[1]刘刚,张恒,张滇,等.基于龙芯3B处理器的Linpack优化实现[J].深圳大学学报理工版,2014,31(3):286-292.[doi:10.3724/SP.J.1249.2014.03286]
 Liu Gang,Zhang Heng,Zhang Dian,et al.Optimization of Linpack for Loongson 3B processor[J].Journal of Shenzhen University Science and Engineering,2014,31(3):286-292.[doi:10.3724/SP.J.1249.2014.03286]
点击复制

基于龙芯3B处理器的Linpack优化实现()
分享到:

《深圳大学学报理工版》[ISSN:1000-2618/CN:44-1401/N]

卷:
第31卷
期数:
2014年第3期
页码:
286-292
栏目:
电子与信息科学
出版日期:
2014-05-20

文章信息/Info

Title:
Optimization of Linpack for Loongson 3B processor
文章编号:
201403011
作者:
刘刚张恒张滇毛睿
深圳大学计算机与软件学院,广东省普及型高性能计算机实验室,深圳 518060
Author(s):
Liu GangZhang HengZhang Dianand Mao Rui
College of Computer Science and Software Engineering, Guangdong Province Key Laboratory of Popular High Performance Computers, Shenzhen University, Shenzhen 518060, P.R.China
关键词:
计算机系统结构龙芯3B处理器 线性系统软件包矩阵乘法数据预取
Keywords:
computer architecture Loongson 3B processor linear system package matrix multiplication data prefetching
分类号:
TP 301;TP 319
DOI:
10.3724/SP.J.1249.2014.03286
文献标志码:
A
摘要:
HPL是高性能计算广泛采用的Linpack 测试软件包.针对龙芯3B处理器体系结构的特点,为Linpack中的核心部分——矩阵乘法设计矩阵分块策略,利用龙芯3B的cache锁机制将频繁调用的数据分块锁在cache中,从而显著降低cache缺失率.同时为龙芯3B处理器中的访存加速部件设计了高效的预取算法,以实现计算时间掩盖访存时间.另外,分别对Linpack所调用的dtrsm和行交换等热点函数进行优化,并通过参数训练来优化Linpack参数.实验结果表明,在龙芯3B处理器上,单节点4核以及双节点8核的Linpack实测性能均达到理论峰值的60%左右,优化后的Linpack性能较优化前提升了10倍左右.
Abstract:
High performance Linpack (HPL) is a linpack benchmark package widely adopted in high performance computing. An efficient partition strategy is introduced by Loongson 3B processor’s architectural features in the matrix multiplication, and the cache lock mechanism which locks the frequently used data blocks into the locked cache is introduced to reduce the missing cache. To make the computation cost hides the memory access cost, a new prefetching algorithm is included in the memory access acceleration device. Other functions, such as dtrsm and line swapping, are optimized, and the optimal value is achieved for each parameter by training. Experimental results indicate that both single-node (4 cores) and double-node (8 cores) have achieved about 60% of theoretical peak performance, which are nearly 10 times performance improvement compared with non-optimized Linpack.

参考文献/References:

[1] Chinese Academy of Sciences Institute of Computing Technology.Loongson 3B processor user manual:volume 1[M].V1.0.Beijing:Chinese Academy of Sciences Institute of Computing Technology,2011.(in Chinese)
中国科学院计算技术研究所.龙芯3B处理器用户手册:上册[M].V1.0版.北京: 中国科学院计算技术研究所,2011.
[2] Cai Ye, Liu Gang, Mao Rui, et al. Design and performance optimization of a popular high performance computing system KD-90[J].Journal of Shenzhen University Science and Engineering,2013,30(2):138-143.(in Chinese)
蔡晔,刘刚,毛睿,等.KD-90普及型个人高性能计算机系统设计与性能优化[J].深圳大学学报理工版,2013,30(2):138-143.
[3] Chen Guoliang,Cai Ye,Luo Qiuming.The China made personal high performance computing system[J].Journal of Shenzhen University Science and Engineering,2011,28(6):471-477.(in Chinese)
陈国良,蔡晔,罗秋明.国产个人高性能计算机系统研制[J].深圳大学学报理工版,2011,28(6):471-477.
[4] Dongarra Jack J,Luszczek P,Petitet A.The Linpack benchmark:past,present,and future[J].Concurrency and Computation:Practice and Experience,2003,15(9):803-820.
[5] Petitet A,Whaley R C,Dongarra J,et al.HPL:A portable implementation of the high-performance Linpack benchmark for distributed-memory computers[EB/OL].[2012-10-26].Claxton(USA):University of Tennessee.http://www.netlib.org/benchmark/hpl/.
[6] Zhang Wenli, Chen Mingyu, Fan Jianping,et al.Emulation and forecast of HPL test performance[J].Journal of Computer Research and Development,2006,43(3):557-562.(in Chinese)
张文力,陈明宇,樊建平,等.HPL测试性能仿真与预测并行Linpack分析与优化探讨[J].计算机研究与发展, 2006,43(3):557-562.
[7] He Songsong,Gu Naijie,Zhu Haitao,et al.Optimization of BLAS for Loongson-3A architecture[J].Journal of Chinese Computer Systems,2012,33(3):571-575.(in Chinese)
何颂颂,顾乃杰,朱海涛,等.面向龙芯3A体系结构的BLAS库优化[J]. 小型微型计算机系统,2012,33(3):571-575.
[8] Goto K,Van De Geijn R.High performance implementation of the level3 BLAS[J].ACM Transactions on Mathematical Software,2008,35(1):12-26.
[9] Zhu Haitao,Chen Yunji,Qian Cheng,et al.Optimization of matrix multiplication based on mutli-core architecture extended with vector unit[J].Journal of University of Science and Technology of China,2011,41(2):174-182.(in Chinese)
朱海涛,陈云霁,钱诚,等.基于向量扩展多核处理器的矩阵乘法算法优化研究[J].中国科学技术大学学报, 2011,41(2):174-182.
[10] Li Wenlong,Liu Li,Tang Zhizhong.Loop unrolling optimization for software pipelining[J].Journal of Beijing University of Aeronautics and Astronautics,2004,30(11):1111-1115.
[11] Goto K.Anatomy of high-performance matrix multiplica- tion[J].ACM Transactions on Mathematical Software,2007,34(3):1-24.
[12] Sasou T, Matsuoka S. Performance tuning high-performance linpack(HPL)[J].IPSJ SIG Notes,2002,91(22):125-130.(in Japanese)
笹生健,松岡聡.HPLのパラメータチューニングの解析[J].IPSJ SIG Notes,2002,91(22):125-130.

相似文献/References:

[1]纪震,田涛,朱泽轩.进化硬件研究进展[J].深圳大学学报理工版,2011,28(No.3(189-282)):255.
 JI Zhen,TIAN Tao,and ZHU Ze-xuan.The survey on evolvable hardware research[J].Journal of Shenzhen University Science and Engineering,2011,28(3):255.

备注/Memo

备注/Memo:
Received:2013-04-25;Revised:2014-02-25;Accepted:2014-04-28
Foundation:National High-Tech Research and Development Program of China (2012AA01A30904); Academician Workstation Construction Projects in Guangdong Province (2012B090500020)
Corresponding author:Associate Professor Mao Rui.E-mail:mao@szu.edu.cn
Citation:Liu Gang, Zhang Heng, Zhang Dian, et al. Optimization of Linpack for Loongson 3B processor[J]. Journal of Shenzhen University Science and Engineering, 2014, 31(3): 286-292.(in Chinese)
基金项目:国家高技术研究发展计划资助项目(2012AA01A30904);广东省院士工作站建设项目(2012B090500020)
作者简介:刘刚(1978—),男(汉族),安徽省合肥市人,深圳大学讲师、博士. E-mail:gliu@szu.edu.cn
引文:刘刚,张恒,张滇,等.基于龙芯3B处理器的Linpack优化实现[J]. 深圳大学学报理工版,2014,31(3):286-292.
更新日期/Last Update: 2014-05-02