深圳大学学报理工版

基于龙芯3B处理器的Linpack优化实现

刘刚,张恒,张滇,毛睿

深圳大学计算机与软件学院,广东省普及型高性能计算机实验室,深圳 518060

关键词：计算机系统结构; 龙芯3B处理器; 线性系统软件包; 矩阵乘法; 数据预取

Optimization of Linpack for Loongson 3B processor

Liu Gang,Zhang Heng,Zhang Dian,and Mao Rui^

Liu Gang,Zhang Heng,Zhang Dian,and Mao Rui^College of Computer Science and Software Engineering, Guangdong Province Key Laboratory of Popular High Performance Computers, Shenzhen University, Shenzhen 518060, P.R.China

Keywords： computer architecture; Loongson 3B processor; linear system package; matrix multiplication; data prefetching

DOI: 10.3724/SP.J.1249.2014.03286

备注

摘要

全文

图/表

参考文献

HPL是高性能计算广泛采用的Linpack 测试软件包.针对龙芯3B处理器体系结构的特点,为Linpack中的核心部分——矩阵乘法设计矩阵分块策略,利用龙芯3B的cache锁机制将频繁调用的数据分块锁在cache中,从而显著降低cache缺失率.同时为龙芯3B处理器中的访存加速部件设计了高效的预取算法,以实现计算时间掩盖访存时间.另外,分别对Linpack所调用的dtrsm和行交换等热点函数进行优化,并通过参数训练来优化Linpack参数.实验结果表明,在龙芯3B处理器上,单节点4核以及双节点8核的Linpack实测性能均达到理论峰值的60%左右,优化后的Linpack性能较优化前提升了10倍左右.

High performance Linpack(HPL)is a linpack benchmark package widely adopted in high performance computing. An efficient partition strategy is introduced by Loongson 3B processor's architectural features in the matrix multiplication, and the cache lock mechanism which locks the frequently used data blocks into the locked cache is introduced to reduce the missing cache. To make the computation cost hides the memory access cost, a new prefetching algorithm is included in the memory access acceleration device. Other functions, such as dtrsm and line swapping, are optimized, and the optimal value is achieved for each parameter by training. Experimental results indicate that both single-node(4 cores)and double-node(8 cores)have achieved about 60% of theoretical peak performance, which are nearly 10 times performance improvement compared with non-optimized Linpack.

引言
1 Linpack算法分析
2 龙芯3B处理器上的Linpack优化
3 实验与性能分析
4 结语

[1] Chinese Academy of Sciences Institute of Computing Technology.Loongson 3B processor user manual:volume 1[M].V1.0.Beijing:Chinese Academy of Sciences Institute of Computing Technology,2011.(in Chinese)
[2] Cai Ye, Liu Gang, Mao Rui, et al. Design and performance optimization of a popular high performance computing system KD-90[J].Journal of Shenzhen University Science and Engineering,2013,30(2):138-143.(in Chinese)
[3] Chen Guoliang,Cai Ye,Luo Qiuming.The China made personal high performance computing system[J].Journal of Shenzhen University Science and Engineering,2011,28(6):471- 477.(in Chinese)
[4] Dongarra Jack J,Luszczek P,Petitet A.The Linpack benchmark:past,present,and future[J].Concurrency and Computation:Practice and Experience,2003,15(9):803-820.
[5] Petitet A,Whaley R C,Dongarra J,et al.HPL:A portable implementation of the high-performance Linpack benchmark for distributed-memory computers[EB/OL].[2012-10-26].Claxton(USA):University of Tennessee.http://www.netlib.org/benchmark/hpl/.
[6] Zhang Wenli, Chen Mingyu, Fan Jianping,et al.Emulation and forecast of HPL test performance[J].Journal of Computer Research and Development,2006,43(3):557-562.(in Chinese)
[7] He Songsong,Gu Naijie,Zhu Haitao,et al.Optimization of BLAS for Loongson-3A architecture[J].Journal of Chinese Computer Systems,2012,33(3):571-575.(in Chinese)
[8] Goto K,Van De Geijn R.High performance implementation of the level3 BLAS[J].ACM Transactions on Mathematical Software,2008,35(1):12-26.
[9] Zhu Haitao,Chen Yunji,Qian Cheng,et al.Optimization of matrix multiplication based on mutli-core architecture extended with vector unit[J].Journal of University of Science and Technology of China,2011,41(2):174-182.(in Chinese)
[10] Li Wenlong,Liu Li,Tang Zhizhong.Loop unrolling optimization for software pipelining[J].Journal of Beijing University of Aeronautics and Astronautics,2004,30(11):1111-1115.
[11] Goto K.Anatomy of high-performance matrix multiplica- tion[J].ACM Transactions on Mathematical Software,2007,34(3):1-24.
[12] Sasou T, Matsuoka S. Performance tuning high-performance linpack(HPL)[J].IPSJ SIG Notes,2002,91(22):125-130.(in Japanese)