High performance Linpack (HPL) is a linpack benchmark package widely adopted in high performance computing. An efficient partition strategy is introduced by Loongson 3B processor's architectural features in the matrix multiplication, and the cache lock mechanism which locks the frequently used data blocks into the locked cache is introduced to reduce the missing cache. To make the computation cost hides the memory access cost, a new prefetching algorithm is included in the memory access acceleration device. Other functions, such as dtrsm and line swapping, are optimized, and the optimal value is achieved for each parameter by training. Experimental results indicate that both single-node (4 cores) and double-node (8 cores) have achieved about 60% of theoretical peak performance, which are nearly 10 times performance improvement compared with non-optimized Linpack.
|Number of pages||7|
|Journal||Shenzhen Daxue Xuebao (Ligong Ban)/Journal of Shenzhen University Science and Engineering|
|Publication status||Published - May 2014|
Bibliographical noteFoundation: National High-Tech Research and Development Program of China (2012AA01A30904); Academician Workstation Construction Projects in Guangdong Province (2012B090500020).
- Computer architecture
- Data prefetching
- Linear system package
- Loongson 3B processor
- Matrix multiplication