Optimization of Linpack for Loongson 3B processor

Gang LIU, Heng ZHANG, Dian ZHANG, Rui MAO*

*Corresponding author for this work

Research output: Journal PublicationsJournal Article (refereed)Researchpeer-review

Abstract

High performance Linpack (HPL) is a linpack benchmark package widely adopted in high performance computing. An efficient partition strategy is introduced by Loongson 3B processor's architectural features in the matrix multiplication, and the cache lock mechanism which locks the frequently used data blocks into the locked cache is introduced to reduce the missing cache. To make the computation cost hides the memory access cost, a new prefetching algorithm is included in the memory access acceleration device. Other functions, such as dtrsm and line swapping, are optimized, and the optimal value is achieved for each parameter by training. Experimental results indicate that both single-node (4 cores) and double-node (8 cores) have achieved about 60% of theoretical peak performance, which are nearly 10 times performance improvement compared with non-optimized Linpack.

HPL是高性能计算广泛采用的Linpack测试软件包。针对龙芯3B处理器体系结构的特点,为Linpack中的核心部分一一矩阵来法设计矩阵分决策,利用龙芯弛的cache锁机制将频繁调用的数据分块锁在cache中,从而显著降低cache缺失率。同时为龙芯3B处理器中的访存加速部件设计了高效的预取算法,以实现计算时间掩盖访存时间。另外,分别对Linpack所调用的dtrsm和行交换等热点函数进行优化,并通过参数训练、来优化Linpack参数实验结果表明,在龙芯3B处理器上,羊节点4核以及双节点8核的Linpack实测性能均达到理论峰值的60%左右,优化后的Linpack性能较优化前提升了10倍左右。

Original languageEnglish
Pages (from-to)286-292
Number of pages7
JournalShenzhen Daxue Xuebao (Ligong Ban)/Journal of Shenzhen University Science and Engineering
Volume31
Issue number3
DOIs
Publication statusPublished - May 2014
Externally publishedYes

Fingerprint

Data storage equipment
Costs

Bibliographical note

Foundation: National High-Tech Research and Development Program of China (2012AA01A30904); Academician Workstation Construction Projects in Guangdong Province (2012B090500020).

Keywords

  • Computer architecture
  • Data prefetching
  • Linear system package
  • Loongson 3B processor
  • Matrix multiplication
  • 计算机系统结构
  • 龙芯3B处理器
  • 线性系统软件包
  • 矩阵来法
  • 数据预取

Cite this

@article{f96cdbb896284b5b90ea09ebcb1448f2,
title = "Optimization of Linpack for Loongson 3B processor",
abstract = "High performance Linpack (HPL) is a linpack benchmark package widely adopted in high performance computing. An efficient partition strategy is introduced by Loongson 3B processor's architectural features in the matrix multiplication, and the cache lock mechanism which locks the frequently used data blocks into the locked cache is introduced to reduce the missing cache. To make the computation cost hides the memory access cost, a new prefetching algorithm is included in the memory access acceleration device. Other functions, such as dtrsm and line swapping, are optimized, and the optimal value is achieved for each parameter by training. Experimental results indicate that both single-node (4 cores) and double-node (8 cores) have achieved about 60{\%} of theoretical peak performance, which are nearly 10 times performance improvement compared with non-optimized Linpack.HPL是高性能计算广泛采用的Linpack测试软件包。针对龙芯3B处理器体系结构的特点,为Linpack中的核心部分一一矩阵来法设计矩阵分决策,利用龙芯弛的cache锁机制将频繁调用的数据分块锁在cache中,从而显著降低cache缺失率。同时为龙芯3B处理器中的访存加速部件设计了高效的预取算法,以实现计算时间掩盖访存时间。另外,分别对Linpack所调用的dtrsm和行交换等热点函数进行优化,并通过参数训练、来优化Linpack参数实验结果表明,在龙芯3B处理器上,羊节点4核以及双节点8核的Linpack实测性能均达到理论峰值的60{\%}左右,优化后的Linpack性能较优化前提升了10倍左右。",
keywords = "Computer architecture, Data prefetching, Linear system package, Loongson 3B processor, Matrix multiplication, 计算机系统结构, 龙芯3B处理器, 线性系统软件包, 矩阵来法, 数据预取",
author = "Gang LIU and Heng ZHANG and Dian ZHANG and Rui MAO",
note = "Foundation: National High-Tech Research and Development Program of China (2012AA01A30904); Academician Workstation Construction Projects in Guangdong Province (2012B090500020).",
year = "2014",
month = "5",
doi = "10.3724/SP.J.1249.2014.03286",
language = "English",
volume = "31",
pages = "286--292",
journal = "Shenzhen Daxue Xuebao (Ligong Ban)/Journal of Shenzhen University Science and Engineering",
issn = "1000-2618",
publisher = "Editorial Office of Journal of Shenzhen University",
number = "3",

}

Optimization of Linpack for Loongson 3B processor. / LIU, Gang; ZHANG, Heng; ZHANG, Dian; MAO, Rui.

In: Shenzhen Daxue Xuebao (Ligong Ban)/Journal of Shenzhen University Science and Engineering, Vol. 31, No. 3, 05.2014, p. 286-292.

Research output: Journal PublicationsJournal Article (refereed)Researchpeer-review

TY - JOUR

T1 - Optimization of Linpack for Loongson 3B processor

AU - LIU, Gang

AU - ZHANG, Heng

AU - ZHANG, Dian

AU - MAO, Rui

N1 - Foundation: National High-Tech Research and Development Program of China (2012AA01A30904); Academician Workstation Construction Projects in Guangdong Province (2012B090500020).

PY - 2014/5

Y1 - 2014/5

N2 - High performance Linpack (HPL) is a linpack benchmark package widely adopted in high performance computing. An efficient partition strategy is introduced by Loongson 3B processor's architectural features in the matrix multiplication, and the cache lock mechanism which locks the frequently used data blocks into the locked cache is introduced to reduce the missing cache. To make the computation cost hides the memory access cost, a new prefetching algorithm is included in the memory access acceleration device. Other functions, such as dtrsm and line swapping, are optimized, and the optimal value is achieved for each parameter by training. Experimental results indicate that both single-node (4 cores) and double-node (8 cores) have achieved about 60% of theoretical peak performance, which are nearly 10 times performance improvement compared with non-optimized Linpack.HPL是高性能计算广泛采用的Linpack测试软件包。针对龙芯3B处理器体系结构的特点,为Linpack中的核心部分一一矩阵来法设计矩阵分决策,利用龙芯弛的cache锁机制将频繁调用的数据分块锁在cache中,从而显著降低cache缺失率。同时为龙芯3B处理器中的访存加速部件设计了高效的预取算法,以实现计算时间掩盖访存时间。另外,分别对Linpack所调用的dtrsm和行交换等热点函数进行优化,并通过参数训练、来优化Linpack参数实验结果表明,在龙芯3B处理器上,羊节点4核以及双节点8核的Linpack实测性能均达到理论峰值的60%左右,优化后的Linpack性能较优化前提升了10倍左右。

AB - High performance Linpack (HPL) is a linpack benchmark package widely adopted in high performance computing. An efficient partition strategy is introduced by Loongson 3B processor's architectural features in the matrix multiplication, and the cache lock mechanism which locks the frequently used data blocks into the locked cache is introduced to reduce the missing cache. To make the computation cost hides the memory access cost, a new prefetching algorithm is included in the memory access acceleration device. Other functions, such as dtrsm and line swapping, are optimized, and the optimal value is achieved for each parameter by training. Experimental results indicate that both single-node (4 cores) and double-node (8 cores) have achieved about 60% of theoretical peak performance, which are nearly 10 times performance improvement compared with non-optimized Linpack.HPL是高性能计算广泛采用的Linpack测试软件包。针对龙芯3B处理器体系结构的特点,为Linpack中的核心部分一一矩阵来法设计矩阵分决策,利用龙芯弛的cache锁机制将频繁调用的数据分块锁在cache中,从而显著降低cache缺失率。同时为龙芯3B处理器中的访存加速部件设计了高效的预取算法,以实现计算时间掩盖访存时间。另外,分别对Linpack所调用的dtrsm和行交换等热点函数进行优化,并通过参数训练、来优化Linpack参数实验结果表明,在龙芯3B处理器上,羊节点4核以及双节点8核的Linpack实测性能均达到理论峰值的60%左右,优化后的Linpack性能较优化前提升了10倍左右。

KW - Computer architecture

KW - Data prefetching

KW - Linear system package

KW - Loongson 3B processor

KW - Matrix multiplication

KW - 计算机系统结构

KW - 龙芯3B处理器

KW - 线性系统软件包

KW - 矩阵来法

KW - 数据预取

UR - http://www.scopus.com/inward/record.url?scp=84901855494&partnerID=8YFLogxK

U2 - 10.3724/SP.J.1249.2014.03286

DO - 10.3724/SP.J.1249.2014.03286

M3 - Journal Article (refereed)

VL - 31

SP - 286

EP - 292

JO - Shenzhen Daxue Xuebao (Ligong Ban)/Journal of Shenzhen University Science and Engineering

JF - Shenzhen Daxue Xuebao (Ligong Ban)/Journal of Shenzhen University Science and Engineering

SN - 1000-2618

IS - 3

ER -