- Linpack 测试 OPENBLAS+OPENMPI+HPL -
# 环境搭建
# 安装 OpenBLAS
- 官方下载 OpenBLAS-0.3.10.tar.gz
- 解压后,在解压目录中执行
make
,
OpenBLAS build complete. (BLAS CBLAS LAPACK LAPACKE)
OS ... Linux
Architecture ... x86_64
BINARY ... 64bit
C compiler ... GCC (cmd & version : cc (Ubuntu 4.8.4-2ubuntu1~14.04.4) 4.8.4)
Fortran compiler ... GFORTRAN (cmd & version : GNU Fortran (Ubuntu 4.8.4-2ubuntu1~14.04.4) 4.8.4)
Library Name ... libopenblas_haswellp-r0.3.10.a (Multi-threading; Max num-threads is 32)
To install the library, you can run "make PREFIX=/path/to/your/installation install".
- 执行
make PREFIX=/home/riolu/HPL/openblas install
# 安装 openMPI
- 官方下载 openmpi-4.0.5.tar.gz
- 解压后,在解压目录中执行
./configure --prefix=/home/riolu/HPL/openmpi
Resource Managers
-----------------------
Cray Alps: no
Grid Engine: no
LSF: no
Moab: no
Slurm: yes
ssh/rsh: yes
Torque: no
OMPIO File Systems
-----------------------
Generic Unix FS: yes
Lustre: no
PVFS2/OrangeFS: no
make
,make install
- 修改
~/.bashrc
,在后面加上
export PATH=/home/riolu/HPL/openmpi/bin:$PATH | |
export INCLUDE=/home/riolu/HPL/openmpi/include:$INCLUDE | |
export LD_LIBRARY_PATH=/home/riolu/HPL/openmpi/lib:$LD_LIBRARY_PATH |
保存后 source ~/.bashrc
(libreOffice 保存会改编码,还是在 Windows 上改完传过去或者在 jupyter 上改
# HPL 安装与编译
- 官方下载网站下载:hpl.tar.gz
- 进入安装文件夹下的 setup,在 setup 中中找到
Make.Linux_PII_CBLAS
,将其放置到上层目录并且命名为Make.Linux
- 修改
Make.Linux
ARCH = Linux
TOPdir = $(HOME)/HPL/hpl-2.3 /*改为hpl解压后产生文件夹*/
MPdir = $(HOME)/HPL/openmpi /*改为mpich安装文件夹*/
MPinc = -I$(MPdir)/include
MPlib = -L$(MPdir)/lib
LAdir = $(HOME)/HPL/openblas
LAinc = -I$(LAdir)/include
LAlib = $(LAdir)/lib/libopenblas_haswellp-r0.3.10.a
HPL_OPTS = -DHPL_CALL_CBLAS
CC = $(MPdir)/bin/mpicc
CCFLAGS = $(HPL_DEFS) -fomit-frame-pointer -fopenmp -O3 -funroll-loops
LINKER = $(MPdir)/bin/mpif77
LINKFLAGS = $(CCFLAGS)
执行 make arch=Linux
- 此时查看安装文件夹下 bin,会看到有
Linux
文件夹,里面有HPL.dat
,xhpl
,安装完成。 - 执行
mpirun -np 4 ./xhpl
,得到正确结果!
(OHHHHHHHHHHHHHHHHHHHHHHHHHHHHHH
# 测试优化
- 查看内存:
$ free -h | |
total used free shared buffers cached | |
Mem: 251G 138G 113G 14M 4.9G 104G | |
-/+ buffers/cache: 29G 222G | |
Swap: 7.6G 1.2G 6.4G | |
使用50G,1个节点,则$N=sqrt(50)*10000=>70000$,最大$N=sqrt(120)*10000=>110000$ |
- 测试运行:
~/HPL/hpl-2.3/bin/Linux$ mpirun -np 8 ./xhpl | |
================================================================================ | |
HPLinpack 2.3 -- High-Performance Linpack benchmark -- December 2, 2018 | |
Written by A. Petitet and R. Clint Whaley, Innovative Computing Laboratory, UTK | |
Modified by Piotr Luszczek, Innovative Computing Laboratory, UTK | |
Modified by Julien Langou, University of Colorado Denver | |
================================================================================ | |
An explanation of the input/output parameters follows: | |
T/V : Wall time / encoded variant. | |
N : The order of the coefficient matrix A. | |
NB : The partitioning blocking factor. | |
P : The number of process rows. | |
Q : The number of process columns. | |
Time : Time in seconds to solve the linear system. | |
Gflops : Rate of execution for solving the linear system. | |
The following parameter values will be used: | |
N : 70000 | |
NB : 256 192 | |
PMAP : Row-major process mapping | |
P : 2 | |
Q : 4 | |
PFACT : Left | |
NBMIN : 2 | |
NDIV : 2 | |
RFACT : Left | |
BCAST : 1ring | |
DEPTH : 0 | |
SWAP : Mix (threshold = 64) | |
L1 : transposed form | |
U : transposed form | |
EQUIL : yes | |
ALIGN : 8 double precision words | |
-------------------------------------------------------------------------------- | |
+ The matrix A is randomly generated for each test. | |
+ The following scaled residual check will be computed: | |
||Ax-b||_oo / ( eps * ( || x ||_oo * || A ||_oo + || b ||_oo ) * N ) | |
+ The relative machine precision (eps) is taken to be 1.110223e-16 | |
+ Computational tests pass if scaled residuals are less than 16.0 | |
================================================================================ | |
T/V N NB P Q Time Gflops | |
-------------------------------------------------------------------------------- | |
WR00L2L2 70000 256 2 4 894.95 2.5551e+02 | |
HPL_pdgesv() start time Fri Dec 4 22:56:34 2020 | |
HPL_pdgesv() end time Fri Dec 4 23:11:29 2020 | |
-------------------------------------------------------------------------------- | |
||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)= 1.51800829e-03 ...... PASSED | |
================================================================================ | |
T/V N NB P Q Time Gflops | |
-------------------------------------------------------------------------------- | |
WR00L2L2 70000 192 2 4 927.80 2.4647e+02 | |
HPL_pdgesv() start time Fri Dec 4 23:12:17 2020 | |
HPL_pdgesv() end time Fri Dec 4 23:27:45 2020 | |
-------------------------------------------------------------------------------- | |
||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)= 1.48198443e-03 ...... PASSED | |
================================================================================ | |
Finished 2 tests with the following results: | |
2 tests completed and passed residual checks, | |
0 tests completed and failed residual checks, | |
0 tests skipped because of illegal input values. | |
-------------------------------------------------------------------------------- | |
End of Tests. | |
================================================================================ |
-
结果整合:
================================================================================ T/V N NB P Q Time Gflops -------------------------------------------------------------------------------- WR00L2L2 70000 192 2 4 927.80 2.4647e+02 WR00L2L2 70000 256 2 4 910.19 2.5124e+02 WR00L2L2 70000 256 2 4 894.95 2.5551e+02 WR00L2L2 70000 256 2 4 902.53 2.5337e+02 WR00L2L2 70000 336 2 4 877.33 2.6065e+02 WR00L2L2 70000 336 2 4 894.92 2.5553e+02 WR00L2L2 70000 384 2 4 862.82 2.6503e+02 ================================================================================
看起来选 NB=384
比较好呢,实测浮点峰值为 265.03Gflops=2.6503 千亿次 / 秒。
# 绑定进程
~/HPL/openmpi/bin/mpirun -np 16 --bind-to core --map-by core --report-bindings ./xhpl
mpirun -np 16 --bind-to core --map-by core --report-bindings ./xhpl | |
[amax:18074] MCW rank 0 bound to socket 0[core 0[hwt 0-1]]: [BB/../../../../../../..][../../../../../../../..] | |
[amax:18074] MCW rank 1 bound to socket 0[core 1[hwt 0-1]]: [../BB/../../../../../..][../../../../../../../..] | |
[amax:18074] MCW rank 2 bound to socket 0[core 2[hwt 0-1]]: [../../BB/../../../../..][../../../../../../../..] | |
[amax:18074] MCW rank 3 bound to socket 0[core 3[hwt 0-1]]: [../../../BB/../../../..][../../../../../../../..] | |
[amax:18074] MCW rank 4 bound to socket 0[core 4[hwt 0-1]]: [../../../../BB/../../..][../../../../../../../..] | |
[amax:18074] MCW rank 5 bound to socket 0[core 5[hwt 0-1]]: [../../../../../BB/../..][../../../../../../../..] | |
[amax:18074] MCW rank 6 bound to socket 0[core 6[hwt 0-1]]: [../../../../../../BB/..][../../../../../../../..] | |
[amax:18074] MCW rank 7 bound to socket 0[core 7[hwt 0-1]]: [../../../../../../../BB][../../../../../../../..] | |
[amax:18074] MCW rank 8 bound to socket 1[core 8[hwt 0-1]]: [../../../../../../../..][BB/../../../../../../..] | |
[amax:18074] MCW rank 9 bound to socket 1[core 9[hwt 0-1]]: [../../../../../../../..][../BB/../../../../../..] | |
[amax:18074] MCW rank 10 bound to socket 1[core 10[hwt 0-1]]: [../../../../../../../..][../../BB/../../../../..] | |
[amax:18074] MCW rank 11 bound to socket 1[core 11[hwt 0-1]]: [../../../../../../../..][../../../BB/../../../..] | |
[amax:18074] MCW rank 12 bound to socket 1[core 12[hwt 0-1]]: [../../../../../../../..][../../../../BB/../../..] | |
[amax:18074] MCW rank 13 bound to socket 1[core 13[hwt 0-1]]: [../../../../../../../..][../../../../../BB/../..] | |
[amax:18074] MCW rank 14 bound to socket 1[core 14[hwt 0-1]]: [../../../../../../../..][../../../../../../BB/..] | |
[amax:18074] MCW rank 15 bound to socket 1[core 15[hwt 0-1]]: [../../../../../../../..][../../../../../../../BB] |