- HPL 测试 MKL+MPICH -
# 环境搭建
-
GPU 驱动
-
CUDA
-
编译器:系统自带的 GNU 编译器
可惜集群上的 CUDA 是 9.0 的,不支持 MPI-3.0,想着下一个 Intel mpi,结果这个不能单独下,要下一个 OneAPI,压缩包和 CUDA 11.0 一样超大。
# MPICH2
- 官方下载:mpich-3.2.1.tar.gz,可以用
wget
。 - 解压,新建文件夹
mpich
作为安装文件夹
cd mpich-3.2.1/ | |
./configure --prefix=/home/riolu/HPL/mpich CFLAGS="-fPIC" CXXFLAGS="-fPIC" --enable-shared --enable-sharedlibs=gcc --with-cuda=/usr/local/cuda-10.0/ --with-cuda-include=/usr/local/cuda-10.0/include --with-cuda-libpath=/usr/local/cuda-10.0/lib64 |
如果不加这些参数会报错:
libmpich.a(allreduce.o): relocation R_X86_64_32 against `.rodata.str1.8' can not be used when making a shared object; recompile with -fPIC
- 接下来
make clean | |
make | |
make install | |
gedit ~/.bashrc |
在后面加上
export PATH=/home/riolu/HPL/mpich/bin:$PATH
export MANPATH=/home/riolu/HPL/mpich/man:$MANPATH
export LD_LIBRARY_PATH=/home/riolu/HPL/mpich/lib:$LD_LIBRARY_PATH //不知要不要
保存后
source ~/.bashrc |
# Intel MKL
- 下载单独安装包:l_mkl_2019.5.281.tgz
- 解压后进入,执行
sh install.sh
,按提示进行。 - 配置
.bashrc
,添加
# added for intel
export LD_LIBRARY_PATH=/home/riolu/intel/mkl/lib/intel64:$LD_LIBRARY_PATH
执行 source ~/.bashrc
。
如果报错找不到共享库的话可能是因为没有加这个路径。
# HPL
- 官方下载网站下载:hpl-2.0_FERMI_v15.tgz
需要注册一下。 - 进入解压后的文件夹,修改 Make.CUDA 的参数
TOPdir = $(HOME)/HPL/hpl-2.0_FERMI_v15 #hpl所在的路径目录
..............................................................
MPdir = $(HOME)/HPL/mpich
MPinc = -I$(MPdir)/include
MPlib = -L$(MPdir)/lib
..............................................................
LAdir =$(HOME)/intel/mkl/lib/intel64
LAMP5dir = $(HOME)/intel/compilers_and_libraries/linux/lib/intel64
LAinc = -I$(HOME)/intel/mkl/include
LAlib = -L $(TOPdir)/src/cuda -ldgemm -L/usr/local/cuda-10.0/lib64 -lcuda -lcudart -lcublas -L$(LAdir) -lmkl_intel_lp64 -lmkl_intel_thread -lmkl_core -liomp5
HPL_INCLUDES = -I$(INCdir) -I$(INCdir)/$(ARCH) $(LAinc) $(MPinc) -I/usr/local/cuda-10.0/include
..............................................................
CC =$(MPdir)/bin/mpicc
..............................................................
- 修改
src/cuda/Makefile
为里的cuda
路径为cuda-10.0
- 编译
make arch=CUDA clean_arch_all | |
make arch=CUDA |
若报错 /usr/bin/ld: 找不到 -liomp5
则可以做一个软链接,执行
ln -s /home/intel/lib/intel64/libiomp5.so /home/intel/mkl/lib/intel64/libiomp5.so |
/bin/CUDA
目录下生成两个文件:HPL.dat 和 xhpl。在该目录新建一个test.sh
文件,内容为
export HPL_DIR=/home/riolu/HPL/hpl-2.0_FERMI_v15
export MKL_NUM_THREADS=6
export OMP_NUM_THREADS=2
export MKL_DYNAMIC=FALSE
export CUDA_DGEMM_SPLIT=0.954
export CUDA_DTRSM_SPLIT=0.946
export LD_LIBRARY_PATH=$HPL_DIR/src/cuda:$LD_LIBRARY_PATH
$HPL_DIR/bin/CUDA/xhpl
其中
- MKL_NUM_THREADS:每个进程使用的 CPU 核的数量
- OMP_NUM_THREADS:每个 GPU 使用的 CPU 核的数量
例如:2 个 GPU 和 8 个 CPU,则 OMP_NUM_THREADS=4 - CUDA_DGEMM_SPLIT:发送给 GPU 的 DGEMM 占总的百分比,大致等于 (GPU GFLOPS)/(GPU GFLOPS + CPU GFLOPS),或者 350 / ( 350 + 每个 GPU 的 CPU 数量 * 4 * CPU 基本频率 )
- CUDA_DTRSM_SPLIT:发送给 GPU 的 DTRSM 占总的百分比,通常比 DGEMM 低 0.05-0.10
- HPL_DIR:文件的路径
- 单节点执行:
./run_linpack.sh
多节点执行(没有测试过,仅供参考):mpiexec.hydra -np 7 ./test.sh
# 初步结果
================================================================================
HPLinpack 2.0 -- High-Performance Linpack benchmark -- September 10, 2008
Written by A. Petitet and R. Clint Whaley, Innovative Computing Laboratory, UTK
Modified by Piotr Luszczek, Innovative Computing Laboratory, UTK
Modified by Julien Langou, University of Colorado Denver
================================================================================
An explanation of the input/output parameters follows:
T/V : Wall time / encoded variant.
N : The order of the coefficient matrix A.
NB : The partitioning blocking factor.
P : The number of process rows.
Q : The number of process columns.
Time : Time in seconds to solve the linear system.
Gflops : Rate of execution for solving the linear system.
The following parameter values will be used:
N : 25000
NB : 768
PMAP : Row-major process mapping
P : 1
Q : 1
PFACT : Left
NBMIN : 2
NDIV : 2
RFACT : Left
BCAST : 1ring
DEPTH : 1
SWAP : Spread-roll (long)
L1 : no-transposed form
U : no-transposed form
EQUIL : yes
ALIGN : 8 double precision words
--------------------------------------------------------------------------------
- The matrix A is randomly generated for each test.
- The following scaled residual check will be computed:
||Ax-b||_oo / ( eps * ( || x ||_oo * || A ||_oo + || b ||_oo ) * N )
- The relative machine precision (eps) is taken to be 1.110223e-16
- Computational tests pass if scaled residuals are less than 16.0
================================================================================
T/V N NB P Q Time Gflops
--------------------------------------------------------------------------------
WR10L2L2 25000 768 1 1 10.58 9.843e+02
--------------------------------------------------------------------------------
||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)= 0.0044272 ...... PASSED
================================================================================
Finished 1 tests with the following results:
1 tests completed and passed residual checks,
0 tests completed and failed residual checks,
0 tests skipped because of illegal input values.
--------------------------------------------------------------------------------
End of Tests.
================================================================================
速度是 CPU 的 3 倍……
# GPU 信息获取
- 查看 nvidia GPU 型号
lspci | grep -i nvidia
- 查看 Nvidia 显卡信息及使用情况
nvidia-smi
Wed Dec 9 02:23:00 2020
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 410.129 Driver Version: 410.129 CUDA Version: 10.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla P100-PCIE... Off | 00000000:05:00.0 Off | 0 |
| N/A 34C P0 36W / 250W | 2391MiB / 16280MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 Tesla P100-PCIE... Off | 00000000:08:00.0 Off | 0 |
| N/A 40C P0 40W / 250W | 2391MiB / 16280MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 2 Tesla P100-PCIE... Off | 00000000:09:00.0 Off | 0 |
| N/A 46C P0 194W / 250W | 2391MiB / 16280MiB | 100% Default |
+-------------------------------+----------------------+----------------------+
| 3 Tesla P100-PCIE... Off | 00000000:84:00.0 Off | 0 |
| N/A 40C P0 39W / 250W | 2391MiB / 16280MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 4 Tesla P100-PCIE... Off | 00000000:88:00.0 Off | 0 |
| N/A 36C P0 36W / 250W | 2391MiB / 16280MiB | 100% Default |
+-------------------------------+----------------------+----------------------+
| 5 Tesla P100-PCIE... Off | 00000000:89:00.0 Off | 0 |
| N/A 42C P0 185W / 250W | 2391MiB / 16280MiB | 100% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 3451 C ...wei/HPL/hpl-2.0_FERMI_v15/bin/CUDA/xhpl 2363MiB |
| 1 3455 C ...wei/HPL/hpl-2.0_FERMI_v15/bin/CUDA/xhpl 2363MiB |
| 2 3452 C ...wei/HPL/hpl-2.0_FERMI_v15/bin/CUDA/xhpl 2363MiB |
| 3 3456 C ...wei/HPL/hpl-2.0_FERMI_v15/bin/CUDA/xhpl 2363MiB |
| 4 3453 C ...wei/HPL/hpl-2.0_FERMI_v15/bin/CUDA/xhpl 2363MiB |
| 5 3457 C ...wei/HPL/hpl-2.0_FERMI_v15/bin/CUDA/xhpl 2363MiB |
+-----------------------------------------------------------------------------+
表头释义:
+ Fan:显示风扇转速,数值在 0 到 100% 之间,是计算机的期望转速,如果不是通过风扇冷却或者风扇坏了,显示出来就是 N/A;
+ Temp:显卡内部的温度,单位是摄氏度;
+ Perf:表征性能状态,从 P0 到 P12,P0 表示最大性能,P12 表示状态最小性能;
+ Pwr:能耗表示;
+ Bus-Id:涉及 GPU 总线的相关信息;
+ Disp.A:是 Display Active 的意思,表示 GPU 的显示是否初始化;
+ Memory Usage:显存的使用率;
+ Volatile GPU-Util:浮动的 GPU 利用率;
+ Compute M:计算模式;
+ Processes 显示每块 GPU 上每个进程所使用的显存情况。
3. 周期性输出显卡的使用情况
watch -n 10 nvidia-smi
命令行参数 - n 后边跟的是执行命令的周期,以 s 为单位。
3. 查看 GPU 名称
nvidia-smi -L
GPU 0: Tesla P100-PCIE-16GB (UUID: GPU-1f613914-b422-06ad-cd7d-d2649435f480)
GPU 1: Tesla P100-PCIE-16GB (UUID: GPU-d8f026e7-9714-39f6-af4a-f8328cbdb3c9)
GPU 2: Tesla P100-PCIE-16GB (UUID: GPU-fb22db8b-9de3-691e-53da-d34aec4b6abb)
GPU 3: Tesla P100-PCIE-16GB (UUID: GPU-ddf0cbc8-3c92-0b3d-5496-b65361fe18a0)
GPU 4: Tesla P100-PCIE-16GB (UUID: GPU-7a94ce44-3e66-d05d-3d58-d625930a2aad)
GPU 5: Tesla P100-PCIE-16GB (UUID: GPU-16423caf-0037-6742-2977-ba03f4937b9b)