- HPL 测试 MKL+MPICH -

# 环境搭建

  1. GPU 驱动

  2. CUDA

  3. 编译器:系统自带的 GNU 编译器

可惜集群上的 CUDA 是 9.0 的,不支持 MPI-3.0,想着下一个 Intel mpi,结果这个不能单独下,要下一个 OneAPI,压缩包和 CUDA 11.0 一样超大。

# MPICH2

  1. 官方下载:mpich-3.2.1.tar.gz,可以用 wget
  2. 解压,新建文件夹 mpich 作为安装文件夹
cd mpich-3.2.1/
./configure --prefix=/home/riolu/HPL/mpich CFLAGS="-fPIC" CXXFLAGS="-fPIC" --enable-shared --enable-sharedlibs=gcc --with-cuda=/usr/local/cuda-10.0/ --with-cuda-include=/usr/local/cuda-10.0/include --with-cuda-libpath=/usr/local/cuda-10.0/lib64

如果不加这些参数会报错:

libmpich.a(allreduce.o): relocation R_X86_64_32 against `.rodata.str1.8' can not be used when making a shared object; recompile with -fPIC
  1. 接下来
make clean
make
make install
gedit ~/.bashrc

在后面加上

export PATH=/home/riolu/HPL/mpich/bin:$PATH
export MANPATH=/home/riolu/HPL/mpich/man:$MANPATH
export LD_LIBRARY_PATH=/home/riolu/HPL/mpich/lib:$LD_LIBRARY_PATH //不知要不要

保存后

source ~/.bashrc

# Intel MKL

  1. 下载单独安装包:l_mkl_2019.5.281.tgz
  2. 解压后进入,执行 sh install.sh ,按提示进行。
  3. 配置 .bashrc ,添加
# added for intel
export LD_LIBRARY_PATH=/home/riolu/intel/mkl/lib/intel64:$LD_LIBRARY_PATH

执行 source ~/.bashrc
如果报错找不到共享库的话可能是因为没有加这个路径。

# HPL

  1. 官方下载网站下载:hpl-2.0_FERMI_v15.tgz
    需要注册一下。
  2. 进入解压后的文件夹,修改 Make.CUDA 的参数
TOPdir = $(HOME)/HPL/hpl-2.0_FERMI_v15 #hpl所在的路径目录
..............................................................
MPdir  = $(HOME)/HPL/mpich
MPinc = -I$(MPdir)/include          
MPlib = -L$(MPdir)/lib
..............................................................    
LAdir   =$(HOME)/intel/mkl/lib/intel64
LAMP5dir    = $(HOME)/intel/compilers_and_libraries/linux/lib/intel64
LAinc        = -I$(HOME)/intel/mkl/include
LAlib        = -L $(TOPdir)/src/cuda  -ldgemm -L/usr/local/cuda-10.0/lib64 -lcuda -lcudart -lcublas -L$(LAdir) -lmkl_intel_lp64 -lmkl_intel_thread -lmkl_core -liomp5

HPL_INCLUDES = -I$(INCdir) -I$(INCdir)/$(ARCH) $(LAinc) $(MPinc) -I/usr/local/cuda-10.0/include
..............................................................
CC   =$(MPdir)/bin/mpicc

..............................................................
  1. 修改 src/cuda/Makefile 为里的 cuda 路径为 cuda-10.0
  2. 编译
make arch=CUDA clean_arch_all
make arch=CUDA

若报错 /usr/bin/ld: 找不到 -liomp5 则可以做一个软链接,执行

ln -s /home/intel/lib/intel64/libiomp5.so /home/intel/mkl/lib/intel64/libiomp5.so
  1. /bin/CUDA 目录下生成两个文件:HPL.dat 和 xhpl。在该目录新建一个 test.sh 文件,内容为
export HPL_DIR=/home/riolu/HPL/hpl-2.0_FERMI_v15
export MKL_NUM_THREADS=6
export OMP_NUM_THREADS=2
export MKL_DYNAMIC=FALSE
export CUDA_DGEMM_SPLIT=0.954
export CUDA_DTRSM_SPLIT=0.946
export LD_LIBRARY_PATH=$HPL_DIR/src/cuda:$LD_LIBRARY_PATH
$HPL_DIR/bin/CUDA/xhpl

其中

  • MKL_NUM_THREADS:每个进程使用的 CPU 核的数量
  • OMP_NUM_THREADS:每个 GPU 使用的 CPU 核的数量
    例如:2 个 GPU 和 8 个 CPU,则 OMP_NUM_THREADS=4
  • CUDA_DGEMM_SPLIT:发送给 GPU 的 DGEMM 占总的百分比,大致等于 (GPU GFLOPS)/(GPU GFLOPS + CPU GFLOPS),或者 350 / ( 350 + 每个 GPU 的 CPU 数量 * 4 * CPU 基本频率 )
  • CUDA_DTRSM_SPLIT:发送给 GPU 的 DTRSM 占总的百分比,通常比 DGEMM 低 0.05-0.10
  • HPL_DIR:文件的路径
  1. 单节点执行: ./run_linpack.sh
    多节点执行(没有测试过,仅供参考): mpiexec.hydra -np 7 ./test.sh

# 初步结果

	================================================================================
	HPLinpack 2.0  --  High-Performance Linpack benchmark  --   September 10, 2008
	Written by A. Petitet and R. Clint Whaley,  Innovative Computing Laboratory, UTK
	Modified by Piotr Luszczek, Innovative Computing Laboratory, UTK
	Modified by Julien Langou, University of Colorado Denver
	================================================================================

	An explanation of the input/output parameters follows:
	T/V    : Wall time / encoded variant.
	N      : The order of the coefficient matrix A.
	NB     : The partitioning blocking factor.
	P      : The number of process rows.
	Q      : The number of process columns.
	Time   : Time in seconds to solve the linear system.
	Gflops : Rate of execution for solving the linear system.

	The following parameter values will be used:

	N      :   25000
	NB     :     768
	PMAP   : Row-major process mapping
	P      :       1
	Q      :       1
	PFACT  :    Left
	NBMIN  :       2
	NDIV   :       2
	RFACT  :    Left
	BCAST  :   1ring
	DEPTH  :       1
	SWAP   : Spread-roll (long)
	L1     : no-transposed form
	U      : no-transposed form
	EQUIL  : yes
	ALIGN  : 8 double precision words

	--------------------------------------------------------------------------------

	- The matrix A is randomly generated for each test.
	- The following scaled residual check will be computed:
		  ||Ax-b||_oo / ( eps * ( || x ||_oo * || A ||_oo + || b ||_oo ) * N )
	- The relative machine precision (eps) is taken to be               1.110223e-16
	- Computational tests pass if scaled residuals are less than                16.0

	================================================================================
	T/V                N    NB     P     Q               Time                 Gflops
	--------------------------------------------------------------------------------
	WR10L2L2       25000   768     1     1              10.58              9.843e+02
	--------------------------------------------------------------------------------
	||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)=        0.0044272 ...... PASSED
	================================================================================

	Finished      1 tests with the following results:
				  1 tests completed and passed residual checks,
				  0 tests completed and failed residual checks,
				  0 tests skipped because of illegal input values.
	--------------------------------------------------------------------------------

	End of Tests.
	================================================================================

速度是 CPU 的 3 倍……

# GPU 信息获取

  1. 查看 nvidia GPU 型号
    lspci | grep -i nvidia
  2. 查看 Nvidia 显卡信息及使用情况
    nvidia-smi
Wed Dec  9 02:23:00 2020
		+-----------------------------------------------------------------------------+
		| NVIDIA-SMI 410.129      Driver Version: 410.129      CUDA Version: 10.0     |
		|-------------------------------+----------------------+----------------------+
		| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
		| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
		|===============================+======================+======================|
		|   0  Tesla P100-PCIE...  Off  | 00000000:05:00.0 Off |                    0 |
		| N/A   34C    P0    36W / 250W |   2391MiB / 16280MiB |      0%      Default |
		+-------------------------------+----------------------+----------------------+
		|   1  Tesla P100-PCIE...  Off  | 00000000:08:00.0 Off |                    0 |
		| N/A   40C    P0    40W / 250W |   2391MiB / 16280MiB |      0%      Default |
		+-------------------------------+----------------------+----------------------+
		|   2  Tesla P100-PCIE...  Off  | 00000000:09:00.0 Off |                    0 |
		| N/A   46C    P0   194W / 250W |   2391MiB / 16280MiB |    100%      Default |
		+-------------------------------+----------------------+----------------------+
		|   3  Tesla P100-PCIE...  Off  | 00000000:84:00.0 Off |                    0 |
		| N/A   40C    P0    39W / 250W |   2391MiB / 16280MiB |      0%      Default |
		+-------------------------------+----------------------+----------------------+
		|   4  Tesla P100-PCIE...  Off  | 00000000:88:00.0 Off |                    0 |
		| N/A   36C    P0    36W / 250W |   2391MiB / 16280MiB |    100%      Default |
		+-------------------------------+----------------------+----------------------+
		|   5  Tesla P100-PCIE...  Off  | 00000000:89:00.0 Off |                    0 |
		| N/A   42C    P0   185W / 250W |   2391MiB / 16280MiB |    100%      Default |
		+-------------------------------+----------------------+----------------------+

		+-----------------------------------------------------------------------------+
		| Processes:                                                       GPU Memory |
		|  GPU       PID   Type   Process name                             Usage      |
		|=============================================================================|
		|    0      3451      C   ...wei/HPL/hpl-2.0_FERMI_v15/bin/CUDA/xhpl  2363MiB |
		|    1      3455      C   ...wei/HPL/hpl-2.0_FERMI_v15/bin/CUDA/xhpl  2363MiB |
		|    2      3452      C   ...wei/HPL/hpl-2.0_FERMI_v15/bin/CUDA/xhpl  2363MiB |
		|    3      3456      C   ...wei/HPL/hpl-2.0_FERMI_v15/bin/CUDA/xhpl  2363MiB |
		|    4      3453      C   ...wei/HPL/hpl-2.0_FERMI_v15/bin/CUDA/xhpl  2363MiB |
		|    5      3457      C   ...wei/HPL/hpl-2.0_FERMI_v15/bin/CUDA/xhpl  2363MiB |
		+-----------------------------------------------------------------------------+

表头释义:
+ Fan:显示风扇转速,数值在 0 到 100% 之间,是计算机的期望转速,如果不是通过风扇冷却或者风扇坏了,显示出来就是 N/A;
+ Temp:显卡内部的温度,单位是摄氏度;
+ Perf:表征性能状态,从 P0 到 P12,P0 表示最大性能,P12 表示状态最小性能;
+ Pwr:能耗表示;
+ Bus-Id:涉及 GPU 总线的相关信息;
+ Disp.A:是 Display Active 的意思,表示 GPU 的显示是否初始化;
+ Memory Usage:显存的使用率;
+ Volatile GPU-Util:浮动的 GPU 利用率;
+ Compute M:计算模式;
+ Processes 显示每块 GPU 上每个进程所使用的显存情况。
3. 周期性输出显卡的使用情况
watch -n 10 nvidia-smi
命令行参数 - n 后边跟的是执行命令的周期,以 s 为单位。
3. 查看 GPU 名称
nvidia-smi -L

GPU 0: Tesla P100-PCIE-16GB (UUID: GPU-1f613914-b422-06ad-cd7d-d2649435f480)
GPU 1: Tesla P100-PCIE-16GB (UUID: GPU-d8f026e7-9714-39f6-af4a-f8328cbdb3c9)
GPU 2: Tesla P100-PCIE-16GB (UUID: GPU-fb22db8b-9de3-691e-53da-d34aec4b6abb)
GPU 3: Tesla P100-PCIE-16GB (UUID: GPU-ddf0cbc8-3c92-0b3d-5496-b65361fe18a0)
GPU 4: Tesla P100-PCIE-16GB (UUID: GPU-7a94ce44-3e66-d05d-3d58-d625930a2aad)
GPU 5: Tesla P100-PCIE-16GB (UUID: GPU-16423caf-0037-6742-2977-ba03f4937b9b)
更新于 阅读次数