| Previous revision Next revision |
— | pandoc:introduction-to-vsc:09_special_hardware:accelerators [2019/01/15 12:42] – Pandoc Auto-commit pandoc |
---|
| |
| ====== Special hardware (GPUs, KNLs, binfs) available & how to use it ====== |
| |
| * Article written by Siegfried Höfinger (VSC Team) <html><br></html>(last update 2019-01-15 by sh). |
| |
| ====== TOP500 List November 2018 ====== |
| |
| |
| <HTML> |
| <!--slide 1--> |
| <!--for nations flags see https://www.free-country-flags.com--> |
| </HTML> |
| ^ Rank^Nation ^Machine ^ Performance^Accelerators ^ |
| | 1.|{{:pandoc:introduction-to-vsc:09_special_hardware:accelerators:us.png?0x24}} |Summit | 144 PFLOPs/s|<html><font color="navy"></html>NVIDIA V100<html></font></html> | |
| | 2.|{{:pandoc:introduction-to-vsc:09_special_hardware:accelerators:us.png?0x24}} |Sierra | 95 PFLOPs/s|<html><font color="navy"></html>NVIDIA V100<html></font></html> | |
| | 3.|{{:pandoc:introduction-to-vsc:09_special_hardware:accelerators:cn.png?0x24}} |Sunway TaihuLight | 93 PFLOPs/s| | |
| | 4.|{{:pandoc:introduction-to-vsc:09_special_hardware:accelerators:cn.png?0x24}} |Tianhe-2A | 62 PFLOPs/s| | |
| | 5.|{{:pandoc:introduction-to-vsc:09_special_hardware:accelerators:ch.png?0x24}} |Piz Daint | 21 PFLOPs/s|<html><font color="navy"></html>NVIDIA P100<html></font></html> | |
| | 6.|{{:pandoc:introduction-to-vsc:09_special_hardware:accelerators:us.png?0x24}} |Trinity | 20 PFLOPs/s|<html><font color="#cc3300"></html>Intel Xeon Phi 7250/KNL<html></font></html> | |
| | 7.|{{:pandoc:introduction-to-vsc:09_special_hardware:accelerators:jp.png?0x24}} |ABCI | 20 PFLOPs/s|<html><font color="navy"></html>NVIDIA V100<html></font></html> | |
| | 8.|{{:pandoc:introduction-to-vsc:09_special_hardware:accelerators:de.png?0x24}} |SuperMUC-NG | 20 PFLOPs/s| | |
| | 9.|{{:pandoc:introduction-to-vsc:09_special_hardware:accelerators:us.png?0x24}} |Titan | 18 PFLOPs/s|<html><font color="navy"></html>NVIDIA K20x<html></font></html> | |
| | 10.|{{:pandoc:introduction-to-vsc:09_special_hardware:accelerators:us.png?0x24}} |Sequoia | 17 PFLOPs/s| | |
| |
| |
| <HTML> |
| <!--slide 2--> |
| </HTML> |
| ====== Components on VSC-3 ====== |
| |
| ^Model ^#cores^Clock Freq (GHz)^Memory (GB)^Bandwidth (GB/s)^TDP (Watt)^FP32/FP64 (GFLOPs/s)^ |
| |<html><font color="navy"></html>36+50x GeForce GTX-1080 n7[1-3]-[001-004,001-022,001-028]<html></font></html>| | | | | | | |
| |{{:pandoc:introduction-to-vsc:09_special_hardware:accelerators:nvidia-gtx-1080.jpg}} |2560 |1.61 |8 |320 |180 |8228/257 | |
| |<html><font color="navy"></html>4x Tesla k20m n72-02[4-5]<html></font></html> | | | | | | | |
| |{{:pandoc:introduction-to-vsc:09_special_hardware:accelerators:nvidia-k20m.png}} |2496 |0.71 |5 |208 |195 |3520/1175 | |
| |<html><font color="navy"></html>2x Tesla m60 n72-023]<html></font></html> | | | | | | | |
| |{{:pandoc:introduction-to-vsc:09_special_hardware:accelerators:m60.jpeg}} |2048 |1.18 |8 |160 |265 |8500/266 | |
| |<html><font color="#cc3300"></html>4x KNL 7210 n25-05[0-3]<html></font></html> | | | | | | | |
| |{{:pandoc:introduction-to-vsc:09_special_hardware:accelerators:intel-knl.png}} |64 |1.30 |197+16<384 |102 |215 |5000+/2500+ | |
| |
| |
| <HTML> |
| <!--slide 3--> |
| </HTML> |
| ====== Working on GPU nodes ====== |
| |
| **Interactive mode** |
| |
| <code> |
| 1. salloc -N 1 -p gpu_gtx1080single --qos gpu_gtx1080single --gres gpu:1 (...perhaps -L intel@vsc) |
| |
| 2. squeue -u training |
| |
| 3. srun -n 1 hostname (...while still on the login node !) |
| |
| 4. ssh n72-012 (...or whatever else node had been assigned) |
| |
| 5. module load cuda/9.1.85 |
| cd ~/examples/09_special_hardware/gpu_gtx1080/matrixMul |
| nvcc ./matrixMul.cu |
| ./a.out |
| |
| cd ~/examples/09_special_hardware/gpu_gtx1080/matrixMulCUBLAS |
| nvcc matrixMulCUBLAS.cu -lcublas |
| ./a.out |
| |
| 6. nvidia-smi |
| |
| 7. /opt/sw/x86_64/glibc-2.17/ivybridge-ep/cuda/9.1.85/NVIDIA_CUDA-9.1_Samples/1_Utilities/deviceQuery/deviceQuery |
| </code> |
| <HTML> |
| <!--slide 4--> |
| </HTML> |
| ====== Working on GPU nodes cont. ====== |
| |
| **SLURM submission** [[examples/gpu_gtx1080/gpu_test.scrpt|gpu_test.scrpt]] |
| |
| <code bash> |
| #!/bin/bash |
| # usage: sbatch ./gpu_test.scrpt |
| # |
| #SBATCH -J gtx1080 |
| #SBATCH -N 1 |
| #SBATCH --partition gpu_gtx1080single |
| #SBATCH --qos gpu_gtx1080single |
| #SBATCH --gres gpu:1 |
| |
| module purge |
| module load cuda/9.1.85 |
| |
| nvidia-smi |
| /opt/sw/x86_64/glibc-2.17/ivybridge-ep/cuda/9.1.85/NVIDIA_CUDA-9.1_Samples/1_Utilities/deviceQuery/deviceQuery |
| </code> |
| <html><font color="navy"></html>**Exercise/Example/Problem:**<html></font></html> <html><br/></html> Using interactive mode or batch submission, figure out whether we have ECC enabled on GPUs of type gtx1080 ? |
| |
| <HTML> |
| <!--slide 5--> |
| </HTML> |
| ====== Working on KNL nodes ====== |
| |
| **Interactive mode** |
| |
| <code> |
| 1. salloc -N 1 -p knl --qos knl -C knl -L intel@vsc |
| |
| 2. squeue -u training |
| |
| 3. srun -n 1 hostname (...while still on the login node !) |
| |
| 4. ssh n25-050 (...or whatever else node had been assigned) |
| |
| 5. module purge |
| |
| 6. module load intel/18 |
| cd ~/examples/09_special_hardware/knl |
| icc -xHost -qopenmp sample.c |
| export OMP_NUM_THREADS=16 |
| ./a.out |
| </code> |
| <HTML> |
| <!--slide 6--> |
| </HTML> |
| ====== Working on KNL nodes cont. ====== |
| |
| **SLURM submission** [[examples/knl/knl_test.scrpt|knl_test.scrpt]] |
| |
| <code bash> |
| #!/bin/bash |
| # usage: sbatch ./knl_test.scrpt |
| # |
| #SBATCH -J knl |
| #SBATCH -N 1 |
| #SBATCH --partition knl |
| #SBATCH --qos knl |
| #SBATCH -C knl |
| #SBATCH -L intel@vsc |
| |
| module purge |
| module load intel/18 |
| cat /proc/cpuinfo |
| export OMP_NUM_THREADS=16 |
| ./a.out |
| </code> |
| <html><font color="#cc3300"></html>**Exercise/Example/Problem:**<html></font></html> <html><br/></html> Given our KNL model, can you determine the current level of hyperthreading, ie 2x, 3x, 4x, whatever-x ? |
| |
| <HTML> |
| <!--slide 7--> |
| </HTML> |
| ====== Working on binf nodes ====== |
| |
| **Interactive mode** |
| |
| <code> |
| 1. salloc -N 1 -p binf --qos normal_binf -C binf -L intel@vsc |
| |
| 2. squeue -u training |
| |
| 3. srun -n 4 hostname (...while still on the login node !) |
| |
| 4. ssh binf-11 (...or whatever else node had been assigned) |
| |
| 5. module purge |
| |
| 6. module load intel/17 |
| cd ~/examples/09_special_hardware/binf |
| icc -xHost -qopenmp sample.c |
| export OMP_NUM_THREADS=8 |
| ./a.out |
| </code> |
| <HTML> |
| <!--slide 8--> |
| </HTML> |
| ====== Working on binf nodes cont. ====== |
| |
| **SLURM submission** [[examples/binf/gromacs-5.1.4_binf/slrm.sbmt.scrpt|slrm.sbmt.scrpt]] |
| |
| <code bash> |
| #!/bin/bash |
| # usage: sbatch ./slrm.sbmt.scrpt |
| # |
| #SBATCH -J gmxbinfs |
| #SBATCH -N 2 |
| #SBATCH --partition binf |
| #SBATCH --qos normal_binf |
| #SBATCH -C binf |
| #SBATCH --ntasks-per-node 24 |
| #SBATCH --ntasks-per-core 1 |
| |
| module purge |
| module load intel/17 intel-mkl/2017 intel-mpi/2017 gromacs/5.1.4_binf |
| |
| export I_MPI_PIN=1 |
| export I_MPI_PIN_PROCESSOR_LIST=0-23 |
| export I_MPI_FABRICS=shm:tmi |
| export I_MPI_TMI_PROVIDER=psm2 |
| export OMP_NUM_THREADS=1 |
| export MDRUN_ARGS=" -dd 0 0 0 -rdd 0 -rcon 0 -dlb yes -dds 0.8 -tunepme -v -nsteps 10000 " |
| |
| mpirun -np $SLURM_NTASKS gmx_mpi mdrun ${MDRUN_ARGS} -s hSERT_5HT_PROD.0.tpr -deffnm hSERT_5HT_PROD.0 -px hSERT_5HT_PROD.0_px.xvg -pf hSERT_5HT_PROD.0_pf.xvg -swap hSERT_5HT_PROD.0.xvg |
| </code> |
| <HTML> |
| <!--slide 9--> |
| </HTML> |
| ====== Real-World Example, AMBER-16 ====== |
| |
| ^ Performance^Power Efficiency ^ |
| | {{:pandoc:introduction-to-vsc:09_special_hardware:accelerators:amber16.perf.png}}|{{:pandoc:introduction-to-vsc:09_special_hardware:accelerators:amber16.powereff.png}} | |
| |