Both sides previous revision Previous revision Next revision | Previous revision Next revisionBoth sides next revision |
pandoc:introduction-to-vsc:09_special_hardware:accelerators [2019/01/15 12:42] – Pandoc Auto-commit pandoc | pandoc:introduction-to-vsc:09_special_hardware:accelerators [2024/06/06 15:38] – [TOP500 List June 2024] goldenberg |
---|
| ====== GPUs available & how to use it ====== |
| |
====== Special hardware (GPUs, KNLs, binfs) available & how to use it ====== | ===== TOP500 List June 2024 ===== |
| |
* Article written by Siegfried Höfinger (VSC Team) <html><br></html>(last update 2019-01-15 by sh). | ^ Rank^Nation ^Machine ^ Performance^Accelerators ^ |
| | 1.|{{.:us.png?0x24}} | Frontier | 1206 PFLOPs/s | AMD Instinct MI250X | |
| | 2.|{{.:us.png?0x24}} | Aurora | 1012 PFLOPs/s | Intel Data Center GPU Max | |
| | 3.|{{.:us.png?0x24}} | Eagle | 561 PFLOPs/s | NVIDIA H100 | |
| | 4.|{{.:jp.png?0x24}} | Fugaku | 442 PFLOPs/s | | |
| | 5.| | LUMI | 379 PFLOPs/s | AMD Instinct MI250X | |
| | 6.|{{.:ch.png?0x24}} | Alps | 270 PFLOPs/s | NVIDIA GH200 Superchip | |
| | 7.|{{.:it.png?0x24}} | Leonardo | 241 PFLOPs/s | NVIDIA A100 SXM4 | |
| | 8.| | MareNostrum 5 ACC | 175 PFLOPs/s | NVIDIA H100 | |
| | 9.|{{.:us.png?0x24}} | Summit | 148 PFLOPs/s | NVIDIA V100 | |
| | 10.|{{.:us.png?0x24}} | Eos NVIDIA DGX | 121 PFLOPs/s | NVIDIA H100 | |
| |
====== TOP500 List November 2018 ====== | ===== Components on VSC-5 ===== |
| |
| ^Model ^#cores ^Clock Freq (GHz)^Memory (GB)^Bandwidth (GB/s)^TDP (Watt)^FP32/FP64 (GFLOPs/s)^ |
| |19x GeForce RTX-2080Ti n375-[001-019] - only in a special project | | | | | | |
| |{{:pandoc:introduction-to-vsc:09_special_hardware:rtx-2080.jpg?nolink&200}} |4352|1.35 |11 |616 |250 |13450/420 | |
| |45x2 nVidia A40 n306[6,7,8]-[001-019,001-019,001-007] | | | | | | | |
| |{{ :pandoc:introduction-to-vsc:09_special_hardware:a40.jpg?nolink&200|}} |10752 |1.305 |48 |696 |300 |37400/1169 | |
| |62x2 nVidia A100-40GB n307[1-4]-[001-015] | | | | | | |
| |{{ :pandoc:introduction-to-vsc:09_special_hardware:a100.jpg?nolink&200|}} |6912 |0.765 |40 |1555 |250 |19500/9700 | |
| |
<HTML> | |
<!--slide 1--> | |
<!--for nations flags see https://www.free-country-flags.com--> | |
</HTML> | |
^ Rank^Nation ^Machine ^ Performance^Accelerators ^ | |
| 1.|{{:pandoc:introduction-to-vsc:09_special_hardware:accelerators:us.png?0x24}} |Summit | 144 PFLOPs/s|<html><font color="navy"></html>NVIDIA V100<html></font></html> | | |
| 2.|{{:pandoc:introduction-to-vsc:09_special_hardware:accelerators:us.png?0x24}} |Sierra | 95 PFLOPs/s|<html><font color="navy"></html>NVIDIA V100<html></font></html> | | |
| 3.|{{:pandoc:introduction-to-vsc:09_special_hardware:accelerators:cn.png?0x24}} |Sunway TaihuLight | 93 PFLOPs/s| | | |
| 4.|{{:pandoc:introduction-to-vsc:09_special_hardware:accelerators:cn.png?0x24}} |Tianhe-2A | 62 PFLOPs/s| | | |
| 5.|{{:pandoc:introduction-to-vsc:09_special_hardware:accelerators:ch.png?0x24}} |Piz Daint | 21 PFLOPs/s|<html><font color="navy"></html>NVIDIA P100<html></font></html> | | |
| 6.|{{:pandoc:introduction-to-vsc:09_special_hardware:accelerators:us.png?0x24}} |Trinity | 20 PFLOPs/s|<html><font color="#cc3300"></html>Intel Xeon Phi 7250/KNL<html></font></html> | | |
| 7.|{{:pandoc:introduction-to-vsc:09_special_hardware:accelerators:jp.png?0x24}} |ABCI | 20 PFLOPs/s|<html><font color="navy"></html>NVIDIA V100<html></font></html> | | |
| 8.|{{:pandoc:introduction-to-vsc:09_special_hardware:accelerators:de.png?0x24}} |SuperMUC-NG | 20 PFLOPs/s| | | |
| 9.|{{:pandoc:introduction-to-vsc:09_special_hardware:accelerators:us.png?0x24}} |Titan | 18 PFLOPs/s|<html><font color="navy"></html>NVIDIA K20x<html></font></html> | | |
| 10.|{{:pandoc:introduction-to-vsc:09_special_hardware:accelerators:us.png?0x24}} |Sequoia | 17 PFLOPs/s| | | |
| |
| ==== Working on GPU nodes Interactively ==== |
<HTML> | |
<!--slide 2--> | |
</HTML> | |
====== Components on VSC-3 ====== | |
| |
^Model ^#cores^Clock Freq (GHz)^Memory (GB)^Bandwidth (GB/s)^TDP (Watt)^FP32/FP64 (GFLOPs/s)^ | |
|<html><font color="navy"></html>36+50x GeForce GTX-1080 n7[1-3]-[001-004,001-022,001-028]<html></font></html>| | | | | | | | |
|{{:pandoc:introduction-to-vsc:09_special_hardware:accelerators:nvidia-gtx-1080.jpg}} |2560 |1.61 |8 |320 |180 |8228/257 | | |
|<html><font color="navy"></html>4x Tesla k20m n72-02[4-5]<html></font></html> | | | | | | | | |
|{{:pandoc:introduction-to-vsc:09_special_hardware:accelerators:nvidia-k20m.png}} |2496 |0.71 |5 |208 |195 |3520/1175 | | |
|<html><font color="navy"></html>2x Tesla m60 n72-023]<html></font></html> | | | | | | | | |
|{{:pandoc:introduction-to-vsc:09_special_hardware:accelerators:m60.jpeg}} |2048 |1.18 |8 |160 |265 |8500/266 | | |
|<html><font color="#cc3300"></html>4x KNL 7210 n25-05[0-3]<html></font></html> | | | | | | | | |
|{{:pandoc:introduction-to-vsc:09_special_hardware:accelerators:intel-knl.png}} |64 |1.30 |197+16<384 |102 |215 |5000+/2500+ | | |
| |
| |
<HTML> | |
<!--slide 3--> | |
</HTML> | |
====== Working on GPU nodes ====== | |
| |
**Interactive mode** | **Interactive mode** |
| |
<code> | <code> |
1. salloc -N 1 -p gpu_gtx1080single --qos gpu_gtx1080single --gres gpu:1 (...perhaps -L intel@vsc) | 1. VSC-5 > salloc -N 1 -p zen2_0256_a40x2 --qos zen2_0256_a40x2 --gres=gpu:2 |
| |
2. squeue -u training | 2. VSC-5 > squeue -u $USER |
| |
3. srun -n 1 hostname (...while still on the login node !) | 3. VSC-5 > srun -n 1 hostname (...while still on the login node !) |
| |
4. ssh n72-012 (...or whatever else node had been assigned) | 4. VSC-5 > ssh n3066-012 (...or whatever else node had been assigned) |
| |
5. module load cuda/9.1.85 | 5. VSC-5 > module load cuda/9.1.85 |
cd ~/examples/09_special_hardware/gpu_gtx1080/matrixMul | cd ~/examples/09_special_hardware/gpu_gtx1080/matrixMul |
nvcc ./matrixMul.cu | nvcc ./matrixMul.cu |
./a.out | ./a.out |
| |
cd ~/examples/09_special_hardware/gpu_gtx1080/matrixMulCUBLAS | cd ~/examples/09_special_hardware/gpu_gtx1080/matrixMulCUBLAS |
nvcc matrixMulCUBLAS.cu -lcublas | nvcc matrixMulCUBLAS.cu -lcublas |
./a.out | ./a.out |
| |
6. nvidia-smi | 6. VSC-5 > nvidia-smi |
| |
7. /opt/sw/x86_64/glibc-2.17/ivybridge-ep/cuda/9.1.85/NVIDIA_CUDA-9.1_Samples/1_Utilities/deviceQuery/deviceQuery | 7. VSC-5 > /opt/sw/x86_64/glibc-2.17/ivybridge-ep/cuda/9.1.85/NVIDIA_CUDA-9.1_Samples/1_Utilities/deviceQuery/deviceQuery |
</code> | </code> |
<HTML> | |
<!--slide 4--> | |
</HTML> | |
====== Working on GPU nodes cont. ====== | |
| |
**SLURM submission** [[examples/gpu_gtx1080/gpu_test.scrpt|gpu_test.scrpt]] | ===== Working on GPU using SLURM ===== |
| |
| **SLURM submission** gpu_test.scrpt |
| |
<code bash> | <code bash> |
#!/bin/bash | #!/bin/bash |
| # |
# usage: sbatch ./gpu_test.scrpt | # usage: sbatch ./gpu_test.scrpt |
# | # |
#SBATCH -J gtx1080 | #SBATCH -J A40 |
#SBATCH -N 1 | #SBATCH -N 1 #use -N only if you use both GPUs on the nodes, otherwise leave this line out |
#SBATCH --partition gpu_gtx1080single | #SBATCH --partition zen2_0256_a40x2 |
#SBATCH --qos gpu_gtx1080single | #SBATCH --qos zen2_0256_a40x2 |
#SBATCH --gres gpu:1 | #SBATCH --gres=gpu:2 #or --gres=gpu:1 if you only want to use half a node |
| |
module purge | module purge |
/opt/sw/x86_64/glibc-2.17/ivybridge-ep/cuda/9.1.85/NVIDIA_CUDA-9.1_Samples/1_Utilities/deviceQuery/deviceQuery | /opt/sw/x86_64/glibc-2.17/ivybridge-ep/cuda/9.1.85/NVIDIA_CUDA-9.1_Samples/1_Utilities/deviceQuery/deviceQuery |
</code> | </code> |
<html><font color="navy"></html>**Exercise/Example/Problem:**<html></font></html> <html><br/></html> Using interactive mode or batch submission, figure out whether we have ECC enabled on GPUs of type gtx1080 ? | |
| |
<HTML> | |
<!--slide 5--> | |
</HTML> | |
====== Working on KNL nodes ====== | |
| |
**Interactive mode** | ===== Real-World Example, AMBER-16 ===== |
| |
<code> | |
1. salloc -N 1 -p knl --qos knl -C knl -L intel@vsc | |
| |
2. squeue -u training | |
| |
3. srun -n 1 hostname (...while still on the login node !) | |
| |
4. ssh n25-050 (...or whatever else node had been assigned) | |
| |
5. module purge | |
| |
6. module load intel/18 | |
cd ~/examples/09_special_hardware/knl | |
icc -xHost -qopenmp sample.c | |
export OMP_NUM_THREADS=16 | |
./a.out | |
</code> | |
<HTML> | |
<!--slide 6--> | |
</HTML> | |
====== Working on KNL nodes cont. ====== | |
| |
**SLURM submission** [[examples/knl/knl_test.scrpt|knl_test.scrpt]] | |
| |
<code bash> | |
#!/bin/bash | |
# usage: sbatch ./knl_test.scrpt | |
# | |
#SBATCH -J knl | |
#SBATCH -N 1 | |
#SBATCH --partition knl | |
#SBATCH --qos knl | |
#SBATCH -C knl | |
#SBATCH -L intel@vsc | |
| |
module purge | |
module load intel/18 | |
cat /proc/cpuinfo | |
export OMP_NUM_THREADS=16 | |
./a.out | |
</code> | |
<html><font color="#cc3300"></html>**Exercise/Example/Problem:**<html></font></html> <html><br/></html> Given our KNL model, can you determine the current level of hyperthreading, ie 2x, 3x, 4x, whatever-x ? | |
| |
<HTML> | |
<!--slide 7--> | |
</HTML> | |
====== Working on binf nodes ====== | |
| |
**Interactive mode** | |
| |
<code> | |
1. salloc -N 1 -p binf --qos normal_binf -C binf -L intel@vsc | |
| |
2. squeue -u training | |
| |
3. srun -n 4 hostname (...while still on the login node !) | |
| |
4. ssh binf-11 (...or whatever else node had been assigned) | |
| |
5. module purge | |
| |
6. module load intel/17 | |
cd ~/examples/09_special_hardware/binf | |
icc -xHost -qopenmp sample.c | |
export OMP_NUM_THREADS=8 | |
./a.out | |
</code> | |
<HTML> | |
<!--slide 8--> | |
</HTML> | |
====== Working on binf nodes cont. ====== | |
| |
**SLURM submission** [[examples/binf/gromacs-5.1.4_binf/slrm.sbmt.scrpt|slrm.sbmt.scrpt]] | |
| |
<code bash> | |
#!/bin/bash | |
# usage: sbatch ./slrm.sbmt.scrpt | |
# | |
#SBATCH -J gmxbinfs | |
#SBATCH -N 2 | |
#SBATCH --partition binf | |
#SBATCH --qos normal_binf | |
#SBATCH -C binf | |
#SBATCH --ntasks-per-node 24 | |
#SBATCH --ntasks-per-core 1 | |
| |
module purge | |
module load intel/17 intel-mkl/2017 intel-mpi/2017 gromacs/5.1.4_binf | |
| |
export I_MPI_PIN=1 | |
export I_MPI_PIN_PROCESSOR_LIST=0-23 | |
export I_MPI_FABRICS=shm:tmi | |
export I_MPI_TMI_PROVIDER=psm2 | |
export OMP_NUM_THREADS=1 | |
export MDRUN_ARGS=" -dd 0 0 0 -rdd 0 -rcon 0 -dlb yes -dds 0.8 -tunepme -v -nsteps 10000 " | |
| |
mpirun -np $SLURM_NTASKS gmx_mpi mdrun ${MDRUN_ARGS} -s hSERT_5HT_PROD.0.tpr -deffnm hSERT_5HT_PROD.0 -px hSERT_5HT_PROD.0_px.xvg -pf hSERT_5HT_PROD.0_pf.xvg -swap hSERT_5HT_PROD.0.xvg | |
</code> | |
<HTML> | |
<!--slide 9--> | |
</HTML> | |
====== Real-World Example, AMBER-16 ====== | |
| |
^ Performance^Power Efficiency ^ | ^ Performance^Power Efficiency ^ |
| {{:pandoc:introduction-to-vsc:09_special_hardware:accelerators:amber16.perf.png}}|{{:pandoc:introduction-to-vsc:09_special_hardware:accelerators:amber16.powereff.png}} | | | {{.:amber16.perf.png}}|{{.:amber16.powereff.png}} | |
| |