====== Special types of hardware (GPUs, KNLs) available & how to access them ====== * Article written by Siegfried Höfinger (VSC Team)
(last update 2017-04-27 by sh). ====== TOP500 List Nov 2016 ====== ^ Rank^Nation ^Machine ^ Performance^Accelerators ^ | 1.|{{pandoc:introduction-to-vsc:09_special_hardware:01_accelerators:cn.png}} |Sunway TaihuLight | 93 PFLOPs/s| | | 2.|{{pandoc:introduction-to-vsc:09_special_hardware:01_accelerators:cn.png}} |Tianhe-2 (MilkyWay-2) | 34 PFLOPs/s|Intel Xeon Phi 31S1P | | 3.|{{pandoc:introduction-to-vsc:09_special_hardware:01_accelerators:us.png}} |Titan | 18 PFLOPs/s|NVIDIA K20x | | 4.|{{pandoc:introduction-to-vsc:09_special_hardware:01_accelerators:us.png}} |Sequoia | 17 PFLOPs/s| | | 5.|{{pandoc:introduction-to-vsc:09_special_hardware:01_accelerators:us.png}} |Cori | 14 PFLOPs/s|Intel Xeon Phi 7250 | | 6.|{{pandoc:introduction-to-vsc:09_special_hardware:01_accelerators:jp.png}} |Oakforest-PACS | 14 PFLOPs/s|Intel Xeon Phi 7250 | | 7.|{{pandoc:introduction-to-vsc:09_special_hardware:01_accelerators:jp.png}} |K-computer | 11 PFLOPs/s| | | 8.|{{pandoc:introduction-to-vsc:09_special_hardware:01_accelerators:ch.png}} |Piz Daint | 10 PFLOPs/s|NVIDIA P100 | | 9.|{{pandoc:introduction-to-vsc:09_special_hardware:01_accelerators:us.png}} |Mira | 9 PFLOPs/s| | | 10.|{{pandoc:introduction-to-vsc:09_special_hardware:01_accelerators:us.png}} |Trinity | 8 PFLOPs/s| | ====== Components on VSC-3 ====== ^Model ^#cores^Clock Freq (GHz)^Memory (GB)^Bandwith (GB/s)^TDP (Watt)^FP32/FP64 (GFLOPs/s)^ |10x GeForce GTX-1080 n25-0[10-20] | | | | | | | |{{pandoc:introduction-to-vsc:09_special_hardware:01_accelerators:nvidia-gtx-1080.jpg}}|2560 |1.61 |8 |320 |180 |8228/257 | |4x Tesla k20m n25-00[5-6] | | | | | | | |{{pandoc:introduction-to-vsc:09_special_hardware:01_accelerators:nvidia-k20m.png}} |2496 |0.71 |5 |208 |195 |3520/1175 | |4x KNL 7210 n25-05[0-3] | | | | | | | |{{pandoc:introduction-to-vsc:09_special_hardware:01_accelerators:intel-knl.png}} |64 |1.30 |384 |102 |215 |5000+/2500+ | ====== Working on GPU nodes ====== **Interactive mode**


1. salloc -N 1 -p gpu --qos=gpu_compute  -C gtx1080 --gres=gpu:1  (...perhaps -L intel@vsc)

2. squeue -u training

3. srun -n 1 hostname  (...while still on the login node !)

4. ssh n25-012  (...or whatever else node had been assigned)

5. module load cuda/8.0.27    
     cd ~/examples/09_special_hardware/gpu_gtx1080/matrixMul
     nvcc ./matrixMul.cu
     ./a.out

     cd ~/examples/09_special_hardware/gpu_gtx1080/matrixMulCUBLAS
     nvcc matrixMulCUBLAS.cu -lcublas
     ./a.out

6. nvidia-smi

7. /opt/sw/x86_64/glibc-2.17/IntelXeonE51620v3/cuda/8.0.27/NVIDIA_CUDA-8.0_Samples/
   1_Utilities/deviceQuery/deviceQuery

====== Working on GPU nodes cont. ====== **SLURM submission**


#!/bin/bash
#  usage: sbatch ./gpu_test.scrpt          
#
#SBATCH -J gtx1080     
#SBATCH -N 1
#SBATCH --partition=gpu         
#SBATCH --qos=gpu_compute
#SBATCH -C gtx1080     
#SBATCH --gres=gpu:1

module purge
module load cuda/8.0.27

nvidia-smi
/opt/sw/x86_64/glibc-2.17/IntelXeonE51620v3/cuda/8.0.27/NVIDIA_CUDA-8.0_Samples/1_Utilities/deviceQuery/deviceQuery

**Exercise/Example/Problem:**
Using interactive mode or batch submission, figure out whether we have ECC enabled on GPUs of type gtx1080 ? ====== Working on KNL nodes ====== **Interactive mode**


1. salloc -N 1 -p knl --qos=knl -C knl -L intel@vsc

2. squeue -u training

3. srun -n 1 hostname

4. ssh n25-050  (...or whatever else node had been assigned)

5. module purge

6. module load intel/17.0.2
     cd ~/examples/09_special_hardware/knl
     icc -xHost -qopenmp sample.c
     export OMP_NUM_THREADS=16
     ./a.out

====== Working on KNL nodes cont. ====== **SLURM submission**


#!/bin/bash
#  usage: sbatch ./knl_test.scrpt          
#
#SBATCH -J knl         
#SBATCH -N 1
#SBATCH --partition=knl         
#SBATCH --qos=knl         
#SBATCH -C knl         
#SBATCH -L intel@vsc

module purge
module load intel/17.0.1
cat /proc/cpuinfo
export OMP_NUM_THREADS=16
./a.out

**Exercise/Example/Problem:**
Given our KNL model, can you determine the current level of hyperthreading, ie 2x, 3x, 4x, whatever-x ? ====== Real-World Example, AMBER-16 ====== ^ Performance^Power Efficiency ^ | {{pandoc:introduction-to-vsc:09_special_hardware:01_accelerators:amber16.perf.png}}|{{pandoc:introduction-to-vsc:09_special_hardware:01_accelerators:amber16.powereff.png}} |