====== VSC – supercomputers ======
* Article written by Claudia Blaas-Schenner (VSC Team)
(last update 2020-10-10 by cb).
**OUTLINE:**
* **VSC – Vienna Scientific Cluster**
$~$
* **Supercomputers for beginners –**
**– introducing VSC to our (new) users**
* Supercomputers for beginners – what is a supercomputer ?
* VSC systems – what do they look like ?
* VSC-4 – components of a supercomputer
* Parallel hardware architectures –
– which parallel programming models can be used ?
* VSC compute nodes
* VSC node-interconnect
* VSC-3 ping-pong – intra-node vs. inter-node
----
====== VSC – Vienna Scientific Cluster ======
* **The VSC is** a joint high performance computing (HPC) facility of Austrian universities.
* **Our mission:** Within the limits of available resources we satisfy the HPC needs of our users.
* **VSC is primarily devoted to research.**
* **Who can use VSC?** Scientific personnel of the partner universities, see: https://vsc.ac.at/access
➠$~~$**https://vsc.ac.at/training**
**VSC course slides:**
➠$~~$➠$~~$➠$~~$**[[https://wiki.vsc.ac.at/doku.php?id=pandoc:introduction-to-vsc:01_supercomputers_for_beginners:00_linux|VSC-Linux]]**
➠$~~$➠$~~$➠$~~$**[[https://wiki.vsc.ac.at/doku.php?id=pandoc:introduction-to-vsc:01_supercomputers_for_beginners:00_intro|VSC-Intro]]**
----
====== Supercomputers for beginners ======
* **What is a supercomputer ?**
* A supercomputer is a computer with a high level of performance as compared to a general-purpose computer. The performance of a supercomputer is commonly measured in floating-point operations per second (FLOPS)… [from Wikipedia]
* **A supercomputer is listed in the [[https://www.top500.org|TOP500]]**
^ ^ ^ TOP500^ GREEN500^ (#1 TOP500)^
|VSC-1 (2009) | 35 TFlop/s| 156 (11/2009)| 94 (06/2009)| 1.8 PFlop/s #1 (11/2009)|
|VSC-2 (2011) | 135 TFlop/s| 56 (06/2011)| 71 (06/2011)| 8 PFlop/s #1 (06/2011)|
|[[https://www.top500.org/system/178471|VSC-3 (2014)]]| 596 TFlop/s| 85 (11/2014)| 86 (11/2014)| 33 PFlop/s #1 (11/2014)|
|VSC-3 (………) | 596 TFlop/s| 461 (11/2017)| 175 (11/2017)| 93 PFlop/s #1 (11/2017)|
|[[https://www.top500.org/system/179697|VSC-4 (2019)]]| 2.7 PFlop/s| 82 (06/2019)| ———| 148 PFlop/s #1 (06/2019)|
|VSC-4 (………) | 2.7 PFlop/s| 105 (06/2020)| ———| 415 PFlop/s #1 (06/2020)|
----
====== VSC systems – what do they look like ? ======
{{.:vsc.png}}
----
====== VSC-4 – components of a supercomputer ======
{{.:vsc4-schematic.png}}
$~$
* **login nodes** vs. **compute nodes**
* **shared** (login, storage) vs. **user exclusive** (compute nodes -N $~$ | $~$ **on VSC-4** optional shared nodes -n)
----
====== Parallel hardware architectures ======
**how to connect cores (processing units) ?**
{{.:hw-cores_margin.png?150}}
{{.:hw-architectures.png}}
----
====== VSC compute nodes ======
* **VSC-3**, **VSC-3+**, and **VSC-4** $~$ ➠ $~$ Intel CPUs $~$ ➠ $~$ different: $~$ **types**, $~$ **memory**, $~$ **# cores**, $~$ **# HCAs**
plus special types of hardware (GPUs on VSC-3) ➠ see: [[../09_special_hardware/accelerators.html#(4)|talk on special hardware]] and [[../05_submitting_batch_jobs/slurm.html#(11)|talk on SLURM]]
$~$
* **VSC-3**: $~$ **1 node** $~$ = $~$ **2 sockets** (CPUs), **8 cores** per socket (P), **2 threads** per core (T1/T2) $~$ + $~$ **2 HCAs**
{{.:vsc3-node.png}}
* **intra-socket**: 59.7 GB/s (max), **inter-socket** via QPI (QuickPath interconnect): 32 GB/s (max)
* **inter-node** via dual rail Intel QDR-80: 4 GB/s (max) / 3.4 GB/s (eff) per HCA (host channel adapter)
* Avoiding slow data paths is the key to most performance optimizations! $~~~$ ➠ $~$**Affinity matters!**$~$
**processing units** (PU#) $~~~$ ➠ pinning
see: [[https://wiki.vsc.ac.at/doku.php?id=pandoc:introduction-to-vsc:05_submitting_batch_jobs:slurm#mpi_ntasks_per_node_pinning|article on SLURM]] and [[https://wiki.vsc.ac.at/doku.php?id=doku:vsc3_pinning|pinning@Wiki]]
**memory hierarchy (mem_0064 nodes):**
L1 data cache: **32 kB**, private to core
L2 cache: **256 kB**, private to core (unified)
L3 cache: **20 MB**, shared by all cores of 1 socket
**memory: 32 GB per socket**
----
====== VSC node-interconnect schematic ======
INTENT VSC-X**VSC-3** $~$ ➠ $~$ **dual rail Intel QDR-80 $~$ ➠ $~$ 3-level fat-tree** (BF = 2:1 / 4:1)
INTENT VSC-X**VSC-4** $~$ ➠ $~$ **single rail Intel Omnipath $~$ ➠ $~$ 2-level fat-tree** (BF = 2:1)
{{.:vsc-fabric-3.png}}
----
====== VSC-3 ping-pong – intra-node vs. inter-node ======
* **1 node** $~$ = $~$ 2 sockets with 8 cores per socket $~$ + $~$ **2 HCAs**
* **inter-node** $~$ = $~$ IB fabric = dual rail Intel QDR-80 = 3-level fat-tree (BF: 2:1 / 4:1)
* **ping-pong benchmark** $~$ = $~$ module load $~$ intel/16.0.3 $~$ intel-mpi/5.1.3 $~$ | $~$ openmpi/1.10.2 $~$ (1 HCA)
**MPI latency & bandwidth (plus typical values for comparison):**
^VSC-3: ^ latency [μs] ^ ^ typical values for: ^ latency^ bandwidth^ |intra-socket | 0.3 μs | | L1 cache | 1–2 ns| 100 GB/s| |inter-socket | 0.7 μs | | L2/L3 c. | 3–10 ns| 50 GB/s| |IB -1- edge | 1.4 μs | | memory | 100 ns| 10 GB/s| |IB -2- leaf | 1.8 μs | | HPC networks | | | |IB -3- spine | 2.3 μs | | (per node / 2 HCAs) | 1–10 μs| 1–8 GB/s|