This version (2020/10/20 09:13) is a draft.
Approvals: 0/1
Approvals: 0/1
VSC – supercomputers
- Article written by Claudia Blaas-Schenner (VSC Team)
(last update 2020-10-10 by cb).
OUTLINE:
- VSC – Vienna Scientific Cluster
$~$ - Supercomputers for beginners –
– introducing VSC to our (new) users- Supercomputers for beginners – what is a supercomputer ?
- VSC systems – what do they look like ?
- VSC-4 – components of a supercomputer
- Parallel hardware architectures –
– which parallel programming models can be used ? - VSC compute nodes
- VSC node-interconnect
- VSC-3 ping-pong – intra-node vs. inter-node
VSC – Vienna Scientific Cluster
- The VSC is a joint high performance computing (HPC) facility of Austrian universities.
- Our mission: Within the limits of available resources we satisfy the HPC needs of our users.
- VSC is primarily devoted to research.
- Who can use VSC? Scientific personnel of the partner universities, see: https://vsc.ac.at/access
VSC is open to users from other Austrian academic and research institutions. - Projects (test, funded, …): Access to VSC is granted on the basis of peer-reviewed projects.
- Project manager (= usually your supervisor): Project application, extensions, creates user accounts, …
VSC links: | Information provided: |
---|---|
➠$~~$https://vsc.ac.at | VSC homepage (general info) |
➠$~~$https://service.vsc.ac.at | VSC service website (application) |
➠$~~$https://wiki.vsc.ac.at | VSC user documentation |
➠$~~$ | VSC user support $~$&$~$ contact |
- VSC Training Courses:
➠$~~$https://vsc.ac.at/training
VSC course slides:
➠$~~$➠$~~$➠$~~$VSC-Linux
➠$~~$➠$~~$➠$~~$VSC-Intro
Supercomputers for beginners
- What is a supercomputer ?
- A supercomputer is a computer with a high level of performance as compared to a general-purpose computer. The performance of a supercomputer is commonly measured in floating-point operations per second (FLOPS)… [from Wikipedia]
- A supercomputer is listed in the TOP500
TOP500 | GREEN500 | (#1 TOP500) | ||
---|---|---|---|---|
VSC-1 (2009) | 35 TFlop/s | 156 (11/2009) | 94 (06/2009) | 1.8 PFlop/s #1 (11/2009) |
VSC-2 (2011) | 135 TFlop/s | 56 (06/2011) | 71 (06/2011) | 8 PFlop/s #1 (06/2011) |
VSC-3 (2014) | 596 TFlop/s | 85 (11/2014) | 86 (11/2014) | 33 PFlop/s #1 (11/2014) |
VSC-3 (………) | 596 TFlop/s | 461 (11/2017) | 175 (11/2017) | 93 PFlop/s #1 (11/2017) |
VSC-4 (2019) | 2.7 PFlop/s | 82 (06/2019) | ——— | 148 PFlop/s #1 (06/2019) |
VSC-4 (………) | 2.7 PFlop/s | 105 (06/2020) | ——— | 415 PFlop/s #1 (06/2020) |
VSC systems – what do they look like ?
VSC-4 – components of a supercomputer
$~$
- login nodes vs. compute nodes
- shared (login, storage) vs. user exclusive (compute nodes -N $~$ | $~$ on VSC-4 optional shared nodes -n)
Parallel hardware architectures
VSC compute nodes
- VSC-3, VSC-3+, and VSC-4 $~$ ➠ $~$ Intel CPUs $~$ ➠ $~$ different: $~$ types, $~$ memory, $~$ # cores, $~$ # HCAs
plus special types of hardware (GPUs on VSC-3) ➠ see: talk on special hardware and talk on SLURM
$~$ - VSC-3: $~$ 1 node $~$ = $~$ 2 sockets (CPUs), 8 cores per socket (P), 2 threads per core (T1/T2) $~$ + $~$ 2 HCAs
- intra-socket: 59.7 GB/s (max), inter-socket via QPI (QuickPath interconnect): 32 GB/s (max)
- inter-node via dual rail Intel QDR-80: 4 GB/s (max) / 3.4 GB/s (eff) per HCA (host channel adapter)
- Avoiding slow data paths is the key to most performance optimizations! $~~~$ ➠ $~$Affinity matters!$~$
processing units (PU#) $~~~$ ➠ pinning
see: article on SLURM and pinning@Wiki
memory hierarchy (mem_0064 nodes):
L1 data cache: 32 kB, private to core
L2 cache: 256 kB, private to core (unified)
L3 cache: 20 MB, shared by all cores of 1 socket
memory: 32 GB per socket
VSC node-interconnect schematic
INTENT VSC-XVSC-3 $~$ ➠ $~$ dual rail Intel QDR-80 <html><font color=#ffa500></html> $~$ ➠ $~$ <html></font></html> 3-level fat-tree (BF = 2:1 / 4:1)
INTENT VSC-XVSC-4 $~$ ➠ $~$ single rail Intel Omnipath <html><font color=#ff00ff></html> $~$ ➠ $~$ <html></font></html> 2-level fat-tree (BF = 2:1)
VSC-3 ping-pong – intra-node vs. inter-node
- 1 node $~$ = $~$ 2 sockets with 8 cores per socket $~$ + $~$ 2 HCAs
- inter-node $~$ = $~$ IB fabric = dual rail Intel QDR-80 = 3-level fat-tree (BF: 2:1 / 4:1)
- ping-pong benchmark $~$ = $~$ module load $~$ intel/16.0.3 $~$ intel-mpi/5.1.3 $~$ | $~$ openmpi/1.10.2 $~$ (1 HCA)
MPI latency & bandwidth (plus typical values for comparison):
VSC-3: latency [μs] typical values for: latency bandwidth intra-socket 0.3 μs L1 cache 1–2 ns 100 GB/s inter-socket 0.7 μs L2/L3 c. 3–10 ns 50 GB/s IB -1- edge 1.4 μs memory 100 ns 10 GB/s IB -2- leaf 1.8 μs HPC networks IB -3- spine 2.3 μs (per node / 2 HCAs) 1–10 μs 1–8 GB/s