VSC – supercomputers

This version (2020/10/20 09:13) is a draft.
Approvals: 0/1

Article written by Claudia Blaas-Schenner (VSC Team)
(last update 2020-10-10 by cb).

OUTLINE:

VSC – Vienna Scientific Cluster
$~$
Supercomputers for beginners –
– introducing VSC to our (new) users
- Supercomputers for beginners – what is a supercomputer ?
- VSC systems – what do they look like ?
- VSC-4 – components of a supercomputer
- Parallel hardware architectures –
  – which parallel programming models can be used ?
- VSC compute nodes
- VSC node-interconnect
- VSC-3 ping-pong – intra-node vs. inter-node

The VSC is a joint high performance computing (HPC) facility of Austrian universities.
Our mission: Within the limits of available resources we satisfy the HPC needs of our users.
VSC is primarily devoted to research.
Who can use VSC? Scientific personnel of the partner universities, see: https://vsc.ac.at/access VSC is open to users from other Austrian academic and research institutions.
Projects (test, funded, …): Access to VSC is granted on the basis of peer-reviewed projects.
Project manager (= usually your supervisor): Project application, extensions, creates user accounts, …
Publications: Please acknowledge VSC and add publications $~~$➠$~~$ visible on VSC homepage !

VSC links:	Information provided:
➠$~~$https://vsc.ac.at	VSC homepage (general info)
➠$~~$https://service.vsc.ac.at	VSC service website (application)
➠$~~$https://wiki.vsc.ac.at	VSC user documentation
➠$~~$	VSC user support $~$&$~$ contact

VSC Training Courses:
➠$~~$https://vsc.ac.at/training
VSC course slides:
➠$~~$➠$~~$➠$~~$VSC-Linux
➠$~~$➠$~~$➠$~~$VSC-Intro

What is a supercomputer ?
A supercomputer is a computer with a high level of performance as compared to a general-purpose computer. The performance of a supercomputer is commonly measured in floating-point operations per second (FLOPS)… [from Wikipedia]
A supercomputer is listed in the TOP500

		TOP500	GREEN500	(#1 TOP500)
VSC-1 (2009)	35 TFlop/s	156 (11/2009)	94 (06/2009)	1.8 PFlop/s #1 (11/2009)
VSC-2 (2011)	135 TFlop/s	56 (06/2011)	71 (06/2011)	8 PFlop/s #1 (06/2011)
VSC-3 (2014)	596 TFlop/s	85 (11/2014)	86 (11/2014)	33 PFlop/s #1 (11/2014)
VSC-3 (………)	596 TFlop/s	461 (11/2017)	175 (11/2017)	93 PFlop/s #1 (11/2017)
VSC-4 (2019)	2.7 PFlop/s	82 (06/2019)	———	148 PFlop/s #1 (06/2019)
VSC-4 (………)	2.7 PFlop/s	105 (06/2020)	———	415 PFlop/s #1 (06/2020)

$~$

login nodes vs. compute nodes
shared (login, storage) vs. user exclusive (compute nodes -N $~$ | $~$ on VSC-4 optional shared nodes -n)

how to connect cores (processing units) ?

VSC-3, VSC-3+, and VSC-4 $~$ ➠ $~$ Intel CPUs $~$ ➠ $~$ different: $~$ types, $~$ memory, $~$ # cores, $~$ # HCAs
plus special types of hardware (GPUs on VSC-3) ➠ see: talk on special hardware and talk on SLURM
$~$
VSC-3: $~$ 1 node $~$ = $~$ 2 sockets (CPUs), 8 cores per socket (P), 2 threads per core (T1/T2) $~$ + $~$ 2 HCAs

intra-socket: 59.7 GB/s (max), inter-socket via QPI (QuickPath interconnect): 32 GB/s (max)
inter-node via dual rail Intel QDR-80: 4 GB/s (max) / 3.4 GB/s (eff) per HCA (host channel adapter)
Avoiding slow data paths is the key to most performance optimizations! $~~~$ ➠ $~$Affinity matters!$~$

processing units (PU#) $~~~$ ➠ pinning
see: article on SLURM and pinning@Wiki

memory hierarchy (mem_0064 nodes):
L1 data cache: 32 kB, private to core
L2 cache: 256 kB, private to core (unified)
L3 cache: 20 MB, shared by all cores of 1 socket
memory: 32 GB per socket

INTENT VSC-XVSC-3 $~$ ➠ $~$ dual rail Intel QDR-80 <html><font color=#ffa500></html> $~$ ➠ $~$ <html></font></html> 3-level fat-tree (BF = 2:1 / 4:1)

INTENT VSC-XVSC-4 $~$ ➠ $~$ single rail Intel Omnipath <html><font color=#ff00ff></html> $~$ ➠ $~$ <html></font></html> 2-level fat-tree (BF = 2:1)

1 node $~$ = $~$ 2 sockets with 8 cores per socket $~$ + $~$ 2 HCAs
inter-node $~$ = $~$ IB fabric = dual rail Intel QDR-80 = 3-level fat-tree (BF: 2:1 / 4:1)
ping-pong benchmark $~$ = $~$ module load $~$ intel/16.0.3 $~$ intel-mpi/5.1.3 $~$ | $~$ openmpi/1.10.2 $~$ (1 HCA)

MPI latency & bandwidth (plus typical values for comparison):

VSC-3:	latency [μs]	typical values for:	latency	bandwidth
intra-socket	0.3 μs	L1 cache	1–2 ns	100 GB/s
inter-socket	0.7 μs	L2/L3 c.	3–10 ns	50 GB/s
IB -1- edge	1.4 μs	memory	100 ns	10 GB/s
IB -2- leaf	1.8 μs	HPC networks
IB -3- spine	2.3 μs	(per node / 2 HCAs)	1–10 μs	1–8 GB/s

VSC – supercomputers

VSC – Vienna Scientific Cluster

Supercomputers for beginners

VSC systems – what do they look like ?

VSC-4 – components of a supercomputer

Parallel hardware architectures

VSC compute nodes

VSC node-interconnect schematic

VSC-3 ping-pong – intra-node vs. inter-node