====== Quick start guide for VSC-5 ======

===== Connecting =====

<code>
ssh <user>@vsc5.vsc.ac.at
</code>

or alternatively use a specific login node:
<code>
ssh <user>@l5[0-6].vsc.ac.at
</code>

===== Data storage =====

On VSC-5 the same directories exist as on VSC-4 as we make use of the same IBM Spectrum Scale GPFS storage.
Consequently, you will have access to your data in ''$HOME'' and ''$DATA'' as on VSC-4.
There is, however, no BINF filesystem available.


===== Software installations =====

==== New SPACK without environments ====

Having worked with spack environments for some time, we have encountered several severe issues which have convinced us that we need to find a more practical way of maintaining software packages at VSC.

There are now three separate spack installation trees corresponding to the CPU/GPU architectures on VSC:

  * skylake - Intel CPUs; works on Intel Skylake and Cascadelake CPUs
  * zen - AMD CPUs; works on Zen 2 and 3 CPUs
  * cuda-zen - AMD CPUs + NVIDIA GPUs; works on all nodes equipped with graphics cards

By default the spack installation tree suitable for the current compute/login node is activated and will be indicated by a **prefix** on the command line, e.g.:

<code>
zen [user@l51 ~]$
</code>

Read more about SPACK at:
  * [[doku:spack-transition | Transition to new SPACK without Environments]]
  * [[doku:spack]]
  * [[https://spack.readthedocs.io/en/latest/basic_usage.html|Official documentation of SPACK]]


==== Load a module ====

Most software is installed via SPACK, so you can use spack commands like ''spack find -ld xyz'' to get details about the installation. All these installations also provide a module, find available modules with ''module avail xyz'', and load with ''module load xyz''. See [[doku:spack|SPACK - a package manager for HPC systems]] for more information.

Some software is still installed by hand, find available [[doku:modules]] with ''module avail xyz'', and load with ''module load xyz''.


===== Compile code =====

A program needs to be compiled on the hardware it will later run
on. If you have programs compiled for VSC4, they will run on the
''cascadelake_0384'' partition, but not on the default AMD partition that uses
AMD processors!


==== AMD: Zen3 ====

Most nodes of VSC5 are based on AMD processors, including the login
nodes. The spack environment ''zen3'' is loaded automatically, so you
can use ''spack load <mypackage>'' to load what you need, compile your
program, and submit a job via slurm.
 
The nodes have 2x AMD Epyc CPUs (Milan architecture) each equipped with 64 cores. In total there are
128 physical cores (core-id 0-127) and 256 virtual cores available.

The A100 GPU nodes have 512GB RAM and the two NVIDIA A100 cards have 40GB RAM each.
60 A100 nodes are installed.

The A40 GPU nodes have 256GB RAM and the two NVIDIA A40 cards have 46GB each.
45 A40 nodes are installed.
<code>
$ nvidia-smi
Tue Apr 26 15:42:00 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.39.01    Driver Version: 510.39.01    CUDA Version: 11.6     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A100-PCI...  Off  | 00000000:01:00.0 Off |                  Off |
| N/A   40C    P0    35W / 250W |      0MiB / 40960MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA A100-PCI...  Off  | 00000000:81:00.0 Off |                  Off |
| N/A   37C    P0    37W / 250W |      0MiB / 40960MiB |     40%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
</code>


==== Intel: Cascadelake ====

If you have programs already compiled for VSC4, they will also run on
the ''cascadelake_0384'' partition of VSC5. Otherwise you need to login on a
''cascadelake_0384'' node (or any VSC4 node) and compile your program
there. Then submit a job on the slurm partition ''cascadelake_0384''.

There are 48 nodes which have 2x Intel Cascadelake CPUs, 48 cores each. In total 96 physical cores (core-id 0-95) and 192 virtual cores are available.
Each node has 384GB RAM.

===== SLURM =====
For the partition/queue setup see [[doku:vsc5_queue|Queue | Partition setup on VSC-5]].
type ''sinfo -o %P'' to see the available partitions.

==== Submit a Job ====

Submit a job in ''slurm'' using a job script like this (minimal) example:

<file sh defjob.sh>
#!/bin/bash
#SBATCH -J <meaningful name for job>
#SBATCH -N 1
./my_program
</file>

This will submit a job in the default partition (zen3_0512) using the default QoS (zen3_0512).

To submit a job to the cascadelake nodes:
<file sh cascjob.sh>
#!/bin/sh
#SBATCH -J <meaningful name for job>
#SBATCH -N 1
#SBATCH --partition=cascadelake_0384
#SBATCH --qos cascadelake_0384
./my_program
</file>

Job Scripts for the AMD CPU nodes:

<file sh zen3_0512.sh>
#!/bin/sh
#SBATCH -J <meaningful name for job>
#SBATCH -N 1
#SBATCH --partition=zen3_0512
#SBATCH --qos zen3_0512
./my_program
</file>

<file sh zen3_1024.sh>
#!/bin/sh
#SBATCH -J <meaningful name for job>
#SBATCH -N 1
#SBATCH --partition=zen3_1024
#SBATCH --qos zen3_1024
./my_program
</file>

<file sh zen3_2048.sh>
#!/bin/sh
#SBATCH -J <meaningful name for job>
#SBATCH -N 1
#SBATCH --partition=zen3_2048
#SBATCH --qos zen3_2048
./my_program
</file>


Example job script to use both GPUs on a GPU nodes:
<file sh twogpujob.sh>
#!/bin/sh
#SBATCH -J <meaningful name for job>
#SBATCH -N 1
#SBATCH --partition=zen3_0512_a100x2
#SBATCH --qos zen3_0512_a100x2
#SBATCH --gres=gpu:2
./my_program
</file>

Example script to use only one GPU on a GPU node:
<file sh onegpujob.sh>
#!/bin/sh
#SBATCH -J <meaningful name for job>
#SBATCH --partition=zen3_0512_a100x2
#SBATCH --qos zen3_0512_a100x2
#SBATCH --gres=gpu:1
./my_program
</file>

Your job will then be constrained to one GPU and will not interfere with a second job on the node.
It will not be possible to access the other GPU card not assigned to your job.


More at [[doku:slurm|Submitting batch jobs (SLURM)]], but bear in mind
the different partitions for VSC4!

Official Slurm documentation: https://slurm.schedmd.com

===== Intel MPI =====

When **using Intel-MPI on the AMD nodes and mpirun** please set the following environment variable in your job script to allow for correct process pinning:

<code>
export I_MPI_PIN_RESPECT_CPUSET=0
</code>