On VSC-5, Nodes of the same type of hardware are grouped to partitions. The quality of service (QOS), former called Queue defines the maximum run time of a job and the number and type of allocate-able nodes.
For submitting jobs to slurm, three parameters are important:
#SBATCH --account=xxxxxx #SBATCH --partition=xxxxx_xxxx #SBATCH --qos=xxxxx_xxxx
Notes:
zen3_0512
for both.Nodes of the same type of hardware are grouped to partitions, there are three basic types:
These are the partitions on VSC-5:
Partition | Nodes | Architecture | CPU | Cores per CPU (physical/with HT) | GPU | RAM | Use |
---|---|---|---|---|---|---|---|
zen3_0512* | 564 | AMD | 2x AMD 7713 | 64/128 | No | 512 GB | The default partition |
zen3_1024 | 120 | AMD | 2x AMD 7713 | 64/128 | No | 1 TB | High Memory partition |
zen3_2048 | 20 | AMD | 2x AMD 7713 | 64/128 | No | 2 TB | Higher Memory partition |
cascadelake_0384 | 48 | Intel | 2x Intel Cascadelake | 48/96 | No | 384 GB | Directly use programs compiled for VSC-4 |
zen2_0256_a40x2 | 45 | AMD | 2x AMD 7252 | 8/16 | 2x NVIDIA A40 | 256 GB | Best for single precision GPU code |
zen3_0512_a100x2 | 60 | AMD | 2x AMD 7713 | 64/128 | 2x NVIDIA A100 | 512 GB | Best for double precision GPU code |
Type sinfo -o %P
on any node to see all the available partitions.
For the sake of completeness there are internally used special partitions, that can not be selected manually:
Partition | Description |
---|---|
login5 | login nodes, not an actual slurm partition |
rackws5 | GUI login nodes, not an actual slurm partition |
_jupyter | variations of zen3, a40 and a100 nodes reserved for the jupyterhub |
The QOS defines the maximum run time of a job and the number and type of allocate-able nodes.
The QOSs that are assigned to a specific user can be viewed with:
sacctmgr show user `id -u` withassoc format=user,defaultaccount,account,qos%40s,defaultqos%20s
All QOS usable are also shown right after login.
The following QoS are available for all normal (=non private) projects:
QOS name | Gives access to Partition | Hard run time limits | Description |
---|---|---|---|
zen3_0512 | zen3_0512 | 72h (3 days) | Default |
zen3_1024 | zen3_1024 | 72h (3 days) | High Memory |
zen3_2048 | zen3_2048 | 72h (3 days) | Higher Memory |
cascadelake_0384 | cascadelake_0384 | 72h (3 days) | |
zen2_0256_a40x2 | zen2_0256_a40x2 | 72h (3 days) | GPU Nodes |
zen3_0512_a100x2 | zen3_0512_a100x2 | 72h (3 days) | GPU Nodes |
zen3_0512_devel | 5 nodes on zen3_0512 | 10min | Fast Feedback |
If a project runs out of compute time, jobs of this project are now running with low job priority and reduced maximum run time limit in the idle QOS.
There is no idle QOS on cascadelake
or zen2_0256_a40x2
/zen3_0512_a100x2
GPU Nodes.
QOS name | Gives access to Partition | Hard run time limits | Description |
---|---|---|---|
idle_0512 | zen3_0512 | 24h (1 day) | Projects out of compute time |
idle_1024 | zen3_1024 | 24h (1 day) | Projects out of compute time |
idle_2048 | zen3_2048 | 24h (1 day) | Projects out of compute time |
The devel QOS gives fast feedback to the user when their job is running. Connect to the node where the actual job is running to directly monitor to check if the threads/processes are doing what you expect. We recommend this before sending the job to one of the compute
queues.
QOS name | Gives access to Partition | Hard run time limits |
---|---|---|
zen3_0512_devel | 5 nodes on zen3_0512 | 10min |
Private projects come with different QOS; nevertheless partition, QOS, and account have to fit together.
QOS name | Gives access to Partition | Hard run time limits | |
---|---|---|---|
p….._0… | various | up to 240h (10 days) | private queues |
For submitting jobs to slurm, three parameters are important:
#SBATCH --account=pxxxxx #SBATCH --partition=zen3_xxxx #SBATCH --qos=pxxxx_xxxx
The QOS's run time limits can also be requested via the command
sacctmgr show qos format=name%20s,priority,grpnodes,maxwall,description%40s
If you know how long your job usually runs, you can set the run time limit in SLURM:
#SBATCH --time=<time>
Of course this has to be below the default QOS's run time limit. Your job might start earlier, which is nice; But after the specified time is elapsed, the job is killed!
Acceptable time formats include: