Queue | Partition setup on VSC-5

On VSC-5, the type of hardware and the quality of service (QOS) where the jobs run on may be selected. Nodes of the same type of hardware are grouped to partitions, the QOS defines the maximum run time of a job and the number and type of allocate-able nodes.

There are three basic types of hardware that differ in architecture:

  • Intel CPU nodes: there is only one variant with Cascadelake CPUs and 368GB RAM.
  • AMD CPU nodes: they all have Zen3 CPU nodes, but come in three memeory versions - 512GB, 1TB and 2TB RAM.
  • GPU nodes: there are two versions, one with Zen2 CPUs, 256GB RAM and 2x nVidia A40 GPUs, and one with Zen3 CPUs, 512GB RAM and 2x nVidia A100 GPUs.

On VSC-5, the hardware is grouped into so-called ➠ partitions:

partition name description
cascadelake_0384 Intel CPU nodes
zen3_0512 default, AMD CPU nodes with 512GB of memory
zen3_1024 AMD CPU nodes with 1TB of memory
zen3_2048 AMD CPU nodes with 2TB of memory
zen2_0256_a40x2 GPU nodes with 2x nVidia A40
zen3_0512_a100x2 GPU nodes with 2x nVidia A100
jupyter reserved for the JupyterHub

Access to node partitions is granted by the so-called ➠ quality of service (QOS). The QOSs constrain the number of allocate-able nodes and limit job wall time. The naming scheme of the QOSs is: <project_type>_<memoryConfig>

The QOSs that are assigned to a specific user can be viewed with:

sacctmgr show user `id -u` withassoc format=user,defaultaccount,account,qos%40s,defaultqos%20s

The default QOS and all QOSs usable are also shown right after login.

Generally, it can be distinguished in QOS defined on the generally available compute nodes and on private nodes. Furthermore, there is a distinction whether a project still has available computing time or if the computing time has already been consumed. In the latter case, jobs of this project are running with low job priority and reduced maximum run time limit in the ➠ idle queue.

The ➠ devel queue gives fast feed-back to the user if her or his job is running. It is possible to connect to the node where the actual job is running and to directly monitor the job, e.g., for the purpose of checking if the threads/processes are doing what is expected. This might be recommended before sending the job to one of the 'computing' queues.

The QOS's hard run time limits
zen3_0512 / zen3_1024 / zen3_2048 / cascadelake_0384 / zen2_0256_a40x2 / zen3_0512_a100x2 72h (3 days)
idle_0512 / idle_1024 / idle_2048 (there is no idle on cascadelake or GPUs) 24h (1 day)
private queues p….._0… up to 240h (10 days)
zen3_0512_devel (up to 5 nodes available) 10min

The QOS's run time limits can also be requested via the command

sacctmgr show qos  format=name%20s,priority,grpnodes,maxwall,description%40s

SLURM allows for setting a run time limit below the default QOS's run time limit. After the specified time is elapsed, the job is killed:

#SBATCH --time=<time> 

Acceptable time formats include “minutes”, “minutes:seconds”, “hours:minutes:seconds”, “days-hours”, “days-hours:minutes” and “days-hours:minutes:seconds”.

For submitting jobs, three parameters are important:

#SBATCH --partition=xxxxx_xxxx
#SBATCH --qos=xxxxx_xxxx
#SBATCH --account=xxxxxx

The core hours will be charged to the specified account. If not specified, the default account (sacctmgr show user `id -u` withassoc format=defaultaccount) will be used.

ordinary projects

For ordinary projects the QOSs are:

QOS name gives access to partition description
zen3_0512 zen3_0512 default
zen3_1024 zen3_1024
zen3_2048 zen3_2048
cascadelake_0384 cascadelake_0384
zen2_0256_a40x2 zen2_0256_a40x2
zen3_0512_a100x2 zen3_0512_a100x2
zen3_0512_devel 5 nodes on zen3_0512
examples
#SBATCH --partition=zen3_0512
#SBATCH --qos=zen3_0512   
#SBATCH --account=p7xxxx   
  • Note that partition, qos, and account have to fit together.
  • If the account is not given, the default account (sacctmgr show user `id -u` withassoc format=defaultaccount) will be used.
  • If partition and qos are not given, default values are zen3_0512 for both.

private nodes projects

example
#SBATCH --partition=zen3_0512
#SBATCH --qos=p7xxx_xxxx
#SBATCH --account=p7xxxx 
  • doku/vsc5_queue.txt
  • Last modified: 2023/01/05 16:56
  • by goldenberg