Queue | Partition | QOS setup on VSC-4

This version (2024/05/28 07:48) was approved by fkocina.The Previously approved version (2024/04/25 13:11) is available.

On VSC-4, Nodes of the same type of hardware are grouped to partitions. The quality of service (QOS), former called Queue defines the maximum run time of a job and the number and type of allocate-able nodes.

For submitting jobs to slurm, three parameters are important:

#SBATCH --account=xxxxxx
#SBATCH --partition=skylake_xxxx
#SBATCH --qos=xxxxx_xxxx

Notes:

Core hours will be charged to the specified account.
Account, partition, and qos have to fit together
If the account is not given, the default account will be used.
If partition and QOS are not given, default values are skylake_0096 for both.

Nodes of the same type of hardware are grouped to partitions. There are three basic types of compute nodes, all with the same CPU, but with different amount of memory: 96 GB, 384 GB and 768 GB.

These are the partitions on VSC-4:

Partition	Nodes	Architecture	CPU	Cores per CPU (physical/with HT)	GPU	RAM	Use
skylake_0096	702	Intel	2x Xeon Platinum 8174	24/48	No	96 GB	The default partition
skylake_0384	78	Intel	2x Xeon Platinum 8174	24/48	No	384 GB	High Memory partition
skylake_0768	12	Intel	2x Xeon Platinum 8174	24/48	No	768 GB	Higher Memory partition

Type sinfo -o %P on any node to see all the available partitions.

For the sake of completeness there are internally used special partitions, that can not be selected manually:

Partition	Description
login4	login nodes, not an actual slurm partition
rackws4	GUI login nodes, not an actual slurm partition
_jupyter	reserved for the jupyterhub

The QOS defines the maximum run time of a job and the number and type of allocate-able nodes.

The QOSs that are assigned to a specific user can be viewed with:

sacctmgr show user `id -u` withassoc format=user,defaultaccount,account,qos%40s,defaultqos%20s

All QOS usable are also shown right after login.

The following QoS are available for all normal (=non private) projects:

QOS name	Gives access to Partition	Hard run time limits	Description
skylake_0096	skylake_0096	72h (3 days)	Default
skylake_0384	skylake_0384	72h (3 days)	High Memory
skylake_0768	skylake_0768	72h (3 days)	Higher Memory

If a project runs out of compute time, jobs of this project are now running with low job priority and reduced maximum run time limit in the idle QOS.

QOS name	Gives access to Partition	Hard run time limits	Description
idle_0096	skylake_0096	24h (1 day)	Projects out of compute time
idle_0384	skylake_0384	24h (1 day)	Projects out of compute time
idle_0768	skylake_0768	24h (1 day)	Projects out of compute time

The devel QOS gives fast feedback to the user when their job is running. Connect to the node where the actual job is running to directly monitor to check if the threads/processes are doing what you expect. We recommend this before sending the job to one of the compute queues.

QOS name	Gives access to Partition	Hard run time limits
skylake_0096_devel	5 nodes on skylake_0096	10min

Private projects come with different QOS; nevertheless partition, QOS, and account have to fit together.

QOS name	Gives access to Partition	Hard run time limits
p….._0…	various	up to 240h (10 days)	private queues

For submitting jobs to slurm, three parameters are important:

#SBATCH --account=pxxxxx 
#SBATCH --partition=skylake_xxxx
#SBATCH --qos=pxxxx_xxxx

The QOS's run time limits can also be requested via the command

sacctmgr show qos  format=name%20s,priority,grpnodes,maxwall,description%40s

If you know how long your job usually runs, you can set the run time limit in SLURM:

#SBATCH --time=<time>

Of course this has to be below the default QOS's run time limit. Your job might start earlier, which is nice; But after the specified time is elapsed, the job is killed!

Acceptable time formats include:

“minutes”
“minutes:seconds”
“hours:minutes:seconds”
“days-hours”
“days-hours:minutes”
“days-hours:minutes:seconds”.

Queue | Partition | QOS setup on VSC-4

Partitions

Quality of service (QOS)

QOS, Partitions and Run time limits

Idle QOS

Devel QOS

Private Projects

Run time