====== Queue | Partition | QOS setup on VSC-4 ======
On VSC-4, Nodes of the same type of hardware are grouped to partitions. The quality of service (QOS), former calle //Queue// defines the maximum run time of a job and the number and type of allocate-able nodes.
For submitting jobs to [[doku:slurm]], three parameters are important:
#SBATCH --account=xxxxxx
#SBATCH --partition=skylake_xxxx
#SBATCH --qos=xxxxx_xxxx
Notes:
* Core hours will be charged to the specified account.
* Account, partition, and qos have to fit together
* If the account is not given, the default account will be used.
* If partition and QOS are not given, default values are ''skylake_0096'' for both.
===== Partitions =====
Nodes of the same type of hardware are grouped to partitions. There are three basic types of compute nodes, all with the same CPU, but with different amount of memory: 96 GB, 384 GB and 768 GB.
These are the partitions on VSC-4:
^ Partition ^ Nodes ^ Architecture ^ CPU ^ Cores per CPU (physical/with HT) ^ GPU ^ RAM ^ Use ^
|skylake_0096 | 702 | Intel | 2x Xeon Platinum 8174 | 24/48 | No | 96 GB | The default partition |
|skylake_0384 | 78 | Intel | 2x Xeon Platinum 8174 | 24/48 | No | 384 GB | High Memory partition |
|skylake_0768 | 12 | Intel | 2x Xeon Platinum 8174 | 24/48 | No | 768 GB | Higher Memory partition |
Type ''sinfo -o %P'' on any node to see all the available partitions.
For the sake of completeness there are internally used //special// partitions, that can not be selected manually:
^ Partition ^ Description ^
| login4 | login nodes, not an actual slurm partition |
| rackws4 | GUI login nodes, not an actual slurm partition |
| _jupyter | reserved for the jupyterhub |
===== Quality of service (QOS) =====
The QOS defines the maximum run time of a job and the number and type of allocate-able nodes.
The QOSs that are assigned to a specific user can be viewed with:
sacctmgr show user `id -u` withassoc format=user,defaultaccount,account,qos%40s,defaultqos%20s
All QOS usable are also shown right after login.
==== QOS, Partitions and Run time limits ====
The following QoS are available for all normal (=non private) projects:
^ QOS name ^ Gives access to Partition ^ Hard run time limits ^ Description ^
| skylake_0096 | skylake_0096 | 72h (3 days) | Default |
| skylake_0384 | skylake_0384 | 72h (3 days) | High Memory |
| skylake_0768 | skylake_0768 | 72h (3 days) | Higher Memory |
==== Idle QOS ====
If a project runs out of compute time, jobs of this project are now running with low job priority and reduced maximum run time limit in the //idle// QOS.
^ QOS name ^ Gives access to Partition ^ Hard run time limits ^ Description ^
| idle_0096 | skylake_0096 | 24h (1 day) | Projects out of compute time |
| idle_0384 | skylake_0384 | 24h (1 day) | Projects out of compute time |
| idle_0768 | skylake_0768 | 24h (1 day) | Projects out of compute time |
==== Devel QOS ====
The //devel// QOS gives fast feedback to the user when their job is running. Connect to the node where the actual job is running to directly [[doku:monitoring|monitor]] to check if the threads/processes are doing what you expect. We recommend this before sending the job to one of the ''compute'' queues.
^ QOS name ^ Gives access to Partition ^ Hard run time limits ^
| skylake_0096_devel | 5 nodes on skylake_0096 | 10min |
==== Private Projects ====
Private projects come with different QOS; nevertheless partition, QOS, and account have to fit together.
^ QOS name ^ Gives access to Partition ^ Hard run time limits ^
| p....._0... | various | up to 240h (10 days) | private queues |
For submitting jobs to [[doku:slurm]], three parameters are important:
#SBATCH --account=pxxxxx
#SBATCH --partition=skylake_xxxx
#SBATCH --qos=pxxxx_xxxx
==== Run time ====
The QOS's run time limits can also be requested via the command
sacctmgr show qos format=name%20s,priority,grpnodes,maxwall,description%40s
If you know how long your job usually runs, you can set the run time limit in SLURM:
#SBATCH --time=
Of course this has to be //below// the default QOS's run time limit. Your job might start earlier, which is nice; But after the specified time is elapsed, the job is killed!
Acceptable time formats include:
* "minutes"
* "minutes:seconds"
* "hours:minutes:seconds"
* "days-hours"
* "days-hours:minutes"
* "days-hours:minutes:seconds".