====== Queue | Partition | QOS setup on VSC-4 ======
On VSC-4, Nodes of the same type of hardware are grouped to partitions. The quality of service (QOS), former calle //Queue// defines the maximum run time of a job and the number and type of allocate-able nodes.

For submitting jobs to [[doku:slurm]], three parameters are important:

<code bash>
#SBATCH --account=xxxxxx
#SBATCH --partition=skylake_xxxx
#SBATCH --qos=xxxxx_xxxx
</code>

Notes:

  * Core hours will be charged to the specified account.
  * Account, partition, and qos have to fit together
  * If the account is not given, the default account will be used.
  * If partition and QOS are not given, default values are ''skylake_0096'' for both.


===== Partitions =====

Nodes of the same type of hardware are grouped to partitions. There are three basic types of compute nodes, all with the same CPU, but with different amount of memory: 96 GB, 384 GB and 768 GB.

These are the partitions on VSC-4:

^ Partition ^ Nodes ^ Architecture ^ CPU ^ Cores per CPU (physical/with HT) ^ GPU ^ RAM ^ Use ^
|skylake_0096 | 702 | Intel | 2x Xeon Platinum 8174 | 24/48 | No | 96 GB | The default partition |
|skylake_0384 | 78 | Intel | 2x Xeon Platinum 8174 | 24/48 | No | 384 GB | High Memory partition |
|skylake_0768 | 12 | Intel | 2x Xeon Platinum 8174 | 24/48 | No | 768 GB | Higher Memory partition |

Type ''sinfo -o %P'' on any node to see all the available partitions.

For the sake of completeness there are internally used //special// partitions, that can not be selected manually:

^ Partition ^ Description ^
| login4 | login nodes, not an actual slurm partition |
| rackws4 | GUI login nodes, not an actual slurm partition |
| _jupyter | reserved for the jupyterhub |


===== Quality of service (QOS) =====

The QOS defines the maximum run time of a job and the number and type of allocate-able nodes.

The QOSs that are assigned to a specific user can be viewed with:

<code>
sacctmgr show user `id -u` withassoc format=user,defaultaccount,account,qos%40s,defaultqos%20s
</code>

All QOS usable are also shown right after login.


==== QOS, Partitions and Run time limits ====

The following QoS are available for all normal (=non private) projects:


^ QOS name ^ Gives access to Partition ^ Hard run time limits  ^ Description ^
| skylake_0096 | skylake_0096 | 72h (3 days) | Default |
| skylake_0384 | skylake_0384 | 72h (3 days) | High Memory |
| skylake_0768 | skylake_0768 | 72h (3 days) | Higher Memory |


==== Idle QOS ====

If a project runs out of compute time, jobs of this project are now running with low job priority and reduced maximum run time limit in the //idle// QOS.

^ QOS name ^ Gives access to Partition ^ Hard run time limits ^  Description ^
| idle_0096 | skylake_0096 | 24h (1 day) | Projects out of compute time | 
| idle_0384 | skylake_0384 | 24h (1 day) | Projects out of compute time | 
| idle_0768 | skylake_0768 | 24h (1 day) | Projects out of compute time | 


==== Devel QOS ====

The //devel// QOS gives fast feedback to the user when their job is running. Connect to the node where the actual job is running to directly [[doku:monitoring|monitor]] to check if the threads/processes are doing what you expect. We recommend this before sending the job to one of the ''compute'' queues.

^ QOS name ^ Gives access to Partition ^ Hard run time limits  ^
| skylake_0096_devel | 5 nodes on skylake_0096 | 10min |

==== Private Projects ====

Private projects come with different QOS; nevertheless partition, QOS, and account have to fit together.

^ QOS name ^ Gives access to Partition ^ Hard run time limits  ^ 
| p....._0...  | various | up to 240h (10 days) | private queues |

For submitting jobs to [[doku:slurm]], three parameters are important:

<code bash>
#SBATCH --account=pxxxxx 
#SBATCH --partition=skylake_xxxx
#SBATCH --qos=pxxxx_xxxx
</code> 


==== Run time ====

The QOS's run time limits can also be requested via the command

<code>sacctmgr show qos  format=name%20s,priority,grpnodes,maxwall,description%40s</code>

If you know how long your job usually runs, you can set the run time limit in SLURM:

<code>
#SBATCH --time=<time>
</code>

 Of course this has to be //below// the default QOS's run time limit. Your job might start earlier, which is nice; But after the specified time is elapsed, the job is killed!

Acceptable time formats include:
  * "minutes"
  * "minutes:seconds"
  * "hours:minutes:seconds"
  * "days-hours"
  * "days-hours:minutes"
  * "days-hours:minutes:seconds".