====== Queue | Partition | QOS setup on VSC-4 ====== On VSC-4, Nodes of the same type of hardware are grouped to partitions. The quality of service (QOS), former calle //Queue// defines the maximum run time of a job and the number and type of allocate-able nodes. For submitting jobs to [[doku:slurm]], three parameters are important: #SBATCH --account=xxxxxx #SBATCH --partition=skylake_xxxx #SBATCH --qos=xxxxx_xxxx Notes: * Core hours will be charged to the specified account. * Account, partition, and qos have to fit together * If the account is not given, the default account will be used. * If partition and QOS are not given, default values are ''skylake_0096'' for both. ===== Partitions ===== Nodes of the same type of hardware are grouped to partitions. There are three basic types of compute nodes, all with the same CPU, but with different amount of memory: 96 GB, 384 GB and 768 GB. These are the partitions on VSC-4: ^ Partition ^ Nodes ^ Architecture ^ CPU ^ Cores per CPU (physical/with HT) ^ GPU ^ RAM ^ Use ^ |skylake_0096 | 702 | Intel | 2x Xeon Platinum 8174 | 24/48 | No | 96 GB | The default partition | |skylake_0384 | 78 | Intel | 2x Xeon Platinum 8174 | 24/48 | No | 384 GB | High Memory partition | |skylake_0768 | 12 | Intel | 2x Xeon Platinum 8174 | 24/48 | No | 768 GB | Higher Memory partition | Type ''sinfo -o %P'' on any node to see all the available partitions. For the sake of completeness there are internally used //special// partitions, that can not be selected manually: ^ Partition ^ Description ^ | login4 | login nodes, not an actual slurm partition | | rackws4 | GUI login nodes, not an actual slurm partition | | _jupyter | reserved for the jupyterhub | ===== Quality of service (QOS) ===== The QOS defines the maximum run time of a job and the number and type of allocate-able nodes. The QOSs that are assigned to a specific user can be viewed with: sacctmgr show user `id -u` withassoc format=user,defaultaccount,account,qos%40s,defaultqos%20s All QOS usable are also shown right after login. ==== QOS, Partitions and Run time limits ==== The following QoS are available for all normal (=non private) projects: ^ QOS name ^ Gives access to Partition ^ Hard run time limits ^ Description ^ | skylake_0096 | skylake_0096 | 72h (3 days) | Default | | skylake_0384 | skylake_0384 | 72h (3 days) | High Memory | | skylake_0768 | skylake_0768 | 72h (3 days) | Higher Memory | ==== Idle QOS ==== If a project runs out of compute time, jobs of this project are now running with low job priority and reduced maximum run time limit in the //idle// QOS. ^ QOS name ^ Gives access to Partition ^ Hard run time limits ^ Description ^ | idle_0096 | skylake_0096 | 24h (1 day) | Projects out of compute time | | idle_0384 | skylake_0384 | 24h (1 day) | Projects out of compute time | | idle_0768 | skylake_0768 | 24h (1 day) | Projects out of compute time | ==== Devel QOS ==== The //devel// QOS gives fast feedback to the user when their job is running. Connect to the node where the actual job is running to directly [[doku:monitoring|monitor]] to check if the threads/processes are doing what you expect. We recommend this before sending the job to one of the ''compute'' queues. ^ QOS name ^ Gives access to Partition ^ Hard run time limits ^ | skylake_0096_devel | 5 nodes on skylake_0096 | 10min | ==== Private Projects ==== Private projects come with different QOS; nevertheless partition, QOS, and account have to fit together. ^ QOS name ^ Gives access to Partition ^ Hard run time limits ^ | p....._0... | various | up to 240h (10 days) | private queues | For submitting jobs to [[doku:slurm]], three parameters are important: #SBATCH --account=pxxxxx #SBATCH --partition=skylake_xxxx #SBATCH --qos=pxxxx_xxxx ==== Run time ==== The QOS's run time limits can also be requested via the command sacctmgr show qos format=name%20s,priority,grpnodes,maxwall,description%40s If you know how long your job usually runs, you can set the run time limit in SLURM: #SBATCH --time= Of course this has to be //below// the default QOS's run time limit. Your job might start earlier, which is nice; But after the specified time is elapsed, the job is killed! Acceptable time formats include: * "minutes" * "minutes:seconds" * "hours:minutes:seconds" * "days-hours" * "days-hours:minutes" * "days-hours:minutes:seconds".