Differences
This shows you the differences between two versions of the page.
Both sides previous revision Previous revision Next revision | Previous revision | ||
doku:vsc5_queue [2023/04/06 12:16] – msiegel | doku:vsc5_queue [2024/04/25 13:17] (current) – [Partitions] goldenberg | ||
---|---|---|---|
Line 1: | Line 1: | ||
- | ====== Queue | Partition setup on VSC-5 ====== | + | ====== Queue | Partition |
- | On VSC-5, | + | On VSC-5, Nodes of the same type of hardware are grouped to partitions. The quality of service (QOS), former calle // |
- | + | ||
- | ===== Hardware types ===== | + | |
- | There are three basic types of hardware that differ in architecture: | + | |
- | * Intel CPU nodes: there is only one variant with Cascadelake CPUs and 368GB RAM. | + | |
- | * AMD CPU nodes: they all have Zen3 CPU nodes, but come in three memeory versions - 512GB, 1TB and 2TB RAM. | + | |
- | * GPU nodes: there are two versions, one with Zen2 CPUs, 256GB RAM and 2x nVidia A40 GPUs, and one with Zen3 CPUs, 512GB RAM and 2x nVidia A100 GPUs. | + | |
- | + | ||
- | On VSC-5, the hardware is grouped into so-called < | + | |
- | + | ||
- | + | ||
- | ^ Partition ^ Nodes ^ Architecture ^ CPU ^ GPU ^ RAM ^ Use ^ | + | |
- | | zen3_0512* || AMD | 2x AMD Epyc (Milan) | No | 512 GB | The default partition | | + | |
- | | zen3_1024 || AMD | 2x AMD Epyc (Milan) | No | 1 TB | High Memory partition | | + | |
- | | zen3_2048 || AMD | 2x AMD Epyc (Milan) | No | 2 TB | Higher Memory partition | | + | |
- | | cascadelake_0384 || Intel | 2x Intel Cascadelake | No | 384 GB | Directly use programs compiled for VSC-4 | | + | |
- | | zen2_0256_a40x2 || AMD | 2x AMD Epyc (Milan) | 2x NIVIDA A40 | 256 GB | Best for single precision GPU code | | + | |
- | | zen3_0512_a100x2 || AMD | 2x AMD Epyc (Milan) | 2x NIVIDA A100 | 512 GB | Best for double precision GPU code | | + | |
- | + | ||
- | ^ partition ^ description ^ | + | |
- | | login5 | login nodes, not an actual slurm partition | | + | |
- | | jupyter | reserved for the jupyterhub | | + | |
- | + | ||
- | ===== Quality of service (QOS) ===== | + | |
- | + | ||
- | Access to node partitions is granted by the so-called < | + | |
- | < | + | |
For submitting jobs to [[doku: | For submitting jobs to [[doku: | ||
Line 39: | Line 13: | ||
* Core hours will be charged to the specified account. | * Core hours will be charged to the specified account. | ||
- | * Note that account, partition, and qos have to fit together | + | * Account, partition, and qos have to fit together |
- | * If the account is not given, the default account | + | * If the account is not given, the default account will be used. |
- | * If partition and qos are not given, default values are '' | + | * If partition and QOS are not given, default values are '' |
- | ===== QOS, Partitions and Run time limits ===== | ||
- | The following QoS are available for normal (=non private) projects: | + | ===== Partitions ===== |
- | ^ QOS name ^ gives access | + | Nodes of the same type of hardware are grouped |
- | |zen3_0512 | zen3_0512 | 72h (3 days) | default | | + | * Intel CPU nodes: There is only one variant with Cascadelake CPUs and 368GB RAM. |
- | |zen3_1024 | zen3_1024 | 72h (3 days) | | + | * AMD CPU nodes: They all have Zen3 CPU nodes, but come in three memory versions - 512GB, 1TB and 2TB RAM. |
- | |zen3_2048 | zen3_2048 | 72h (3 days) | | + | * GPU nodes: there are two versions, one with Zen2 CPUs, 256GB RAM and 2x nVidia A40 GPUs, and one with Zen3 CPUs, 512GB RAM and 2x nVidia A100 GPUs. |
- | |cascadelake_0384 | cascadelake_0384 | 72h (3 days) | | + | |
- | |zen2_0256_a40x2 | zen2_0256_a40x2 | 72h (3 days) | | + | |
- | |zen3_0512_a100x2 | zen3_0512_a100x2 | 72h (3 days) | | + | |
- | |zen3_0512_devel | 5 nodes on zen3_0512 | 10min | | + | |
- | | private queues | + | |
- | For submitting jobs to [[doku: | + | These are the partitions on VSC-5: |
- | <code bash> | + | ^ Partition ^ Nodes ^ Architecture ^ CPU ^ Cores per CPU (physical/ |
- | #SBATCH --account=p7xxxx | + | | zen3_0512* | 564 | AMD | 2x AMD 7713 | 64/128 | No | 512 GB | The default partition | |
- | #SBATCH --partition=zen3_0512 | + | | zen3_1024 | 120 | AMD | 2x AMD 7713 | 64/128 | No | 1 TB | High Memory |
- | # | + | | zen3_2048 | 20 | AMD | 2x AMD 7713 | 64/128 | No | 2 TB | Higher Memory partition | |
- | </code> | + | | cascadelake_0384 | 48 | Intel | 2x Intel Cascadelake | 48/96 | No | 384 GB | Directly use programs compiled for VSC-4 | |
+ | | zen2_0256_a40x2 | 45 | AMD | 2x AMD 7252 | 8/16 | 2x NIVIDA A40 | 256 GB | Best for single precision GPU code | | ||
+ | | zen3_0512_a100x2 | 60 | AMD | 2x AMD 7713 | 64/128 | 2x NIVIDA A100 | 512 GB | Best for double precision GPU code | | ||
+ | |||
+ | Type '' | ||
+ | |||
+ | For the sake of completeness there are internally used //special// partitions, that can not be selected manually: | ||
+ | |||
+ | ^ Partition ^ Description ^ | ||
+ | | login5 | login nodes, not an actual slurm partition | | ||
+ | | rackws5 | GUI login nodes, not an actual slurm partition | | ||
+ | | _jupyter | variations of zen3, a40 and a100 nodes reserved for the jupyterhub | | ||
+ | |||
+ | |||
+ | ===== Quality of service (QOS) ===== | ||
+ | |||
+ | The QOS defines the maximum run time of a job and the number and type of allocate-able nodes. | ||
The QOSs that are assigned to a specific user can be viewed with: | The QOSs that are assigned to a specific user can be viewed with: | ||
+ | |||
< | < | ||
sacctmgr show user `id -u` withassoc format=user, | sacctmgr show user `id -u` withassoc format=user, | ||
</ | </ | ||
- | The default QOS and all QOSs usable are also shown right after login. | ||
- | Generally, it can be distinguished in QOS defined on the generally available compute nodes and on private nodes. Furthermore, | + | All QOS usable |
- | The < | ||
- | ==== private nodes projects | + | ==== QOS, Partitions and Run time limits |
- | Private projects come with different | + | The following QoS are available for all normal (=non private) projects: |
+ | |||
+ | ^ QOS name ^ Gives access to Partition ^ Hard run time limits | ||
+ | |zen3_0512 | zen3_0512 | 72h (3 days) | Default | | ||
+ | |zen3_1024 | zen3_1024 | 72h (3 days) | High Memory | | ||
+ | |zen3_2048 | zen3_2048 | 72h (3 days) | Higher Memory | | ||
+ | |cascadelake_0384 | cascadelake_0384 | 72h (3 days) | | ||
+ | |zen2_0256_a40x2 | zen2_0256_a40x2 | 72h (3 days) | GPU Nodes | | ||
+ | |zen3_0512_a100x2 | zen3_0512_a100x2 | 72h (3 days) | GPU Nodes | | ||
+ | |zen3_0512_devel | 5 nodes on zen3_0512 | 10min | Fast Feedback | | ||
+ | |||
+ | |||
+ | ==== Idle QOS ==== | ||
+ | |||
+ | If a project runs out of compute time, jobs of this project are now running with low job priority and reduced maximum run time limit in the //idle// QOS. | ||
+ | |||
+ | There is no //idle// QOS on '' | ||
+ | |||
+ | ^ QOS name ^ Gives access to Partition ^ Hard run time limits ^ Description ^ | ||
+ | | idle_0512 | zen3_0512 | 24h (1 day) | Projects out of compute time | | ||
+ | | idle_1024 | zen3_1024 | 24h (1 day) | Projects out of compute time | | ||
+ | | idle_2048 | zen3_2048 | 24h (1 day) | Projects out of compute time | | ||
+ | |||
+ | |||
+ | ==== Devel QOS ==== | ||
+ | |||
+ | The //devel// QOS gives fast feedback to the user when their job is running. Connect to the node where the actual job is running to directly [[doku: | ||
+ | |||
+ | ^ QOS name ^ Gives access to Partition ^ Hard run time limits | ||
+ | |zen3_0512_devel | 5 nodes on zen3_0512 | 10min | | ||
+ | |||
+ | ==== Private Projects ==== | ||
+ | |||
+ | Private projects come with different | ||
+ | |||
+ | ^ QOS name ^ Gives access to Partition ^ Hard run time limits | ||
+ | | p....._0... | ||
+ | |||
+ | For submitting jobs to [[doku: | ||
<code bash> | <code bash> | ||
- | #SBATCH --partition=zen3_0512 | + | #SBATCH --account=pxxxxx |
- | #SBATCH --qos=p7xxx_xxxx | + | #SBATCH --partition=zen3_xxxx |
- | #SBATCH --account=p7xxxx | + | #SBATCH --qos=pxxxx_xxxx |
</ | </ | ||
- | ==== Run time limits ==== | ||
+ | ==== Run time ==== | ||
- | ^ The QOS's hard run time limits ^ | | ||
- | | zen3_0512 / zen3_1024 / zen3_2048 / cascadelake_0384 / zen2_0256_a40x2 / zen3_0512_a100x2 | 72h (3 days) | | ||
- | | idle_0512 / idle_1024 / idle_2048 (there is no idle on cascadelake or GPUs) | 24h (1 day) | | ||
- | | private queues | ||
- | | zen3_0512_devel (up to 5 nodes available) | ||
The QOS's run time limits can also be requested via the command | The QOS's run time limits can also be requested via the command | ||
+ | |||
< | < | ||
- | SLURM allows for setting a run time limit //below// the default QOS's run time limit. | + | |
- | < | + | If you know how long your job usually runs, you can set the run time limit in SLURM: |
- | Acceptable time formats include " | + | |
+ | < | ||
+ | #SBATCH --time=< | ||
+ | </ | ||
+ | |||
+ | Of course this has to be //below// the default QOS's run time limit. | ||
+ | |||
+ | Acceptable time formats include: | ||
+ | * " | ||
+ | * " | ||
+ | * " | ||
+ | * " | ||
+ | * " | ||
+ | * " | ||