Differences
This shows you the differences between two versions of the page.
Both sides previous revision Previous revision Next revision | Previous revision Next revisionBoth sides next revision | ||
doku:vsc5_queue [2023/01/05 16:00] – [Run time limits] goldenberg | doku:vsc5_queue [2023/08/04 09:42] – [Partitions] goldenberg | ||
---|---|---|---|
Line 1: | Line 1: | ||
- | ====== Queue | Partition setup on VSC-5 ====== | + | ====== Queue | Partition |
- | On VSC-5, | + | On VSC-5, Nodes of the same type of hardware are grouped to partitions. The quality of service (QOS), former calle // |
- | ===== Hardware types ===== | + | For submitting jobs to [[doku: |
- | There are three basic types of hardware that differ in architecture: | + | |
- | * Intel CPU nodes: | + | <code bash> |
- | * AMD CPU nodes: | + | #SBATCH --account=xxxxxx |
+ | #SBATCH --partition=xxxxx_xxxx | ||
+ | #SBATCH --qos=xxxxx_xxxx | ||
+ | </ | ||
+ | |||
+ | Notes: | ||
+ | |||
+ | * Core hours will be charged to the specified account. | ||
+ | * Account, partition, and qos have to fit together | ||
+ | * If the account is not given, the default account will be used. | ||
+ | * If partition and QOS are not given, default values are '' | ||
+ | |||
+ | |||
+ | ===== Partitions | ||
+ | |||
+ | Nodes of the same type of hardware are grouped to partitions, there are three basic types: | ||
+ | * Intel CPU nodes: | ||
+ | * AMD CPU nodes: | ||
* GPU nodes: there are two versions, one with Zen2 CPUs, 256GB RAM and 2x nVidia A40 GPUs, and one with Zen3 CPUs, 512GB RAM and 2x nVidia A100 GPUs. | * GPU nodes: there are two versions, one with Zen2 CPUs, 256GB RAM and 2x nVidia A40 GPUs, and one with Zen3 CPUs, 512GB RAM and 2x nVidia A100 GPUs. | ||
- | On VSC-5, the hardware is grouped into so-called < | + | These are the partitions on VSC-5: |
+ | |||
+ | ^ Partition ^ Nodes ^ Architecture ^ CPU ^ GPU ^ RAM ^ Use ^ | ||
+ | | zen3_0512* | 564 | AMD | AMD 7713 (64 Core/CPU) | No | 512 GB | The default partition | | ||
+ | | zen3_1024 | 120 | AMD | AMD 7713 (64 Core/CPU) | No | 1 TB | High Memory partition | | ||
+ | | zen3_2048 |20 | AMD | AMD 7713 (64 Core/CPU) | No | 2 TB | Higher Memory partition | | ||
+ | | cascadelake_0384 |48 | Intel | 2x Intel Cascadelake | No | 384 GB | Directly use programs compiled for VSC-4 | | ||
+ | | zen2_0256_a40x2 | 45 | AMD | 2x AMD 7252 (8 Core/CPU) | 2x NIVIDA A40 | 256 GB | Best for single precision GPU code | | ||
+ | | zen3_0512_a100x2 | 60 | AMD | 2x AMD 7713 (64 Core/CPU) | 2x NIVIDA A100 | 512 GB | Best for double precision GPU code | | ||
+ | |||
+ | Type '' | ||
+ | |||
+ | For the sake of completeness there are internally used //special// partitions, that can not be selected manually: | ||
+ | |||
+ | ^ Partition ^ Description ^ | ||
+ | | login5 | login nodes, not an actual slurm partition | | ||
+ | | rackws5 | GUI login nodes, not an actual slurm partition | | ||
+ | | _jupyter | variations of zen3, a40 and a100 nodes reserved for the jupyterhub | | ||
- | ^partition name^ description^ | ||
- | | | | | ||
- | |cascadelake_0384 | Intel CPU nodes | | ||
- | |zen3_0512 | default, AMD CPU nodes with 512GB of memory | | ||
- | |zen3_1024 | AMD CPU nodes with 1TB of memory| | ||
- | |zen3_2048 | AMD CPU nodes with 2TB of memory| | ||
- | |zen2_0256_a40x2 | GPU nodes with 2x nVidia A40 | | ||
- | |zen3_0512_a100x2 | GPU nodes with 2x nVidia A100 | | ||
- | |jupyter| reserved for the JupyterHub | | ||
===== Quality of service (QOS) ===== | ===== Quality of service (QOS) ===== | ||
- | Access to node partitions is granted by the so-called < | + | The QOS defines |
- | < | + | |
The QOSs that are assigned to a specific user can be viewed with: | The QOSs that are assigned to a specific user can be viewed with: | ||
+ | |||
< | < | ||
sacctmgr show user `id -u` withassoc format=user, | sacctmgr show user `id -u` withassoc format=user, | ||
</ | </ | ||
- | The default QOS and all QOSs usable are also shown right after login. | ||
- | Generally, it can be distinguished in QOS defined on the generally available compute nodes and on private nodes. Furthermore, | + | All QOS usable |
- | The < | ||
- | ==== Run time limits ==== | + | ==== QOS, Partitions and Run time limits ==== |
+ | The following QoS are available for all normal (=non private) projects: | ||
- | ^ The QOS's hard run time limits ^ | | + | ^ QOS name ^ Gives access to Partition ^ Hard run time limits |
- | | | | | + | |zen3_0512 | zen3_0512 | 72h (3 days) | Default |
- | | zen3_0512 / zen3_1024 | + | |zen3_1024 |
- | | idle_0512 / idle_1024 / idle_2048 | + | |zen3_2048 | zen3_2048 | 72h (3 days) | Higher Memory | |
- | | private queues | + | |cascadelake_0384 | cascadelake_0384 | 72h (3 days) | |
- | | zen3_0512_devel | + | |zen2_0256_a40x2 | zen2_0256_a40x2 | 72h (3 days) | GPU Nodes | |
- | The QOS's run time limits can also be requested via the command | + | |zen3_0512_a100x2 |
- | < | + | |zen3_0512_devel |
- | SLURM allows for setting a run time limit //below// the default QOS's run time limit. After the specified time is elapsed, the job is killed: | + | |
- | < | + | |
- | Acceptable time formats include " | + | |
- | ==== sbatch parameters ==== | ||
- | For submitting jobs, three parameters are important: | ||
- | < | + | ==== Idle QOS ==== |
- | #SBATCH --partition=skylake_xxxx | + | |
- | #SBATCH --qos=xxxxx_xxxx | + | |
- | #SBATCH --account=xxxxxx | + | |
- | </ | + | |
- | The core hours will be charged to the specified account. If not specified, the default account ('' | + | |
- | === ordinary projects === | + | If a project runs out of compute time, jobs of this project are now running with low job priority and reduced maximum run time limit in the //idle// QOS. |
- | For ordinary projects the QOSs are: | + | There is no // |
- | ^QOS name ^ gives access to partition ^description^ | + | |
- | | | | | + | |
- | |skylake_0096 | skylake_0096 | default | | + | |
- | |skylake_0384 | skylake_0384 | | | + | |
- | |skylake_0768 | skylake_0768 | | | + | |
- | |skylake_devel_0096 | 5 nodes on skylake_0096 | | + | |
- | == examples == | + | ^ QOS name ^ Gives access to Partition ^ Hard run time limits ^ Description ^ |
- | < | + | | idle_0512 | zen3_0512 | 24h (1 day) | Projects out of compute time | |
- | #SBATCH --partition=skylake_0096 | + | | idle_1024 | zen3_1024 | 24h (1 day) | Projects out of compute time | |
- | #SBATCH --qos=skylake_0096 | + | | idle_2048 | zen3_2048 | 24h (1 day) | Projects out of compute time | |
- | #SBATCH --account=p7xxxx | + | |
- | </ | + | |
- | * Note that partition, qos, and account have to fit together. | + | |
- | * If the account is not given, the default account | + | |
- | * If partition and qos are not given, default values are mem_0096 for both. | + | |
- | === private nodes projects === | ||
- | == example | + | ==== Devel QOS ==== |
- | < | + | The //devel// QOS gives fast feedback to the user when their job is running. Connect to the node where the actual job is running to directly [[doku: |
- | #SBATCH --partition=skylake_0384 | + | |
- | #SBATCH --qos=p7xxx_xxxx | + | ^ QOS name ^ Gives access to Partition ^ Hard run time limits |
- | #SBATCH --account=p7xxxx | + | |zen3_0512_devel | 5 nodes on zen3_0512 | 10min | |
+ | |||
+ | ==== Private Projects ==== | ||
+ | |||
+ | Private projects come with different QOS; nevertheless partition, QOS, and account have to fit together. | ||
+ | |||
+ | ^ QOS name ^ Gives access to Partition ^ Hard run time limits | ||
+ | | p....._0... | ||
+ | |||
+ | For submitting jobs to [[doku: | ||
+ | |||
+ | < | ||
+ | #SBATCH --account=pxxxxx | ||
+ | #SBATCH --partition=zen3_xxxx | ||
+ | #SBATCH --qos=pxxxx_xxxx | ||
</ | </ | ||
+ | |||
+ | ==== Run time ==== | ||
+ | |||
+ | The QOS's run time limits can also be requested via the command | ||
+ | |||
+ | < | ||
+ | |||
+ | If you know how long your job usually runs, you can set the run time limit in SLURM: | ||
+ | |||
+ | < | ||
+ | #SBATCH --time=< | ||
+ | </ | ||
+ | |||
+ | Of course this has to be //below// the default QOS's run time limit. Your job might start earlier, which is nice; But after the specified time is elapsed, the job is killed! | ||
+ | |||
+ | Acceptable time formats include: | ||
+ | * " | ||
+ | * " | ||
+ | * " | ||
+ | * " | ||
+ | * " | ||
+ | * " | ||