Differences
This shows you the differences between two versions of the page.
Both sides previous revision Previous revision Next revision | Previous revision Next revisionBoth sides next revision | ||
doku:slurm_multisite_admin [2022/12/22 21:57] – fsattari | doku:slurm_multisite_admin [2023/06/23 13:33] – [Multi-Cluster Slurm Operation] fsattari | ||
---|---|---|---|
Line 15: | Line 15: | ||
[...]# sacctmgr list cluster | [...]# sacctmgr list cluster | ||
- | | + | |
- | ---------- --------------- ------------ ----- --------- ------- ------------- --------- ------- ------------- --------- ----------- ------ | + | ------------------------- ------------ ----- --------- ------- ------------- --------- ------- ------------- --------- ----------- ------ |
- | | + | vsc4 X.X.X.X |
- | vscdev2 | + | vsc5 X.X.X.X |
</ | </ | ||
+ | {{: | ||
==== Node allocation policy ==== | ==== Node allocation policy ==== | ||
- | * The multi-cluster functionality | + | * To enable the multi-cluster functionality the use of SlurmDBD |
* When sbatch, salloc or srun is invoked with a cluster list, Slurm submits the job to the cluster that offers the earliest start time considering its queue of pending and running jobs | * When sbatch, salloc or srun is invoked with a cluster list, Slurm submits the job to the cluster that offers the earliest start time considering its queue of pending and running jobs | ||
* BUT Slurm will make no subsequent effort to migrate the job to a different cluster whose resources become available when running jobs finish before their scheduled end times. | * BUT Slurm will make no subsequent effort to migrate the job to a different cluster whose resources become available when running jobs finish before their scheduled end times. | ||
* Originally, job IDs are not unique across multiple clusters. | * Originally, job IDs are not unique across multiple clusters. | ||
- | {{: | + | {{: |
Line 43: | Line 44: | ||
The third parameter after " | The third parameter after " | ||
- | * '' | + | * '' |
< | < | ||
- | [...]# sinfo -M vscdev,vscdev2 | + | [...]# sinfo -M vsc4,vsc5 |
- | CLUSTER: | + | CLUSTER: |
- | PARTITION AVAIL TIMELIMIT | + | PARTITION |
- | test* | + | jupyter |
- | test* | + | skylake_0768 |
+ | . | ||
+ | . | ||
+ | . | ||
- | CLUSTER: | + | |
- | PARTITION AVAIL TIMELIMIT | + | CLUSTER: |
- | test* | + | PARTITION |
- | test* | + | cascadelake_0384 |
+ | zen3_2048 | ||
+ | . | ||
+ | . | ||
+ | . | ||
</ | </ | ||
- | * '' | + | * '' |
< | < | ||
- | [...]# squeue -M vscdev,vscdev2 | + | [...]# squeue -M vsc4,vsc5' |
- | CLUSTER: | + | CLUSTER: |
- | JOBID PARTITION | + | |
- | 211 test | + | 178418 skylake_0 V_0.2_U_ |
+ | | ||
+ | . | ||
+ | . | ||
+ | . | ||
- | CLUSTER: | + | |
+ | CLUSTER: | ||
JOBID PARTITION | JOBID PARTITION | ||
- | 249 test | + | ... |
</ | </ | ||
* '' | * '' | ||
< | < | ||
- | | + | |
</ | </ | ||
==== Multi-Cluster job submission ==== | ==== Multi-Cluster job submission ==== | ||
Line 78: | Line 91: | ||
Assume a submission script '' | Assume a submission script '' | ||
< | < | ||
- | [username@node ~]$ sbatch job.sh </ | + | [username@node ~]$ sbatch job.sh </ |
To submit a job to a list of possible clusters and have the job submitted to the cluster that could run the job the soonest. Of course, to achieve that, the authentication between clusters must be operational. | To submit a job to a list of possible clusters and have the job submitted to the cluster that could run the job the soonest. Of course, to achieve that, the authentication between clusters must be operational. | ||
< | < | ||
- | [username@node ~]$ sbatch -M vscdev/vscdev2 | + | [username@node ~]$ sbatch -M vsc4/vsc5 job.sh |
</ | </ | ||
- | To submit a job to a specific cluster (here vscdev | + | To submit a job to a specific cluster (here vsc4 or vsc5) |
===== Federated Slurm ===== | ===== Federated Slurm ===== | ||
Line 98: | Line 111: | ||
* A cluster can only be part of one federation at a time | * A cluster can only be part of one federation at a time | ||
* Embed cluster ID within the originally 32-bit job ID | * Embed cluster ID within the originally 32-bit job ID | ||
- | {{: | + | {{: |
< | < | ||
[...]# sacctmgr list cluster withfed | [...]# sacctmgr list cluster withfed | ||
- | | + | |
---------- --------------- ------------ ----- -------- ------- ------------- ------- ------- ------------- -------- ----------- -------------------- | ---------- --------------- ------------ ----- -------- ------- ------------- ------- ------- ------------- -------- ----------- -------------------- | ||
- | vscdev | + | vsc4 |
- | vscdev2 | + | vsc5 |
</ | </ | ||
Line 117: | Line 130: | ||
< | < | ||
[...]# squeue -M vscdev, | [...]# squeue -M vscdev, | ||
- | CLUSTER: | + | CLUSTER: |
- | | + | |
- | ** 67109080** test test | + | |
- | CLUSTER: | + | CLUSTER: |
- | | + | |
- | ** 134217981** | + | |
</ | </ | ||
Line 129: | Line 143: | ||
[root@node]# | [root@node]# | ||
Federation: vscdev_fed | Federation: vscdev_fed | ||
- | Self: vscdev2:X.X.X.X:X ID:2 FedState: | + | Self: vsc4:X.X.X.X:X ID:2 FedState: |
- | Sibling: | + | Sibling: |
</ | </ | ||
Line 136: | Line 150: | ||
==== Slurm Federation Workflow ==== | ==== Slurm Federation Workflow ==== | ||
- | {{ : | + | {{: |
===== Multi-Cluster vs Federation implementation ===== | ===== Multi-Cluster vs Federation implementation ===== | ||
- | {{: | ||
On a basic approach, multi-cluster is one unique interface to submit jobs to multiple separated Slurm clusters and the Slurm database can be unique or can be dedicated to each Slurm cluster while federation is a way to federate the job and scheduling information as one and the Slurm database must be unique. | On a basic approach, multi-cluster is one unique interface to submit jobs to multiple separated Slurm clusters and the Slurm database can be unique or can be dedicated to each Slurm cluster while federation is a way to federate the job and scheduling information as one and the Slurm database must be unique. |