Differences
This shows you the differences between two versions of the page.
Both sides previous revision Previous revision | Next revisionBoth sides next revision | ||
doku:slurm_multisite_admin [2022/12/22 21:57] – fsattari | doku:slurm_multisite_admin [2022/12/22 23:00] – fsattari | ||
---|---|---|---|
Line 15: | Line 15: | ||
[...]# sacctmgr list cluster | [...]# sacctmgr list cluster | ||
- | | + | |
- | ---------- --------------- ------------ ----- --------- ------- ------------- --------- ------- ------------- --------- ----------- ------ | + | ------------------------- ------------ ----- --------- ------- ------------- --------- ------- ------------- --------- ----------- ------ |
- | | + | vsc4 X.X.X.X |
- | vscdev2 | + | vsc5 X.X.X.X |
</ | </ | ||
Line 29: | Line 29: | ||
* Originally, job IDs are not unique across multiple clusters. | * Originally, job IDs are not unique across multiple clusters. | ||
- | {{: | + | {{: |
Line 43: | Line 43: | ||
The third parameter after " | The third parameter after " | ||
- | * '' | + | * '' |
< | < | ||
- | [...]# sinfo -M vscdev,vscdev2 | + | [...]# sinfo -M vsc4,vsc5 |
- | CLUSTER: | + | CLUSTER: |
- | PARTITION AVAIL TIMELIMIT | + | PARTITION |
- | test* | + | jupyter |
- | test* | + | skylake_0768 |
+ | . | ||
+ | . | ||
+ | . | ||
- | CLUSTER: | + | |
- | PARTITION AVAIL TIMELIMIT | + | CLUSTER: |
- | test* | + | PARTITION |
- | test* | + | cascadelake_0384 |
+ | zen3_2048 | ||
+ | . | ||
+ | . | ||
+ | . | ||
</ | </ | ||
- | * '' | + | * '' |
< | < | ||
- | [...]# squeue -M vscdev,vscdev2 | + | [...]# squeue -M vsc4,vsc5' |
- | CLUSTER: | + | CLUSTER: |
- | JOBID PARTITION | + | |
- | 211 test | + | 178418 skylake_0 V_0.2_U_ |
+ | | ||
+ | . | ||
+ | . | ||
+ | . | ||
- | CLUSTER: | + | |
+ | CLUSTER: | ||
JOBID PARTITION | JOBID PARTITION | ||
- | 249 test | + | ... |
</ | </ | ||
* '' | * '' | ||
< | < | ||
- | | + | |
</ | </ | ||
==== Multi-Cluster job submission ==== | ==== Multi-Cluster job submission ==== | ||
Line 78: | Line 90: | ||
Assume a submission script '' | Assume a submission script '' | ||
< | < | ||
- | [username@node ~]$ sbatch job.sh </ | + | [username@node ~]$ sbatch job.sh </ |
To submit a job to a list of possible clusters and have the job submitted to the cluster that could run the job the soonest. Of course, to achieve that, the authentication between clusters must be operational. | To submit a job to a list of possible clusters and have the job submitted to the cluster that could run the job the soonest. Of course, to achieve that, the authentication between clusters must be operational. | ||
< | < | ||
- | [username@node ~]$ sbatch -M vscdev/vscdev2 | + | [username@node ~]$ sbatch -M vsc4/vsc5 job.sh |
</ | </ | ||
- | To submit a job to a specific cluster (here vscdev | + | To submit a job to a specific cluster (here vsc4 or vsc5) |
===== Federated Slurm ===== | ===== Federated Slurm ===== | ||
Line 98: | Line 110: | ||
* A cluster can only be part of one federation at a time | * A cluster can only be part of one federation at a time | ||
* Embed cluster ID within the originally 32-bit job ID | * Embed cluster ID within the originally 32-bit job ID | ||
- | {{: | + | {{: |
< | < | ||
[...]# sacctmgr list cluster withfed | [...]# sacctmgr list cluster withfed | ||
- | | + | |
---------- --------------- ------------ ----- -------- ------- ------------- ------- ------- ------------- -------- ----------- -------------------- | ---------- --------------- ------------ ----- -------- ------- ------------- ------- ------- ------------- -------- ----------- -------------------- | ||
- | vscdev | + | vsc4 |
- | vscdev2 | + | vsc5 |
</ | </ | ||
Line 117: | Line 129: | ||
< | < | ||
[...]# squeue -M vscdev, | [...]# squeue -M vscdev, | ||
- | CLUSTER: | + | CLUSTER: |
- | | + | |
- | ** 67109080** test test | + | |
- | CLUSTER: | + | CLUSTER: |
- | | + | |
- | ** 134217981** | + | |
</ | </ | ||
Line 129: | Line 142: | ||
[root@node]# | [root@node]# | ||
Federation: vscdev_fed | Federation: vscdev_fed | ||
- | Self: vscdev2:X.X.X.X:X ID:2 FedState: | + | Self: vsc4:X.X.X.X:X ID:2 FedState: |
- | Sibling: | + | Sibling: |
</ | </ | ||
Line 136: | Line 149: | ||
==== Slurm Federation Workflow ==== | ==== Slurm Federation Workflow ==== | ||
- | {{ : | + | {{: |
===== Multi-Cluster vs Federation implementation ===== | ===== Multi-Cluster vs Federation implementation ===== | ||
- | {{: | ||
On a basic approach, multi-cluster is one unique interface to submit jobs to multiple separated Slurm clusters and the Slurm database can be unique or can be dedicated to each Slurm cluster while federation is a way to federate the job and scheduling information as one and the Slurm database must be unique. | On a basic approach, multi-cluster is one unique interface to submit jobs to multiple separated Slurm clusters and the Slurm database can be unique or can be dedicated to each Slurm cluster while federation is a way to federate the job and scheduling information as one and the Slurm database must be unique. |