Differences

This shows you the differences between two versions of the page.

--- doku:slurm_multisite_admin [2022/12/22 20:42] – fsattari
+++ doku:slurm_multisite_admin [2023/06/23 14:54] (current) – [Overview of Multi-Clustering] fsattari
@@ Line 1: / Line 1: @@
 ====== Overview of Multi-Clustering ======
+  In a multi cluster environment, It’s possible to share resources such as computing nodes among the clusters. This maximizes resource utilization and reduces idle time.
   * Multi-Clustering allows running several slurm clusters from the same control node.
   * In this case, different slurmctld daemons will be running on the same machine, and the system users can target commands to any (or all) of the clusters.
@@ Line 15: / Line 17: @@
 [...]# sacctmgr list cluster
-   Cluster     ControlHost  ControlPort   RPC     Share GrpJobs   GrpTRES GrpSubmit MaxJobs   MaxTRES MaxSubmit  MaxWall  QOS   Def QOS
+   Cluster  ControlHost  ControlPort   RPC     Share GrpJobs   GrpTRES GrpSubmit MaxJobs   MaxTRES MaxSubmit  MaxWall  QOS   Def QOS
----------- --------------- ------------ ----- --------- ------- ------------- --------- ------- ------------- --------- ----------- ------
+------------------------- ------------ ----- --------- ------- ------------- --------- ------- ------------- --------- ----------- ------
-    vscdev     X.X.X.X         XX         XX         1                                                                   normal
+   vsc4     X.X.X.X         XX         XX          1                                                                   normal
-   vscdev2     X.X.X.X         XX         XX         1                                                                   normal
+   vsc5     X.X.X.X         XX         XX          1                                                                   normal
 </code>
+{{:doku:vsc4-vsc5-multiclustering.png?600|}}
 ==== Node allocation policy ====
-  * The multi-cluster functionality requires the use of the SlurmDBD.
+  * To enable the multi-cluster functionality the use of SlurmDBD and  MUNGE or authentication keys is required.
   * When sbatch, salloc or srun is invoked with a cluster list, Slurm submits the job to the cluster that offers the earliest start time considering its queue of pending and running jobs
   * BUT Slurm will make no subsequent effort to migrate the job to a different cluster whose resources become available when running jobs finish before their scheduled end times.
   * Originally, job IDs are not unique across multiple clusters.
-{{:doku:mcslurm.png?400|}}
+{{:doku:mcslurm.png?700|}}
@@ Line 43: / Line 46: @@
 The third parameter after "-M" or "--clusters=" in the Slurm call is the cluster name (or a list of possible clusters)
-  * ''[...]$ sinfo -M vscdev,vscdev2'' to see which jobs are running on the two clusters (vscdev, vscdev2)
+  * ''[...]$ sinfo -M vsc4,vsc5'' to see which jobs are running on the two clusters (vsc4, vsc5)
 <code>
-[...]# sinfo -M vscdev,vscdev2
+[...]# sinfo -M vsc4,vsc5
-CLUSTER: vscdev
+CLUSTER: vsc4
-PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
+PARTITION          AVAIL  TIMELIMIT  NODES  STATE  NODELIST
-test*        up   infinite      1   unk* storage01
+jupyter            up    infinite      3    idle   n4905-025,n4906-020,n4912-072
-test*        up   infinite      2   idle storage[02-03]
+skylake_0768       up    infinite      9    alloc  n4911-[011-012,023-024,035-036,047-048,060]
+.
+.
+.
-CLUSTER: vscdev2
-PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
+CLUSTER: vsc5
-test*        up   infinite      1   unk* storage08
+PARTITION          AVAIL  TIMELIMIT  NODES  STATE NODELIST
-test*        up   infinite      4   idle storage[04-07]
+cascadelake_0384   up   infinite      5     unk*  n452-[001,003-004,007-008]
+zen3_2048          up   infinite      9     down* n3511-[011-013,015-020]
+.
+.
+.
 </code>
-  * ''[...]$ squeue -M vscdev,vscdev2'' to see which jobs are running on the two clusters, the current list of submitted jobs, their state, and resource allocation. [doku:slurm_job_reason_codes|Here]] is a description of the most important job reason codes returned by the squeue command.
+  * ''[...]$ squeue -M vsc4,vsc5'' to see which jobs are running on the two clusters, the current list of submitted jobs, their state, and resource allocation. [doku:slurm_job_reason_codes|Here]] is a description of the most important job reason codes returned by the squeue command.
 <code>
-[...]# squeue -M vscdev,vscdev2
+[...]# squeue -M vsc4,vsc5'
-CLUSTER: vscdev
+CLUSTER: vsc4
-             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
+             JOBID  PARTITION     NAME     USER   ST       TIME  NODES NODELIST(REASON)
-      test     test     root  R       0:04      2 storage[02-03]
+skylake_0 V_0.2_U_     nobody PD       0:00     1  n4905-025,n4906-020
+skylake_0 V_0.3_U_     nobody PD       0:00     1  n4905-025,n4906-020
+             .
+             .
+             .
-CLUSTER: vscdev2
+CLUSTER: vsc5
              JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
-      test     test     root  R       0:07      1 storage04
+               ...
 </code>
   * ''[...]$ scontrol -M vscdev/vscdev2'' is used to view SLURM configuration including: job, job step, node, partition, reservation, and overall system configuration. Without a command entered on the execute line, scontrol operates in an interactive mode and prompt for input. With a command entered on the execute line, scontrol executes that command and terminates. Unlike ''sinfo'' and ''squeue'' only one cluster can be used at a time with ''scontrol''
 <code>
- [...]# scontrol -M vscdev show job 211
+ [...]# scontrol -M vsc4 show job 178418
  </code>
 ==== Multi-Cluster job submission ====
@@ Line 78: / Line 93: @@
 Assume a submission script ''job.sh''
 <code>
-[username@node ~]$ sbatch job.sh </code> Or <code> [username@node ~]$ sbatch -M vscdev,vscdev2 job.sh </code>
+[username@node ~]$ sbatch job.sh </code> Or <code> [username@node ~]$ sbatch -M vsc4,vsc5 job.sh </code>
 To submit a job to a list of possible clusters and have the job submitted to the cluster that could run the job the soonest. Of course, to achieve that, the authentication between clusters must be operational.
 <code>
-[username@node ~]$ sbatch -M vscdev/vscdev2 job.sh
+[username@node ~]$ sbatch -M vsc4/vsc5 job.sh
 </code>
-To submit a job to a specific cluster (here vscdev or vscdev2)
+To submit a job to a specific cluster (here vsc4 or vsc5)
 ===== Federated Slurm =====
@@ Line 98: / Line 113: @@
   * A cluster can only be part of one federation at a time
   * Embed cluster ID within the originally 32-bit job ID
-{{:doku:orig-federated-jobid.png?700|}}
+{{:doku:orig-federated-jobid.png?600|}}
 <code>
 [...]# sacctmgr list cluster withfed
-   Cluster ControlHost ControlPort  RPC  Share GrpSubmit MaxJobs MaxTRES MaxSubmit  MaxWall  QOS   Def QOS  Federation      ID  Features    FedState
+   Cluster ControlHost ControlPort  RPC  Share GrpSubmit MaxJobs MaxTRES MaxSubmit  MaxWall  QOS   Def QOS    Federation      ID  Features    FedState
 ---------- --------------- ------------ ----- -------- ------- ------------- ------- ------- ------------- -------- ----------- --------------------
-   vscdev   X.X.X.X         X       X       1                                                           normal  Vscdev_fed   1  synced:yes  ACTIVE
+   vsc4    X.X.X.X         X        X       1                                                          normal  Vscdev_fed     1   synced:yes   ACTIVE
-   vscdev2  X.X.X.X         X       X       1                                                           normal  vscdev_fed   2  synced:yes  ACTIVE
+   vsc5    X.X.X.X         X        X       1                                                          normal  vscdev_fed     2   synced:yes   ACTIVE
 </code>
+{{:doku:federation.png?600|}}
 ==== Federation Job Submission ====
@@ Line 117: / Line 132: @@
 <code>
 [...]# squeue -M vscdev,vscdev2
-CLUSTER: vscdev
+CLUSTER: vsc4
-             JOBID          PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
+             JOBID          PARTITION     NAME     USER   ST    TIME  NODES  NODELIST(REASON)
-         ** 67109080**      test          test     root  R       0:01      2 storage[02-03]
+             67109080       skylake_0 V_0.3_U_     nobody PD    0:05     3   n4905-025,n4906-020
-CLUSTER: vscdev2
-             JOBID          PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
+CLUSTER: vsc5
-         ** 134217981**     test          test     root  R       0:11      1 storage04
+             JOBID          PARTITION     NAME     USER   ST    TIME  NODES  NODELIST(REASON)
+             134217981      zen3_2048              nobody PD    0:25     8   n3511-[011-013,015-020]
 </code>
@@ Line 129: / Line 145: @@
 [root@node]# scontrol show fed --sibling job
 Federation: vscdev_fed
-Self:       vscdev2:X.X.X.X:X ID:2 FedState:ACTIVE Features:synced:yes
+Self:       vsc4:X.X.X.X:X ID:2 FedState:ACTIVE Features:synced:yes
-Sibling:    vscdev:X.X.X.X:X  ID:1 FedState:ACTIVE Features:synced:yes PersistConnSend/Recv:Yes/Yes Synced:Yes
+Sibling:    vsc5:X.X.X.X:X ID:1 FedState:ACTIVE Features:synced:yes PersistConnSend/Recv:Yes/Yes Synced:Yes
 </code>
@@ Line 136: / Line 152: @@
 ==== Slurm Federation Workflow ====
-{{:doku:federationworkflow.png?400|}}    {{:doku:federationworkflowexp.png?400|}}
+{{:doku:federationworkflow.png?900|}}
 ===== Multi-Cluster vs Federation implementation =====
-{{:doku:multiclustervsfederation.png?500|}}
 On a basic approach, multi-cluster is one unique interface to submit jobs to multiple separated Slurm clusters and the Slurm database can be unique or can be dedicated to each Slurm cluster while federation is a way to federate the job and scheduling information as one and the Slurm database must be unique.
@@ Line 154: / Line 169: @@
 Burst-Buffer plugin adds a layer between the compute nodes and the parallel file system to improve network performance, I/O, and data staging.
+{{:doku:bb-process.png?700|}}