Overview of Multi-Clustering

The most recent version of this page is a draft.

This version is outdated by a newer approved version.

This version (2022/12/22 20:12) is a draft.
Approvals: 0/1

This is an old revision of the document!

Multi-Clustering allows running several slurm clusters from the same control node.
In this case, different slurmctld daemons will be running on the same machine, and the system users can target commands to any (or all) of the clusters.
From an architecture standpoint, each cluster still relies on its own Slurm controller.
From an end-user standpoint, they will be able to submit jobs that will not only be able to run on their local Slurm cluster but also on extra resources provided by the additional Slurm clusters.

Multi-cluster slurm operation is designed to offer the ability to submit a job that can be run on any of the available clusters instead of the local one and receive status from those remote clusters.

[...]# sacctmgr list cluster

   Cluster     ControlHost  ControlPort   RPC     Share GrpJobs   GrpTRES GrpSubmit MaxJobs   MaxTRES MaxSubmit  MaxWall  QOS   Def QOS 
---------- --------------- ------------ ----- --------- ------- ------------- --------- ------- ------------- --------- ----------- ------
    vscdev     X.X.X.X         XX         XX         1                                                                   normal           
   vscdev2     X.X.X.X         XX         XX         1                                                                   normal

The multi-cluster functionality requires the use of the SlurmDBD.
When sbatch, salloc or srun is invoked with a cluster list, Slurm submits the job to the cluster that offers the earliest start time considering its queue of pending and running jobs
BUT Slurm will make no subsequent effort to migrate the job to a different cluster whose resources become available when running jobs finish before their scheduled end times.
Originally, job IDs are not unique across multiple clusters.

Slurm client commands offer the “-M, –clusters=” option to communicate to a list of clusters. The third parameter after “-M” or “–clusters=” in the Slurm call is the cluster name (or a list of possible clusters)

[…]$ sinfo -M vscdev,vscdev2 to see which jobs are running on the two clusters (vscdev, vscdev2)

[...]# sinfo -M vscdev,vscdev2
CLUSTER: vscdev
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
test*        up   infinite      1   unk* storage01
test*        up   infinite      2   idle storage[02-03]

CLUSTER: vscdev2
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
test*        up   infinite      1   unk* storage08
test*        up   infinite      4   idle storage[04-07]

[…]$ squeue -M vscdev,vscdev2 to see which jobs are running on the two clusters, the current list of submitted jobs, their state, and resource allocation. [doku:slurm_job_reason_codes|Here]] is a description of the most important job reason codes returned by the squeue command.

[...]# squeue -M vscdev,vscdev2
CLUSTER: vscdev
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
               211      test     test     root  R       0:04      2 storage[02-03]

CLUSTER: vscdev2
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
               249      test     test     root  R       0:07      1 storage04

[…]$ scontrol -M vscdev/vscdev2 is used to view SLURM configuration including: job, job step, node, partition, reservation, and overall system configuration. Without a command entered on the execute line, scontrol operates in an interactive mode and prompt for input. With a command entered on the execute line, scontrol executes that command and terminates. Unlike sinfo and squeue only one cluster can be used at a time with scontrol

 [...]# scontrol -M vscdev show job 211

Submitting batch jobs

Assume a submission script job.sh

[username@node ~]$ sbatch job.sh

Or

 [username@node ~]$ sbatch -M vscdev,vscdev2 job.sh

To submit a job to a list of possible clusters and have the job submitted to the cluster that could run the job the soonest. Of course, to achieve that, the authentication between clusters must be operational.

 
[username@node ~]$ sbatch -M vscdev/vscdev2 job.sh

To submit a job to a specific cluster (here vscdev or vscdev2)

The Federation is based on the multi-cluster slurm implementation but provides one aggregated system from the independent Slurm clusters and works in a peer-to-peer way In federation a job is submitted to the local cluster (on which the command is invoked) and is then replicated across the clusters in the federation. Each cluster then independently attempts to the schedule the job based off of its own scheduling policies. The clusters coordinate with the “origin” cluster (cluster the job was submitted to) to schedule the job.

Federation is used to unify the job ID and scheduling information among all clusters in the federation
Multiple, independent clusters can be used as one global resource
Slurmdbd pushes updates to all clusters in the federation
A cluster can only be part of one federation at a time
Embed cluster ID within the originally 32-bit job ID

[...]# sacctmgr list cluster withfed
   Cluster ControlHost ControlPort  RPC  Share GrpSubmit MaxJobs MaxTRES MaxSubmit  MaxWall  QOS   Def QOS  Federation      ID  Features    FedState 
---------- --------------- ------------ ----- -------- ------- ------------- ------- ------- ------------- -------- ----------- -------------------- 
   vscdev   X.X.X.X         X       X       1                                                           normal  Vscdev_fed   1  synced:yes  ACTIVE 
   vscdev2  X.X.X.X         X       X       1                                                           normal  vscdev_fed   2  synced:yes  ACTIVE

When a federated cluster receives a job submission, it will submit copies of the job (sibling jobs) to each eligible cluster. Each cluster will then independently attempt to schedule the job. Once a sibling job is started, the origin cluster revokes the remaining sibling jobs.

The job submission is managed the same way as for the multi-cluster solution and the new –cluster-constraint=[!]<constraint_list> option allowing to use only cluster(s) in the federation that match some specific constraints.

[...]# squeue -M vscdev,vscdev2
CLUSTER: vscdev
             JOBID          PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
         ** 67109080**      test          test     root  R       0:01      2 storage[02-03]

CLUSTER: vscdev2
             JOBID          PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
         ** 134217981**     test          test     root  R       0:11      1 storage04

[root@node]# scontrol show fed --sibling job
Federation: vscdev_fed
Self:       vscdev2:X.X.X.X:X ID:2 FedState:ACTIVE Features:synced:yes
Sibling:    vscdev:X.X.X.X:X  ID:1 FedState:ACTIVE Features:synced:yes PersistConnSend/Recv:Yes/Yes Synced:Yes

On a basic approach, multi-cluster is one unique interface to submit jobs to multiple separated Slurm clusters and the Slurm database can be unique or can be dedicated to each Slurm cluster while federation is a way to federate the job and scheduling information as one and the Slurm database must be unique.

I/O components are much slower than the compute parts of a supercomputer, therefore they can create bottlenecks if the bandwidth is saturated.

The data staging derives large scale of traffic on a network connecting computing nodes for moving input and output data between the computing nodes. In this network, the traffic of inter-process communication also flows and consequently mutual interference between both types of traffic may degrade network performance. For example, burst traffic derived from the data staging increases delay in inter-process communication. Also, both types of traffic compete network bandwidth and consequently communication time is increased.

Burst-Buffer plugin adds a layer between the compute nodes and the parallel file system to improve network performance, I/O, and data staging.

Overview of Multi-Clustering

Multi-Cluster Slurm Operation

Node allocation policy

Basic Multi-Cluster Slurm commands

Multi-Cluster job submission

Federated Slurm

Federation Job Submission

Slurm Federation Workflow

Multi-Cluster vs Federation implementation

Slurm Burst-Buffer