pandoc:introduction-to-mul-cluster:01_introduction:04_slurm [VSC Wiki]

This version (2024/10/24 10:28) is a draft.
Approvals: 0/1

As batch system the workload scheduler SLURM (Simple Linux Utility for Resource Management) is used. https://www.schedmd.com/

Basic commands:

sinfo                 # list information about partitions and node states
squeue                # list jobs in queue
sbatch <job-script>   # submit batch job
scancel <job-id>      # cancel job
srun ...              # run a parallel job within the SLURM environment
scontrol              # gives various information on job and nodes

Example sinfo:

sinfo

PARTITION  AVAIL  TIMELIMIT  NODES  STATE NODELIST
E5-2690v4*    up 7-00:00:00      1  down* c1-10
E5-2690v4*    up 7-00:00:00     19   idle c1-[01-09,11-12],c2-[01-08]
Phi           up 7-00:00:00      1  drain c3-01
Phi           up 7-00:00:00      7   idle c3-[02-08]
E5-1650v4     up 7-00:00:00      1  down* c4-16
E5-1650v4     up 7-00:00:00      1  drain c4-01
E5-1650v4     up 7-00:00:00     14   idle c4-[02-15]
E5-1650v3     up 7-00:00:00      1   idle c5-01

down … node is unreachable by the slurm control daemon
drain … node is not available for job allocation
allocated … node is used by a job
idle … node is available for job allocation

List the names, number of nodes and nodenames of the available hardware partitions:

sinfo -o "%.10R %.5D %.N"

 PARTITION NODES NODELIST
 E5-2690v4    20 c1-[01-12],c2-[01-08]
       Phi     8 c3-[01-08]
 E5-1650v4    16 c4-[01-16]
 E5-1650v3     1 c5-01

To list node states of a specific partition:

sinfo -p Phi

PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
Phi          up 7-00:00:00      1  drain c3-01
Phi          up 7-00:00:00      7   idle c3-[02-08]

Submit jobs

Example job script /opt/ohpc/pub/examples/slurm/mul/mpi/job.sh:

#!/bin/bash
#
#SBATCH -J your_job_name_here    # name of job to appear in squeue
#SBATCH -N 1                     # number of nodes requested
#SBATCH -o job.%j.out            # filename for stdout
#SBATCH -p E5-2690v4             # speficfy hardware partition
#SBATCH -q E5-2690v4-batch       # specify quality of service (QOS)
#SBATCH --ntasks-per-node=28     # request number of tasks for your job
#SBATCH --threads-per-core=1     # specify number of threads per core

env|grep SLURM                   # list of SLURM environment variables to use in job script

module purge
module load gnu7/7.2.0 openmpi/1.10.7 prun

echo
module list

echo
which gcc
which mpicc

mpicc hello.c -o hello           # compile and optimize your code directly on hardware

time mpirun -np $SLURM_NPROCS ./hello  # run your job with mpirun
echo
time prun ./hello                # run your job via prun wrapper script

Submit jobs

The modules gnu7/7.2.0 and openmpi/1.10.7 should be loaded for this example:

sbatch job.sh

sbatch: info for user at sbatch
sbatch: current partition E5-2690v4
sbatch: current qos E5-2690v4-batch
Submitted batch job 548

squeue

             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
               548 E5-2690v4  job_nam  usernam  R       0:01      1 c1-01

To cancel the job:

scancel 548

If no file is specified STDOUT will be written to a file called slurm-<job id>.out
In the example of the previous slides STDOUT is written to job.<job id>.out

If no free nodes are available the job will be shown as pending (PD):

squeue

             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
               548 E5-2690v4  job_nam  usernam PD       0:00      1 (Priority)

Customized format of squeue:

squeue -o "%.10A %.10u %.10g %.10P %.10p %.20S %.10j %.10D %.10N  %.10T"

JOBID      USER     GROUP  PARTITION   PRIORITY           START_TIME     NAME  NODES  NODELIST    STATE
  555  user_nam  group_na  E5-2690v4 0.00000023  2018-01-12T12:21:30  mpitest      1     c1-01  RUNNING

To show only jobs of specific user:

squeue -u <username>

create a job script as in the previous slides (or copy it from: /opt/ohpc/pub/examples/slurm/mul/mpi/job.sh)
submit the job
inspect the output file

scontrol show job 548

JobId=548 JobName=job_name
   UserId=user_name(1000) GroupId=user_name(1000) MCS_label=N/A
   Priority=1015 Nice=0 Account=(null) QOS=e5-2690v4-batch
   JobState=RUNNING Reason=None Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=00:00:02 TimeLimit=7-00:00:00 TimeMin=N/A
   SubmitTime=2018-01-12T11:58:47 EligibleTime=2018-01-12T11:58:47
   StartTime=2018-01-12T11:58:47 EndTime=2018-01-19T11:58:47 Deadline=N/A
   PreemptTime=None SuspendTime=None SecsPreSuspend=0
   LastSchedEval=2018-01-12T11:58:47
   Partition=E5-2690v4 AllocNode:Sid=mul-hpc-81a-mgmt:27864
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=c1-01
   BatchHost=c1-01
   NumNodes=1 NumCPUs=28 NumTasks=28 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   TRES=cpu=28,mem=128655M,node=1,billing=40
   Socks/Node=* NtasksPerN:B:S:C=28:0:*:* CoreSpec=*
   MinCPUsNode=28 MinMemoryNode=128655M MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   Gres=(null) Reservation=(null)
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=/home/user_name/jz/mpi/job.sh
   WorkDir=/home/user_name/jz/mpi
   StdErr=/home/user_name/jz/mpi/job.548.out
   StdIn=/dev/null
   StdOut=/home/user_name/jz/mpi/job.548.out
   Power=

To get direct access to a compute node (this example can be found in /opt/ohpc/pub/examples/slurm/mul/mpi/srun_example):

srun -J test -p E5-2690v4 --qos E5-2690v4-batch -N 1 --ntasks-per-node=28 --pty /bin/bash

srun: info for user at sbatch
srun: current partition E5-2690v4
srun: current qos E5-2690v4-batch
[user_name@c1-01 ~]$

[user_name@c1-01 ~]$ prun hello
[prun] Master compute host = c1-01
[prun] Resource manager = slurm
[prun] Launch cmd = mpirun hello (family=openmpi)

 Hello, world (28 procs total)
    --> Process #  11 of  28 is alive. -> c1-01
    --> Process #  12 of  28 is alive. -> c1-01
    --> Process #  13 of  28 is alive. -> c1-01
    --> Process #  15 of  28 is alive. -> c1-01
    --> Process #  16 of  28 is alive. -> c1-01
    --> Process #  17 of  28 is alive. -> c1-01
    --> Process #  19 of  28 is alive. -> c1-01
    --> Process #  20 of  28 is alive. -> c1-01
    --> Process #  21 of  28 is alive. -> c1-01
    --> Process #  23 of  28 is alive. -> c1-01
    --> Process #  24 of  28 is alive. -> c1-01
    --> Process #  25 of  28 is alive. -> c1-01
    --> Process #  27 of  28 is alive. -> c1-01
    --> Process #   0 of  28 is alive. -> c1-01
    --> Process #   3 of  28 is alive. -> c1-01
    --> Process #   4 of  28 is alive. -> c1-01
    --> Process #   5 of  28 is alive. -> c1-01
    --> Process #   7 of  28 is alive. -> c1-01
    --> Process #   8 of  28 is alive. -> c1-01
    --> Process #   9 of  28 is alive. -> c1-01
    --> Process #  18 of  28 is alive. -> c1-01
    --> Process #  22 of  28 is alive. -> c1-01
    --> Process #  26 of  28 is alive. -> c1-01
    --> Process #   1 of  28 is alive. -> c1-01
    --> Process #  10 of  28 is alive. -> c1-01
    --> Process #   2 of  28 is alive. -> c1-01
    --> Process #   6 of  28 is alive. -> c1-01
    --> Process #  14 of  28 is alive. -> c1-01

Alternatively the salloc command can be used:

salloc -N 1 -J test -p E5-2690v4 --qos E5-2690v4-batch --mem=10G

Then find out where your job is running:

squeue -u <username>

or

srun hostname

and connect to it:

ssh <node>

To get direct interactive access to a compute try:

salloc -N 1 -J test -p E5-2690v4 --qos E5-2690v4-batch --mem=10G  srun --pty --preserve-env $SHELL

It is possible to request a certain number of cores and a specific amount of memory in a job script. E.g. to ask for two cores of a node and a total of 2 GByte of memory:

#SBATCH -n 2
#SBATCH --mem=2G

The cores and the requested memory are then exclusively assigned to the processes of this job via cgroups. The current policy is that if the memory is not specified, the job cannot be submitted an an error will be displayed.

you have to specify memory
slurm does not accept your job without a memory specification
choose the right amount of memory:
- not too little
- not too much
too little memory:
- could lead to very low speed because of swapping
- could lead to crash of job (experienced with Abaqus)
too much memory
- does not hurt performance and does not kill your job
- but it costs you more of your fair share

because of shared usage of nodes
if we would use nodes only exclusively then this would not be necessary

run similar, independent jobs at once, that can be distinguished by one parameter
each task will be treated as a seperate job
example start=1, end=30, stepwidth=7:

#!/bin/sh
#SBATCH -J array
#SBATCH -N 1
#SBATCH --array=1-30:7

./my_binary $SLURM_ARRAY_TASK_ID

computed tasks: 1, 8, 15, 22, 29

5605039_[15-29] E5-2690v4    array   markus PD
5605039_1       E5-2690v4    array   markus  R
5605039_8       E5-2690v4    array   markus  R

useful variables within job:

SLURM_ARRAY_JOB_ID
SLURM_ARRAY_TASK_ID
SLURM_ARRAY_TASK_STEP
SLURM_ARRAY_TASK_MAX
SLURM_ARRAY_TASK_MIN

limit number of simultanously running jobs to 2:

#SBATCH --array=1-30:7%2

normal jobs:

#SBATCH	job environment
-N	SLURM_JOB_NUM_NODES
–ntasks-per-core	SLURM_NTASKS_PER_CORE
–ntasks-per-node	SLURM_NTASKS_PER_NODE
–ntasks-per-socket	SLURM_NTASKS_PER_SOCKET
–ntasks, -n	SLURM_NTASKS

emails:

#SBATCH --mail-user=yourmail@example.com
#SBATCH --mail-type=BEGIN,END

Submit first job and get its <job id>
Submit dependent job (and get <job_id>):

#!/bin/bash
#SBATCH -J jobname
#SBATCH -N 2
#SBATCH -d afterany:<job_id>
srun  ./my_program

<HTML><ol start=“3” style=“list-style-type: decimal;”></HTML> <HTML><li></HTML>continue at 2. for further dependent jobs<HTML></li></HTML><HTML></ol></HTML>

SLURM

Submission of batch jobs

SLURM

SLURM

SLURM

SLURM

Submit jobs

SLURM

Submit jobs

SLURM

Output files

SLURM

SLURM

SLURM

Exercise

SLURM

SLURM

Interactive jobs (1)

SLURM

Interactive jobs (2) (excercise)

SLURM

Interactive jobs (2) (excercise)

SLURM

Requesting ressources

SLURM: memory

SLURM: memory

why have this annoying feature anyway?

SLURM

Array jobs

SLURM

SLURM

SLURM

Job dependencies