SLURM

This version (2020/10/20 09:13) is a draft.
Approvals: 0/1

Article written by Markus Stöhr (VSC Team) <html><br></html>(last update 2017-10-09 by ms).

script examples/05_submitting_batch_jobs/job-quickstart.sh:

#!/bin/bash
#SBATCH -J h5test
#SBATCH -N 1

module purge
module load gcc/5.3 intel-mpi/5 hdf5/1.8.18-MPI

cp $VSC_HDF5_ROOT/share/hdf5_examples/c/ph5example.c .
mpicc -lhdf5 ph5example.c -o ph5example

mpirun -np 8  ./ph5example -c -v

submission:

$ sbatch job.sh
Submitted batch job 5250981

check what is going on:

squeue -u $USER

  JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
5250981  mem_0128   h5test   markus  R       0:00      2 n323-[018-019]

Output files:

ParaEg0.h5
ParaEg1.h5
slurm-5250981.out

try on .h5 files:

h5dump

cancel jobs:

scancel <job_id>

or

scancel <job_name>

or

scancel -u $USER

job/batch script:
- shell script, that does everything needed to run your calculation
- independent of queueing system
- use simple scripts (max 50 lines, i.e. put complicated logic elsewhere)
- load modules from scratch (purge, then load)

tell scheduler where/how to run jobs:
- #nodes
- nodetype
- …

scheduler manages job allocation to compute nodes

partition	RAM (GB)	CPU	Cores	IB (HCA)	#Nodes
mem_0064*	64	2x Intel E5-2650 v2 @ 2.60GHz	2×8	2xQDR	1849
mem_0128	128	2x Intel E5-2650 v2 @ 2.60GHz	2×8	2xQDR	140
mem_0256	256	2x Intel E5-2650 v2 @ 2.60GHz	2×8	2xQDR	50
vsc3plus_0064	64	2x Intel E5-2660 v2 @ 2.20GHz	2×10	1xFDR	816
vsc3plus_0256	256	2x Intel E5-2660 v2 @ 2.20GHz	2×10	1xFDR	48
binf	512 - 1536	2x Intel E5-2690 v4 @ 2.60GHz	2×14	1xFDR	17

* default partition, QDR: Intel Truescale Infinipath (40Gbit/s), FDR: Mellanox ConnectX-3 (56Gbit/s)

effective: 10/2018

+ GPU nodes (see later)
specify partition in job script:

#SBATCH -p <partition>

partition	QOS
mem_0064*	normal_0064
mem_0128	normal_0128
mem_0256	normal_0256
vsc3plus_0064	vsc3plus_0064
vsc3plus_0256	vsc3plus_0256
binf	normal_binf

specify QOS in job script:

#SBATCH --qos <QOS>

partition	RAM (GB)	CPU	Cores	IB (HCA)	#Nodes
mem_0096*	96	2x Intel Platinum 8174 @ 3.10GHz	2×24	1xEDR	688
mem_0384	384	2x Intel Platinum 8174 @ 3.10GHz	2×24	1xEDR	78
mem_0768	768	2x Intel Platinum 8174 @ 3.10GHz	2×24	1xEDR	12

* default partition, EDR: Intel Omni-Path (100Gbit/s)

effective: 10/2020

partition	QOS
mem_0096*	mem_0096
mem_0384	mem_0384
mem_0768	mem_0768

Display information about partitions and their nodes:

sinfo -o %P
scontrol show partition mem_0064
scontrol show node n301-001

1.+2.:

sqos -acc

default_account:              p70824
        account:              p70824                    

    default_qos:         normal_0064                    
            qos:          devel_0128                    
                            goodluck                    
                      gpu_gtx1080amd                    
                    gpu_gtx1080multi                    
                   gpu_gtx1080single                    
                            gpu_k20m                    
                             gpu_m60                    
                                 knl                    
                         normal_0064                    
                         normal_0128                    
                         normal_0256                    
                         normal_binf                    
                       vsc3plus_0064                    
                       vsc3plus_0256

3.:

sqos

            qos_name total  used  free     walltime   priority partitions  
=========================================================================
         normal_0064  1782  1173   609   3-00:00:00       2000 mem_0064    
         normal_0256    15    24    -9   3-00:00:00       2000 mem_0256    
         normal_0128    93    51    42   3-00:00:00       2000 mem_0128    
          devel_0128    10    20   -10     00:10:00      20000 mem_0128    
            goodluck     0     0     0   3-00:00:00       1000 vsc3plus_0256,vsc3plus_0064,amd
                 knl     4     1     3   3-00:00:00       1000 knl         
         normal_binf    16     5    11   1-00:00:00       1000 binf        
    gpu_gtx1080multi     4     2     2   3-00:00:00       2000 gpu_gtx1080multi
   gpu_gtx1080single    50    18    32   3-00:00:00       2000 gpu_gtx1080single
            gpu_k20m     2     0     2   3-00:00:00       2000 gpu_k20m    
             gpu_m60     1     1     0   3-00:00:00       2000 gpu_m60     
       vsc3plus_0064   800   781    19   3-00:00:00       1000 vsc3plus_0064
       vsc3plus_0256    48    44     4   3-00:00:00       1000 vsc3plus_0256
      gpu_gtx1080amd     1     0     1   3-00:00:00       2000 gpu_gtx1080amd

naming convention:

QOS	Partition
*_0064	mem_0064

#SBATCH --account=xxxxxx
#SBATCH --qos=xxxxx_xxxx
#SBATCH --partition=mem_xxxx

For omitted lines corresponding defaults are used. See previous slides, default partition is “mem_0064”

default:

#!/bin/bash
#SBATCH -J jobname
#SBATCH -N number_of_nodes

do_my_work

job is submitted to:

partition mem_0064
qos normal_0064
default account

explicit:

#!/bin/bash
#SBATCH -J jobname
#SBATCH -N number_of_nodes
#SBATCH
#SBATCH
#SBATCH --partition=mem_xxxx
#SBATCH --qos=xxxxx_xxxx
#SBATCH --account=xxxxxx

do_my_work

must be a shell script (first line!)
‘#SBATCH’ for marking SLURM parameters
environment variables are set by SLURM for use within the script (e.g. SLURM_JOB_NUM_NODES)

sbatch <SLURM_PARAMETERS> job.sh <JOB_PARAMETERS>

parameters are specified as in job script
precedence: sbatch parameters override parameters in job script
be careful to place slurm parameters before job script

try these commands and find out which partition has to be used if you want to run in QOS ‘devel_0128’:

sqos
sqos -acc

find out, which nodes are in the partition that allows running in ‘devel_0128’. Further, check how much memory these nodes have:

scontrol show partition ...
scontrol show node ...

submit a one node job to QOS devel_0128 with the following commands:

hostname
free

job submissions in a loop (takes a long time):

for i in {1..1000} 
do 
    sbatch job.sh $i
done

loop inside job script (sequential mpirun commands):

for i in {1..1000}
do
    mpirun my_program $i
done

submit/run a series of independent jobs via a single SLURM script
each job in the array gets a unique identifier (SLURM_ARRAY_TASK_ID) based on which various workloads can be organized
example (job_array.sh), 10 jobs, SLURM_ARRAY_TASK_ID=1,2,3…10

#!/bin/sh
#SBATCH -J array
#SBATCH -N 1
#SBATCH --array=1-10

echo "Hi, this is array job number"  $SLURM_ARRAY_TASK_ID
sleep $SLURM_ARRAY_TASK_ID

independent jobs: 1, 2, 3 … 10

VSC-4 >  squeue  -u $user
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
     406846_[7-10]  mem_0096    array       sh PD       0:00      1 (Resources)
          406846_4  mem_0096    array       sh  R    INVALID      1 n403-062
          406846_5  mem_0096    array       sh  R    INVALID      1 n403-072
          406846_6  mem_0096    array       sh  R    INVALID      1 n404-031

VSC-4 >  ls slurm-*
slurm-406846_10.out  slurm-406846_3.out  slurm-406846_6.out  slurm-406846_9.out
slurm-406846_1.out   slurm-406846_4.out  slurm-406846_7.out
slurm-406846_2.out   slurm-406846_5.out  slurm-406846_8.out

VSC-4 >  cat slurm-406846_8.out
Hi, this is array job number  8

fine-tuning via builtin variables (SLURM_ARRAY_TASK_MIN, SLURM_ARRAY_TASK_MAX…)

example of going in chunks of a certain size, e.g. 5, SLURM_ARRAY_TASK_ID=1,6,11,16

#SBATCH --array=1-20:5

example of limiting number of simultaneously running jobs to 2 (perhaps for licences)

#SBATCH --array=1-20:5%2

use an entire compute node for several independent jobs
example: single_node_multiple_jobs.sh:

for ((i=1; i<=48; i++))
do
   stress --cpu 1 --timeout $i  &
done
wait

‘&’: send process into the background, script can continue
‘wait’: waits for all processes in the background, otherwise script would terminate

example: combined_array_multiple_jobs.sh:

...
#SBATCH --array=1-144:48

j=$SLURM_ARRAY_TASK_ID
((j+=47))

for ((i=$SLURM_ARRAY_TASK_ID; i<=$j; i++))
do
   stress --cpu 1 --timeout $i  &
done
wait

files are located in folder examples/05_submitting_batch_jobs
look into job_array.sh and modify it such that the considered range is from 1 to 20 but in steps of 5
look into single_node_multiple_jobs.sh and also change it to go in steps of 5
run combined_array_multiple_jobs.sh and check whether the output is reasonable

normal jobs:

#SBATCH	job environment
-N	SLURM_JOB_NUM_NODES
–ntasks-per-core	SLURM_NTASKS_PER_CORE
–ntasks-per-node	SLURM_NTASKS_PER_NODE
–ntasks, -n	SLURM_NTASKS

emails:

#SBATCH --mail-user=yourmail@example.com
#SBATCH --mail-type=BEGIN,END

constraints:

#SBATCH -t, --time=<time>
#SBATCH --time-min=<time>

time format:

DD-HH[:MM[:SS]]

backfilling: * specify ‘–time’ or ‘–time-min’ which are estimates of the runtime of your job * shorter than default runtimes (mostly 72h) may enable the scheduler to use idle nodes waiting for a larger job
get the remaining running time for your job:

squeue -h -j $SLURM_JOBID -o %L

VSC-3 >  slic

Within the SLURN submit script add the flags as shown with ‘slic’, e.g. when both Matlab and Mathematica are required

#SBATCH -L matlab@vsc,mathematica@vsc

Intel licenses are needed only when compiling code, not for running resulting executables

core-h accounting is done for the entire period of reservation
contact service@vsc.ac.at
reservations are named after the project id

check for reservations:

VSC-3 >  scontrol show reservations

usage:

#SBATCH --reservation=

check for available reservations. If there is one available, use it
specify an email address that notifies you when the job has finished
run the following matlab code in your job:

echo "2+2" | matlab

understand what your code is doing and place the processes correctly
use only a few processes per node if memory demand is high
details for pinning: https://wiki.vsc.ac.at/doku.php?id=doku:vsc3_pinning

Example: Two nodes with two MPI processes each:

srun

#SBATCH -N 2
#SBATCH --tasks-per-node=2

srun --cpu_bind=map_cpu:0,24 ./my_mpi_program

mpirun

#SBATCH -N 2
#SBATCH --tasks-per-node=2

export I_MPI_PIN_PROCESSOR_LIST=0,24   # Intel MPI syntax 
mpirun ./my_mpi_program

Submit first job and get its <job id>
Submit dependent job (and get <job_id>):

#!/bin/bash
#SBATCH -J jobname
#SBATCH -N 2
#SBATCH -d afterany:<job_id>
srun  ./my_program

<HTML><ol start=“3” style=“list-style-type: decimal;”></HTML> <HTML><li></HTML>continue at 2. for further dependent jobs<HTML></li></HTML><HTML></ol></HTML>

SLURM

Quickstart

Basic concepts

Queueing system

SLURM: Accounts and Users

SLURM: Partition and Quality of Service

VSC-3 Hardware Types

Standard QOS

VSC-4 Hardware Types

Standard QOS

VSC Hardware Types

QOS-Account/Project assignment

QOS-Partition assignment

Specification in job script

Sample batch job

Job submission

Exercises

Bad job practices

Array jobs

Single core jobs

Combination of array & single core job

Exercises

Job/process setup

Licenses

Reservation of compute nodes

Exercises

MPI + pinning

srun

mpirun

Job dependencies