====== SLURM ======
* Article written by Markus Stöhr (VSC Team) (last update 2017-10-09 by ms).
==== Quickstart ====
script [[examples/job-quickstart.sh|examples/05_submitting_batch_jobs/job-quickstart.sh]]:
#!/bin/bash
#SBATCH -J h5test
#SBATCH -N 1
module purge
module load gcc/5.3 intel-mpi/5 hdf5/1.8.18-MPI
cp $VSC_HDF5_ROOT/share/hdf5_examples/c/ph5example.c .
mpicc -lhdf5 ph5example.c -o ph5example
mpirun -np 8 ./ph5example -c -v
submission:
$ sbatch job.sh
Submitted batch job 5250981
check what is going on:
squeue -u $USER
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
5250981 mem_0128 h5test markus R 0:00 2 n323-[018-019]
Output files:
ParaEg0.h5
ParaEg1.h5
slurm-5250981.out
try on .h5 files:
h5dump
cancel jobs:
scancel
or
scancel
or
scancel -u $USER
===== Basic concepts =====
==== Queueing system ====
* job/batch script:
* shell script, that does everything needed to run your calculation
* independent of queueing system
* **use simple scripts** (max 50 lines, i.e. put complicated logic elsewhere)
* load modules from scratch (purge, then load)
* tell scheduler where/how to run jobs:
* #nodes
* nodetype
* …
* scheduler manages job allocation to compute nodes
{{.:queueing_basics.png?200}}
==== SLURM: Accounts and Users ====
{{.:slurm_accounts.png}}
==== SLURM: Partition and Quality of Service ====
{{.:partitions.png}}
==== VSC-3 Hardware Types ====
^partition ^ RAM (GB) ^CPU ^ Cores ^ IB (HCA) ^ #Nodes ^
|mem_0064* | 64 |2x Intel E5-2650 v2 @ 2.60GHz| 2x8 | 2xQDR | 1849 |
|mem_0128 | 128 |2x Intel E5-2650 v2 @ 2.60GHz| 2x8 | 2xQDR | 140 |
|mem_0256 | 256 |2x Intel E5-2650 v2 @ 2.60GHz| 2x8 | 2xQDR | 50 |
|vsc3plus_0064| 64 |2x Intel E5-2660 v2 @ 2.20GHz| 2x10 | 1xFDR | 816 |
|vsc3plus_0256| 256 |2x Intel E5-2660 v2 @ 2.20GHz| 2x10 | 1xFDR | 48 |
|binf | 512 - 1536 |2x Intel E5-2690 v4 @ 2.60GHz| 2x14 | 1xFDR | 17 |
* default partition, QDR: Intel Truescale Infinipath (40Gbit/s), FDR: Mellanox ConnectX-3 (56Gbit/s)
effective: 10/2018
* + GPU nodes (see later)
* specify partition in job script:
#SBATCH -p
==== Standard QOS ====
^partition ^QOS ^
|mem_0064* |normal_0064 |
|mem_0128 |normal_0128 |
|mem_0256 |normal_0256 |
|vsc3plus_0064|vsc3plus_0064|
|vsc3plus_0256|vsc3plus_0256|
|binf |normal_binf |
* specify QOS in job script:
#SBATCH --qos
----
==== VSC-4 Hardware Types ====
^partition^ RAM (GB) ^CPU ^ Cores ^ IB (HCA) ^ #Nodes ^
|mem_0096*| 96 |2x Intel Platinum 8174 @ 3.10GHz| 2x24 | 1xEDR | 688 |
|mem_0384 | 384 |2x Intel Platinum 8174 @ 3.10GHz| 2x24 | 1xEDR | 78 |
|mem_0768 | 768 |2x Intel Platinum 8174 @ 3.10GHz| 2x24 | 1xEDR | 12 |
* default partition, EDR: Intel Omni-Path (100Gbit/s)
effective: 10/2020
==== Standard QOS ====
^partition^QOS ^
|mem_0096*|mem_0096|
|mem_0384 |mem_0384|
|mem_0768 |mem_0768|
----
==== VSC Hardware Types ====
* Display information about partitions and their nodes:
sinfo -o %P
scontrol show partition mem_0064
scontrol show node n301-001
==== QOS-Account/Project assignment ====
{{.:setup.png?200}}
1.+2.:
sqos -acc
default_account: p70824
account: p70824
default_qos: normal_0064
qos: devel_0128
goodluck
gpu_gtx1080amd
gpu_gtx1080multi
gpu_gtx1080single
gpu_k20m
gpu_m60
knl
normal_0064
normal_0128
normal_0256
normal_binf
vsc3plus_0064
vsc3plus_0256
==== QOS-Partition assignment ====
3.:
sqos
qos_name total used free walltime priority partitions
=========================================================================
normal_0064 1782 1173 609 3-00:00:00 2000 mem_0064
normal_0256 15 24 -9 3-00:00:00 2000 mem_0256
normal_0128 93 51 42 3-00:00:00 2000 mem_0128
devel_0128 10 20 -10 00:10:00 20000 mem_0128
goodluck 0 0 0 3-00:00:00 1000 vsc3plus_0256,vsc3plus_0064,amd
knl 4 1 3 3-00:00:00 1000 knl
normal_binf 16 5 11 1-00:00:00 1000 binf
gpu_gtx1080multi 4 2 2 3-00:00:00 2000 gpu_gtx1080multi
gpu_gtx1080single 50 18 32 3-00:00:00 2000 gpu_gtx1080single
gpu_k20m 2 0 2 3-00:00:00 2000 gpu_k20m
gpu_m60 1 1 0 3-00:00:00 2000 gpu_m60
vsc3plus_0064 800 781 19 3-00:00:00 1000 vsc3plus_0064
vsc3plus_0256 48 44 4 3-00:00:00 1000 vsc3plus_0256
gpu_gtx1080amd 1 0 1 3-00:00:00 2000 gpu_gtx1080amd
naming convention:
^QOS ^Partition^
|*_0064|mem_0064 |
----
==== Specification in job script ====
#SBATCH --account=xxxxxx
#SBATCH --qos=xxxxx_xxxx
#SBATCH --partition=mem_xxxx
For omitted lines corresponding defaults are used. See previous slides, default partition is “mem_0064”
==== Sample batch job ====
default:
#!/bin/bash
#SBATCH -J jobname
#SBATCH -N number_of_nodes
do_my_work
job is submitted to:
* partition mem_0064
* qos normal_0064
* default account
explicit:
#!/bin/bash
#SBATCH -J jobname
#SBATCH -N number_of_nodes
#SBATCH
#SBATCH
#SBATCH --partition=mem_xxxx
#SBATCH --qos=xxxxx_xxxx
#SBATCH --account=xxxxxx
do_my_work
* must be a shell script (first line!)
* ‘#SBATCH’ for marking SLURM parameters
* environment variables are set by SLURM for use within the script (e.g. ''%%SLURM_JOB_NUM_NODES%%'')
==== Job submission ====
sbatch job.sh
* parameters are specified as in job script
* precedence: sbatch parameters override parameters in job script
* be careful to place slurm parameters **before** job script
==== Exercises ====
* try these commands and find out which partition has to be used if you want to run in QOS ‘devel_0128’:
sqos
sqos -acc
* find out, which nodes are in the partition that allows running in ‘devel_0128’. Further, check how much memory these nodes have:
scontrol show partition ...
scontrol show node ...
* submit a one node job to QOS devel_0128 with the following commands:
hostname
free
==== Bad job practices ====
* job submissions in a loop (takes a long time):
for i in {1..1000}
do
sbatch job.sh $i
done
* loop inside job script (sequential mpirun commands):
for i in {1..1000}
do
mpirun my_program $i
done
==== Array jobs ====
* submit/run a series of **independent** jobs via a single SLURM script
* each job in the array gets a unique identifier (SLURM_ARRAY_TASK_ID) based on which various workloads can be organized
* example ([[examples/job_array.sh|job_array.sh]]), 10 jobs, SLURM_ARRAY_TASK_ID=1,2,3…10
#!/bin/sh
#SBATCH -J array
#SBATCH -N 1
#SBATCH --array=1-10
echo "Hi, this is array job number" $SLURM_ARRAY_TASK_ID
sleep $SLURM_ARRAY_TASK_ID
* independent jobs: 1, 2, 3 … 10
VSC-4 > squeue -u $user
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
406846_[7-10] mem_0096 array sh PD 0:00 1 (Resources)
406846_4 mem_0096 array sh R INVALID 1 n403-062
406846_5 mem_0096 array sh R INVALID 1 n403-072
406846_6 mem_0096 array sh R INVALID 1 n404-031
VSC-4 > ls slurm-*
slurm-406846_10.out slurm-406846_3.out slurm-406846_6.out slurm-406846_9.out
slurm-406846_1.out slurm-406846_4.out slurm-406846_7.out
slurm-406846_2.out slurm-406846_5.out slurm-406846_8.out
VSC-4 > cat slurm-406846_8.out
Hi, this is array job number 8
* fine-tuning via builtin variables (SLURM_ARRAY_TASK_MIN, SLURM_ARRAY_TASK_MAX…)
* example of going in chunks of a certain size, e.g. 5, SLURM_ARRAY_TASK_ID=1,6,11,16
#SBATCH --array=1-20:5
* example of limiting number of simultaneously running jobs to 2 (perhaps for licences)
#SBATCH --array=1-20:5%2
==== Single core jobs ====
* use an entire compute node for several independent jobs
* example: [[examples/single_node_multiple_jobs.sh|single_node_multiple_jobs.sh]]:
for ((i=1; i<=48; i++))
do
stress --cpu 1 --timeout $i &
done
wait
* ‘&’: send process into the background, script can continue
* ‘wait’: waits for all processes in the background, otherwise script would terminate
==== Combination of array & single core job ====
* example: [[examples/combined_array_multiple_jobs.sh|combined_array_multiple_jobs.sh]]:
...
#SBATCH --array=1-144:48
j=$SLURM_ARRAY_TASK_ID
((j+=47))
for ((i=$SLURM_ARRAY_TASK_ID; i<=$j; i++))
do
stress --cpu 1 --timeout $i &
done
wait
==== Exercises ====
* files are located in folder ''%%examples/05_submitting_batch_jobs%%''
* look into [[examples/job_array.sh|job_array.sh]] and modify it such that the considered range is from 1 to 20 but in steps of 5
* look into [[examples/single_node_multiple_jobs.sh|single_node_multiple_jobs.sh]] and also change it to go in steps of 5
* run [[examples/combined_array_multiple_jobs.sh|combined_array_multiple_jobs.sh]] and check whether the output is reasonable
==== Job/process setup ====
* normal jobs:
^#SBATCH ^job environment ^
|-N |SLURM_JOB_NUM_NODES |
|--ntasks-per-core|SLURM_NTASKS_PER_CORE|
|--ntasks-per-node|SLURM_NTASKS_PER_NODE|
|--ntasks, -n |SLURM_NTASKS |
* emails:
#SBATCH --mail-user=yourmail@example.com
#SBATCH --mail-type=BEGIN,END
* constraints:
#SBATCH -t, --time=
time format:
* DD-HH[:MM[:SS]]
* backfilling: * specify ‘–time’ or ‘–time-min’ which are estimates of the runtime of your job * shorter than default runtimes (mostly 72h) may enable the scheduler to use idle nodes waiting for a larger job
* get the remaining running time for your job:
squeue -h -j $SLURM_JOBID -o %L
==== Licenses ====
{{.:licenses.png}}
VSC-3 > slic
Within the SLURN submit script add the flags as shown with ‘slic’, e.g. when both Matlab and Mathematica are required
#SBATCH -L matlab@vsc,mathematica@vsc
Intel licenses are needed only when compiling code, not for running resulting executables
==== Reservation of compute nodes ====
* core-h accounting is done for the entire period of reservation
* contact service@vsc.ac.at
* reservations are named after the project id
* check for reservations:
VSC-3 > scontrol show reservations
* usage:
#SBATCH --reservation=
==== Exercises ====
* check for available reservations. If there is one available, use it
* specify an email address that notifies you when the job has finished
* run the following matlab code in your job:
echo "2+2" | matlab
==== MPI + pinning ====
* understand what your code is doing and place the processes correctly
* use only a few processes per node if memory demand is high
* details for pinning: https://wiki.vsc.ac.at/doku.php?id=doku:vsc3_pinning
Example: Two nodes with two MPI processes each:
=== srun ===
#SBATCH -N 2
#SBATCH --tasks-per-node=2
srun --cpu_bind=map_cpu:0,24 ./my_mpi_program
=== mpirun ===
#SBATCH -N 2
#SBATCH --tasks-per-node=2
export I_MPI_PIN_PROCESSOR_LIST=0,24 # Intel MPI syntax
mpirun ./my_mpi_program
==== Job dependencies ====
- Submit first job and get its
- Submit dependent job (and get ):
#!/bin/bash
#SBATCH -J jobname
#SBATCH -N 2
#SBATCH -d afterany:
srun ./my_program