Approvals: 0/1
SLURM
- Article written by Markus Stöhr (VSC Team) <html><br></html>(last update 2017-10-09 by ms).
Quickstart
script examples/05_submitting_batch_jobs/job-quickstart.sh:
#!/bin/bash #SBATCH -J h5test #SBATCH -N 1 module purge module load gcc/5.3 intel-mpi/5 hdf5/1.8.18-MPI cp $VSC_HDF5_ROOT/share/hdf5_examples/c/ph5example.c . mpicc -lhdf5 ph5example.c -o ph5example mpirun -np 8 ./ph5example -c -v
submission:
$ sbatch job.sh Submitted batch job 5250981
check what is going on:
squeue -u $USER
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 5250981 mem_0128 h5test markus R 0:00 2 n323-[018-019]
Output files:
ParaEg0.h5 ParaEg1.h5 slurm-5250981.out
try on .h5 files:
h5dump
cancel jobs:
scancel <job_id>
or
scancel <job_name>
or
scancel -u $USER
Basic concepts
Queueing system
- job/batch script:
- shell script, that does everything needed to run your calculation
- independent of queueing system
- use simple scripts (max 50 lines, i.e. put complicated logic elsewhere)
- load modules from scratch (purge, then load)
- tell scheduler where/how to run jobs:
- #nodes
- nodetype
- …
- scheduler manages job allocation to compute nodes
SLURM: Accounts and Users
SLURM: Partition and Quality of Service
VSC-3 Hardware Types
partition | RAM (GB) | CPU | Cores | IB (HCA) | #Nodes |
---|---|---|---|---|---|
mem_0064* | 64 | 2x Intel E5-2650 v2 @ 2.60GHz | 2×8 | 2xQDR | 1849 |
mem_0128 | 128 | 2x Intel E5-2650 v2 @ 2.60GHz | 2×8 | 2xQDR | 140 |
mem_0256 | 256 | 2x Intel E5-2650 v2 @ 2.60GHz | 2×8 | 2xQDR | 50 |
vsc3plus_0064 | 64 | 2x Intel E5-2660 v2 @ 2.20GHz | 2×10 | 1xFDR | 816 |
vsc3plus_0256 | 256 | 2x Intel E5-2660 v2 @ 2.20GHz | 2×10 | 1xFDR | 48 |
binf | 512 - 1536 | 2x Intel E5-2690 v4 @ 2.60GHz | 2×14 | 1xFDR | 17 |
* default partition, QDR: Intel Truescale Infinipath (40Gbit/s), FDR: Mellanox ConnectX-3 (56Gbit/s)
effective: 10/2018
- + GPU nodes (see later)
- specify partition in job script:
#SBATCH -p <partition>
Standard QOS
partition | QOS |
---|---|
mem_0064* | normal_0064 |
mem_0128 | normal_0128 |
mem_0256 | normal_0256 |
vsc3plus_0064 | vsc3plus_0064 |
vsc3plus_0256 | vsc3plus_0256 |
binf | normal_binf |
- specify QOS in job script:
#SBATCH --qos <QOS>
VSC-4 Hardware Types
partition | RAM (GB) | CPU | Cores | IB (HCA) | #Nodes |
---|---|---|---|---|---|
mem_0096* | 96 | 2x Intel Platinum 8174 @ 3.10GHz | 2×24 | 1xEDR | 688 |
mem_0384 | 384 | 2x Intel Platinum 8174 @ 3.10GHz | 2×24 | 1xEDR | 78 |
mem_0768 | 768 | 2x Intel Platinum 8174 @ 3.10GHz | 2×24 | 1xEDR | 12 |
* default partition, EDR: Intel Omni-Path (100Gbit/s)
effective: 10/2020
Standard QOS
partition | QOS |
---|---|
mem_0096* | mem_0096 |
mem_0384 | mem_0384 |
mem_0768 | mem_0768 |
VSC Hardware Types
- Display information about partitions and their nodes:
sinfo -o %P scontrol show partition mem_0064 scontrol show node n301-001
QOS-Account/Project assignment
1.+2.:
sqos -acc
default_account: p70824 account: p70824 default_qos: normal_0064 qos: devel_0128 goodluck gpu_gtx1080amd gpu_gtx1080multi gpu_gtx1080single gpu_k20m gpu_m60 knl normal_0064 normal_0128 normal_0256 normal_binf vsc3plus_0064 vsc3plus_0256
QOS-Partition assignment
3.:
sqos
qos_name total used free walltime priority partitions ========================================================================= normal_0064 1782 1173 609 3-00:00:00 2000 mem_0064 normal_0256 15 24 -9 3-00:00:00 2000 mem_0256 normal_0128 93 51 42 3-00:00:00 2000 mem_0128 devel_0128 10 20 -10 00:10:00 20000 mem_0128 goodluck 0 0 0 3-00:00:00 1000 vsc3plus_0256,vsc3plus_0064,amd knl 4 1 3 3-00:00:00 1000 knl normal_binf 16 5 11 1-00:00:00 1000 binf gpu_gtx1080multi 4 2 2 3-00:00:00 2000 gpu_gtx1080multi gpu_gtx1080single 50 18 32 3-00:00:00 2000 gpu_gtx1080single gpu_k20m 2 0 2 3-00:00:00 2000 gpu_k20m gpu_m60 1 1 0 3-00:00:00 2000 gpu_m60 vsc3plus_0064 800 781 19 3-00:00:00 1000 vsc3plus_0064 vsc3plus_0256 48 44 4 3-00:00:00 1000 vsc3plus_0256 gpu_gtx1080amd 1 0 1 3-00:00:00 2000 gpu_gtx1080amd
naming convention:
QOS | Partition |
---|---|
*_0064 | mem_0064 |
Specification in job script
#SBATCH --account=xxxxxx #SBATCH --qos=xxxxx_xxxx #SBATCH --partition=mem_xxxx
For omitted lines corresponding defaults are used. See previous slides, default partition is “mem_0064”
Sample batch job
default:
#!/bin/bash #SBATCH -J jobname #SBATCH -N number_of_nodes do_my_work
job is submitted to:
- partition mem_0064
- qos normal_0064
- default account
explicit:
#!/bin/bash #SBATCH -J jobname #SBATCH -N number_of_nodes #SBATCH #SBATCH #SBATCH --partition=mem_xxxx #SBATCH --qos=xxxxx_xxxx #SBATCH --account=xxxxxx do_my_work
- must be a shell script (first line!)
- ‘#SBATCH’ for marking SLURM parameters
- environment variables are set by SLURM for use within the script (e.g.
SLURM_JOB_NUM_NODES
)
Job submission
sbatch <SLURM_PARAMETERS> job.sh <JOB_PARAMETERS>
- parameters are specified as in job script
- precedence: sbatch parameters override parameters in job script
- be careful to place slurm parameters before job script
Exercises
- try these commands and find out which partition has to be used if you want to run in QOS ‘devel_0128’:
sqos sqos -acc
- find out, which nodes are in the partition that allows running in ‘devel_0128’. Further, check how much memory these nodes have:
scontrol show partition ... scontrol show node ...
- submit a one node job to QOS devel_0128 with the following commands:
hostname free
Bad job practices
- job submissions in a loop (takes a long time):
for i in {1..1000} do sbatch job.sh $i done
- loop inside job script (sequential mpirun commands):
for i in {1..1000} do mpirun my_program $i done
Array jobs
- submit/run a series of independent jobs via a single SLURM script
- each job in the array gets a unique identifier (SLURM_ARRAY_TASK_ID) based on which various workloads can be organized
- example (job_array.sh), 10 jobs, SLURM_ARRAY_TASK_ID=1,2,3…10
#!/bin/sh #SBATCH -J array #SBATCH -N 1 #SBATCH --array=1-10 echo "Hi, this is array job number" $SLURM_ARRAY_TASK_ID sleep $SLURM_ARRAY_TASK_ID
- independent jobs: 1, 2, 3 … 10
VSC-4 > squeue -u $user JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 406846_[7-10] mem_0096 array sh PD 0:00 1 (Resources) 406846_4 mem_0096 array sh R INVALID 1 n403-062 406846_5 mem_0096 array sh R INVALID 1 n403-072 406846_6 mem_0096 array sh R INVALID 1 n404-031
VSC-4 > ls slurm-* slurm-406846_10.out slurm-406846_3.out slurm-406846_6.out slurm-406846_9.out slurm-406846_1.out slurm-406846_4.out slurm-406846_7.out slurm-406846_2.out slurm-406846_5.out slurm-406846_8.out
VSC-4 > cat slurm-406846_8.out Hi, this is array job number 8
- fine-tuning via builtin variables (SLURM_ARRAY_TASK_MIN, SLURM_ARRAY_TASK_MAX…)
- example of going in chunks of a certain size, e.g. 5, SLURM_ARRAY_TASK_ID=1,6,11,16
#SBATCH --array=1-20:5
- example of limiting number of simultaneously running jobs to 2 (perhaps for licences)
#SBATCH --array=1-20:5%2
Single core jobs
- use an entire compute node for several independent jobs
- example: single_node_multiple_jobs.sh:
for ((i=1; i<=48; i++)) do stress --cpu 1 --timeout $i & done wait
- ‘&’: send process into the background, script can continue
- ‘wait’: waits for all processes in the background, otherwise script would terminate
Combination of array & single core job
- example: combined_array_multiple_jobs.sh:
... #SBATCH --array=1-144:48 j=$SLURM_ARRAY_TASK_ID ((j+=47)) for ((i=$SLURM_ARRAY_TASK_ID; i<=$j; i++)) do stress --cpu 1 --timeout $i & done wait
Exercises
- files are located in folder
examples/05_submitting_batch_jobs
- look into job_array.sh and modify it such that the considered range is from 1 to 20 but in steps of 5
- look into single_node_multiple_jobs.sh and also change it to go in steps of 5
- run combined_array_multiple_jobs.sh and check whether the output is reasonable
Job/process setup
- normal jobs:
#SBATCH | job environment |
---|---|
-N | SLURM_JOB_NUM_NODES |
–ntasks-per-core | SLURM_NTASKS_PER_CORE |
–ntasks-per-node | SLURM_NTASKS_PER_NODE |
–ntasks, -n | SLURM_NTASKS |
- emails:
#SBATCH --mail-user=yourmail@example.com #SBATCH --mail-type=BEGIN,END
- constraints:
#SBATCH -t, --time=<time> #SBATCH --time-min=<time>
time format:
- DD-HH[:MM[:SS]]
- backfilling: * specify ‘–time’ or ‘–time-min’ which are estimates of the runtime of your job * shorter than default runtimes (mostly 72h) may enable the scheduler to use idle nodes waiting for a larger job
- get the remaining running time for your job:
squeue -h -j $SLURM_JOBID -o %L
Licenses
VSC-3 > slic
Within the SLURN submit script add the flags as shown with ‘slic’, e.g. when both Matlab and Mathematica are required
#SBATCH -L matlab@vsc,mathematica@vsc
Intel licenses are needed only when compiling code, not for running resulting executables
Reservation of compute nodes
- core-h accounting is done for the entire period of reservation
- contact service@vsc.ac.at
- reservations are named after the project id
- check for reservations:
VSC-3 > scontrol show reservations
- usage:
#SBATCH --reservation=
Exercises
- check for available reservations. If there is one available, use it
- specify an email address that notifies you when the job has finished
- run the following matlab code in your job:
echo "2+2" | matlab
MPI + pinning
- understand what your code is doing and place the processes correctly
- use only a few processes per node if memory demand is high
- details for pinning: https://wiki.vsc.ac.at/doku.php?id=doku:vsc3_pinning
Example: Two nodes with two MPI processes each:
srun
#SBATCH -N 2 #SBATCH --tasks-per-node=2 srun --cpu_bind=map_cpu:0,24 ./my_mpi_program
mpirun
#SBATCH -N 2 #SBATCH --tasks-per-node=2 export I_MPI_PIN_PROCESSOR_LIST=0,24 # Intel MPI syntax mpirun ./my_mpi_program
Job dependencies
- Submit first job and get its <job id>
- Submit dependent job (and get <job_id>):
#!/bin/bash #SBATCH -J jobname #SBATCH -N 2 #SBATCH -d afterany:<job_id> srun ./my_program
<HTML><ol start=“3” style=“list-style-type: decimal;”></HTML> <HTML><li></HTML>continue at 2. for further dependent jobs<HTML></li></HTML><HTML></ol></HTML>