SLURM

The most recent version of this page is a draft.

This version (2017/11/07 12:57) is a draft.
Approvals: 0/1

This is an old revision of the document!

Article written by Markus Stöhr (VSC Team) <html><br></html>(last update 2017-10-09 by ms).

script examples/05_submitting_batch_jobs/job-quickstart.sh:

#!/bin/bash
#SBATCH -J h5test
#SBATCH -N 1

module purge
module load gcc/5.3 intel-mpi/5 hdf5/1.8.18-MPI

cp $VSC_HDF5_ROOT/share/hdf5_examples/c/ph5example.c .
mpicc -lhdf5 ph5example.c -o ph5example

mpirun -np 8  ./ph5example -c -v

submission:

sbatch job.sh

Submitted batch job 5250981

check what is going on:

squeue -u $USER

  JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
5250981  mem_0128   h5test   markus  R       0:00      2 n23-[018-019]

Output files:

ParaEg0.h5
ParaEg1.h5
slurm-5250981.out

try on .h5 files:

h5dump

cancel jobs:

scancel <job_id>

or

scancel <job_name>

or

scancel -u $USER

job/batch script:
- shell script, that does everything needed to run your calculation
- independent of queueing system
- use simple scripts (max 50 lines, i.e. put complicated logic elsewhere)
- load modules from scratch (purge, then load)

tell scheduler where/how to run jobs:
- #nodes
- nodetype
- …

scheduler manages job allocation to compute nodes

partition	memory
mem_0064	64 GB	default
mem_0128	128 GB
mem_0256	256 GB

All nodes with the same CPU configuration:
- 16 cores
- 2 x Intel(R) Xeon(R) CPU E5-2650 v2 @ 2.60GHz (Ivy-Bridge)

Display information about partitions and their nodes:

sinfo -o %P
scontrol show partition mem_0064
scontrol show node n01-001

1.+2.:

sqos -acc

default_account:        p70824
        account:        p70824              

    default_qos:   normal_0064
            qos:    devel_0128              
                      goodluck              
                   gpu_compute              
                       gpu_vis              
                           knl              
                   normal_0064              
                   normal_0128              
                   normal_0256

3.:

sqos

   qos_name total  free     walltime   prio partitions  
==========================================================
normal_0064  1796    43   3-00:00:00   2000 mem_0064    
normal_0256    15    -1   3-00:00:00   2000 mem_0256    
normal_0128    67    -3   3-00:00:00   2000 mem_0128    
 devel_0128    10     9     00:10:00  20000 mem_0128    
gpu_compute    12     3   3-00:00:00   1000 p70971_gpu,gpu
    gpu_vis     4     4   3-00:00:00   1000 p70971_gpu,gpu
   goodluck   470   470   3-00:00:00   1000             
        knl     4     4   3-00:00:00   1000 knl

naming convention:

QOS	Partition
*_0064	mem_0064

#SBATCH --account=xxxxxx
#SBATCH --qos=xxxxx_xxxx
#SBATCH --partition=mem_xxxx

For omitted lines corresponding defaults are used. See previous slides, default partition is “mem_0064”

default:

#!/bin/bash
#SBATCH -J jobname
#SBATCH -N number_of_nodes

do_my_work

job is submitted to:

partition mem_0064
qos normal_0064
default account

explicit:

#!/bin/bash
#SBATCH -J jobname
#SBATCH -N number_of_nodes
#SBATCH
#SBATCH
#SBATCH --partition=mem_xxxx
#SBATCH --qos=xxxxx_xxxx
#SBATCH --account=xxxxxx

do_my_work

must be a shell script (first line!)
'#SBATCH' for marking SLURM parameters
environment variables are set by SLURM for use within the script (e.g. SLURM_JOB_NUM_NODES)

sbatch <SLURM_PARAMETERS> job.sh <JOB_PARAMETERS>

parameters are specified as in job script
precedence: sbatch parameters override parameters in job script
be careful to place slurm parameters before job script

try these commands and find out which partition has to be used if you want to run in QOS 'devel_0128':

sqos
sqos -acc

find out, which nodes are in the partition that allows running in 'devel_0128'. Further, check how much memory these nodes have:

scontrol show partition ...
scontrol show node ...

submit a one node job to QOS devel_0128 with the following commands:

hostname
free

looped job submission (takes a long time):

for i in {1..1000} 
do 
    sbatch job.sh $i
done

loop in job (sequential mpirun commands):

for i in {1..1000}
do
    mpirun my_program $i
done

run similar, independent jobs at once, that can be distinguished by one parameter
each task will be treated as a seperate job
example (job_array.sh, sleep.sh), start=1, end=30, stepwidth=7:

#!/bin/sh
#SBATCH -J array
#SBATCH -N 1
#SBATCH --array=1-30:7

./sleep.sh $SLURM_ARRAY_TASK_ID

computed tasks: 1, 8, 15, 22, 29

5605039_[15-29] mem_0064    array   markus PD
5605039_1       mem_0064    array   markus  R
5605039_8       mem_0064    array   markus  R

useful variables within job:

SLURM_ARRAY_JOB_ID
SLURM_ARRAY_TASK_ID
SLURM_ARRAY_TASK_STEP
SLURM_ARRAY_TASK_MAX
SLURM_ARRAY_TASK_MIN

limit number of simultanously running jobs to 2:

#SBATCH --array=1-30:7%2

use a complete compute node for several tasks at once

example: job_singlenode_manytasks.sh:

...

max_num_tasks=16

...

for i in `seq $task_start $task_increment $task_end`
do
  ./$executable $i &
  check_running_tasks #sleeps as long as max_num_tasks are running
done
wait

'&': start binary in background, script can continue
'wait': waits for all processes in the background, otherwise script will finish

job_array_some_tasks.sh:

...
#SBATCH --array=1-100:32

...

task_start=$SLURM_ARRAY_TASK_ID
task_end=$(( $SLURM_ARRAY_TASK_ID + $SLURM_ARRAY_TASK_STEP -1 ))
if [ $task_end -gt $SLURM_ARRAY_TASK_MAX ]; then
        task_end=$SLURM_ARRAY_TASK_MAX
fi
task_increment=1

...

for i in `seq $task_start $task_increment $task_end`
do
  ./$executable $i &
  check_running_tasks
done
wait

files are located in folder examples/05_submitting_batch_jobs
download or copy sleep.sh and find out what it is doing
run job_array.sh with tasks 4-20 and stepwidth 3
start a jobs for job_singlenode_manytasks.sh with max_num_tasks=16 and max_num_tasks=8; compare the job runtimes
run job_array_some_tasks.sh

normal jobs:

#SBATCH	job environment
-N	SLURM_JOB_NUM_NODES
–ntasks-per-core	SLURM_NTASKS_PER_CORE
–ntasks-per-node	SLURM_NTASKS_PER_NODE
–ntasks-per-socket	SLURM_NTASKS_PER_SOCKET
–ntasks, -n	SLURM_NTASKS

emails:

#SBATCH --mail-user=yourmail@example.com
#SBATCH --mail-type=BEGIN,END

constraints:

#SBATCH -C --constraint
#SBATCH --gres=

#SBATCH -t, --time=<time>
#SBATCH --time-min=<time>

Valid time formats:

MM
[HH:]MM:SS
DD-HH[:MM[:SS]]

backfilling:
- specify '–time' or '–time-min' that is eligible for your job
- short runtimes may enable the scheduler to use idle nodes waiting for a large job

slic

Within the job script add the flags as shown with 'slic', e.g. for using both Matlab and Mathematica:

#SBATCH -L matlab@vsc,mathematica@vsc

Intel licenses are needed only for compiling code, not for running it!

core-h accounting is done for the full reservation time
contact us, if needed
reservations are named after the project id

check for reservations:

scontrol show reservations

use it:

#SBATCH --reservation=

check for available reservations. If there is one available, use it
specify an email address that notifies you when the job has finished
run the following matlab code in your job:

echo "2+2" | matlab

understand what your code is doing and place the processes correctly
use only a few processes per node if memory demand is high
details for pinning: https://wiki.vsc.ac.at/doku.php?id=doku:vsc3_pinning

Example: Two nodes with two mpi processes each:

srun

#SBATCH -N 2
#SBATCH --tasks-per-node=2

srun --cpu_bind=map_cpu:0,8 ./my_mpi_program

mpirun

#SBATCH -N 2
#SBATCH --tasks-per-node=2

export I_MPI_PIN_PROCESSOR_LIST=0,8
mpirun ./my_mpi_program

Submit first job and get its <job id>
Submit dependent job (and get <job_id>):

#!/bin/bash
#SBATCH -J jobname
#SBATCH -N 2
#SBATCH -d afterany:<job_id>
srun  ./my_program

<HTML><ol start=“3” style=“list-style-type: decimal;”></HTML> <HTML><li></HTML>continue at 2. for further dependent jobs<HTML></li></HTML><HTML></ol></HTML>

SLURM

Quickstart

Basic concepts

Queueing system

SLURM: Accounts and Users

SLURM: Partition and Quality of Service

VSC-3 Hardware Types

QOS-Account/Project assignment

QOS-Partition assignment

Specification in job script

Sample batch job

Job submission

Exercises

Bad job practices

Array job

Single core

Array job + single core

Exercises

Job/process setup

Licenses

Reservations of compute nodes

Exercises

MPI + NTASKS_PER_NODE + pinning

srun

mpirun

Job dependencies