The most recent version of this page is a draft.DiffThis version (2017/11/07 12:57) is a draft.
Approvals: 0/1

This is an old revision of the document!


SLURM

  • Article written by Markus Stöhr (VSC Team) <html><br></html>(last update 2017-10-09 by ms).

script examples/05_submitting_batch_jobs/job-quickstart.sh:

#!/bin/bash
#SBATCH -J h5test
#SBATCH -N 1

module purge
module load gcc/5.3 intel-mpi/5 hdf5/1.8.18-MPI

cp $VSC_HDF5_ROOT/share/hdf5_examples/c/ph5example.c .
mpicc -lhdf5 ph5example.c -o ph5example

mpirun -np 8  ./ph5example -c -v 

submission:

sbatch job.sh
Submitted batch job 5250981

check what is going on:

squeue -u $USER
  JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
5250981  mem_0128   h5test   markus  R       0:00      2 n23-[018-019]

Output files:

ParaEg0.h5
ParaEg1.h5
slurm-5250981.out

try on .h5 files:

h5dump

cancel jobs:

scancel <job_id> 

or

scancel <job_name>

or

scancel -u $USER
  • job/batch script:
    • shell script, that does everything needed to run your calculation
    • independent of queueing system
    • use simple scripts (max 50 lines, i.e. put complicated logic elsewhere)
    • load modules from scratch (purge, then load)
  • tell scheduler where/how to run jobs:
    • #nodes
    • nodetype
  • scheduler manages job allocation to compute nodes

partition memory
mem_0064 64 GBdefault
mem_0128 128 GB
mem_0256 256 GB
  • All nodes with the same CPU configuration:
    • 16 cores
    • 2 x Intel(R) Xeon(R) CPU E5-2650 v2 @ 2.60GHz (Ivy-Bridge)
  • Display information about partitions and their nodes:
sinfo -o %P
scontrol show partition mem_0064
scontrol show node n01-001

1.+2.:

sqos -acc
default_account:        p70824
        account:        p70824              

    default_qos:   normal_0064
            qos:    devel_0128              
                      goodluck              
                   gpu_compute              
                       gpu_vis              
                           knl              
                   normal_0064              
                   normal_0128              
                   normal_0256              

3.:

sqos
   qos_name total  free     walltime   prio partitions  
==========================================================
normal_0064  1796    43   3-00:00:00   2000 mem_0064    
normal_0256    15    -1   3-00:00:00   2000 mem_0256    
normal_0128    67    -3   3-00:00:00   2000 mem_0128    
 devel_0128    10     9     00:10:00  20000 mem_0128    
gpu_compute    12     3   3-00:00:00   1000 p70971_gpu,gpu
    gpu_vis     4     4   3-00:00:00   1000 p70971_gpu,gpu
   goodluck   470   470   3-00:00:00   1000             
        knl     4     4   3-00:00:00   1000 knl         

naming convention:

QOS Partition
*_0064mem_0064

#SBATCH --account=xxxxxx
#SBATCH --qos=xxxxx_xxxx
#SBATCH --partition=mem_xxxx

For omitted lines corresponding defaults are used. See previous slides, default partition is “mem_0064”

default:

#!/bin/bash
#SBATCH -J jobname
#SBATCH -N number_of_nodes

do_my_work

job is submitted to:

  • partition mem_0064
  • qos normal_0064
  • default account

explicit:

#!/bin/bash
#SBATCH -J jobname
#SBATCH -N number_of_nodes
#SBATCH
#SBATCH
#SBATCH --partition=mem_xxxx
#SBATCH --qos=xxxxx_xxxx
#SBATCH --account=xxxxxx

do_my_work
  • must be a shell script (first line!)
  • '#SBATCH' for marking SLURM parameters
  • environment variables are set by SLURM for use within the script (e.g. SLURM_JOB_NUM_NODES)
sbatch <SLURM_PARAMETERS> job.sh <JOB_PARAMETERS>
  • parameters are specified as in job script
  • precedence: sbatch parameters override parameters in job script
  • be careful to place slurm parameters before job script
  • try these commands and find out which partition has to be used if you want to run in QOS 'devel_0128':
sqos
sqos -acc
  • find out, which nodes are in the partition that allows running in 'devel_0128'. Further, check how much memory these nodes have:
scontrol show partition ...
scontrol show node ...
  • submit a one node job to QOS devel_0128 with the following commands:
hostname
free 
  • looped job submission (takes a long time):
for i in {1..1000} 
do 
    sbatch job.sh $i
done
  • loop in job (sequential mpirun commands):
for i in {1..1000}
do
    mpirun my_program $i
done
  • run similar, independent jobs at once, that can be distinguished by one parameter
  • each task will be treated as a seperate job
  • example (job_array.sh, sleep.sh), start=1, end=30, stepwidth=7:
#!/bin/sh
#SBATCH -J array
#SBATCH -N 1
#SBATCH --array=1-30:7

./sleep.sh $SLURM_ARRAY_TASK_ID
  • computed tasks: 1, 8, 15, 22, 29
5605039_[15-29] mem_0064    array   markus PD
5605039_1       mem_0064    array   markus  R
5605039_8       mem_0064    array   markus  R

useful variables within job:

SLURM_ARRAY_JOB_ID
SLURM_ARRAY_TASK_ID
SLURM_ARRAY_TASK_STEP
SLURM_ARRAY_TASK_MAX
SLURM_ARRAY_TASK_MIN

limit number of simultanously running jobs to 2:

#SBATCH --array=1-30:7%2
  • use a complete compute node for several tasks at once
...

max_num_tasks=16

...

for i in `seq $task_start $task_increment $task_end`
do
  ./$executable $i &
  check_running_tasks #sleeps as long as max_num_tasks are running
done
wait
  • '&': start binary in background, script can continue
  • 'wait': waits for all processes in the background, otherwise script will finish

job_array_some_tasks.sh:

...
#SBATCH --array=1-100:32

...

task_start=$SLURM_ARRAY_TASK_ID
task_end=$(( $SLURM_ARRAY_TASK_ID + $SLURM_ARRAY_TASK_STEP -1 ))
if [ $task_end -gt $SLURM_ARRAY_TASK_MAX ]; then
        task_end=$SLURM_ARRAY_TASK_MAX
fi
task_increment=1

...

for i in `seq $task_start $task_increment $task_end`
do
  ./$executable $i &
  check_running_tasks
done
wait
  • normal jobs:
#SBATCH job environment
-N SLURM_JOB_NUM_NODES
–ntasks-per-core SLURM_NTASKS_PER_CORE
–ntasks-per-node SLURM_NTASKS_PER_NODE
–ntasks-per-socketSLURM_NTASKS_PER_SOCKET
–ntasks, -n SLURM_NTASKS
  • emails:
#SBATCH --mail-user=yourmail@example.com
#SBATCH --mail-type=BEGIN,END
  • constraints:
#SBATCH -C --constraint
#SBATCH --gres=

#SBATCH -t, --time=<time>
#SBATCH --time-min=<time>

Valid time formats:

  • MM
  • [HH:]MM:SS
  • DD-HH[:MM[:SS]]
  • backfilling:
    • specify '–time' or '–time-min' that is eligible for your job
    • short runtimes may enable the scheduler to use idle nodes waiting for a large job

slic

Within the job script add the flags as shown with 'slic', e.g. for using both Matlab and Mathematica:

#SBATCH -L matlab@vsc,mathematica@vsc

Intel licenses are needed only for compiling code, not for running it!

  • core-h accounting is done for the full reservation time
  • contact us, if needed
  • reservations are named after the project id
  • check for reservations:
scontrol show reservations
  • use it:
#SBATCH --reservation=
  • check for available reservations. If there is one available, use it
  • specify an email address that notifies you when the job has finished
  • run the following matlab code in your job:
echo "2+2" | matlab

Example: Two nodes with two mpi processes each:

srun

#SBATCH -N 2
#SBATCH --tasks-per-node=2

srun --cpu_bind=map_cpu:0,8 ./my_mpi_program

mpirun

#SBATCH -N 2
#SBATCH --tasks-per-node=2

export I_MPI_PIN_PROCESSOR_LIST=0,8
mpirun ./my_mpi_program
  1. Submit first job and get its <job id>
  2. Submit dependent job (and get <job_id>):
#!/bin/bash
#SBATCH -J jobname
#SBATCH -N 2
#SBATCH -d afterany:<job_id>
srun  ./my_program

<HTML><ol start=“3” style=“list-style-type: decimal;”></HTML> <HTML><li></HTML>continue at 2. for further dependent jobs<HTML></li></HTML><HTML></ol></HTML>


  • pandoc/introduction-to-vsc/05_submitting_batch_jobs/slurm.1510059463.txt.gz
  • Last modified: 2017/11/07 12:57
  • by pandoc