====== SLURM ======

  * Article written by Markus Stöhr (VSC Team) <html><br></html>(last update 2017-10-09 by ms).


==== Quickstart ====

script [[examples/job-quickstart.sh|examples/05_submitting_batch_jobs/job-quickstart.sh]]:

<code>
#!/bin/bash
#SBATCH -J h5test
#SBATCH -N 1

module purge
module load gcc/5.3 intel-mpi/5 hdf5/1.8.18-MPI

cp $VSC_HDF5_ROOT/share/hdf5_examples/c/ph5example.c .
mpicc -lhdf5 ph5example.c -o ph5example

mpirun -np 8  ./ph5example -c -v 

</code>
submission:

<code>
$ sbatch job.sh
Submitted batch job 5250981
</code>

check what is going on:

<code>
squeue -u $USER
</code>
<code>
  JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
5250981  mem_0128   h5test   markus  R       0:00      2 n323-[018-019]
</code>
Output files:

<code>
ParaEg0.h5
ParaEg1.h5
slurm-5250981.out
</code>
try on .h5 files:

<code>
h5dump
</code>

cancel jobs:

<code>
scancel <job_id> 
</code>
or

<code>
scancel <job_name>
</code>
or

<code>
scancel -u $USER
</code>
===== Basic concepts =====

==== Queueing system ====

  * job/batch script:
    * shell script, that does everything needed to run your calculation
    * independent of queueing system
    * **use simple scripts** (max 50 lines, i.e. put complicated logic elsewhere)
    * load modules from scratch (purge, then load)


  * tell scheduler where/how to run jobs:
    * #nodes
    * nodetype
    * …


  * scheduler manages job allocation to compute nodes


{{.:queueing_basics.png?200}}

==== SLURM: Accounts and Users ====

{{.:slurm_accounts.png}}


==== SLURM: Partition and Quality of Service ====

{{.:partitions.png}}


==== VSC-3 Hardware Types ====

^partition    ^   RAM (GB)   ^CPU                          ^  Cores  ^  IB (HCA)  ^  #Nodes  ^
|mem_0064*    |      64      |2x Intel E5-2650 v2 @ 2.60GHz|   2x8   |   2xQDR    |   1849   |
|mem_0128     |     128      |2x Intel E5-2650 v2 @ 2.60GHz|   2x8   |   2xQDR    |   140    |
|mem_0256     |     256      |2x Intel E5-2650 v2 @ 2.60GHz|   2x8   |   2xQDR    |    50    |
|vsc3plus_0064|      64      |2x Intel E5-2660 v2 @ 2.20GHz|  2x10   |   1xFDR    |   816    |
|vsc3plus_0256|     256      |2x Intel E5-2660 v2 @ 2.20GHz|  2x10   |   1xFDR    |    48    |
|binf         |  512 - 1536  |2x Intel E5-2690 v4 @ 2.60GHz|  2x14   |   1xFDR    |    17    |


* default partition, QDR: Intel Truescale Infinipath (40Gbit/s), FDR: Mellanox ConnectX-3 (56Gbit/s)

effective: 10/2018

  * + GPU nodes (see later)
  * specify partition in job script:

<code>
#SBATCH -p <partition>
</code>
==== Standard QOS ====

^partition    ^QOS          ^
|mem_0064*    |normal_0064  |
|mem_0128     |normal_0128  |
|mem_0256     |normal_0256  |
|vsc3plus_0064|vsc3plus_0064|
|vsc3plus_0256|vsc3plus_0256|
|binf         |normal_binf  |


  * specify QOS in job script:

<code>
#SBATCH --qos <QOS>
</code>

----

==== VSC-4 Hardware Types ====

^partition^  RAM (GB)  ^CPU                             ^  Cores  ^  IB (HCA)  ^  #Nodes  ^
|mem_0096*|     96     |2x Intel Platinum 8174 @ 3.10GHz|  2x24   |   1xEDR    |   688    |
|mem_0384 |    384     |2x Intel Platinum 8174 @ 3.10GHz|  2x24   |   1xEDR    |    78    |
|mem_0768 |    768     |2x Intel Platinum 8174 @ 3.10GHz|  2x24   |   1xEDR    |    12    |


* default partition, EDR: Intel Omni-Path (100Gbit/s)

effective: 10/2020

==== Standard QOS ====

^partition^QOS     ^
|mem_0096*|mem_0096|
|mem_0384 |mem_0384|
|mem_0768 |mem_0768|


----

==== VSC Hardware Types ====

  * Display information about partitions and their nodes:

<code>
sinfo -o %P
scontrol show partition mem_0064
scontrol show node n301-001
</code>

==== QOS-Account/Project assignment ====


{{.:setup.png?200}}

1.+2.:

<code>
sqos -acc
</code>

<code>
default_account:              p70824
        account:              p70824                    

    default_qos:         normal_0064                    
            qos:          devel_0128                    
                            goodluck                    
                      gpu_gtx1080amd                    
                    gpu_gtx1080multi                    
                   gpu_gtx1080single                    
                            gpu_k20m                    
                             gpu_m60                    
                                 knl                    
                         normal_0064                    
                         normal_0128                    
                         normal_0256                    
                         normal_binf                    
                       vsc3plus_0064                    
                       vsc3plus_0256
</code>


==== QOS-Partition assignment ====


3.:

<code>
sqos
</code>
<code>
            qos_name total  used  free     walltime   priority partitions  
=========================================================================
         normal_0064  1782  1173   609   3-00:00:00       2000 mem_0064    
         normal_0256    15    24    -9   3-00:00:00       2000 mem_0256    
         normal_0128    93    51    42   3-00:00:00       2000 mem_0128    
          devel_0128    10    20   -10     00:10:00      20000 mem_0128    
            goodluck     0     0     0   3-00:00:00       1000 vsc3plus_0256,vsc3plus_0064,amd
                 knl     4     1     3   3-00:00:00       1000 knl         
         normal_binf    16     5    11   1-00:00:00       1000 binf        
    gpu_gtx1080multi     4     2     2   3-00:00:00       2000 gpu_gtx1080multi
   gpu_gtx1080single    50    18    32   3-00:00:00       2000 gpu_gtx1080single
            gpu_k20m     2     0     2   3-00:00:00       2000 gpu_k20m    
             gpu_m60     1     1     0   3-00:00:00       2000 gpu_m60     
       vsc3plus_0064   800   781    19   3-00:00:00       1000 vsc3plus_0064
       vsc3plus_0256    48    44     4   3-00:00:00       1000 vsc3plus_0256
      gpu_gtx1080amd     1     0     1   3-00:00:00       2000 gpu_gtx1080amd
</code>
naming convention:

^QOS   ^Partition^
|*_0064|mem_0064 |


----

==== Specification in job script ====


<code>
#SBATCH --account=xxxxxx
#SBATCH --qos=xxxxx_xxxx
#SBATCH --partition=mem_xxxx
</code>
For omitted lines corresponding defaults are used. See previous slides, default partition is “mem_0064”


==== Sample batch job ====

default:

<code>
#!/bin/bash
#SBATCH -J jobname
#SBATCH -N number_of_nodes

do_my_work
</code>
job is submitted to:

  * partition mem_0064
  * qos normal_0064
  * default account


explicit:

<code>
#!/bin/bash
#SBATCH -J jobname
#SBATCH -N number_of_nodes
#SBATCH
#SBATCH
#SBATCH --partition=mem_xxxx
#SBATCH --qos=xxxxx_xxxx
#SBATCH --account=xxxxxx

do_my_work
</code>


  * must be a shell script (first line!)
  * ‘#SBATCH’ for marking SLURM parameters
  * environment variables are set by SLURM for use within the script (e.g. ''%%SLURM_JOB_NUM_NODES%%'')


==== Job submission ====

<code>
sbatch <SLURM_PARAMETERS> job.sh <JOB_PARAMETERS>
</code>
  * parameters are specified as in job script
  * precedence: sbatch parameters override parameters in job script
  * be careful to place slurm parameters **before** job script

==== Exercises ====

  * try these commands and find out which partition has to be used if you want to run in QOS ‘devel_0128’:

<code>
sqos
sqos -acc
</code>
  * find out, which nodes are in the partition that allows running in ‘devel_0128’. Further, check how much memory these nodes have:

<code>
scontrol show partition ...
scontrol show node ...
</code>
  * submit a one node job to QOS devel_0128 with the following commands:

<code>
hostname
free 
</code>
==== Bad job practices ====

  * job submissions in a loop (takes a long time):

<code>
for i in {1..1000} 
do 
    sbatch job.sh $i
done
</code>

  * loop inside job script (sequential mpirun commands):

<code>
for i in {1..1000}
do
    mpirun my_program $i
done
</code>


==== Array jobs ====

  * submit/run a series of **independent** jobs via a single SLURM script
  * each job in the array gets a unique identifier (SLURM_ARRAY_TASK_ID) based on which various workloads can be organized
  * example ([[examples/job_array.sh|job_array.sh]]), 10 jobs, SLURM_ARRAY_TASK_ID=1,2,3…10

<code>
#!/bin/sh
#SBATCH -J array
#SBATCH -N 1
#SBATCH --array=1-10

echo "Hi, this is array job number"  $SLURM_ARRAY_TASK_ID
sleep $SLURM_ARRAY_TASK_ID
</code>
  * independent jobs: 1, 2, 3 … 10

<code>
VSC-4 >  squeue  -u $user
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
     406846_[7-10]  mem_0096    array       sh PD       0:00      1 (Resources)
          406846_4  mem_0096    array       sh  R    INVALID      1 n403-062
          406846_5  mem_0096    array       sh  R    INVALID      1 n403-072
          406846_6  mem_0096    array       sh  R    INVALID      1 n404-031
</code>

<code>
VSC-4 >  ls slurm-*
slurm-406846_10.out  slurm-406846_3.out  slurm-406846_6.out  slurm-406846_9.out
slurm-406846_1.out   slurm-406846_4.out  slurm-406846_7.out
slurm-406846_2.out   slurm-406846_5.out  slurm-406846_8.out
</code>

<code>
VSC-4 >  cat slurm-406846_8.out
Hi, this is array job number  8
</code>


  * fine-tuning via builtin variables (SLURM_ARRAY_TASK_MIN, SLURM_ARRAY_TASK_MAX…)

  * example of going in chunks of a certain size, e.g. 5, SLURM_ARRAY_TASK_ID=1,6,11,16

<code>
#SBATCH --array=1-20:5
</code>

  * example of limiting number of simultaneously running jobs to 2 (perhaps for licences)

<code>
#SBATCH --array=1-20:5%2
</code>


==== Single core jobs ====

  * use an entire compute node for several independent jobs
  * example: [[examples/single_node_multiple_jobs.sh|single_node_multiple_jobs.sh]]:

<code>
for ((i=1; i<=48; i++))
do
   stress --cpu 1 --timeout $i  &
done
wait
</code>
  * ‘&’: send process into the background, script can continue
  * ‘wait’: waits for all processes in the background, otherwise script would terminate


==== Combination of array & single core job ====

  * example: [[examples/combined_array_multiple_jobs.sh|combined_array_multiple_jobs.sh]]:

<code>
...
#SBATCH --array=1-144:48

j=$SLURM_ARRAY_TASK_ID
((j+=47))

for ((i=$SLURM_ARRAY_TASK_ID; i<=$j; i++))
do
   stress --cpu 1 --timeout $i  &
done
wait

</code>
==== Exercises ====

  * files are located in folder ''%%examples/05_submitting_batch_jobs%%''
  * look into [[examples/job_array.sh|job_array.sh]] and modify it such that the considered range is from 1 to 20 but in steps of 5
  * look into [[examples/single_node_multiple_jobs.sh|single_node_multiple_jobs.sh]] and also change it to go in steps of 5
  * run [[examples/combined_array_multiple_jobs.sh|combined_array_multiple_jobs.sh]] and check whether the output is reasonable

==== Job/process setup ====

  * normal jobs:

^#SBATCH          ^job environment      ^
|-N               |SLURM_JOB_NUM_NODES  |
|--ntasks-per-core|SLURM_NTASKS_PER_CORE|
|--ntasks-per-node|SLURM_NTASKS_PER_NODE|
|--ntasks, -n     |SLURM_NTASKS         |

  * emails:

<code>
#SBATCH --mail-user=yourmail@example.com
#SBATCH --mail-type=BEGIN,END
</code>

  * constraints:

<code>
#SBATCH -t, --time=<time>
#SBATCH --time-min=<time>
</code>

time format:

  * DD-HH[:MM[:SS]]


  * backfilling: * specify ‘–time’ or ‘–time-min’ which are estimates of the runtime of your job * shorter than default runtimes (mostly 72h) may enable the scheduler to use idle nodes waiting for a larger job
  * get the remaining running time for your job:

<code>
squeue -h -j $SLURM_JOBID -o %L
</code>


==== Licenses ====

{{.:licenses.png}}


<code>
VSC-3 >  slic
</code>
Within the SLURN submit script add the flags as shown with ‘slic’, e.g. when both Matlab and Mathematica are required

<code>
#SBATCH -L matlab@vsc,mathematica@vsc
</code>
Intel licenses are needed only when compiling code, not for running resulting executables

==== Reservation of compute nodes ====

  * core-h accounting is done for the entire period of reservation
  * contact service@vsc.ac.at
  * reservations are named after the project id

  * check for reservations:

<code>
VSC-3 >  scontrol show reservations
</code>
  * usage:

<code>
#SBATCH --reservation=
</code>


==== Exercises ====

  * check for available reservations. If there is one available, use it
  * specify an email address that notifies you when the job has finished
  * run the following matlab code in your job:

<code>
echo "2+2" | matlab
</code>
==== MPI + pinning ====

  * understand what your code is doing and place the processes correctly
  * use only a few processes per node if memory demand is high
  * details for pinning: https://wiki.vsc.ac.at/doku.php?id=doku:vsc3_pinning

Example: Two nodes with two MPI processes each:

=== srun ===

<code>
#SBATCH -N 2
#SBATCH --tasks-per-node=2

srun --cpu_bind=map_cpu:0,24 ./my_mpi_program

</code>

=== mpirun ===

<code>
#SBATCH -N 2
#SBATCH --tasks-per-node=2

export I_MPI_PIN_PROCESSOR_LIST=0,24   # Intel MPI syntax 
mpirun ./my_mpi_program
</code>


==== Job dependencies ====

  - Submit first job and get its <job id>
  - Submit dependent job (and get <job_id>):

<code>
#!/bin/bash
#SBATCH -J jobname
#SBATCH -N 2
#SBATCH -d afterany:<job_id>
srun  ./my_program
</code>
<HTML><ol start="3" style="list-style-type: decimal;"></HTML>
<HTML><li></HTML>continue at 2. for further dependent jobs<HTML></li></HTML><HTML></ol></HTML>


----