Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
Last revisionBoth sides next revision
pandoc:introduction-to-vsc:05_submitting_batch_jobs:slurm [2018/01/31 11:10] – Pandoc Auto-commit pandocpandoc:introduction-to-vsc:05_submitting_batch_jobs:slurm [2020/10/20 08:09] – Pandoc Auto-commit pandoc
Line 1: Line 1:
 +====== SLURM ======
 +
 +  * Article written by Markus Stöhr (VSC Team) <html><br></html>(last update 2017-10-09 by ms).
 +
 +
 +
 +==== Quickstart ====
 +
 +script [[examples/job-quickstart.sh|examples/05_submitting_batch_jobs/job-quickstart.sh]]:
 +
 +<code>
 +#!/bin/bash
 +#SBATCH -J h5test
 +#SBATCH -N 1
 +
 +module purge
 +module load gcc/5.3 intel-mpi/5 hdf5/1.8.18-MPI
 +
 +cp $VSC_HDF5_ROOT/share/hdf5_examples/c/ph5example.c .
 +mpicc -lhdf5 ph5example.c -o ph5example
 +
 +mpirun -np 8  ./ph5example -c -v 
 +
 +</code>
 +submission:
 +
 +<code>
 +$ sbatch job.sh
 +Submitted batch job 5250981
 +</code>
 +
 +check what is going on:
 +
 +<code>
 +squeue -u $USER
 +</code>
 +<code>
 +  JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
 +5250981  mem_0128   h5test   markus  R       0:00      2 n323-[018-019]
 +</code>
 +Output files:
 +
 +<code>
 +ParaEg0.h5
 +ParaEg1.h5
 +slurm-5250981.out
 +</code>
 +try on .h5 files:
 +
 +<code>
 +h5dump
 +</code>
 +
 +cancel jobs:
 +
 +<code>
 +scancel <job_id> 
 +</code>
 +or
 +
 +<code>
 +scancel <job_name>
 +</code>
 +or
 +
 +<code>
 +scancel -u $USER
 +</code>
 +===== Basic concepts =====
 +
 +==== Queueing system ====
 +
 +  * job/batch script:
 +    * shell script, that does everything needed to run your calculation
 +    * independent of queueing system
 +    * **use simple scripts** (max 50 lines, i.e. put complicated logic elsewhere)
 +    * load modules from scratch (purge, then load)
 +
 +
 +  * tell scheduler where/how to run jobs:
 +    * #nodes
 +    * nodetype
 +    * …
 +
 +
 +  * scheduler manages job allocation to compute nodes
 +
 +
 +
 +
 +{{..:queueing_basics.png?200}}
 +
 +==== SLURM: Accounts and Users ====
 +
 +{{..:slurm_accounts.png}}
 +
 +
 +==== SLURM: Partition and Quality of Service ====
 +
 +{{..:partitions.png}}
 +
 +
 +==== VSC-3 Hardware Types ====
 +
 +^partition    ^   RAM (GB)   ^CPU                          ^  Cores  ^  IB (HCA)  ^  #Nodes  ^
 +|mem_0064*    |      64      |2x Intel E5-2650 v2 @ 2.60GHz|   2x8     2xQDR    |   1849   |
 +|mem_0128         128      |2x Intel E5-2650 v2 @ 2.60GHz|   2x8     2xQDR    |   140    |
 +|mem_0256         256      |2x Intel E5-2650 v2 @ 2.60GHz|   2x8     2xQDR    |    50    |
 +|vsc3plus_0064|      64      |2x Intel E5-2660 v2 @ 2.20GHz|  2x10     1xFDR    |   816    |
 +|vsc3plus_0256|     256      |2x Intel E5-2660 v2 @ 2.20GHz|  2x10     1xFDR    |    48    |
 +|binf          512 - 1536  |2x Intel E5-2690 v4 @ 2.60GHz|  2x14     1xFDR    |    17    |
 +
 +
 +* default partition, QDR: Intel Truescale Infinipath (40Gbit/s), FDR: Mellanox ConnectX-3 (56Gbit/s)
 +
 +effective: 10/2018
 +
 +  * + GPU nodes (see later)
 +  * specify partition in job script:
 +
 +<code>
 +#SBATCH -p <partition>
 +</code>
 +==== Standard QOS ====
 +
 +^partition    ^QOS          ^
 +|mem_0064*    |normal_0064  |
 +|mem_0128     |normal_0128  |
 +|mem_0256     |normal_0256  |
 +|vsc3plus_0064|vsc3plus_0064|
 +|vsc3plus_0256|vsc3plus_0256|
 +|binf         |normal_binf  |
 +
 +
 +  * specify QOS in job script:
 +
 +<code>
 +#SBATCH --qos <QOS>
 +</code>
 +
 +----
 +
 +==== VSC-4 Hardware Types ====
 +
 +^partition^  RAM (GB)  ^CPU                              Cores  ^  IB (HCA)  ^  #Nodes  ^
 +|mem_0096*|     96     |2x Intel Platinum 8174 @ 3.10GHz|  2x24     1xEDR    |   688    |
 +|mem_0384 |    384     |2x Intel Platinum 8174 @ 3.10GHz|  2x24     1xEDR    |    78    |
 +|mem_0768 |    768     |2x Intel Platinum 8174 @ 3.10GHz|  2x24     1xEDR    |    12    |
 +
 +
 +* default partition, EDR: Intel Omni-Path (100Gbit/s)
 +
 +effective: 10/2020
 +
 +==== Standard QOS ====
 +
 +^partition^QOS     ^
 +|mem_0096*|mem_0096|
 +|mem_0384 |mem_0384|
 +|mem_0768 |mem_0768|
 +
 +
 +
 +----
 +
 +==== VSC Hardware Types ====
 +
 +  * Display information about partitions and their nodes:
 +
 +<code>
 +sinfo -o %P
 +scontrol show partition mem_0064
 +scontrol show node n301-001
 +</code>
 +
 +==== QOS-Account/Project assignment ====
 +
 +
 +{{..:setup.png?200}}
 +
 +1.+2.:
 +
 +<code>
 +sqos -acc
 +</code>
 +
 +<code>
 +default_account:              p70824
 +        account:              p70824                    
 +
 +    default_qos:         normal_0064                    
 +            qos:          devel_0128                    
 +                            goodluck                    
 +                      gpu_gtx1080amd                    
 +                    gpu_gtx1080multi                    
 +                   gpu_gtx1080single                    
 +                            gpu_k20m                    
 +                             gpu_m60                    
 +                                 knl                    
 +                         normal_0064                    
 +                         normal_0128                    
 +                         normal_0256                    
 +                         normal_binf                    
 +                       vsc3plus_0064                    
 +                       vsc3plus_0256
 +</code>
 +
 +
 +==== QOS-Partition assignment ====
 +
 +
 +3.:
 +
 +<code>
 +sqos
 +</code>
 +<code>
 +            qos_name total  used  free     walltime   priority partitions  
 +=========================================================================
 +         normal_0064  1782  1173   609   3-00:00:00       2000 mem_0064    
 +         normal_0256    15    24    -9   3-00:00:00       2000 mem_0256    
 +         normal_0128    93    51    42   3-00:00:00       2000 mem_0128    
 +          devel_0128    10    20   -10     00:10:00      20000 mem_0128    
 +            goodluck               3-00:00:00       1000 vsc3plus_0256,vsc3plus_0064,amd
 +                 knl               3-00:00:00       1000 knl         
 +         normal_binf    16        11   1-00:00:00       1000 binf        
 +    gpu_gtx1080multi               3-00:00:00       2000 gpu_gtx1080multi
 +   gpu_gtx1080single    50    18    32   3-00:00:00       2000 gpu_gtx1080single
 +            gpu_k20m               3-00:00:00       2000 gpu_k20m    
 +             gpu_m60               3-00:00:00       2000 gpu_m60     
 +       vsc3plus_0064   800   781    19   3-00:00:00       1000 vsc3plus_0064
 +       vsc3plus_0256    48    44       3-00:00:00       1000 vsc3plus_0256
 +      gpu_gtx1080amd               3-00:00:00       2000 gpu_gtx1080amd
 +</code>
 +naming convention:
 +
 +^QOS   ^Partition^
 +|*_0064|mem_0064 |
 +
 +
 +
 +
 +
 +
 +----
 +
 +==== Specification in job script ====
 +
 +
 +<code>
 +#SBATCH --account=xxxxxx
 +#SBATCH --qos=xxxxx_xxxx
 +#SBATCH --partition=mem_xxxx
 +</code>
 +For omitted lines corresponding defaults are used. See previous slides, default partition is “mem_0064”
 +
 +
 +==== Sample batch job ====
 +
 +default:
 +
 +<code>
 +#!/bin/bash
 +#SBATCH -J jobname
 +#SBATCH -N number_of_nodes
 +
 +do_my_work
 +</code>
 +job is submitted to:
 +
 +  * partition mem_0064
 +  * qos normal_0064
 +  * default account
 +
 +
 +
 +explicit:
 +
 +<code>
 +#!/bin/bash
 +#SBATCH -J jobname
 +#SBATCH -N number_of_nodes
 +#SBATCH
 +#SBATCH
 +#SBATCH --partition=mem_xxxx
 +#SBATCH --qos=xxxxx_xxxx
 +#SBATCH --account=xxxxxx
 +
 +do_my_work
 +</code>
 +
 +
 +
 +  * must be a shell script (first line!)
 +  * ‘#SBATCH’ for marking SLURM parameters
 +  * environment variables are set by SLURM for use within the script (e.g. ''%%SLURM_JOB_NUM_NODES%%'')
 +
 +
 +
 +==== Job submission ====
 +
 +<code>
 +sbatch <SLURM_PARAMETERS> job.sh <JOB_PARAMETERS>
 +</code>
 +  * parameters are specified as in job script
 +  * precedence: sbatch parameters override parameters in job script
 +  * be careful to place slurm parameters **before** job script
 +
 +==== Exercises ====
 +
 +  * try these commands and find out which partition has to be used if you want to run in QOS ‘devel_0128’:
 +
 +<code>
 +sqos
 +sqos -acc
 +</code>
 +  * find out, which nodes are in the partition that allows running in ‘devel_0128’. Further, check how much memory these nodes have:
 +
 +<code>
 +scontrol show partition ...
 +scontrol show node ...
 +</code>
 +  * submit a one node job to QOS devel_0128 with the following commands:
 +
 +<code>
 +hostname
 +free 
 +</code>
 +==== Bad job practices ====
 +
 +  * job submissions in a loop (takes a long time):
 +
 +<code>
 +for i in {1..1000} 
 +do 
 +    sbatch job.sh $i
 +done
 +</code>
 +
 +  * loop inside job script (sequential mpirun commands):
 +
 +<code>
 +for i in {1..1000}
 +do
 +    mpirun my_program $i
 +done
 +</code>
 +
 +
 +==== Array jobs ====
 +
 +  * submit/run a series of **independent** jobs via a single SLURM script
 +  * each job in the array gets a unique identifier (SLURM_ARRAY_TASK_ID) based on which various workloads can be organized
 +  * example ([[examples/job_array.sh|job_array.sh]]), 10 jobs, SLURM_ARRAY_TASK_ID=1,2,3…10
 +
 +<code>
 +#!/bin/sh
 +#SBATCH -J array
 +#SBATCH -N 1
 +#SBATCH --array=1-10
 +
 +echo "Hi, this is array job number"  $SLURM_ARRAY_TASK_ID
 +sleep $SLURM_ARRAY_TASK_ID
 +</code>
 +  * independent jobs: 1, 2, 3 … 10
 +
 +<code>
 +VSC-4 >  squeue  -u $user
 +             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
 +     406846_[7-10]  mem_0096    array       sh PD       0:00      1 (Resources)
 +          406846_4  mem_0096    array       sh  R    INVALID      1 n403-062
 +          406846_5  mem_0096    array       sh  R    INVALID      1 n403-072
 +          406846_6  mem_0096    array       sh  R    INVALID      1 n404-031
 +</code>
 +
 +<code>
 +VSC-4 >  ls slurm-*
 +slurm-406846_10.out  slurm-406846_3.out  slurm-406846_6.out  slurm-406846_9.out
 +slurm-406846_1.out   slurm-406846_4.out  slurm-406846_7.out
 +slurm-406846_2.out   slurm-406846_5.out  slurm-406846_8.out
 +</code>
 +
 +<code>
 +VSC-4 >  cat slurm-406846_8.out
 +Hi, this is array job number  8
 +</code>
 +
 +
 +
 +  * fine-tuning via builtin variables (SLURM_ARRAY_TASK_MIN, SLURM_ARRAY_TASK_MAX…)
 +
 +  * example of going in chunks of a certain size, e.g. 5, SLURM_ARRAY_TASK_ID=1,6,11,16
 +
 +<code>
 +#SBATCH --array=1-20:5
 +</code>
 +
 +  * example of limiting number of simultaneously running jobs to 2 (perhaps for licences)
 +
 +<code>
 +#SBATCH --array=1-20:5%2
 +</code>
 +
 +
 +==== Single core jobs ====
 +
 +  * use an entire compute node for several independent jobs
 +  * example: [[examples/single_node_multiple_jobs.sh|single_node_multiple_jobs.sh]]:
 +
 +<code>
 +for ((i=1; i<=48; i++))
 +do
 +   stress --cpu 1 --timeout $i  &
 +done
 +wait
 +</code>
 +  * ‘&’: send process into the background, script can continue
 +  * ‘wait’: waits for all processes in the background, otherwise script would terminate
 +
 +
 +==== Combination of array & single core job ====
 +
 +  * example: [[examples/combined_array_multiple_jobs.sh|combined_array_multiple_jobs.sh]]:
 +
 +<code>
 +...
 +#SBATCH --array=1-144:48
 +
 +j=$SLURM_ARRAY_TASK_ID
 +((j+=47))
 +
 +for ((i=$SLURM_ARRAY_TASK_ID; i<=$j; i++))
 +do
 +   stress --cpu 1 --timeout $i  &
 +done
 +wait
 +
 +</code>
 +==== Exercises ====
 +
 +  * files are located in folder ''%%examples/05_submitting_batch_jobs%%''
 +  * look into [[examples/job_array.sh|job_array.sh]] and modify it such that the considered range is from 1 to 20 but in steps of 5
 +  * look into [[examples/single_node_multiple_jobs.sh|single_node_multiple_jobs.sh]] and also change it to go in steps of 5
 +  * run [[examples/combined_array_multiple_jobs.sh|combined_array_multiple_jobs.sh]] and check whether the output is reasonable
 +
 +==== Job/process setup ====
 +
 +  * normal jobs:
 +
 +^#SBATCH          ^job environment      ^
 +|-N               |SLURM_JOB_NUM_NODES  |
 +|--ntasks-per-core|SLURM_NTASKS_PER_CORE|
 +|--ntasks-per-node|SLURM_NTASKS_PER_NODE|
 +|--ntasks, -n     |SLURM_NTASKS         |
 +
 +  * emails:
 +
 +<code>
 +#SBATCH --mail-user=yourmail@example.com
 +#SBATCH --mail-type=BEGIN,END
 +</code>
 +
 +  * constraints:
 +
 +<code>
 +#SBATCH -t, --time=<time>
 +#SBATCH --time-min=<time>
 +</code>
 +
 +time format:
 +
 +  * DD-HH[:MM[:SS]]
 +
 +
 +
 +  * backfilling: * specify ‘–time’ or ‘–time-min’ which are estimates of the runtime of your job * shorter than default runtimes (mostly 72h) may enable the scheduler to use idle nodes waiting for a larger job
 +  * get the remaining running time for your job:
 +
 +<code>
 +squeue -h -j $SLURM_JOBID -o %L
 +</code>
 +
 +
 +==== Licenses ====
 +
 +{{..:licenses.png}}
 +
 +
 +<code>
 +VSC-3 >  slic
 +</code>
 +Within the SLURN submit script add the flags as shown with ‘slic’, e.g. when both Matlab and Mathematica are required
 +
 +<code>
 +#SBATCH -L matlab@vsc,mathematica@vsc
 +</code>
 +Intel licenses are needed only when compiling code, not for running resulting executables
 +
 +==== Reservation of compute nodes ====
 +
 +  * core-h accounting is done for the entire period of reservation
 +  * contact service@vsc.ac.at
 +  * reservations are named after the project id
 +
 +  * check for reservations:
 +
 +<code>
 +VSC-3 >  scontrol show reservations
 +</code>
 +  * usage:
 +
 +<code>
 +#SBATCH --reservation=
 +</code>
 +
 +
 +==== Exercises ====
 +
 +  * check for available reservations. If there is one available, use it
 +  * specify an email address that notifies you when the job has finished
 +  * run the following matlab code in your job:
 +
 +<code>
 +echo "2+2" | matlab
 +</code>
 +==== MPI + pinning ====
 +
 +  * understand what your code is doing and place the processes correctly
 +  * use only a few processes per node if memory demand is high
 +  * details for pinning: https://wiki.vsc.ac.at/doku.php?id=doku:vsc3_pinning
 +
 +Example: Two nodes with two MPI processes each:
 +
 +=== srun ===
 +
 +<code>
 +#SBATCH -N 2
 +#SBATCH --tasks-per-node=2
 +
 +srun --cpu_bind=map_cpu:0,24 ./my_mpi_program
 +
 +</code>
 +
 +=== mpirun ===
 +
 +<code>
 +#SBATCH -N 2
 +#SBATCH --tasks-per-node=2
 +
 +export I_MPI_PIN_PROCESSOR_LIST=0,24   # Intel MPI syntax 
 +mpirun ./my_mpi_program
 +</code>
 +
 +
 +==== Job dependencies ====
 +
 +  - Submit first job and get its <job id>
 +  - Submit dependent job (and get <job_id>):
 +
 +<code>
 +#!/bin/bash
 +#SBATCH -J jobname
 +#SBATCH -N 2
 +#SBATCH -d afterany:<job_id>
 +srun  ./my_program
 +</code>
 +<HTML><ol start="3" style="list-style-type: decimal;"></HTML>
 +<HTML><li></HTML>continue at 2. for further dependent jobs<HTML></li></HTML><HTML></ol></HTML>
 +
 +
 +----
 +
  
  • pandoc/introduction-to-vsc/05_submitting_batch_jobs/slurm.txt
  • Last modified: 2020/10/20 09:13
  • by pandoc