Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
Next revisionBoth sides next revision
pandoc:introduction-to-vsc:05_submitting_batch_jobs:slurm [2018/01/31 11:10] – Pandoc Auto-commit pandocpandoc:introduction-to-vsc:05_submitting_batch_jobs:slurm [2018/10/16 13:34] – Pandoc Auto-commit pandoc
Line 1: Line 1:
 +====== SLURM ======
 +
 +  * Article written by Markus Stöhr (VSC Team) <html><br></html>(last update 2017-10-09 by ms).
 +
 +
 +
 +==== Quickstart ====
 +
 +script [[examples/job-quickstart.sh|examples/05_submitting_batch_jobs/job-quickstart.sh]]:
 +
 +<code>
 +#!/bin/bash
 +#SBATCH -J h5test
 +#SBATCH -N 1
 +
 +module purge
 +module load gcc/5.3 intel-mpi/5 hdf5/1.8.18-MPI
 +
 +cp $VSC_HDF5_ROOT/share/hdf5_examples/c/ph5example.c .
 +mpicc -lhdf5 ph5example.c -o ph5example
 +
 +mpirun -np 8  ./ph5example -c -v 
 +
 +</code>
 +submission:
 +
 +<code>
 +$ sbatch job.sh
 +Submitted batch job 5250981
 +</code>
 +
 +check what is going on:
 +
 +<code>
 +squeue -u $USER
 +</code>
 +<code>
 +  JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
 +5250981  mem_0128   h5test   markus  R       0:00      2 n23-[018-019]
 +</code>
 +Output files:
 +
 +<code>
 +ParaEg0.h5
 +ParaEg1.h5
 +slurm-5250981.out
 +</code>
 +try on .h5 files:
 +
 +<code>
 +h5dump
 +</code>
 +
 +cancel jobs:
 +
 +<code>
 +scancel <job_id> 
 +</code>
 +or
 +
 +<code>
 +scancel <job_name>
 +</code>
 +or
 +
 +<code>
 +scancel -u $USER
 +</code>
 +===== Basic concepts =====
 +
 +==== Queueing system ====
 +
 +  * job/batch script:
 +    * shell script, that does everything needed to run your calculation
 +    * independent of queueing system
 +    * **use simple scripts** (max 50 lines, i.e. put complicated logic elsewhere)
 +    * load modules from scratch (purge, then load)
 +
 +
 +  * tell scheduler where/how to run jobs:
 +    * #nodes
 +    * nodetype
 +    * …
 +
 +
 +  * scheduler manages job allocation to compute nodes
 +
 +
 +
 +
 +{{:pandoc:introduction-to-vsc:05_submitting_batch_jobs:slurm:queueing_basics.png?200}}
 +
 +==== SLURM: Accounts and Users ====
 +
 +{{:pandoc:introduction-to-vsc:05_submitting_batch_jobs:slurm:slurm_accounts.png}}
 +
 +
 +==== SLURM: Partition and Quality of Service ====
 +
 +{{:pandoc:introduction-to-vsc:05_submitting_batch_jobs:slurm:partitions.png}}
 +
 +
 +==== VSC-3 Hardware Types ====
 +
 +^partition    ^   RAM (GB)   ^CPU                          ^    Cores    ^  IB (HCA)  ^  #Nodes  ^
 +|mem_0064*    |      64      |2x Intel E5-2650 v2 @ 2.60GHz|     2x8       2xQDR    |   1849   |
 +|mem_0128         128      |2x Intel E5-2650 v2 @ 2.60GHz|     2x8       2xQDR    |   140    |
 +|mem_0256         256      |2x Intel E5-2650 v2 @ 2.60GHz|     2x8       2xQDR    |    50    |
 +|vsc3plus_0064|      64      |2x Intel E5-2660 v2 @ 2.20GHz|    2x10       1xFDR    |   816    |
 +|vsc3plus_0256|     256      |2x Intel E5-2660 v2 @ 2.20GHz|    2x10       1xFDR    |    48    |
 +|knl          |     208      |Intel Phi 7210 @ 1.30GHz        1x64       1xQDR    |    4     |
 +|haswell      |     128      |2x Intel E5-2660 v3 @ 2.60GHz|    2x10       1xFDR    |    8     |
 +|binf          512 - 1536  |2x Intel E5-2690 v4 @ 2.60GHz|    2x14       1xFDR    |    17    |
 +|amd          |   128, 256   |AMD EPYC 7551, 7551P          32,64,128  |   1xFDR    |    3     |
 +
 +
 +* default partition, QDR: Intel Infinipath, FDR: Mellanox ConnectX-3
 +
 +effective: 10/2018
 +
 +  * + GPU nodes (see later)
 +  * specify partition in job script:
 +
 +<code>
 +#SBATCH -p <partition>
 +</code>
 +
 +----
 +
 +==== VSC-3 Hardware Types ====
 +
 +  * Display information about partitions and their nodes:
 +
 +<code>
 +sinfo -o %P
 +scontrol show partition mem_0064
 +scontrol show node n01-001
 +</code>
 +
 +==== QOS-Account/Project assignment ====
 +
 +
 +{{:pandoc:introduction-to-vsc:05_submitting_batch_jobs:slurm:setup.png?200}}
 +
 +1.+2.:
 +
 +<code>
 +sqos -acc
 +</code>
 +
 +<code>
 +default_account:              p70824
 +        account:              p70824                    
 +
 +    default_qos:         normal_0064                    
 +            qos:          devel_0128                    
 +                            goodluck                    
 +                      gpu_gtx1080amd                    
 +                    gpu_gtx1080multi                    
 +                   gpu_gtx1080single                    
 +                            gpu_k20m                    
 +                             gpu_m60                    
 +                                 knl                    
 +                         normal_0064                    
 +                         normal_0128                    
 +                         normal_0256                    
 +                         normal_binf                    
 +                       vsc3plus_0064                    
 +                       vsc3plus_0256
 +</code>
 +
 +
 +==== QOS-Partition assignment ====
 +
 +
 +3.:
 +
 +<code>
 +sqos
 +</code>
 +<code>
 +            qos_name total  used  free     walltime   priority partitions  
 +=========================================================================
 +         normal_0064  1782  1173   609   3-00:00:00       2000 mem_0064    
 +         normal_0256    15    24    -9   3-00:00:00       2000 mem_0256    
 +         normal_0128    93    51    42   3-00:00:00       2000 mem_0128    
 +          devel_0128    10    20   -10     00:10:00      20000 mem_0128    
 +            goodluck               3-00:00:00       1000 vsc3plus_0256,vsc3plus_0064,amd
 +                 knl               3-00:00:00       1000 knl         
 +         normal_binf    16        11   1-00:00:00       1000 binf        
 +    gpu_gtx1080multi               3-00:00:00       2000 gpu_gtx1080multi
 +   gpu_gtx1080single    50    18    32   3-00:00:00       2000 gpu_gtx1080single
 +            gpu_k20m               3-00:00:00       2000 gpu_k20m    
 +             gpu_m60               3-00:00:00       2000 gpu_m60     
 +       vsc3plus_0064   800   781    19   3-00:00:00       1000 vsc3plus_0064
 +       vsc3plus_0256    48    44       3-00:00:00       1000 vsc3plus_0256
 +      gpu_gtx1080amd               3-00:00:00       2000 gpu_gtx1080amd
 +</code>
 +naming convention:
 +
 +^QOS   ^Partition^
 +|*_0064|mem_0064 |
 +
 +
 +
 +
 +
 +
 +----
 +
 +==== Specification in job script ====
 +
 +
 +<code>
 +#SBATCH --account=xxxxxx
 +#SBATCH --qos=xxxxx_xxxx
 +#SBATCH --partition=mem_xxxx
 +</code>
 +For omitted lines corresponding defaults are used. See previous slides, default partition is “mem_0064”
 +
 +
 +==== Sample batch job ====
 +
 +default:
 +
 +<code>
 +#!/bin/bash
 +#SBATCH -J jobname
 +#SBATCH -N number_of_nodes
 +
 +do_my_work
 +</code>
 +job is submitted to:
 +
 +  * partition mem_0064
 +  * qos normal_0064
 +  * default account
 +
 +
 +
 +explicit:
 +
 +<code>
 +#!/bin/bash
 +#SBATCH -J jobname
 +#SBATCH -N number_of_nodes
 +#SBATCH
 +#SBATCH
 +#SBATCH --partition=mem_xxxx
 +#SBATCH --qos=xxxxx_xxxx
 +#SBATCH --account=xxxxxx
 +
 +do_my_work
 +</code>
 +
 +
 +
 +  * must be a shell script (first line!)
 +  * ‘#SBATCH’ for marking SLURM parameters
 +  * environment variables are set by SLURM for use within the script (e.g. ''%%SLURM_JOB_NUM_NODES%%'')
 +
 +
 +
 +==== Job submission ====
 +
 +<code>
 +sbatch <SLURM_PARAMETERS> job.sh <JOB_PARAMETERS>
 +</code>
 +  * parameters are specified as in job script
 +  * precedence: sbatch parameters override parameters in job script
 +  * be careful to place slurm parameters **before** job script
 +
 +==== Exercises ====
 +
 +  * try these commands and find out which partition has to be used if you want to run in QOS ‘devel_0128’:
 +
 +<code>
 +sqos
 +sqos -acc
 +</code>
 +  * find out, which nodes are in the partition that allows running in ‘devel_0128’. Further, check how much memory these nodes have:
 +
 +<code>
 +scontrol show partition ...
 +scontrol show node ...
 +</code>
 +  * submit a one node job to QOS devel_0128 with the following commands:
 +
 +<code>
 +hostname
 +free 
 +</code>
 +==== Bad job practices ====
 +
 +  * looped job submission (takes a long time):
 +
 +<code>
 +for i in {1..1000} 
 +do 
 +    sbatch job.sh $i
 +done
 +</code>
 +
 +  * loop in job (sequential mpirun commands):
 +
 +<code>
 +for i in {1..1000}
 +do
 +    mpirun my_program $i
 +done
 +</code>
 +
 +
 +==== Array job ====
 +
 +  * run similar, **independent** jobs at once, that can be distinguished by **one parameter**
 +  * each task will be treated as a seperate job
 +  * example ([[examples/job_array.sh|job_array.sh]], [[examples/sleep.sh|sleep.sh]]), start=1, end=30, stepwidth=7:
 +
 +<code>
 +#!/bin/sh
 +#SBATCH -J array
 +#SBATCH -N 1
 +#SBATCH --array=1-30:7
 +
 +./sleep.sh $SLURM_ARRAY_TASK_ID
 +</code>
 +  * computed tasks: 1, 8, 15, 22, 29
 +
 +<code>
 +5605039_[15-29] mem_0064    array   markus PD
 +5605039_1       mem_0064    array   markus  R
 +5605039_8       mem_0064    array   markus  R
 +</code>
 +
 +
 +
 +useful variables within job:
 +
 +<code>
 +SLURM_ARRAY_JOB_ID
 +SLURM_ARRAY_TASK_ID
 +SLURM_ARRAY_TASK_STEP
 +SLURM_ARRAY_TASK_MAX
 +SLURM_ARRAY_TASK_MIN
 +</code>
 +
 +limit number of simultanously running jobs to 2:
 +
 +<code>
 +#SBATCH --array=1-30:7%2
 +</code>
 +
 +
 +==== Single core ====
 +
 +  * use a complete compute node for several tasks at once
 +
 +  * example: [[examples/job_singlenode_manytasks.sh|job_singlenode_manytasks.sh]]:
 +
 +<code>
 +...
 +
 +max_num_tasks=16
 +
 +...
 +
 +for i in `seq $task_start $task_increment $task_end`
 +do
 +  ./$executable $i &
 +  check_running_tasks #sleeps as long as max_num_tasks are running
 +done
 +wait
 +</code>
 +
 +  * ‘&’: start binary in background, script can continue
 +  * ‘wait’: waits for all processes in the background, otherwise script will finish
 +
 +
 +
 +==== Array job + single core ====
 +
 +[[examples/job_array_some_tasks.sh|job_array_some_tasks.sh]]:
 +
 +<code>
 +...
 +#SBATCH --array=1-100:32
 +
 +...
 +
 +task_start=$SLURM_ARRAY_TASK_ID
 +task_end=$(( $SLURM_ARRAY_TASK_ID + $SLURM_ARRAY_TASK_STEP -1 ))
 +if [ $task_end -gt $SLURM_ARRAY_TASK_MAX ]; then
 +        task_end=$SLURM_ARRAY_TASK_MAX
 +fi
 +task_increment=1
 +
 +...
 +
 +for i in `seq $task_start $task_increment $task_end`
 +do
 +  ./$executable $i &
 +  check_running_tasks
 +done
 +wait
 +</code>
 +==== Exercises ====
 +
 +  * files are located in folder ''%%examples/05_submitting_batch_jobs%%''
 +  * download or copy [[examples/sleep.sh|sleep.sh]] and find out what it is doing
 +  * run [[examples/job_array.sh|job_array.sh]] with tasks 4-20 and stepwidth 3
 +  * start a jobs for [[examples/job_singlenode_manytasks.sh|job_singlenode_manytasks.sh]] with max_num_tasks=16 and max_num_tasks=8; compare the job runtimes
 +  * run [[examples/job_array_some_tasks.sh|job_array_some_tasks.sh]]
 +
 +==== Job/process setup ====
 +
 +  * normal jobs:
 +
 +^#SBATCH            ^job environment        ^
 +|-N                 |SLURM_JOB_NUM_NODES    |
 +|--ntasks-per-core  |SLURM_NTASKS_PER_CORE  |
 +|--ntasks-per-node  |SLURM_NTASKS_PER_NODE  |
 +|--ntasks-per-socket|SLURM_NTASKS_PER_SOCKET|
 +|--ntasks, -n       |SLURM_NTASKS           |
 +
 +  * emails:
 +
 +<code>
 +#SBATCH --mail-user=yourmail@example.com
 +#SBATCH --mail-type=BEGIN,END
 +</code>
 +
 +  * constraints:
 +
 +<code>
 +#SBATCH -C --constraint
 +#SBATCH --gres=
 +
 +#SBATCH -t, --time=<time>
 +#SBATCH --time-min=<time>
 +</code>
 +
 +Valid time formats:
 +
 +  * MM
 +  * [HH:]MM:SS
 +  * DD-HH[:MM[:SS]]
 +
 +
 +
 +  * backfilling:
 +    * specify ‘–time’ or ‘–time-min’ that is eligible for your job
 +    * short runtimes may enable the scheduler to use idle nodes waiting for a large job
 +  * get the remaining running time for your job:
 +
 +<code>
 +squeue -h -j $SLURM_JOBID -o %L
 +</code>
 +
 +
 +==== Licenses ====
 +
 +{{:pandoc:introduction-to-vsc:05_submitting_batch_jobs:slurm:licenses.png}}
 +
 +
 +<code>
 +slic
 +</code>
 +Within the job script add the flags as shown with ‘slic’, e.g. for using both Matlab and Mathematica:
 +
 +<code>
 +#SBATCH -L matlab@vsc,mathematica@vsc
 +</code>
 +Intel licenses are needed only for compiling code, not for running it!
 +
 +==== Reservations of compute nodes ====
 +
 +  * core-h accounting is done for the full reservation time
 +  * contact us, if needed
 +  * reservations are named after the project id
 +
 +  * check for reservations:
 +
 +<code>
 +scontrol show reservations
 +</code>
 +  * use it:
 +
 +<code>
 +#SBATCH --reservation=
 +</code>
 +
 +
 +==== Exercises ====
 +
 +  * check for available reservations. If there is one available, use it
 +  * specify an email address that notifies you when the job has finished
 +  * run the following matlab code in your job:
 +
 +<code>
 +echo "2+2" | matlab
 +</code>
 +==== MPI + NTASKS_PER_NODE + pinning ====
 +
 +  * understand what your code is doing and place the processes correctly
 +  * use only a few processes per node if memory demand is high
 +  * details for pinning: https://wiki.vsc.ac.at/doku.php?id=doku:vsc3_pinning
 +
 +Example: Two nodes with two mpi processes each:
 +
 +=== srun ===
 +
 +<code>
 +#SBATCH -N 2
 +#SBATCH --tasks-per-node=2
 +
 +srun --cpu_bind=map_cpu:0,8 ./my_mpi_program
 +
 +</code>
 +
 +=== mpirun ===
 +
 +<code>
 +#SBATCH -N 2
 +#SBATCH --tasks-per-node=2
 +
 +export I_MPI_PIN_PROCESSOR_LIST=0, # valid for Intel MPI only!
 +mpirun ./my_mpi_program
 +</code>
 +
 +
 +==== Job dependencies ====
 +
 +  - Submit first job and get its <job id>
 +  - Submit dependent job (and get <job_id>):
 +
 +<code>
 +#!/bin/bash
 +#SBATCH -J jobname
 +#SBATCH -N 2
 +#SBATCH -d afterany:<job_id>
 +srun  ./my_program
 +</code>
 +<HTML><ol start="3" style="list-style-type: decimal;"></HTML>
 +<HTML><li></HTML>continue at 2. for further dependent jobs<HTML></li></HTML><HTML></ol></HTML>
 +
 +
 +----
  
  • pandoc/introduction-to-vsc/05_submitting_batch_jobs/slurm.txt
  • Last modified: 2020/10/20 09:13
  • by pandoc