Differences

This shows you the differences between two versions of the page.

--- pandoc:introduction-to-vsc:05_submitting_batch_jobs:slurm [2018/01/31 11:10] – Pandoc Auto-commit pandoc
+++ pandoc:introduction-to-vsc:05_submitting_batch_jobs:slurm [2020/10/20 08:09] – Pandoc Auto-commit pandoc
@@ Line 1: / Line 1: @@
+====== SLURM ======
+  * Article written by Markus Stöhr (VSC Team) <html><br></html>(last update 2017-10-09 by ms).
+==== Quickstart ====
+script [[examples/job-quickstart.sh|examples/05_submitting_batch_jobs/job-quickstart.sh]]:
+<code>
+#!/bin/bash
+#SBATCH -J h5test
+#SBATCH -N 1
+module purge
+module load gcc/5.3 intel-mpi/5 hdf5/1.8.18-MPI
+cp $VSC_HDF5_ROOT/share/hdf5_examples/c/ph5example.c .
+mpicc -lhdf5 ph5example.c -o ph5example
+mpirun -np 8  ./ph5example -c -v
+</code>
+submission:
+<code>
+$ sbatch job.sh
+Submitted batch job 5250981
+</code>
+check what is going on:
+<code>
+squeue -u $USER
+</code>
+<code>
+  JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
+5250981  mem_0128   h5test   markus  R       0:00      2 n323-[018-019]
+</code>
+Output files:
+<code>
+ParaEg0.h5
+ParaEg1.h5
+slurm-5250981.out
+</code>
+try on .h5 files:
+<code>
+h5dump
+</code>
+cancel jobs:
+<code>
+scancel <job_id>
+</code>
+or
+<code>
+scancel <job_name>
+</code>
+or
+<code>
+scancel -u $USER
+</code>
+===== Basic concepts =====
+==== Queueing system ====
+  * job/batch script:
+    * shell script, that does everything needed to run your calculation
+    * independent of queueing system
+    * **use simple scripts** (max 50 lines, i.e. put complicated logic elsewhere)
+    * load modules from scratch (purge, then load)
+  * tell scheduler where/how to run jobs:
+    * #nodes
+    * nodetype
+    * …
+  * scheduler manages job allocation to compute nodes
+{{..:queueing_basics.png?200}}
+==== SLURM: Accounts and Users ====
+{{..:slurm_accounts.png}}
+==== SLURM: Partition and Quality of Service ====
+{{..:partitions.png}}
+==== VSC-3 Hardware Types ====
+^partition    ^   RAM (GB)   ^CPU                          ^  Cores  ^  IB (HCA)  ^  #Nodes  ^
+|mem_0064*    |      64      |2x Intel E5-2650 v2 @ 2.60GHz|   2x8   |   2xQDR    |   1849   |
+|mem_0128     |     128      |2x Intel E5-2650 v2 @ 2.60GHz|   2x8   |   2xQDR    |   140    |
+|mem_0256     |     256      |2x Intel E5-2650 v2 @ 2.60GHz|   2x8   |   2xQDR    |    50    |
+|vsc3plus_0064|      64      |2x Intel E5-2660 v2 @ 2.20GHz|  2x10   |   1xFDR    |   816    |
+|vsc3plus_0256|     256      |2x Intel E5-2660 v2 @ 2.20GHz|  2x10   |   1xFDR    |    48    |
+|binf         |  512 - 1536  |2x Intel E5-2690 v4 @ 2.60GHz|  2x14   |   1xFDR    |    17    |
+* default partition, QDR: Intel Truescale Infinipath (40Gbit/s), FDR: Mellanox ConnectX-3 (56Gbit/s)
+effective: 10/2018
+  * + GPU nodes (see later)
+  * specify partition in job script:
+<code>
+#SBATCH -p <partition>
+</code>
+==== Standard QOS ====
+^partition    ^QOS          ^
+|mem_0064*    |normal_0064  |
+|mem_0128     |normal_0128  |
+|mem_0256     |normal_0256  |
+|vsc3plus_0064|vsc3plus_0064|
+|vsc3plus_0256|vsc3plus_0256|
+|binf         |normal_binf  |
+  * specify QOS in job script:
+<code>
+#SBATCH --qos <QOS>
+</code>
+----
+==== VSC-4 Hardware Types ====
+^partition^  RAM (GB)  ^CPU                             ^  Cores  ^  IB (HCA)  ^  #Nodes  ^
+|mem_0096*|     96     |2x Intel Platinum 8174 @ 3.10GHz|  2x24   |   1xEDR    |   688    |
+|mem_0384 |    384     |2x Intel Platinum 8174 @ 3.10GHz|  2x24   |   1xEDR    |    78    |
+|mem_0768 |    768     |2x Intel Platinum 8174 @ 3.10GHz|  2x24   |   1xEDR    |    12    |
+* default partition, EDR: Intel Omni-Path (100Gbit/s)
+effective: 10/2020
+==== Standard QOS ====
+^partition^QOS     ^
+|mem_0096*|mem_0096|
+|mem_0384 |mem_0384|
+|mem_0768 |mem_0768|
+----
+==== VSC Hardware Types ====
+  * Display information about partitions and their nodes:
+<code>
+sinfo -o %P
+scontrol show partition mem_0064
+scontrol show node n301-001
+</code>
+==== QOS-Account/Project assignment ====
+{{..:setup.png?200}}
+.+2.:
+<code>
+sqos -acc
+</code>
+<code>
+default_account:              p70824
+        account:              p70824
+    default_qos:         normal_0064
+            qos:          devel_0128
+                            goodluck
+                      gpu_gtx1080amd
+                    gpu_gtx1080multi
+                   gpu_gtx1080single
+                            gpu_k20m
+                             gpu_m60
+                                 knl
+                         normal_0064
+                         normal_0128
+                         normal_0256
+                         normal_binf
+                       vsc3plus_0064
+                       vsc3plus_0256
+</code>
+==== QOS-Partition assignment ====
+.:
+<code>
+sqos
+</code>
+<code>
+            qos_name total  used  free     walltime   priority partitions
+=========================================================================
+         normal_0064  1782  1173   609   3-00:00:00       2000 mem_0064
+         normal_0256    15    24    -9   3-00:00:00       2000 mem_0256
+         normal_0128    93    51    42   3-00:00:00       2000 mem_0128
+          devel_0128    10    20   -10     00:10:00      20000 mem_0128
+            goodluck     0     0     0   3-00:00:00       1000 vsc3plus_0256,vsc3plus_0064,amd
+                 knl     4     1     3   3-00:00:00       1000 knl
+         normal_binf    16     5    11   1-00:00:00       1000 binf
+    gpu_gtx1080multi     4     2     2   3-00:00:00       2000 gpu_gtx1080multi
+   gpu_gtx1080single    50    18    32   3-00:00:00       2000 gpu_gtx1080single
+            gpu_k20m     2     0     2   3-00:00:00       2000 gpu_k20m
+             gpu_m60     1     1     0   3-00:00:00       2000 gpu_m60
+       vsc3plus_0064   800   781    19   3-00:00:00       1000 vsc3plus_0064
+       vsc3plus_0256    48    44     4   3-00:00:00       1000 vsc3plus_0256
+      gpu_gtx1080amd     1     0     1   3-00:00:00       2000 gpu_gtx1080amd
+</code>
+naming convention:
+^QOS   ^Partition^
+|*_0064|mem_0064 |
+----
+==== Specification in job script ====
+<code>
+#SBATCH --account=xxxxxx
+#SBATCH --qos=xxxxx_xxxx
+#SBATCH --partition=mem_xxxx
+</code>
+For omitted lines corresponding defaults are used. See previous slides, default partition is “mem_0064”
+==== Sample batch job ====
+default:
+<code>
+#!/bin/bash
+#SBATCH -J jobname
+#SBATCH -N number_of_nodes
+do_my_work
+</code>
+job is submitted to:
+  * partition mem_0064
+  * qos normal_0064
+  * default account
+explicit:
+<code>
+#!/bin/bash
+#SBATCH -J jobname
+#SBATCH -N number_of_nodes
+#SBATCH
+#SBATCH
+#SBATCH --partition=mem_xxxx
+#SBATCH --qos=xxxxx_xxxx
+#SBATCH --account=xxxxxx
+do_my_work
+</code>
+  * must be a shell script (first line!)
+  * ‘#SBATCH’ for marking SLURM parameters
+  * environment variables are set by SLURM for use within the script (e.g. ''%%SLURM_JOB_NUM_NODES%%'')
+==== Job submission ====
+<code>
+sbatch <SLURM_PARAMETERS> job.sh <JOB_PARAMETERS>
+</code>
+  * parameters are specified as in job script
+  * precedence: sbatch parameters override parameters in job script
+  * be careful to place slurm parameters **before** job script
+==== Exercises ====
+  * try these commands and find out which partition has to be used if you want to run in QOS ‘devel_0128’:
+<code>
+sqos
+sqos -acc
+</code>
+  * find out, which nodes are in the partition that allows running in ‘devel_0128’. Further, check how much memory these nodes have:
+<code>
+scontrol show partition ...
+scontrol show node ...
+</code>
+  * submit a one node job to QOS devel_0128 with the following commands:
+<code>
+hostname
+free
+</code>
+==== Bad job practices ====
+  * job submissions in a loop (takes a long time):
+<code>
+for i in {1..1000}
+do
+    sbatch job.sh $i
+done
+</code>
+  * loop inside job script (sequential mpirun commands):
+<code>
+for i in {1..1000}
+do
+    mpirun my_program $i
+done
+</code>
+==== Array jobs ====
+  * submit/run a series of **independent** jobs via a single SLURM script
+  * each job in the array gets a unique identifier (SLURM_ARRAY_TASK_ID) based on which various workloads can be organized
+  * example ([[examples/job_array.sh|job_array.sh]]), 10 jobs, SLURM_ARRAY_TASK_ID=1,2,3…10
+<code>
+#!/bin/sh
+#SBATCH -J array
+#SBATCH -N 1
+#SBATCH --array=1-10
+echo "Hi, this is array job number"  $SLURM_ARRAY_TASK_ID
+sleep $SLURM_ARRAY_TASK_ID
+</code>
+  * independent jobs: 1, 2, 3 … 10
+<code>
+VSC-4 >  squeue  -u $user
+             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
+_[7-10]  mem_0096    array       sh PD       0:00      1 (Resources)
+_4  mem_0096    array       sh  R    INVALID      1 n403-062
+_5  mem_0096    array       sh  R    INVALID      1 n403-072
+_6  mem_0096    array       sh  R    INVALID      1 n404-031
+</code>
+<code>
+VSC-4 >  ls slurm-*
+slurm-406846_10.out  slurm-406846_3.out  slurm-406846_6.out  slurm-406846_9.out
+slurm-406846_1.out   slurm-406846_4.out  slurm-406846_7.out
+slurm-406846_2.out   slurm-406846_5.out  slurm-406846_8.out
+</code>
+<code>
+VSC-4 >  cat slurm-406846_8.out
+Hi, this is array job number  8
+</code>
+  * fine-tuning via builtin variables (SLURM_ARRAY_TASK_MIN, SLURM_ARRAY_TASK_MAX…)
+  * example of going in chunks of a certain size, e.g. 5, SLURM_ARRAY_TASK_ID=1,6,11,16
+<code>
+#SBATCH --array=1-20:5
+</code>
+  * example of limiting number of simultaneously running jobs to 2 (perhaps for licences)
+<code>
+#SBATCH --array=1-20:5%2
+</code>
+==== Single core jobs ====
+  * use an entire compute node for several independent jobs
+  * example: [[examples/single_node_multiple_jobs.sh|single_node_multiple_jobs.sh]]:
+<code>
+for ((i=1; i<=48; i++))
+do
+   stress --cpu 1 --timeout $i  &
+done
+wait
+</code>
+  * ‘&’: send process into the background, script can continue
+  * ‘wait’: waits for all processes in the background, otherwise script would terminate
+==== Combination of array & single core job ====
+  * example: [[examples/combined_array_multiple_jobs.sh|combined_array_multiple_jobs.sh]]:
+<code>
+...
+#SBATCH --array=1-144:48
+j=$SLURM_ARRAY_TASK_ID
+((j+=47))
+for ((i=$SLURM_ARRAY_TASK_ID; i<=$j; i++))
+do
+   stress --cpu 1 --timeout $i  &
+done
+wait
+</code>
+==== Exercises ====
+  * files are located in folder ''%%examples/05_submitting_batch_jobs%%''
+  * look into [[examples/job_array.sh|job_array.sh]] and modify it such that the considered range is from 1 to 20 but in steps of 5
+  * look into [[examples/single_node_multiple_jobs.sh|single_node_multiple_jobs.sh]] and also change it to go in steps of 5
+  * run [[examples/combined_array_multiple_jobs.sh|combined_array_multiple_jobs.sh]] and check whether the output is reasonable
+==== Job/process setup ====
+  * normal jobs:
+^#SBATCH          ^job environment      ^
+|-N               |SLURM_JOB_NUM_NODES  |
+|--ntasks-per-core|SLURM_NTASKS_PER_CORE|
+|--ntasks-per-node|SLURM_NTASKS_PER_NODE|
+|--ntasks, -n     |SLURM_NTASKS         |
+  * emails:
+<code>
+#SBATCH --mail-user=yourmail@example.com
+#SBATCH --mail-type=BEGIN,END
+</code>
+  * constraints:
+<code>
+#SBATCH -t, --time=<time>
+#SBATCH --time-min=<time>
+</code>
+time format:
+  * DD-HH[:MM[:SS]]
+  * backfilling: * specify ‘–time’ or ‘–time-min’ which are estimates of the runtime of your job * shorter than default runtimes (mostly 72h) may enable the scheduler to use idle nodes waiting for a larger job
+  * get the remaining running time for your job:
+<code>
+squeue -h -j $SLURM_JOBID -o %L
+</code>
+==== Licenses ====
+{{..:licenses.png}}
+<code>
+VSC-3 >  slic
+</code>
+Within the SLURN submit script add the flags as shown with ‘slic’, e.g. when both Matlab and Mathematica are required
+<code>
+#SBATCH -L matlab@vsc,mathematica@vsc
+</code>
+Intel licenses are needed only when compiling code, not for running resulting executables
+==== Reservation of compute nodes ====
+  * core-h accounting is done for the entire period of reservation
+  * contact service@vsc.ac.at
+  * reservations are named after the project id
+  * check for reservations:
+<code>
+VSC-3 >  scontrol show reservations
+</code>
+  * usage:
+<code>
+#SBATCH --reservation=
+</code>
+==== Exercises ====
+  * check for available reservations. If there is one available, use it
+  * specify an email address that notifies you when the job has finished
+  * run the following matlab code in your job:
+<code>
+echo "2+2" | matlab
+</code>
+==== MPI + pinning ====
+  * understand what your code is doing and place the processes correctly
+  * use only a few processes per node if memory demand is high
+  * details for pinning: https://wiki.vsc.ac.at/doku.php?id=doku:vsc3_pinning
+Example: Two nodes with two MPI processes each:
+=== srun ===
+<code>
+#SBATCH -N 2
+#SBATCH --tasks-per-node=2
+srun --cpu_bind=map_cpu:0,24 ./my_mpi_program
+</code>
+=== mpirun ===
+<code>
+#SBATCH -N 2
+#SBATCH --tasks-per-node=2
+export I_MPI_PIN_PROCESSOR_LIST=0,24   # Intel MPI syntax
+mpirun ./my_mpi_program
+</code>
+==== Job dependencies ====
+  - Submit first job and get its <job id>
+  - Submit dependent job (and get <job_id>):
+<code>
+#!/bin/bash
+#SBATCH -J jobname
+#SBATCH -N 2
+#SBATCH -d afterany:<job_id>
+srun  ./my_program
+</code>
+<HTML><ol start="3" style="list-style-type: decimal;"></HTML>
+<HTML><li></HTML>continue at 2. for further dependent jobs<HTML></li></HTML><HTML></ol></HTML>
+----