Differences

This shows you the differences between two versions of the page.

--- pandoc:introduction-to-vsc:05_submitting_batch_jobs:slurm [2017/10/18 11:42] – Pandoc Auto-commit pandoc
+++ pandoc:introduction-to-vsc:05_submitting_batch_jobs:slurm [2020/10/20 09:13] (current) – Pandoc Auto-commit pandoc
@@ Line 2: / Line 2: @@
   * Article written by Markus Stöhr (VSC Team) <html><br></html>(last update 2017-10-09 by ms).
@@ Line 25: / Line 26: @@
 <code>
-sbatch job.sh
+$ sbatch job.sh
-</code>
-<code>
 Submitted batch job 5250981
 </code>
 check what is going on:
@@ Line 37: / Line 37: @@
 <code>
   JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
-5250981  mem_0128   h5test   markus  R       0:00      2 n23-[018-019]
+5250981  mem_0128   h5test   markus  R       0:00      2 n323-[018-019]
 </code>
 Output files:
@@ Line 50: / Line 50: @@
 <code>
 h5dump
+</code>
+cancel jobs:
+<code>
+scancel <job_id>
+</code>
+or
+<code>
+scancel <job_name>
+</code>
+or
+<code>
+scancel -u $USER
 </code>
 ===== Basic concepts =====
@@ Line 58: / Line 74: @@
     * shell script, that does everything needed to run your calculation
     * independent of queueing system
-    * **use simple scripts** (max 50 lines, i.e. put complicated logic elsewhere)
+    * **use simple scripts** (max 50 lines, i.e. put complicated logic elsewhere)
     * load modules from scratch (purge, then load)
@@ Line 65: / Line 81: @@
     * #nodes
     * nodetype
-    * ...
+    * …
@@ Line 72: / Line 88: @@
-{{pandoc:introduction-to-vsc:05_submitting_batch_jobs:slurm:queueing_basics.png?200}}
+{{.:queueing_basics.png?200}}
 ==== SLURM: Accounts and Users ====
-{{pandoc:introduction-to-vsc:05_submitting_batch_jobs:slurm:slurm_accounts.png}}
+{{.:slurm_accounts.png}}
 ==== SLURM: Partition and Quality of Service ====
-{{pandoc:introduction-to-vsc:05_submitting_batch_jobs:slurm:partitions.png}}
+{{.:partitions.png}}
 ==== VSC-3 Hardware Types ====
-^partition^  memory^       ^
+^partition    ^   RAM (GB)   ^CPU                          ^  Cores  ^  IB (HCA)  ^  #Nodes  ^
-|mem_0064 |   64 GB|default|
+|mem_0064*    |      64      |2x Intel E5-2650 v2 @ 2.60GHz|   2x8   |   2xQDR    |   1849   |
-|mem_0128 |  128 GB|       |
+|mem_0128     |     128      |2x Intel E5-2650 v2 @ 2.60GHz|   2x8   |   2xQDR    |   140    |
-|mem_0256 |  256 GB|       |
+|mem_0256     |     256      |2x Intel E5-2650 v2 @ 2.60GHz|   2x8   |   2xQDR    |    50    |
+|vsc3plus_0064|      64      |2x Intel E5-2660 v2 @ 2.20GHz|  2x10   |   1xFDR    |   816    |
+|vsc3plus_0256|     256      |2x Intel E5-2660 v2 @ 2.20GHz|  2x10   |   1xFDR    |    48    |
+|binf         |  512 - 1536  |2x Intel E5-2690 v4 @ 2.60GHz|  2x14   |   1xFDR    |    17    |
-  * All nodes with the same CPU configuration:
-    * 16 cores
+* default partition, QDR: Intel Truescale Infinipath (40Gbit/s), FDR: Mellanox ConnectX-3 (56Gbit/s)
-    * 2 x Intel(R) Xeon(R) CPU E5-2650 v2 @ 2.60GHz (Ivy-Bridge)
+effective: 10/2018
+  * + GPU nodes (see later)
+  * specify partition in job script:
+<code>
+#SBATCH -p <partition>
+</code>
+==== Standard QOS ====
+^partition    ^QOS          ^
+|mem_0064*    |normal_0064  |
+|mem_0128     |normal_0128  |
+|mem_0256     |normal_0256  |
+|vsc3plus_0064|vsc3plus_0064|
+|vsc3plus_0256|vsc3plus_0256|
+|binf         |normal_binf  |
+  * specify QOS in job script:
+<code>
+#SBATCH --qos <QOS>
+</code>
+----
+==== VSC-4 Hardware Types ====
+^partition^  RAM (GB)  ^CPU                             ^  Cores  ^  IB (HCA)  ^  #Nodes  ^
+|mem_0096*|     96     |2x Intel Platinum 8174 @ 3.10GHz|  2x24   |   1xEDR    |   688    |
+|mem_0384 |    384     |2x Intel Platinum 8174 @ 3.10GHz|  2x24   |   1xEDR    |    78    |
+|mem_0768 |    768     |2x Intel Platinum 8174 @ 3.10GHz|  2x24   |   1xEDR    |    12    |
+* default partition, EDR: Intel Omni-Path (100Gbit/s)
+effective: 10/2020
+==== Standard QOS ====
+^partition^QOS     ^
+|mem_0096*|mem_0096|
+|mem_0384 |mem_0384|
+|mem_0768 |mem_0768|
+----
+==== VSC Hardware Types ====
   * Display information about partitions and their nodes:
@@ Line 100: / Line 171: @@
 sinfo -o %P
 scontrol show partition mem_0064
-scontrol show node n01-001
+scontrol show node n301-001
 </code>
 ==== QOS-Account/Project assignment ====
-{{pandoc:introduction-to-vsc:05_submitting_batch_jobs:slurm:setup.png?200}}
+{{.:setup.png?200}}
 .+2.:
@@ Line 115: / Line 186: @@
 <code>
-default_account:        p70824
+default_account:              p70824
-        account:        p70824
+        account:              p70824
-    default_qos:     asperitas
-            qos:     asperitas
-                    devel_0128
-                      goodluck
-                   gpu_compute
-                       gpu_vis
-                           knl
-                   normal_0064
-                   normal_0128
-                   normal_0256
+    default_qos:         normal_0064
+            qos:          devel_0128
+                            goodluck
+                      gpu_gtx1080amd
+                    gpu_gtx1080multi
+                   gpu_gtx1080single
+                            gpu_k20m
+                             gpu_m60
+                                 knl
+                         normal_0064
+                         normal_0128
+                         normal_0256
+                         normal_binf
+                       vsc3plus_0064
+                       vsc3plus_0256
 </code>
 ==== QOS-Partition assignment ====
 .:
@@ Line 141: / Line 216: @@
 </code>
 <code>
-   qos_name total  free     walltime   prio partitions
+            qos_name total  used  free     walltime   priority partitions
-==========================================================
+=========================================================================
-normal_0064  1796    43   3-00:00:00   2000 mem_0064
+         normal_0064  1782  1173   609   3-00:00:00       2000 mem_0064
-normal_0256    15    -1   3-00:00:00   2000 mem_0256
+         normal_0256    15    24    -9   3-00:00:00       2000 mem_0256
-normal_0128    67    -3   3-00:00:00   2000 mem_0128
+         normal_0128    93    51    42   3-00:00:00       2000 mem_0128
- devel_0128    10     9     00:10:00  20000 mem_0128
+          devel_0128    10    20   -10     00:10:00      20000 mem_0128
-gpu_compute    12     3   3-00:00:00   1000 p70971_gpu,gpu
+            goodluck     0     0     0   3-00:00:00       1000 vsc3plus_0256,vsc3plus_0064,amd
-    gpu_vis     4     4   3-00:00:00   1000 p70971_gpu,gpu
+                 knl     4     1     3   3-00:00:00       1000 knl
-   goodluck   470   470   3-00:00:00   1000
+         normal_binf    16     5    11   1-00:00:00       1000 binf
-        knl     4     4   3-00:00:00   1000 knl
+    gpu_gtx1080multi     4     2     2   3-00:00:00       2000 gpu_gtx1080multi
-  asperitas   500   500   3-00:00:00   1000 asperitas
+   gpu_gtx1080single    50    18    32   3-00:00:00       2000 gpu_gtx1080single
+            gpu_k20m     2     0     2   3-00:00:00       2000 gpu_k20m
+             gpu_m60     1     1     0   3-00:00:00       2000 gpu_m60
+       vsc3plus_0064   800   781    19   3-00:00:00       1000 vsc3plus_0064
+       vsc3plus_0256    48    44     4   3-00:00:00       1000 vsc3plus_0256
+      gpu_gtx1080amd     1     0     1   3-00:00:00       2000 gpu_gtx1080amd
 </code>
 naming convention:
@@ Line 157: / Line 237: @@
 ^QOS   ^Partition^
 |*_0064|mem_0064 |
@@ Line 165: / Line 246: @@
 ==== Specification in job script ====
 <code>
@@ Line 171: / Line 253: @@
 #SBATCH --partition=mem_xxxx
 </code>
-For omitted lines corresponding defaults are used. See previous slides, default partition is "mem_0064"
+For omitted lines corresponding defaults are used. See previous slides, default partition is “mem_0064”
@@ Line 211: / Line 293: @@
   * must be a shell script (first line!)
-  * '#SBATCH' for marking SLURM parameters
+  * ‘#SBATCH’ for marking SLURM parameters
-  * environment variables are set by SLURM for use within the script (e.g. ''%%SLURM_JOB_NUM_NODES%%'')
+  * environment variables are set by SLURM for use within the script (e.g. ''%%SLURM_JOB_NUM_NODES%%'')
@@ Line 227: / Line 309: @@
 ==== Exercises ====
-  * try these commands and find out which partition has to be used if you want to run in QOS 'devel_0128':
+  * try these commands and find out which partition has to be used if you want to run in QOS ‘devel_0128’:
 <code>
@@ Line 233: / Line 315: @@
 sqos -acc
 </code>
-  * find out, which nodes are in the partition that allows running in 'devel_0128'. Further, check how much memory these nodes have:
+  * find out, which nodes are in the partition that allows running in ‘devel_0128’. Further, check how much memory these nodes have:
 <code>
@@ Line 247: / Line 329: @@
 ==== Bad job practices ====
-  * looped job submission (takes a long time):
+  * job submissions in a loop (takes a long time):
 <code>
@@ Line 256: / Line 338: @@
 </code>
-  * loop in job (sequential mpirun commands):
+  * loop inside job script (sequential mpirun commands):
 <code>
@@ Line 266: / Line 348: @@
-==== Array job ====
+==== Array jobs ====
-  * run similar, **independent** jobs at once, that can be distinguished by **one parameter**
+  * submit/run a series of **independent** jobs via a single SLURM script
-  * each task will be treated as a seperate job
+  * each job in the array gets a unique identifier (SLURM_ARRAY_TASK_ID) based on which various workloads can be organized
-  * example ([[examples/slurm/job_array.sh|job_array.sh]], [[examples/slurm/sleep.sh|sleep.sh]]), start=1, end=30, stepwidth=7:
+  * example ([[examples/job_array.sh|job_array.sh]]), 10 jobs, SLURM_ARRAY_TASK_ID=1,2,3…10
 <code>
@@ Line 276: / Line 358: @@
 #SBATCH -J array
 #SBATCH -N 1
-#SBATCH --array=1-30:7
+#SBATCH --array=1-10
-./sleep.sh $SLURM_ARRAY_TASK_ID
+echo "Hi, this is array job number"  $SLURM_ARRAY_TASK_ID
+sleep $SLURM_ARRAY_TASK_ID
 </code>
-  * computed tasks: 1, 8, 15, 22, 29
+  * independent jobs: 1, 2, 3 … 10
 <code>
-5605039_[15-29] mem_0064    array   markus PD
+VSC-4 >  squeue  -u $user
-5605039_1       mem_0064    array   markus  R
+             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
-5605039_8       mem_0064    array   markus  R
+_[7-10]  mem_0096    array       sh PD       0:00      1 (Resources)
+_4  mem_0096    array       sh  R    INVALID      1 n403-062
+_5  mem_0096    array       sh  R    INVALID      1 n403-072
+_6  mem_0096    array       sh  R    INVALID      1 n404-031
 </code>
-useful variables within job:
 <code>
-SLURM_ARRAY_JOB_ID
+VSC-4 >  ls slurm-*
-SLURM_ARRAY_TASK_ID
+slurm-406846_10.out  slurm-406846_3.out  slurm-406846_6.out  slurm-406846_9.out
-SLURM_ARRAY_TASK_STEP
+slurm-406846_1.out   slurm-406846_4.out  slurm-406846_7.out
-SLURM_ARRAY_TASK_MAX
+slurm-406846_2.out   slurm-406846_5.out  slurm-406846_8.out
-SLURM_ARRAY_TASK_MIN
 </code>
-limit number of simultanously running jobs to 2:
 <code>
-#SBATCH --array=1-30:7%2
+VSC-4 >  cat slurm-406846_8.out
+Hi, this is array job number  8
 </code>
-==== Single core ====
-  * use a complete compute node for several tasks at once
+  * fine-tuning via builtin variables (SLURM_ARRAY_TASK_MIN, SLURM_ARRAY_TASK_MAX…)
-  * example: [[examples/job_singlenode_manytasks.sh|job_singlenode_manytasks.sh]]:
+  * example of going in chunks of a certain size, e.g. 5, SLURM_ARRAY_TASK_ID=1,6,11,16
 <code>
-...
+#SBATCH --array=1-20:5
+</code>
-max_num_tasks=16
+  * example of limiting number of simultaneously running jobs to 2 (perhaps for licences)
-...
+<code>
+#SBATCH --array=1-20:5%2
+</code>
-for i in `seq $task_start $task_increment $task_end`
+==== Single core jobs ====
+  * use an entire compute node for several independent jobs
+  * example: [[examples/single_node_multiple_jobs.sh|single_node_multiple_jobs.sh]]:
+<code>
+for ((i=1; i<=48; i++))
 do
-  ./$executable $i &
+   stress --cpu 1 --timeout $i  &
-  check_running_tasks #sleeps as long as max_num_tasks are running
 done
 wait
 </code>
+  * ‘&’: send process into the background, script can continue
+  * ‘wait’: waits for all processes in the background, otherwise script would terminate
-  * '&': start binary in background, script can continue
-  * 'wait': waits for all processes in the background, otherwise script will finish
+==== Combination of array & single core job ====
+  * example: [[examples/combined_array_multiple_jobs.sh|combined_array_multiple_jobs.sh]]:
-==== Array job + single core ====
-[[examples/job_array_some_tasks.sh|job_array_some_tasks.sh]]:
 <code>
 ...
-#SBATCH --array=1-100:32
+#SBATCH --array=1-144:48
-...
+j=$SLURM_ARRAY_TASK_ID
+((j+=47))
-task_start=$SLURM_ARRAY_TASK_ID
+for ((i=$SLURM_ARRAY_TASK_ID; i<=$j; i++))
-task_end=$(( $SLURM_ARRAY_TASK_ID + $SLURM_ARRAY_TASK_STEP -1 ))
-if [ $task_end -gt $SLURM_ARRAY_TASK_MAX ]; then
-        task_end=$SLURM_ARRAY_TASK_MAX
-fi
-task_increment=1
-...
-for i in `seq $task_start $task_increment $task_end`
 do
-  ./$executable $i &
+   stress --cpu 1 --timeout $i  &
-  check_running_tasks
 done
 wait
 </code>
 ==== Exercises ====
   * files are located in folder ''%%examples/05_submitting_batch_jobs%%''
-  * download or copy [[examples/sleep.sh|sleep.sh]] and find out what it is doing
+  * look into [[examples/job_array.sh|job_array.sh]] and modify it such that the considered range is from 1 to 20 but in steps of 5
-  * run [[examples/slurm/job_array.sh|job_array.sh]] with tasks 4-20 and stepwidth 3
+  * look into [[examples/single_node_multiple_jobs.sh|single_node_multiple_jobs.sh]] and also change it to go in steps of 5
-  * start a jobs for [[examples/job_singlenode_manytasks.sh|job_singlenode_manytasks.sh]] with max_num_tasks=16 and max_num_tasks=8; compare the job runtimes
+  * run [[examples/combined_array_multiple_jobs.sh|combined_array_multiple_jobs.sh]] and check whether the output is reasonable
-  * run [[examples/job_array_some_tasks.sh|job_array_some_tasks.sh]]
 ==== Job/process setup ====
@@ Line 370: / Line 448: @@
   * normal jobs:
-^#SBATCH            ^job environment        ^
+^#SBATCH          ^job environment      ^
-|-N                 |SLURM_JOB_NUM_NODES    |
+|-N               |SLURM_JOB_NUM_NODES  |
-|--ntasks-per-core  |SLURM_NTASKS_PER_CORE  |
+|--ntasks-per-core|SLURM_NTASKS_PER_CORE|
-|--ntasks-per-node  |SLURM_NTASKS_PER_NODE  |
+|--ntasks-per-node|SLURM_NTASKS_PER_NODE|
-|--ntasks-per-socket|SLURM_NTASKS_PER_SOCKET|
+|--ntasks, -n     |SLURM_NTASKS         |
-|--ntasks, -n       |SLURM_NTASKS           |
   * emails:
@@ Line 383: / Line 460: @@
 #SBATCH --mail-type=BEGIN,END
 </code>
   * constraints:
 <code>
-#SBATCH -C --constraint
-#SBATCH --gres=
 #SBATCH -t, --time=<time>
 #SBATCH --time-min=<time>
 </code>
-Valid time formats:
+time format:
-  * MM
-  * [HH:]MM:SS
   * DD-HH[:MM[:SS]]
-  * backfilling:
+  * backfilling: * specify ‘–time’ or ‘–time-min’ which are estimates of the runtime of your job * shorter than default runtimes (mostly 72h) may enable the scheduler to use idle nodes waiting for a larger job
-    * specify '--time' or '--time-min' that is eligible for your job
+  * get the remaining running time for your job:
-    * short runtimes may enable the scheduler to use idle nodes waiting for a large job
+<code>
+squeue -h -j $SLURM_JOBID -o %L
+</code>
 ==== Licenses ====
-{{pandoc:introduction-to-vsc:05_submitting_batch_jobs:slurm:licenses.png}}
+{{.:licenses.png}}
 <code>
-slic
+VSC-3 >  slic
 </code>
-Within the job script add the flags as shown with 'slic', e.g. for using both Matlab and Mathematica:
+Within the SLURN submit script add the flags as shown with ‘slic’, e.g. when both Matlab and Mathematica are required
 <code>
 #SBATCH -L matlab@vsc,mathematica@vsc
 </code>
-Intel licenses are needed only for compiling code, not for running it!
+Intel licenses are needed only when compiling code, not for running resulting executables
-==== Reservations of compute nodes ====
+==== Reservation of compute nodes ====
-  * core-h accounting is done for the full reservation time
+  * core-h accounting is done for the entire period of reservation
-  * contact us, if needed
+  * contact service@vsc.ac.at
   * reservations are named after the project id
@@ Line 431: / Line 506: @@
 <code>
-scontrol show reservations
+VSC-3 >  scontrol show reservations
 </code>
-  * use it:
+  * usage:
 <code>
@@ Line 449: / Line 524: @@
 echo "2+2" | matlab
 </code>
-==== MPI + NTASKS_PER_NODE + pinning ====
+==== MPI + pinning ====
   * understand what your code is doing and place the processes correctly
@@ Line 455: / Line 530: @@
   * details for pinning: https://wiki.vsc.ac.at/doku.php?id=doku:vsc3_pinning
-Example: Two nodes with two mpi processes each:
+Example: Two nodes with two MPI processes each:
 === srun ===
@@ Line 463: / Line 538: @@
 #SBATCH --tasks-per-node=2
-srun --cpu_bind=map_cpu:0,8 ./my_mpi_program
+srun --cpu_bind=map_cpu:0,24 ./my_mpi_program
 </code>
@@ Line 473: / Line 548: @@
 #SBATCH --tasks-per-node=2
-export I_MPI_PIN_PROCESSOR_LIST=0,8
+export I_MPI_PIN_PROCESSOR_LIST=0,24   # Intel MPI syntax
 mpirun ./my_mpi_program
 </code>
@@ Line 495: / Line 570: @@
 ----