====== SLURM ====== * Article written by Markus Stöhr (VSC Team)
(last update 2017-10-09 by ms). ==== Quickstart ==== script [[examples/job-quickstart.sh|examples/05_submitting_batch_jobs/job-quickstart.sh]]: #!/bin/bash #SBATCH -J h5test #SBATCH -N 1 module purge module load gcc/5.3 intel-mpi/5 hdf5/1.8.18-MPI cp $VSC_HDF5_ROOT/share/hdf5_examples/c/ph5example.c . mpicc -lhdf5 ph5example.c -o ph5example mpirun -np 8 ./ph5example -c -v submission: $ sbatch job.sh Submitted batch job 5250981 check what is going on: squeue -u $USER JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 5250981 mem_0128 h5test markus R 0:00 2 n323-[018-019] Output files: ParaEg0.h5 ParaEg1.h5 slurm-5250981.out try on .h5 files: h5dump cancel jobs: scancel or scancel or scancel -u $USER ===== Basic concepts ===== ==== Queueing system ==== * job/batch script: * shell script, that does everything needed to run your calculation * independent of queueing system * **use simple scripts** (max 50 lines, i.e. put complicated logic elsewhere) * load modules from scratch (purge, then load) * tell scheduler where/how to run jobs: * #nodes * nodetype * … * scheduler manages job allocation to compute nodes {{.:queueing_basics.png?200}} ==== SLURM: Accounts and Users ==== {{.:slurm_accounts.png}} ==== SLURM: Partition and Quality of Service ==== {{.:partitions.png}} ==== VSC-3 Hardware Types ==== ^partition ^ RAM (GB) ^CPU ^ Cores ^ IB (HCA) ^ #Nodes ^ |mem_0064* | 64 |2x Intel E5-2650 v2 @ 2.60GHz| 2x8 | 2xQDR | 1849 | |mem_0128 | 128 |2x Intel E5-2650 v2 @ 2.60GHz| 2x8 | 2xQDR | 140 | |mem_0256 | 256 |2x Intel E5-2650 v2 @ 2.60GHz| 2x8 | 2xQDR | 50 | |vsc3plus_0064| 64 |2x Intel E5-2660 v2 @ 2.20GHz| 2x10 | 1xFDR | 816 | |vsc3plus_0256| 256 |2x Intel E5-2660 v2 @ 2.20GHz| 2x10 | 1xFDR | 48 | |binf | 512 - 1536 |2x Intel E5-2690 v4 @ 2.60GHz| 2x14 | 1xFDR | 17 | * default partition, QDR: Intel Truescale Infinipath (40Gbit/s), FDR: Mellanox ConnectX-3 (56Gbit/s) effective: 10/2018 * + GPU nodes (see later) * specify partition in job script: #SBATCH -p ==== Standard QOS ==== ^partition ^QOS ^ |mem_0064* |normal_0064 | |mem_0128 |normal_0128 | |mem_0256 |normal_0256 | |vsc3plus_0064|vsc3plus_0064| |vsc3plus_0256|vsc3plus_0256| |binf |normal_binf | * specify QOS in job script: #SBATCH --qos ---- ==== VSC-4 Hardware Types ==== ^partition^ RAM (GB) ^CPU ^ Cores ^ IB (HCA) ^ #Nodes ^ |mem_0096*| 96 |2x Intel Platinum 8174 @ 3.10GHz| 2x24 | 1xEDR | 688 | |mem_0384 | 384 |2x Intel Platinum 8174 @ 3.10GHz| 2x24 | 1xEDR | 78 | |mem_0768 | 768 |2x Intel Platinum 8174 @ 3.10GHz| 2x24 | 1xEDR | 12 | * default partition, EDR: Intel Omni-Path (100Gbit/s) effective: 10/2020 ==== Standard QOS ==== ^partition^QOS ^ |mem_0096*|mem_0096| |mem_0384 |mem_0384| |mem_0768 |mem_0768| ---- ==== VSC Hardware Types ==== * Display information about partitions and their nodes: sinfo -o %P scontrol show partition mem_0064 scontrol show node n301-001 ==== QOS-Account/Project assignment ==== {{.:setup.png?200}} 1.+2.: sqos -acc default_account: p70824 account: p70824 default_qos: normal_0064 qos: devel_0128 goodluck gpu_gtx1080amd gpu_gtx1080multi gpu_gtx1080single gpu_k20m gpu_m60 knl normal_0064 normal_0128 normal_0256 normal_binf vsc3plus_0064 vsc3plus_0256 ==== QOS-Partition assignment ==== 3.: sqos qos_name total used free walltime priority partitions ========================================================================= normal_0064 1782 1173 609 3-00:00:00 2000 mem_0064 normal_0256 15 24 -9 3-00:00:00 2000 mem_0256 normal_0128 93 51 42 3-00:00:00 2000 mem_0128 devel_0128 10 20 -10 00:10:00 20000 mem_0128 goodluck 0 0 0 3-00:00:00 1000 vsc3plus_0256,vsc3plus_0064,amd knl 4 1 3 3-00:00:00 1000 knl normal_binf 16 5 11 1-00:00:00 1000 binf gpu_gtx1080multi 4 2 2 3-00:00:00 2000 gpu_gtx1080multi gpu_gtx1080single 50 18 32 3-00:00:00 2000 gpu_gtx1080single gpu_k20m 2 0 2 3-00:00:00 2000 gpu_k20m gpu_m60 1 1 0 3-00:00:00 2000 gpu_m60 vsc3plus_0064 800 781 19 3-00:00:00 1000 vsc3plus_0064 vsc3plus_0256 48 44 4 3-00:00:00 1000 vsc3plus_0256 gpu_gtx1080amd 1 0 1 3-00:00:00 2000 gpu_gtx1080amd naming convention: ^QOS ^Partition^ |*_0064|mem_0064 | ---- ==== Specification in job script ==== #SBATCH --account=xxxxxx #SBATCH --qos=xxxxx_xxxx #SBATCH --partition=mem_xxxx For omitted lines corresponding defaults are used. See previous slides, default partition is “mem_0064” ==== Sample batch job ==== default: #!/bin/bash #SBATCH -J jobname #SBATCH -N number_of_nodes do_my_work job is submitted to: * partition mem_0064 * qos normal_0064 * default account explicit: #!/bin/bash #SBATCH -J jobname #SBATCH -N number_of_nodes #SBATCH #SBATCH #SBATCH --partition=mem_xxxx #SBATCH --qos=xxxxx_xxxx #SBATCH --account=xxxxxx do_my_work * must be a shell script (first line!) * ‘#SBATCH’ for marking SLURM parameters * environment variables are set by SLURM for use within the script (e.g. ''%%SLURM_JOB_NUM_NODES%%'') ==== Job submission ==== sbatch job.sh * parameters are specified as in job script * precedence: sbatch parameters override parameters in job script * be careful to place slurm parameters **before** job script ==== Exercises ==== * try these commands and find out which partition has to be used if you want to run in QOS ‘devel_0128’: sqos sqos -acc * find out, which nodes are in the partition that allows running in ‘devel_0128’. Further, check how much memory these nodes have: scontrol show partition ... scontrol show node ... * submit a one node job to QOS devel_0128 with the following commands: hostname free ==== Bad job practices ==== * job submissions in a loop (takes a long time): for i in {1..1000} do sbatch job.sh $i done * loop inside job script (sequential mpirun commands): for i in {1..1000} do mpirun my_program $i done ==== Array jobs ==== * submit/run a series of **independent** jobs via a single SLURM script * each job in the array gets a unique identifier (SLURM_ARRAY_TASK_ID) based on which various workloads can be organized * example ([[examples/job_array.sh|job_array.sh]]), 10 jobs, SLURM_ARRAY_TASK_ID=1,2,3…10 #!/bin/sh #SBATCH -J array #SBATCH -N 1 #SBATCH --array=1-10 echo "Hi, this is array job number" $SLURM_ARRAY_TASK_ID sleep $SLURM_ARRAY_TASK_ID * independent jobs: 1, 2, 3 … 10 VSC-4 > squeue -u $user JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 406846_[7-10] mem_0096 array sh PD 0:00 1 (Resources) 406846_4 mem_0096 array sh R INVALID 1 n403-062 406846_5 mem_0096 array sh R INVALID 1 n403-072 406846_6 mem_0096 array sh R INVALID 1 n404-031 VSC-4 > ls slurm-* slurm-406846_10.out slurm-406846_3.out slurm-406846_6.out slurm-406846_9.out slurm-406846_1.out slurm-406846_4.out slurm-406846_7.out slurm-406846_2.out slurm-406846_5.out slurm-406846_8.out VSC-4 > cat slurm-406846_8.out Hi, this is array job number 8 * fine-tuning via builtin variables (SLURM_ARRAY_TASK_MIN, SLURM_ARRAY_TASK_MAX…) * example of going in chunks of a certain size, e.g. 5, SLURM_ARRAY_TASK_ID=1,6,11,16 #SBATCH --array=1-20:5 * example of limiting number of simultaneously running jobs to 2 (perhaps for licences) #SBATCH --array=1-20:5%2 ==== Single core jobs ==== * use an entire compute node for several independent jobs * example: [[examples/single_node_multiple_jobs.sh|single_node_multiple_jobs.sh]]: for ((i=1; i<=48; i++)) do stress --cpu 1 --timeout $i & done wait * ‘&’: send process into the background, script can continue * ‘wait’: waits for all processes in the background, otherwise script would terminate ==== Combination of array & single core job ==== * example: [[examples/combined_array_multiple_jobs.sh|combined_array_multiple_jobs.sh]]: ... #SBATCH --array=1-144:48 j=$SLURM_ARRAY_TASK_ID ((j+=47)) for ((i=$SLURM_ARRAY_TASK_ID; i<=$j; i++)) do stress --cpu 1 --timeout $i & done wait ==== Exercises ==== * files are located in folder ''%%examples/05_submitting_batch_jobs%%'' * look into [[examples/job_array.sh|job_array.sh]] and modify it such that the considered range is from 1 to 20 but in steps of 5 * look into [[examples/single_node_multiple_jobs.sh|single_node_multiple_jobs.sh]] and also change it to go in steps of 5 * run [[examples/combined_array_multiple_jobs.sh|combined_array_multiple_jobs.sh]] and check whether the output is reasonable ==== Job/process setup ==== * normal jobs: ^#SBATCH ^job environment ^ |-N |SLURM_JOB_NUM_NODES | |--ntasks-per-core|SLURM_NTASKS_PER_CORE| |--ntasks-per-node|SLURM_NTASKS_PER_NODE| |--ntasks, -n |SLURM_NTASKS | * emails: #SBATCH --mail-user=yourmail@example.com #SBATCH --mail-type=BEGIN,END * constraints: #SBATCH -t, --time= time format: * DD-HH[:MM[:SS]] * backfilling: * specify ‘–time’ or ‘–time-min’ which are estimates of the runtime of your job * shorter than default runtimes (mostly 72h) may enable the scheduler to use idle nodes waiting for a larger job * get the remaining running time for your job: squeue -h -j $SLURM_JOBID -o %L ==== Licenses ==== {{.:licenses.png}} VSC-3 > slic Within the SLURN submit script add the flags as shown with ‘slic’, e.g. when both Matlab and Mathematica are required #SBATCH -L matlab@vsc,mathematica@vsc Intel licenses are needed only when compiling code, not for running resulting executables ==== Reservation of compute nodes ==== * core-h accounting is done for the entire period of reservation * contact service@vsc.ac.at * reservations are named after the project id * check for reservations: VSC-3 > scontrol show reservations * usage: #SBATCH --reservation= ==== Exercises ==== * check for available reservations. If there is one available, use it * specify an email address that notifies you when the job has finished * run the following matlab code in your job: echo "2+2" | matlab ==== MPI + pinning ==== * understand what your code is doing and place the processes correctly * use only a few processes per node if memory demand is high * details for pinning: https://wiki.vsc.ac.at/doku.php?id=doku:vsc3_pinning Example: Two nodes with two MPI processes each: === srun === #SBATCH -N 2 #SBATCH --tasks-per-node=2 srun --cpu_bind=map_cpu:0,24 ./my_mpi_program === mpirun === #SBATCH -N 2 #SBATCH --tasks-per-node=2 export I_MPI_PIN_PROCESSOR_LIST=0,24 # Intel MPI syntax mpirun ./my_mpi_program ==== Job dependencies ==== - Submit first job and get its - Submit dependent job (and get ): #!/bin/bash #SBATCH -J jobname #SBATCH -N 2 #SBATCH -d afterany: srun ./my_program
  1. continue at 2. for further dependent jobs
----