Differences
This shows you the differences between two versions of the page.
Both sides previous revision Previous revision Next revision | Previous revision Next revisionBoth sides next revision | ||
pandoc:introduction-to-vsc:05_submitting_batch_jobs:slurm [2018/01/31 11:10] – Pandoc Auto-commit pandoc | pandoc:introduction-to-vsc:05_submitting_batch_jobs:slurm [2018/10/16 13:34] – Pandoc Auto-commit pandoc | ||
---|---|---|---|
Line 1: | Line 1: | ||
+ | ====== SLURM ====== | ||
+ | |||
+ | * Article written by Markus Stöhr (VSC Team) < | ||
+ | |||
+ | |||
+ | |||
+ | ==== Quickstart ==== | ||
+ | |||
+ | script [[examples/ | ||
+ | |||
+ | < | ||
+ | #!/bin/bash | ||
+ | #SBATCH -J h5test | ||
+ | #SBATCH -N 1 | ||
+ | |||
+ | module purge | ||
+ | module load gcc/5.3 intel-mpi/5 hdf5/ | ||
+ | |||
+ | cp $VSC_HDF5_ROOT/ | ||
+ | mpicc -lhdf5 ph5example.c -o ph5example | ||
+ | |||
+ | mpirun -np 8 ./ | ||
+ | |||
+ | </ | ||
+ | submission: | ||
+ | |||
+ | < | ||
+ | $ sbatch job.sh | ||
+ | Submitted batch job 5250981 | ||
+ | </ | ||
+ | |||
+ | check what is going on: | ||
+ | |||
+ | < | ||
+ | squeue -u $USER | ||
+ | </ | ||
+ | < | ||
+ | JOBID PARTITION | ||
+ | 5250981 | ||
+ | </ | ||
+ | Output files: | ||
+ | |||
+ | < | ||
+ | ParaEg0.h5 | ||
+ | ParaEg1.h5 | ||
+ | slurm-5250981.out | ||
+ | </ | ||
+ | try on .h5 files: | ||
+ | |||
+ | < | ||
+ | h5dump | ||
+ | </ | ||
+ | |||
+ | cancel jobs: | ||
+ | |||
+ | < | ||
+ | scancel < | ||
+ | </ | ||
+ | or | ||
+ | |||
+ | < | ||
+ | scancel < | ||
+ | </ | ||
+ | or | ||
+ | |||
+ | < | ||
+ | scancel -u $USER | ||
+ | </ | ||
+ | ===== Basic concepts ===== | ||
+ | |||
+ | ==== Queueing system ==== | ||
+ | |||
+ | * job/batch script: | ||
+ | * shell script, that does everything needed to run your calculation | ||
+ | * independent of queueing system | ||
+ | * **use simple scripts** (max 50 lines, i.e. put complicated logic elsewhere) | ||
+ | * load modules from scratch (purge, then load) | ||
+ | |||
+ | |||
+ | * tell scheduler where/how to run jobs: | ||
+ | * #nodes | ||
+ | * nodetype | ||
+ | * … | ||
+ | |||
+ | |||
+ | * scheduler manages job allocation to compute nodes | ||
+ | |||
+ | |||
+ | |||
+ | |||
+ | {{: | ||
+ | |||
+ | ==== SLURM: Accounts and Users ==== | ||
+ | |||
+ | {{: | ||
+ | |||
+ | |||
+ | ==== SLURM: Partition and Quality of Service ==== | ||
+ | |||
+ | {{: | ||
+ | |||
+ | |||
+ | ==== VSC-3 Hardware Types ==== | ||
+ | |||
+ | ^partition | ||
+ | |mem_0064* | ||
+ | |mem_0128 | ||
+ | |mem_0256 | ||
+ | |vsc3plus_0064| | ||
+ | |vsc3plus_0256| | ||
+ | |knl | | ||
+ | |haswell | ||
+ | |binf | ||
+ | |amd | 128, 256 |AMD EPYC 7551, 7551P | ||
+ | |||
+ | |||
+ | * default partition, QDR: Intel Infinipath, FDR: Mellanox ConnectX-3 | ||
+ | |||
+ | effective: 10/2018 | ||
+ | |||
+ | * + GPU nodes (see later) | ||
+ | * specify partition in job script: | ||
+ | |||
+ | < | ||
+ | #SBATCH -p < | ||
+ | </ | ||
+ | |||
+ | ---- | ||
+ | |||
+ | ==== VSC-3 Hardware Types ==== | ||
+ | |||
+ | * Display information about partitions and their nodes: | ||
+ | |||
+ | < | ||
+ | sinfo -o %P | ||
+ | scontrol show partition mem_0064 | ||
+ | scontrol show node n01-001 | ||
+ | </ | ||
+ | |||
+ | ==== QOS-Account/ | ||
+ | |||
+ | |||
+ | {{: | ||
+ | |||
+ | 1.+2.: | ||
+ | |||
+ | < | ||
+ | sqos -acc | ||
+ | </ | ||
+ | |||
+ | < | ||
+ | default_account: | ||
+ | account: | ||
+ | |||
+ | default_qos: | ||
+ | qos: devel_0128 | ||
+ | goodluck | ||
+ | gpu_gtx1080amd | ||
+ | gpu_gtx1080multi | ||
+ | | ||
+ | gpu_k20m | ||
+ | | ||
+ | | ||
+ | | ||
+ | | ||
+ | | ||
+ | | ||
+ | | ||
+ | | ||
+ | </ | ||
+ | |||
+ | |||
+ | ==== QOS-Partition assignment ==== | ||
+ | |||
+ | |||
+ | 3.: | ||
+ | |||
+ | < | ||
+ | sqos | ||
+ | </ | ||
+ | < | ||
+ | qos_name total used free | ||
+ | ========================================================================= | ||
+ | | ||
+ | | ||
+ | | ||
+ | devel_0128 | ||
+ | goodluck | ||
+ | | ||
+ | | ||
+ | gpu_gtx1080multi | ||
+ | | ||
+ | gpu_k20m | ||
+ | | ||
+ | | ||
+ | | ||
+ | gpu_gtx1080amd | ||
+ | </ | ||
+ | naming convention: | ||
+ | |||
+ | ^QOS | ||
+ | |*_0064|mem_0064 | | ||
+ | |||
+ | |||
+ | |||
+ | |||
+ | |||
+ | |||
+ | ---- | ||
+ | |||
+ | ==== Specification in job script ==== | ||
+ | |||
+ | |||
+ | < | ||
+ | #SBATCH --account=xxxxxx | ||
+ | #SBATCH --qos=xxxxx_xxxx | ||
+ | #SBATCH --partition=mem_xxxx | ||
+ | </ | ||
+ | For omitted lines corresponding defaults are used. See previous slides, default partition is “mem_0064” | ||
+ | |||
+ | |||
+ | ==== Sample batch job ==== | ||
+ | |||
+ | default: | ||
+ | |||
+ | < | ||
+ | #!/bin/bash | ||
+ | #SBATCH -J jobname | ||
+ | #SBATCH -N number_of_nodes | ||
+ | |||
+ | do_my_work | ||
+ | </ | ||
+ | job is submitted to: | ||
+ | |||
+ | * partition mem_0064 | ||
+ | * qos normal_0064 | ||
+ | * default account | ||
+ | |||
+ | |||
+ | |||
+ | explicit: | ||
+ | |||
+ | < | ||
+ | #!/bin/bash | ||
+ | #SBATCH -J jobname | ||
+ | #SBATCH -N number_of_nodes | ||
+ | #SBATCH | ||
+ | #SBATCH | ||
+ | #SBATCH --partition=mem_xxxx | ||
+ | #SBATCH --qos=xxxxx_xxxx | ||
+ | #SBATCH --account=xxxxxx | ||
+ | |||
+ | do_my_work | ||
+ | </ | ||
+ | |||
+ | |||
+ | |||
+ | * must be a shell script (first line!) | ||
+ | * ‘# | ||
+ | * environment variables are set by SLURM for use within the script (e.g. '' | ||
+ | |||
+ | |||
+ | |||
+ | ==== Job submission ==== | ||
+ | |||
+ | < | ||
+ | sbatch < | ||
+ | </ | ||
+ | * parameters are specified as in job script | ||
+ | * precedence: sbatch parameters override parameters in job script | ||
+ | * be careful to place slurm parameters **before** job script | ||
+ | |||
+ | ==== Exercises ==== | ||
+ | |||
+ | * try these commands and find out which partition has to be used if you want to run in QOS ‘devel_0128’: | ||
+ | |||
+ | < | ||
+ | sqos | ||
+ | sqos -acc | ||
+ | </ | ||
+ | * find out, which nodes are in the partition that allows running in ‘devel_0128’. Further, check how much memory these nodes have: | ||
+ | |||
+ | < | ||
+ | scontrol show partition ... | ||
+ | scontrol show node ... | ||
+ | </ | ||
+ | * submit a one node job to QOS devel_0128 with the following commands: | ||
+ | |||
+ | < | ||
+ | hostname | ||
+ | free | ||
+ | </ | ||
+ | ==== Bad job practices ==== | ||
+ | |||
+ | * looped job submission (takes a long time): | ||
+ | |||
+ | < | ||
+ | for i in {1..1000} | ||
+ | do | ||
+ | sbatch job.sh $i | ||
+ | done | ||
+ | </ | ||
+ | |||
+ | * loop in job (sequential mpirun commands): | ||
+ | |||
+ | < | ||
+ | for i in {1..1000} | ||
+ | do | ||
+ | mpirun my_program $i | ||
+ | done | ||
+ | </ | ||
+ | |||
+ | |||
+ | ==== Array job ==== | ||
+ | |||
+ | * run similar, **independent** jobs at once, that can be distinguished by **one parameter** | ||
+ | * each task will be treated as a seperate job | ||
+ | * example ([[examples/ | ||
+ | |||
+ | < | ||
+ | #!/bin/sh | ||
+ | #SBATCH -J array | ||
+ | #SBATCH -N 1 | ||
+ | #SBATCH --array=1-30: | ||
+ | |||
+ | ./sleep.sh $SLURM_ARRAY_TASK_ID | ||
+ | </ | ||
+ | * computed tasks: 1, 8, 15, 22, 29 | ||
+ | |||
+ | < | ||
+ | 5605039_[15-29] mem_0064 | ||
+ | 5605039_1 | ||
+ | 5605039_8 | ||
+ | </ | ||
+ | |||
+ | |||
+ | |||
+ | useful variables within job: | ||
+ | |||
+ | < | ||
+ | SLURM_ARRAY_JOB_ID | ||
+ | SLURM_ARRAY_TASK_ID | ||
+ | SLURM_ARRAY_TASK_STEP | ||
+ | SLURM_ARRAY_TASK_MAX | ||
+ | SLURM_ARRAY_TASK_MIN | ||
+ | </ | ||
+ | |||
+ | limit number of simultanously running jobs to 2: | ||
+ | |||
+ | < | ||
+ | #SBATCH --array=1-30: | ||
+ | </ | ||
+ | |||
+ | |||
+ | ==== Single core ==== | ||
+ | |||
+ | * use a complete compute node for several tasks at once | ||
+ | |||
+ | * example: [[examples/ | ||
+ | |||
+ | < | ||
+ | ... | ||
+ | |||
+ | max_num_tasks=16 | ||
+ | |||
+ | ... | ||
+ | |||
+ | for i in `seq $task_start $task_increment $task_end` | ||
+ | do | ||
+ | ./ | ||
+ | check_running_tasks #sleeps as long as max_num_tasks are running | ||
+ | done | ||
+ | wait | ||
+ | </ | ||
+ | |||
+ | * ‘& | ||
+ | * ‘wait’: waits for all processes in the background, otherwise script will finish | ||
+ | |||
+ | |||
+ | |||
+ | ==== Array job + single core ==== | ||
+ | |||
+ | [[examples/ | ||
+ | |||
+ | < | ||
+ | ... | ||
+ | #SBATCH --array=1-100: | ||
+ | |||
+ | ... | ||
+ | |||
+ | task_start=$SLURM_ARRAY_TASK_ID | ||
+ | task_end=$(( $SLURM_ARRAY_TASK_ID + $SLURM_ARRAY_TASK_STEP -1 )) | ||
+ | if [ $task_end -gt $SLURM_ARRAY_TASK_MAX ]; then | ||
+ | task_end=$SLURM_ARRAY_TASK_MAX | ||
+ | fi | ||
+ | task_increment=1 | ||
+ | |||
+ | ... | ||
+ | |||
+ | for i in `seq $task_start $task_increment $task_end` | ||
+ | do | ||
+ | ./ | ||
+ | check_running_tasks | ||
+ | done | ||
+ | wait | ||
+ | </ | ||
+ | ==== Exercises ==== | ||
+ | |||
+ | * files are located in folder '' | ||
+ | * download or copy [[examples/ | ||
+ | * run [[examples/ | ||
+ | * start a jobs for [[examples/ | ||
+ | * run [[examples/ | ||
+ | |||
+ | ==== Job/process setup ==== | ||
+ | |||
+ | * normal jobs: | ||
+ | |||
+ | ^# | ||
+ | |-N | ||
+ | |--ntasks-per-core | ||
+ | |--ntasks-per-node | ||
+ | |--ntasks-per-socket|SLURM_NTASKS_PER_SOCKET| | ||
+ | |--ntasks, -n | ||
+ | |||
+ | * emails: | ||
+ | |||
+ | < | ||
+ | #SBATCH --mail-user=yourmail@example.com | ||
+ | #SBATCH --mail-type=BEGIN, | ||
+ | </ | ||
+ | |||
+ | * constraints: | ||
+ | |||
+ | < | ||
+ | #SBATCH -C --constraint | ||
+ | #SBATCH --gres= | ||
+ | |||
+ | #SBATCH -t, --time=< | ||
+ | #SBATCH --time-min=< | ||
+ | </ | ||
+ | |||
+ | Valid time formats: | ||
+ | |||
+ | * MM | ||
+ | * [HH:]MM:SS | ||
+ | * DD-HH[: | ||
+ | |||
+ | |||
+ | |||
+ | * backfilling: | ||
+ | * specify ‘–time’ or ‘–time-min’ that is eligible for your job | ||
+ | * short runtimes may enable the scheduler to use idle nodes waiting for a large job | ||
+ | * get the remaining running time for your job: | ||
+ | |||
+ | < | ||
+ | squeue -h -j $SLURM_JOBID -o %L | ||
+ | </ | ||
+ | |||
+ | |||
+ | ==== Licenses ==== | ||
+ | |||
+ | {{: | ||
+ | |||
+ | |||
+ | < | ||
+ | slic | ||
+ | </ | ||
+ | Within the job script add the flags as shown with ‘slic’, e.g. for using both Matlab and Mathematica: | ||
+ | |||
+ | < | ||
+ | #SBATCH -L matlab@vsc, | ||
+ | </ | ||
+ | Intel licenses are needed only for compiling code, not for running it! | ||
+ | |||
+ | ==== Reservations of compute nodes ==== | ||
+ | |||
+ | * core-h accounting is done for the full reservation time | ||
+ | * contact us, if needed | ||
+ | * reservations are named after the project id | ||
+ | |||
+ | * check for reservations: | ||
+ | |||
+ | < | ||
+ | scontrol show reservations | ||
+ | </ | ||
+ | * use it: | ||
+ | |||
+ | < | ||
+ | #SBATCH --reservation= | ||
+ | </ | ||
+ | |||
+ | |||
+ | ==== Exercises ==== | ||
+ | |||
+ | * check for available reservations. If there is one available, use it | ||
+ | * specify an email address that notifies you when the job has finished | ||
+ | * run the following matlab code in your job: | ||
+ | |||
+ | < | ||
+ | echo " | ||
+ | </ | ||
+ | ==== MPI + NTASKS_PER_NODE + pinning ==== | ||
+ | |||
+ | * understand what your code is doing and place the processes correctly | ||
+ | * use only a few processes per node if memory demand is high | ||
+ | * details for pinning: https:// | ||
+ | |||
+ | Example: Two nodes with two mpi processes each: | ||
+ | |||
+ | === srun === | ||
+ | |||
+ | < | ||
+ | #SBATCH -N 2 | ||
+ | #SBATCH --tasks-per-node=2 | ||
+ | |||
+ | srun --cpu_bind=map_cpu: | ||
+ | |||
+ | </ | ||
+ | |||
+ | === mpirun === | ||
+ | |||
+ | < | ||
+ | #SBATCH -N 2 | ||
+ | #SBATCH --tasks-per-node=2 | ||
+ | |||
+ | export I_MPI_PIN_PROCESSOR_LIST=0, | ||
+ | mpirun ./ | ||
+ | </ | ||
+ | |||
+ | |||
+ | ==== Job dependencies ==== | ||
+ | |||
+ | - Submit first job and get its <job id> | ||
+ | - Submit dependent job (and get < | ||
+ | |||
+ | < | ||
+ | #!/bin/bash | ||
+ | #SBATCH -J jobname | ||
+ | #SBATCH -N 2 | ||
+ | #SBATCH -d afterany:< | ||
+ | srun ./ | ||
+ | </ | ||
+ | < | ||
+ | < | ||
+ | |||
+ | |||
+ | ---- | ||