Contrary to the previously on VSC-1 and VSC-2 employed SGE, the scheduler on VSC-3, VSC-4, and VSC-5 is SLURM - Simple Linux Utility for Resource Management.
[…]$ sinfo
gives information on which partitions are available for job submission. Note: What SGE on VSC-2 termed a 'queue' is now called a 'partition' under SLURM.[…]$ scontrol
is used to view SLURM configuration including: job, job step, node, partition, reservation, and overall system configuration. Without a command entered on the execute line, scontrol operates in an interactive mode and prompt for input. With a command entered on the execute line, scontrol executes that command and terminates. […]$ scontrol show job 567890
shows information on the job with number 567890.[…]$ scontrol show partition
shows information on available partitions.[…]$ squeue
to see the current list of submitted jobs, their state and resource allocation. Here is a description of the most important job reason codes returned by the squeue command.On VSC-4 and VSC-5, spack is used to install and provide modules, see SPACK - a package manager for HPC systems. The methods described in modules can still be used for backwards compatibility, but we suggest using spack.
In order to set environment variables needed for a specific application, the module environment may be used:
module avail
lists the available Application-Software, Compilers, Parallel-Environment, and Libraries module list
shows currently loaded package of your sessionmodule unload <xyz>
unload a particular package <xyz> from your sessionmodule load <xyz>
load a particular package <xyz> into your sessionmodule display <xyz>
OR module show <xyz>
show module details such as the full path of the module file and all (or most) of the environment changes the modulefile will make if loadedmodule purge
unloads all loaded modulefilesmodule avail
. Thus, in order to load or unload a selected module, copy and paste exactly the name listed by module avail
.module load/unload
directives may also be included in the top part of a job submission scriptWhen all required/intended modules have been loaded, user packages may be compiled as usual.
The compute nodes of VSC-4 are configured with the following parameters in SLURM:
CoresPerSocket=24 Sockets=2 ThreadsPerCore=2
And the primary nodes of VSC-5 with:
CoresPerSocket=64 Sockets=2 ThreadsPerCore=2
This reflects the fact that hyperthreading is activated on all compute nodes and 96 cores on VSC4 and 256 cores on VSC5 may be utilized on each node. In the batch script hyperthreading is selected by adding the line
#SBATCH --ntasks-per-core=2
which allows for 2 tasks per core.
Some codes may experience a performance gain from using all virtual cores, e.g., GROMACS seems to profit. But note that using all virtual cores also leads to more communication and may impact on the performance of large MPI jobs.
NOTE on accounting: the project's core-h are always calculated as job_walltime * nnodes * ncpus
(number of physical cores per node). SLURM's built in function sreport
yields wrong accounting statistics because (depending on the job script) the multiplier is 'number of virtual cores' instead of 'physical cores'. You may instead use the accounting script introduced in this section.
On VSC-4 & VSC-5 there is a set of nodes that accept jobs that do not require entire exclusive nodes (anything from 1 core to less than a full node). These nodes are set up to accommodate different jobs from different users until they are full. They are automatically used for such types of jobs. All other nodes are assigned completely (and exclusively) to a job whenever the '-N' argument is used.
Depending on the demands of a certain application, the partition (grouping hardware according its type) and quality of service (QOS; defining the run time etc.) can be selected. Additionally, the run time of a job can be limited in the job script to a value lower than the runtime limit of the selected QOS. This allows for a process called backfilling possibly leading to a shorter waiting time in the queue.
It is recommended to write the job script using a text editor on the VSC Linux cluster or on any Linux/Mac system. Editors in Windows may add additional invisible characters to the job file which render it unreadable and, thus, it cannot be not executed.
Assume a submission script check.slrm
#!/bin/bash # #SBATCH -J chk #SBATCH -N 2 #SBATCH --ntasks-per-node=48 #SBATCH --ntasks-per-core=1 #SBATCH --mail-type=BEGIN # first have to state the type of event to occur #SBATCH --mail-user=<email@address.at> # and then your email address # when srun is used, you need to set: <srun -l -N2 -n96 a.out > # or <mpirun -np 96 a.out>
In order to send the job to specific queues, see Queue | Partition setup on VSC-4 or Queue | Partition setup on VSC-5.
[username@l42 ~]$ sbatch check.slrm # to submit the job [username@l42 ~]$ squeue -u `whoami` # to check the status of own jobs [username@l42 ~]$ scancel JOBID # for premature removal, where JOBID # is obtained from the previous command
SLURM Script:
#SBATCH -N 3 #SBATCH --ntasks-per-node=2 #SBATCH -c 8 export OMP_NUM_THREADS=8 srun myhybridcode.exe
mpirun pins processes to cores. At least in the case of pure MPI processes (without any threads) the best performance has been observed with our default pinning (pinning to the physical cpus 0, 1, …, 15). If you need to use hybrid MPI/openMP, you may have to disable our default pinning including the following line in the job script:
unset I_MPI_PIN_PROCESSOR_LIST export I_MPI_PIN_PROCESSOR_LIST=0,8 # configuration for 2 processes / node export I_MPI_PIN_PROCESSOR_LIST=0,4,8,12 # 4 processes / node export I_MPI_PIN_PROCESSOR_LIST=0,2,4,6,8,10,12,14 # 8 processes / node
or use the shell script:
if [ $PROC_PER_NODE -gt 1 ] then unset I_MPI_PIN_PROCESSOR_LIST if [ $PROC_PER_NODE -eq 2 ] then export I_MPI_PIN_PROCESSOR_LIST=0,8 # configuration for 2 processes / node elif [ $PROC_PER_NODE -eq 4 ] then export I_MPI_PIN_PROCESSOR_LIST=0,4,8,12 # 4 processes / node elif [ $PROC_PER_NODE -eq 8 ] then export I_MPI_PIN_PROCESSOR_LIST=0,2,4,6,8,10,12,14 # 8 processes / node else export I_MPI_PIN=disable fi fi
See also the Intel Environment Variables.
This example is for using a set of 4 nodes to compute a series of jobs in two stages, each of them split into two separate subjobs.
vi check.slrm
#!/bin/bash # #SBATCH -J chk #SBATCH -N 4 #SBATCH --ntasks-per-node=48 #SBATCH --ntasks-per-core=1 scontrol show hostnames $SLURM_NODELIST > ./nodelist srun -l -N2 -r0 -n96 job1.scrpt & srun -l -N2 -r2 -n96 job2.scrpt & wait srun -l -N2 -r2 -n96 job3.scrpt & srun -l -N2 -r0 -n96 job4.scrpt & wait
the file 'nodelist' has been written for information only;
it is important to send the jobs into the background (&) and insert the 'wait' at each synchronization point;
with -r2 one can define an offset in the node list, in particular the -r2 means taking nodes number 2 and 3 from the set of four (where the list starts with node number 0), hence a combination of -N -r -n allows full control over all involved cores and the tasks they are going to be used for;
Given a large number of single core jobs, it is the aim to schedule them such that the needed core hours is minimized. The following script sends 64 tasks to a single node. The first 16 tasks are distributed to the 16 cores. After termination of each task, the consecutive task is started.
#!/bin/bash #SBATCH -J TEST ## name #SBATCH -N 1 ## 1 node #SBATCH --ntasks=16 ## number of tasks per node #SBATCH --time=08:00:00 tasks_to_be_done=64 ## total number of tasks to be scheduled: ## set this to your preferred number max_tasks=16 ## number of tasks per node. current_task=0 ## initialization running_tasks=0 ## initialization while (($current_task < $tasks_to_be_done)) do ## ~ * ~ * ~ * ~ * ~ * ~ * ~ * ~ * ~ * ~ * ~ * ~ * ~ * ## ## "test" MUST BE REPLACED by your application ~ * ~ * ## ## counts the number of tasks currently running * ~ * ## ## ~ * ~ * ~ * ~ * ~ * ~ * ~ * ~ * ~ * ~ * ~ * ~ * ~ * ## running_tasks=`ps -C test --no-headers | wc -l` while (($running_tasks < $max_tasks && ${current_task} < ${tasks_to_be_done})) do ((current_task++)) ## ~ * ~ * ~ * ~ * ~ * ~ * ~ * ~ * ~ * ~ * ~ * ~ * ~ * ~ * ~ * ~ * ~ * ~ ## ## run application (here named 'test_--.sh'): replace "test" accordingly ## ## ~ * ~ * ~ * ~ * ~ * ~ * ~ * ~ * ~ * ~ * ~ * ~ * ~ * ~ * ~ * ~ * ~ * ~ ## ./test_${current_task}.sh -${current_task} & ## run executables test_0, test_1, ... running_tasks=`ps -C test --no-headers | wc -l` done done wait
The test_–.sh script was here of the form
#!/bin/bash sleep 10 date hostname ps | grep test|grep -v grep | wc -l
Another example for single core jobs with the parallelization in the innermost loop:
#!/bin/bash # #SBATCH -J testjob #SBATCH -N 1 #SBATCH --ntasks-per-node=16 #SBATCH --ntasks-per-core=1 ##SBATCH --mail-type=BEGIN # first have to state the type of event to occur ##SBATCH --mail-user=user@example.com # and then your email address #SBATCH --time=60 # to use backfilling of the slurm queues export max_num_tasks=16 export executable=dt24 # dk=100 for n in {1027..20000..231}; do echo for k in $((n)) $((n-1)) $((n/2)) $((n/4)) $((n/10)) ; do for rdb in $(eval echo 1 {8..16..4}); do for blocksize in $(eval echo {64..$n..32}); do ./$executable $n $k $rdb $blocksize $dk >dt2.${n}.${k}.${rdb}.${blocksize} 2>&1 & ############################## Here starts the parallelization while (("$(ps -C $executable --no-headers | wc -l)" == $max_num_tasks )); do sleep 10 # replace by appropriate value done ############################## Here ends the parallelization done done done done wait # wait for final jobs to finish
(The SLURM inherent command #SBATCH –array starting_value-end_value:stepwidth does not provide this functionality since nodes are exclusively occupied by one job.)
Here are examples how to run multiple MPI jobs on a single node
In general a host file is not needed. However, there may be situations where a host file is necessary, for example for Mathematica Batch Jobs. With the following script a host machines file may be generated
#!/bin/bash # #SBATCH -J par # job name #SBATCH -N 2 # number of nodes=2 #SBATCH --ntasks-per-node=48 # uses all cpus of one node #SBATCH --ntasks-per-core=1 #SBATCH --threads-per-core=1 module load Mathematica/10.0.2 # load desired version scontrol show hostnames $SLURM_NODELIST > ./nodelist rm machines_tmp tasks_per_node=48 # change number accordingly nodes=2 # change number accordingly for ((line=1; line<=nodes; line++)) do for ((tN=1; tN<=tasks_per_node; tN++)) do head -$line nodelist | tail -1 >> machines_tmp done done echo "this is the content of the host machines file" cat machines_tmp
Pay attention to the fact that tasks_per_node
and nodes
have to be changed accordingly.
Slurm is not longer configured to automatically requeue jobs which were aborted due to node failures. If this is an unwanted behaviour you can prevent ask for automatic requeuing with the following option in your job script:
#SBATCH --requeue
#!/bin/bash #SBATCH -J array_job #SBATCH -N 1 #SBATCH --tasks-per-node=16 #run all tasks from index 1 to 1000 with stepwidth 3, i.e. 1,4,7, ... #SBATCH --array=1-1000:3 #usefull variables: #$SLURM_ARRAY_JOB_ID #$SLURM_ARRAY_TASK_ID #$SLURM_JOB_NODELIST ./start_my_task $SLURM_ARRAY_TASK_ID
Dependent jobs can be used to create a job chain that is executed one job after another.
#!/bin/bash #SBATCH -J jobname #SBATCH -N 2 #SBATCH -d afterany:<job_id> srun ./my_program
ERROR_MEMORY=200 ERROR_INFINIBAND_HW=201 ERROR_INFINIBAND_SW=202 ERROR_IPOIB=203 ERROR_BEEGFS_SERVICE=204 ERROR_BEEGFS_USER=205 ERROR_BEEGFS_SCRATCH=206 ERROR_NFS=207 ERROR_USER_GROUP=220 ERROR_USER_HOME=221 ERROR_GPFS_START=228 ERROR_GPFS_MOUNT=229 ERROR_GPFS_UNMOUNT=230