===== SLURM =====
==== Submission of batch jobs ====
As batch system the workload scheduler ''%%SLURM%%'' (Simple Linux Utility for Resource Management) is used. https://www.schedmd.com/
Basic commands:
sinfo # list information about partitions and node states
squeue # list jobs in queue
sbatch # submit batch job
scancel # cancel job
srun ... # run a parallel job within the SLURM environment
scontrol # gives various information on job and nodes
----
===== SLURM =====
Example ''%%sinfo%%'':
sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
E5-2690v4* up 7-00:00:00 1 down* c1-10
E5-2690v4* up 7-00:00:00 19 idle c1-[01-09,11-12],c2-[01-08]
Phi up 7-00:00:00 1 drain c3-01
Phi up 7-00:00:00 7 idle c3-[02-08]
E5-1650v4 up 7-00:00:00 1 down* c4-16
E5-1650v4 up 7-00:00:00 1 drain c4-01
E5-1650v4 up 7-00:00:00 14 idle c4-[02-15]
E5-1650v3 up 7-00:00:00 1 idle c5-01
* down … node is unreachable by the slurm control daemon
* drain … node is not available for job allocation
* allocated … node is used by a job
* idle … node is available for job allocation
----
===== SLURM =====
List the names, number of nodes and nodenames of the available hardware partitions:
sinfo -o "%.10R %.5D %.N"
PARTITION NODES NODELIST
E5-2690v4 20 c1-[01-12],c2-[01-08]
Phi 8 c3-[01-08]
E5-1650v4 16 c4-[01-16]
E5-1650v3 1 c5-01
----
===== SLURM =====
To list node states of a specific partition:
sinfo -p Phi
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
Phi up 7-00:00:00 1 drain c3-01
Phi up 7-00:00:00 7 idle c3-[02-08]
----
===== SLURM =====
=== Submit jobs ===
Example job script ''%%/opt/ohpc/pub/examples/slurm/mul/mpi/job.sh%%'':
#!/bin/bash
#
#SBATCH -J your_job_name_here # name of job to appear in squeue
#SBATCH -N 1 # number of nodes requested
#SBATCH -o job.%j.out # filename for stdout
#SBATCH -p E5-2690v4 # speficfy hardware partition
#SBATCH -q E5-2690v4-batch # specify quality of service (QOS)
#SBATCH --ntasks-per-node=28 # request number of tasks for your job
#SBATCH --threads-per-core=1 # specify number of threads per core
env|grep SLURM # list of SLURM environment variables to use in job script
module purge
module load gnu7/7.2.0 openmpi/1.10.7 prun
echo
module list
echo
which gcc
which mpicc
mpicc hello.c -o hello # compile and optimize your code directly on hardware
time mpirun -np $SLURM_NPROCS ./hello # run your job with mpirun
echo
time prun ./hello # run your job via prun wrapper script
----
===== SLURM =====
=== Submit jobs ===
The modules ''%%gnu7/7.2.0%%'' and ''%%openmpi/1.10.7%%'' should be loaded for this example:
sbatch job.sh
sbatch: info for user at sbatch
sbatch: current partition E5-2690v4
sbatch: current qos E5-2690v4-batch
Submitted batch job 548
squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
548 E5-2690v4 job_nam usernam R 0:01 1 c1-01
To cancel the job:
scancel 548
----
===== SLURM =====
==== Output files ====
* If no file is specified STDOUT will be written to a file called ''%%slurm-.out%%''
* In the example of the previous slides STDOUT is written to ''%%job..out%%''
----
===== SLURM =====
If no free nodes are available the job will be shown as pending (PD):
squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
548 E5-2690v4 job_nam usernam PD 0:00 1 (Priority)
----
===== SLURM =====
Customized format of squeue:
squeue -o "%.10A %.10u %.10g %.10P %.10p %.20S %.10j %.10D %.10N %.10T"
JOBID USER GROUP PARTITION PRIORITY START_TIME NAME NODES NODELIST STATE
555 user_nam group_na E5-2690v4 0.00000023 2018-01-12T12:21:30 mpitest 1 c1-01 RUNNING
To show only jobs of specific user:
squeue -u
----
===== SLURM =====
==== Exercise ====
* create a job script as in the previous slides (or copy it from: /opt/ohpc/pub/examples/slurm/mul/mpi/job.sh)
* submit the job
* inspect the output file
----
===== SLURM =====
scontrol show job 548
JobId=548 JobName=job_name
UserId=user_name(1000) GroupId=user_name(1000) MCS_label=N/A
Priority=1015 Nice=0 Account=(null) QOS=e5-2690v4-batch
JobState=RUNNING Reason=None Dependency=(null)
Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
RunTime=00:00:02 TimeLimit=7-00:00:00 TimeMin=N/A
SubmitTime=2018-01-12T11:58:47 EligibleTime=2018-01-12T11:58:47
StartTime=2018-01-12T11:58:47 EndTime=2018-01-19T11:58:47 Deadline=N/A
PreemptTime=None SuspendTime=None SecsPreSuspend=0
LastSchedEval=2018-01-12T11:58:47
Partition=E5-2690v4 AllocNode:Sid=mul-hpc-81a-mgmt:27864
ReqNodeList=(null) ExcNodeList=(null)
NodeList=c1-01
BatchHost=c1-01
NumNodes=1 NumCPUs=28 NumTasks=28 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
TRES=cpu=28,mem=128655M,node=1,billing=40
Socks/Node=* NtasksPerN:B:S:C=28:0:*:* CoreSpec=*
MinCPUsNode=28 MinMemoryNode=128655M MinTmpDiskNode=0
Features=(null) DelayBoot=00:00:00
Gres=(null) Reservation=(null)
OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
Command=/home/user_name/jz/mpi/job.sh
WorkDir=/home/user_name/jz/mpi
StdErr=/home/user_name/jz/mpi/job.548.out
StdIn=/dev/null
StdOut=/home/user_name/jz/mpi/job.548.out
Power=
----
===== SLURM =====
==== Interactive jobs (1) ====
To get direct access to a compute node (this example can be found in ''%%/opt/ohpc/pub/examples/slurm/mul/mpi/srun_example%%''):
srun -J test -p E5-2690v4 --qos E5-2690v4-batch -N 1 --ntasks-per-node=28 --pty /bin/bash
srun: info for user at sbatch
srun: current partition E5-2690v4
srun: current qos E5-2690v4-batch
[user_name@c1-01 ~]$
[user_name@c1-01 ~]$ prun hello
[prun] Master compute host = c1-01
[prun] Resource manager = slurm
[prun] Launch cmd = mpirun hello (family=openmpi)
Hello, world (28 procs total)
--> Process # 11 of 28 is alive. -> c1-01
--> Process # 12 of 28 is alive. -> c1-01
--> Process # 13 of 28 is alive. -> c1-01
--> Process # 15 of 28 is alive. -> c1-01
--> Process # 16 of 28 is alive. -> c1-01
--> Process # 17 of 28 is alive. -> c1-01
--> Process # 19 of 28 is alive. -> c1-01
--> Process # 20 of 28 is alive. -> c1-01
--> Process # 21 of 28 is alive. -> c1-01
--> Process # 23 of 28 is alive. -> c1-01
--> Process # 24 of 28 is alive. -> c1-01
--> Process # 25 of 28 is alive. -> c1-01
--> Process # 27 of 28 is alive. -> c1-01
--> Process # 0 of 28 is alive. -> c1-01
--> Process # 3 of 28 is alive. -> c1-01
--> Process # 4 of 28 is alive. -> c1-01
--> Process # 5 of 28 is alive. -> c1-01
--> Process # 7 of 28 is alive. -> c1-01
--> Process # 8 of 28 is alive. -> c1-01
--> Process # 9 of 28 is alive. -> c1-01
--> Process # 18 of 28 is alive. -> c1-01
--> Process # 22 of 28 is alive. -> c1-01
--> Process # 26 of 28 is alive. -> c1-01
--> Process # 1 of 28 is alive. -> c1-01
--> Process # 10 of 28 is alive. -> c1-01
--> Process # 2 of 28 is alive. -> c1-01
--> Process # 6 of 28 is alive. -> c1-01
--> Process # 14 of 28 is alive. -> c1-01
----
===== SLURM =====
==== Interactive jobs (2) (excercise) ====
Alternatively the ''%%salloc%%'' command can be used:
salloc -N 1 -J test -p E5-2690v4 --qos E5-2690v4-batch --mem=10G
Then find out where your job is running:
squeue -u
or
srun hostname
and connect to it:
ssh
----
===== SLURM =====
==== Interactive jobs (2) (excercise) ====
To get direct interactive access to a compute try:
salloc -N 1 -J test -p E5-2690v4 --qos E5-2690v4-batch --mem=10G srun --pty --preserve-env $SHELL
----
===== SLURM =====
==== Requesting ressources ====
It is possible to request a certain number of cores and a specific amount of memory in a job script. E.g. to ask for two cores of a node and a total of 2 GByte of memory:
#SBATCH -n 2
#SBATCH --mem=2G
The cores and the requested memory are then exclusively assigned to the processes of this job via cgroups. The current policy is that if the memory is not specified, the job cannot be submitted an an error will be displayed.
----
===== SLURM: memory =====
* you **have to** specify memory
* slurm does not accept your job without a memory specification
* choose the right amount of memory:
* not too little
* not too much
* too **little** memory:
* could lead to very low speed because of swapping
* could lead to crash of job (experienced with Abaqus)
* too **much** memory
* does not hurt performance and does not kill your job
* but it costs you more of your fair share
----
===== SLURM: memory =====
==== why have this annoying feature anyway? ====
* because of shared usage of nodes
* if we would use nodes **only exclusively** then this would not be necessary
----
===== SLURM =====
==== Array jobs ====
* run similar, **independent** jobs at once, that can be distinguished by **one parameter**
* each task will be treated as a seperate job
* example start=1, end=30, stepwidth=7:
#!/bin/sh
#SBATCH -J array
#SBATCH -N 1
#SBATCH --array=1-30:7
./my_binary $SLURM_ARRAY_TASK_ID
* computed tasks: 1, 8, 15, 22, 29
5605039_[15-29] E5-2690v4 array markus PD
5605039_1 E5-2690v4 array markus R
5605039_8 E5-2690v4 array markus R
----
===== SLURM =====
useful variables within job:
SLURM_ARRAY_JOB_ID
SLURM_ARRAY_TASK_ID
SLURM_ARRAY_TASK_STEP
SLURM_ARRAY_TASK_MAX
SLURM_ARRAY_TASK_MIN
limit number of simultanously running jobs to 2:
#SBATCH --array=1-30:7%2
----
===== SLURM =====
* normal jobs:
^#SBATCH ^job environment ^
|-N |SLURM_JOB_NUM_NODES |
|--ntasks-per-core |SLURM_NTASKS_PER_CORE |
|--ntasks-per-node |SLURM_NTASKS_PER_NODE |
|--ntasks-per-socket|SLURM_NTASKS_PER_SOCKET|
|--ntasks, -n |SLURM_NTASKS |
* emails:
#SBATCH --mail-user=yourmail@example.com
#SBATCH --mail-type=BEGIN,END
----
===== SLURM =====
==== Job dependencies ====
- Submit first job and get its
- Submit dependent job (and get ):
#!/bin/bash
#SBATCH -J jobname
#SBATCH -N 2
#SBATCH -d afterany:
srun ./my_program
- continue at 2. for further dependent jobs
----