Approvals: 0/1
SLURM
Submission of batch jobs
As batch system the workload scheduler SLURM
(Simple Linux Utility for Resource Management) is used. https://www.schedmd.com/
Basic commands:
sinfo # list information about partitions and node states squeue # list jobs in queue sbatch <job-script> # submit batch job scancel <job-id> # cancel job srun ... # run a parallel job within the SLURM environment scontrol # gives various information on job and nodes
SLURM
Example sinfo
:
sinfo PARTITION AVAIL TIMELIMIT NODES STATE NODELIST E5-2690v4* up 7-00:00:00 1 down* c1-10 E5-2690v4* up 7-00:00:00 19 idle c1-[01-09,11-12],c2-[01-08] Phi up 7-00:00:00 1 drain c3-01 Phi up 7-00:00:00 7 idle c3-[02-08] E5-1650v4 up 7-00:00:00 1 down* c4-16 E5-1650v4 up 7-00:00:00 1 drain c4-01 E5-1650v4 up 7-00:00:00 14 idle c4-[02-15] E5-1650v3 up 7-00:00:00 1 idle c5-01
- down … node is unreachable by the slurm control daemon
- drain … node is not available for job allocation
- allocated … node is used by a job
- idle … node is available for job allocation
SLURM
List the names, number of nodes and nodenames of the available hardware partitions:
sinfo -o "%.10R %.5D %.N" PARTITION NODES NODELIST E5-2690v4 20 c1-[01-12],c2-[01-08] Phi 8 c3-[01-08] E5-1650v4 16 c4-[01-16] E5-1650v3 1 c5-01
SLURM
To list node states of a specific partition:
sinfo -p Phi PARTITION AVAIL TIMELIMIT NODES STATE NODELIST Phi up 7-00:00:00 1 drain c3-01 Phi up 7-00:00:00 7 idle c3-[02-08]
SLURM
Submit jobs
Example job script /opt/ohpc/pub/examples/slurm/mul/mpi/job.sh
:
#!/bin/bash # #SBATCH -J your_job_name_here # name of job to appear in squeue #SBATCH -N 1 # number of nodes requested #SBATCH -o job.%j.out # filename for stdout #SBATCH -p E5-2690v4 # speficfy hardware partition #SBATCH -q E5-2690v4-batch # specify quality of service (QOS) #SBATCH --ntasks-per-node=28 # request number of tasks for your job #SBATCH --threads-per-core=1 # specify number of threads per core env|grep SLURM # list of SLURM environment variables to use in job script module purge module load gnu7/7.2.0 openmpi/1.10.7 prun echo module list echo which gcc which mpicc mpicc hello.c -o hello # compile and optimize your code directly on hardware time mpirun -np $SLURM_NPROCS ./hello # run your job with mpirun echo time prun ./hello # run your job via prun wrapper script
SLURM
Submit jobs
The modules gnu7/7.2.0
and openmpi/1.10.7
should be loaded for this example:
sbatch job.sh sbatch: info for user at sbatch sbatch: current partition E5-2690v4 sbatch: current qos E5-2690v4-batch Submitted batch job 548
squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 548 E5-2690v4 job_nam usernam R 0:01 1 c1-01
To cancel the job:
scancel 548
SLURM
Output files
- If no file is specified STDOUT will be written to a file called
slurm-<job id>.out
- In the example of the previous slides STDOUT is written to
job.<job id>.out
SLURM
If no free nodes are available the job will be shown as pending (PD):
squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 548 E5-2690v4 job_nam usernam PD 0:00 1 (Priority)
SLURM
Customized format of squeue:
squeue -o "%.10A %.10u %.10g %.10P %.10p %.20S %.10j %.10D %.10N %.10T" JOBID USER GROUP PARTITION PRIORITY START_TIME NAME NODES NODELIST STATE 555 user_nam group_na E5-2690v4 0.00000023 2018-01-12T12:21:30 mpitest 1 c1-01 RUNNING
To show only jobs of specific user:
squeue -u <username>
SLURM
Exercise
- create a job script as in the previous slides (or copy it from: /opt/ohpc/pub/examples/slurm/mul/mpi/job.sh)
- submit the job
- inspect the output file
SLURM
scontrol show job 548 JobId=548 JobName=job_name UserId=user_name(1000) GroupId=user_name(1000) MCS_label=N/A Priority=1015 Nice=0 Account=(null) QOS=e5-2690v4-batch JobState=RUNNING Reason=None Dependency=(null) Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0 RunTime=00:00:02 TimeLimit=7-00:00:00 TimeMin=N/A SubmitTime=2018-01-12T11:58:47 EligibleTime=2018-01-12T11:58:47 StartTime=2018-01-12T11:58:47 EndTime=2018-01-19T11:58:47 Deadline=N/A PreemptTime=None SuspendTime=None SecsPreSuspend=0 LastSchedEval=2018-01-12T11:58:47 Partition=E5-2690v4 AllocNode:Sid=mul-hpc-81a-mgmt:27864 ReqNodeList=(null) ExcNodeList=(null) NodeList=c1-01 BatchHost=c1-01 NumNodes=1 NumCPUs=28 NumTasks=28 CPUs/Task=1 ReqB:S:C:T=0:0:*:* TRES=cpu=28,mem=128655M,node=1,billing=40 Socks/Node=* NtasksPerN:B:S:C=28:0:*:* CoreSpec=* MinCPUsNode=28 MinMemoryNode=128655M MinTmpDiskNode=0 Features=(null) DelayBoot=00:00:00 Gres=(null) Reservation=(null) OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null) Command=/home/user_name/jz/mpi/job.sh WorkDir=/home/user_name/jz/mpi StdErr=/home/user_name/jz/mpi/job.548.out StdIn=/dev/null StdOut=/home/user_name/jz/mpi/job.548.out Power=
SLURM
Interactive jobs (1)
To get direct access to a compute node (this example can be found in /opt/ohpc/pub/examples/slurm/mul/mpi/srun_example
):
srun -J test -p E5-2690v4 --qos E5-2690v4-batch -N 1 --ntasks-per-node=28 --pty /bin/bash srun: info for user at sbatch srun: current partition E5-2690v4 srun: current qos E5-2690v4-batch [user_name@c1-01 ~]$
[user_name@c1-01 ~]$ prun hello [prun] Master compute host = c1-01 [prun] Resource manager = slurm [prun] Launch cmd = mpirun hello (family=openmpi) Hello, world (28 procs total) --> Process # 11 of 28 is alive. -> c1-01 --> Process # 12 of 28 is alive. -> c1-01 --> Process # 13 of 28 is alive. -> c1-01 --> Process # 15 of 28 is alive. -> c1-01 --> Process # 16 of 28 is alive. -> c1-01 --> Process # 17 of 28 is alive. -> c1-01 --> Process # 19 of 28 is alive. -> c1-01 --> Process # 20 of 28 is alive. -> c1-01 --> Process # 21 of 28 is alive. -> c1-01 --> Process # 23 of 28 is alive. -> c1-01 --> Process # 24 of 28 is alive. -> c1-01 --> Process # 25 of 28 is alive. -> c1-01 --> Process # 27 of 28 is alive. -> c1-01 --> Process # 0 of 28 is alive. -> c1-01 --> Process # 3 of 28 is alive. -> c1-01 --> Process # 4 of 28 is alive. -> c1-01 --> Process # 5 of 28 is alive. -> c1-01 --> Process # 7 of 28 is alive. -> c1-01 --> Process # 8 of 28 is alive. -> c1-01 --> Process # 9 of 28 is alive. -> c1-01 --> Process # 18 of 28 is alive. -> c1-01 --> Process # 22 of 28 is alive. -> c1-01 --> Process # 26 of 28 is alive. -> c1-01 --> Process # 1 of 28 is alive. -> c1-01 --> Process # 10 of 28 is alive. -> c1-01 --> Process # 2 of 28 is alive. -> c1-01 --> Process # 6 of 28 is alive. -> c1-01 --> Process # 14 of 28 is alive. -> c1-01
SLURM
Interactive jobs (2) (excercise)
Alternatively the salloc
command can be used:
salloc -N 1 -J test -p E5-2690v4 --qos E5-2690v4-batch --mem=10G
Then find out where your job is running:
squeue -u <username>
or
srun hostname
and connect to it:
ssh <node>
SLURM
Interactive jobs (2) (excercise)
To get direct interactive access to a compute try:
salloc -N 1 -J test -p E5-2690v4 --qos E5-2690v4-batch --mem=10G srun --pty --preserve-env $SHELL
SLURM
Requesting ressources
It is possible to request a certain number of cores and a specific amount of memory in a job script. E.g. to ask for two cores of a node and a total of 2 GByte of memory:
#SBATCH -n 2 #SBATCH --mem=2G
The cores and the requested memory are then exclusively assigned to the processes of this job via cgroups. The current policy is that if the memory is not specified, the job cannot be submitted an an error will be displayed.
SLURM: memory
- you have to specify memory
- slurm does not accept your job without a memory specification
- choose the right amount of memory:
- not too little
- not too much
- too little memory:
- could lead to very low speed because of swapping
- could lead to crash of job (experienced with Abaqus)
- too much memory
- does not hurt performance and does not kill your job
- but it costs you more of your fair share
SLURM: memory
why have this annoying feature anyway?
- because of shared usage of nodes
- if we would use nodes only exclusively then this would not be necessary
SLURM
Array jobs
- run similar, independent jobs at once, that can be distinguished by one parameter
- each task will be treated as a seperate job
- example start=1, end=30, stepwidth=7:
#!/bin/sh #SBATCH -J array #SBATCH -N 1 #SBATCH --array=1-30:7 ./my_binary $SLURM_ARRAY_TASK_ID
- computed tasks: 1, 8, 15, 22, 29
5605039_[15-29] E5-2690v4 array markus PD 5605039_1 E5-2690v4 array markus R 5605039_8 E5-2690v4 array markus R
SLURM
useful variables within job:
SLURM_ARRAY_JOB_ID SLURM_ARRAY_TASK_ID SLURM_ARRAY_TASK_STEP SLURM_ARRAY_TASK_MAX SLURM_ARRAY_TASK_MIN
limit number of simultanously running jobs to 2:
#SBATCH --array=1-30:7%2
SLURM
- normal jobs:
#SBATCH | job environment |
---|---|
-N | SLURM_JOB_NUM_NODES |
–ntasks-per-core | SLURM_NTASKS_PER_CORE |
–ntasks-per-node | SLURM_NTASKS_PER_NODE |
–ntasks-per-socket | SLURM_NTASKS_PER_SOCKET |
–ntasks, -n | SLURM_NTASKS |
- emails:
#SBATCH --mail-user=yourmail@example.com #SBATCH --mail-type=BEGIN,END
SLURM
Job dependencies
- Submit first job and get its <job id>
- Submit dependent job (and get <job_id>):
#!/bin/bash #SBATCH -J jobname #SBATCH -N 2 #SBATCH -d afterany:<job_id> srun ./my_program
<HTML><ol start=“3” style=“list-style-type: decimal;”></HTML> <HTML><li></HTML>continue at 2. for further dependent jobs<HTML></li></HTML><HTML></ol></HTML>