===== SLURM ===== ==== Submission of batch jobs ==== As batch system the workload scheduler ''%%SLURM%%'' (Simple Linux Utility for Resource Management) is used. https://www.schedmd.com/ Basic commands: sinfo # list information about partitions and node states squeue # list jobs in queue sbatch # submit batch job scancel # cancel job srun ... # run a parallel job within the SLURM environment scontrol # gives various information on job and nodes ---- ===== SLURM ===== Example ''%%sinfo%%'': sinfo PARTITION AVAIL TIMELIMIT NODES STATE NODELIST E5-2690v4* up 7-00:00:00 1 down* c1-10 E5-2690v4* up 7-00:00:00 19 idle c1-[01-09,11-12],c2-[01-08] Phi up 7-00:00:00 1 drain c3-01 Phi up 7-00:00:00 7 idle c3-[02-08] E5-1650v4 up 7-00:00:00 1 down* c4-16 E5-1650v4 up 7-00:00:00 1 drain c4-01 E5-1650v4 up 7-00:00:00 14 idle c4-[02-15] E5-1650v3 up 7-00:00:00 1 idle c5-01 * down … node is unreachable by the slurm control daemon * drain … node is not available for job allocation * allocated … node is used by a job * idle … node is available for job allocation ---- ===== SLURM ===== List the names, number of nodes and nodenames of the available hardware partitions: sinfo -o "%.10R %.5D %.N" PARTITION NODES NODELIST E5-2690v4 20 c1-[01-12],c2-[01-08] Phi 8 c3-[01-08] E5-1650v4 16 c4-[01-16] E5-1650v3 1 c5-01 ---- ===== SLURM ===== To list node states of a specific partition: sinfo -p Phi PARTITION AVAIL TIMELIMIT NODES STATE NODELIST Phi up 7-00:00:00 1 drain c3-01 Phi up 7-00:00:00 7 idle c3-[02-08] ---- ===== SLURM ===== === Submit jobs === Example job script ''%%/opt/ohpc/pub/examples/slurm/mul/mpi/job.sh%%'': #!/bin/bash # #SBATCH -J your_job_name_here # name of job to appear in squeue #SBATCH -N 1 # number of nodes requested #SBATCH -o job.%j.out # filename for stdout #SBATCH -p E5-2690v4 # speficfy hardware partition #SBATCH -q E5-2690v4-batch # specify quality of service (QOS) #SBATCH --ntasks-per-node=28 # request number of tasks for your job #SBATCH --threads-per-core=1 # specify number of threads per core env|grep SLURM # list of SLURM environment variables to use in job script module purge module load gnu7/7.2.0 openmpi/1.10.7 prun echo module list echo which gcc which mpicc mpicc hello.c -o hello # compile and optimize your code directly on hardware time mpirun -np $SLURM_NPROCS ./hello # run your job with mpirun echo time prun ./hello # run your job via prun wrapper script ---- ===== SLURM ===== === Submit jobs === The modules ''%%gnu7/7.2.0%%'' and ''%%openmpi/1.10.7%%'' should be loaded for this example: sbatch job.sh sbatch: info for user at sbatch sbatch: current partition E5-2690v4 sbatch: current qos E5-2690v4-batch Submitted batch job 548 squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 548 E5-2690v4 job_nam usernam R 0:01 1 c1-01 To cancel the job: scancel 548 ---- ===== SLURM ===== ==== Output files ==== * If no file is specified STDOUT will be written to a file called ''%%slurm-.out%%'' * In the example of the previous slides STDOUT is written to ''%%job..out%%'' ---- ===== SLURM ===== If no free nodes are available the job will be shown as pending (PD): squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 548 E5-2690v4 job_nam usernam PD 0:00 1 (Priority) ---- ===== SLURM ===== Customized format of squeue: squeue -o "%.10A %.10u %.10g %.10P %.10p %.20S %.10j %.10D %.10N %.10T" JOBID USER GROUP PARTITION PRIORITY START_TIME NAME NODES NODELIST STATE 555 user_nam group_na E5-2690v4 0.00000023 2018-01-12T12:21:30 mpitest 1 c1-01 RUNNING To show only jobs of specific user: squeue -u ---- ===== SLURM ===== ==== Exercise ==== * create a job script as in the previous slides (or copy it from: /opt/ohpc/pub/examples/slurm/mul/mpi/job.sh) * submit the job * inspect the output file ---- ===== SLURM ===== scontrol show job 548 JobId=548 JobName=job_name UserId=user_name(1000) GroupId=user_name(1000) MCS_label=N/A Priority=1015 Nice=0 Account=(null) QOS=e5-2690v4-batch JobState=RUNNING Reason=None Dependency=(null) Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0 RunTime=00:00:02 TimeLimit=7-00:00:00 TimeMin=N/A SubmitTime=2018-01-12T11:58:47 EligibleTime=2018-01-12T11:58:47 StartTime=2018-01-12T11:58:47 EndTime=2018-01-19T11:58:47 Deadline=N/A PreemptTime=None SuspendTime=None SecsPreSuspend=0 LastSchedEval=2018-01-12T11:58:47 Partition=E5-2690v4 AllocNode:Sid=mul-hpc-81a-mgmt:27864 ReqNodeList=(null) ExcNodeList=(null) NodeList=c1-01 BatchHost=c1-01 NumNodes=1 NumCPUs=28 NumTasks=28 CPUs/Task=1 ReqB:S:C:T=0:0:*:* TRES=cpu=28,mem=128655M,node=1,billing=40 Socks/Node=* NtasksPerN:B:S:C=28:0:*:* CoreSpec=* MinCPUsNode=28 MinMemoryNode=128655M MinTmpDiskNode=0 Features=(null) DelayBoot=00:00:00 Gres=(null) Reservation=(null) OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null) Command=/home/user_name/jz/mpi/job.sh WorkDir=/home/user_name/jz/mpi StdErr=/home/user_name/jz/mpi/job.548.out StdIn=/dev/null StdOut=/home/user_name/jz/mpi/job.548.out Power= ---- ===== SLURM ===== ==== Interactive jobs (1) ==== To get direct access to a compute node (this example can be found in ''%%/opt/ohpc/pub/examples/slurm/mul/mpi/srun_example%%''): srun -J test -p E5-2690v4 --qos E5-2690v4-batch -N 1 --ntasks-per-node=28 --pty /bin/bash srun: info for user at sbatch srun: current partition E5-2690v4 srun: current qos E5-2690v4-batch [user_name@c1-01 ~]$ [user_name@c1-01 ~]$ prun hello [prun] Master compute host = c1-01 [prun] Resource manager = slurm [prun] Launch cmd = mpirun hello (family=openmpi) Hello, world (28 procs total) --> Process # 11 of 28 is alive. -> c1-01 --> Process # 12 of 28 is alive. -> c1-01 --> Process # 13 of 28 is alive. -> c1-01 --> Process # 15 of 28 is alive. -> c1-01 --> Process # 16 of 28 is alive. -> c1-01 --> Process # 17 of 28 is alive. -> c1-01 --> Process # 19 of 28 is alive. -> c1-01 --> Process # 20 of 28 is alive. -> c1-01 --> Process # 21 of 28 is alive. -> c1-01 --> Process # 23 of 28 is alive. -> c1-01 --> Process # 24 of 28 is alive. -> c1-01 --> Process # 25 of 28 is alive. -> c1-01 --> Process # 27 of 28 is alive. -> c1-01 --> Process # 0 of 28 is alive. -> c1-01 --> Process # 3 of 28 is alive. -> c1-01 --> Process # 4 of 28 is alive. -> c1-01 --> Process # 5 of 28 is alive. -> c1-01 --> Process # 7 of 28 is alive. -> c1-01 --> Process # 8 of 28 is alive. -> c1-01 --> Process # 9 of 28 is alive. -> c1-01 --> Process # 18 of 28 is alive. -> c1-01 --> Process # 22 of 28 is alive. -> c1-01 --> Process # 26 of 28 is alive. -> c1-01 --> Process # 1 of 28 is alive. -> c1-01 --> Process # 10 of 28 is alive. -> c1-01 --> Process # 2 of 28 is alive. -> c1-01 --> Process # 6 of 28 is alive. -> c1-01 --> Process # 14 of 28 is alive. -> c1-01 ---- ===== SLURM ===== ==== Interactive jobs (2) (excercise) ==== Alternatively the ''%%salloc%%'' command can be used: salloc -N 1 -J test -p E5-2690v4 --qos E5-2690v4-batch --mem=10G Then find out where your job is running: squeue -u or srun hostname and connect to it: ssh ---- ===== SLURM ===== ==== Interactive jobs (2) (excercise) ==== To get direct interactive access to a compute try: salloc -N 1 -J test -p E5-2690v4 --qos E5-2690v4-batch --mem=10G srun --pty --preserve-env $SHELL ---- ===== SLURM ===== ==== Requesting ressources ==== It is possible to request a certain number of cores and a specific amount of memory in a job script. E.g. to ask for two cores of a node and a total of 2 GByte of memory: #SBATCH -n 2 #SBATCH --mem=2G The cores and the requested memory are then exclusively assigned to the processes of this job via cgroups. The current policy is that if the memory is not specified, the job cannot be submitted an an error will be displayed. ---- ===== SLURM: memory ===== * you **have to** specify memory * slurm does not accept your job without a memory specification * choose the right amount of memory: * not too little * not too much * too **little** memory: * could lead to very low speed because of swapping * could lead to crash of job (experienced with Abaqus) * too **much** memory * does not hurt performance and does not kill your job * but it costs you more of your fair share ---- ===== SLURM: memory ===== ==== why have this annoying feature anyway? ==== * because of shared usage of nodes * if we would use nodes **only exclusively** then this would not be necessary ---- ===== SLURM ===== ==== Array jobs ==== * run similar, **independent** jobs at once, that can be distinguished by **one parameter** * each task will be treated as a seperate job * example start=1, end=30, stepwidth=7: #!/bin/sh #SBATCH -J array #SBATCH -N 1 #SBATCH --array=1-30:7 ./my_binary $SLURM_ARRAY_TASK_ID * computed tasks: 1, 8, 15, 22, 29 5605039_[15-29] E5-2690v4 array markus PD 5605039_1 E5-2690v4 array markus R 5605039_8 E5-2690v4 array markus R ---- ===== SLURM ===== useful variables within job: SLURM_ARRAY_JOB_ID SLURM_ARRAY_TASK_ID SLURM_ARRAY_TASK_STEP SLURM_ARRAY_TASK_MAX SLURM_ARRAY_TASK_MIN limit number of simultanously running jobs to 2: #SBATCH --array=1-30:7%2 ---- ===== SLURM ===== * normal jobs: ^#SBATCH ^job environment ^ |-N |SLURM_JOB_NUM_NODES | |--ntasks-per-core |SLURM_NTASKS_PER_CORE | |--ntasks-per-node |SLURM_NTASKS_PER_NODE | |--ntasks-per-socket|SLURM_NTASKS_PER_SOCKET| |--ntasks, -n |SLURM_NTASKS | * emails: #SBATCH --mail-user=yourmail@example.com #SBATCH --mail-type=BEGIN,END ---- ===== SLURM ===== ==== Job dependencies ==== - Submit first job and get its - Submit dependent job (and get ): #!/bin/bash #SBATCH -J jobname #SBATCH -N 2 #SBATCH -d afterany: srun ./my_program
  1. continue at 2. for further dependent jobs
----