Approvals: 0/1
This is an old revision of the document!
Node access and job control
- Article written by Jan Zabloudil (VSC Team) <html><br></html>(last update 2017-10-09 by jz).
Node access
- … after submitting a job script
- … in interactive sessions
Job scripts: sbatch
[jz@l31 somedirectory]$ sbatch job.sh Submitted batch job 54321 [jz@l31 somedirectory]$ squeue -u jz JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 54321 mem_0064 test jz R 0:04 1 n20-038 [jz@l31 somedirectory]$ ssh n20-038 Last login: Wed Mar 15 14:26:01 2017 from l31.cm.cluster [jz@n20-038 ~]$
Interactive jobs: salloc
<HTML> <!–<div class=“incremental smaller”> <div style=“float:left;width:50%”>–> </HTML>
Option 1:
[jz@l31 somedirectory]$ salloc -N1 salloc: Pending job allocation 5115879 salloc: job 5115879 queued and waiting for resources salloc: job 5115879 has been allocated resources salloc: Granted job allocation 5115879 [jz@l31 somedirectory]$ squeue -u jz JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 5115879 mem_0064 bash jz R 0:18 1 n07-043
NOTE: the slurm prolog script is not automatically run in this case. The prolog performs basic tasks such as
- clean up from previous job,
- checking basic node functionality,
- adapting firewall settings to access license servers.
Interactive jobs: salloc
To trigger the execution of the prolog you need to run an srun command, e.g.:
[jz@l31 somedirectory]$ srun hostname
Then access the node:
[jz@l31 somedirectory]$ ssh n07-043 Warning: Permanently added 'n07-043,10.141.7.43' (ECDSA) to the list of known hosts. [jz@n07-043 ~]$
Interactive jobs: salloc
Option 2:
[jz@l31 somedirectory]$ salloc -N1 srun --pty --preserve-env $SHELL salloc: Pending job allocation 5115908 salloc: job 5115908 queued and waiting for resources salloc: job 5115908 has been allocated resources salloc: Granted job allocation 5115908 [jz@n09-046 somedirectory]$
[jz@l31 somedirectory]$ salloc srun -N1 --pty --preserve-env $SHELL salloc: Pending job allocation 5115909 salloc: job 5115909 queued and waiting for resources salloc: job 5115909 has been allocated resources salloc: Granted job allocation 5115909 [jz@n15-062 somedirectory]$
No job on node
[jz@l31 somedirectory]$ squeue -u jz JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) [jz@l31 somedirectory]$ ssh n15-002 Warning: Permanently added 'n15-002,10.141.15.2' (ECDSA) to the list of known hosts. Access denied: user jz (uid=70497) has no active jobs on this node. Connection closed by 10.141.15.2
Exercise 1: interactive job
1.) login as user training:
[...]$ ssh training@vsc3.vsc.ac.at
or
[...]$ su - training
create a directory and copy example:
[training@l31]$ mkdir my_directory_name [training@l31]$ cd my_directory_name [training@l31 my_directory_name]$ cp -r ~/examples/06_node_access_job_control/linpack . [training@l31 my_directory_name]$ ls HPL.dat job.sh xhpl
Exercise 1: interactive job, cont.
2.) Allocate one node for an interactive session:
(name the job in a useful way with the ‘-J’ option)
[training@l31 linpack]$ salloc -J jz_hpl -N 1 salloc: Pending job allocation 5260456 salloc: job 5260456 queued and waiting for resources salloc: job 5260456 has been allocated resources salloc: Granted job allocation 5260456 [training@l31 linpack]$ squeue -u training JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 5260456 mem_0064 jz_hpl training R 0:11 1 n41-001
3.) Run an srun command:
[training@l31 linpack]$ srun hostname
Exercise 1: interactive job, cont.
4.) Login to allocated node and execute a program:
[training@l31 linpack]$ ssh n41-001 Last login: Thu Apr 20 15:38:33 2017 from l31.cm.cluster [training@n41-001 ~]$ cd my_directory_name/linpack [training@n41-001 linpack]$ module load intel/17 intel-mkl/2017 intel-mpi/2017 Loading intel/17 from: /cm/shared/apps/intel/compilers_and_libraries_2017.2.174/linux Loading intel-mkl/2017 from: /cm/shared/apps/intel/compilers_and_libraries_2017.2.174/linux/mkl Loading intel-mpi/2017 from: /cm/shared/apps/intel/compilers_and_libraries_2017.2.174/linux/mpi [training@n41-001 linpack]$ mpirun -np 16 ./xhpl Number of Intel(R) Xeon Phi(TM) coprocessors : 0 ================================================================================ HPLinpack 2.1 -- High-Performance Linpack benchmark -- October 26, 2012 Written by A. Petitet and R. Clint Whaley, Innovative Computing Laboratory, UTK Modified by Piotr Luszczek, Innovative Computing Laboratory, UTK Modified by Julien Langou, University of Colorado Denver ================================================================================
Exercise 1: interactive job, cont.
5.) stop the process (Ctlr+C or Strg+C), log out of the node and the delete job:
[training@n41-001 linpack]$ exit [training@l31 linpack]$ exit exit salloc: Relinquishing job allocation 5260456 salloc: Job allocation 5260456 has been revoked.
Exercise 1: interactive job, summary
[training@l31 linpack]$ salloc -J jz_hpl -N 1 [training@l31 linpack]$ squeue -u training [training@l31 linpack]$ srun hostname [training@l31 linpack]$ ssh nXX-YYY [training@nXX-YYY ~]$ module load intel/17 intel-mkl/2017 intel-mpi/2017 [training@nXX-YYY ~]$ cd my_directory_name/linpack [training@nXX-YYY linpack]$ mpirun -np 16 ./xhpl
stop the process (Ctlr+C or Strg+C)
[training@nXX-YYY linpack]$ exit [training@l31 linpack]$ exit
Exercise 2: job script
1.) Now submit the same job with a job script:
[training@l31 my_directory_name]$ sbatch job.sh [training@l31 my_directory_name]$ squeue -u training JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 5262833 mem_0064 hpl training R 0:04 1 n41-001
2.) Login to the node listed under NODELIST and perform commands which will give you information about your job
[training@l31 my_directory_name]$ ssh n41-001 Last login: Fri Apr 21 08:43:07 2017 from l31.cm.cluster
Exercise 2: job script, cont.
[training@n41-001 ~]$ top
top - 13:06:06 up 19 days, 2:57, 2 users, load average: 12.42, 4.16, 1.51 Tasks: 442 total, 17 running, 425 sleeping, 0 stopped, 0 zombie %Cpu(s): 0.1 us, 0.0 sy, 0.0 ni, 99.9 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st KiB Mem : 65940112 total, 62756072 free, 1586292 used, 1597748 buff/cache KiB Swap: 0 total, 0 free, 0 used. 62903424 avail Mem PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 12737 training 20 0 1915480 68580 18780 R 106.2 0.1 0:12.57 xhpl 12739 training 20 0 1915480 68372 18584 R 106.2 0.1 0:12.55 xhpl 12742 training 20 0 1856120 45412 13172 R 106.2 0.1 0:12.57 xhpl 12744 training 20 0 1856120 45412 13172 R 106.2 0.1 0:12.52 xhpl 12745 training 20 0 1856120 45464 13224 R 106.2 0.1 0:12.56 xhpl ...
- 1 … show cpu core load
- Shift+H … show threads
3.) Cancel the job:
[training@n41-001 ~]$ exit [training@l31 ~]$ scancel <job ID>
<HTML> </div> </div> </HTML>
Prolog Failure
If a check in the SLURM prolog script fails on one of the nodes assigned to your job, you will see a message like the following in your slurm-$JOBID.out file:
Error running slurm prolog: 228
The error code (228) tells you what kind of check has failed. A list of currently existing error codes is:
ERROR_MEMORY=200 ERROR_INFINIBAND_HW=201 ERROR_INFINIBAND_SW=202 ERROR_IPOIB=203 ERROR_BEEGFS_SERVICE=204 ERROR_BEEGFS_USER=205 ERROR_BEEGFS_SCRATCH=206 ERROR_NFS=207 ERROR_USER_GROUP=220 ERROR_USER_HOME=221 ERROR_GPFS_START=228 ERROR_GPFS_MOUNT=229 ERROR_GPFS_UNMOUNT=230
- node will be drained (unavailable for subsequent jobs until fixed)
- resubmit your job
squeue options
[jz@l31 ~]$ squeue -u jz JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 5115915 mem_0064 bash jz R 34:50 1 n05-071 [jz@l31 ~]$ alias squeue='squeue -o "%.18i %.10Q %.12q %.8j %.8u %8U %.2t %.10M %.6D %R" -u jz' [jz@l31 ~]$ squeue JOBID PRIORITY QOS NAME USER USER ST TIME NODES NODELIST(REASON) 5115915 30209 admin bash jz 70497 R 33:41 1 n05-071
- Formatting option: -o <output_format>, –format=<output_format>
- The format of each field is
%[[.]size]type
. - “.” means right justification.
- “size” is the field width
- “type” the field type (JOBID, PRIORITY, etc.)
[jz@l31 ~]$ man squeue
scontrol
[training@l31 linpack]$ scontrol show job 5264675 JobId=5264675 JobName=hpl UserId=training(72127) GroupId=p70824(70824) MCS_label=N/A Priority=1861 Nice=0 Account=p70824 QOS=normal_0064 JobState=RUNNING Reason=None Dependency=(null) Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0 RunTime=00:00:36 TimeLimit=3-00:00:00 TimeMin=N/A SubmitTime=2017-04-21T11:30:18 EligibleTime=2017-04-21T11:30:18 StartTime=2017-04-21T11:30:49 EndTime=2017-04-24T11:30:56 Deadline=N/A PreemptTime=None SuspendTime=None SecsPreSuspend=0 Partition=mem_0064 AllocNode:Sid=l31:28554 ReqNodeList=(null) ExcNodeList=(null) NodeList=n41-001 BatchHost=n41-001 NumNodes=1 NumCPUs=32 NumTasks=16 CPUs/Task=1 ReqB:S:C:T=0:0:*:* TRES=cpu=32,node=1,gres/cpu_mem_0064=32 Socks/Node=* NtasksPerN:B:S:C=16:0:*:* CoreSpec=* MinCPUsNode=16 MinMemoryNode=0 MinTmpDiskNode=0 Features=(null) Gres=cpu_mem_0064:32 Reservation=(null) OverSubscribe=NO Contiguous=0 Licenses=(null) Network=(null) Command=/home/lv70824/training/Examples/06_node_access_job_control/linpack/job.sh WorkDir=/home/lv70824/training/Examples/06_node_access_job_control/linpack StdErr=/home/lv70824/training/Examples/06_node_access_job_control/linpack/slurm-5264675.out StdIn=/dev/null StdOut=/home/lv70824/training/Examples/06_node_access_job_control/linpack/slurm-5264675.out Power=
Accounting: sacct
default:
[jz@l31 ~]$ sacct -j 5115879 JobID JobName Partition Account AllocCPUS State ExitCode ------------ ---------- ---------- ---------- ---------- ---------- -------- 5115879 bash mem_0064 sysadmin 32 COMPLETED 0:0
specify options:
[jz@l31 ~]$ sacct -j 5115879 -o jobid,jobname,cluster,nodelist,Start,End,cputime,cputimeraw,ncpus,qos,account,ExitCode JobID JobName Cluster NodeList Start End CPUTime CPUTimeRAW NCPUS QOS Account ExitCode ------------ ---------- ---------- --------------- ------------------- ------------------- ---------- ---------- ---------- ---------- ---------- -------- 5115879 bash vsc3 n07-043 2017-03-15T14:31:39 2017-03-15T14:36:43 02:42:08 9728 32 admin sysadmin 0:0
inspect man page for more options:
[jz@l31 ~]$ man sacct
Accounting: vsc3CoreHours.py
[jz@l31 ~]$ vsc3CoreHours.py -h usage: vsc3CoreHours.py [-h] [-S STARTTIME] [-E ENDTIME] [-D DURATION] [-A ACCOUNTLIST] [-u USERNAMES] [-uni UNI] [-d DETAIL_LEVEL] [-keys KEYS] getting cpu usage - start - end time optional arguments: -h, --help show this help message and exit -S STARTTIME start time, e.g., 2015-04[-01[T10[:04[:01]]]] -E ENDTIME end time, e.g., 2015-04[-01[T10[:04[:01]]]] -D DURATION duration, display the last D days -A ACCOUNTLIST give comma separated list of projects for which the accounting data is calculated, e.g., p70yyy,p70xxx. Default: primary project. -u USERNAMES give comma separated list of usernames -uni UNI get usage statistics for one university -d DETAIL_LEVEL set detail level; default=0 -keys KEYS show data from qos or user perspective, use either qos or user; default=user
Accounting: vsc3CoreHours.py, cont.
[jz@l31 ~]$ vsc3CoreHours.py -S 2017-01-01 -E 2017-03-31 -u jz =================================================== Accounting data time range From: 2017-01-01 00:00:00 To: 2017-03-31 00:00:00 =================================================== Getting accounting information for the following account/user combinations: account: sysadmin users: jz getting data, excluding these qos: goodluck, gpu_vis, gpu_compute =============================== account core_h _______________________________ sysadmin 56844.87 _______________________________ total 56844.87
<HTML> <!–# profiling
## hdf5 file
```{.bash} [jz@l31 ~]$ squeue -u jz
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 5115915 mem_0064 bash jz R 40:12 1 n05-071
[jz@l31 ~]$ ls /opt/profiling/slurm/jz/5115915_0_n05-071.h5 /opt/profiling/slurm/jz/5115915_0_n05-071.h5 ``` –> </HTML>