no way to compare when less than two revisions
Differences
This shows you the differences between two versions of the page.
Previous revision Next revision | |||
— | pandoc:introduction-to-vsc:06_node_access_job_control:node_access_job_control [2018/10/16 12:06] – Pandoc Auto-commit pandoc | ||
---|---|---|---|
Line 1: | Line 1: | ||
+ | ====== Node access and job control ====== | ||
+ | |||
+ | * Article written by Jan Zabloudil (VSC Team) < | ||
+ | |||
+ | |||
+ | |||
+ | ===== Node access ===== | ||
+ | |||
+ | - … after submitting a job script | ||
+ | - … in interactive sessions | ||
+ | |||
+ | |||
+ | ---- | ||
+ | |||
+ | ===== Job scripts: sbatch ===== | ||
+ | |||
+ | <code bash> | ||
+ | [jz@l31 somedirectory]$ sbatch job.sh | ||
+ | Submitted batch job 54321 | ||
+ | [jz@l31 somedirectory]$ squeue -u jz | ||
+ | JOBID PARTITION | ||
+ | | ||
+ | [jz@l31 somedirectory]$ ssh n20-038 | ||
+ | Last login: Wed Mar 15 14:26:01 2017 from l31.cm.cluster | ||
+ | [jz@n20-038 ~]$ | ||
+ | |||
+ | </ | ||
+ | |||
+ | ---- | ||
+ | |||
+ | ===== Interactive jobs: salloc ===== | ||
+ | |||
+ | < | ||
+ | < | ||
+ | <div style=" | ||
+ | </ | ||
+ | === Option 1: === | ||
+ | |||
+ | <code bash> | ||
+ | [jz@l31 somedirectory]$ salloc -N1 | ||
+ | salloc: Pending job allocation 5115879 | ||
+ | salloc: job 5115879 queued and waiting for resources | ||
+ | salloc: job 5115879 has been allocated resources | ||
+ | salloc: Granted job allocation 5115879 | ||
+ | [jz@l31 somedirectory]$ squeue -u jz | ||
+ | JOBID PARTITION | ||
+ | | ||
+ | </ | ||
+ | |||
+ | **NOTE:** the slurm prolog script is **not** automatically run in this case. The prolog performs basic tasks such as | ||
+ | |||
+ | * clean up from previous job, | ||
+ | * checking basic node functionality, | ||
+ | * adapting firewall settings to access license servers. | ||
+ | |||
+ | |||
+ | |||
+ | |||
+ | ---- | ||
+ | |||
+ | ===== Interactive jobs: salloc ===== | ||
+ | |||
+ | To trigger the execution of the prolog you need to run an **srun** command, e.g.: | ||
+ | |||
+ | <code bash> | ||
+ | [jz@l31 somedirectory]$ srun hostname | ||
+ | </ | ||
+ | |||
+ | Then access the node: | ||
+ | |||
+ | <code bash> | ||
+ | [jz@l31 somedirectory]$ ssh n07-043 | ||
+ | Warning: Permanently added ' | ||
+ | [jz@n07-043 ~]$ | ||
+ | </ | ||
+ | |||
+ | |||
+ | |||
+ | ---- | ||
+ | |||
+ | ===== Interactive jobs: salloc ===== | ||
+ | |||
+ | === Option 2: === | ||
+ | |||
+ | <code bash> | ||
+ | [jz@l31 somedirectory]$ salloc -N1 srun --pty --preserve-env $SHELL | ||
+ | salloc: Pending job allocation 5115908 | ||
+ | salloc: job 5115908 queued and waiting for resources | ||
+ | salloc: job 5115908 has been allocated resources | ||
+ | salloc: Granted job allocation 5115908 | ||
+ | [jz@n09-046 somedirectory]$ | ||
+ | </ | ||
+ | <code bash> | ||
+ | [jz@l31 somedirectory]$ salloc srun -N1 --pty --preserve-env $SHELL | ||
+ | salloc: Pending job allocation 5115909 | ||
+ | salloc: job 5115909 queued and waiting for resources | ||
+ | salloc: job 5115909 has been allocated resources | ||
+ | salloc: Granted job allocation 5115909 | ||
+ | [jz@n15-062 somedirectory]$ | ||
+ | </ | ||
+ | |||
+ | |||
+ | ---- | ||
+ | |||
+ | ===== No job on node ===== | ||
+ | |||
+ | <code bash> | ||
+ | [jz@l31 somedirectory]$ squeue -u jz | ||
+ | JOBID PARTITION | ||
+ | |||
+ | [jz@l31 somedirectory]$ ssh n15-002 | ||
+ | Warning: Permanently added ' | ||
+ | Access denied: user jz (uid=70497) has no active jobs on this node. | ||
+ | Connection closed by 10.141.15.2 | ||
+ | </ | ||
+ | |||
+ | ---- | ||
+ | |||
+ | ===== Exercise 1: interactive job ===== | ||
+ | |||
+ | 1.) login as user // | ||
+ | |||
+ | <code bash> | ||
+ | [...]$ ssh training@vsc3.vsc.ac.at | ||
+ | </ | ||
+ | or | ||
+ | |||
+ | <code bash> | ||
+ | [...]$ su - training | ||
+ | </ | ||
+ | create a directory and copy example: | ||
+ | |||
+ | <code bash> | ||
+ | [training@l31]$ mkdir my_directory_name | ||
+ | [training@l31]$ cd my_directory_name | ||
+ | [training@l31 my_directory_name]$ cp -r ~/ | ||
+ | [training@l31 my_directory_name]$ ls | ||
+ | HPL.dat | ||
+ | </ | ||
+ | |||
+ | |||
+ | |||
+ | ---- | ||
+ | |||
+ | ===== Exercise 1: interactive job, cont. ===== | ||
+ | |||
+ | 2.) Allocate one node for an interactive session: | ||
+ | |||
+ | (name the job in a useful way with the ‘-J’ option) | ||
+ | |||
+ | <code bash> | ||
+ | [training@l31 linpack]$ salloc -J jz_hpl -N 1 | ||
+ | salloc: Pending job allocation 5260456 | ||
+ | salloc: job 5260456 queued and waiting for resources | ||
+ | salloc: job 5260456 has been allocated resources | ||
+ | salloc: Granted job allocation 5260456 | ||
+ | [training@l31 linpack]$ squeue -u training | ||
+ | JOBID PARTITION | ||
+ | | ||
+ | </ | ||
+ | |||
+ | 3.) Run an srun command: | ||
+ | |||
+ | <code bash> | ||
+ | [training@l31 linpack]$ srun hostname | ||
+ | </ | ||
+ | |||
+ | |||
+ | |||
+ | ---- | ||
+ | |||
+ | ===== Exercise 1: interactive job, cont. ===== | ||
+ | |||
+ | 4.) Login to allocated node and execute a program: | ||
+ | |||
+ | <code bash> | ||
+ | [training@l31 linpack]$ ssh n41-001 | ||
+ | Last login: Thu Apr 20 15:38:33 2017 from l31.cm.cluster | ||
+ | |||
+ | [training@n41-001 ~]$ cd my_directory_name/ | ||
+ | |||
+ | [training@n41-001 linpack]$ module load intel/17 intel-mkl/ | ||
+ | Loading intel/17 from: / | ||
+ | Loading intel-mkl/ | ||
+ | Loading intel-mpi/ | ||
+ | |||
+ | [training@n41-001 linpack]$ mpirun -np 16 ./xhpl | ||
+ | |||
+ | Number of Intel(R) Xeon Phi(TM) coprocessors : 0 | ||
+ | ================================================================================ | ||
+ | HPLinpack 2.1 -- High-Performance Linpack benchmark | ||
+ | Written by A. Petitet and R. Clint Whaley, | ||
+ | Modified by Piotr Luszczek, Innovative Computing Laboratory, UTK | ||
+ | Modified by Julien Langou, University of Colorado Denver | ||
+ | ================================================================================ | ||
+ | |||
+ | </ | ||
+ | |||
+ | |||
+ | |||
+ | ---- | ||
+ | |||
+ | ===== Exercise 1: interactive job, cont. ===== | ||
+ | |||
+ | 5.) stop the process (Ctlr+C or Strg+C), log out of the node and the delete job: | ||
+ | |||
+ | <code bash> | ||
+ | [training@n41-001 linpack]$ exit | ||
+ | [training@l31 linpack]$ exit | ||
+ | exit | ||
+ | salloc: Relinquishing job allocation 5260456 | ||
+ | salloc: Job allocation 5260456 has been revoked. | ||
+ | </ | ||
+ | |||
+ | |||
+ | |||
+ | ---- | ||
+ | |||
+ | ===== Exercise 1: interactive job, summary ===== | ||
+ | |||
+ | <code bash> | ||
+ | [training@l31 linpack]$ salloc -J jz_hpl -N 1 | ||
+ | [training@l31 linpack]$ squeue -u training | ||
+ | [training@l31 linpack]$ srun hostname | ||
+ | [training@l31 linpack]$ ssh nXX-YYY | ||
+ | [training@nXX-YYY ~]$ module load intel/17 intel-mkl/ | ||
+ | [training@nXX-YYY ~]$ cd my_directory_name/ | ||
+ | [training@nXX-YYY linpack]$ mpirun -np 16 ./xhpl | ||
+ | </ | ||
+ | stop the process (Ctlr+C or Strg+C) | ||
+ | |||
+ | <code bash> | ||
+ | [training@nXX-YYY linpack]$ exit | ||
+ | [training@l31 linpack]$ exit | ||
+ | </ | ||
+ | |||
+ | |||
+ | ---- | ||
+ | |||
+ | ===== Exercise 2: job script ===== | ||
+ | |||
+ | 1.) Now submit the same job with a job script: | ||
+ | |||
+ | <code bash> | ||
+ | [training@l31 my_directory_name]$ sbatch job.sh | ||
+ | [training@l31 my_directory_name]$ squeue -u training | ||
+ | JOBID PARTITION | ||
+ | | ||
+ | </ | ||
+ | |||
+ | 2.) Login to the node listed under NODELIST and perform commands which will give you information about your job | ||
+ | |||
+ | <code bash> | ||
+ | [training@l31 my_directory_name]$ ssh n41-001 | ||
+ | Last login: Fri Apr 21 08:43:07 2017 from l31.cm.cluster | ||
+ | </ | ||
+ | |||
+ | |||
+ | |||
+ | ---- | ||
+ | |||
+ | ===== Exercise 2: job script, cont. ===== | ||
+ | |||
+ | <code bash> | ||
+ | [training@n41-001 ~]$ top | ||
+ | </ | ||
+ | |||
+ | <code bash> | ||
+ | top - 13:06:06 up 19 days, 2:57, 2 users, | ||
+ | Tasks: 442 total, | ||
+ | %Cpu(s): | ||
+ | KiB Mem : 65940112 total, 62756072 free, 1586292 used, 1597748 buff/cache | ||
+ | KiB Swap: 0 total, | ||
+ | |||
+ | PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND | ||
+ | 12737 training | ||
+ | 12739 training | ||
+ | 12742 training | ||
+ | 12744 training | ||
+ | 12745 training | ||
+ | ... | ||
+ | </ | ||
+ | |||
+ | * 1 … show cpu core load | ||
+ | * Shift+H … show threads | ||
+ | |||
+ | |||
+ | |||
+ | ---- | ||
+ | |||
+ | 3.) Cancel the job: | ||
+ | |||
+ | <code bash> | ||
+ | [training@n41-001 ~]$ exit | ||
+ | [training@l31 ~]$ scancel <job ID> | ||
+ | </ | ||
+ | < | ||
+ | </ | ||
+ | </ | ||
+ | </ | ||
+ | |||
+ | ---- | ||
+ | |||
+ | ===== Prolog Failure ===== | ||
+ | |||
+ | If a check in the SLURM prolog script fails on one of the nodes assigned to your job, you will see a message like the following in your slurm-$JOBID.out file: | ||
+ | |||
+ | <code bash> | ||
+ | Error running slurm prolog: 228 | ||
+ | </ | ||
+ | The error code (228) tells you what kind of check has failed. A list of currently existing error codes is: | ||
+ | |||
+ | <code bash> | ||
+ | ERROR_MEMORY=200 | ||
+ | ERROR_INFINIBAND_HW=201 | ||
+ | ERROR_INFINIBAND_SW=202 | ||
+ | ERROR_IPOIB=203 | ||
+ | ERROR_BEEGFS_SERVICE=204 | ||
+ | ERROR_BEEGFS_USER=205 | ||
+ | ERROR_BEEGFS_SCRATCH=206 | ||
+ | ERROR_NFS=207 | ||
+ | ERROR_USER_GROUP=220 | ||
+ | ERROR_USER_HOME=221 | ||
+ | ERROR_GPFS_START=228 | ||
+ | ERROR_GPFS_MOUNT=229 | ||
+ | ERROR_GPFS_UNMOUNT=230 | ||
+ | </ | ||
+ | * node will be //drained// (unavailable for subsequent jobs until fixed) | ||
+ | * resubmit your job | ||
+ | |||
+ | |||
+ | ---- | ||
+ | |||
+ | ===== squeue options ===== | ||
+ | |||
+ | <code bash> | ||
+ | [jz@l31 ~]$ squeue -u jz | ||
+ | JOBID PARTITION | ||
+ | | ||
+ | |||
+ | [jz@l31 ~]$ alias squeue=' | ||
+ | [jz@l31 ~]$ squeue | ||
+ | | ||
+ | | ||
+ | </ | ||
+ | |||
+ | * Formatting option: -o < | ||
+ | * The format of each field is '' | ||
+ | * “.” means right justification. | ||
+ | * “size” is the field width | ||
+ | * “type” the field type (JOBID, PRIORITY, etc.) | ||
+ | |||
+ | |||
+ | <code bash> | ||
+ | [jz@l31 ~]$ man squeue | ||
+ | </ | ||
+ | |||
+ | |||
+ | |||
+ | ---- | ||
+ | |||
+ | ===== scontrol ===== | ||
+ | |||
+ | <code bash> | ||
+ | [training@l31 linpack]$ scontrol show job 5264675 | ||
+ | JobId=5264675 JobName=hpl | ||
+ | | ||
+ | | ||
+ | | ||
+ | | ||
+ | | ||
+ | | ||
+ | | ||
+ | | ||
+ | | ||
+ | | ||
+ | | ||
+ | | ||
+ | | ||
+ | | ||
+ | | ||
+ | | ||
+ | | ||
+ | | ||
+ | | ||
+ | | ||
+ | | ||
+ | | ||
+ | | ||
+ | | ||
+ | </ | ||
+ | |||
+ | ---- | ||
+ | |||
+ | ===== Accounting: sacct ===== | ||
+ | |||
+ | default: | ||
+ | |||
+ | <code bash> | ||
+ | [jz@l31 ~]$ sacct -j 5115879 | ||
+ | | ||
+ | ------------ ---------- ---------- ---------- ---------- ---------- -------- | ||
+ | 5115879 | ||
+ | </ | ||
+ | |||
+ | specify options: | ||
+ | |||
+ | <code bash> | ||
+ | [jz@l31 ~]$ sacct -j 5115879 -o jobid, | ||
+ | | ||
+ | ------------ ---------- ---------- --------------- ------------------- ------------------- ---------- ---------- ---------- ---------- ---------- -------- | ||
+ | 5115879 | ||
+ | </ | ||
+ | |||
+ | inspect man page for more options: | ||
+ | |||
+ | <code bash> | ||
+ | [jz@l31 ~]$ man sacct | ||
+ | </ | ||
+ | |||
+ | |||
+ | |||
+ | ---- | ||
+ | |||
+ | ===== Accounting: vsc3CoreHours.py ===== | ||
+ | |||
+ | <code bash> | ||
+ | [jz@l31 ~]$ vsc3CoreHours.py -h | ||
+ | usage: vsc3CoreHours.py [-h] [-S STARTTIME] [-E ENDTIME] [-D DURATION] | ||
+ | [-A ACCOUNTLIST] [-u USERNAMES] [-uni UNI] | ||
+ | [-d DETAIL_LEVEL] [-keys KEYS] | ||
+ | |||
+ | getting cpu usage - start - end time | ||
+ | |||
+ | optional arguments: | ||
+ | -h, --help | ||
+ | -S STARTTIME | ||
+ | -E ENDTIME | ||
+ | -D DURATION | ||
+ | -A ACCOUNTLIST | ||
+ | | ||
+ | | ||
+ | -u USERNAMES | ||
+ | -uni UNI get usage statistics for one university | ||
+ | -d DETAIL_LEVEL | ||
+ | -keys KEYS show data from qos or user perspective, | ||
+ | user; default=user | ||
+ | </ | ||
+ | |||
+ | |||
+ | |||
+ | ---- | ||
+ | |||
+ | ===== Accounting: vsc3CoreHours.py, | ||
+ | |||
+ | <code bash> | ||
+ | [jz@l31 ~]$ vsc3CoreHours.py -S 2017-01-01 -E 2017-03-31 -u jz | ||
+ | =================================================== | ||
+ | Accounting data time range | ||
+ | From: 2017-01-01 00:00:00 | ||
+ | To: | ||
+ | =================================================== | ||
+ | Getting accounting information for the following account/ | ||
+ | account: | ||
+ | getting data, excluding these qos: goodluck, gpu_vis, gpu_compute | ||
+ | =============================== | ||
+ | | ||
+ | _______________________________ | ||
+ | sysadmin | ||
+ | _______________________________ | ||
+ | | ||
+ | </ | ||
+ | |||
+ | |||
+ | < | ||
+ | <!--# profiling | ||
+ | |||
+ | ## hdf5 file | ||
+ | |||
+ | ```{.bash} | ||
+ | [jz@l31 ~]$ squeue -u jz | ||
+ | JOBID PARTITION | ||
+ | | ||
+ | [jz@l31 ~]$ ls / | ||
+ | / | ||
+ | ``` | ||
+ | --> | ||
+ | </ | ||
+ | |||
+ | ---- | ||