Differences
This shows you the differences between two versions of the page.
Both sides previous revision Previous revision Next revision | Previous revision | ||
pandoc:introduction-to-vsc:06_node_access_job_control:node_access_job_control [2018/01/31 11:13] – Pandoc Auto-commit pandoc | pandoc:introduction-to-vsc:06_node_access_job_control:node_access_job_control [2020/10/20 09:13] (current) – Pandoc Auto-commit pandoc | ||
---|---|---|---|
Line 3: | Line 3: | ||
* Article written by Jan Zabloudil (VSC Team) < | * Article written by Jan Zabloudil (VSC Team) < | ||
- | |||
- | ===== Node access ===== | ||
- | |||
- | - ... after submitting a job script | ||
- | - ... in interactive sessions | ||
---- | ---- | ||
- | ===== Job scripts: sbatch ===== | ||
- | <code bash> | ||
- | [jz@l31 somedirectory]$ sbatch job.sh | ||
- | Submitted batch job 54321 | ||
- | [jz@l31 somedirectory]$ squeue -u jz | ||
- | JOBID PARTITION | ||
- | | ||
- | [jz@l31 somedirectory]$ ssh n20-038 | ||
- | Last login: Wed Mar 15 14:26:01 2017 from l31.cm.cluster | ||
- | [jz@n20-038 ~]$ | ||
- | |||
- | </ | ||
---- | ---- | ||
- | ===== Interactive jobs: salloc | + | ====== |
- | + | ||
- | < | + | |
- | < | + | |
- | <div style=" | + | |
- | </ | + | |
- | === Option 1: === | + | |
- | + | ||
- | <code bash> | + | |
- | [jz@l31 somedirectory]$ salloc -N1 | + | |
- | salloc: Pending job allocation 5115879 | + | |
- | salloc: job 5115879 queued and waiting for resources | + | |
- | salloc: job 5115879 has been allocated resources | + | |
- | salloc: Granted job allocation 5115879 | + | |
- | [jz@l31 somedirectory]$ squeue -u jz | + | |
- | JOBID PARTITION | + | |
- | | + | |
- | </ | + | |
- | + | ||
- | **NOTE:** the slurm prolog script is **not** automatically run in this case. The prolog performs basic tasks such as | + | |
- | + | ||
- | * clean up from previous job, | + | |
- | * checking basic node functionality, | + | |
- | * adapting firewall settings to access license servers. | + | |
+ | {{.: | ||
---- | ---- | ||
- | ===== Interactive jobs: salloc | + | ====== Node access ====== |
- | + | ||
- | To trigger the execution of the prolog you need to run an **srun** command, e.g.: | + | |
- | + | ||
- | <code bash> | + | |
- | [jz@l31 somedirectory]$ srun hostname | + | |
- | </ | + | |
- | + | ||
- | Then access the node: | + | |
- | + | ||
- | <code bash> | + | |
- | [jz@l31 somedirectory]$ ssh n07-043 | + | |
- | Warning: Permanently added ' | + | |
- | [jz@n07-043 ~]$ | + | |
- | </ | + | |
+ | {{.: | ||
---- | ---- | ||
- | ===== Interactive jobs: salloc | + | ====== Node access ====== |
- | === Option 2: === | + | {{.:folie_14_salloc.png? |
- | + | ||
- | <code bash> | + | |
- | [jz@l31 somedirectory]$ salloc -N1 srun --pty --preserve-env $SHELL | + | |
- | salloc: Pending job allocation 5115908 | + | |
- | salloc: job 5115908 queued and waiting for resources | + | |
- | salloc: job 5115908 has been allocated resources | + | |
- | salloc: Granted job allocation 5115908 | + | |
- | [jz@n09-046 somedirectory]$ | + | |
- | </ | + | |
- | <code bash> | + | |
- | [jz@l31 somedirectory]$ salloc srun -N1 --pty --preserve-env $SHELL | + | |
- | salloc: Pending job allocation 5115909 | + | |
- | salloc: job 5115909 queued and waiting for resources | + | |
- | salloc: job 5115909 has been allocated resources | + | |
- | salloc: Granted job allocation 5115909 | + | |
- | [jz@n15-062 somedirectory]$ | + | |
- | </ | + | |
---- | ---- | ||
- | ===== No job on node ===== | + | ====== 1) Job scripts: sbatch ====== |
- | <code bash> | + | (DEMO) |
- | [jz@l31 somedirectory]$ squeue -u jz | + | |
- | JOBID PARTITION | + | |
- | + | ||
- | [jz@l31 somedirectory]$ ssh n15-002 | + | |
- | Warning: Permanently added ' | + | |
- | Access denied: user jz (uid=70497) has no active jobs on this node. | + | |
- | Connection closed by 10.141.15.2 | + | |
- | </ | + | |
- | + | ||
- | ---- | + | |
- | + | ||
- | ===== Exercise 1: interactive job ===== | + | |
- | + | ||
- | 1.) login as user // | + | |
<code bash> | <code bash> | ||
- | [...]$ ssh training@vsc3.vsc.ac.at | + | [me@l31]$ sbatch job.sh # submit batch job |
+ | [me@l31]$ squeue -u me # find out on which node(s) job is running | ||
+ | [me@l31]$ ssh < | ||
+ | [me@n320-038]$ | ||
</ | </ | ||
- | or | + | more about from Claudia |
- | + | ||
- | <code bash> | + | |
- | [...]$ su - training | + | |
- | </ | + | |
- | create a directory and copy example: | + | |
- | + | ||
- | <code bash> | + | |
- | [training@l31]$ mkdir my_directory_name | + | |
- | [training@l31]$ cd my_directory_name | + | |
- | [training@l31 my_directory_name]$ cp -r ~/ | + | |
- | [training@l31 my_directory_name]$ ls | + | |
- | HPL.dat | + | |
- | </ | + | |
---- | ---- | ||
- | ===== Exercise 1: interactive job, cont. ===== | + | ====== Interactive jobs: salloc ====== |
- | 2.) Allocate one node for an interactive session: | + | < |
- | + | < | |
- | (name the job in a useful way with the ' | + | |
<code bash> | <code bash> | ||
- | [training@l31 linpack]$ salloc -J jz_hpl | + | ~/ |
- | salloc: Pending job allocation 5260456 | + | [me@l31 linpack]$ salloc -J me_hpl |
- | salloc: job 5260456 queued and waiting for resources | + | [me@l31 linpack]$ squeue -u me # find out which node(s) is(are) allocated |
- | salloc: job 5260456 has been allocated resources | + | [me@l31 linpack]$ srun hostname |
- | salloc: Granted job allocation 5260456 | + | [me@l31 linpack]$ ssh n3XX-YYY |
- | [training@l31 linpack]$ squeue -u training | + | [me@n3XX-YYY ~]$ module purge |
- | JOBID PARTITION | + | [me@n3XX-YYY ~]$ module load intel/17 intel-mkl/ |
- | 5260456 | + | [me@n3XX-YYY ~]$ cd my_directory_name/ |
+ | [me@n3XX-YYY linpack]$ mpirun | ||
</ | </ | ||
- | + | stop the process (Ctlr+C or Strg+C) | |
- | 3.) Run an srun command: | + | |
<code bash> | <code bash> | ||
- | [training@l31 linpack]$ | + | [me@n3XX-YYY linpack]$ exit |
+ | [me@l31 linpack]$ | ||
</ | </ | ||
- | |||
- | |||
---- | ---- | ||
- | ===== Exercise 1: interactive job, cont. ===== | + | ====== Interactive jobs: salloc ====== |
- | 4.) Login to allocated node and execute a program: | + | ===== notes: ===== |
<code bash> | <code bash> | ||
- | [training@l31 linpack]$ ssh n41-001 | + | [me@l31]$ |
- | Last login: Thu Apr 20 15:38:33 2017 from l31.cm.cluster | + | |
- | + | ||
- | [training@n41-001 ~]$ cd my_directory_name/ | + | |
- | + | ||
- | [training@n41-001 linpack]$ module load intel/17 intel-mkl/ | + | |
- | Loading intel/17 from: / | + | |
- | Loading intel-mkl/ | + | |
- | Loading intel-mpi/ | + | |
- | + | ||
- | [training@n41-001 linpack]$ mpirun -np 16 ./xhpl | + | |
- | + | ||
- | Number of Intel(R) Xeon Phi(TM) coprocessors : 0 | + | |
- | ================================================================================ | + | |
- | HPLinpack 2.1 -- High-Performance Linpack benchmark | + | |
- | Written by A. Petitet and R. Clint Whaley, | + | |
- | Modified by Piotr Luszczek, Innovative Computing Laboratory, UTK | + | |
- | Modified by Julien Langou, University of Colorado Denver | + | |
- | ================================================================================ | + | |
</ | </ | ||
+ | **NOTE:** the slurm prolog script is **not** automatically run in this case. The prolog performs basic tasks such as | ||
- | + | * clean up from previous | |
- | + | * checking basic node functionality, | |
- | ---- | + | * adapting firewall settings to access license servers. |
- | + | ||
- | ===== Exercise 1: interactive | + | |
- | + | ||
- | 5.) stop the process (Ctlr+C or Strg+C), log out of the node and the delete job: | + | |
- | + | ||
- | <code bash> | + | |
- | [training@n41-001 linpack]$ exit | + | |
- | [training@l31 linpack]$ exit | + | |
- | exit | + | |
- | salloc: Relinquishing job allocation 5260456 | + | |
- | salloc: Job allocation 5260456 has been revoked. | + | |
- | </ | + | |
- | + | ||
- | + | ||
- | + | ||
- | ---- | + | |
- | + | ||
- | ===== Exercise 1: interactive job, summary ===== | + | |
- | + | ||
- | <code bash> | + | |
- | [training@l31 linpack]$ salloc -J jz_hpl -N 1 | + | |
- | [training@l31 linpack]$ squeue -u training | + | |
- | [training@l31 linpack]$ srun hostname | + | |
- | [training@l31 linpack]$ ssh nXX-YYY | + | |
- | [training@nXX-YYY ~]$ module load intel/17 intel-mkl/ | + | |
- | [training@nXX-YYY ~]$ cd my_directory_name/ | + | |
- | [training@nXX-YYY linpack]$ mpirun -np 16 ./xhpl | + | |
- | </ | + | |
- | stop the process (Ctlr+C or Strg+C) | + | |
- | + | ||
- | <code bash> | + | |
- | [training@nXX-YYY linpack]$ exit | + | |
- | [training@l31 linpack]$ exit | + | |
- | </ | + | |
Line 245: | Line 96: | ||
[training@l31 my_directory_name]$ squeue -u training | [training@l31 my_directory_name]$ squeue -u training | ||
JOBID PARTITION | JOBID PARTITION | ||
- | | + | |
</ | </ | ||
Line 251: | Line 102: | ||
<code bash> | <code bash> | ||
- | [training@l31 my_directory_name]$ ssh n41-001 | + | [training@l31 my_directory_name]$ ssh n341-001 |
Last login: Fri Apr 21 08:43:07 2017 from l31.cm.cluster | Last login: Fri Apr 21 08:43:07 2017 from l31.cm.cluster | ||
</ | </ | ||
Line 262: | Line 113: | ||
<code bash> | <code bash> | ||
- | [training@n41-001 ~]$ top | + | [training@n341-001 ~]$ top |
</ | </ | ||
Line 281: | Line 132: | ||
</ | </ | ||
- | * 1 ... show cpu core load | + | * 1 … show cpu core load |
- | * Shift+H | + | * Shift+H |
---- | ---- | ||
- | |||
- | ===== Exercise 2: job script, cont. ===== | ||
- | |||
- | <code bash> | ||
- | [training@n41-001 ~]$ ps -U training u | ||
- | </ | ||
- | <code bash> | ||
- | [training@n41-001 ~]$ ps -U training f | ||
- | </ | ||
- | <code bash> | ||
- | [training@n41-001 ~]$ ps -U training e | ||
- | </ | ||
- | <code bash> | ||
- | [training@n41-001 ~]$ ps -U training eu | ||
- | </ | ||
- | <code bash> | ||
- | [training@n41-001 ~]$ ps -U training ef | ||
- | </ | ||
- | <code bash> | ||
- | [training@n41-001 ~]$ ps -U training euf | ||
- | </ | ||
3.) Cancel the job: | 3.) Cancel the job: | ||
<code bash> | <code bash> | ||
- | [training@n41-001 ~]$ scancel <job ID> | + | [training@n341-001 ~]$ exit |
+ | [training@l31 | ||
</ | </ | ||
+ | < | ||
+ | </ | ||
+ | </ | ||
+ | </ | ||
+ | ---- | ||
+ | ===== Prolog Failure ===== | ||
- | ---- | + | If a check in the SLURM prolog script fails on one of the nodes assigned to your job, you will see a message like the following in your slurm-$JOBID.out file: |
- | + | ||
- | ===== Exercise 3: interactive job with srun ===== | + | |
<code bash> | <code bash> | ||
- | [training@l31 linpack]$ salloc -J jz_hpl -N 1 | + | Error running slurm prolog: 228 |
</ | </ | ||
- | <code bash> | + | The error code (228) tells you what kind of check has failed. A list of currently existing error codes is: |
- | [training@l31 linpack]$ squeue -u training | + | |
- | </ | + | |
- | <code bash> | + | |
- | [training@l31 linpack]$ srun -n 2 hostname | + | |
- | </ | + | |
- | <code bash> | + | |
- | [training@l31 linpack]$ module purge | + | |
- | </ | + | |
- | <code bash> | + | |
- | [training@l31 linpack]$ module load intel/17 intel-mkl/ | + | |
- | </ | + | |
- | <code bash> | + | |
- | [training@l31 linpack]$ export I_MPI_PMI_LIBRARY=/ | + | |
- | </ | + | |
- | <code bash> | + | |
- | [training@l31 linpack]$ srun -n 16 ./xhpl | + | |
- | </ | + | |
- | + | ||
- | type Ctrl+C or Strg+C to stop the job and remove the job allocation: | + | |
<code bash> | <code bash> | ||
- | [training@l31 linpack]$ exit | + | ERROR_MEMORY=200 |
- | exit | + | ERROR_INFINIBAND_HW=201 |
- | salloc: Relinquishing job allocation 5264616 | + | ERROR_INFINIBAND_SW=202 |
+ | ERROR_IPOIB=203 | ||
+ | ERROR_BEEGFS_SERVICE=204 | ||
+ | ERROR_BEEGFS_USER=205 | ||
+ | ERROR_BEEGFS_SCRATCH=206 | ||
+ | ERROR_NFS=207 | ||
+ | ERROR_USER_GROUP=220 | ||
+ | ERROR_USER_HOME=221 | ||
+ | ERROR_GPFS_START=228 | ||
+ | ERROR_GPFS_MOUNT=229 | ||
+ | ERROR_GPFS_UNMOUNT=230 | ||
</ | </ | ||
+ | * node will be //drained// (unavailable for subsequent jobs until fixed) | ||
+ | * resubmit your job | ||
Line 358: | Line 186: | ||
<code bash> | <code bash> | ||
- | [jz@l31 ~]$ squeue -u jz | + | VSC-4 > |
JOBID PARTITION | JOBID PARTITION | ||
- | | + | 409879 |
- | [jz@l31 ~]$ alias squeue=' | + | VSC-4 > squeue |
- | [jz@l31 ~]$ squeue | + | |
- | | + | .......................................................................... |
- | 5115915 | + | 407141 |
+ | 409880_[1-6] | ||
+ | 409879 | ||
+ | 402078 | ||
</ | </ | ||
- | * Formatting option: | + | * optional reformatting via -o, for example, |
- | * The format of each field is '' | + | |
- | * " | + | |
- | * " | + | |
- | * " | + | |
<code bash> | <code bash> | ||
- | [jz@l31 ~]$ man squeue | + | VSC-4 > squeue -u $user -o "%.18i %.10Q %.12q %.8j %.8u %8U %.2t %.10M %.6D %R" |
+ | | ||
+ | 409879 | ||
</ | </ | ||
+ | < | ||
+ | </ | ||
+ | </ | ||
---- | ---- | ||
Line 386: | Line 217: | ||
<code bash> | <code bash> | ||
- | [training@l31 linpack]$ | + | VSC-4 > |
- | JobId=5264675 | + | JobId=409879 |
- | | + | |
- | | + | |
| | ||
- | | + | |
- | | + | |
- | | + | |
- | | + | |
+ | | ||
| | ||
- | | + | LastSchedEval=2020-10-12T15: |
+ | Partition=mem_0096 | ||
| | ||
- | | + | |
- | | + | |
- | | + | |
- | | + | |
- | | + | |
- | | + | |
- | | + | |
| | ||
- | | + | |
- | | + | |
- | | + | |
- | | + | |
- | | + | |
| | ||
+ | | ||
</ | </ | ||
+ | |||
---- | ---- | ||
Line 421: | Line 253: | ||
<code bash> | <code bash> | ||
- | [jz@l31 ~]$ sacct -j 5115879 | ||
- | | ||
- | ------------ ---------- ---------- ---------- ---------- ---------- -------- | ||
- | 5115879 | ||
- | </ | ||
- | specify options: | + | VSC-4 > sacct -j 409878 |
- | + | | |
- | <code bash> | + | ------------ ---------- ---------- ---------- ---------- ---------- -------- |
- | [jz@l31 ~]$ sacct -j 5115879 -o jobid, | + | 409878 |
- | | + | |
- | ------------ ---------- ---------- --------------- ------------------- ------------------- ---------- ---------- ---------- ---------- ---------- -------- | + | |
- | 5115879 | + | |
</ | </ | ||
- | inspect man page for more options: | + | adjust output |
<code bash> | <code bash> | ||
- | [jz@l31 ~]$ man sacct | + | VSC-4 > |
+ | | ||
+ | ------------ ---------- ---------- --------------- ------------------- ------------------- ---------- ---------- ---------- ---------- ---------- -------- | ||
+ | 409878 | ||
</ | </ | ||
Line 449: | Line 276: | ||
<code bash> | <code bash> | ||
- | [jz@l31 ~]$ vsc3CoreHours.py -h | + | VSC-3 > |
usage: vsc3CoreHours.py [-h] [-S STARTTIME] [-E ENDTIME] [-D DURATION] | usage: vsc3CoreHours.py [-h] [-S STARTTIME] [-E ENDTIME] [-D DURATION] | ||
[-A ACCOUNTLIST] [-u USERNAMES] [-uni UNI] | [-A ACCOUNTLIST] [-u USERNAMES] [-uni UNI] | ||
Line 478: | Line 305: | ||
<code bash> | <code bash> | ||
- | [jz@l31 ~]$ vsc3CoreHours.py -S 2017-01-01 -E 2017-03-31 -u jz | + | VSC-3 > |
=================================================== | =================================================== | ||
Accounting data time range | Accounting data time range | ||
- | From: 2017-01-01 00:00:00 | + | From: 2020-01-01 00:00:00 |
- | To: 2017-03-31 00:00:00 | + | To: 2020-10-01 00:00:00 |
=================================================== | =================================================== | ||
Getting accounting information for the following account/ | Getting accounting information for the following account/ | ||
- | account: | + | account: |
getting data, excluding these qos: goodluck, gpu_vis, gpu_compute | getting data, excluding these qos: goodluck, gpu_vis, gpu_compute | ||
=============================== | =============================== | ||
| | ||
_______________________________ | _______________________________ | ||
- | sysadmin | + | sysadmin |
_______________________________ | _______________________________ | ||
- | | + | |
</ | </ | ||
Line 512: | Line 339: | ||
---- | ---- | ||
+ | |||