Differences
This shows you the differences between two versions of the page.
Both sides previous revision Previous revision Next revision | Previous revision | ||
pandoc:introduction-to-vsc:06_node_access_job_control:node_access_job_control [2018/01/31 11:10] – Pandoc Auto-commit pandoc | pandoc:introduction-to-vsc:06_node_access_job_control:node_access_job_control [2020/10/20 09:13] (current) – Pandoc Auto-commit pandoc | ||
---|---|---|---|
Line 1: | Line 1: | ||
+ | ====== Node access and job control ====== | ||
+ | |||
+ | * Article written by Jan Zabloudil (VSC Team) < | ||
+ | |||
+ | |||
+ | |||
+ | ---- | ||
+ | |||
+ | |||
+ | |||
+ | ---- | ||
+ | |||
+ | ====== Node access ====== | ||
+ | |||
+ | {{.: | ||
+ | |||
+ | |||
+ | ---- | ||
+ | |||
+ | ====== Node access ====== | ||
+ | |||
+ | {{.: | ||
+ | |||
+ | |||
+ | ---- | ||
+ | |||
+ | ====== Node access ====== | ||
+ | |||
+ | {{.: | ||
+ | |||
+ | |||
+ | ---- | ||
+ | |||
+ | ====== 1) Job scripts: sbatch ====== | ||
+ | |||
+ | (DEMO) | ||
+ | |||
+ | <code bash> | ||
+ | [me@l31]$ sbatch job.sh | ||
+ | [me@l31]$ squeue -u me # find out on which node(s) job is running | ||
+ | [me@l31]$ ssh < | ||
+ | [me@n320-038]$ ... # connected to node named n320-038 | ||
+ | </ | ||
+ | more about from Claudia | ||
+ | |||
+ | |||
+ | ---- | ||
+ | |||
+ | ====== Interactive jobs: salloc ====== | ||
+ | |||
+ | < | ||
+ | < | ||
+ | |||
+ | <code bash> | ||
+ | ~/ | ||
+ | [me@l31 linpack]$ salloc -J me_hpl -N 1 # allocate " | ||
+ | [me@l31 linpack]$ squeue -u me # find out which node(s) is(are) allocated | ||
+ | [me@l31 linpack]$ srun hostname | ||
+ | [me@l31 linpack]$ ssh n3XX-YYY | ||
+ | [me@n3XX-YYY ~]$ module purge | ||
+ | [me@n3XX-YYY ~]$ module load intel/17 intel-mkl/ | ||
+ | [me@n3XX-YYY ~]$ cd my_directory_name/ | ||
+ | [me@n3XX-YYY linpack]$ mpirun -np 16 ./xhpl | ||
+ | </ | ||
+ | stop the process (Ctlr+C or Strg+C) | ||
+ | |||
+ | <code bash> | ||
+ | [me@n3XX-YYY linpack]$ exit | ||
+ | [me@l31 linpack]$ exit | ||
+ | </ | ||
+ | |||
+ | ---- | ||
+ | |||
+ | ====== Interactive jobs: salloc ====== | ||
+ | |||
+ | ===== notes: ===== | ||
+ | |||
+ | <code bash> | ||
+ | [me@l31]$ srun hostname | ||
+ | </ | ||
+ | **NOTE:** the slurm prolog script is **not** automatically run in this case. The prolog performs basic tasks such as | ||
+ | |||
+ | * clean up from previous job, | ||
+ | * checking basic node functionality, | ||
+ | * adapting firewall settings to access license servers. | ||
+ | |||
+ | |||
+ | ---- | ||
+ | |||
+ | ===== Exercise 2: job script ===== | ||
+ | |||
+ | 1.) Now submit the same job with a job script: | ||
+ | |||
+ | <code bash> | ||
+ | [training@l31 my_directory_name]$ sbatch job.sh | ||
+ | [training@l31 my_directory_name]$ squeue -u training | ||
+ | JOBID PARTITION | ||
+ | | ||
+ | </ | ||
+ | |||
+ | 2.) Login to the node listed under NODELIST and perform commands which will give you information about your job | ||
+ | |||
+ | <code bash> | ||
+ | [training@l31 my_directory_name]$ ssh n341-001 | ||
+ | Last login: Fri Apr 21 08:43:07 2017 from l31.cm.cluster | ||
+ | </ | ||
+ | |||
+ | |||
+ | |||
+ | ---- | ||
+ | |||
+ | ===== Exercise 2: job script, cont. ===== | ||
+ | |||
+ | <code bash> | ||
+ | [training@n341-001 ~]$ top | ||
+ | </ | ||
+ | |||
+ | <code bash> | ||
+ | top - 13:06:06 up 19 days, 2:57, 2 users, | ||
+ | Tasks: 442 total, | ||
+ | %Cpu(s): | ||
+ | KiB Mem : 65940112 total, 62756072 free, 1586292 used, 1597748 buff/cache | ||
+ | KiB Swap: 0 total, | ||
+ | |||
+ | PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND | ||
+ | 12737 training | ||
+ | 12739 training | ||
+ | 12742 training | ||
+ | 12744 training | ||
+ | 12745 training | ||
+ | ... | ||
+ | </ | ||
+ | |||
+ | * 1 … show cpu core load | ||
+ | * Shift+H … show threads | ||
+ | |||
+ | |||
+ | |||
+ | ---- | ||
+ | |||
+ | 3.) Cancel the job: | ||
+ | |||
+ | <code bash> | ||
+ | [training@n341-001 ~]$ exit | ||
+ | [training@l31 ~]$ scancel <job ID> | ||
+ | </ | ||
+ | < | ||
+ | </ | ||
+ | </ | ||
+ | </ | ||
+ | |||
+ | ---- | ||
+ | |||
+ | ===== Prolog Failure ===== | ||
+ | |||
+ | If a check in the SLURM prolog script fails on one of the nodes assigned to your job, you will see a message like the following in your slurm-$JOBID.out file: | ||
+ | |||
+ | <code bash> | ||
+ | Error running slurm prolog: 228 | ||
+ | </ | ||
+ | The error code (228) tells you what kind of check has failed. A list of currently existing error codes is: | ||
+ | |||
+ | <code bash> | ||
+ | ERROR_MEMORY=200 | ||
+ | ERROR_INFINIBAND_HW=201 | ||
+ | ERROR_INFINIBAND_SW=202 | ||
+ | ERROR_IPOIB=203 | ||
+ | ERROR_BEEGFS_SERVICE=204 | ||
+ | ERROR_BEEGFS_USER=205 | ||
+ | ERROR_BEEGFS_SCRATCH=206 | ||
+ | ERROR_NFS=207 | ||
+ | ERROR_USER_GROUP=220 | ||
+ | ERROR_USER_HOME=221 | ||
+ | ERROR_GPFS_START=228 | ||
+ | ERROR_GPFS_MOUNT=229 | ||
+ | ERROR_GPFS_UNMOUNT=230 | ||
+ | </ | ||
+ | |||
+ | * node will be //drained// (unavailable for subsequent jobs until fixed) | ||
+ | * resubmit your job | ||
+ | |||
+ | |||
+ | ---- | ||
+ | |||
+ | ===== squeue options ===== | ||
+ | |||
+ | <code bash> | ||
+ | VSC-4 > squeue -u $user | ||
+ | JOBID PARTITION | ||
+ | 409879 | ||
+ | |||
+ | VSC-4 > squeue | ||
+ | JOBID PARTITION | ||
+ | | ||
+ | 407141 | ||
+ | 409880_[1-6] | ||
+ | 409879 | ||
+ | 402078 | ||
+ | </ | ||
+ | |||
+ | * optional reformatting via -o, for example, | ||
+ | |||
+ | <code bash> | ||
+ | VSC-4 > squeue -u $user -o "%.18i %.10Q %.12q %.8j %.8u %8U %.2t %.10M %.6D %R" | ||
+ | | ||
+ | 409879 | ||
+ | </ | ||
+ | |||
+ | |||
+ | < | ||
+ | </ | ||
+ | </ | ||
+ | |||
+ | ---- | ||
+ | |||
+ | ===== scontrol ===== | ||
+ | |||
+ | <code bash> | ||
+ | VSC-4 > scontrol show job 409879 | ||
+ | JobId=409879 JobName=training | ||
+ | | ||
+ | | ||
+ | | ||
+ | | ||
+ | | ||
+ | | ||
+ | | ||
+ | | ||
+ | | ||
+ | | ||
+ | | ||
+ | | ||
+ | | ||
+ | | ||
+ | | ||
+ | | ||
+ | | ||
+ | | ||
+ | | ||
+ | | ||
+ | | ||
+ | | ||
+ | | ||
+ | | ||
+ | </ | ||
+ | |||
+ | |||
+ | ---- | ||
+ | |||
+ | ===== Accounting: sacct ===== | ||
+ | |||
+ | default: | ||
+ | |||
+ | <code bash> | ||
+ | |||
+ | VSC-4 > sacct -j 409878 | ||
+ | | ||
+ | ------------ ---------- ---------- ---------- ---------- ---------- -------- | ||
+ | 409878 | ||
+ | </ | ||
+ | |||
+ | adjust output | ||
+ | |||
+ | <code bash> | ||
+ | VSC-4 > sacct -j 409878 -o jobid, | ||
+ | | ||
+ | ------------ ---------- ---------- --------------- ------------------- ------------------- ---------- ---------- ---------- ---------- ---------- -------- | ||
+ | 409878 | ||
+ | </ | ||
+ | |||
+ | |||
+ | |||
+ | ---- | ||
+ | |||
+ | ===== Accounting: vsc3CoreHours.py ===== | ||
+ | |||
+ | <code bash> | ||
+ | VSC-3 > vsc3CoreHours.py -h | ||
+ | usage: vsc3CoreHours.py [-h] [-S STARTTIME] [-E ENDTIME] [-D DURATION] | ||
+ | [-A ACCOUNTLIST] [-u USERNAMES] [-uni UNI] | ||
+ | [-d DETAIL_LEVEL] [-keys KEYS] | ||
+ | |||
+ | getting cpu usage - start - end time | ||
+ | |||
+ | optional arguments: | ||
+ | -h, --help | ||
+ | -S STARTTIME | ||
+ | -E ENDTIME | ||
+ | -D DURATION | ||
+ | -A ACCOUNTLIST | ||
+ | | ||
+ | | ||
+ | -u USERNAMES | ||
+ | -uni UNI get usage statistics for one university | ||
+ | -d DETAIL_LEVEL | ||
+ | -keys KEYS show data from qos or user perspective, | ||
+ | user; default=user | ||
+ | </ | ||
+ | |||
+ | |||
+ | |||
+ | ---- | ||
+ | |||
+ | ===== Accounting: vsc3CoreHours.py, | ||
+ | |||
+ | <code bash> | ||
+ | VSC-3 > vsc3CoreHours.py -S 2020-01-01 -E 2020-10-01 -u sh | ||
+ | =================================================== | ||
+ | Accounting data time range | ||
+ | From: 2020-01-01 00:00:00 | ||
+ | To: | ||
+ | =================================================== | ||
+ | Getting accounting information for the following account/ | ||
+ | account: | ||
+ | getting data, excluding these qos: goodluck, gpu_vis, gpu_compute | ||
+ | =============================== | ||
+ | | ||
+ | _______________________________ | ||
+ | sysadmin | ||
+ | _______________________________ | ||
+ | | ||
+ | </ | ||
+ | |||
+ | |||
+ | < | ||
+ | <!--# profiling | ||
+ | |||
+ | ## hdf5 file | ||
+ | |||
+ | ```{.bash} | ||
+ | [jz@l31 ~]$ squeue -u jz | ||
+ | JOBID PARTITION | ||
+ | | ||
+ | [jz@l31 ~]$ ls / | ||
+ | / | ||
+ | ``` | ||
+ | --> | ||
+ | </ | ||
+ | |||
+ | ---- | ||
+ | |||