Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
pandoc:introduction-to-vsc:06_node_access_job_control:node_access_job_control [2018/01/31 11:10] – Pandoc Auto-commit pandocpandoc:introduction-to-vsc:06_node_access_job_control:node_access_job_control [2020/10/20 09:13] (current) – Pandoc Auto-commit pandoc
Line 1: Line 1:
 +====== Node access and job control ======
 +
 +  * Article written by Jan Zabloudil (VSC Team) <html><br></html>(last update 2017-10-09 by jz).
 +
 +
 +
 +----
 +
 +
 +
 +----
 +
 +====== Node access ======
 +
 +{{.:folie_12_1_connect.png?0x600}}
 +
 +
 +----
 +
 +====== Node access ======
 +
 +{{.:folie_12_2_connect.png?0x600}}
 +
 +
 +----
 +
 +====== Node access ======
 +
 +{{.:folie_14_salloc.png?0x600}}
 +
 +
 +----
 +
 +====== 1) Job scripts: sbatch ======
 +
 +(DEMO)
 +
 +<code bash>
 +[me@l31]$ sbatch job.sh     # submit batch job
 +[me@l31]$ squeue -u me      # find out on which node(s) job is running
 +[me@l31]$ ssh <nodename>    # connect to node
 +[me@n320-038]$ ...          # connected to node named n320-038
 +</code>
 +more about from Claudia
 +
 +
 +----
 +
 +====== Interactive jobs: salloc ======
 +
 +<HTML><ol start="2" style="list-style-type: decimal;"></HTML>
 +<HTML><li></HTML>with salloc EXAMPLE: linpack (DEMO)<HTML></li></HTML><HTML></ol></HTML>
 +
 +<code bash>
 +~/examples/06_node_access_job_control/linpack
 +[me@l31 linpack]$ salloc -J me_hpl -N 1  # allocate "1" node(s)
 +[me@l31 linpack]$ squeue -u me      # find out which node(s) is(are) allocated
 +[me@l31 linpack]$ srun hostname     # clean & check functionality & licence
 +[me@l31 linpack]$ ssh n3XX-YYY       # connect to node nXX-YYY
 +[me@n3XX-YYY ~]$ module purge
 +[me@n3XX-YYY ~]$ module load intel/17 intel-mkl/2017 intel-mpi/2017
 +[me@n3XX-YYY ~]$ cd my_directory_name/linpack
 +[me@n3XX-YYY linpack]$ mpirun -np 16 ./xhpl
 +</code>
 +stop the process (Ctlr+C or Strg+C)
 +
 +<code bash>
 +[me@n3XX-YYY linpack]$ exit
 +[me@l31 linpack]$ exit
 +</code>
 +
 +----
 +
 +====== Interactive jobs: salloc ======
 +
 +===== notes: =====
 +
 +<code bash>
 +[me@l31]$ srun hostname
 +</code>
 +**NOTE:** the slurm prolog script is **not** automatically run in this case. The prolog performs basic tasks such as
 +
 +  * clean up from previous job,
 +  * checking basic node functionality,
 +  * adapting firewall settings to access license servers.
 +
 +
 +----
 +
 +===== Exercise 2: job script =====
 +
 +1.) Now submit the same job with a job script:
 +
 +<code bash>
 +[training@l31 my_directory_name]$ sbatch job.sh
 +[training@l31 my_directory_name]$ squeue -u training
 +             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
 +           5262833  mem_0064      hpl training R        0:04      1 n341-001
 +</code>
 +
 +2.) Login to the node listed under NODELIST and perform commands which will give you information about your job
 +
 +<code bash>
 +[training@l31 my_directory_name]$ ssh n341-001
 +Last login: Fri Apr 21 08:43:07 2017 from l31.cm.cluster
 +</code>
 +
 +
 +
 +----
 +
 +===== Exercise 2: job script, cont. =====
 +
 +<code bash>
 +[training@n341-001 ~]$ top
 +</code>
 +
 +<code bash>
 +top - 13:06:06 up 19 days,  2:57,  2 users,  load average: 12.42, 4.16, 1.51
 +Tasks: 442 total,  17 running, 425 sleeping,   0 stopped,   0 zombie
 +%Cpu(s):  0.1 us,  0.0 sy,  0.0 ni, 99.9 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
 +KiB Mem : 65940112 total, 62756072 free,  1586292 used,  1597748 buff/cache
 +KiB Swap:        0 total,        0 free,        0 used. 62903424 avail Mem
 +
 +  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
 +12737 training  20   0 1915480  68580  18780 R 106.2  0.1   0:12.57 xhpl
 +12739 training  20   0 1915480  68372  18584 R 106.2  0.1   0:12.55 xhpl
 +12742 training  20   0 1856120  45412  13172 R 106.2  0.1   0:12.57 xhpl
 +12744 training  20   0 1856120  45412  13172 R 106.2  0.1   0:12.52 xhpl
 +12745 training  20   0 1856120  45464  13224 R 106.2  0.1   0:12.56 xhpl
 +...
 +</code>
 +
 +  * 1 … show cpu core load
 +  * Shift+H … show threads
 +
 +
 +
 +----
 +
 +3.) Cancel the job:
 +
 +<code bash>
 +[training@n341-001 ~]$ exit
 +[training@l31 ~]$ scancel <job ID>
 +</code>
 +<HTML>
 +</div>
 +</div>
 +</HTML>
 +
 +----
 +
 +===== Prolog Failure =====
 +
 +If a check in the SLURM prolog script fails on one of the nodes assigned to your job, you will see a message like the following in your slurm-$JOBID.out file:
 +
 +<code bash>
 +Error running slurm prolog: 228
 +</code>
 +The error code (228) tells you what kind of check has failed. A list of currently existing error codes is:
 +
 +<code bash>
 +ERROR_MEMORY=200
 +ERROR_INFINIBAND_HW=201
 +ERROR_INFINIBAND_SW=202
 +ERROR_IPOIB=203
 +ERROR_BEEGFS_SERVICE=204
 +ERROR_BEEGFS_USER=205
 +ERROR_BEEGFS_SCRATCH=206
 +ERROR_NFS=207
 +ERROR_USER_GROUP=220
 +ERROR_USER_HOME=221
 +ERROR_GPFS_START=228
 +ERROR_GPFS_MOUNT=229
 +ERROR_GPFS_UNMOUNT=230
 +</code>
 +
 +  * node will be //drained// (unavailable for subsequent jobs until fixed)
 +  * resubmit your job
 +
 +
 +----
 +
 +===== squeue options =====
 +
 +<code bash>
 +VSC-4 >  squeue -u $user 
 +             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
 +            409879  mem_0096 training       sh  R    1:10:59      3 n407-030,n411-[007-008]
 +
 +VSC-4 >  squeue  -p mem_0096 | more
 +             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
 +             ..........................................................................
 +            407141  mem_0096 interfac   mjech3 PD       0:00      4 (Priority)
 +      409880_[1-6]  mem_0096 OP1_2_Fi wiesmann PD       0:00      1 (Resources,Priority)
 +            409879  mem_0096 training       sh  R    1:30:16      3 n407-030,n411-[007-008]
 +            402078  mem_0096      sim mboesenh  R 4-22:46:13      1 n403-007
 +</code>
 +
 +  * optional reformatting via -o, for example,
 +
 +<code bash>
 +VSC-4 >  squeue -u $user -o "%.18i %.10Q  %.12q %.8j %.8u %8U %.2t %.10M %.6D %R"
 +             JOBID   PRIORITY           QOS     NAME     USER UID      ST       TIME  NODES NODELIST(REASON)
 +            409879    3000982         admin training       sh 71177        1:34:55      3 n407-030,n411-[007-008]
 +</code>
 +
 +
 +<HTML>
 +</div>
 +</HTML>
 +
 +----
 +
 +===== scontrol =====
 +
 +<code bash>
 +VSC-4 >  scontrol show job 409879
 +JobId=409879 JobName=training
 +   UserId=sh(71177) GroupId=sysadmin(60000) MCS_label=N/A
 +   Priority=3000982 Nice=0 Account=sysadmin QOS=admin
 +   JobState=RUNNING Reason=None Dependency=(null)
 +   Requeue=0 Restarts=0 BatchFlag=0 Reboot=0 ExitCode=0:0
 +   RunTime=01:36:35 TimeLimit=10-00:00:00 TimeMin=N/A
 +   SubmitTime=2020-10-12T15:50:44 EligibleTime=2020-10-12T15:50:44
 +   AccrueTime=2020-10-12T15:50:44
 +   StartTime=2020-10-12T15:51:07 EndTime=2020-10-22T15:51:07 Deadline=N/A
 +   PreemptTime=None SuspendTime=None SecsPreSuspend=0
 +   LastSchedEval=2020-10-12T15:51:07
 +   Partition=mem_0096 AllocNode:Sid=l40:186217
 +   ReqNodeList=(null) ExcNodeList=(null)
 +   NodeList=n407-030,n411-[007-008]
 +   BatchHost=n407-030
 +   NumNodes=3 NumCPUs=288 NumTasks=3 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
 +   TRES=cpu=288,mem=288888M,node=3,billing=288,gres/cpu_mem_0096=288
 +   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
 +   MinCPUsNode=1 MinMemoryNode=96296M MinTmpDiskNode=0
 +   Features=(null) DelayBoot=00:00:00
 +   OverSubscribe=NO Contiguous=0 Licenses=(null) Network=(null)
 +   Command=(null)
 +   WorkDir=/home/fs60000/sh
 +   Power=
 +   TresPerNode=cpu_mem_0096:96
 +</code>
 +
 +
 +----
 +
 +===== Accounting: sacct =====
 +
 +default:
 +
 +<code bash>
 +
 +VSC-4 >  sacct -j 409878 
 +       JobID    JobName  Partition    Account  AllocCPUS      State ExitCode 
 +------------ ---------- ---------- ---------- ---------- ---------- -------- 
 +409878         training   mem_0096   sysadmin        288  COMPLETED      0:0 
 +</code>
 +
 +adjust output
 +
 +<code bash>
 +VSC-4 >  sacct -j 409878 -o jobid,jobname,cluster,nodelist,Start,End,cputime,cputimeraw,ncpus,qos,account,ExitCode
 +       JobID    JobName    Cluster        NodeList               Start                 End    CPUTime CPUTimeRAW      NCPUS        QOS    Account ExitCode 
 +------------ ---------- ---------- --------------- ------------------- ------------------- ---------- ---------- ---------- ---------- ---------- -------- 
 +409878         training       vsc4 n407-030,n411-+ 2020-10-12T15:49:42 2020-10-12T15:51:06   06:43:12      24192        288      admin   sysadmin      0:0 
 +</code>
 +
 +
 +
 +----
 +
 +===== Accounting: vsc3CoreHours.py =====
 +
 +<code bash>
 +VSC-3 >  vsc3CoreHours.py -h
 +usage: vsc3CoreHours.py [-h] [-S STARTTIME] [-E ENDTIME] [-D DURATION]
 +                        [-A ACCOUNTLIST] [-u USERNAMES] [-uni UNI]
 +                        [-d DETAIL_LEVEL] [-keys KEYS]
 +
 +getting cpu usage - start - end time
 +
 +optional arguments:
 +  -h, --help       show this help message and exit
 +  -S STARTTIME     start time, e.g., 2015-04[-01[T10[:04[:01]]]]
 +  -E ENDTIME       end time, e.g., 2015-04[-01[T10[:04[:01]]]]
 +  -D DURATION      duration, display the last D days
 +  -A ACCOUNTLIST   give comma separated list of projects for which the
 +                   accounting data is calculated, e.g., p70yyy,p70xxx.
 +                   Default: primary project.
 +  -u USERNAMES     give comma separated list of usernames
 +  -uni UNI         get usage statistics for one university
 +  -d DETAIL_LEVEL  set detail level; default=0
 +  -keys KEYS       show data from qos or user perspective, use either qos or
 +                   user; default=user
 +</code>
 +
 +
 +
 +----
 +
 +===== Accounting: vsc3CoreHours.py, cont. =====
 +
 +<code bash>
 +VSC-3 >  vsc3CoreHours.py -S 2020-01-01 -E 2020-10-01 -u sh
 +===================================================
 +Accounting data time range
 +From: 2020-01-01 00:00:00
 +To:   2020-10-01 00:00:00
 +===================================================
 +Getting accounting information for the following account/user combinations:
 +account:   sysadmin users: sh
 +getting data, excluding these qos: goodluck, gpu_vis, gpu_compute
 +===============================
 +             account     core_h
 +_______________________________
 +            sysadmin   25775.15
 +_______________________________
 +               total   25775.15
 +</code>
 +
 +
 +<HTML>
 +<!--# profiling
 +
 +## hdf5 file
 +
 +```{.bash}
 +[jz@l31 ~]$ squeue -u jz
 +             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
 +           5115915  mem_0064     bash       jz  R      40:12      1 n05-071
 +[jz@l31 ~]$ ls /opt/profiling/slurm/jz/5115915_0_n05-071.h5
 +/opt/profiling/slurm/jz/5115915_0_n05-071.h5
 +```
 +-->
 +</HTML>
 +
 +----
 +
  
  • pandoc/introduction-to-vsc/06_node_access_job_control/node_access_job_control.1517397055.txt.gz
  • Last modified: 2018/01/31 11:10
  • by pandoc