no way to compare when less than two revisions

Differences

This shows you the differences between two versions of the page.


Next revision
pandoc:introduction-to-vsc:06_node_access_job_control:node_access_job_control [2017/10/18 11:42] – Pandoc Auto-commit pandoc
Line 1: Line 1:
 +====== Node access and job control ======
 +
 +  * Article written by Jan Zabloudil (VSC Team) <html><br></html>(last update 2017-10-09 by jz).
 +
 +
 +===== Node access =====
 +
 +  - ... after submitting a job script
 +  - ... in interactive sessions
 +
 +
 +----
 +
 +===== Job scripts: sbatch =====
 +
 +<code bash>
 +[jz@l31 somedirectory]$ sbatch job.sh
 +Submitted batch job 54321
 +[jz@l31 somedirectory]$ squeue -u jz
 +             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
 +             54321  mem_0064     test       jz  R       0:04      1 n20-038
 +[jz@l31 somedirectory]$ ssh n20-038
 +Last login: Wed Mar 15 14:26:01 2017 from l31.cm.cluster
 +[jz@n20-038 ~]$
 +
 +</code>
 +
 +----
 +
 +===== Interactive jobs: salloc =====
 +
 +<HTML>
 +<!--<div class="incremental smaller">
 +<div style="float:left;width:50%">-->
 +</HTML>
 +=== Option 1: ===
 +
 +<code bash>
 +[jz@l31 somedirectory]$ salloc -N1
 +salloc: Pending job allocation 5115879
 +salloc: job 5115879 queued and waiting for resources
 +salloc: job 5115879 has been allocated resources
 +salloc: Granted job allocation 5115879
 +[jz@l31 somedirectory]$ squeue -u jz
 +             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
 +           5115879  mem_0064     bash       jz  R       0:18      1 n07-043
 +</code>
 +
 +**NOTE:** the slurm prolog script is **not** automatically run in this case. The prolog performs basic tasks such as
 +
 +  * clean up from previous job,
 +  * checking basic node functionality,
 +  * adapting firewall settings to access license servers.
 +
 +
 +
 +
 +----
 +
 +===== Interactive jobs: salloc =====
 +
 +To trigger the execution of the prolog you need to run an **srun** command, e.g.:
 +
 +<code bash>
 +[jz@l31 somedirectory]$ srun hostname
 +</code>
 +
 +Then access the node:
 +
 +<code bash>
 +[jz@l31 somedirectory]$ ssh n07-043
 +Warning: Permanently added 'n07-043,10.141.7.43' (ECDSA) to the list of known hosts.
 +[jz@n07-043 ~]$
 +</code>
 +
 +
 +
 +----
 +
 +===== Interactive jobs: salloc =====
 +
 +=== Option 2: ===
 +
 +<code bash>
 +[jz@l31 somedirectory]$ salloc -N1 srun --pty --preserve-env $SHELL
 +salloc: Pending job allocation 5115908
 +salloc: job 5115908 queued and waiting for resources
 +salloc: job 5115908 has been allocated resources
 +salloc: Granted job allocation 5115908
 +[jz@n09-046 somedirectory]$
 +</code>
 +<code bash>
 +[jz@l31 somedirectory]$ salloc srun -N1 --pty --preserve-env $SHELL
 +salloc: Pending job allocation 5115909
 +salloc: job 5115909 queued and waiting for resources
 +salloc: job 5115909 has been allocated resources
 +salloc: Granted job allocation 5115909
 +[jz@n15-062 somedirectory]$
 +</code>
 +
 +
 +----
 +
 +===== No job on node =====
 +
 +<code bash>
 +[jz@l31 somedirectory]$ squeue -u jz
 +             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
 +
 +[jz@l31 somedirectory]$ ssh n15-002
 +Warning: Permanently added 'n15-002,10.141.15.2' (ECDSA) to the list of known hosts.
 +Access denied: user jz (uid=70497) has no active jobs on this node.
 +Connection closed by 10.141.15.2
 +</code>
 +
 +----
 +
 +===== Exercise 1: interactive job =====
 +
 +1.) login as user //training//:
 +
 +<code bash>
 +[...]$ ssh training@vsc3.vsc.ac.at
 +</code>
 +or
 +
 +<code bash>
 +[...]$ su - training
 +</code>
 +create a directory and copy example:
 +
 +<code bash>
 +[training@l31]$ mkdir my_directory_name
 +[training@l31]$ cd my_directory_name
 +[training@l31 my_directory_name]$ cp -r ~/examples/06_node_access_job_control/linpack .
 +[training@l31 my_directory_name]$ ls
 +HPL.dat  job.sh  xhpl
 +</code>
 +
 +
 +
 +----
 +
 +===== Exercise 1: interactive job, cont. =====
 +
 +2.) Allocate one node for an interactive session:
 +
 +(name the job in a useful way with the '-J' option)
 +
 +<code bash>
 +[training@l31 linpack]$ salloc -J jz_hpl -N 1
 +salloc: Pending job allocation 5260456
 +salloc: job 5260456 queued and waiting for resources
 +salloc: job 5260456 has been allocated resources
 +salloc: Granted job allocation 5260456
 +[training@l31 linpack]$ squeue -u training
 +             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
 +           5260456  mem_0064   jz_hpl training  R       0:11      1 n41-001
 +</code>
 +
 +3.) Run an srun command:
 +
 +<code bash>
 +[training@l31 linpack]$ srun hostname
 +</code>
 +
 +
 +
 +----
 +
 +===== Exercise 1: interactive job, cont. =====
 +
 +4.) Login to allocated node and execute a program:
 +
 +<code bash>
 +[training@l31 linpack]$ ssh n41-001
 +Last login: Thu Apr 20 15:38:33 2017 from l31.cm.cluster
 +
 +[training@n41-001 ~]$ cd my_directory_name/linpack
 +
 +[training@n41-001 linpack]$ module load intel/17 intel-mkl/2017 intel-mpi/2017
 +  Loading intel/17 from: /cm/shared/apps/intel/compilers_and_libraries_2017.2.174/linux
 +  Loading intel-mkl/2017 from: /cm/shared/apps/intel/compilers_and_libraries_2017.2.174/linux/mkl
 +  Loading intel-mpi/2017 from: /cm/shared/apps/intel/compilers_and_libraries_2017.2.174/linux/mpi
 +
 +[training@n41-001 linpack]$ mpirun -np 16 ./xhpl
 +
 +Number of Intel(R) Xeon Phi(TM) coprocessors : 0
 +================================================================================
 +HPLinpack 2.1  --  High-Performance Linpack benchmark  --   October 26, 2012
 +Written by A. Petitet and R. Clint Whaley,  Innovative Computing Laboratory, UTK
 +Modified by Piotr Luszczek, Innovative Computing Laboratory, UTK
 +Modified by Julien Langou, University of Colorado Denver
 +================================================================================
 +
 +</code>
 +
 +
 +
 +----
 +
 +===== Exercise 1: interactive job, cont. =====
 +
 +5.) stop the process (Ctlr+C or Strg+C), log out of the node and the delete job:
 +
 +<code bash>
 +[training@n41-001 linpack]$ exit
 +[training@l31 linpack]$ exit
 +exit
 +salloc: Relinquishing job allocation 5260456
 +salloc: Job allocation 5260456 has been revoked.
 +</code>
 +
 +
 +
 +----
 +
 +===== Exercise 1: interactive job, summary =====
 +
 +<code bash>
 +[training@l31 linpack]$ salloc -J jz_hpl -N 1 
 +[training@l31 linpack]$ squeue -u training
 +[training@l31 linpack]$ srun hostname
 +[training@l31 linpack]$ ssh nXX-YYY
 +[training@nXX-YYY ~]$ module load intel/17 intel-mkl/2017 intel-mpi/2017
 +[training@nXX-YYY ~]$ cd my_directory_name/linpack
 +[training@nXX-YYY linpack]$ mpirun -np 16 ./xhpl
 +</code>
 +stop the process (Ctlr+C or Strg+C)
 +
 +<code bash>
 +[training@nXX-YYY linpack]$ exit
 +[training@l31 linpack]$ exit
 +</code>
 +
 +
 +----
 +
 +===== Exercise 2: job script =====
 +
 +1.) Now submit the same job with a job script:
 +
 +<code bash>
 +[training@l31 my_directory_name]$ sbatch job.sh
 +[training@l31 my_directory_name]$ squeue -u training
 +             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
 +           5262833  mem_0064      hpl training R        0:04      1 n41-001
 +</code>
 +
 +2.) Login to the node listed under NODELIST and perform commands which will give you information about your job
 +
 +<code bash>
 +[training@l31 my_directory_name]$ ssh n41-001
 +Last login: Fri Apr 21 08:43:07 2017 from l31.cm.cluster
 +</code>
 +
 +
 +
 +----
 +
 +===== Exercise 2: job script, cont. =====
 +
 +<code bash>
 +[training@n41-001 ~]$ top
 +</code>
 +
 +<code bash>
 +top - 13:06:06 up 19 days,  2:57,  2 users,  load average: 12.42, 4.16, 1.51
 +Tasks: 442 total,  17 running, 425 sleeping,   0 stopped,   0 zombie
 +%Cpu(s):  0.1 us,  0.0 sy,  0.0 ni, 99.9 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
 +KiB Mem : 65940112 total, 62756072 free,  1586292 used,  1597748 buff/cache
 +KiB Swap:        0 total,        0 free,        0 used. 62903424 avail Mem
 +
 +  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
 +12737 training  20   0 1915480  68580  18780 R 106.2  0.1   0:12.57 xhpl
 +12739 training  20   0 1915480  68372  18584 R 106.2  0.1   0:12.55 xhpl
 +12742 training  20   0 1856120  45412  13172 R 106.2  0.1   0:12.57 xhpl
 +12744 training  20   0 1856120  45412  13172 R 106.2  0.1   0:12.52 xhpl
 +12745 training  20   0 1856120  45464  13224 R 106.2  0.1   0:12.56 xhpl
 +...
 +</code>
 +
 +  * 1 ... show cpu core load
 +  * Shift+H ... show threads
 +
 +
 +
 +----
 +
 +===== Exercise 2: job script, cont. =====
 +
 +<code bash>
 +[training@n41-001 ~]$ ps -U training u
 +</code>
 +<code bash>
 +[training@n41-001 ~]$ ps -U training f
 +</code>
 +<code bash>
 +[training@n41-001 ~]$ ps -U training e
 +</code>
 +<code bash>
 +[training@n41-001 ~]$ ps -U training eu
 +</code>
 +<code bash>
 +[training@n41-001 ~]$ ps -U training ef
 +</code>
 +<code bash>
 +[training@n41-001 ~]$ ps -U training euf
 +</code>
 +
 +3.) Cancel the job:
 +
 +<code bash>
 +[training@n41-001 ~]$ scancel <job ID>
 +</code>
 +
 +
 +
 +----
 +
 +===== Exercise 3: interactive job with srun =====
 +
 +<code bash>
 +[training@l31 linpack]$ salloc -J jz_hpl -N 1
 +</code>
 +<code bash>
 +[training@l31 linpack]$ squeue -u training
 +</code>
 +<code bash>
 +[training@l31 linpack]$ srun -n 2 hostname
 +</code>
 +<code bash>
 +[training@l31 linpack]$ module purge
 +</code>
 +<code bash>
 +[training@l31 linpack]$ module load intel/17 intel-mkl/2017 intel-mpi/2017
 +</code>
 +<code bash>
 +[training@l31 linpack]$ export I_MPI_PMI_LIBRARY=/cm/shared/apps/slurm/current/lib/libpmi.so
 +</code>
 +<code bash>
 +[training@l31 linpack]$ srun -n 16 ./xhpl
 +</code>
 +
 +type Ctrl+C or Strg+C to stop the job and remove the job allocation:
 +
 +<code bash>
 +[training@l31 linpack]$ exit
 +exit
 +salloc: Relinquishing job allocation 5264616
 +</code>
 +
 +
 +
 +----
 +
 +===== squeue options =====
 +
 +<code bash>
 +[jz@l31 ~]$ squeue -u jz
 +             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
 +           5115915  mem_0064     bash       jz  R      34:50      1 n05-071
 +
 +[jz@l31 ~]$ alias squeue='squeue -o "%.18i %.10Q  %.12q %.8j %.8u %8U %.2t %.10M %.6D %R" -u jz'
 +[jz@l31 ~]$ squeue
 +             JOBID   PRIORITY           QOS     NAME     USER USER     ST       TIME  NODES NODELIST(REASON)
 +           5115915      30209         admin     bash       jz 70497          33:41      1 n05-071
 +</code>
 +
 +  * Formatting option: -o <output_format>, --format=<output_format>
 +  * The format of each field is ''%%%[[.]size]type%%''.
 +  * "." means right justification.
 +  * "size" is the field width
 +  * "type" the field type (JOBID, PRIORITY, etc.)
 +
 +
 +<code bash>
 +[jz@l31 ~]$ man squeue
 +</code>
 +
 +
 +
 +----
 +
 +===== scontrol =====
 +
 +<code bash>
 +[training@l31 linpack]$ scontrol show job 5264675
 +JobId=5264675 JobName=hpl
 +   UserId=training(72127) GroupId=p70824(70824) MCS_label=N/A
 +   Priority=1861 Nice=0 Account=p70824 QOS=normal_0064
 +   JobState=RUNNING Reason=None Dependency=(null)
 +   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
 +   RunTime=00:00:36 TimeLimit=3-00:00:00 TimeMin=N/A
 +   SubmitTime=2017-04-21T11:30:18 EligibleTime=2017-04-21T11:30:18
 +   StartTime=2017-04-21T11:30:49 EndTime=2017-04-24T11:30:56 Deadline=N/A
 +   PreemptTime=None SuspendTime=None SecsPreSuspend=0
 +   Partition=mem_0064 AllocNode:Sid=l31:28554
 +   ReqNodeList=(null) ExcNodeList=(null)
 +   NodeList=n41-001
 +   BatchHost=n41-001
 +   NumNodes=1 NumCPUs=32 NumTasks=16 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
 +   TRES=cpu=32,node=1,gres/cpu_mem_0064=32
 +   Socks/Node=* NtasksPerN:B:S:C=16:0:*:* CoreSpec=*
 +   MinCPUsNode=16 MinMemoryNode=0 MinTmpDiskNode=0
 +   Features=(null) Gres=cpu_mem_0064:32 Reservation=(null)
 +   OverSubscribe=NO Contiguous=0 Licenses=(null) Network=(null)
 +   Command=/home/lv70824/training/Examples/06_node_access_job_control/linpack/job.sh
 +   WorkDir=/home/lv70824/training/Examples/06_node_access_job_control/linpack
 +   StdErr=/home/lv70824/training/Examples/06_node_access_job_control/linpack/slurm-5264675.out
 +   StdIn=/dev/null
 +   StdOut=/home/lv70824/training/Examples/06_node_access_job_control/linpack/slurm-5264675.out
 +   Power=
 +</code>
 +
 +----
 +
 +===== Accounting: sacct =====
 +
 +default:
 +
 +<code bash>
 +[jz@l31 ~]$ sacct -j 5115879
 +       JobID    JobName  Partition    Account  AllocCPUS      State ExitCode
 +------------ ---------- ---------- ---------- ---------- ---------- --------
 +5115879            bash   mem_0064   sysadmin         32  COMPLETED      0:0
 +</code>
 +
 +specify options:
 +
 +<code bash>
 +[jz@l31 ~]$ sacct -j 5115879 -o jobid,jobname,cluster,nodelist,Start,End,cputime,cputimeraw,ncpus,qos,account,ExitCode
 +       JobID    JobName    Cluster        NodeList               Start                 End    CPUTime CPUTimeRAW      NCPUS        QOS    Account ExitCode
 +------------ ---------- ---------- --------------- ------------------- ------------------- ---------- ---------- ---------- ---------- ---------- --------
 +5115879            bash       vsc3         n07-043 2017-03-15T14:31:39 2017-03-15T14:36:43   02:42:08       9728         32      admin   sysadmin      0:0
 +</code>
 +
 +inspect man page for more options:
 +
 +<code bash>
 +[jz@l31 ~]$ man sacct
 +</code>
 +
 +
 +
 +----
 +
 +===== Accounting: vsc3CoreHours.py =====
 +
 +<code bash>
 +[jz@l31 ~]$ vsc3CoreHours.py -h
 +usage: vsc3CoreHours.py [-h] [-S STARTTIME] [-E ENDTIME] [-D DURATION]
 +                        [-A ACCOUNTLIST] [-u USERNAMES] [-uni UNI]
 +                        [-d DETAIL_LEVEL] [-keys KEYS]
 +
 +getting cpu usage - start - end time
 +
 +optional arguments:
 +  -h, --help       show this help message and exit
 +  -S STARTTIME     start time, e.g., 2015-04[-01[T10[:04[:01]]]]
 +  -E ENDTIME       end time, e.g., 2015-04[-01[T10[:04[:01]]]]
 +  -D DURATION      duration, display the last D days
 +  -A ACCOUNTLIST   give comma separated list of projects for which the
 +                   accounting data is calculated, e.g., p70yyy,p70xxx.
 +                   Default: primary project.
 +  -u USERNAMES     give comma separated list of usernames
 +  -uni UNI         get usage statistics for one university
 +  -d DETAIL_LEVEL  set detail level; default=0
 +  -keys KEYS       show data from qos or user perspective, use either qos or
 +                   user; default=user
 +</code>
 +
 +
 +
 +----
 +
 +===== Accounting: vsc3CoreHours.py, cont. =====
 +
 +<code bash>
 +[jz@l31 ~]$ vsc3CoreHours.py -S 2017-01-01 -E 2017-03-31 -u jz
 +===================================================
 +Accounting data time range
 +From: 2017-01-01 00:00:00
 +To:   2017-03-31 00:00:00
 +===================================================
 +Getting accounting information for the following account/user combinations:
 +account:   sysadmin users: jz
 +getting data, excluding these qos: goodluck, gpu_vis, gpu_compute
 +===============================
 +             account     core_h
 +_______________________________
 +            sysadmin   56844.87
 +_______________________________
 +               total   56844.87
 +</code>
 +
 +
 +<HTML>
 +<!--# profiling
 +
 +## hdf5 file
 +
 +```{.bash}
 +[jz@l31 ~]$ squeue -u jz
 +             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
 +           5115915  mem_0064     bash       jz  R      40:12      1 n05-071
 +[jz@l31 ~]$ ls /opt/profiling/slurm/jz/5115915_0_n05-071.h5
 +/opt/profiling/slurm/jz/5115915_0_n05-071.h5
 +```
 +-->
 +</HTML>
 +
 +----
  
  • pandoc/introduction-to-vsc/06_node_access_job_control/node_access_job_control.txt
  • Last modified: 2020/10/20 09:13
  • by pandoc