Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
Last revisionBoth sides next revision
pandoc:introduction-to-vsc:06_node_access_job_control:node_access_job_control [2018/10/16 12:06] – Pandoc Auto-commit pandocpandoc:introduction-to-vsc:06_node_access_job_control:node_access_job_control [2020/10/20 08:09] – Pandoc Auto-commit pandoc
Line 3: Line 3:
   * Article written by Jan Zabloudil (VSC Team) <html><br></html>(last update 2017-10-09 by jz).   * Article written by Jan Zabloudil (VSC Team) <html><br></html>(last update 2017-10-09 by jz).
  
- 
- 
-===== Node access ===== 
- 
-  - … after submitting a job script 
-  - … in interactive sessions 
  
  
 ---- ----
  
-===== Job scripts: sbatch ===== 
  
-<code bash> 
-[jz@l31 somedirectory]$ sbatch job.sh 
-Submitted batch job 54321 
-[jz@l31 somedirectory]$ squeue -u jz 
-             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON) 
-             54321  mem_0064     test       jz  R       0:04      1 n20-038 
-[jz@l31 somedirectory]$ ssh n20-038 
-Last login: Wed Mar 15 14:26:01 2017 from l31.cm.cluster 
-[jz@n20-038 ~]$ 
- 
-</code> 
  
 ---- ----
  
-===== Interactive jobs: salloc ===== +====== Node access ======
- +
-<HTML> +
-<!--<div class="incremental smaller"> +
-<div style="float:left;width:50%">--> +
-</HTML> +
-=== Option 1: === +
- +
-<code bash> +
-[jz@l31 somedirectory]$ salloc -N1 +
-salloc: Pending job allocation 5115879 +
-salloc: job 5115879 queued and waiting for resources +
-salloc: job 5115879 has been allocated resources +
-salloc: Granted job allocation 5115879 +
-[jz@l31 somedirectory]$ squeue -u jz +
-             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON) +
-           5115879  mem_0064     bash       jz  R       0:18      1 n07-043 +
-</code> +
- +
-**NOTE:** the slurm prolog script is **not** automatically run in this case. The prolog performs basic tasks such as +
- +
-  * clean up from previous job, +
-  * checking basic node functionality, +
-  * adapting firewall settings to access license servers. +
  
 +{{..:folie_12_1_connect.png?0x600}}
  
  
 ---- ----
  
-===== Interactive jobs: salloc ===== +====== Node access ======
- +
-To trigger the execution of the prolog you need to run an **srun** command, e.g.: +
- +
-<code bash> +
-[jz@l31 somedirectory]$ srun hostname +
-</code> +
- +
-Then access the node: +
- +
-<code bash> +
-[jz@l31 somedirectory]$ ssh n07-043 +
-Warning: Permanently added 'n07-043,10.141.7.43' (ECDSA) to the list of known hosts. +
-[jz@n07-043 ~]$ +
-</code>+
  
 +{{..:folie_12_2_connect.png?0x600}}
  
  
 ---- ----
  
-===== Interactive jobs: salloc =====+====== Node access ======
  
-=== Option 2=== +{{..:folie_14_salloc.png?0x600}}
- +
-<code bash> +
-[jz@l31 somedirectory]$ salloc -N1 srun --pty --preserve-env $SHELL +
-salloc: Pending job allocation 5115908 +
-salloc: job 5115908 queued and waiting for resources +
-salloc: job 5115908 has been allocated resources +
-salloc: Granted job allocation 5115908 +
-[jz@n09-046 somedirectory]$ +
-</code> +
-<code bash> +
-[jz@l31 somedirectory]$ salloc srun -N1 --pty --preserve-env $SHELL +
-salloc: Pending job allocation 5115909 +
-salloc: job 5115909 queued and waiting for resources +
-salloc: job 5115909 has been allocated resources +
-salloc: Granted job allocation 5115909 +
-[jz@n15-062 somedirectory]$ +
-</code>+
  
  
 ---- ----
  
-===== No job on node =====+====== 1) Job scripts: sbatch ======
  
-<code bash> +(DEMO)
-[jz@l31 somedirectory]$ squeue -u jz +
-             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON) +
- +
-[jz@l31 somedirectory]$ ssh n15-002 +
-Warning: Permanently added 'n15-002,10.141.15.2' (ECDSA) to the list of known hosts. +
-Access denied: user jz (uid=70497) has no active jobs on this node. +
-Connection closed by 10.141.15.2 +
-</code> +
- +
----- +
- +
-===== Exercise 1: interactive job ===== +
- +
-1.) login as user //training//:+
  
 <code bash> <code bash>
-[...]$ ssh training@vsc3.vsc.ac.at+[me@l31]$ sbatch job.sh     # submit batch job 
 +[me@l31]$ squeue -u me      # find out on which node(s) job is running 
 +[me@l31]$ ssh <nodename>    # connect to node 
 +[me@n320-038]$ ...          # connected to node named n320-038
 </code> </code>
-or +more about from Claudia
- +
-<code bash> +
-[...]$ su - training +
-</code> +
-create a directory and copy example: +
- +
-<code bash> +
-[training@l31]$ mkdir my_directory_name +
-[training@l31]$ cd my_directory_name +
-[training@l31 my_directory_name]$ cp -r ~/examples/06_node_access_job_control/linpack . +
-[training@l31 my_directory_name]$ ls +
-HPL.dat  job.sh  xhpl +
-</code> +
  
  
 ---- ----
  
-===== Exercise 1interactive job, cont. ===== +====== Interactive jobssalloc ======
- +
-2.) Allocate one node for an interactive session:+
  
-(name the job in a useful way with the ‘-J’ option)+<HTML><ol start="2" style="list-style-type: decimal;"></HTML> 
 +<HTML><li></HTML>with salloc EXAMPLE: linpack (DEMO)<HTML></li></HTML><HTML></ol></HTML>
  
 <code bash> <code bash>
-[training@l31 linpack]$ salloc -J jz_hpl -N 1 +~/examples/06_node_access_job_control/linpack 
-salloc: Pending job allocation 5260456 +[me@l31 linpack]$ salloc -J me_hpl -N 1  # allocate "1" node(s) 
-salloc: job 5260456 queued and waiting for resources +[me@l31 linpack]$ squeue -u me      # find out which node(sis(are) allocated 
-salloc: job 5260456 has been allocated resources +[me@l31 linpack]$ srun hostname     # clean & check functionality & licence 
-salloc: Granted job allocation 5260456 +[me@l31 linpack]$ ssh n3XX-YYY       # connect to node nXX-YYY 
-[training@l31 linpack]$ squeue -u training +[me@n3XX-YYY ~]$ module purge 
-             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON+[me@n3XX-YYY ~]$ module load intel/17 intel-mkl/2017 intel-mpi/2017 
-           5260456  mem_0064   jz_hpl training       0:11      1 n41-001+[me@n3XX-YYY ~]$ cd my_directory_name/linpack 
 +[me@n3XX-YYY linpack]$ mpirun -np 16 ./xhpl
 </code> </code>
- +stop the process (Ctlr+C or Strg+C)
-3.Run an srun command:+
  
 <code bash> <code bash>
-[training@l31 linpack]$ srun hostname+[me@n3XX-YYY linpack]$ exit 
 +[me@l31 linpack]$ exit
 </code> </code>
- 
- 
  
 ---- ----
  
-===== Exercise 1interactive job, cont. =====+====== Interactive jobssalloc ======
  
-4.) Login to allocated node and execute a program:+===== notes=====
  
 <code bash> <code bash>
-[training@l31 linpack]$ ssh n41-001 +[me@l31]$ srun hostname
-Last login: Thu Apr 20 15:38:33 2017 from l31.cm.cluster +
- +
-[training@n41-001 ~]$ cd my_directory_name/linpack +
- +
-[training@n41-001 linpack]$ module load intel/17 intel-mkl/2017 intel-mpi/2017 +
-  Loading intel/17 from: /cm/shared/apps/intel/compilers_and_libraries_2017.2.174/linux +
-  Loading intel-mkl/2017 from: /cm/shared/apps/intel/compilers_and_libraries_2017.2.174/linux/mkl +
-  Loading intel-mpi/2017 from: /cm/shared/apps/intel/compilers_and_libraries_2017.2.174/linux/mpi +
- +
-[training@n41-001 linpack]$ mpirun -np 16 ./xhpl +
- +
-Number of Intel(R) Xeon Phi(TM) coprocessors : 0 +
-================================================================================ +
-HPLinpack 2.1  --  High-Performance Linpack benchmark  --   October 26, 2012 +
-Written by A. Petitet and R. Clint Whaley,  Innovative Computing Laboratory, UTK +
-Modified by Piotr Luszczek, Innovative Computing Laboratory, UTK +
-Modified by Julien Langou, University of Colorado Denver +
-================================================================================ +
 </code> </code>
 +**NOTE:** the slurm prolog script is **not** automatically run in this case. The prolog performs basic tasks such as
  
- +  * clean up from previous job, 
- +  * checking basic node functionality
----- +  * adapting firewall settings to access license servers.
- +
-===== Exercise 1: interactive job, cont. ===== +
- +
-5.) stop the process (Ctlr+C or Strg+C), log out of the node and the delete job: +
- +
-<code bash> +
-[training@n41-001 linpack]$ exit +
-[training@l31 linpack]$ exit +
-exit +
-salloc: Relinquishing job allocation 5260456 +
-salloc: Job allocation 5260456 has been revoked. +
-</code> +
- +
- +
- +
----- +
- +
-===== Exercise 1: interactive jobsummary ===== +
- +
-<code bash> +
-[training@l31 linpack]$ salloc -J jz_hpl -N 1  +
-[training@l31 linpack]$ squeue -u training +
-[training@l31 linpack]$ srun hostname +
-[training@l31 linpack]$ ssh nXX-YYY +
-[training@nXX-YYY ~]$ module load intel/17 intel-mkl/2017 intel-mpi/2017 +
-[training@nXX-YYY ~]$ cd my_directory_name/linpack +
-[training@nXX-YYY linpack]$ mpirun -np 16 ./xhpl +
-</code> +
-stop the process (Ctlr+C or Strg+C) +
- +
-<code bash> +
-[training@nXX-YYY linpack]$ exit +
-[training@l31 linpack]$ exit +
-</code>+
  
  
Line 246: Line 96:
 [training@l31 my_directory_name]$ squeue -u training [training@l31 my_directory_name]$ squeue -u training
              JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)              JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
-           5262833  mem_0064      hpl training R        0:04      1 n41-001+           5262833  mem_0064      hpl training R        0:04      1 n341-001
 </code> </code>
  
Line 252: Line 102:
  
 <code bash> <code bash>
-[training@l31 my_directory_name]$ ssh n41-001+[training@l31 my_directory_name]$ ssh n341-001
 Last login: Fri Apr 21 08:43:07 2017 from l31.cm.cluster Last login: Fri Apr 21 08:43:07 2017 from l31.cm.cluster
 </code> </code>
Line 263: Line 113:
  
 <code bash> <code bash>
-[training@n41-001 ~]$ top+[training@n341-001 ~]$ top
 </code> </code>
  
Line 292: Line 142:
  
 <code bash> <code bash>
-[training@n41-001 ~]$ exit+[training@n341-001 ~]$ exit
 [training@l31 ~]$ scancel <job ID> [training@l31 ~]$ scancel <job ID>
 </code> </code>
Line 326: Line 176:
 ERROR_GPFS_UNMOUNT=230 ERROR_GPFS_UNMOUNT=230
 </code> </code>
 +
   * node will be //drained// (unavailable for subsequent jobs until fixed)   * node will be //drained// (unavailable for subsequent jobs until fixed)
   * resubmit your job   * resubmit your job
Line 335: Line 186:
  
 <code bash> <code bash>
-[jz@l31 ~]$ squeue -u jz+VSC-4 >  squeue -u $user 
              JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)              JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
-           5115915  mem_0064     bash       jz       34:50      1 n05-071+            409879  mem_0096 training       sh     1:10:59      3 n407-030,n411-[007-008]
  
-[jz@l31 ~]$ alias squeue='squeue -o "%.18i %.10Q  %.12q %.8j %.8u %8U %.2t %.10M %.6D %R" -u jz' +VSC-4 >  squeue  -p mem_0096 | more 
-[jz@l31 ~]$ squeue +             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON) 
-             JOBID   PRIORITY           QOS     NAME     USER USER     ST       TIME  NODES NODELIST(REASON) +             .......................................................................... 
-           5115915      30209         admin     bash       jz 70497     R      33:41      n05-071+            407141  mem_0096 interfac   mjech3 PD       0:00      4 (Priority) 
 +      409880_[1-6]  mem_0096 OP1_2_Fi wiesmann PD       0:00      1 (Resources,Priority) 
 +            409879  mem_0096 training       sh     1:30:16      3 n407-030,n411-[007-008] 
 +            402078  mem_0096      sim mboesenh  R 4-22:46:13      n403-007
 </code> </code>
  
-  * Formatting option: -o <output_format>--format=<output_format> +  * optional reformatting via -o, for example,
-  * The format of each field is ''%%%[[.]size]type%%''+
-  * “.” means right justification. +
-  * “size” is the field width +
-  * “type” the field type (JOBID, PRIORITYetc.) +
  
 <code bash> <code bash>
-[jz@l31 ~]$ man squeue+VSC-4 >  squeue -u $user -o "%.18i %.10Q  %.12q %.8j %.8u %8U %.2t %.10M %.6D %R" 
 +             JOBID   PRIORITY           QOS     NAME     USER UID      ST       TIME  NODES NODELIST(REASON) 
 +            409879    3000982         admin training       sh 71177        1:34:55      3 n407-030,n411-[007-008]
 </code> </code>
  
  
 +<HTML>
 +</div>
 +</HTML>
  
 ---- ----
Line 363: Line 217:
  
 <code bash> <code bash>
-[training@l31 linpack]$ scontrol show job 5264675 +VSC-4 >  scontrol show job 409879 
-JobId=5264675 JobName=hpl +JobId=409879 JobName=training 
-   UserId=training(72127) GroupId=p70824(70824) MCS_label=N/+   UserId=sh(71177) GroupId=sysadmin(60000) MCS_label=N/
-   Priority=1861 Nice=0 Account=p70824 QOS=normal_0064+   Priority=3000982 Nice=0 Account=sysadmin QOS=admin
    JobState=RUNNING Reason=None Dependency=(null)    JobState=RUNNING Reason=None Dependency=(null)
-   Requeue=Restarts=0 BatchFlag=Reboot=0 ExitCode=0:+   Requeue=Restarts=0 BatchFlag=Reboot=0 ExitCode=0:
-   RunTime=00:00:36 TimeLimit=3-00:00:00 TimeMin=N/+   RunTime=01:36:35 TimeLimit=10-00:00:00 TimeMin=N/
-   SubmitTime=2017-04-21T11:30:18 EligibleTime=2017-04-21T11:30:18 +   SubmitTime=2020-10-12T15:50:44 EligibleTime=2020-10-12T15:50:44 
-   StartTime=2017-04-21T11:30:49 EndTime=2017-04-24T11:30:56 Deadline=N/A+   AccrueTime=2020-10-12T15:50:44 
 +   StartTime=2020-10-12T15:51:07 EndTime=2020-10-22T15:51:07 Deadline=N/A
    PreemptTime=None SuspendTime=None SecsPreSuspend=0    PreemptTime=None SuspendTime=None SecsPreSuspend=0
-   Partition=mem_0064 AllocNode:Sid=l31:28554+   LastSchedEval=2020-10-12T15:51:07 
 +   Partition=mem_0096 AllocNode:Sid=l40:186217
    ReqNodeList=(null) ExcNodeList=(null)    ReqNodeList=(null) ExcNodeList=(null)
-   NodeList=n41-001 +   NodeList=n407-030,n411-[007-008] 
-   BatchHost=n41-001 +   BatchHost=n407-030 
-   NumNodes=NumCPUs=32 NumTasks=16 CPUs/Task=1 ReqB:S:C:T=0:0:*:+   NumNodes=NumCPUs=288 NumTasks=CPUs/Task=1 ReqB:S:C:T=0:0:*:
-   TRES=cpu=32,node=1,gres/cpu_mem_0064=32 +   TRES=cpu=288,mem=288888M,node=3,billing=288,gres/cpu_mem_0096=288 
-   Socks/Node=* NtasksPerN:B:S:C=16:0:*:* CoreSpec=* +   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=* 
-   MinCPUsNode=16 MinMemoryNode=MinTmpDiskNode=0 +   MinCPUsNode=MinMemoryNode=96296M MinTmpDiskNode=0 
-   Features=(null) Gres=cpu_mem_0064:32 Reservation=(null)+   Features=(null) DelayBoot=00:00:00
    OverSubscribe=NO Contiguous=0 Licenses=(null) Network=(null)    OverSubscribe=NO Contiguous=0 Licenses=(null) Network=(null)
-   Command=/home/lv70824/training/Examples/06_node_access_job_control/linpack/job.sh +   Command=(null) 
-   WorkDir=/home/lv70824/training/Examples/06_node_access_job_control/linpack +   WorkDir=/home/fs60000/sh
-   StdErr=/home/lv70824/training/Examples/06_node_access_job_control/linpack/slurm-5264675.out +
-   StdIn=/dev/null +
-   StdOut=/home/lv70824/training/Examples/06_node_access_job_control/linpack/slurm-5264675.out+
    Power=    Power=
 +   TresPerNode=cpu_mem_0096:96
 </code> </code>
 +
  
 ---- ----
Line 398: Line 253:
  
 <code bash> <code bash>
-[jz@l31 ~]$ sacct -j 5115879 
-       JobID    JobName  Partition    Account  AllocCPUS      State ExitCode 
------------- ---------- ---------- ---------- ---------- ---------- -------- 
-5115879            bash   mem_0064   sysadmin         32  COMPLETED      0:0 
-</code> 
  
-specify options: +VSC-4  sacct -j 409878  
- +       JobID    JobName  Partition    Account  AllocCPUS      State ExitCode  
-<code bash> +------------ ---------- ---------- ---------- ---------- ---------- --------  
-[jz@l31 ~]$ sacct -j 5115879 -o jobid,jobname,cluster,nodelist,Start,End,cputime,cputimeraw,ncpus,qos,account,ExitCode +409878         training   mem_0096   sysadmin        288  COMPLETED      0:
-       JobID    JobName    Cluster        NodeList               Start                 End    CPUTime CPUTimeRAW      NCPUS        QOS    Account ExitCode +
------------- ---------- ---------- --------------- ------------------- ------------------- ---------- ---------- ---------- ---------- ---------- -------- +
-5115879            bash       vsc3         n07-043 2017-03-15T14:31:39 2017-03-15T14:36:43   02:42:08       9728         32      admin   sysadmin      0:0+
 </code> </code>
  
-inspect man page for more options:+adjust output
  
 <code bash> <code bash>
-[jz@l31 ~]$ man sacct+VSC-4 >  sacct -j 409878 -o jobid,jobname,cluster,nodelist,Start,End,cputime,cputimeraw,ncpus,qos,account,ExitCode 
 +       JobID    JobName    Cluster        NodeList               Start                 End    CPUTime CPUTimeRAW      NCPUS        QOS    Account ExitCode  
 +------------ ---------- ---------- --------------- ------------------- ------------------- ---------- ---------- ---------- ---------- ---------- --------  
 +409878         training       vsc4 n407-030,n411-+ 2020-10-12T15:49:42 2020-10-12T15:51:06   06:43:12      24192        288      admin   sysadmin      0:0 
 </code> </code>
  
Line 426: Line 276:
  
 <code bash> <code bash>
-[jz@l31 ~]$ vsc3CoreHours.py -h+VSC-3 >  vsc3CoreHours.py -h
 usage: vsc3CoreHours.py [-h] [-S STARTTIME] [-E ENDTIME] [-D DURATION] usage: vsc3CoreHours.py [-h] [-S STARTTIME] [-E ENDTIME] [-D DURATION]
                         [-A ACCOUNTLIST] [-u USERNAMES] [-uni UNI]                         [-A ACCOUNTLIST] [-u USERNAMES] [-uni UNI]
Line 455: Line 305:
  
 <code bash> <code bash>
-[jz@l31 ~]$ vsc3CoreHours.py -S 2017-01-01 -E 2017-03-31 -u jz+VSC-3 >  vsc3CoreHours.py -S 2020-01-01 -E 2020-10-01 -u sh
 =================================================== ===================================================
 Accounting data time range Accounting data time range
-From: 2017-01-01 00:00:00 +From: 2020-01-01 00:00:00 
-To:   2017-03-31 00:00:00+To:   2020-10-01 00:00:00
 =================================================== ===================================================
 Getting accounting information for the following account/user combinations: Getting accounting information for the following account/user combinations:
-account:   sysadmin users: jz+account:   sysadmin users: sh
 getting data, excluding these qos: goodluck, gpu_vis, gpu_compute getting data, excluding these qos: goodluck, gpu_vis, gpu_compute
 =============================== ===============================
              account     core_h              account     core_h
 _______________________________ _______________________________
-            sysadmin   56844.87+            sysadmin   25775.15
 _______________________________ _______________________________
-               total   56844.87+               total   25775.15
 </code> </code>
  
Line 489: Line 339:
  
 ---- ----
 +
  
  • pandoc/introduction-to-vsc/06_node_access_job_control/node_access_job_control.txt
  • Last modified: 2020/10/20 09:13
  • by pandoc