Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revisionBoth sides next revision
doku:monitoring [2021/05/14 13:52] goldenbergdoku:monitoring [2022/06/22 16:53] – rewrite in markdown msiegel
Line 1: Line 1:
-====== Monitor where threads/processes are running ======+# Monitoring Processes & Threads
  
-There are several ways to monitor your job, either in live time directly on the compute node or by modifying the job script or the application code: +## CPU Load
-  * <html><font color=#cc3300>live </font> <span style="color:blue;font-size:100%;">&dzigrarr;</span> submit the job and connect with the compute node</html>+
  
-{{ :doku:top_vasp_2.png?200|}} +There are several ways to monitor the threads CPU load distribution of 
-<code> +your job, either live directly on the compute nodeor by modifying 
-[xy@l32]$ sbatch job.sh +the job scriptor the application code.
-[xy@l32]$ squeue -u xy +
-JOBID    PARTITION  NAME      USER  ST  TIME  NODES  NODELIST(REASON) +
-1098917  mem_0096   vasp_xx   xy    R   0:02    1    n408-041 +
-[xy@l32]$ ssh n408-041 +
-[xy@nn408-041]$ top      +
-</code> +
-When typing <html><span style="color:blue;font-size:100%;">&dzigrarr;</span> <font color=#cc3300>1</font></html> during the top callper-core-information is obtained, e.g., about the cpu-usagecompare with the picture to the right. The user can select the parameters to be shown from a list displayed when typing <html><span style="color:blue;font-size:100%;">&dzigrarr;</span> <font color=#cc3300>f</font></html>.+
  
-The columns VIRT and RES indicate the virtualresident memory usage of each process (per default in kB)The column COMMAND lists the name of the command or application+### Live 
-  * <html><font color=#cc3300>batch script</font> (Intel-MPI<span style="color:blue;font-size:100%;">&dzigrarr;</span> setI_MPI_DEBUG=4 </html> + 
-  * <html><font color=#cc3300>code</font> <span style="color:blue;font-size:100%;">&dzigrarr;</span> via library functions information about the locality of processes and threads can be obtained (librariesmpi.h or in C-code hwloc.h (hardware locality) or sched.h (scheduling parameters)) </html> +So we assume your program runsbut could it be faster? SLURM gives you 
-<code> +a `Job ID`, type `squeue --job myjobid` to find out on which node your 
-#include "mpi.h"+job runs; say n372-007Type `ssh n372-007`, to connect to the given 
 +nodeType `top` to start a simple task manager: 
 + 
 +``` 
 +[myuser@l32]$ sbatch job.sh 
 +[myuser@l32]$ squeue -u myuser 
 +JOBID    PARTITION  NAME      USER    ST  TIME  NODES  NODELIST(REASON) 
 +1098917  mem_0096   gmx_mpi   myuser  R   0:02       n372-007 
 +[myuser@l32]$ ssh n372-007 
 +[myuser@n372-007]$ top  
 +``` 
 + 
 +Within `top`, hit the following keys (case sensitive): `H t 1`. Now you 
 +should be able to see the load on all the available CPUs, as an 
 +example: 
 + 
 +``` 
 +top 16:31:51 up 181 days,  1:04,  3 users,  load average: 1.67, 3.39, 3.61 
 +Threads: 239 total,   2 running, 237 sleeping,   0 stopped,   0 zombie 
 +%Cpu0  :  69.8/29.2   99[|||||||||||||||||||||||||||||||||||||||||||||||| ] 
 +%Cpu1   97.0/2.3    99[|||||||||||||||||||||||||||||||||||||||||||||||| ] 
 +%Cpu2  :  98.7/0.7    99[|||||||||||||||||||||||||||||||||||||||||||||||| ] 
 +%Cpu3   95.7/4.0   100[|||||||||||||||||||||||||||||||||||||||||||||||| ] 
 +%Cpu4  :  99.0/0.3    99[|||||||||||||||||||||||||||||||||||||||||||||||| ] 
 +%Cpu5   98.7/0.3    99[|||||||||||||||||||||||||||||||||||||||||||||||| ] 
 +%Cpu6  :  99.3/0.0    99[|||||||||||||||||||||||||||||||||||||||||||||||||] 
 +%Cpu7  :  99.0/0.0    99[|||||||||||||||||||||||||||||||||||||||||||||||| ] 
 +KiB Mem : 65861076 total, 60442504 free,  1039244 used,  4379328 buff/cache 
 +KiB Swap:        0 total,        0 free,        0 used. 62613824 avail Mem  
 + 
 +  PID USER      PR  NI    VIRT    RES    SHR S %CPU %MEM     TIME+ COMMAND 
 +18876 myuser    20   0 9950.2m 303908 156512 S 99.3  0.5   0:11.14 gmx_mpi 
 +18856 myuser    20   0 9950.2m 303908 156512 S 99.0  0.5   0:12.28 gmx_mpi 
 +18870 myuser    20   0 9950.2m 303908 156512 R 99.0  0.5   0:11.20 gmx_mpi 
 +18874 myuser    20   0 9950.2m 303908 156512 S 99.0  0.5   0:11.25 gmx_mpi 
 +18872 myuser    20   0 9950.2m 303908 156512 S 98.7  0.5   0:11.19 gmx_mpi 
 +18873 myuser    20   0 9950.2m 303908 156512 S 98.7  0.5   0:11.15 gmx_mpi 
 +18871 myuser    20   0 9950.2m 303908 156512 S 96.3  0.5   0:11.09 gmx_mpi 
 +18875 myuser    20   0 9950.2m 303908 156512 S 95.7  0.5   0:11.02 gmx_mpi 
 +18810 root      20              0      0 S  6.6  0.0   0:00.70 nv_queue
 ... ...
-MPI_Get_processor_name(processor_name, &namelen); +``` 
-</code> + 
-<code>+In our example all 8 threads are utilised; which is good. The opposite 
 +is not true however, sometimes the best case still only uses 40% on 
 +most CPUs! 
 + 
 +The columns `VIRT` and `RES` indicate the *virtual*, respective 
 +*resident* memory usage of each process (unless noted otherwise in 
 +kB). The column `COMMAND` lists the name of the command or 
 +application. 
 + 
 +In the following screenshot we can see stats for all 32 threads of a compute node running `VASP`: 
 + 
 +{{ :doku:top_vasp_2.png }} 
 + 
 + 
 +### Job Script  
 + 
 +If you are using `Intel-MPI` you might include this option in your batch script: 
 +``` 
 +I_MPI_DEBUG=4 
 +``` 
 + 
 +### Application Code 
 + 
 +If your application code is in `C`, information about the locality of 
 +processes and threads can be obtained via library functions using either  
 +of the following libraries: 
 + 
 +#### mpi.h 
 +``` 
 +#include "mpi.h" 
 +...  MPI_Get_processor_name(processor_name, &namelen); 
 +``` 
 + 
 +#### sched.h (scheduling parameters) 
 +```
 #include <sched.h> #include <sched.h>
-... +...  CPU_ID = sched_getcpu(); 
-CPU_ID = sched_getcpu(); +``` 
-</code> + 
-<code>+#### hwloc.h (Hardware locality) 
 +```
 #include <hwloc.h> #include <hwloc.h>
 ... ...
Line 45: Line 111:
  
 //  compile: mpiicc -qopenmp -o ompMpiCoreIds ompMpiCoreIds.c -lhwloc //  compile: mpiicc -qopenmp -o ompMpiCoreIds ompMpiCoreIds.c -lhwloc
-</code>+```
  
  • doku/monitoring.txt
  • Last modified: 2023/03/14 12:56
  • by goldenberg