Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
doku:monitoring [2022/06/23 08:30]
msiegel added gpu load
doku:monitoring [2023/03/14 12:56] (current)
goldenberg [Live]
Line 1: Line 1:
-Monitoring Processes & Threads+====== Monitoring Processes & Threads ======
  
-## CPU Load+===== CPU Load =====
  
 There are several ways to monitor the threads CPU load distribution of There are several ways to monitor the threads CPU load distribution of
-your job, either live directly on the compute node, or by modifying +your job, either [[doku:monitoring#Live]] directly on the compute node, or by modifying 
-the job script, or the application code.+the [[doku:monitoring#Job Script]], or the [[doku:monitoring#Application Code]].
  
-### Live+==== Live ====
  
-So we assume your program runs, but could it be faster? SLURM gives you +So we assume your program runs, but could it be faster? [[doku:SLURM]] gives you 
-`Job ID`, type `squeue --job myjobidto find out on which node your +''Job ID'', type ''squeue --job myjobid'' to find out on which node your 
-job runs; say n372-007. Type `ssh n372-007`, to connect to the given +job runs; say n4905-007. Type ''ssh n4905-007'', to connect to the given 
-node. Type `topto start a simple task manager:+node. Type ''top'' to start a simple task manager:
  
-``` +<code sh> 
-[myuser@l32]$ sbatch job.sh +[myuser@l42]$ sbatch job.sh 
-[myuser@l32]$ squeue -u myuser+[myuser@l42]$ squeue -u myuser
 JOBID    PARTITION  NAME      USER    ST  TIME  NODES  NODELIST(REASON) JOBID    PARTITION  NAME      USER    ST  TIME  NODES  NODELIST(REASON)
-1098917  mem_0096   gmx_mpi   myuser  R   0:02       n372-007 +1098917  skylake_0096   gmx_mpi   myuser  R   0:02       n4905-007 
-[myuser@l32]$ ssh n372-007 +[myuser@l42]$ ssh n4905-007 
-[myuser@n372-007]$ top  +[myuser@n4905-007]$ top  
-```+</code>
  
-Within `top`, hit the following keys (case sensitive): `H t 1`. Now you+Within ''top'', hit the following keys (case sensitive): ''H t 1''. Now you
 should be able to see the load on all the available CPUs, as an should be able to see the load on all the available CPUs, as an
 example: example:
  
-```+<code>
 top - 16:31:51 up 181 days,  1:04,  3 users,  load average: 1.67, 3.39, 3.61 top - 16:31:51 up 181 days,  1:04,  3 users,  load average: 1.67, 3.39, 3.61
 Threads: 239 total,   2 running, 237 sleeping,   0 stopped,   0 zombie Threads: 239 total,   2 running, 237 sleeping,   0 stopped,   0 zombie
Line 52: Line 52:
 18810 root      20              0      0 S  6.6  0.0   0:00.70 nv_queue 18810 root      20              0      0 S  6.6  0.0   0:00.70 nv_queue
 ... ...
-```+</code>
  
 In our example all 8 threads are utilised; which is good. The opposite In our example all 8 threads are utilised; which is good. The opposite
Line 58: Line 58:
 most CPUs! most CPUs!
  
-The columns `VIRTand `RESindicate the *virtual*, respective +The columns ''VIRT'' and ''RES'' indicate the //virtual//, respective 
-*residentmemory usage of each process (unless noted otherwise in +//resident// memory usage of each process (unless noted otherwise in 
-kB). The column `COMMANDlists the name of the command or+kB). The column ''COMMAND'' lists the name of the command or
 application. application.
  
-In the following screenshot we can see stats for all 32 threads of a compute node running `VASP`:+In the following screenshot we can see stats for all 32 threads of a compute node running [[doku:VASP]]:
  
 {{ :doku:top_vasp_2.png }} {{ :doku:top_vasp_2.png }}
  
  
-### Job Script +==== Job Script ==== 
 +  
 +If you are using ''Intel-MPI'' you might include this option in your batch script:
  
-If you are using `Intel-MPI` you might include this option in your batch script: +  I_MPI_DEBUG=4
-``` +
-I_MPI_DEBUG=4 +
-```+
  
-### Application Code+==== Application Code ====
  
-If your application code is in `C`, information about the locality of+If your application code is in ''C'', information about the locality of
 processes and threads can be obtained via library functions using either  processes and threads can be obtained via library functions using either 
 of the following libraries: of the following libraries:
  
-#### mpi.h +=== mpi.h === 
-```+ 
 +<code cpp>
 #include "mpi.h" #include "mpi.h"
 ...  MPI_Get_processor_name(processor_name, &namelen); ...  MPI_Get_processor_name(processor_name, &namelen);
-```+</code>
  
-#### sched.h (scheduling parameters) +=== sched.h (scheduling parameters) === 
-```+ 
 +<code c++>
 #include <sched.h> #include <sched.h>
 ...  CPU_ID = sched_getcpu(); ...  CPU_ID = sched_getcpu();
-```+</code> 
 + 
 +=== hwloc.h (Hardware locality) ===
  
-#### hwloc.h (Hardware locality) +<code cpp>
-```+
 #include <hwloc.h> #include <hwloc.h>
 ... ...
Line 111: Line 113:
  
 //  compile: mpiicc -qopenmp -o ompMpiCoreIds ompMpiCoreIds.c -lhwloc //  compile: mpiicc -qopenmp -o ompMpiCoreIds ompMpiCoreIds.c -lhwloc
-```+</code>
  
-## GPU Load+===== GPU Load =====
  
 We assume you program uses a GPU, and your program runs as expected, We assume you program uses a GPU, and your program runs as expected,
 so could it be faster? On the same node where your job runs (see CPU so could it be faster? On the same node where your job runs (see CPU
-load section), maybe in a new terminal, type `watch nvidia-smi`, to +load section), maybe in a new terminal, type ''watch nvidia-smi'', to 
-start a simple task manager for the graphics card. `watchjust+start a simple task manager for the graphics card. ''watch'' just
 repeats a command every 2 seconds, acts as a live monitor for the repeats a command every 2 seconds, acts as a live monitor for the
 GPU. In our example below the GPU utilisation is around 80% the most GPU. In our example below the GPU utilisation is around 80% the most
 time, which is very good already. time, which is very good already.
  
-```+<code>
 Every 2.0s: nvidia-smi                                 Wed Jun 22 16:42:52 2022 Every 2.0s: nvidia-smi                                 Wed Jun 22 16:42:52 2022
 Wed Jun 22 16:42:52 2022 Wed Jun 22 16:42:52 2022
Line 145: Line 147:
 |    0   N/ N/A     21045      C   gmx_mpi                           159MiB | |    0   N/ N/A     21045      C   gmx_mpi                           159MiB |
 +-----------------------------------------------------------------------------+ +-----------------------------------------------------------------------------+
-```+</code>
  
  • doku/monitoring.1655973032.txt.gz
  • Last modified: 2022/06/23 08:30
  • by msiegel