Differences

This shows you the differences between two versions of the page.

--- doku:monitoring [2022/06/22 16:53] – rewrite in markdown msiegel
+++ doku:monitoring [2024/07/18 14:25] (current) – [Live] grokyta
@@ Line 1: / Line 1: @@
-# Monitoring Processes & Threads
+====== Monitoring Processes & Threads ======
-## CPU Load
+===== CPU Load =====
 There are several ways to monitor the threads CPU load distribution of
-your job, either live directly on the compute node, or by modifying
+your job, either [[doku:monitoring#Live]] directly on the compute node, or by modifying
-the job script, or the application code.
+the [[doku:monitoring#Job Script]], or the [[doku:monitoring#Application Code]].
-### Live
+==== Live ====
-So we assume your program runs, but could it be faster? SLURM gives you
+So we assume your program runs, but could it be faster? [[doku:SLURM]] gives you
-a `Job ID`, type `squeue --job myjobid` to find out on which node your
+a ''Job ID'', type ''squeue --job myjobid'' to find out on which node your
-job runs; say n372-007. Type `ssh n372-007`, to connect to the given
+job runs; say n4905-007. Type ''ssh n4905-007'', to connect to the given
-node. Type `top` to start a simple task manager:
+node. Type ''top'' to start a simple task manager:
-```
+<code sh>
-[myuser@l32]$ sbatch job.sh
+[myuser@l42]$ sbatch job.sh
-[myuser@l32]$ squeue -u myuser
+[myuser@l42]$ squeue -u myuser
 JOBID    PARTITION  NAME      USER    ST  TIME  NODES  NODELIST(REASON)
-1098917  mem_0096   gmx_mpi   myuser  R   0:02   1     n372-007
+1098917  skylake_0096   gmx_mpi   myuser  R   0:02   1     n4905-007
-[myuser@l32]$ ssh n372-007
+[myuser@l42]$ ssh n4905-007
-[myuser@n372-007]$ top
+[myuser@n4905-007]$ top
-```
+</code>
-Within `top`, hit the following keys (case sensitive): `H t 1`. Now you
+Within ''top'', hit the following keys (case sensitive): ''H t 1''. Now you
 should be able to see the load on all the available CPUs, as an
 example:
-```
+<code>
 top - 16:31:51 up 181 days,  1:04,  3 users,  load average: 1.67, 3.39, 3.61
 Threads: 239 total,   2 running, 237 sleeping,   0 stopped,   0 zombie
@@ Line 52: / Line 52: @@
 root      20   0       0      0      0 S  6.6  0.0   0:00.70 nv_queue
 ...
-```
+</code>
 In our example all 8 threads are utilised; which is good. The opposite
@@ Line 58: / Line 58: @@
 most CPUs!
-The columns `VIRT` and `RES` indicate the *virtual*, respective
+The columns ''VIRT'' and ''RES'' indicate the //virtual//, respective
-*resident* memory usage of each process (unless noted otherwise in
+//resident// memory usage of each process (unless noted otherwise in
-kB). The column `COMMAND` lists the name of the command or
+kB). The column ''COMMAND'' lists the name of the command or
 application.
-In the following screenshot we can see stats for all 32 threads of a compute node running `VASP`:
+In the following screenshot we can see stats for all 32 threads of a compute node running ''VASP'':
 {{ :doku:top_vasp_2.png }}
-### Job Script
+==== Job Script ====
+If you are using ''Intel-MPI'' you might include this option in your batch script:
-If you are using `Intel-MPI` you might include this option in your batch script:
+  I_MPI_DEBUG=4
-```
-I_MPI_DEBUG=4
-```
-### Application Code
+==== Application Code ====
-If your application code is in `C`, information about the locality of
+If your application code is in ''C'', information about the locality of
 processes and threads can be obtained via library functions using either
 of the following libraries:
-#### mpi.h
+=== mpi.h ===
-```
+<code cpp>
 #include "mpi.h"
 ...  MPI_Get_processor_name(processor_name, &namelen);
-```
+</code>
-#### sched.h (scheduling parameters)
+=== sched.h (scheduling parameters) ===
-```
+<code c++>
 #include <sched.h>
 ...  CPU_ID = sched_getcpu();
-```
+</code>
-#### hwloc.h (Hardware locality)
+=== hwloc.h (Hardware locality) ===
-```
+<code cpp>
 #include <hwloc.h>
 ...
@@ Line 111: / Line 113: @@
 //  compile: mpiicc -qopenmp -o ompMpiCoreIds ompMpiCoreIds.c -lhwloc
-```
+</code>
+===== GPU Load =====
+We assume you program uses a GPU, and your program runs as expected,
+so could it be faster? On the same node where your job runs (see CPU
+load section), maybe in a new terminal, type ''watch nvidia-smi'', to
+start a simple task manager for the graphics card. ''watch'' just
+repeats a command every 2 seconds, acts as a live monitor for the
+GPU. In our example below the GPU utilisation is around 80% the most
+time, which is very good already.
+<code>
+Every 2.0s: nvidia-smi                                 Wed Jun 22 16:42:52 2022
+Wed Jun 22 16:42:52 2022
++-----------------------------------------------------------------------------+
+| NVIDIA-SMI 460.32.03    Driver Version: 460.32.03    CUDA Version: 11.2     |
+|-------------------------------+----------------------+----------------------+
+| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
+| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
+|                               |                      |               MIG M. |
+|===============================+======================+======================|
+|   0  GeForce GTX 1080    Off  | 00000000:02:00.0 Off |                  N/A |
+| 36%   59C    P2   112W / 180W |    161MiB /  8119MiB |     83%      Default |
+|                               |                      |                  N/A |
++-------------------------------+----------------------+----------------------+
++-----------------------------------------------------------------------------+
+| Processes:                                                                  |
+|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
+|        ID   ID                                                   Usage      |
+|=============================================================================|
+|    0   N/A  N/A     21045      C   gmx_mpi                           159MiB |
++-----------------------------------------------------------------------------+
+</code>