Differences
This shows you the differences between two versions of the page.
Both sides previous revision Previous revision Next revision | Previous revision | ||
doku:monitoring [2021/05/14 13:52] – goldenberg | doku:monitoring [2024/07/18 14:25] (current) – [Live] grokyta | ||
---|---|---|---|
Line 1: | Line 1: | ||
- | ====== | + | ====== |
- | There are several ways to monitor your job, either in live time directly on the compute node or by modifying the job script or the application code: | + | ===== CPU Load ===== |
- | * < | + | |
- | {{ :doku:top_vasp_2.png?200|}} | + | There are several ways to monitor the threads CPU load distribution of |
- | < | + | your job, either [[doku:monitoring# |
- | [xy@l32]$ sbatch job.sh | + | the [[doku:monitoring# |
- | [xy@l32]$ squeue -u xy | + | |
- | JOBID PARTITION | + | ==== Live ==== |
- | 1098917 | + | |
- | [xy@l32]$ ssh n408-041 | + | So we assume your program runs, but could it be faster? [[doku: |
- | [xy@nn408-041]$ top | + | a '' |
+ | job runs; say n4905-007. Type '' | ||
+ | node. Type '' | ||
+ | |||
+ | < | ||
+ | [myuser@l42]$ sbatch job.sh | ||
+ | [myuser@l42]$ squeue -u myuser | ||
+ | JOBID PARTITION | ||
+ | 1098917 | ||
+ | [myuser@l42]$ ssh n4905-007 | ||
+ | [myuser@n4905-007]$ top | ||
</ | </ | ||
- | When typing < | ||
- | The columns VIRT and RES indicate | + | Within '' |
- | * < | + | should be able to see the load on all the available CPUs, as an |
- | * < | + | example: |
< | < | ||
- | #include "mpi.h" | + | top - 16:31:51 up 181 days, 1:04, 3 users, |
+ | Threads: 239 total, | ||
+ | %Cpu0 : 69.8/ | ||
+ | %Cpu1 : 97.0/ | ||
+ | %Cpu2 : 98.7/ | ||
+ | %Cpu3 : 95.7/ | ||
+ | %Cpu4 : 99.0/ | ||
+ | %Cpu5 : 98.7/ | ||
+ | %Cpu6 : 99.3/ | ||
+ | %Cpu7 : 99.0/ | ||
+ | KiB Mem : 65861076 total, 60442504 free, 1039244 used, 4379328 buff/ | ||
+ | KiB Swap: 0 total, | ||
+ | |||
+ | PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND | ||
+ | 18876 myuser | ||
+ | 18856 myuser | ||
+ | 18870 myuser | ||
+ | 18874 myuser | ||
+ | 18872 myuser | ||
+ | 18873 myuser | ||
+ | 18871 myuser | ||
+ | 18875 myuser | ||
+ | 18810 root 20 | ||
... | ... | ||
- | MPI_Get_processor_name(processor_name, | ||
</ | </ | ||
- | < | + | |
+ | In our example all 8 threads are utilised; which is good. The opposite | ||
+ | is not true however, sometimes the best case still only uses 40% on | ||
+ | most CPUs! | ||
+ | |||
+ | The columns '' | ||
+ | // | ||
+ | kB). The column '' | ||
+ | application. | ||
+ | |||
+ | In the following screenshot we can see stats for all 32 threads of a compute node running '' | ||
+ | |||
+ | {{ : | ||
+ | |||
+ | |||
+ | ==== Job Script ==== | ||
+ | |||
+ | If you are using '' | ||
+ | |||
+ | I_MPI_DEBUG=4 | ||
+ | |||
+ | ==== Application Code ==== | ||
+ | |||
+ | If your application code is in '' | ||
+ | processes and threads can be obtained via library functions using either | ||
+ | of the following libraries: | ||
+ | |||
+ | === mpi.h === | ||
+ | |||
+ | <code cpp> | ||
+ | #include " | ||
+ | ... MPI_Get_processor_name(processor_name, | ||
+ | </ | ||
+ | |||
+ | === sched.h (scheduling parameters) === | ||
+ | |||
+ | < | ||
#include < | #include < | ||
- | ... | + | ... CPU_ID = sched_getcpu(); |
- | CPU_ID = sched_getcpu(); | + | |
</ | </ | ||
- | < | + | |
+ | === hwloc.h (Hardware locality) === | ||
+ | |||
+ | < | ||
#include < | #include < | ||
... | ... | ||
Line 45: | Line 113: | ||
// compile: mpiicc -qopenmp -o ompMpiCoreIds ompMpiCoreIds.c -lhwloc | // compile: mpiicc -qopenmp -o ompMpiCoreIds ompMpiCoreIds.c -lhwloc | ||
+ | </ | ||
+ | |||
+ | ===== GPU Load ===== | ||
+ | |||
+ | We assume you program uses a GPU, and your program runs as expected, | ||
+ | so could it be faster? On the same node where your job runs (see CPU | ||
+ | load section), maybe in a new terminal, type '' | ||
+ | start a simple task manager for the graphics card. '' | ||
+ | repeats a command every 2 seconds, acts as a live monitor for the | ||
+ | GPU. In our example below the GPU utilisation is around 80% the most | ||
+ | time, which is very good already. | ||
+ | |||
+ | < | ||
+ | Every 2.0s: nvidia-smi | ||
+ | Wed Jun 22 16:42:52 2022 | ||
+ | +-----------------------------------------------------------------------------+ | ||
+ | | NVIDIA-SMI 460.32.03 | ||
+ | |-------------------------------+----------------------+----------------------+ | ||
+ | | GPU Name Persistence-M| Bus-Id | ||
+ | | Fan Temp Perf Pwr: | ||
+ | | | ||
+ | |===============================+======================+======================| | ||
+ | | | ||
+ | | 36% | ||
+ | | | ||
+ | +-------------------------------+----------------------+----------------------+ | ||
+ | |||
+ | +-----------------------------------------------------------------------------+ | ||
+ | | Processes: | ||
+ | | GPU | ||
+ | | ID | ||
+ | |=============================================================================| | ||
+ | | 0 | ||
+ | +-----------------------------------------------------------------------------+ | ||
</ | </ | ||