Differences
This shows you the differences between two versions of the page.
Both sides previous revision Previous revision Next revision | Previous revision Next revisionBoth sides next revision | ||
doku:monitoring [2017/03/17 14:09] – ir | doku:monitoring [2022/06/22 16:53] – rewrite in markdown msiegel | ||
---|---|---|---|
Line 1: | Line 1: | ||
- | ====== Monitor where threads/ | + | # Monitoring Processes & Threads |
- | There are several ways to monitor your job, either in live time directly on the compute node or by modifying the job script or the application code: | + | ## CPU Load |
- | * < | + | |
- | {{ : | + | There are several ways to monitor the threads CPU load distribution of |
- | < | + | your job, either live directly on the compute node, or by modifying |
- | [xy@l32]$ sbatch | + | the job script, or the application code. |
- | [xy@l32]$ squeue -u xy | + | |
- | JOBID PARTITION | + | |
- | 5066692 | + | |
- | [xy@l32]$ ssh n09-005 | + | |
- | [xy@n09-005]$ top | + | |
- | </ | + | |
- | When typing < | + | |
- | The columns VIRT and RES indicate the virtual, resident memory usage of each process (per default in kB). | + | ### Live |
- | | + | |
- | | + | So we assume your program runs, but could it be faster? SLURM gives you |
- | < | + | a `Job ID`, type `squeue --job myjobid` to find out on which node your |
- | #include "mpi.h" | + | job runs; say n372-007. Type `ssh n372-007`, to connect to the given |
+ | node. Type `top` to start a simple task manager: | ||
+ | |||
+ | ``` | ||
+ | [myuser@l32]$ sbatch job.sh | ||
+ | [myuser@l32]$ squeue -u myuser | ||
+ | JOBID PARTITION | ||
+ | 1098917 | ||
+ | [myuser@l32]$ ssh n372-007 | ||
+ | [myuser@n372-007]$ top | ||
+ | ``` | ||
+ | |||
+ | Within `top`, hit the following keys (case sensitive): `H t 1`. Now you | ||
+ | should be able to see the load on all the available CPUs, as an | ||
+ | example: | ||
+ | |||
+ | ``` | ||
+ | top - 16:31:51 up 181 days, 1:04, 3 users, | ||
+ | Threads: 239 total, | ||
+ | %Cpu0 : 69.8/29.2 | ||
+ | %Cpu1 | ||
+ | %Cpu2 : 98.7/0.7 99[|||||||||||||||||||||||||||||||||||||||||||||||| ] | ||
+ | %Cpu3 | ||
+ | %Cpu4 : 99.0/0.3 99[|||||||||||||||||||||||||||||||||||||||||||||||| ] | ||
+ | %Cpu5 | ||
+ | %Cpu6 : 99.3/0.0 99[|||||||||||||||||||||||||||||||||||||||||||||||||] | ||
+ | %Cpu7 : 99.0/ | ||
+ | KiB Mem : 65861076 total, 60442504 free, 1039244 used, 4379328 buff/ | ||
+ | KiB Swap: 0 total, | ||
+ | |||
+ | PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND | ||
+ | 18876 myuser | ||
+ | 18856 myuser | ||
+ | 18870 myuser | ||
+ | 18874 myuser | ||
+ | 18872 myuser | ||
+ | 18873 myuser | ||
+ | 18871 myuser | ||
+ | 18875 myuser | ||
+ | 18810 root 20 | ||
... | ... | ||
- | MPI_Get_processor_name(processor_name, | + | ``` |
- | </ | + | |
- | < | + | In our example all 8 threads are utilised; which is good. The opposite |
+ | is not true however, sometimes the best case still only uses 40% on | ||
+ | most CPUs! | ||
+ | |||
+ | The columns `VIRT` and `RES` indicate the *virtual*, respective | ||
+ | *resident* memory usage of each process (unless noted otherwise in | ||
+ | kB). The column `COMMAND` lists the name of the command or | ||
+ | application. | ||
+ | |||
+ | In the following screenshot we can see stats for all 32 threads of a compute node running `VASP`: | ||
+ | |||
+ | {{ : | ||
+ | |||
+ | |||
+ | ### Job Script | ||
+ | |||
+ | If you are using `Intel-MPI` you might include this option in your batch script: | ||
+ | ``` | ||
+ | I_MPI_DEBUG=4 | ||
+ | ``` | ||
+ | |||
+ | ### Application Code | ||
+ | |||
+ | If your application code is in `C`, information about the locality of | ||
+ | processes and threads can be obtained via library functions using either | ||
+ | of the following libraries: | ||
+ | |||
+ | #### mpi.h | ||
+ | ``` | ||
+ | #include " | ||
+ | ... | ||
+ | ``` | ||
+ | |||
+ | #### sched.h (scheduling parameters) | ||
+ | ``` | ||
#include < | #include < | ||
- | ... | + | ... CPU_ID = sched_getcpu(); |
- | CPU_ID = sched_getcpu(); | + | ``` |
- | </ | + | |
- | < | + | #### hwloc.h (Hardware locality) |
+ | ``` | ||
#include < | #include < | ||
... | ... | ||
Line 45: | Line 111: | ||
// compile: mpiicc -qopenmp -o ompMpiCoreIds ompMpiCoreIds.c -lhwloc | // compile: mpiicc -qopenmp -o ompMpiCoreIds ompMpiCoreIds.c -lhwloc | ||
- | </ | + | ``` |