This is an old revision of the document!


Environment variables : Setting environment variables does unfortunately not always lead to the expected outcome. ➠ Thus, it is highly encouraged to always monitor if the job is doing what it is supposed to.

There are several ways to monitor your job, either in live time directly on the compute node or by modifying the job script or the application code:

  • live submit the job and connect with the compute node

[xy@l32]$ sbatch job.sh
[xy@l32]$ squeue -u xy
JOBID    PARTITION  NAME      USER  ST  TIME  NODES  NODELIST(REASON)
5066692  mem_0064   nn1tn8th   xy   R   0:02    1    n09-005
[xy@l32]$ ssh n09-005
[xy@n09-005]$ top     

When typing 1 during the top call, per-core-information is obtained, e.g., about the cpu-usage, compare with the picture to the right. The user can select the parameters to be shown from a list displayed when typing f.

  • batch script (Intel-MPI) set: I_MPI_DEBUG=4
  • code via library functions information about the locality of processes and threads can be obtained (libraries: mpi.h or in C-code hwloc.h (hardware locality) or sched.h (scheduling parameters))
#include "mpi.h"
...
MPI_Get_processor_name(processor_name, &namelen);
#include <sched.h>
...
CPU_ID = sched_getcpu();
#include <hwloc.h>
...
    hwloc_topology_t topology;
    hwloc_cpuset_t cpuset;
    hwloc_obj_t obj;
    hwloc_topology_init ( &topology);
    hwloc_topology_load ( topology);
    hwloc_bitmap_t set = hwloc_bitmap_alloc();
    hwloc_obj_t pu;
    err = hwloc_get_proc_cpubind(topology, getpid(), set, HWLOC_CPUBIND_PROCESS);
    pu = hwloc_get_pu_obj_by_os_index(topology, hwloc_bitmap_first(set));
    int my_coreid = hwloc_bitmap_first(set);
    hwloc_bitmap_free(set);
    hwloc_topology_destroy(topology);

//  compile: mpiicc -qopenmp -o ompMpiCoreIds ompMpiCoreIds.c -lhwloc
  • doku/monitoring.1489759135.txt.gz
  • Last modified: 2017/03/17 13:58
  • by ir