This is an old revision of the document!
Processor affinity and/or pinning
Need for processor affinity and/or pinning
Job performance may require that the processor affinity and/or pinning is controlled by the user. The default configuration on VSC-3 is set to be convenient in most cases, however, in order to tune it for a specific case, the following items may be considered:
- <html><span style=“color:blue;font-size:100%;”>minimizing communication paths:</span></html> communication between cores of the same socket is fastest, it slows down in this sequence: between sockets, between nodes inside the same island*, and between nodes of different islands*.
- <html><span style=“color:blue;font-size:100%;”>data locality effects:</span></html> the cores of one node do not have uniform memory access, see also VSC 3 architecture. The physical cores 0–7 and the virtual cores 16–23 belong to socket 0 and have fastest access to the memory of this socket (NUMA node P#0). The cores of socket 0 access the memory of socket 1 (NUMA node P#1) over PCI-e which is slower.
- it sounds quite obvious that processes should be <html><span style=“color:blue;font-size:100%;”>evenly distributed</span></html> over the node(s), however, it may unintentionally happen that all threads are placed on one core while the other cores idle.
* The nodes of VSC-3 are grouped in 8 islands. Messages inside one single island involve hops within the lower two levels, messages between two islands involve hops within all three levels of the infiniband hierarchy.
Environment variables : Setting environment variables does unfortunately <html><span style=“color:blue;font-size:100%;”>not always lead to the expected outcome.</span> </html> Thus, it is highly encouraged to <html><font color=#cc3300>always ➠</font></html> monitor if the job is doing what it is supposed to.
Types of parallel jobs
- pure OpenMP jobs
- pure MPI jobs
- hybrid jobs
1. Pure OpenMP jobs
OpenMP threads are pinned with <html><font color=#cc3300>KMP_AFFINITY.</font></html> Its default pin processor list is given by proclist=[{0…7,16…23},…,{8…15,24…31}]
. This syntax says that each of the threads 0—7 is running on one of the cores of socket 0 and each of the threads 8—15 on one of the cores of socket 1, see also VSC 3 architecture.
The elements of proclist
list the allowed core IDs corresponding to the threads 0, 1, … Lists of allowed cores are given in curly brackets, single cores as plain numbers, e.g., proclist=[0,1,2,3,4,5,{6,7},{6,7},8,9,10,11,12,{13,14,15},{13,14,15},{13,14,15}]
.
Example
icc -qopenmp -o myprogram myprogram.c
#!/bin/bash #SBATCH -J pureOMP #SBATCH -N 1 # OpenMP is only running on one node export OMP_NUM_THREADS=4 export KMP_AFFINITY="verbose,granularity=fine,proclist=[0,4,8,12]" ./my_program
2. Pure MPI jobs
MPI processes are pinned with
<html><font color=#cc3300>I_MPI_PIN_PROCESSOR_LIST.</font></html>
The default environment on VSC-3 is I_MPI_PIN_PROCESSOR_LIST=0-15
, i.e., the processes are pinned to the physical cores, only. Cores 0–15 are physical, cores 16–31 virtual, see also VSC 3 architecture.
Example
module load intel/18.0.2 intel-mpi/2018.2 mpiicc -o myprogram myprogram.c
With srun:
#!/bin/bash #SBATCH -J pureMPI #SBATCH -N 4 #SBATCH --ntasks-per-node=4 module load intel/18.0.2 intel-mpi/2018.2 export I_MPI_DEBUG=4 NUMBER_OF_MPI_PROCESSES=16 export OMP_NUM_THREADS=1 srun -n $NUMBER_OF_MPI_PROCESSES --cpu_bind=map_cpu:0,4,8,12 ./my_program
With mpirun:
#!/bin/bash #SBATCH -J pureMPI #SBATCH -N 4 #SBATCH --ntasks-per-node=4 module load intel/18.0.2 intel-mpi/2018.2 export I_MPI_DEBUG=4 export I_MPI_PIN_PROCESSOR_LIST=0,4,8,12 NUMBER_OF_MPI_PROCESSES=16 export OMP_NUM_THREADS=1 mpirun -n $NUMBER_OF_MPI_PROCESSES ./my_program
The default environment is I_MPI_PIN_PROCESSOR_LIST=0-15
, i.e., the processes are pinned to the physical cores, only. Cores 0–15 are physical, cores 16–31 virtual.
3. Hybrid jobs
Unfortunately, the combination of KMP_AFFINITY and I_MPI_PIN_PROCESSOR_LIST does not always lead to the expected pinning. For example, in the case of 4 MPI processes and 4 OMP threads on one node, it places all threads of one MPI-process on the same core instead, no matter how the two variables are changed.
<html>
Therefore, in the following the <span style=“color:blue;font-size:100%;”>pinning involving a binary cpu-mask</span> is introduced. Each of the 16 cores of one </html> node <html>is represented as one digit of a binary number. Hence, the number 1
is assigned to cores where a thread should run and 0
to the remaining cores. This is done for one node and each of the MPI-processes running on this node.
</html>
The binary mask is translated into hex-code which can easily be obtained using, e.g, the binaryhexconverter or Python: hex(int(“1111000000000000”,2))
yielding 0xf000
.
module load intel/18.0.2 intel-mpi/2018.2 mpiicc -qopenmp -o myprogram myprogram.c
#!/bin/bash # #SBATCH -J mapCPU #SBATCH -N 4 #SBATCH --time=00:60:00 module load intel/18.0.2 intel-mpi/2018.2 export I_MPI_DEBUG=4 NUMBER_OF_MPI_PROCESSES=16 export OMP_NUM_THREADS=4 srun -n $NUMBER_OF_MPI_PROCESSES --cpu_bind=mask_cpu:0xf,0xf0,0xf00,0xf000 ./my_program
Likwid 4.0
Background:
It is proving increasingly difficult to exert control over the assignment of different threads to the available CPU cores in multi-threaded OpenMP applications. Particularly troublesome are hybrid MPI/OpenMP codes. Here, the developer usually has a comprehensive knowledge of the regions running in parallel, but relies on the OS for optimal assignment of different physical cores to the individual computing threads. A variety of methods do exist to explicitly state the link between CPU core and a particular thread. However, in practice many of these configurations turn out to be either non-functional, dependent on MPI versions, or frequently ineffective and, moreover, are overruled by the queuing system (e.g. SLURM). In the following the auxiliary tool likwid-pin
is described. It has shown promise in successfully managing arbitrary thread assignment to individual CPU cores in a more general way.
Example:
Suppose we have the following little test program, test_mpit_var.c, and want to run it using 8 threads on a single compute node based on the following set of physical cores: 3, 4, 2, 1, 6, 5, 7, 9. Thus, after compilation, e.g., via mpigcc -fopenmp ./test_mpit_var.c
, the following SLURM submit script could be used
#!/bin/bash # #SBATCH -J tmv #SBATCH -N 1 #SBATCH --time=00:10:00 module purge module load gcc/5.3 likwid/4.2.0 module load intel/18.0.2 intel-mpi/2018.2 export OMP_NUM_THREADS=8 likwid-pin -c 3,4,2,1,6,5,7,9 ./a.out
- Note the repeated declaration of the initial core #3. This is required due to the fact that one main task is called which subsequently will branch out into 8 parallel threads.
- Thread #0 must run on the same core the parent process (main task) will run at (e.g. core #3 in the above example).
- There are plenty of additional ways to define appropriate masks for thread domains (see link below), for example, in order to employ all available physical cores in an explicit order on both sockets,
export OMP_NUM_THREADS=16
could be set and thenlikwid-pin -c 2,2,0,1,3,7,4,6,5,10,8,9,11,15,12,14,13 ./a.out
could be called. - The good news is, likwid-pin works exactly the same way for INTEL-based compilers. For example, the above submit script would have led to exactly the same type of results when compiled with the command
mpiicc -openmp ./test_mpit_var.c
.
MPI/OpenMP:
likwid-pin
may also be used for hybrid MPI/OpenMP applications. For example, in order to run the little test program on 4 nodes using 16 threads per node the submit script has to simply be modified in the following way,
#!/bin/bash # #SBATCH -J tmvmxd #SBATCH -N 4 #SBATCH --time=00:60:00 module purge module load gcc/5.3 likwid/4.2.0 module load intel/18.0.2 intel-mpi/2018.2 export OMP_NUM_THREADS=16 srun -n4 likwid-pin -c 0,0-15 ./a.out