Differences

This shows you the differences between two versions of the page.

--- doku:vsc3_pinning [2023/01/04 04:21] – Still in progress amelic
+++ doku:corepinning [2024/03/04 16:12] – ↷ Page name changed from doku:vsc3_pinning to doku:corepinning amelic
@@ Line 1: / Line 1: @@
-====== Processor affinity and/or pinning ======
+====== Pinning ======
+Various tools and applications, such as OpenMP, OpenMPI, IntelMPI,... can be employed for pinning purpose (assigning processes and threads to cores and nodes) to enhance the speed and efficiency of parallelized programs.
 ===== Need for processor affinity and/or pinning =====
-Job performance may require that the processor affinity and/or pinning is controlled by the user.
+To improve job performance, users can adjust processor affinity and/or pinning. The default cluster settings are generally convenient, but for specific cases, consider following:
-The default configuration on VSC-3 is set to be convenient in most cases, however,
-in order to tune it for a specific case, the following items may be considered:
-  - <html><span style="color:blue;font-size:100%;">minimizing communication paths:</span></html> communication between cores of the same socket is fastest, it slows down in this sequence: between sockets, between nodes inside the same island*, and between nodes of different islands*.
+  - <html><span style="color:blue;font-size:100%;">minimizing communication paths:</span></html> communication between cores of the same socket is fastest, it slows down in this sequence: between sockets, between nodes
-  - <html><span style="color:blue;font-size:100%;">data locality effects:</span></html> the cores of one node do not have uniform memory access, see also [[doku:vsc3_architecture|VSC 3 architecture]]. The physical cores 0--7 and the virtual cores 16--23 belong to socket 0 and have fastest access to the memory of this socket (NUMA node P#0). The cores of socket 0 access the memory of socket 1 (NUMA node P#1) over PCI-e which is slower.
+  - <html><span style="color:blue;font-size:100%;">data locality effects:</span></html> the cores of one node do not have uniform memory access.
-  - it sounds quite obvious that processes should be <html><span style="color:blue;font-size:100%;">evenly distributed</span></html> over the node(s), however, it may unintentionally happen that all threads are placed on one core while the other cores idle.
+To optimize program parallelization, involving the allocation of multiple processes and threads to nodes and cores for enhanced performance, it's essential to understand the cluster being used and its configuration. This includes recognizing details like the maximum number of processes/threads allowable on a node, constrained by the number of cores available. Additionally, it's crucial to grasp the distinction between threads and processes (threads are generally faster due to resource control decoupling) but therefore limited to run on a single node (utilizing shared memory).
+====Cluster- compute nodes and cores====
+==VSC 4==
+Physical cores for processes/ threads of Socket 0 are numbered from 0 to 23.
+Physical cores for processes/ threads of Socket 1 are numbered from 24 to 47.
+Virtual cores for processes/ threads are numbered from 48 to 95.
+==VSC 5 (Cascade Lake)==
+Physical cores for processes/ threads of Socket 0 are numbered from 0 to 47.
+Physical cores for processes/ threads of Socket 1 are numbered from 48 to 95.
+Virtual cores for processes are numbered from 96 to 191.
+==VSC 5 (Zen)==
+Physical cores for processes/ threads of Socket 0 are numbered from 0 to 63.
+Physical cores for processes/ threads of Socket 1 are numbered from 64 to 127.
+Virtual cores are numbered from 128 to 255.
-* The nodes of VSC-3 are grouped in 8 islands. Messages inside one single island involve hops within the lower two levels, messages between two islands involve hops within all three levels of the infiniband hierarchy.
 **Environment variables :**
-Setting environment variables does unfortunately <html><span style="color:blue;font-size:100%;">not always lead to the expected outcome.</span> </html>
+For MPI, OpenMP, and hybrid job applications, the environment variables, such as proclist(OpenMP), I_MPI_PIN_PROCESSOR_LIST(IntelMPI), must be configured according to the cluster configuration and the desired possible number of processes, threads, and nodes according to your parallelized application. After setting the environment variables, we recommend always to <html><font color=#cc3300>always &#x27A0;</font></html> [[doku:monitoring|monitor]] the job(s).
-Thus, it is highly encouraged to <html><font color=#cc3300>always &#x27A0;</font></html> [[doku:monitoring|monitor]] if the job is doing what it is supposed to.
@@ Line 26: / Line 38: @@
 ==== 1. Pure OpenMP jobs =====
-OpenMP threads are pinned with <html><font color=#cc3300>KMP_AFFINITY.</font></html> Its default pin processor list is given by ''proclist=[{0...11,24...35},...,{12...23,35...47}]''. This syntax says that each of the threads 0—11 is running on one of the cores of socket 0 and each of the threads 12—23 on one of the cores of socket 1, see also [[doku:vsc3_architecture|VSC 3 architecture]].
+**OpenMP threads** are pinned with <html><font color=#cc3300>AFFINITY.</font></html> Its default pin processor list is given by <html><font color=#cc3300>0,1,...,n</font></html>, (n is the last core of a computing node).
-The elements of ''proclist'' list the allowed core IDs corresponding to the threads 0, 1, ... Lists of allowed cores are given in curly brackets, single cores as plain numbers, e.g., ''proclist=[0,1,2,3,4,5,{6,7},{6,7},8,9,10,11,12,{13,14,15},{13,14,15},{13,14,15}]''.
-=== Example ===
+==Compiler examples supporting OpenMP==
+The spack compilers command lists the available compilers. However, a more common practice is to use the module avail gcc or module avail intel commands and then load the desired compiler:
+**ICC Example**
 <code>
-icc -qopenmp -o myprogram myprogram.c
+module load intel-oneapi-compilers/2023.1.0-gcc9.5.0-j52vcxx
+icc -fopenmp -o myprogram myprogram.c
 </code>
+**GCC Example**
+<code>
+module load --auto gcc/12.2.0-gcc-9.5.0-aegzcbj
+gcc -fopenmp -o myprogram myprogram.c
+</code>
+Note the flag -fopenmp, necessary to instruct the compiler to enable OpenMP functionality.
+**Example Job Script for ICC**
 <code>
 #!/bin/bash
 #SBATCH -J pureOMP
-#SBATCH -N 1               # OpenMP is only running on one node
+#SBATCH -N 1
 export OMP_NUM_THREADS=4
 export KMP_AFFINITY="verbose,granularity=fine,proclist=[0,4,8,12]"
-./my_program
+./myprogram
 </code>
+**Example Job Script for GCC**
+<code>
+#!/bin/bash
+#SBATCH -J pureOMP
+#SBATCH -N 1
+export OMP_NUM_THREADS=4
+export GOMP_CPU_AFFINITY="8-11"
+./myprogram
+</code>
+**OMP PROC BIND AND OMP PLACES**
+<code>
+# Example: Places threads on cores in a round-robin fashion
+export OMP_PLACES="{0:1},{1:1},{2:1},{3:1}"
+# Specify whether threads may be moved between CPUs using OMP_PROC_BIND
+# "true" indicates that threads should be bound to the specified places
+export OMP_PROC_BIND=true
+</code>
+OMP_PLACES is set to specify the placement of threads. In this example, each thread is assigned to a specific core in a round-robin fashion.
+OMP_PROC_BIND is set to "true" to indicate that threads should be bound to the specified places.
+The rest of your Batch script should remain the same.
+Note that you might need to adjust the OMP_PLACES configuration based on your specific hardware architecture and the desired thread placement strategy.
+Make sure to check the OpenMP documentation and your system's specifics to fine-tune these parameters for optimal performance. Additionally, monitor the performance of your parallelized code to ensure that the chosen thread placement strategy meets your performance goals.
 ==== 2. Pure MPI jobs =====
-MPI processes are pinned with
+**MPI processes**
-<html><font color=#cc3300>I_MPI_PIN_PROCESSOR_LIST.</font></html>
+:In a distributed computing environment, processes often need to communicate with each other across multiple cores and **nodes**.  This communication is facilitated by Message Passing Interface (MPI), which is a standardized and widely used communication protocol in high-performance computing. Unlike threads, processes are not decoupled from resource control.
-The default environment on VSC-3 is ''I_MPI_PIN_PROCESSOR_LIST=0-15'', i.e., the processes are pinned to the physical cores, only. Cores 0--15 are physical, cores 16--31 virtual, see also [[doku:vsc3_architecture|VSC 3 architecture]].
+There are several MPI implementations, including OpenMPI, Intel MPI, and MPICH, each offering different options for process pinning, which is the assignment of processes to specific processor cores.
-=== Example ===
+To choose the optimal MPI implementation for your parallelized application, follow these steps:
-<code>
+  * Understand Your Application's Requirements: Consider scalability, compatibility, and any unique features your application needs.
-module load intel/18.0.2 intel-mpi/2018.2
+  * Explore Available MPI Implementations: Investigate popular MPI implementations like OpenMPI, Intel MPI, and MPICH. Explore their features, advantages, and limitations through their official documentation.
-mpiicc -o myprogram myprogram.c
+  * Check Compatibility: Ensure the selected MPI implementation is compatible with the system architecture and meets any specific requirements. Seek guidance from system administrators or relevant documentation.
-</code>
+  * Experiment with Basic Commands: After selecting an MPI implementation, experiment with basic commands like mpirun, mpiexec, and srun.
-With srun:
+  * Seek Assistance: Don't hesitate to seek help if you have questions or face challenges.
-<code>
+  * Additional Resources: Explore MPI tutorials at: [[https://vsc.ac.at/research/vsc-research-center/vsc-school-seminar/|VSC training events]]
-#!/bin/bash
-#SBATCH -J pureMPI
-#SBATCH -N 4
-#SBATCH --ntasks-per-node=4
-module load intel/18.0.2 intel-mpi/2018.2
-export I_MPI_DEBUG=4
-NUMBER_OF_MPI_PROCESSES=16
+The default pin processor list is given by <html><font color="#cc3300">0,1,...,n</font></html> (n is the last core of a computing node).
-export OMP_NUM_THREADS=1
-srun -n $NUMBER_OF_MPI_PROCESSES --cpu_bind=map_cpu:0,4,8,12 ./my_program
+==== Examples ====
-</code>
+===Compatibility and Compilers===
-With mpirun:
+Various MPI compilers and implementations exist catering to different programming languages such as C, C++, and Fortran for e.g. MPI implementations: Intel MPI, Open MPI, and MPICH:
+**OpenMPI**
+  * C: mpicc
+  * C++: mpic++ oder mpiCC
+  * Fortran: mpifort oder mpif77 für Fortran 77, mpif90 für Fortran 90
+**Intel MPI**
+  * C: mpiicc
+  * C++: mpiicpc
+  * Fortran: mpiifort
+**MPICH**
+  * C: mpicc
+  * C++: mpic++
+  * Fortran: mpifort
+Use the 'module avail' command to investigate available MPI versions by specifying your preferred MPI module, such as 'module avail openmpi'. Similarly, you can check for available compiler versions compatible with MPI using the command 'module avail' followed by your preferred compiler for MPI, providing a comprehensive overview of the available options.
+Following are a few Slurm script examples written for C applications with various compiler versions. These examples provide a glimpse into writing batch scripts and serve as a practical guide for creating scripts.
+Note that environment variables differ for different MPI implementations (OpenMPI, Intel MPI, and MPICH), and the Slurm scripts also vary between srun and mpiexec. Adjust your Slurm scripts accordingly  on whether you are using srun or mpiexec for process launching.
+===OPENMPI===
+**SRUN**
 <code>
 #!/bin/bash
-#SBATCH -J pureMPI
+#
-#SBATCH -N 4
+#SBATCH -N 2
-#SBATCH --ntasks-per-node=4
+#SBATCH --ntasks-per-node 4
+#SBATCH --ntasks-per-core 1
-module load intel/18.0.2 intel-mpi/2018.2
+NUMBER_OF_MPI_PROCESSES=8
-export I_MPI_DEBUG=4
-export I_MPI_PIN_PROCESSOR_LIST=0,4,8,12
-NUMBER_OF_MPI_PROCESSES=16
+module purge
-export OMP_NUM_THREADS=1
+module load openmpi/4.1.4-gcc-8.5.0-p6nh7mw
-mpirun -n $NUMBER_OF_MPI_PROCESSES ./my_program
+mpicc  -o openmpi openmpi.c
+srun -n $NUMBER_OF_MPI_PROCESSES --mpi=pmi2 --cpu_bind=map_cpu:0,4,8,12 ./openmpi
 </code>
+Note: The //--mpi=pmi2// flag is a command-line argument commonly used when executing MPI (Message Passing Interface) applications. It specifies the MPI launch system to be used. In this context, pmi2 refers to the Process Management Interface (PMI-2), which provides an interface for managing processes in parallel applications. PMI-2 is a standard part of the MPI interface and is often utilized in conjunction with resource management systems like SLURM (Simple Linux Utility for Resource Management) to run MPI applications on a cluster.
-The default environment is ''I_MPI_PIN_PROCESSOR_LIST=0-15'', i.e., the processes are pinned to the physical cores, only. Cores 0--15 are physical, cores 16--31 virtual.
+**MPIEXEC**
-==== 3. Hybrid jobs =====
+<code>
+#!/bin/bash
+#
+#SBATCH -N 2
+#SBATCH --ntasks-per-node 4
+#SBATCH --ntasks-per-core 1
-Unfortunately, the combination of KMP_AFFINITY and I_MPI_PIN_PROCESSOR_LIST does not always lead to the expected pinning. For example, in the case of 4 MPI processes and 4 OMP threads on one node, it places all threads of one MPI-process on the same core instead, no matter how the two variables are changed.
+NUMBER_OF_MPI_PROCESSES=8
+export OMPI_MCA_hwloc_base_binding_policy=core
+export OMPI_MCA_hwloc_base_cpu_set=0,6,16,64
-<html>
+module purge
-Therefore, in the following the <span style="color:blue;font-size:100%;">pinning involving a binary cpu-mask</span> is introduced. Each of the 16 cores of one </html> [[doku:vsc3_architecture|node]] <html>is represented as one digit of a binary number. Hence, the number ''1'' is assigned to cores where a thread should run and ''0'' to the remaining cores. This is done for one node and each of the MPI-processes running on this node.
+module load openmpi/4.1.2-gcc-9.5.0-hieglt7
-</html>
-The binary mask is translated into hex-code which can easily be obtained using, e.g, the [[http://www.binaryhexconverter.com|binaryhexconverter]] or Python: ''hex(int("1111000000000000",2))'' yielding ''0xf000''.
-{{ :doku:pinningbinarymask.png?450 |}}
+mpicc  -o openmpi openmpi.c
+mpiexec -n $NUMBER_OF_MPI_PROCESSES ./openmpi
-<code>
-module load intel/18.0.2 intel-mpi/2018.2
-mpiicc -qopenmp -o myprogram myprogram.c
 </code>
+===INTELMPI===
+**SRUN**
 <code>
 #!/bin/bash
 #
-#SBATCH -J mapCPU
+#SBATCH -M vsc5
-#SBATCH -N 4
+#SBATCH -N 2
-#SBATCH --time=00:60:00
+#SBATCH --ntasks-per-node 4
+#SBATCH --ntasks-per-core 1
-module load intel/18.0.2 intel-mpi/2018.2
 export I_MPI_DEBUG=4
+NUMBER_OF_MPI_PROCESSES=8
+export I_MPI_PIN_PROCESSOR_LIST=0,6,16,64
-NUMBER_OF_MPI_PROCESSES=16
+module purge
-export OMP_NUM_THREADS=4
+module load intel/19.1.3
+module load intel-mpi/2021.5.0
-srun -n $NUMBER_OF_MPI_PROCESSES --cpu_bind=mask_cpu:0xf,0xf0,0xf00,0xf000 ./my_program
+mpiicc  -o intelmpi intelmpi.c
+srun -n $NUMBER_OF_MPI_PROCESSES --cpu_bind=map_cpu:0,4,8,12 ./intelmpi
 </code>
+**MPIEXEC**
+<code>
+#!/bin/bash
+#
+#SBATCH -N 2
+#SBATCH --ntasks-per-node 4
+#SBATCH --ntasks-per-core 1
+export I_MPI_DEBUG=4
+NUMBER_OF_MPI_PROCESSES=8
+export I_MPI_PIN_PROCESSOR_LIST=0,6,16,64
-===== Likwid 4.0 =====
+module purge
+module load intel/19.1.3
+module load intel-mpi/2021.5.0
-==== Background: ====
+mpiicc  -o intelmpi intelmpi.c
-It is proving increasingly difficult to exert control over the assignment of different threads to the available CPU cores in multi-threaded OpenMP applications. Particularly troublesome are hybrid MPI/OpenMP codes. Here, the developer usually has a comprehensive knowledge of the regions running in parallel, but relies on the OS for optimal assignment of different physical cores to the individual computing threads. A variety of methods do exist to explicitly state the link between CPU core and a particular thread. However, in practice many of these configurations turn out to be either non-functional, dependent on MPI versions, or frequently ineffective and, moreover, are overruled by the queuing system (e.g. [[doku:slurm|SLURM]]). In the following the auxiliary tool ''likwid-pin'' is described. It has shown promise in successfully managing arbitrary thread assignment to individual CPU cores in a more general way.
+mpiexec -n $NUMBER_OF_MPI_PROCESSES ./intelmpi</code>
+===MPICH===
-==== Example: ====
+**SRUN**
-Suppose we have the following little test program, {{:doku:test_mpit_var.c|test_mpit_var.c}}, and want to run it using 8 threads on a single compute node based on the following set of physical cores: 3, 4, 2, 1, 6, 5, 7, 9. Thus, after compilation, e.g., via ''mpigcc -fopenmp ./test_mpit_var.c'', the following  [[doku:slurm|SLURM]] submit script  could be used
 <code>
 #!/bin/bash
 #
-#SBATCH -J tmv
+#SBATCH -N 2
-#SBATCH -N 1
+#SBATCH --ntasks-per-node 4
-#SBATCH --time=00:10:00
+#SBATCH --ntasks-per-core 1
-module purge
+NUMBER_OF_MPI_PROCESSES=8
-module load gcc/5.3 likwid/4.2.0
-module load intel/18.0.2 intel-mpi/2018.2
-export OMP_NUM_THREADS=8
+module purge
+module load --auto mpich/4.0.2-gcc-12.2.0-vdvlylu
-likwid-pin -c 3,4,2,1,6,5,7,9 ./a.out
+mpicc  -o mpich mpich.c
+srun -n $NUMBER_OF_MPI_PROCESSES --cpu_bind=map_cpu:0,4,8,12 ./mpich
 </code>
-  * Note the repeated declaration of the initial core #3. This is required due to the fact that one main task is called which subsequently will branch out into 8 parallel threads.
+Note: The flag - -auto; loads all dependencies
-  * Thread #0 must run on the same core the parent process (main task) will run at (e.g. core #3 in the above example).
-  * There are plenty of additional ways to define appropriate masks for thread domains (see link below), for example, in order to employ all available physical cores in an explicit order on both sockets, ''export OMP_NUM_THREADS=16'' could be set and then ''likwid-pin -c 2,2,0,1,3,7,4,6,5,10,8,9,11,15,12,14,13 ./a.out'' could be called.
+**MPIEXEC**
-  * The good news is, likwid-pin works exactly the same way for INTEL-based compilers. For example, the above submit script would have led to exactly the same type of results when compiled with the command ''mpiicc  -openmp  ./test_mpit_var.c''.
+==== 3. Hybrid jobs =====
-==== MPI/OpenMP: ====
-''likwid-pin'' may also be used for hybrid MPI/OpenMP applications. For example, in order to run the little test program on 4 nodes using 16 threads per node the submit script has to simply be modified in the following way,
+MPI (Message Passing Interface) is utilized for facilitating communication between processes across multiple nodes. Executing OpenMP on each respective node can be advantageous, as it diminishes data exchange by decoupling resource management. The combination of both approaches (hybrid jobs) results in enhanced performance. For instance, threads are assigned to cores within one node, and communication between nodes is managed by processes. This can be achieved through CPU binding. In comparison, when exclusively utilizing processes across multiple nodes and within nodes, the hybrid use of MPI and OpenMP typically yields improved performance. As an illustration, a configuration might involve 3 nodes with 3 MPI processes on each node, without using OpenMP within nodes.:
 <code>
 #!/bin/bash
 #
-#SBATCH -J tmvmxd
+#SBATCH -J mapCPU
-#SBATCH -N 4
+#SBATCH -N 3
+#SBATCH -n 3
+#SBATCH --ntasks-per-node=1
+#SBATCH --cpus-per-task=3
 #SBATCH --time=00:60:00
-module purge
+export I_MPI_DEBUG=1
-module load gcc/5.3 likwid/4.2.0
+NUMBER_OF_MPI_PROCESSES=3
-module load intel/18.0.2 intel-mpi/2018.2
+export OMP_NUM_THREADS=3
-export OMP_NUM_THREADS=16
+module load intel/19.1.3
+module load intel-mpi/2021.5.0
+mpiicc -qopenmp -o myprogram myprogram.c
-srun -n4 likwid-pin -c 0,0-15 ./a.out
+srun -n $NUMBER_OF_MPI_PROCESSES --cpu_bind=mask_cpu:0xf,0xf0,0xf00 ./my_program
 </code>
-==== Further Reading: ====
-[[https://github.com/rrze-likwid/likwid/wiki/Likwid-Pin]]