Pinning
Various tools and applications, such as OpenMP, OpenMPI, IntelMPI,… can be employed for pinning purpose (assigning processes and threads to cores and nodes) to enhance the speed and efficiency of parallelized programs.
Need for processor affinity and/or pinning
To improve job performance, users can adjust processor affinity and/or pinning. The default cluster settings are generally convenient, but for specific cases, consider following:
- minimizing communication paths: communication between cores of the same socket is fastest, it slows down in this sequence: between sockets, between nodes
- data locality effects: the cores of one node do not have uniform memory access.
To optimize program parallelization, involving the allocation of multiple processes and threads to nodes and cores for enhanced performance, it's essential to understand the cluster being used and its configuration. This includes recognizing details like the maximum number of processes/threads allowable on a node, constrained by the number of cores available. Additionally, it's crucial to grasp the distinction between threads and processes (threads are generally faster due to resource control decoupling) but therefore limited to run on a single node (utilizing shared memory).
Cluster- compute nodes and cores
VSC 4
Physical cores for processes/ threads of Socket 0 are numbered from 0 to 23. Physical cores for processes/ threads of Socket 1 are numbered from 24 to 47. Virtual cores for processes/ threads are numbered from 48 to 95.
VSC 5 (Cascade Lake)
Physical cores for processes/ threads of Socket 0 are numbered from 0 to 47. Physical cores for processes/ threads of Socket 1 are numbered from 48 to 95. Virtual cores for processes are numbered from 96 to 191.
VSC 5 (Zen)
Physical cores for processes/ threads of Socket 0 are numbered from 0 to 63. Physical cores for processes/ threads of Socket 1 are numbered from 64 to 127. Virtual cores are numbered from 128 to 255.
Environment variables : For MPI, OpenMP, and hybrid job applications, the environment variables, such as proclist(OpenMP), I_MPI_PIN_PROCESSOR_LIST(IntelMPI), must be configured according to the cluster configuration and the desired possible number of processes, threads, and nodes according to your parallelized application. After setting the environment variables, we recommend always to always monitor the job(s).
Types of parallel jobs
- pure OpenMP jobs - pure MPI jobs - hybrid jobs
1. Pure OpenMP jobs
OpenMP threads are pinned with AFFINITY. Its default pin processor list is given by 0,1,…,n, (n is the last core of a computing node).
Compiler examples supporting OpenMP
The spack compilers command lists the available compilers. However, a more common practice is to use the module avail gcc or module avail intel commands and then load the desired compiler:
ICC Example
module load intel-oneapi-compilers/2023.1.0-gcc9.5.0-j52vcxx icc -fopenmp -o myprogram myprogram.c
GCC Example
module load --auto gcc/12.2.0-gcc-9.5.0-aegzcbj gcc -fopenmp -o myprogram myprogram.c
Note the flag -fopenmp, necessary to instruct the compiler to enable OpenMP functionality.
Example Job Script for ICC
#!/bin/bash #SBATCH -J pureOMP #SBATCH -N 1 export OMP_NUM_THREADS=4 export KMP_AFFINITY="verbose,granularity=fine,proclist=[0,4,8,12]" ./myprogram
Example Job Script for GCC
#!/bin/bash #SBATCH -J pureOMP #SBATCH -N 1 export OMP_NUM_THREADS=4 export GOMP_CPU_AFFINITY="8-11" ./myprogram
OMP PROC BIND AND OMP PLACES
# Example: Places threads on cores in a round-robin fashion export OMP_PLACES="{0:1},{1:1},{2:1},{3:1}" # Specify whether threads may be moved between CPUs using OMP_PROC_BIND # "true" indicates that threads should be bound to the specified places export OMP_PROC_BIND=true
OMP_PLACES is set to specify the placement of threads. In this example, each thread is assigned to a specific core in a round-robin fashion. OMP_PROC_BIND is set to “true” to indicate that threads should be bound to the specified places. The rest of your Batch script should remain the same. Note that you might need to adjust the OMP_PLACES configuration based on your specific hardware architecture and the desired thread placement strategy.
Make sure to check the OpenMP documentation and your system's specifics to fine-tune these parameters for optimal performance. Additionally, monitor the performance of your parallelized code to ensure that the chosen thread placement strategy meets your performance goals.
2. Pure MPI jobs
MPI processes :In a distributed computing environment, processes often need to communicate with each other across multiple cores and nodes. This communication is facilitated by Message Passing Interface (MPI), which is a standardized and widely used communication protocol in high-performance computing. Unlike threads, processes are not decoupled from resource control. There are several MPI implementations, including OpenMPI, Intel MPI, and MPICH, each offering different options for process pinning, which is the assignment of processes to specific processor cores.
To choose the optimal MPI implementation for your parallelized application, follow these steps:
- Understand Your Application's Requirements: Consider scalability, compatibility, and any unique features your application needs.
- Explore Available MPI Implementations: Investigate popular MPI implementations like OpenMPI, Intel MPI, and MPICH. Explore their features, advantages, and limitations through their official documentation.
- Check Compatibility: Ensure the selected MPI implementation is compatible with the system architecture and meets any specific requirements. Seek guidance from system administrators or relevant documentation.
- Experiment with Basic Commands: After selecting an MPI implementation, experiment with basic commands like mpirun, mpiexec, and srun.
- Seek Assistance: Don't hesitate to seek help if you have questions or face challenges.
- Additional Resources: Explore MPI tutorials at: VSC training events
The default pin processor list is given by 0,1,…,n (n is the last core of a computing node).
Examples
Compatibility and Compilers
Various MPI compilers and implementations exist catering to different programming languages such as C, C++, and Fortran for e.g. MPI implementations: Intel MPI, Open MPI, and MPICH:
OpenMPI
- C: mpicc
- C++: mpic++ oder mpiCC
- Fortran: mpifort oder mpif77 für Fortran 77, mpif90 für Fortran 90
Intel MPI
- C: mpiicc
- C++: mpiicpc
- Fortran: mpiifort
MPICH
- C: mpicc
- C++: mpic++
- Fortran: mpifort
Use the 'module avail' command to investigate available MPI versions by specifying your preferred MPI module, such as 'module avail openmpi'. Similarly, you can check for available compiler versions compatible with MPI using the command 'module avail' followed by your preferred compiler for MPI, providing a comprehensive overview of the available options. Following are a few Slurm script examples written for C applications with various compiler versions. These examples provide a glimpse into writing batch scripts and serve as a practical guide for creating scripts. Note that environment variables differ for different MPI implementations (OpenMPI, Intel MPI, and MPICH), and the Slurm scripts also vary between srun and mpiexec. Adjust your Slurm scripts accordingly on whether you are using srun or mpiexec(mirin) for process launching.
OPENMPI
SRUN
#!/bin/bash # #SBATCH -N 2 #SBATCH --ntasks-per-node 4 #SBATCH --ntasks-per-core 1 NUMBER_OF_MPI_PROCESSES=8 module purge module load openmpi/4.1.4-gcc-8.5.0-p6nh7mw mpicc -o openmpi openmpi.c srun -n $NUMBER_OF_MPI_PROCESSES --mpi=pmi2 --cpu_bind=map_cpu:0,4,8,12 ./openmpi
Note: The –mpi=pmi2 flag is a command-line argument commonly used when executing MPI (Message Passing Interface) applications. It specifies the MPI launch system to be used. In this context, pmi2 refers to the Process Management Interface (PMI-2), which provides an interface for managing processes in parallel applications. PMI-2 is a standard part of the MPI interface and is often utilized in conjunction with resource management systems like SLURM (Simple Linux Utility for Resource Management) to run MPI applications on a cluster.
MPIEXEC
#!/bin/bash # #SBATCH -N 2 #SBATCH --ntasks-per-node 4 #SBATCH --ntasks-per-core 1 NUMBER_OF_MPI_PROCESSES=8 export OMPI_MCA_hwloc_base_binding_policy=core export OMPI_MCA_hwloc_base_cpu_set=0,6,16,64 module purge module load openmpi/4.1.2-gcc-9.5.0-hieglt7 mpicc -o openmpi openmpi.c mpiexec -n $NUMBER_OF_MPI_PROCESSES ./openmpi
INTELMPI
SRUN
#!/bin/bash # #SBATCH -M vsc5 #SBATCH -N 2 #SBATCH --ntasks-per-node 4 #SBATCH --ntasks-per-core 1 export I_MPI_DEBUG=4 NUMBER_OF_MPI_PROCESSES=8 export I_MPI_PIN_PROCESSOR_LIST=0,6,16,64 module purge module load intel/19.1.3 module load intel-mpi/2021.5.0 mpiicc -o intelmpi intelmpi.c srun -n $NUMBER_OF_MPI_PROCESSES --cpu_bind=map_cpu:0,4,8,12 ./intelmpi
MPIEXEC
#!/bin/bash # #SBATCH -N 2 #SBATCH --ntasks-per-node 4 #SBATCH --ntasks-per-core 1 export I_MPI_DEBUG=4 NUMBER_OF_MPI_PROCESSES=8 export I_MPI_PIN_PROCESSOR_LIST=0,6,16,64 module purge module load intel/19.1.3 module load intel-mpi/2021.5.0 mpiicc -o intelmpi intelmpi.c mpiexec -n $NUMBER_OF_MPI_PROCESSES ./intelmpi
MPICH
SRUN
#!/bin/bash # #SBATCH -N 2 #SBATCH --ntasks-per-node 4 #SBATCH --ntasks-per-core 1 NUMBER_OF_MPI_PROCESSES=8 module purge module load --auto mpich/4.0.2-gcc-12.2.0-vdvlylu mpicc -o mpich mpich.c srun -n $NUMBER_OF_MPI_PROCESSES --cpu_bind=map_cpu:0,4,8,12 ./mpich
Note: The flag - -auto; loads all dependencies
3. Hybrid jobs
MPI (Message Passing Interface) is utilized for facilitating communication between processes across multiple nodes. Executing OpenMP on each respective node can be advantageous, as it diminishes data exchange by decoupling resource management. The combination of both approaches (hybrid jobs) results in enhanced performance. For instance, threads are assigned to cores within one node, and communication between nodes is managed by processes. This can be achieved through CPU binding. In comparison, when exclusively utilizing processes across multiple nodes and within nodes, the hybrid use of MPI and OpenMP typically yields improved performance. As an illustration, a configuration might involve 3 nodes with 3 MPI processes on each node, without using OpenMP within nodes.:
#!/bin/bash # #SBATCH -J mapCPU #SBATCH -N 3 #SBATCH -n 3 #SBATCH --ntasks-per-node=1 #SBATCH --cpus-per-task=3 #SBATCH --time=00:60:00 export I_MPI_DEBUG=1 NUMBER_OF_MPI_PROCESSES=3 export OMP_NUM_THREADS=3 module load intel/19.1.3 module load intel-mpi/2021.5.0 mpiicc -qopenmp -o myprogram myprogram.c srun -n $NUMBER_OF_MPI_PROCESSES --cpu_bind=mask_cpu:0xf,0xf0,0xf00 ./my_program