Differences
This shows you the differences between two versions of the page.
Both sides previous revision Previous revision Next revision | Previous revisionLast revisionBoth sides next revision | ||
doku:vsc3_pinning [2023/01/04 04:21] – Still in progress amelic | doku:corepinning [2024/03/04 16:12] – ↷ Page name changed from doku:vsc3_pinning to doku:corepinning amelic | ||
---|---|---|---|
Line 1: | Line 1: | ||
- | ====== | + | ====== |
+ | Various tools and applications, | ||
===== Need for processor affinity and/or pinning ===== | ===== Need for processor affinity and/or pinning ===== | ||
- | Job performance | + | To improve job performance, users can adjust |
- | The default | + | |
- | in order to tune it for a specific | + | |
- | - < | + | - < |
- | - < | + | - < |
- | - it sounds quite obvious that processes | + | |
+ | To optimize program parallelization, involving the allocation of multiple processes | ||
+ | ====Cluster- compute nodes and cores==== | ||
+ | |||
+ | ==VSC 4== | ||
+ | Physical cores for processes/ threads | ||
+ | Physical cores for processes/ threads | ||
+ | Virtual cores for processes/ threads are numbered from 48 to 95. | ||
+ | ==VSC 5 (Cascade Lake)== | ||
+ | Physical cores for processes/ threads of Socket 0 are numbered from 0 to 47. | ||
+ | Physical cores for processes/ | ||
+ | Virtual cores for processes are numbered from 96 to 191. | ||
+ | ==VSC 5 (Zen)== | ||
+ | Physical cores for processes/ | ||
+ | Physical | ||
+ | Virtual cores are numbered from 128 to 255. | ||
- | * The nodes of VSC-3 are grouped in 8 islands. Messages inside one single island involve hops within the lower two levels, messages between two islands involve hops within all three levels of the infiniband hierarchy. | ||
**Environment variables :** | **Environment variables :** | ||
- | Setting | + | For MPI, OpenMP, and hybrid job applications, |
- | Thus, it is highly encouraged | + | |
Line 26: | Line 38: | ||
==== 1. Pure OpenMP jobs ===== | ==== 1. Pure OpenMP jobs ===== | ||
- | OpenMP threads are pinned with < | + | **OpenMP threads** are pinned with < |
- | The elements of '' | + | |
- | === Example === | ||
+ | ==Compiler examples supporting OpenMP== | ||
+ | The spack compilers command lists the available compilers. However, a more common practice is to use the module avail gcc or module avail intel commands and then load the desired compiler: | ||
+ | |||
+ | **ICC Example** | ||
< | < | ||
- | icc -qopenmp | + | module load intel-oneapi-compilers/ |
+ | icc -fopenmp | ||
</ | </ | ||
+ | **GCC Example** | ||
+ | < | ||
+ | module load --auto gcc/ | ||
+ | gcc -fopenmp -o myprogram myprogram.c | ||
+ | </ | ||
+ | Note the flag -fopenmp, necessary to instruct the compiler to enable OpenMP functionality. | ||
+ | |||
+ | **Example Job Script for ICC** | ||
< | < | ||
#!/bin/bash | #!/bin/bash | ||
#SBATCH -J pureOMP | #SBATCH -J pureOMP | ||
- | #SBATCH -N 1 # OpenMP is only running on one node | + | #SBATCH -N 1 |
export OMP_NUM_THREADS=4 | export OMP_NUM_THREADS=4 | ||
export KMP_AFFINITY=" | export KMP_AFFINITY=" | ||
- | ./my_program | + | ./myprogram |
</ | </ | ||
+ | **Example Job Script for GCC** | ||
+ | < | ||
+ | #!/bin/bash | ||
+ | #SBATCH -J pureOMP | ||
+ | #SBATCH -N 1 | ||
+ | |||
+ | export OMP_NUM_THREADS=4 | ||
+ | export GOMP_CPU_AFFINITY=" | ||
+ | ./myprogram | ||
+ | </ | ||
+ | **OMP PROC BIND AND OMP PLACES** | ||
+ | |||
+ | < | ||
+ | # Example: Places threads on cores in a round-robin fashion | ||
+ | export OMP_PLACES=" | ||
+ | |||
+ | # Specify whether threads may be moved between CPUs using OMP_PROC_BIND | ||
+ | # " | ||
+ | export OMP_PROC_BIND=true | ||
+ | </ | ||
+ | |||
+ | OMP_PLACES is set to specify the placement of threads. In this example, each thread is assigned to a specific core in a round-robin fashion. | ||
+ | OMP_PROC_BIND is set to " | ||
+ | The rest of your Batch script should remain the same. | ||
+ | Note that you might need to adjust the OMP_PLACES configuration based on your specific hardware architecture and the desired thread placement strategy. | ||
+ | Make sure to check the OpenMP documentation and your system' | ||
==== 2. Pure MPI jobs ===== | ==== 2. Pure MPI jobs ===== | ||
- | MPI processes | + | **MPI processes** |
- | < | + | :In a distributed computing |
- | The default | + | There are several MPI implementations, including OpenMPI, Intel MPI, and MPICH, each offering different options for process pinning, which is the assignment of processes to specific processor cores. |
- | === Example === | + | To choose the optimal MPI implementation for your parallelized application, |
- | < | + | * Understand Your Application' |
- | module load intel/18.0.2 intel-mpi/ | + | * Explore Available MPI Implementations: |
- | mpiicc -o myprogram myprogram.c | + | * Check Compatibility: |
- | </ | + | * Experiment with Basic Commands: After selecting an MPI implementation, |
- | With srun: | + | * Seek Assistance: Don't hesitate to seek help if you have questions or face challenges. |
- | < | + | * Additional Resources: Explore MPI tutorials at: [[https://vsc.ac.at/ |
- | #!/bin/bash | + | |
- | # | + | |
- | # | + | |
- | # | + | |
- | module load intel/ | ||
- | export I_MPI_DEBUG=4 | ||
- | NUMBER_OF_MPI_PROCESSES=16 | + | The default pin processor list is given by < |
- | export OMP_NUM_THREADS=1 | + | |
- | srun -n $NUMBER_OF_MPI_PROCESSES --cpu_bind=map_cpu:0,4,8,12 ./ | + | ==== Examples ==== |
- | </ | + | ===Compatibility and Compilers=== |
- | With mpirun: | + | Various MPI compilers and implementations exist catering to different programming languages such as C, C++, and Fortran for e.g. MPI implementations: Intel MPI, Open MPI, and MPICH: |
+ | |||
+ | **OpenMPI** | ||
+ | * C: mpicc | ||
+ | * C++: mpic++ oder mpiCC | ||
+ | * Fortran: mpifort oder mpif77 für Fortran 77, mpif90 für Fortran 90 | ||
+ | |||
+ | **Intel MPI** | ||
+ | * C: mpiicc | ||
+ | * C++: mpiicpc | ||
+ | * Fortran: mpiifort | ||
+ | |||
+ | **MPICH** | ||
+ | * C: mpicc | ||
+ | * C++: mpic++ | ||
+ | * Fortran: mpifort | ||
+ | |||
+ | Use the ' | ||
+ | Following are a few Slurm script examples written for C applications with various compiler versions. These examples provide a glimpse into writing batch scripts and serve as a practical guide for creating scripts. | ||
+ | Note that environment variables differ for different MPI implementations (OpenMPI, Intel MPI, and MPICH), and the Slurm scripts also vary between srun and mpiexec. Adjust your Slurm scripts accordingly | ||
+ | ===OPENMPI=== | ||
+ | **SRUN** | ||
< | < | ||
#!/bin/bash | #!/bin/bash | ||
- | #SBATCH -J pureMPI | + | # |
- | #SBATCH -N 4 | + | #SBATCH -N 2 |
- | #SBATCH --ntasks-per-node=4 | + | #SBATCH --ntasks-per-node 4 |
+ | #SBATCH --ntasks-per-core 1 | ||
- | module load intel/ | + | NUMBER_OF_MPI_PROCESSES=8 |
- | export I_MPI_DEBUG=4 | + | |
- | export I_MPI_PIN_PROCESSOR_LIST=0, | + | |
- | NUMBER_OF_MPI_PROCESSES=16 | + | module purge |
- | export OMP_NUM_THREADS=1 | + | module load openmpi/4.1.4-gcc-8.5.0-p6nh7mw |
- | mpirun | + | mpicc -o openmpi openmpi.c |
+ | srun -n $NUMBER_OF_MPI_PROCESSES | ||
</ | </ | ||
+ | Note: The // | ||
- | The default environment is '' | + | **MPIEXEC** |
- | ==== 3. Hybrid jobs ===== | + | < |
+ | # | ||
+ | # | ||
+ | # | ||
+ | # | ||
+ | #SBATCH --ntasks-per-core 1 | ||
- | Unfortunately, the combination of KMP_AFFINITY and I_MPI_PIN_PROCESSOR_LIST does not always lead to the expected pinning. For example, in the case of 4 MPI processes and 4 OMP threads on one node, it places all threads of one MPI-process on the same core instead, no matter how the two variables are changed. | + | NUMBER_OF_MPI_PROCESSES=8 |
+ | export OMPI_MCA_hwloc_base_binding_policy=core | ||
+ | export OMPI_MCA_hwloc_base_cpu_set=0,6,16,64 | ||
- | < | + | module purge |
- | Therefore, in the following the <span style=" | + | module load openmpi/4.1.2-gcc-9.5.0-hieglt7 |
- | </ | + | |
- | The binary mask is translated into hex-code which can easily be obtained using, e.g, the [[http:// | + | |
- | {{ : | + | mpicc |
- | + | mpiexec | |
- | < | + | |
- | module load intel/ | + | |
- | mpiicc | + | |
</ | </ | ||
+ | |||
+ | ===INTELMPI=== | ||
+ | **SRUN** | ||
< | < | ||
#!/bin/bash | #!/bin/bash | ||
# | # | ||
- | #SBATCH -J mapCPU | + | #SBATCH -M vsc5 |
- | #SBATCH -N 4 | + | #SBATCH -N 2 |
- | #SBATCH --time=00: | + | #SBATCH --ntasks-per-node |
+ | #SBATCH --ntasks-per-core 1 | ||
- | module load intel/ | ||
export I_MPI_DEBUG=4 | export I_MPI_DEBUG=4 | ||
+ | NUMBER_OF_MPI_PROCESSES=8 | ||
+ | export I_MPI_PIN_PROCESSOR_LIST=0, | ||
- | NUMBER_OF_MPI_PROCESSES=16 | + | module purge |
- | export OMP_NUM_THREADS=4 | + | module load intel/ |
+ | module load intel-mpi/ | ||
- | srun -n $NUMBER_OF_MPI_PROCESSES --cpu_bind=mask_cpu:0xf,0xf0,0xf00,0xf000 | + | mpiicc |
+ | srun -n $NUMBER_OF_MPI_PROCESSES --cpu_bind=map_cpu:0,4,8,12 ./intelmpi | ||
</ | </ | ||
+ | **MPIEXEC** | ||
+ | < | ||
+ | #!/bin/bash | ||
+ | # | ||
+ | #SBATCH -N 2 | ||
+ | #SBATCH --ntasks-per-node 4 | ||
+ | #SBATCH --ntasks-per-core 1 | ||
+ | export I_MPI_DEBUG=4 | ||
+ | NUMBER_OF_MPI_PROCESSES=8 | ||
+ | export I_MPI_PIN_PROCESSOR_LIST=0, | ||
- | ===== Likwid 4.0 ===== | + | module purge |
- | + | module load intel/19.1.3 | |
+ | module load intel-mpi/ | ||
- | ==== Background: ==== | + | mpiicc |
- | It is proving increasingly difficult to exert control over the assignment of different threads to the available CPU cores in multi-threaded OpenMP applications. Particularly troublesome are hybrid MPI/OpenMP codes. Here, the developer usually has a comprehensive knowledge of the regions running in parallel, but relies on the OS for optimal assignment of different physical cores to the individual computing threads. A variety of methods do exist to explicitly state the link between CPU core and a particular thread. However, in practice many of these configurations turn out to be either non-functional, | + | mpiexec |
- | + | ===MPICH=== | |
- | ==== Example: | + | **SRUN** |
- | Suppose we have the following little test program, {{: | + | |
- | | + | |
< | < | ||
#!/bin/bash | #!/bin/bash | ||
# | # | ||
- | #SBATCH -J tmv | + | #SBATCH -N 2 |
- | #SBATCH -N 1 | + | #SBATCH --ntasks-per-node 4 |
- | #SBATCH --time=00: | + | #SBATCH --ntasks-per-core 1 |
- | module purge | + | NUMBER_OF_MPI_PROCESSES=8 |
- | module load gcc/5.3 likwid/ | + | |
- | module load intel/ | + | |
- | export OMP_NUM_THREADS=8 | + | module purge |
+ | module load --auto mpich/ | ||
- | likwid-pin -c 3,4,2,1, | + | mpicc |
+ | srun -n $NUMBER_OF_MPI_PROCESSES --cpu_bind=map_cpu: | ||
</ | </ | ||
- | + | ||
- | * Note the repeated declaration of the initial core #3. This is required due to the fact that one main task is called which subsequently will branch out into 8 parallel threads. | + | Note: The flag - -auto; loads all dependencies |
- | * Thread #0 must run on the same core the parent process | + | |
- | * There are plenty | + | **MPIEXEC** |
- | * The good news is, likwid-pin works exactly the same way for INTEL-based compilers. For example, the above submit script would have led to exactly the same type of results when compiled with the command '' | + | ==== 3. Hybrid jobs ===== |
- | ==== MPI/OpenMP: ==== | + | |
- | '' | + | MPI (Message Passing Interface) is utilized for facilitating communication between processes across multiple nodes. Executing OpenMP on each respective node can be advantageous, |
< | < | ||
#!/bin/bash | #!/bin/bash | ||
# | # | ||
- | #SBATCH -J tmvmxd | + | #SBATCH -J mapCPU |
- | #SBATCH -N 4 | + | #SBATCH -N 3 |
+ | #SBATCH -n 3 | ||
+ | #SBATCH --ntasks-per-node=1 | ||
+ | #SBATCH --cpus-per-task=3 | ||
#SBATCH --time=00: | #SBATCH --time=00: | ||
- | module purge | + | export I_MPI_DEBUG=1 |
- | module load gcc/5.3 likwid/ | + | NUMBER_OF_MPI_PROCESSES=3 |
- | module load intel/ | + | export OMP_NUM_THREADS=3 |
- | export OMP_NUM_THREADS=16 | + | module load intel/ |
+ | module load intel-mpi/ | ||
+ | mpiicc -qopenmp -o myprogram myprogram.c | ||
- | srun -n4 likwid-pin -c 0,0-15 ./a.out | + | srun -n $NUMBER_OF_MPI_PROCESSES |
</ | </ | ||
- | ==== Further Reading: ==== | + | |
- | [[https:// | + | |
+ | |||
+ | |||
+ | |||
+ |