Differences
This shows you the differences between two versions of the page.
Both sides previous revision Previous revision Next revision | Previous revision | ||
doku:gromacs [2014/11/04 14:49] – sh | doku:gromacs [2023/11/23 12:27] (current) – [Many ranks on many nodes with many GPUs] msiegel | ||
---|---|---|---|
Line 1: | Line 1: | ||
- | ===== GROMACS | + | ====== GROMACS ====== |
- | ==== Installation VSC-1 (MPI-parallel) | + | |
- | 1. Make sure we use a recent version of the Intel MPI toolchain. | + | Our recommendation: |
- | < | + | |
- | mpi-selector --query | + | |
- | mpi-selector --list | + | |
- | ... in case | + | |
- | mpi-selector --set intel_mpi_intel64-4.1.1.036 | + | |
- | exit | + | |
- | relogin | + | |
- | </ | + | |
- | 2. Follow [[http:// | + | |
- | and prepare for two separate directories, | + | - Use the newest Hardware: use **1 GPU** on the partitions '' |
- | < | + | - Do some **performance analysis** to decide if a single GPU Node (likely) or multiple CPU Nodes via MPI (unlikely) better suits your problem. |
- | cd /opt/sw/ | + | |
- | wget ftp:// | + | |
- | gunzip ./gromacs-5.0.1.tar.gz | + | |
- | tar xvf ./gromacs-5.0.1.tar | + | |
- | gzip gromacs-5.0.1.tar | + | |
- | mv ./ | + | |
- | chown -R root ./ | + | |
- | chgrp -R root ./ | + | |
- | rm -rf ./ | + | |
- | mkdir | + | |
- | cd ./ | + | |
- | </ | + | |
- | 3. Download the latest version of '' | + | In most cases it does not make sense to run on multiple GPU nodes with MPI; Whether using one or two GPUs per node. |
- | double precision versions of it. For future compatibility with CUDA, | + | |
- | '' | + | ===== CPU or GPU Partition? ===== |
- | < | + | |
- | cd / | + | First you have to decide on which hardware GROMACS should run, we call this a '' |
- | cp from/where/ever/ | + | |
- | gunzip | + | ===== Installations ===== |
- | tar xvf ./fftw-3.3.4.tar | + | |
- | gzip fftw-3.3.4.tar | + | Type '' |
- | cd ./fftw-3.3.4 | + | |
- | ./configure | + | Because of the low efficiency of GROMACS on many nodes with many GPUs via MPI, we do not provide '' |
- | export CC=gcc | + | |
- | export F77=gfortran | + | We provide the following GROMACS variants: |
- | ./ | + | |
- | make clean | + | ==== GPU but no MPI ==== |
- | make | + | |
- | make install | + | We recommend GPU Nodes, use the '' |
- | </ | + | |
+ | **cuda-zen**: | ||
+ | * Gromacs +cuda ~mpi, all compiled with **GCC** | ||
+ | |||
+ | Since the '' | ||
+ | |||
+ | ==== MPI but no GPU ==== | ||
+ | |||
+ | For Gromacs on CPU only but with MPI, use '' | ||
+ | |||
+ | **zen**: | ||
+ | * Gromacs +openmpi +blas +lapack ~cuda, all compiled with **GCC** | ||
+ | * Gromacs +openmpi +blas +lapack ~cuda, all compiled with **AOCC** | ||
+ | * | ||
+ | **skylake**: | ||
+ | * Gromacs +**open**mpi +blas +lapack ~cuda, all compiled with **GCC** | ||
+ | * Gromacs +**open**mpi +blas +lapack ~cuda, all compiled with **Intel** | ||
+ | * Gromacs +**intel**mpi +blas +lapack ~cuda, all compiled with **GCC** | ||
+ | * Gromacs +**intel**mpi +blas +lapack ~cuda, all compiled with **Intel** | ||
+ | |||
+ | In some of these packages, there is no '' | ||
+ | |||
+ | ===== Batch Script ===== | ||
+ | |||
+ | Write a '' | ||
+ | |||
+ | * some SLURM parameters: the ''# | ||
+ | * export environment variables: e.g. '' | ||
+ | * clean modules: '' | ||
+ | * load modules: '' | ||
+ | * starting the program in question: '' | ||
+ | |||
+ | <code bash mybatchscript.sh> | ||
+ | #!/bin/bash | ||
+ | # | ||
+ | #SBATCH --partition=zen2_0256_a40x2 | ||
+ | # | ||
+ | # | ||
- | 4. | + | unset OMP_NUM_THREADS |
+ | export CUDA_VISIBLE_DEVICES=0 | ||
- | < | + | module purge |
- | cd /opt/sw/gromacs-5.0.1_mpi_build | + | module load gromacs/2022.2-gcc-9.5.0-... |
+ | gmx mdrun -s topol.tpr | ||
</ | </ | ||
+ | Type '' | ||
+ | ===== Performance ===== | ||
+ | ==== CPU / GPU Load ==== | ||
+ | There is a whole page dedicated to [[doku: | ||
- | ==== GROMACS with gpu support | + | ==== Short Example |
- | up to version 4.5.x openmm | + | As a short example we ran '' |
- | On VSC-1 Gromacs was built with this cmake command | + | The following table lists our 5 tests: Without any options GROMACS already runs fine (a). Setting the number of tasks (b) is not needed; if set wrong can even slow the calculation down significantly ( c ) due to over provisioning! We would advise to enforce pinning, in our example it does not show any effects though (d), we assume that the tasks are pinned automatically already. The only further improvement we could get was using the '' |
- | < | + | ^ # ^ cmd ^ ns / day ^ cpu load / % ^ gpu load / % ^ notes ^ |
+ | | a | '' | ||
+ | | b | '' | ||
+ | | c | '' | ||
+ | | d | '' | ||
+ | | e | '' | ||
- | CC=icc CXX=icpc / | + | |
+ | ==== 7 Test Cases ==== | ||
+ | |||
+ | Since GROMACS is used in many and very different ways, it makes sense to | ||
+ | benchmark various scenarios: | ||
+ | |||
+ | - R-143a in hexane (20,248 atoms) with very high output rate | ||
+ | - a short RNA piece with explicit water (31,889 atoms) | ||
+ | - a protein inside a membrane surrounded by explicit water (80,289 atoms) | ||
+ | - a VSC users test case (50,897 atoms) | ||
+ | - a protein in explicit water (170,320 atoms) | ||
+ | - a protein membrane channel with explicit water (615,924 atoms) | ||
+ | - a huge virus protein (1,066,628 atoms) | ||
+ | |||
+ | Take a look at the test results resembling a similar case than your application. | ||
+ | |||
+ | In this chart we tested our various hardware on the 7 test cases, some recent GPUs like '' | ||
+ | |||
+ | < | ||
+ | { | ||
+ | series: [{ | ||
+ | name: 'Test 1', | ||
+ | data: [191, 144, 128, 125, 145, 127, 92, 62, 57, 60, 57, 29, 28, 27, 17, 7.4, 7.4] | ||
+ | }, { | ||
+ | name: ' | ||
+ | data: [525, 442, 449, 455, 471, 317, 228, 228, 207, 193, 152, 73, 74, 61, 46, 18, 18] | ||
+ | }, { | ||
+ | name: 'Test 3', | ||
+ | data: [205, 143, 164, 130, 113, 164, 103, 66, 68, 58, 48, 24, 25, 23, 14, 6.2, 6] | ||
+ | }, { | ||
+ | name: 'Test 4', | ||
+ | data: [463, 333, 273, 246, 229, 276, 103, 165, 170, 158, 143, 69, 67, 54, 40, 16, 16] | ||
+ | }, { | ||
+ | name: 'Test 5', | ||
+ | data: [168, 139, 162, 147, 131, 174, 94, 61, 59, 58, 43, 18, 18, 22, 10, 5.2, 5] | ||
+ | }, { | ||
+ | name: 'Test 6', | ||
+ | data: [9.6, 8.1, 16, 8.4, 9.9, 7.3, 12, 4.3, 3.1, 3.1, 4.6, 1.7, 1.7, 1.6, 1, 0.4, 0.4] | ||
+ | }, { | ||
+ | name: 'Test 7', | ||
+ | data: [27.2, 13, 25, 21.8, 1.4, 24.6, 18, 8.6, 8, 7.6, 8, 3.1, 3.1, 3, 1.7, 0.7, 0.7] | ||
+ | }], | ||
+ | chart: { | ||
+ | type: ' | ||
+ | height: 350, | ||
+ | stacked: true, | ||
+ | }, | ||
+ | plotOptions: | ||
+ | bar: { | ||
+ | horizontal: true, | ||
+ | }, | ||
+ | }, | ||
+ | title: { | ||
+ | text: ' | ||
+ | }, | ||
+ | xaxis: { | ||
+ | categories: [ | ||
+ | "1x A40", | ||
+ | "1x RTX2080TI", | ||
+ | "1x A100", | ||
+ | "4x GTX1080 M", | ||
+ | "2x A40", | ||
+ | "8x GTX1080 M", | ||
+ | "2x A100", | ||
+ | "2x GTX1080 M", | ||
+ | "1x GTX1080 M", | ||
+ | "1x GTX1080 S", | ||
+ | "0x A100", | ||
+ | "0x GTX1080 M", | ||
+ | "0x A40", | ||
+ | "1x K20M", | ||
+ | "0x K20M", | ||
+ | "0x GTX1080 | ||
+ | "0x RTX2080TI", | ||
+ | ], | ||
+ | title: { | ||
+ | text: " | ||
+ | }, | ||
+ | }, | ||
+ | legend: { | ||
+ | position: ' | ||
+ | horizontalAlign: | ||
+ | title: { | ||
+ | text: "Test #" | ||
+ | }, | ||
+ | } | ||
+ | } | ||
+ | </achart> | ||
+ | |||
+ | |||
+ | ==== Many GPUs ==== | ||
+ | |||
+ | In most cases 1 GPU is **better** than 2 GPUs! | ||
+ | |||
+ | In some cases, for example a large molecule like Test 7, you might want to run GROMACS on both GPUs. We strongly encourage you to test if you actually benefit from running with GPUs on many nodes. | ||
+ | |||
+ | To find out if more GPUs mean more work done we need some math: the parallel efficiency **η** is the ratio of the [[https://en.wikipedia.org/wiki/Speedup | speedup]] factor **S(N)** and the number of processors **N**: | ||
+ | |||
+ | η = S(N) / N | ||
+ | |||
+ | In this chart we compare GROMACS parallel efficiency **η** of the 7 test cases with two GPU versus one GPU on VSC-5 '' | ||
+ | |||
+ | Set the number of GPUs on the node visible to GROMACS with '' | ||
+ | |||
+ | < | ||
+ | { | ||
+ | series: [{ | ||
+ | name: '2x A40', | ||
+ | data: [0.38, 0.45, 0.28, 0.25, 0.39, 0.52, 0.03] | ||
+ | }, { | ||
+ | name: '2x A100', | ||
+ | data: [0.36, 0.25, 0.31, 0.19, 0.29, 0.38, 0.36] | ||
+ | }], | ||
+ | chart: { | ||
+ | type: ' | ||
+ | height: 350, | ||
+ | | ||
+ | | ||
+ | }, | ||
+ | | ||
+ | | ||
+ | | ||
+ | // }, | ||
+ | // }, | ||
+ | title: { | ||
+ | text: ' | ||
+ | }, | ||
+ | xaxis: { | ||
+ | categories: [ | ||
+ | "Test 1", | ||
+ | "Test 2", | ||
+ | " | ||
+ | "Test 4", | ||
+ | "Test 5", | ||
+ | "Test 6", | ||
+ | "Test 7", | ||
+ | ], | ||
+ | // title: { | ||
+ | // text: "Test #" | ||
+ | // }, | ||
+ | }, | ||
+ | yaxis: { | ||
+ | title: { | ||
+ | text: " | ||
+ | }, | ||
+ | }, | ||
+ | legend: { | ||
+ | position: ' | ||
+ | horizontalAlign: | ||
+ | } | ||
+ | } | ||
+ | </ | ||
+ | |||
+ | |||
+ | ==== Many nodes with many GPUs ==== | ||
+ | |||
+ | In most cases one node is **better** than more nodes. | ||
+ | |||
+ | In some cases, for example a large molecule like Test 7, you might want to run GROMACS on multiple nodes in parallel using MPI, with multiple GPUs (one each node). We strongly encourage you to test if you actually benefit from running with GPUs on many nodes. GROMACS can perform worse on many nodes in parallel than on a single one, even considerably! | ||
+ | |||
+ | Run GROMACS on multiple nodes with: | ||
+ | |||
+ | <code bash> | ||
+ | #SBATCH --nodes 2 | ||
+ | gmx mdrun ... | ||
</ | </ | ||
+ | Take a look at the chapter [[doku: | ||
- | There might be a problem coming from CUDA, where the intel compiler version is checked. In this case uncomment this part in the / | + | < |
+ | { | ||
+ | series: [{ | ||
+ | name: 'Test 1', | ||
+ | data: [ 42.374, 40.176, 39.439, 38.252, 35.744, 30.811 ] | ||
+ | }, { | ||
+ | name: 'Test 2', | ||
+ | data: [ 82.513, 81.25, 84.805, 81.894, 72.589, 62.855 ] | ||
+ | }, { | ||
+ | name: 'Test 3', | ||
+ | data: [ 94.069, 99.788, 97.9, 100.509, 95.666, 83.485 ] | ||
+ | }, { | ||
+ | name: 'Test 4', | ||
+ | data: [ 115.179, 117.999, 115.028, 114.967, 103.8, 0 ] | ||
+ | }, { | ||
+ | name: ' | ||
+ | data: [ 67.147, 76.027, 80.627, 80.903, 83.031, 68.702 ] | ||
+ | }, { | ||
+ | name: 'Test 6', | ||
+ | data: [ 10.612, 11.963, 10.996, 14.37, 35.482, 34.988 ] | ||
+ | }, { | ||
+ | name: 'Test 7', | ||
+ | data: [ 17.92, 21.604, 30.482, 37.497, 35.448, 43.254 ] | ||
+ | }], | ||
+ | chart: { | ||
+ | type: ' | ||
+ | height: 350, | ||
+ | stacked: true, | ||
+ | }, | ||
+ | plotOptions: | ||
+ | bar: { | ||
+ | horizontal: true, | ||
+ | }, | ||
+ | }, | ||
+ | title: { | ||
+ | text: ' | ||
+ | }, | ||
+ | xaxis: { | ||
+ | categories: [ | ||
+ | "1 Node", | ||
+ | "2 Nodes", | ||
+ | "4 Nodes", | ||
+ | "8 Nodes", | ||
+ | "16 Nodes", | ||
+ | "32 Nodes", | ||
+ | ], | ||
+ | title: { | ||
+ | text: " | ||
+ | }, | ||
+ | }, | ||
+ | legend: { | ||
+ | position: ' | ||
+ | horizontalAlign: | ||
+ | title: { | ||
+ | text: "Test #" | ||
+ | }, | ||
+ | } | ||
+ | } | ||
+ | </achart> | ||
- | < | + | Note: the computation timed out for 4 with 32 nodes, before gromacs was able to estimate a performance. We can safely assume this example case is going to be less performant on 32 than on fewer nodes too. |
- | //#if defined(__ICC) | + | |
- | //#if !(__INTEL_COMPILER == 9999 && __INTEL_COMPILER_BUILD_DATE == 20110811) || !defined(__GNUC__) || !defined(__LP64__) | ||
- | //#error -- unsupported ICC configuration! Only ICC 12.1 on Linux x86_64 is supported! | + | ==== Many ranks on many nodes with many GPUs==== |
- | //#endif /* !(__INTEL_COMPILER == 9999 && __INTEL_COMPILER_BUILD_DATE == 20110811) || !__GNUC__ || !__LP64__ | + | Quick summary: |
+ | | ||
+ | | ||
- | //#endif /* __ICC */ | + | If you want to run GROMACS on multiple nodes and multiple GPUs in parallel using MPI, best |
+ | tell MPI how many processes should be launched on each node | ||
+ | '' | ||
+ | yourself with your specific application. Based on our tests listed in the following chart we | ||
+ | recommend 1 ranks per node for most (small) problems, and only for large problems up to 8 ranks per node: | ||
+ | |||
+ | <code bash> | ||
+ | #SBATCH --nodes 2 | ||
+ | mpirun -np 16 \ | ||
+ | | ||
+ | ... | ||
+ | | ||
</ | </ | ||
+ | |||
+ | The reason for this is that the graphics cards does more work than the CPU. GROMACS needs to copy data between different ranks on the CPUs and all GPUs, which takes more time with more ranks. GROMACS notices that and shows '' | ||
+ | |||
+ | < | ||
+ | { | ||
+ | series: [{ | ||
+ | name: 'Test 1', | ||
+ | data: [ 43.644, 46.385, 32.454, 37.333, 19.084, 16.136, 4.824 ] | ||
+ | }, { | ||
+ | name: 'Test 2', | ||
+ | data: [ 390.057, 138.831, 89.078, 78.769, 39.94, 35.99, 9.545 ] | ||
+ | }, { | ||
+ | name: 'Test 3', | ||
+ | data: [ 82.997, 39.682, 33.176, 80.643, 48.766, 29.216, 13.972 ] | ||
+ | }, { | ||
+ | name: 'Test 4', | ||
+ | data: [ 144.859, 52.099, 35.469, 96.125, 55.373, 32.502, 14.864 ] | ||
+ | }, { | ||
+ | name: 'Test 5', | ||
+ | data: [ 30.174, 35.561, 39.051, 68.824, 39.012, 34.442, 10.475 ] | ||
+ | }, { | ||
+ | name: 'Test 6', | ||
+ | data: [ 18.282, 10.061, 15.62, 20.889, 17.528, 16.452, 7.534 ] | ||
+ | }, { | ||
+ | name: 'Test 7', | ||
+ | data: [ 26.499, 14.855, 22.433, 26.672, 21.686, 19.323, 7.879 ] | ||
+ | }], | ||
+ | chart: { | ||
+ | type: ' | ||
+ | height: 350, | ||
+ | stacked: true, | ||
+ | }, | ||
+ | plotOptions: | ||
+ | bar: { | ||
+ | horizontal: true, | ||
+ | }, | ||
+ | }, | ||
+ | title: { | ||
+ | text: ' | ||
+ | }, | ||
+ | xaxis: { | ||
+ | categories: [ | ||
+ | "1 Rank", | ||
+ | "2 Ranks", | ||
+ | "4 Ranks", | ||
+ | "8 Ranks", | ||
+ | "16 Ranks", | ||
+ | "28* Ranks", | ||
+ | "64 Ranks", | ||
+ | ], | ||
+ | title: { | ||
+ | text: "ns / day" | ||
+ | }, | ||
+ | }, | ||
+ | legend: { | ||
+ | position: ' | ||
+ | horizontalAlign: | ||
+ | title: { | ||
+ | text: "Test #" | ||
+ | }, | ||
+ | } | ||
+ | } | ||
+ | </ | ||
+ | |||
+ | ===== Links ===== | ||
+ | |||
+ | The benchmarks are based on three articles of NHR@FAU, featuring in | ||
+ | depth analysis on GROMACS Performance on various GPU systems, multi | ||
+ | GPU setups and comparison with CPU: | ||
+ | |||
+ | https:// | ||
+ | |||
+ | https:// | ||
+ | |||
+ | https:// | ||
+ | |||
+ | |||