Differences
This shows you the differences between two versions of the page.
Both sides previous revision Previous revision Next revision | Previous revision | ||
doku:gromacs [2022/07/15 11:35] – added recommendations msiegel | doku:gromacs [2023/11/23 12:27] (current) – [Many ranks on many nodes with many GPUs] msiegel | ||
---|---|---|---|
Line 1: | Line 1: | ||
====== GROMACS ====== | ====== GROMACS ====== | ||
- | Our recommendation: | + | Our recommendation: |
- | - Use the most recent version of GROMACS that we provide or build your own. | + | - Use the **most recent version** of GROMACS that we provide or build your own. |
- | - Use the newest Hardware: the partitions '' | + | - Use the newest Hardware: |
- | - Read our article on multi GPU setup and do some performance analysis. | + | - Do some **performance analysis** to decide if a single GPU Node (likely) or multiple |
- | - Run on multiple | + | |
- | - Additionally use multiple GPUs per node | + | |
+ | In most cases it does not make sense to run on multiple GPU nodes with MPI; Whether using one or two GPUs per node. | ||
- | ===== GPU Partition ===== | + | ===== CPU or GPU Partition? ===== |
- | First you have to decide on which hardware GROMACS should run, we call | + | First you have to decide on which hardware GROMACS should run, we call this a '' |
- | this a '' | + | |
- | SLURM]]. On any login node, type '' | + | ===== Installations ===== |
- | available partitions. The partition has to be set in the batch script, | + | |
- | see the example below. Be aware that each partition has different | + | Type ''spack find -l gromacs'' |
- | hardware, | + | |
- | 1 GPU and a single socket à 4 cores, | + | Because of the low efficiency of GROMACS on many nodes with many GPUs via MPI, we do not provide '' |
- | listed at [[doku:vsc3gpuqos | + | |
- | makes sense to let GROMACS run on 8 threads ('' | + | We provide the following GROMACS variants: |
- | makes little sense to force more threads than that, as this would lead | + | |
- | to oversubscribing. GROMACS decides mostly on its own how it wants to | + | ==== GPU but no MPI ==== |
- | work, so don't be surprised if it ignores settings like environment | + | |
- | variables. | + | We recommend GPU Nodes, use the '' |
+ | |||
+ | **cuda-zen**: | ||
+ | * Gromacs +cuda ~mpi, all compiled with **GCC** | ||
+ | |||
+ | Since the '' | ||
+ | |||
+ | ==== MPI but no GPU ==== | ||
+ | |||
+ | For Gromacs | ||
+ | |||
+ | **zen**: | ||
+ | * Gromacs +openmpi +blas +lapack ~cuda, all compiled with **GCC** | ||
+ | * Gromacs +openmpi +blas +lapack ~cuda, all compiled with **AOCC** | ||
+ | | ||
+ | **skylake**: | ||
+ | * Gromacs +**open**mpi +blas +lapack ~cuda, all compiled with **GCC** | ||
+ | * Gromacs +**open**mpi +blas +lapack ~cuda, all compiled with **Intel** | ||
+ | * Gromacs +**intel**mpi +blas +lapack ~cuda, all compiled with **GCC** | ||
+ | * Gromacs +**intel**mpi +blas +lapack ~cuda, all compiled with **Intel** | ||
+ | |||
+ | In some of these packages, there is no '' | ||
===== Batch Script ===== | ===== Batch Script ===== | ||
Line 31: | Line 51: | ||
* some SLURM parameters: the ''# | * some SLURM parameters: the ''# | ||
- | * exporting | + | * export |
- | * cleaning the environment: '' | + | * clean modules: '' |
- | * loading | + | * load modules: '' |
- | * last but not least starting the program in question: '' | + | * starting the program in question: '' |
<code bash mybatchscript.sh> | <code bash mybatchscript.sh> | ||
#!/bin/bash | #!/bin/bash | ||
#SBATCH --job-name=myname | #SBATCH --job-name=myname | ||
- | #SBATCH --partition=gpu_gtx1080single | + | #SBATCH --partition=zen2_0256_a40x2 |
+ | #SBATCH --qos=zen2_0256_a40x2 | ||
#SBATCH --gres=gpu: | #SBATCH --gres=gpu: | ||
- | #SBATCH --nodes=1 | ||
unset OMP_NUM_THREADS | unset OMP_NUM_THREADS | ||
Line 47: | Line 67: | ||
module purge | module purge | ||
- | module load gcc/7.3 nvidia/1.0 cuda/ | + | module load gromacs/2022.2-gcc-9.5.0-... |
- | + | gmx mdrun -s topol.tpr | |
- | gmx_mpi | + | |
</ | </ | ||
- | Type '' | + | Type '' |
- | [[doku: | + | |
- | executed automatically. | + | |
- | ===== CPU / GPU Load ===== | + | ===== Performance |
- | There is a whole page dedicated to [[doku: | ||
- | GPU, for GROMACS the relevant sections are section | ||
- | [[doku: | ||
+ | ==== CPU / GPU Load ==== | ||
- | ===== Performance ===== | + | There is a whole page dedicated to [[doku: |
- | There is a whole article about the [[doku: | ||
- | As a short example we ran '' | + | ==== Short Example ==== |
- | different options, where '' | + | |
- | we don't actually care about the result. Without any options GROMACS | + | As a short example we ran '' |
- | already runs fine (a). Setting the number of tasks (b) is not needed; | + | |
- | if set wrong can even slow the calculation down significantly ( c ) due | + | The following table lists our 5 tests: |
- | to over provisioning! We would advice | + | |
- | example it does not show any effects though (d), we assume that the | + | |
- | tasks are pinned automatically already. The only further improvement | + | |
- | we could get was using the '' | + | |
- | load on the GPU (e). | + | |
^ # ^ cmd ^ ns / day ^ cpu load / % ^ gpu load / % ^ notes ^ | ^ # ^ cmd ^ ns / day ^ cpu load / % ^ gpu load / % ^ notes ^ | ||
- | | a | -- | 160 | 100 | 80 | | | + | | a | '' |
- | | b | -ntomp 8 | 160 | 100 | 80 | | | + | | b | '' |
- | | c | -ntomp 16 | 140 | 40 | 70 | gromacs warning: over provisioning | | + | | c | '' |
- | | d | -pin on | 160 | 100 | 80 | | | + | | d | '' |
- | | e | -update gpu | 170 | 100 | 90 | | | + | | e | '' |
- | ==== GROMACS2020 | + | ==== 7 Test Cases ==== |
- | The following environment variables need to be set with Gromacs2020 | + | Since GROMACS is used in many and very different ways, it makes sense to |
- | when using multiple | + | benchmark various scenarios: |
- | for Gromacs2021 onwards; they are already included | + | |
- | explicitly | + | - R-143a in hexane (20,248 atoms) |
+ | - a short RNA piece with explicit water (31,889 atoms) | ||
+ | - a protein inside a membrane surrounded by explicit water (80,289 atoms) | ||
+ | - a VSC users test case (50,897 atoms) | ||
+ | - a protein in explicit water (170,320 atoms) | ||
+ | - a protein membrane channel with explicit water (615,924 atoms) | ||
+ | - a huge virus protein (1,066,628 atoms) | ||
+ | |||
+ | Take a look at the test results resembling a similar case than your application. | ||
+ | |||
+ | In this chart we tested our various hardware on the 7 test cases, some recent | ||
+ | |||
+ | < | ||
+ | { | ||
+ | series: [{ | ||
+ | name: 'Test 1', | ||
+ | data: [191, 144, 128, 125, 145, 127, 92, 62, 57, 60, 57, 29, 28, 27, 17, 7.4, 7.4] | ||
+ | }, { | ||
+ | name: 'Test 2', | ||
+ | data: [525, 442, 449, 455, 471, 317, 228, 228, 207, 193, 152, 73, 74, 61, 46, 18, 18] | ||
+ | }, { | ||
+ | name: 'Test 3', | ||
+ | data: [205, 143, 164, 130, 113, 164, 103, 66, 68, 58, 48, 24, 25, 23, 14, 6.2, 6] | ||
+ | }, { | ||
+ | name: 'Test 4', | ||
+ | data: [463, 333, 273, 246, 229, 276, 103, 165, 170, 158, 143, 69, 67, 54, 40, 16, 16] | ||
+ | }, { | ||
+ | name: 'Test 5', | ||
+ | data: [168, 139, 162, 147, 131, 174, 94, 61, 59, 58, 43, 18, 18, 22, 10, 5.2, 5] | ||
+ | }, { | ||
+ | name: 'Test 6', | ||
+ | data: [9.6, 8.1, 16, 8.4, 9.9, 7.3, 12, 4.3, 3.1, 3.1, 4.6, 1.7, 1.7, 1.6, 1, 0.4, 0.4] | ||
+ | }, { | ||
+ | name: 'Test 7', | ||
+ | data: [27.2, 13, 25, 21.8, 1.4, 24.6, 18, 8.6, 8, 7.6, 8, 3.1, 3.1, 3, 1.7, 0.7, 0.7] | ||
+ | }], | ||
+ | chart: { | ||
+ | type: ' | ||
+ | height: 350, | ||
+ | stacked: true, | ||
+ | }, | ||
+ | plotOptions: | ||
+ | bar: { | ||
+ | horizontal: true, | ||
+ | }, | ||
+ | }, | ||
+ | title: { | ||
+ | text: ' | ||
+ | }, | ||
+ | xaxis: { | ||
+ | categories: [ | ||
+ | "1x A40", | ||
+ | "1x RTX2080TI", | ||
+ | "1x A100", | ||
+ | "4x GTX1080 M", | ||
+ | "2x A40", | ||
+ | "8x GTX1080 M", | ||
+ | "2x A100", | ||
+ | "2x GTX1080 M", | ||
+ | "1x GTX1080 M", | ||
+ | "1x GTX1080 S", | ||
+ | "0x A100", | ||
+ | "0x GTX1080 M", | ||
+ | "0x A40", | ||
+ | "1x K20M", | ||
+ | "0x K20M", | ||
+ | "0x GTX1080 | ||
+ | "0x RTX2080TI", | ||
+ | ], | ||
+ | title: { | ||
+ | text: "ns / day" | ||
+ | }, | ||
+ | }, | ||
+ | legend: { | ||
+ | position: ' | ||
+ | horizontalAlign: | ||
+ | title: { | ||
+ | text: "Test #" | ||
+ | }, | ||
+ | } | ||
+ | } | ||
+ | </ | ||
+ | |||
+ | |||
+ | ==== Many GPUs ==== | ||
+ | |||
+ | In most cases 1 GPU is **better** than 2 GPUs! | ||
+ | |||
+ | In some cases, | ||
+ | |||
+ | To find out if more GPUs mean more work done we need some math: the parallel efficiency **η** is the ratio of the [[https:// | ||
+ | |||
+ | η = S(N) / N | ||
+ | |||
+ | In this chart we compare GROMACS parallel efficiency **η** of the 7 test cases with two GPU versus one GPU on VSC-5 '' | ||
+ | |||
+ | Set the number of GPUs on the node visible to GROMACS with '' | ||
+ | |||
+ | < | ||
+ | { | ||
+ | series: [{ | ||
+ | name: '2x A40', | ||
+ | data: [0.38, 0.45, 0.28, 0.25, 0.39, 0.52, 0.03] | ||
+ | }, { | ||
+ | name: '2x A100', | ||
+ | data: [0.36, 0.25, 0.31, 0.19, 0.29, 0.38, 0.36] | ||
+ | }], | ||
+ | chart: { | ||
+ | type: ' | ||
+ | height: 350, | ||
+ | // stacked: true, | ||
+ | // stackType: ' | ||
+ | }, | ||
+ | // plotOptions: | ||
+ | // bar: { | ||
+ | // | ||
+ | // }, | ||
+ | // }, | ||
+ | title: { | ||
+ | text: ' | ||
+ | }, | ||
+ | xaxis: { | ||
+ | categories: [ | ||
+ | "Test 1", | ||
+ | "Test 2", | ||
+ | "Test 3", | ||
+ | "Test 4", | ||
+ | "Test 5", | ||
+ | "Test 6", | ||
+ | "Test 7", | ||
+ | ], | ||
+ | // title: { | ||
+ | // text: "Test #" | ||
+ | // }, | ||
+ | }, | ||
+ | yaxis: { | ||
+ | title: { | ||
+ | text: " | ||
+ | }, | ||
+ | }, | ||
+ | legend: { | ||
+ | position: ' | ||
+ | horizontalAlign: | ||
+ | } | ||
+ | } | ||
+ | </ | ||
+ | |||
+ | |||
+ | ==== Many nodes with many GPUs ==== | ||
+ | |||
+ | In most cases one node is **better** than more nodes. | ||
+ | |||
+ | In some cases, for example a large molecule like Test 7, you might want to run GROMACS on multiple nodes in parallel using MPI, with multiple GPUs (one each node). We strongly encourage you to test if you actually | ||
+ | |||
+ | Run GROMACS on multiple nodes with: | ||
<code bash> | <code bash> | ||
- | export GMX_GPU_PME_PP_COMMS=true | + | #SBATCH --nodes 2 |
- | export GMX_GPU_DD_COMMS=true | + | gmx mdrun ... |
- | export GMX_GPU_FORCE_UPDATE_DEFAULT_GPU=true | + | |
</ | </ | ||
+ | |||
+ | Take a look at the chapter [[doku: | ||
+ | |||
+ | < | ||
+ | { | ||
+ | series: [{ | ||
+ | name: 'Test 1', | ||
+ | data: [ 42.374, 40.176, 39.439, 38.252, 35.744, 30.811 ] | ||
+ | }, { | ||
+ | name: 'Test 2', | ||
+ | data: [ 82.513, 81.25, 84.805, 81.894, 72.589, 62.855 ] | ||
+ | }, { | ||
+ | name: 'Test 3', | ||
+ | data: [ 94.069, 99.788, 97.9, 100.509, 95.666, 83.485 ] | ||
+ | }, { | ||
+ | name: 'Test 4', | ||
+ | data: [ 115.179, 117.999, 115.028, 114.967, 103.8, 0 ] | ||
+ | }, { | ||
+ | name: 'Test 5', | ||
+ | data: [ 67.147, 76.027, 80.627, 80.903, 83.031, 68.702 ] | ||
+ | }, { | ||
+ | name: 'Test 6', | ||
+ | data: [ 10.612, 11.963, 10.996, 14.37, 35.482, 34.988 ] | ||
+ | }, { | ||
+ | name: 'Test 7', | ||
+ | data: [ 17.92, 21.604, 30.482, 37.497, 35.448, 43.254 ] | ||
+ | }], | ||
+ | chart: { | ||
+ | type: ' | ||
+ | height: 350, | ||
+ | stacked: true, | ||
+ | }, | ||
+ | plotOptions: | ||
+ | bar: { | ||
+ | horizontal: true, | ||
+ | }, | ||
+ | }, | ||
+ | title: { | ||
+ | text: ' | ||
+ | }, | ||
+ | xaxis: { | ||
+ | categories: [ | ||
+ | "1 Node", | ||
+ | "2 Nodes", | ||
+ | "4 Nodes", | ||
+ | "8 Nodes", | ||
+ | "16 Nodes", | ||
+ | "32 Nodes", | ||
+ | ], | ||
+ | title: { | ||
+ | text: "ns / day" | ||
+ | }, | ||
+ | }, | ||
+ | legend: { | ||
+ | position: ' | ||
+ | horizontalAlign: | ||
+ | title: { | ||
+ | text: "Test #" | ||
+ | }, | ||
+ | } | ||
+ | } | ||
+ | </ | ||
+ | |||
+ | Note: the computation timed out for 4 with 32 nodes, before gromacs was able to estimate a performance. We can safely assume this example case is going to be less performant on 32 than on fewer nodes too. | ||
+ | |||
+ | |||
+ | ==== Many ranks on many nodes with many GPUs==== | ||
+ | |||
+ | Quick summary: | ||
+ | * Most problems (Small): 1 or 2 ranks per node | ||
+ | * Large problem: 8 ranks per node | ||
+ | |||
+ | If you want to run GROMACS on multiple nodes and multiple GPUs in parallel using MPI, best | ||
+ | tell MPI how many processes should be launched on each node | ||
+ | '' | ||
+ | yourself with your specific application. Based on our tests listed in the following chart we | ||
+ | recommend 1 ranks per node for most (small) problems, and only for large problems up to 8 ranks per node: | ||
+ | |||
+ | <code bash> | ||
+ | #SBATCH --nodes 2 | ||
+ | mpirun -np 16 \ | ||
+ | | ||
+ | ... | ||
+ | | ||
+ | </ | ||
+ | |||
+ | The reason for this is that the graphics cards does more work than the CPU. GROMACS needs to copy data between different ranks on the CPUs and all GPUs, which takes more time with more ranks. GROMACS notices that and shows '' | ||
+ | |||
+ | < | ||
+ | { | ||
+ | series: [{ | ||
+ | name: 'Test 1', | ||
+ | data: [ 43.644, 46.385, 32.454, 37.333, 19.084, 16.136, 4.824 ] | ||
+ | }, { | ||
+ | name: 'Test 2', | ||
+ | data: [ 390.057, 138.831, 89.078, 78.769, 39.94, 35.99, 9.545 ] | ||
+ | }, { | ||
+ | name: 'Test 3', | ||
+ | data: [ 82.997, 39.682, 33.176, 80.643, 48.766, 29.216, 13.972 ] | ||
+ | }, { | ||
+ | name: 'Test 4', | ||
+ | data: [ 144.859, 52.099, 35.469, 96.125, 55.373, 32.502, 14.864 ] | ||
+ | }, { | ||
+ | name: 'Test 5', | ||
+ | data: [ 30.174, 35.561, 39.051, 68.824, 39.012, 34.442, 10.475 ] | ||
+ | }, { | ||
+ | name: 'Test 6', | ||
+ | data: [ 18.282, 10.061, 15.62, 20.889, 17.528, 16.452, 7.534 ] | ||
+ | }, { | ||
+ | name: 'Test 7', | ||
+ | data: [ 26.499, 14.855, 22.433, 26.672, 21.686, 19.323, 7.879 ] | ||
+ | }], | ||
+ | chart: { | ||
+ | type: ' | ||
+ | height: 350, | ||
+ | stacked: true, | ||
+ | }, | ||
+ | plotOptions: | ||
+ | bar: { | ||
+ | horizontal: true, | ||
+ | }, | ||
+ | }, | ||
+ | title: { | ||
+ | text: ' | ||
+ | }, | ||
+ | xaxis: { | ||
+ | categories: [ | ||
+ | "1 Rank", | ||
+ | "2 Ranks", | ||
+ | "4 Ranks", | ||
+ | "8 Ranks", | ||
+ | "16 Ranks", | ||
+ | "28* Ranks", | ||
+ | "64 Ranks", | ||
+ | ], | ||
+ | title: { | ||
+ | text: "ns / day" | ||
+ | }, | ||
+ | }, | ||
+ | legend: { | ||
+ | position: ' | ||
+ | horizontalAlign: | ||
+ | title: { | ||
+ | text: "Test #" | ||
+ | }, | ||
+ | } | ||
+ | } | ||
+ | </ | ||
+ | |||
+ | ===== Links ===== | ||
+ | |||
+ | The benchmarks are based on three articles of NHR@FAU, featuring in | ||
+ | depth analysis on GROMACS Performance on various GPU systems, multi | ||
+ | GPU setups and comparison with CPU: | ||
+ | |||
+ | https:// | ||
+ | |||
+ | https:// | ||
+ | |||
+ | https:// | ||
+ | |||
+ | |||
+ |