Differences
This shows you the differences between two versions of the page.
Both sides previous revision Previous revision Next revision | Previous revision | ||
doku:gromacs [2023/05/17 12:29] – msiegel | doku:gromacs [2023/11/23 12:27] (current) – [Many ranks on many nodes with many GPUs] msiegel | ||
---|---|---|---|
Line 3: | Line 3: | ||
Our recommendation: | Our recommendation: | ||
- | - Use the most **recent version** of GROMACS that we provide or build your own. | + | - Use the **most recent version** of GROMACS that we provide or build your own. |
- Use the newest Hardware: use **1 GPU** on the partitions '' | - Use the newest Hardware: use **1 GPU** on the partitions '' | ||
- Do some **performance analysis** to decide if a single GPU Node (likely) or multiple CPU Nodes via MPI (unlikely) better suits your problem. | - Do some **performance analysis** to decide if a single GPU Node (likely) or multiple CPU Nodes via MPI (unlikely) better suits your problem. | ||
Line 9: | Line 9: | ||
In most cases it does not make sense to run on multiple GPU nodes with MPI; Whether using one or two GPUs per node. | In most cases it does not make sense to run on multiple GPU nodes with MPI; Whether using one or two GPUs per node. | ||
- | ===== GPU Partition ===== | + | ===== CPU or GPU Partition? ===== |
- | First you have to decide on which hardware GROMACS should run, we call | + | First you have to decide on which hardware GROMACS should run, we call this a '' |
- | this a '' | + | |
- | SLURM]]. On any login node, type '' | + | ===== Installations ===== |
- | available partitions, or take a look at [[doku: | + | |
- | see the example below. Be aware that each partition has different | + | Type '' |
- | hardware, so choose the parameters accordingly. GROMACS decides mostly on its own how it wants to | + | |
- | work, so don't be surprised if it ignores settings like environment | + | Because of the low efficiency of GROMACS on many nodes with many GPUs via MPI, we do not provide '' |
- | variables. | + | |
+ | We provide the following GROMACS variants: | ||
+ | |||
+ | ==== GPU but no MPI ==== | ||
+ | |||
+ | We recommend GPU Nodes, use the '' | ||
+ | |||
+ | **cuda-zen**: | ||
+ | * Gromacs +cuda ~mpi, all compiled with **GCC** | ||
+ | |||
+ | Since the '' | ||
+ | |||
+ | ==== MPI but no GPU ==== | ||
+ | |||
+ | For Gromacs on CPU only but with MPI, use '' | ||
+ | |||
+ | **zen**: | ||
+ | * Gromacs +openmpi +blas +lapack ~cuda, all compiled with **GCC** | ||
+ | * Gromacs +openmpi +blas +lapack ~cuda, all compiled with **AOCC** | ||
+ | * | ||
+ | **skylake**: | ||
+ | * Gromacs +**open**mpi +blas +lapack ~cuda, all compiled with **GCC** | ||
+ | * Gromacs +**open**mpi +blas +lapack ~cuda, all compiled with **Intel** | ||
+ | * Gromacs +**intel**mpi +blas +lapack ~cuda, all compiled with **GCC** | ||
+ | * Gromacs +**intel**mpi +blas +lapack ~cuda, all compiled with **Intel** | ||
+ | |||
+ | In some of these packages, there is no '' | ||
===== Batch Script ===== | ===== Batch Script ===== | ||
Line 28: | Line 54: | ||
* clean modules: '' | * clean modules: '' | ||
* load modules: '' | * load modules: '' | ||
- | * starting the program in question: '' | + | * starting the program in question: '' |
<code bash mybatchscript.sh> | <code bash mybatchscript.sh> | ||
Line 42: | Line 68: | ||
module purge | module purge | ||
module load gromacs/ | module load gromacs/ | ||
- | gmx_mpi | + | gmx mdrun -s topol.tpr |
</ | </ | ||
- | Type '' | + | Type '' |
- | [[doku: | + | |
- | executed automatically. | + | |
Line 55: | Line 79: | ||
==== CPU / GPU Load ==== | ==== CPU / GPU Load ==== | ||
- | There is a whole page dedicated to [[doku: | + | There is a whole page dedicated to [[doku: |
- | GPU, for GROMACS the relevant sections are section | + | |
- | [[doku: | + | |
==== Short Example ==== | ==== Short Example ==== | ||
- | As a short example we ran '' | + | As a short example we ran '' |
- | different options, where '' | + | |
- | we don't actually care about the result, we just want to know how many **ns/day** we can get, Gromacs | + | |
- | The following table lists our 5 tests: Without any options GROMACS | + | The following table lists our 5 tests: Without any options GROMACS already runs fine (a). Setting the number of tasks (b) is not needed; if set wrong can even slow the calculation down significantly ( c ) due to over provisioning! We would advise to enforce pinning, in our example it does not show any effects though (d), we assume that the tasks are pinned automatically already. The only further improvement we could get was using the '' |
- | already runs fine (a). Setting the number of tasks (b) is not needed; | + | |
- | if set wrong can even slow the calculation down significantly ( c ) due | + | |
- | to over provisioning! We would advise to enforce pinning, in our | + | |
- | example it does not show any effects though (d), we assume that the | + | |
- | tasks are pinned automatically already. The only further improvement | + | |
- | we could get was using the '' | + | |
- | load on the GPU (e). | + | |
^ # ^ cmd ^ ns / day ^ cpu load / % ^ gpu load / % ^ notes ^ | ^ # ^ cmd ^ ns / day ^ cpu load / % ^ gpu load / % ^ notes ^ | ||
Line 85: | Line 98: | ||
==== 7 Test Cases ==== | ==== 7 Test Cases ==== | ||
- | Since Gromacs | + | Since GROMACS |
benchmark various scenarios: | benchmark various scenarios: | ||
- | - a VSC users test case (??? atoms) | ||
- R-143a in hexane (20,248 atoms) with very high output rate | - R-143a in hexane (20,248 atoms) with very high output rate | ||
- a short RNA piece with explicit water (31,889 atoms) | - a short RNA piece with explicit water (31,889 atoms) | ||
- a protein inside a membrane surrounded by explicit water (80,289 atoms) | - a protein inside a membrane surrounded by explicit water (80,289 atoms) | ||
+ | - a VSC users test case (50,897 atoms) | ||
- a protein in explicit water (170,320 atoms) | - a protein in explicit water (170,320 atoms) | ||
- a protein membrane channel with explicit water (615,924 atoms) | - a protein membrane channel with explicit water (615,924 atoms) | ||
Line 135: | Line 148: | ||
}, | }, | ||
title: { | title: { | ||
- | text: 'Gromacs | + | text: 'GROMACS |
}, | }, | ||
xaxis: { | xaxis: { | ||
Line 176: | Line 189: | ||
In most cases 1 GPU is **better** than 2 GPUs! | In most cases 1 GPU is **better** than 2 GPUs! | ||
- | In some cases, for example a large molecule like Test 7, you might want to run Gromacs | + | In some cases, for example a large molecule like Test 7, you might want to run GROMACS |
To find out if more GPUs mean more work done we need some math: the parallel efficiency **η** is the ratio of the [[https:// | To find out if more GPUs mean more work done we need some math: the parallel efficiency **η** is the ratio of the [[https:// | ||
Line 182: | Line 195: | ||
η = S(N) / N | η = S(N) / N | ||
- | In this chart we compare | + | In this chart we compare |
- | Set the number of GPUs on the node visible to Gromacs | + | Set the number of GPUs on the node visible to GROMACS |
< | < | ||
Line 240: | Line 253: | ||
In most cases one node is **better** than more nodes. | In most cases one node is **better** than more nodes. | ||
- | In some cases, for example a large molecule like Test 7, you might want to run Gromacs | + | In some cases, for example a large molecule like Test 7, you might want to run GROMACS |
- | multiple nodes in parallel using MPI, with multiple | + | |
- | GPUs (one each node). We strongly encourage you to test if you | + | |
- | actually benefit from running with GPUs on many nodes. | + | |
- | many nodes in parallel than on a single one, even considerably! | + | |
- | Run Gromacs | + | Run GROMACS |
<code bash> | <code bash> | ||
Line 265: | Line 274: | ||
}, { | }, { | ||
name: 'Test 3', | name: 'Test 3', | ||
- | data: [ 0, 0, 0, 0, 0, 0 ] | + | data: [ 94.069, 99.788, 97.9, 100.509, 95.666, 83.485 |
}, { | }, { | ||
name: 'Test 4', | name: 'Test 4', | ||
- | data: [ 0, 0, 0, 0, 0, 0 ] | + | data: [ 115.179, 117.999, 115.028, 114.967, 103.8, 0 ] |
}, { | }, { | ||
name: 'Test 5', | name: 'Test 5', | ||
Line 290: | Line 299: | ||
}, | }, | ||
title: { | title: { | ||
- | text: 'Gromacs | + | text: 'GROMACS |
}, | }, | ||
xaxis: { | xaxis: { | ||
Line 314: | Line 323: | ||
} | } | ||
</ | </ | ||
+ | |||
+ | Note: the computation timed out for 4 with 32 nodes, before gromacs was able to estimate a performance. We can safely assume this example case is going to be less performant on 32 than on fewer nodes too. | ||
Line 322: | Line 333: | ||
* Large problem: 8 ranks per node | * Large problem: 8 ranks per node | ||
- | If you want to run Gromacs | + | If you want to run GROMACS |
tell MPI how many processes should be launched on each node | tell MPI how many processes should be launched on each node | ||
'' | '' | ||
Line 336: | Line 347: | ||
</ | </ | ||
- | The reason for this is that the graphics cards does more work than the CPU. Gromacs | + | The reason for this is that the graphics cards does more work than the CPU. GROMACS |
- | different ranks on the CPUs and all GPUs, which takes more time with more ranks. | + | |
- | '' | + | |
- | 1 with 16 ranks on 1 node: the '' | + | |
- | of the time spent! | + | |
< | < | ||
Line 377: | Line 384: | ||
}, | }, | ||
title: { | title: { | ||
- | text: 'Gromacs | + | text: 'GROMACS |
}, | }, | ||
xaxis: { | xaxis: { | ||
Line 402: | Line 409: | ||
} | } | ||
</ | </ | ||
+ | |||
+ | ===== Links ===== | ||
+ | |||
+ | The benchmarks are based on three articles of NHR@FAU, featuring in | ||
+ | depth analysis on GROMACS Performance on various GPU systems, multi | ||
+ | GPU setups and comparison with CPU: | ||
+ | |||
+ | https:// | ||
+ | |||
+ | https:// | ||
+ | |||
+ | https:// | ||
+ | |||