Differences
This shows you the differences between two versions of the page.
Both sides previous revision Previous revision | Next revisionBoth sides next revision | ||
doku:gromacs [2023/05/11 15:45] – msiegel | doku:gromacs [2023/05/17 12:29] – msiegel | ||
---|---|---|---|
Line 3: | Line 3: | ||
Our recommendation: | Our recommendation: | ||
- | - Use the most recent version of GROMACS that we provide or build your own. | + | - Use the most **recent version** of GROMACS that we provide or build your own. |
- | - Use the newest Hardware: the partitions '' | + | - Use the newest Hardware: |
- | - Do some performance analysis to decide if a single GPU Node (likely) or multiple CPU Nodes via MPI (unlikely) better suits your problem. | + | - Do some **performance analysis** to decide if a single GPU Node (likely) or multiple CPU Nodes via MPI (unlikely) better suits your problem. |
- | In most cases it does not make sense to run on multiple GPU nodes with MPI; weather | + | In most cases it does not make sense to run on multiple GPU nodes with MPI; Whether |
===== GPU Partition ===== | ===== GPU Partition ===== | ||
Line 14: | Line 14: | ||
this a '' | this a '' | ||
SLURM]]. On any login node, type '' | SLURM]]. On any login node, type '' | ||
- | available partitions. The partition has to be set in the batch script, | + | available partitions, or take a look at [[doku: |
see the example below. Be aware that each partition has different | see the example below. Be aware that each partition has different | ||
hardware, so choose the parameters accordingly. GROMACS decides mostly on its own how it wants to | hardware, so choose the parameters accordingly. GROMACS decides mostly on its own how it wants to | ||
Line 25: | Line 25: | ||
* some SLURM parameters: the ''# | * some SLURM parameters: the ''# | ||
- | * exporting | + | * export |
- | * cleaning the environment: '' | + | * clean modules: '' |
- | * loading | + | * load modules: '' |
- | * last but not least starting the program in question: '' | + | * starting the program in question: '' |
<code bash mybatchscript.sh> | <code bash mybatchscript.sh> | ||
Line 41: | Line 41: | ||
module purge | module purge | ||
- | module load cuda/11.5.0-gcc-11.2.0-ao7cp7w openmpi/ | + | module load gromacs/2022.2-gcc-9.5.0-... |
gmx_mpi mdrun -s topol.tpr | gmx_mpi mdrun -s topol.tpr | ||
</ | </ | ||
Line 51: | Line 50: | ||
- | ===== CPU / GPU Load ===== | + | ===== Performance ===== |
+ | |||
+ | |||
+ | ==== CPU / GPU Load ==== | ||
There is a whole page dedicated to [[doku: | There is a whole page dedicated to [[doku: | ||
Line 58: | Line 60: | ||
- | ===== Performance ===== | + | ==== Short Example |
As a short example we ran '' | As a short example we ran '' | ||
different options, where '' | different options, where '' | ||
- | we don't actually care about the result. Without any options GROMACS | + | we don't actually care about the result, we just want to know how many **ns/day** we can get, Gromacs tells you that at the end of every run. Such a short test can be done in no time. |
+ | |||
+ | The following table lists our 5 tests: | ||
already runs fine (a). Setting the number of tasks (b) is not needed; | already runs fine (a). Setting the number of tasks (b) is not needed; | ||
if set wrong can even slow the calculation down significantly ( c ) due | if set wrong can even slow the calculation down significantly ( c ) due | ||
- | to over provisioning! We would advice | + | to over provisioning! We would advise |
example it does not show any effects though (d), we assume that the | example it does not show any effects though (d), we assume that the | ||
tasks are pinned automatically already. The only further improvement | tasks are pinned automatically already. The only further improvement | ||
Line 72: | Line 76: | ||
^ # ^ cmd ^ ns / day ^ cpu load / % ^ gpu load / % ^ notes ^ | ^ # ^ cmd ^ ns / day ^ cpu load / % ^ gpu load / % ^ notes ^ | ||
- | | a | -- | 160 | 100 | 80 | | | + | | a | '' |
- | | b | -ntomp 8 | 160 | 100 | 80 | | | + | | b | '' |
- | | c | -ntomp 16 | 140 | 40 | 70 | gromacs warning: over provisioning | | + | | c | '' |
- | | d | -pin on | 160 | 100 | 80 | | | + | | d | '' |
- | | e | -update gpu | 170 | 100 | 90 | | | + | | e | '' |
+ | |||
+ | |||
+ | ==== 7 Test Cases ==== | ||
+ | |||
+ | Since Gromacs is used in many and very different ways, it makes sense to | ||
+ | benchmark various scenarios: | ||
+ | |||
+ | - a VSC users test case (??? atoms) | ||
+ | - R-143a in hexane (20,248 atoms) with very high output rate | ||
+ | - a short RNA piece with explicit water (31,889 atoms) | ||
+ | - a protein inside a membrane surrounded by explicit water (80,289 atoms) | ||
+ | - a protein in explicit water (170,320 atoms) | ||
+ | - a protein membrane channel with explicit water (615,924 atoms) | ||
+ | - a huge virus protein (1,066,628 atoms) | ||
+ | |||
+ | Take a look at the test results resembling a similar case than your application. | ||
+ | |||
+ | In this chart we tested our various hardware on the 7 test cases, some recent GPUs like '' | ||
+ | |||
+ | < | ||
+ | { | ||
+ | series: [{ | ||
+ | name: 'Test 1', | ||
+ | data: [191, 144, 128, 125, 145, 127, 92, 62, 57, 60, 57, 29, 28, 27, 17, 7.4, 7.4] | ||
+ | }, { | ||
+ | name: 'Test 2', | ||
+ | data: [525, 442, 449, 455, 471, 317, 228, 228, 207, 193, 152, 73, 74, 61, 46, 18, 18] | ||
+ | }, { | ||
+ | name: 'Test 3', | ||
+ | data: [205, 143, 164, 130, 113, 164, 103, 66, 68, 58, 48, 24, 25, 23, 14, 6.2, 6] | ||
+ | }, { | ||
+ | name: 'Test 4', | ||
+ | data: [463, 333, 273, 246, 229, 276, 103, 165, 170, 158, 143, 69, 67, 54, 40, 16, 16] | ||
+ | }, { | ||
+ | name: 'Test 5', | ||
+ | data: [168, 139, 162, 147, 131, 174, 94, 61, 59, 58, 43, 18, 18, 22, 10, 5.2, 5] | ||
+ | }, { | ||
+ | name: 'Test 6', | ||
+ | data: [9.6, 8.1, 16, 8.4, 9.9, 7.3, 12, 4.3, 3.1, 3.1, 4.6, 1.7, 1.7, 1.6, 1, 0.4, 0.4] | ||
+ | }, { | ||
+ | name: 'Test 7', | ||
+ | data: [27.2, 13, 25, 21.8, 1.4, 24.6, 18, 8.6, 8, 7.6, 8, 3.1, 3.1, 3, 1.7, 0.7, 0.7] | ||
+ | }], | ||
+ | chart: { | ||
+ | type: ' | ||
+ | height: 350, | ||
+ | stacked: true, | ||
+ | }, | ||
+ | plotOptions: | ||
+ | bar: { | ||
+ | horizontal: true, | ||
+ | }, | ||
+ | }, | ||
+ | title: { | ||
+ | text: ' | ||
+ | }, | ||
+ | xaxis: { | ||
+ | categories: [ | ||
+ | "1x A40", | ||
+ | "1x RTX2080TI", | ||
+ | "1x A100", | ||
+ | "4x GTX1080 M", | ||
+ | "2x A40", | ||
+ | "8x GTX1080 M", | ||
+ | "2x A100", | ||
+ | "2x GTX1080 M", | ||
+ | "1x GTX1080 M", | ||
+ | "1x GTX1080 S", | ||
+ | "0x A100", | ||
+ | "0x GTX1080 M", | ||
+ | "0x A40", | ||
+ | "1x K20M", | ||
+ | "0x K20M", | ||
+ | "0x GTX1080 | ||
+ | "0x RTX2080TI", | ||
+ | ], | ||
+ | title: { | ||
+ | text: "ns / day" | ||
+ | }, | ||
+ | }, | ||
+ | legend: { | ||
+ | position: ' | ||
+ | horizontalAlign: | ||
+ | title: { | ||
+ | text: "Test #" | ||
+ | }, | ||
+ | } | ||
+ | } | ||
+ | </ | ||
+ | |||
+ | |||
+ | ==== Many GPUs ==== | ||
+ | |||
+ | In most cases 1 GPU is **better** than 2 GPUs! | ||
+ | |||
+ | In some cases, for example a large molecule like Test 7, you might want to run Gromacs on both GPUs. We strongly encourage you to test if you actually benefit from running with GPUs on many nodes. | ||
+ | |||
+ | To find out if more GPUs mean more work done we need some math: the parallel efficiency **η** is the ratio of the [[https:// | ||
+ | |||
+ | η = S(N) / N | ||
+ | |||
+ | In this chart we compare Gromacs parallel efficiency **η** of the 7 test cases with two GPU versus one GPU on VSC-5 '' | ||
+ | |||
+ | Set the number of GPUs on the node visible to Gromacs with '' | ||
+ | |||
+ | < | ||
+ | { | ||
+ | series: [{ | ||
+ | name: '2x A40', | ||
+ | data: [0.38, 0.45, 0.28, 0.25, 0.39, 0.52, 0.03] | ||
+ | }, { | ||
+ | name: '2x A100', | ||
+ | data: [0.36, 0.25, 0.31, 0.19, 0.29, 0.38, 0.36] | ||
+ | }], | ||
+ | chart: { | ||
+ | type: ' | ||
+ | height: 350, | ||
+ | // stacked: true, | ||
+ | // stackType: ' | ||
+ | }, | ||
+ | // plotOptions: | ||
+ | // bar: { | ||
+ | // | ||
+ | // }, | ||
+ | // }, | ||
+ | title: { | ||
+ | text: ' | ||
+ | }, | ||
+ | xaxis: { | ||
+ | categories: [ | ||
+ | "Test 1", | ||
+ | "Test 2", | ||
+ | "Test 3", | ||
+ | "Test 4", | ||
+ | "Test 5", | ||
+ | "Test 6", | ||
+ | "Test 7", | ||
+ | ], | ||
+ | // title: { | ||
+ | // text: "Test #" | ||
+ | // }, | ||
+ | }, | ||
+ | yaxis: { | ||
+ | title: { | ||
+ | text: " | ||
+ | }, | ||
+ | }, | ||
+ | legend: { | ||
+ | position: ' | ||
+ | horizontalAlign: | ||
+ | } | ||
+ | } | ||
+ | </ | ||
+ | |||
+ | |||
+ | ==== Many nodes with many GPUs ==== | ||
+ | |||
+ | In most cases one node is **better** than more nodes. | ||
+ | |||
+ | In some cases, for example a large molecule like Test 7, you might want to run Gromacs on | ||
+ | multiple nodes in parallel using MPI, with multiple | ||
+ | GPUs (one each node). We strongly encourage you to test if you | ||
+ | actually benefit from running with GPUs on many nodes. Gromacs can perform worse on | ||
+ | many nodes in parallel than on a single one, even considerably! | ||
+ | |||
+ | Run Gromacs on multiple nodes with: | ||
+ | |||
+ | <code bash> | ||
+ | #SBATCH --nodes 2 | ||
+ | gmx mdrun ... | ||
+ | </ | ||
+ | |||
+ | Take a look at the chapter [[doku: | ||
+ | |||
+ | < | ||
+ | { | ||
+ | series: [{ | ||
+ | name: 'Test 1', | ||
+ | data: [ 42.374, 40.176, 39.439, 38.252, 35.744, 30.811 ] | ||
+ | }, { | ||
+ | name: 'Test 2', | ||
+ | data: [ 82.513, 81.25, 84.805, 81.894, 72.589, 62.855 ] | ||
+ | }, { | ||
+ | name: 'Test 3', | ||
+ | data: [ 0, 0, 0, 0, 0, 0 ] | ||
+ | }, { | ||
+ | name: 'Test 4', | ||
+ | data: [ 0, 0, 0, 0, 0, 0 ] | ||
+ | }, { | ||
+ | name: 'Test 5', | ||
+ | data: [ 67.147, 76.027, 80.627, 80.903, 83.031, 68.702 ] | ||
+ | }, { | ||
+ | name: 'Test 6', | ||
+ | data: [ 10.612, 11.963, 10.996, 14.37, 35.482, 34.988 ] | ||
+ | }, { | ||
+ | name: 'Test 7', | ||
+ | data: [ 17.92, 21.604, 30.482, 37.497, 35.448, 43.254 ] | ||
+ | }], | ||
+ | chart: { | ||
+ | type: ' | ||
+ | height: 350, | ||
+ | stacked: true, | ||
+ | }, | ||
+ | plotOptions: | ||
+ | bar: { | ||
+ | horizontal: true, | ||
+ | }, | ||
+ | }, | ||
+ | title: { | ||
+ | text: ' | ||
+ | }, | ||
+ | xaxis: { | ||
+ | categories: [ | ||
+ | "1 Node", | ||
+ | "2 Nodes", | ||
+ | "4 Nodes", | ||
+ | "8 Nodes", | ||
+ | "16 Nodes", | ||
+ | "32 Nodes", | ||
+ | ], | ||
+ | title: { | ||
+ | text: "ns / day" | ||
+ | }, | ||
+ | }, | ||
+ | legend: { | ||
+ | position: ' | ||
+ | horizontalAlign: | ||
+ | title: { | ||
+ | text: "Test #" | ||
+ | }, | ||
+ | } | ||
+ | } | ||
+ | </ | ||
+ | |||
+ | |||
+ | ==== Many ranks on many nodes with many GPUs==== | ||
+ | |||
+ | Quick summary: | ||
+ | * Most problems (Small): 1 or 2 ranks per node | ||
+ | * Large problem: 8 ranks per node | ||
+ | |||
+ | If you want to run Gromacs on multiple nodes and multiple GPUs in parallel using MPI, best | ||
+ | tell MPI how many processes should be launched on each node | ||
+ | '' | ||
+ | yourself with your specific application. Based on our tests listed in the following chart we | ||
+ | recommend 1 ranks per node for most (small) problems, and only for large problems up to 8 ranks per node: | ||
+ | |||
+ | <code bash> | ||
+ | #SBATCH --nodes 2 | ||
+ | mpirun -np 16 \ | ||
+ | | ||
+ | ... | ||
+ | | ||
+ | </ | ||
+ | |||
+ | The reason for this is that the graphics cards does more work than the CPU. Gromacs needs to copy data between | ||
+ | different ranks on the CPUs and all GPUs, which takes more time with more ranks. Gromacs notices that and shows | ||
+ | '' | ||
+ | 1 with 16 ranks on 1 node: the '' | ||
+ | of the time spent! | ||
+ | < | ||
+ | { | ||
+ | series: [{ | ||
+ | name: 'Test 1', | ||
+ | data: [ 43.644, 46.385, 32.454, 37.333, 19.084, 16.136, 4.824 ] | ||
+ | }, { | ||
+ | name: 'Test 2', | ||
+ | data: [ 390.057, 138.831, 89.078, 78.769, 39.94, 35.99, 9.545 ] | ||
+ | }, { | ||
+ | name: 'Test 3', | ||
+ | data: [ 82.997, 39.682, 33.176, 80.643, 48.766, 29.216, 13.972 ] | ||
+ | }, { | ||
+ | name: 'Test 4', | ||
+ | data: [ 144.859, 52.099, 35.469, 96.125, 55.373, 32.502, 14.864 ] | ||
+ | }, { | ||
+ | name: 'Test 5', | ||
+ | data: [ 30.174, 35.561, 39.051, 68.824, 39.012, 34.442, 10.475 ] | ||
+ | }, { | ||
+ | name: 'Test 6', | ||
+ | data: [ 18.282, 10.061, 15.62, 20.889, 17.528, 16.452, 7.534 ] | ||
+ | }, { | ||
+ | name: 'Test 7', | ||
+ | data: [ 26.499, 14.855, 22.433, 26.672, 21.686, 19.323, 7.879 ] | ||
+ | }], | ||
+ | chart: { | ||
+ | type: ' | ||
+ | height: 350, | ||
+ | stacked: true, | ||
+ | }, | ||
+ | plotOptions: | ||
+ | bar: { | ||
+ | horizontal: true, | ||
+ | }, | ||
+ | }, | ||
+ | title: { | ||
+ | text: ' | ||
+ | }, | ||
+ | xaxis: { | ||
+ | categories: [ | ||
+ | "1 Rank", | ||
+ | "2 Ranks", | ||
+ | "4 Ranks", | ||
+ | "8 Ranks", | ||
+ | "16 Ranks", | ||
+ | "28* Ranks", | ||
+ | "64 Ranks", | ||
+ | ], | ||
+ | title: { | ||
+ | text: "ns / day" | ||
+ | }, | ||
+ | }, | ||
+ | legend: { | ||
+ | position: ' | ||
+ | horizontalAlign: | ||
+ | title: { | ||
+ | text: "Test #" | ||
+ | }, | ||
+ | } | ||
+ | } | ||
+ | </ | ||