Differences

This shows you the differences between two versions of the page.

--- doku:gromacs [2022/06/23 13:24] – msiegel
+++ doku:gromacs [2023/11/23 12:19] – [Many nodes with many GPUs] msiegel
@@ Line 1: / Line 1: @@
 ====== GROMACS ======
-===== GPU Partition =====
+Our recommendation: Follow these three steps to get the fastest program:
-First you have to decide on which hardware GROMACS should run, we call
+  - Use the **most recent version** of GROMACS that we provide or build your own.
-this a ''partition'', described in detail at . On a login node, type
+  - Use the newest Hardware: use **1 GPU** on the partitions ''zen2_0256_a40x2'' or ''zen3_0512_a100x2'' on VSC-5.
-''sinfo'' to get a list of the available partitions. Be aware that each
+  - Do some **performance analysis** to decide if a single GPU Node (likely) or multiple CPU Nodes via MPI (unlikely) better suits your problem.
-setup has different hardware, for example the partition
-''gpu_gtx1080single'' on VSC3 has 1 GPU and single socket à 4 cores,
-with 2 hyperthreads each core, listed at [[doku:vsc3gpuqos | GPU Partitions on VSC3]].
-The partition has to be set in the batch script, see the example
-below. Thus here it makes sense to let GROMACS run on 8 threads
-(''-ntomp 8''), yet it makes little sense to force more threads than
-that, as this would lead to oversubscribing. GROMACS decides mostly on
-its own how it wants to work, so don't be surprised if it ignores
-settings like environment variables.
-=== Batch Script ===
+In most cases it does not make sense to run on multiple GPU nodes with MPI; Whether using one or two GPUs per node.
-In order to be scheduled efficiently with [[doku:slurm | SLURM]], one writes a ''shell
-script'' (see the text file ''myscript.sh'' below) consisting of:
+===== CPU or GPU Partition? =====
+First you have to decide on which hardware GROMACS should run, we call this a ''partition'', described in detail at [[doku:slurm | SLURM]]. On any login node, type ''sinfo'' to get a list of the available partitions, or take a look at [[doku:vsc4_queue]] and [[doku:vsc5_queue]]. We recommend the partitions ''zen2_0256_a40x2'' or ''zen3_0512_a100x2'' on VSC-5; they have plenty nodes available with 2 GPUs each. The partition has to be set in the batch script, see the example below. Be aware that each partition has different hardware, so choose the parameters accordingly. GROMACS decides mostly on its own how it wants to work, so don't be surprised if it ignores settings like environment variables.
+===== Installations =====
+Type ''spack find -l gromacs'' or ''module avail gromacs'' on ''cuda-zen''/''zen''/''skylake'' [[doku:spack-transition|spack trees]] on VSC-5/4. You can list available variants with [[doku:spack]]: ''spack find -l gromacs +cuda'' or ''spack find -l gromacs +mpi''.
+Because of the low efficiency of GROMACS on many nodes with many GPUs via MPI, we do not provide ''gromacs + cuda + mpi''.
+We provide the following GROMACS variants:
+==== GPU but no MPI ====
+We recommend GPU Nodes, use the ''cuda-zen'' [[doku:spack-transition|spack tree]] on VSC-5:
+**cuda-zen**:
+  * Gromacs +cuda ~mpi, all compiled with **GCC**
+Since the ''gromacs + cuda'' packages do not have MPI support, there is no ''gmx_mpi'' binary, only ''gmx''.
+==== MPI but no GPU ====
+For Gromacs on CPU only but with MPI, use ''zen'' [[doku:spack-transition|spack tree]] on VSC-5 or ''skylake'' on VSC-4:
+**zen**:
+  * Gromacs +openmpi +blas +lapack ~cuda, all compiled with **GCC**
+  * Gromacs +openmpi +blas +lapack ~cuda, all compiled with **AOCC**
+  *
+**skylake**:
+  * Gromacs +**open**mpi +blas +lapack ~cuda, all compiled with **GCC**
+  * Gromacs +**open**mpi +blas +lapack ~cuda, all compiled with **Intel**
+  * Gromacs +**intel**mpi +blas +lapack ~cuda, all compiled with **GCC**
+  * Gromacs +**intel**mpi +blas +lapack ~cuda, all compiled with **Intel**
+In some of these packages, there is no ''gmx'' binary, only ''gmx_mpi'', but this also works on one single node.
+===== Batch Script =====
+Write a ''batch script'' (example below) including:
   * some SLURM parameters: the ''#BATCH ...'' part)
-  * exporting environment variables: ''export CUDA_VISIBLE_DEVICES=0''
+  * export environment variables: e.g. ''export CUDA_VISIBLE_DEVICES=0''
-  * cleaning the environment: ''module purge''
+  * clean modules: ''module purge''
-  * loading modules: ''load gcc/7.3 ...''
+  * load modules: ''module load gromacs/2022.2-gcc-9.5.0-...''
-  * last but not least starting the program in question: ''gmx_mpi ...''
+  * starting the program in question: ''gmx ...''
-<code sh myscript.sh>
+<code bash mybatchscript.sh>
 #!/bin/bash
 #SBATCH --job-name=myname
-#SBATCH --partition=gpu_gtx1080single
+#SBATCH --partition=zen2_0256_a40x2
+#SBATCH --qos=zen2_0256_a40x2
 #SBATCH --gres=gpu:1
-#SBATCH --nodes=1
 unset OMP_NUM_THREADS
@@ Line 37: / Line 67: @@
 module purge
-module load gcc/7.3 nvidia/1.0 cuda/10.1.168 cmake/3.15.4 openmpi/4.0.5 python/3.7 gromacs/2021.2_gtx1080
+module load gromacs/2022.2-gcc-9.5.0-...
+gmx mdrun -s topol.tpr
-gmx_mpi mdrun -s topol.tpr
 </code>
-Type ''sbatch myscript.sh'' to submit such a batch script to [[doku:SLURM]]. you
+Type ''sbatch myscript.sh'' to submit such your batch script to [[doku:SLURM]]. you get the job id, and your job will be scheduled and executed automatically.
-get the job id, and your job will be scheduled, and executed
-automatically.
-=== CPU / GPU Load ===
+===== Performance =====
-There is a whole page dedicated to [[doku:monitoring]] the CPU and GPU, for
-GROMACS the relevant sections are section [[doku:monitoring#Live]] Live and [[doku:monitoring#GPU]].
-=== Performance ===
+==== CPU / GPU Load ====
-As an example we ran ''gmx_mpi mdrun -s topol.tpr'' with different
+There is a whole page dedicated to [[doku:monitoring]] the CPU and GPU, for GROMACS the relevant sections are section [[doku:monitoring#Live]] and [[doku:monitoring#GPU]].
-options, where ''topol.tpl'' is just some sample topology, we don't
-actually care about the result. Without any options GROMACS already
-runs fine (a). Setting the number of tasks (b,c) is not needed; if set
+==== Short Example ====
-wrong can even slow the calculation down significantly (over
-provisioning)! Enforcing pinning also does not show any effects (d),
+As a short example we ran ''gmx mdrun -s topol.tpr'' with different options, where ''topol.tpl'' is just some sample topology, we don't actually care about the result, we just want to know how many **ns/day** we can get, GROMACS tells you that at the end of every run. Such a short test can be done in no time.
-we assume that the tasks are pinned automatically already. The only
-improvement we had was using the ''-update gpu'' option, which puts more
+The following table lists our 5 tests: Without any options GROMACS already runs fine (a). Setting the number of tasks (b) is not needed; if set wrong can even slow the calculation down significantly ( c ) due to over provisioning! We would advise to enforce pinning, in our example it does not show any effects though (d), we assume that the tasks are pinned automatically already. The only further improvement we could get was using the ''-update gpu'' option, which puts more load on the GPU (e).
-load on the GPU. This might not work however if we use more than one
-GPU.
 ^ # ^ cmd         ^ ns / day ^ cpu load / % ^ gpu load / % ^ notes                               ^
-| a | --          | 160    | 100   | 80        |                                    |
+| a | ''--''          | 160    | 100   | 80        |                                    |
-| b | -ntomp 8    | 160    | 100   | 80        |                                    |
+| b | ''-ntomp 8''    | 160    | 100   | 80        |                                    |
-| c | -ntomp 16   | 140    | 40    | 70        | gromacs warning: over provisioning |
+| c | ''-ntomp 16''   | 140    | 40    | 70        | gromacs warning: over provisioning |
-| d | -pin on     | 160    | 100   | 80        |                                    |
+| d | ''-pin on''     | 160    | 100   | 80        |                                    |
-| e | -update gpu | 170    | 100   | 90        |                                    |
+| e | ''-update gpu'' | 170    | 100   | 90        |                                    |
+==== 7 Test Cases ====
+Since GROMACS is used in many and very different ways, it makes sense to
+benchmark various scenarios:
+  - R-143a in hexane (20,248 atoms) with very high output rate
+  - a short RNA piece with explicit water (31,889 atoms)
+  - a protein inside a membrane surrounded by explicit water (80,289 atoms)
+  - a VSC users test case (50,897 atoms)
+  - a protein in explicit water (170,320 atoms)
+  - a protein membrane channel with explicit water (615,924 atoms)
+  - a huge virus protein (1,066,628 atoms)
+Take a look at the test results resembling a similar case than your application.
+In this chart we tested our various hardware on the 7 test cases, some recent GPUs like ''A100'' GPU Nodes on VSC-5, some belong to VSC-3 (like ''GTX1080'', ''K20''), and are not accessible anymore:
+<achart>
+{
+    series: [{
+        name: 'Test 1',
+        data: [191, 144, 128, 125, 145, 127, 92, 62, 57, 60, 57, 29, 28, 27, 17, 7.4, 7.4]
+    }, {
+        name: 'Test 2',
+        data: [525, 442, 449, 455, 471, 317, 228, 228, 207, 193, 152, 73, 74, 61, 46, 18, 18]
+    }, {
+        name: 'Test 3',
+        data: [205, 143, 164, 130, 113, 164, 103, 66, 68, 58, 48, 24, 25, 23, 14, 6.2, 6]
+    }, {
+        name: 'Test 4',
+        data: [463, 333, 273, 246, 229, 276, 103, 165, 170, 158, 143, 69, 67, 54, 40, 16, 16]
+    }, {
+        name: 'Test 5',
+        data: [168, 139, 162, 147, 131, 174, 94, 61, 59, 58, 43, 18, 18, 22, 10, 5.2, 5]
+    }, {
+        name: 'Test 6',
+        data: [9.6, 8.1, 16, 8.4, 9.9, 7.3, 12, 4.3, 3.1, 3.1, 4.6, 1.7, 1.7, 1.6, 1, 0.4, 0.4]
+    }, {
+        name: 'Test 7',
+        data: [27.2, 13, 25, 21.8, 1.4, 24.6, 18, 8.6, 8, 7.6, 8, 3.1, 3.1, 3, 1.7, 0.7, 0.7]
+    }],
+    chart: {
+        type: 'bar',
+        height: 350,
+        stacked: true,
+    },
+    plotOptions: {
+        bar: {
+            horizontal: true,
+        },
+    },
+    title: {
+        text: 'GROMACS Benchmarks, 7 tests on various hardware'
+    },
+    xaxis: {
+        categories: [
+            "1x A40",
+            "1x RTX2080TI",
+            "1x A100",
+            "4x GTX1080 M",
+            "2x A40",
+            "8x GTX1080 M",
+            "2x A100",
+            "2x GTX1080 M",
+            "1x GTX1080 M",
+            "1x GTX1080 S",
+            "0x A100",
+            "0x GTX1080 M",
+            "0x A40",
+            "1x K20M",
+            "0x K20M",
+            "0x GTX1080  S",
+            "0x RTX2080TI",
+        ],
+        title: {
+            text: "ns / day"
+        },
+    },
+    legend: {
+        position: 'right',
+        horizontalAlign: 'left',
+        title: {
+            text: "Test #"
+        },
+    }
+}
+</achart>
+==== Many GPUs ====
+In most cases 1 GPU is **better** than 2 GPUs!
+In some cases, for example a large molecule like Test 7, you might want to run GROMACS on both GPUs. We strongly encourage you to test if you actually benefit from running with GPUs on many nodes.
+To find out if more GPUs mean more work done we need some math: the parallel efficiency **η** is the ratio of the [[https://en.wikipedia.org/wiki/Speedup | speedup]] factor **S(N)** and the number of processors **N**:
+η = S(N) / N
+In this chart we compare GROMACS parallel efficiency **η** of the 7 test cases with two GPU versus one GPU on VSC-5 ''A40'' and ''A100'' Nodes. Ideally two GPUs would bring a speedup **S** of 2, but also a **N** of 2, so the parallel efficiency would be 1. In reality, the speedup will be less than 1; and so will be the efficiency. An efficiency 0.45 for example means that with two GPUs, more than half the used resources were wasted! Because of the high demand on our GPU Nodes we kindly ask to test your case, and only use more than 1 GPU, if your efficiency is above 0.5.
+Set the number of GPUs on the node visible to GROMACS with ''export CUDA_VISIBLE_DEVICES=0'' for one GPU or ''export CUDA_VISIBLE_DEVICES=0,1'' for both GPUs.
+<achart>
+{
+    series: [{
+        name: '2x A40',
+        data: [0.38, 0.45, 0.28, 0.25, 0.39, 0.52, 0.03]
+    }, {
+        name: '2x A100',
+        data: [0.36, 0.25, 0.31, 0.19, 0.29, 0.38, 0.36]
+    }],
+    chart: {
+        type: 'bar',
+        height: 350,
+        // stacked: true,
+        // stackType: '100%'
+    },
+    // plotOptions: {
+    //     bar: {
+    //         horizontal: true,
+    //     },
+    // },
+    title: {
+        text: 'Strong Scaling Efficiency of dual GPU'
+    },
+    xaxis: {
+        categories: [
+            "Test 1",
+            "Test 2",
+            "Test 3",
+            "Test 4",
+            "Test 5",
+            "Test 6",
+            "Test 7",
+        ],
+        // title: {
+        //     text: "Test #"
+        // },
+    },
+    yaxis: {
+        title: {
+            text: "Strong Scaling Efficiency S(N) / N"
+        },
+    },
+    legend: {
+        position: 'right',
+        horizontalAlign: 'left',
+    }
+}
+</achart>
+==== Many nodes with many GPUs ====
+In most cases one node is **better** than more nodes.
+In some cases, for example a large molecule like Test 7, you might want to run GROMACS on multiple nodes in parallel using MPI, with multiple GPUs (one each node). We strongly encourage you to test if you actually benefit from running with GPUs on many nodes. GROMACS can perform worse on many nodes in parallel than on a single one, even considerably!
+Run GROMACS on multiple nodes with:
+<code bash>
+#SBATCH --nodes 2
+gmx mdrun ...
+</code>
+Take a look at the chapter [[doku:gromacs#Dual GPU Scaling Efficiency]] to find out if more nodes are a good idea. In the below examples the efficiency is very low, even for the large test cases 6 and 7.
+<achart>
+{
+    series: [{
+        name: 'Test 1',
+        data: [ 42.374, 40.176, 39.439, 38.252, 35.744, 30.811 ]
+    }, {
+        name: 'Test 2',
+        data: [ 82.513, 81.25, 84.805, 81.894, 72.589, 62.855 ]
+    }, {
+        name: 'Test 3',
+        data: [ 94.069, 99.788, 97.9, 100.509, 95.666, 83.485 ]
+    }, {
+        name: 'Test 4',
+        data: [ 115.179, 117.999, 115.028, 114.967, 103.8, 0 ]
+    }, {
+        name: 'Test 5',
+        data: [ 67.147, 76.027, 80.627, 80.903, 83.031, 68.702 ]
+    }, {
+        name: 'Test 6',
+        data: [ 10.612, 11.963, 10.996, 14.37, 35.482, 34.988 ]
+    }, {
+        name: 'Test 7',
+        data: [ 17.92, 21.604, 30.482, 37.497, 35.448, 43.254 ]
+    }],
+    chart: {
+        type: 'bar',
+        height: 350,
+        stacked: true,
+    },
+    plotOptions: {
+        bar: {
+            horizontal: true,
+        },
+    },
+    title: {
+        text: 'GROMACS Benchmarks: Performance on multiple Nodes with dual A100 GPUs each via MPI'
+    },
+    xaxis: {
+        categories: [
+            "1 Node",
+            "2 Nodes",
+            "4 Nodes",
+            "8 Nodes",
+            "16 Nodes",
+            "32 Nodes",
+        ],
+        title: {
+            text: "ns / day"
+        },
+    },
+    legend: {
+        position: 'right',
+        horizontalAlign: 'left',
+        title: {
+            text: "Test #"
+        },
+    }
+}
+</achart>
+Note: the computation timed out for 4 with 32 nodes, before gromacs was able to estimate a performance. We can safely assume this example case is going to be less performant on 32 than on fewer nodes too.
+==== Many ranks on many nodes with many GPUs====
+Quick summary:
+  * Most problems (Small): 1 or 2 ranks per node
+  * Large problem: 8 ranks per node
+If you want to run GROMACS on multiple nodes and multiple GPUs in parallel using MPI, best
+tell MPI how many processes should be launched on each node
+''--map-by ppr:1:node''. MPI would normally use many ranks, which is good for CPUs, but bad with GPUs! We encourage you to test a few number of ranks
+yourself with your specific application. Based on our tests listed in the following chart we
+recommend 1 ranks per node for most (small) problems, and only for large problems up to 8 ranks per node:
+<code bash>
+#SBATCH --nodes 2
+mpirun -np 16 \
+       --map-by ppr:8:node \
+       ...
+       gmx_mpi mdrun ...
+</code>
+The reason for this is that the graphics cards does more work than the CPU. GROMACS needs to copy data between different ranks on the CPUs and all GPUs, which takes more time with more ranks. GROMACS notices that and shows ''Wait GPU state copy'' at the end of the log. As an example our test 1 with 16 ranks on 1 node: the ''Wait GPU state copy'' amounts to 44.5% of the time spent!
+<achart>
+{
+    series: [{
+        name: 'Test 1',
+        data: [ 43.644, 46.385, 32.454, 37.333, 19.084, 16.136, 4.824 ]
+    }, {
+        name: 'Test 2',
+        data: [ 390.057, 138.831, 89.078, 78.769, 39.94, 35.99, 9.545 ]
+    }, {
+        name: 'Test 3',
+        data: [ 82.997, 39.682, 33.176, 80.643, 48.766, 29.216, 13.972 ]
+    }, {
+        name: 'Test 4',
+        data: [ 144.859, 52.099, 35.469, 96.125, 55.373, 32.502, 14.864 ]
+    }, {
+        name: 'Test 5',
+        data: [ 30.174, 35.561, 39.051, 68.824, 39.012, 34.442, 10.475 ]
+    }, {
+        name: 'Test 6',
+        data: [ 18.282, 10.061, 15.62, 20.889, 17.528, 16.452, 7.534 ]
+    }, {
+        name: 'Test 7',
+        data: [ 26.499, 14.855, 22.433, 26.672, 21.686, 19.323, 7.879 ]
+    }],
+    chart: {
+        type: 'bar',
+        height: 350,
+        stacked: true,
+    },
+    plotOptions: {
+        bar: {
+            horizontal: true,
+        },
+    },
+    title: {
+        text: 'GROMACS Benchmarks: Performance with various Ranks on 2 Nodes with 1 GPU each'
+    },
+    xaxis: {
+        categories: [
+            "1 Rank",
+            "2 Ranks",
+            "4 Ranks",
+            "8 Ranks",
+            "16 Ranks",
+            "28* Ranks",
+            "64 Ranks",
+        ],
+        title: {
+            text: "ns / day"
+        },
+    },
+    legend: {
+        position: 'right',
+        horizontalAlign: 'left',
+        title: {
+            text: "Test #"
+        },
+    }
+}
+</achart>
+===== Links =====
+The benchmarks are based on three articles of NHR@FAU, featuring in
+depth analysis on GROMACS Performance on various GPU systems, multi
+GPU setups and comparison with CPU:
+https://hpc.fau.de/2022/02/10/gromacs-performance-on-different-gpu-types/
+https://hpc.fau.de/2021/08/10/gromacs-shootout-intel-icelake-vs-nvidia-gpu/
+https://hpc.fau.de/2021/06/18/multi-gpu-gromacs-jobs-on-tinygpu/