Differences

This shows you the differences between two versions of the page.

--- doku:gromacs [2023/05/17 13:05] – msiegel
+++ doku:gromacs [2023/11/23 12:27] (current) – [Many ranks on many nodes with many GPUs] msiegel
@@ Line 3: / Line 3: @@
 Our recommendation: Follow these three steps to get the fastest program:
-  - Use the most **recent version** of GROMACS that we provide or build your own.
+  - Use the **most recent version** of GROMACS that we provide or build your own.
   - Use the newest Hardware: use **1 GPU** on the partitions ''zen2_0256_a40x2'' or ''zen3_0512_a100x2'' on VSC-5.
   - Do some **performance analysis** to decide if a single GPU Node (likely) or multiple CPU Nodes via MPI (unlikely) better suits your problem.
@@ Line 9: / Line 9: @@
 In most cases it does not make sense to run on multiple GPU nodes with MPI; Whether using one or two GPUs per node.
-===== GPU Partition =====
+===== CPU or GPU Partition? =====
 First you have to decide on which hardware GROMACS should run, we call this a ''partition'', described in detail at [[doku:slurm | SLURM]]. On any login node, type ''sinfo'' to get a list of the available partitions, or take a look at [[doku:vsc4_queue]] and [[doku:vsc5_queue]]. We recommend the partitions ''zen2_0256_a40x2'' or ''zen3_0512_a100x2'' on VSC-5; they have plenty nodes available with 2 GPUs each. The partition has to be set in the batch script, see the example below. Be aware that each partition has different hardware, so choose the parameters accordingly. GROMACS decides mostly on its own how it wants to work, so don't be surprised if it ignores settings like environment variables.
@@ Line 15: / Line 15: @@
 ===== Installations =====
-We provide the following GROMACS installations:
+Type ''spack find -l gromacs'' or ''module avail gromacs'' on ''cuda-zen''/''zen''/''skylake'' [[doku:spack-transition|spack trees]] on VSC-5/4. You can list available variants with [[doku:spack]]: ''spack find -l gromacs +cuda'' or ''spack find -l gromacs +mpi''.
-  * ''gromacs + cuda'': For the use with GPU on the ''cuda-zen'' spack tree on VSC-5.
+Because of the low efficiency of GROMACS on many nodes with many GPUs via MPI, we do not provide ''gromacs + cuda + mpi''.
-  * ''gromacs + mpi'': For CPU only use on ''zen''/''skylake'' spack trees on VSC-5/4.
+We provide the following GROMACS variants:
-Type ''spack find -l gromacs'' or ''module avail gromacs'' on ''cuda-zen''/''zen''/''skylake'' spack trees on VSC-5/4. You can list available variants with spack: ''spack find -l gromacs +cuda'' or ''spack find -l gromacs +mpi''.
+==== GPU but no MPI ====
-Because of the low efficiency of GROMACS on many nodes with many GPUs via MPI, we do not provide ''gromacs + cuda + mpi''.
+We recommend GPU Nodes, use the ''cuda-zen'' [[doku:spack-transition|spack tree]] on VSC-5:
+**cuda-zen**:
+  * Gromacs +cuda ~mpi, all compiled with **GCC**
+Since the ''gromacs + cuda'' packages do not have MPI support, there is no ''gmx_mpi'' binary, only ''gmx''.
+==== MPI but no GPU ====
+For Gromacs on CPU only but with MPI, use ''zen'' [[doku:spack-transition|spack tree]] on VSC-5 or ''skylake'' on VSC-4:
+**zen**:
+  * Gromacs +openmpi +blas +lapack ~cuda, all compiled with **GCC**
+  * Gromacs +openmpi +blas +lapack ~cuda, all compiled with **AOCC**
+  *
+**skylake**:
+  * Gromacs +**open**mpi +blas +lapack ~cuda, all compiled with **GCC**
+  * Gromacs +**open**mpi +blas +lapack ~cuda, all compiled with **Intel**
+  * Gromacs +**intel**mpi +blas +lapack ~cuda, all compiled with **GCC**
+  * Gromacs +**intel**mpi +blas +lapack ~cuda, all compiled with **Intel**
+In some of these packages, there is no ''gmx'' binary, only ''gmx_mpi'', but this also works on one single node.
 ===== Batch Script =====
@@ Line 81: / Line 101: @@
 benchmark various scenarios:
-  - a VSC users test case (??? atoms)
   - R-143a in hexane (20,248 atoms) with very high output rate
   - a short RNA piece with explicit water (31,889 atoms)
   - a protein inside a membrane surrounded by explicit water (80,289 atoms)
+  - a VSC users test case (50,897 atoms)
   - a protein in explicit water (170,320 atoms)
   - a protein membrane channel with explicit water (615,924 atoms)
@@ Line 254: / Line 274: @@
     }, {
         name: 'Test 3',
-        data: [ 0, 0, 0, 0, 0, 0 ]
+        data: [ 94.069, 99.788, 97.9, 100.509, 95.666, 83.485 ]
     }, {
         name: 'Test 4',
-        data: [ 0, 0, 0, 0, 0, 0 ]
+        data: [ 115.179, 117.999, 115.028, 114.967, 103.8, 0 ]
     }, {
         name: 'Test 5',
@@ Line 279: / Line 299: @@
     },
     title: {
-        text: 'GROMACS Benchmarks: Performance on multiple Nodes via MPI'
+        text: 'GROMACS Benchmarks: Performance on multiple Nodes with dual A100 GPUs each via MPI'
     },
     xaxis: {
@@ Line 303: / Line 323: @@
 }
 </achart>
+Note: the computation timed out for 4 with 32 nodes, before gromacs was able to estimate a performance. We can safely assume this example case is going to be less performant on 32 than on fewer nodes too.
@@ Line 362: / Line 384: @@
     },
     title: {
-        text: 'GROMACS Benchmarks: Performance with various Ranks on 2 Nodes with 1 GPU each'
+        text: 'GROMACS Benchmarks: Performance with various Ranks on 2 Nodes with dual A100 GPUs each'
     },
     xaxis: {
@@ Line 387: / Line 409: @@
 }
 </achart>
+===== Links =====
+The benchmarks are based on three articles of NHR@FAU, featuring in
+depth analysis on GROMACS Performance on various GPU systems, multi
+GPU setups and comparison with CPU:
+https://hpc.fau.de/2022/02/10/gromacs-performance-on-different-gpu-types/
+https://hpc.fau.de/2021/08/10/gromacs-shootout-intel-icelake-vs-nvidia-gpu/
+https://hpc.fau.de/2021/06/18/multi-gpu-gromacs-jobs-on-tinygpu/