Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
doku:gromacs [2023/05/17 12:50] msiegeldoku:gromacs [2023/11/23 12:27] (current) – [Many ranks on many nodes with many GPUs] msiegel
Line 3: Line 3:
 Our recommendation: Follow these three steps to get the fastest program: Our recommendation: Follow these three steps to get the fastest program:
  
-  - Use the most **recent version** of GROMACS that we provide or build your own.+  - Use the **most recent version** of GROMACS that we provide or build your own.
   - Use the newest Hardware: use **1 GPU** on the partitions ''zen2_0256_a40x2'' or ''zen3_0512_a100x2'' on VSC-5.   - Use the newest Hardware: use **1 GPU** on the partitions ''zen2_0256_a40x2'' or ''zen3_0512_a100x2'' on VSC-5.
   - Do some **performance analysis** to decide if a single GPU Node (likely) or multiple CPU Nodes via MPI (unlikely) better suits your problem.   - Do some **performance analysis** to decide if a single GPU Node (likely) or multiple CPU Nodes via MPI (unlikely) better suits your problem.
Line 9: Line 9:
 In most cases it does not make sense to run on multiple GPU nodes with MPI; Whether using one or two GPUs per node. In most cases it does not make sense to run on multiple GPU nodes with MPI; Whether using one or two GPUs per node.
  
-===== GPU Partition =====+===== CPU or GPU Partition=====
  
-First you have to decide on which hardware GROMACS should run, we call +First you have to decide on which hardware GROMACS should run, we call this a ''partition'', described in detail at [[doku:slurm | SLURM]]. On any login node, type ''sinfo'' to get a list of the available partitions, or take a look at [[doku:vsc4_queue]] and [[doku:vsc5_queue]]. We recommend the partitions ''zen2_0256_a40x2'' or ''zen3_0512_a100x2'' on VSC-5; they have plenty nodes available with 2 GPUs each. The partition has to be set in the batch script, see the example below. Be aware that each partition has different hardware, so choose the parameters accordingly. GROMACS decides mostly on its own how it wants to work, so don't be surprised if it ignores settings like environment variables.
-this a ''partition'', described in detail at [[doku:slurm | +
-SLURM]]. On any login node, type ''sinfo'' to get a list of the +
-available partitions, or take a look at [[doku:vsc4_queue]] and [[doku:vsc5_queue]]. We recommend the partitions ''zen2_0256_a40x2'' or ''zen3_0512_a100x2'' on VSC-5; they have plenty nodes available with 2 GPUs each. The partition has to be set in the batch script, +
-see the example below. Be aware that each partition has different +
-hardware, so choose the parameters accordingly. GROMACS decides mostly on its own how it wants to +
-work, so don't be surprised if it ignores settings like environment +
-variables.+
  
 ===== Installations ===== ===== Installations =====
  
-Type ''spack find -l gromacs'' or ''module avail gromacs'' on ''cuda-zen''/''zen''/''skylake'' spack trees on VSC-5/4.+Type ''spack find -l gromacs'' or ''module avail gromacs'' on ''cuda-zen''/''zen''/''skylake'' [[doku:spack-transition|spack trees]] on VSC-5/4. You can list available variants with [[doku:spack]]: ''spack find -l gromacs +cuda'' or ''spack find -l gromacs +mpi''.
  
-We provide the following GROMACS installations:+Because of the low efficiency of GROMACS on many nodes with many GPUs via MPI, we do not provide ''gromacs + cuda + mpi''.
  
-  * ''gromacs + cuda'': For the use with GPU on the ''cuda-zen'' spack tree on VSC-5. +We provide the following GROMACS variants:
-  * ''gromacs + mpi''For CPU only use on ''zen''/''skylake'' spack trees on VSC-5/4.+
  
-Because of the low efficiency of GROMACS on many nodes with many GPUs via MPIwe do not provide ''gromacs + cuda + mpi''+==== GPU but no MPI ==== 
 + 
 +We recommend GPU Nodes, use the ''cuda-zen'' [[doku:spack-transition|spack tree]] on VSC-5: 
 + 
 +**cuda-zen**: 
 +  * Gromacs +cuda ~mpiall compiled with **GCC** 
 + 
 +Since the ''gromacs + cuda'' packages do not have MPI support, there is no ''gmx_mpi'' binary, only ''gmx''
 + 
 +==== MPI but no GPU ==== 
 + 
 +For Gromacs on CPU only but with MPI, use ''zen'' [[doku:spack-transition|spack tree]] on VSC-5 or ''skylake'' on VSC-4: 
 + 
 +**zen**: 
 +  * Gromacs +openmpi +blas +lapack ~cuda, all compiled with **GCC** 
 +  * Gromacs +openmpi +blas +lapack ~cuda, all compiled with **AOCC** 
 +  *  
 +**skylake**: 
 +  * Gromacs +**open**mpi +blas +lapack ~cuda, all compiled with **GCC** 
 +  * Gromacs +**open**mpi +blas +lapack ~cuda, all compiled with **Intel** 
 +  * Gromacs +**intel**mpi +blas +lapack ~cuda, all compiled with **GCC** 
 +  * Gromacs +**intel**mpi +blas +lapack ~cuda, all compiled with **Intel** 
 + 
 +In some of these packages, there is no ''gmx'' binary, only ''gmx_mpi'', but this also works on one single node.
  
 ===== Batch Script ===== ===== Batch Script =====
Line 56: Line 71:
 </code> </code>
  
-Type ''sbatch myscript.sh'' to submit such your batch script to +Type ''sbatch myscript.sh'' to submit such your batch script to [[doku:SLURM]]. you get the job id, and your job will be scheduled and executed automatically.
-[[doku:SLURM]]. you get the job id, and your job will be scheduled and +
-executed automatically.+
  
  
Line 66: Line 79:
 ==== CPU / GPU Load ==== ==== CPU / GPU Load ====
  
-There is a whole page dedicated to [[doku:monitoring]] the CPU and +There is a whole page dedicated to [[doku:monitoring]] the CPU and GPU, for GROMACS the relevant sections are section [[doku:monitoring#Live]] and [[doku:monitoring#GPU]].
-GPU, for GROMACS the relevant sections are section +
-[[doku:monitoring#Live]] and [[doku:monitoring#GPU]].+
  
  
 ==== Short Example ==== ==== Short Example ====
  
-As a short example we ran ''gmx mdrun -s topol.tpr'' with +As a short example we ran ''gmx mdrun -s topol.tpr'' with different options, where ''topol.tpl'' is just some sample topology, we don't actually care about the result, we just want to know how many **ns/day** we can get, GROMACS tells you that at the end of every run. Such a short test can be done in no time.
-different options, where ''topol.tpl'' is just some sample topology, +
-we don't actually care about the result, we just want to know how many **ns/day** we can get, GROMACS tells you that at the end of every run. Such a short test can be done in no time.+
  
-The following table lists our 5 tests: Without any options GROMACS +The following table lists our 5 tests: Without any options GROMACS already runs fine (a). Setting the number of tasks (b) is not needed; if set wrong can even slow the calculation down significantly ( c ) due to over provisioning! We would advise to enforce pinning, in our example it does not show any effects though (d), we assume that the tasks are pinned automatically already. The only further improvement we could get was using the ''-update gpu'' option, which puts more load on the GPU (e).
-already runs fine (a). Setting the number of tasks (b) is not needed; +
-if set wrong can even slow the calculation down significantly ( c ) due +
-to over provisioning! We would advise to enforce pinning, in our +
-example it does not show any effects though (d), we assume that the +
-tasks are pinned automatically already. The only further improvement +
-we could get was using the ''-update gpu'' option, which puts more +
-load on the GPU (e).+
  
 ^ # ^ cmd         ^ ns / day ^ cpu load / % ^ gpu load / % ^ notes                               ^ ^ # ^ cmd         ^ ns / day ^ cpu load / % ^ gpu load / % ^ notes                               ^
Line 99: Line 101:
 benchmark various scenarios: benchmark various scenarios:
  
-  - a VSC users test case (??? atoms) 
   - R-143a in hexane (20,248 atoms) with very high output rate   - R-143a in hexane (20,248 atoms) with very high output rate
   - a short RNA piece with explicit water (31,889 atoms)   - a short RNA piece with explicit water (31,889 atoms)
   - a protein inside a membrane surrounded by explicit water (80,289 atoms)   - a protein inside a membrane surrounded by explicit water (80,289 atoms)
 +  - a VSC users test case (50,897 atoms)
   - a protein in explicit water (170,320 atoms)   - a protein in explicit water (170,320 atoms)
   - a protein membrane channel with explicit water (615,924 atoms)   - a protein membrane channel with explicit water (615,924 atoms)
Line 251: Line 253:
 In most cases one node is **better** than more nodes. In most cases one node is **better** than more nodes.
  
-In some cases, for example a large molecule like Test 7, you might want to run GROMACS on +In some cases, for example a large molecule like Test 7, you might want to run GROMACS on multiple nodes in parallel using MPI, with multiple GPUs (one each node). We strongly encourage you to test if you actually benefit from running with GPUs on many nodes. GROMACS can perform worse on many nodes in parallel than on a single one, even considerably!
-multiple nodes in parallel using MPI, with multiple +
-GPUs (one each node). We strongly encourage you to test if you +
-actually benefit from running with GPUs on many nodes. GROMACS can perform worse on +
-many nodes in parallel than on a single one, even considerably!+
  
 Run GROMACS on multiple nodes with: Run GROMACS on multiple nodes with:
Line 276: Line 274:
     }, {     }, {
         name: 'Test 3',         name: 'Test 3',
-        data: [ 00000]+        data: [ 94.06999.78897.9100.50995.66683.485 ]
     }, {     }, {
         name: 'Test 4',         name: 'Test 4',
-        data: [ 00000, 0 ]+        data: [ 115.179117.999115.028114.967103.8, 0 ]
     }, {     }, {
         name: 'Test 5',         name: 'Test 5',
Line 301: Line 299:
     },     },
     title: {     title: {
-        text: 'GROMACS Benchmarks: Performance on multiple Nodes via MPI'+        text: 'GROMACS Benchmarks: Performance on multiple Nodes with dual A100 GPUs each via MPI'
     },     },
     xaxis: {     xaxis: {
Line 325: Line 323:
 } }
 </achart> </achart>
 +
 +Note: the computation timed out for 4 with 32 nodes, before gromacs was able to estimate a performance. We can safely assume this example case is going to be less performant on 32 than on fewer nodes too.
  
  
Line 347: Line 347:
 </code> </code>
  
-The reason for this is that the graphics cards does more work than the CPU. GROMACS needs to copy data between +The reason for this is that the graphics cards does more work than the CPU. GROMACS needs to copy data between different ranks on the CPUs and all GPUs, which takes more time with more ranks. GROMACS notices that and shows ''Wait GPU state copy'' at the end of the log. As an example our test 1 with 16 ranks on 1 node: the ''Wait GPU state copy'' amounts to 44.5% of the time spent!
-different ranks on the CPUs and all GPUs, which takes more time with more ranks. GROMACS notices that and shows +
-''Wait GPU state copy'' at the end of the log. As an example our test +
-1 with 16 ranks on 1 node: the ''Wait GPU state copy'' amounts to 44.5% +
-of the time spent!+
  
 <achart> <achart>
Line 388: Line 384:
     },     },
     title: {     title: {
-        text: 'GROMACS Benchmarks: Performance with various Ranks on 2 Nodes with 1 GPU each'+        text: 'GROMACS Benchmarks: Performance with various Ranks on 2 Nodes with dual A100 GPUs each'
     },     },
     xaxis: {     xaxis: {
Line 413: Line 409:
 } }
 </achart> </achart>
 +
 +===== Links =====
 +
 +The benchmarks are based on three articles of NHR@FAU, featuring in
 +depth analysis on GROMACS Performance on various GPU systems, multi
 +GPU setups and comparison with CPU:
 +
 +https://hpc.fau.de/2022/02/10/gromacs-performance-on-different-gpu-types/
 +
 +https://hpc.fau.de/2021/08/10/gromacs-shootout-intel-icelake-vs-nvidia-gpu/
 +
 +https://hpc.fau.de/2021/06/18/multi-gpu-gromacs-jobs-on-tinygpu/
 +
  
  
  • doku/gromacs.1684327853.txt.gz
  • Last modified: 2023/05/17 12:50
  • by msiegel