Differences

This shows you the differences between two versions of the page.

--- doku:gromacs [2023/05/17 12:29] – msiegel
+++ doku:gromacs [2023/05/17 15:56] – [Installations] msiegel
@@ Line 11: / Line 11: @@
 ===== GPU Partition =====
-First you have to decide on which hardware GROMACS should run, we call
+First you have to decide on which hardware GROMACS should run, we call this a ''partition'', described in detail at [[doku:slurm | SLURM]]. On any login node, type ''sinfo'' to get a list of the available partitions, or take a look at [[doku:vsc4_queue]] and [[doku:vsc5_queue]]. We recommend the partitions ''zen2_0256_a40x2'' or ''zen3_0512_a100x2'' on VSC-5; they have plenty nodes available with 2 GPUs each. The partition has to be set in the batch script, see the example below. Be aware that each partition has different hardware, so choose the parameters accordingly. GROMACS decides mostly on its own how it wants to work, so don't be surprised if it ignores settings like environment variables.
-this a ''partition'', described in detail at [[doku:slurm |
-SLURM]]. On any login node, type ''sinfo'' to get a list of the
+===== Installations =====
-available partitions, or take a look at [[doku:vsc4_queue]] and [[doku:vsc5_queue]]. We recommend the partitions ''zen2_0256_a40x2'' or ''zen3_0512_a100x2'' on VSC-5; they have plenty nodes available with 2 GPUs each. The partition has to be set in the batch script,
-see the example below. Be aware that each partition has different
+We provide the following GROMACS installations:
-hardware, so choose the parameters accordingly. GROMACS decides mostly on its own how it wants to
-work, so don't be surprised if it ignores settings like environment
+  * ''gromacs + cuda'': GPU Nodes, use the ''cuda-zen'' [[doku:spack-transition|spack tree]] on VSC-5.
-variables.
+  * ''gromacs + mpi'': CPU only use ''zen''/''skylake'' [[doku:spack-transition|spack trees]] on VSC-5/4.
+Type ''spack find -l gromacs'' or ''module avail gromacs'' on ''cuda-zen''/''zen''/''skylake'' [[doku:spack-transition|spack trees]] on VSC-5/4. You can list available variants with [[doku:spack]]: ''spack find -l gromacs +cuda'' or ''spack find -l gromacs +mpi''.
+Because of the low efficiency of GROMACS on many nodes with many GPUs via MPI, we do not provide ''gromacs + cuda + mpi''. So the ''gromacs + cuda'' packages do not have MPI support, so there is no ''gmx_mpi'' binary, only ''gmx''.
 ===== Batch Script =====
@@ Line 28: / Line 33: @@
   * clean modules: ''module purge''
   * load modules: ''module load gromacs/2022.2-gcc-9.5.0-...''
-  * starting the program in question: ''gmx_mpi ...''
+  * starting the program in question: ''gmx ...''
 <code bash mybatchscript.sh>
@@ Line 42: / Line 47: @@
 module purge
 module load gromacs/2022.2-gcc-9.5.0-...
-gmx_mpi mdrun -s topol.tpr
+gmx mdrun -s topol.tpr
 </code>
-Type ''sbatch myscript.sh'' to submit such your batch script to
+Type ''sbatch myscript.sh'' to submit such your batch script to [[doku:SLURM]]. you get the job id, and your job will be scheduled and executed automatically.
-[[doku:SLURM]]. you get the job id, and your job will be scheduled and
-executed automatically.
@@ Line 55: / Line 58: @@
 ==== CPU / GPU Load ====
-There is a whole page dedicated to [[doku:monitoring]] the CPU and
+There is a whole page dedicated to [[doku:monitoring]] the CPU and GPU, for GROMACS the relevant sections are section [[doku:monitoring#Live]] and [[doku:monitoring#GPU]].
-GPU, for GROMACS the relevant sections are section
-[[doku:monitoring#Live]] and [[doku:monitoring#GPU]].
 ==== Short Example ====
-As a short example we ran ''gmx_mpi mdrun -s topol.tpr'' with
+As a short example we ran ''gmx mdrun -s topol.tpr'' with different options, where ''topol.tpl'' is just some sample topology, we don't actually care about the result, we just want to know how many **ns/day** we can get, GROMACS tells you that at the end of every run. Such a short test can be done in no time.
-different options, where ''topol.tpl'' is just some sample topology,
-we don't actually care about the result, we just want to know how many **ns/day** we can get, Gromacs tells you that at the end of every run. Such a short test can be done in no time.
-The following table lists our 5 tests: Without any options GROMACS
+The following table lists our 5 tests: Without any options GROMACS already runs fine (a). Setting the number of tasks (b) is not needed; if set wrong can even slow the calculation down significantly ( c ) due to over provisioning! We would advise to enforce pinning, in our example it does not show any effects though (d), we assume that the tasks are pinned automatically already. The only further improvement we could get was using the ''-update gpu'' option, which puts more load on the GPU (e).
-already runs fine (a). Setting the number of tasks (b) is not needed;
-if set wrong can even slow the calculation down significantly ( c ) due
-to over provisioning! We would advise to enforce pinning, in our
-example it does not show any effects though (d), we assume that the
-tasks are pinned automatically already. The only further improvement
-we could get was using the ''-update gpu'' option, which puts more
-load on the GPU (e).
 ^ # ^ cmd         ^ ns / day ^ cpu load / % ^ gpu load / % ^ notes                               ^
@@ Line 85: / Line 77: @@
 ==== 7 Test Cases ====
-Since Gromacs is used in many and very different ways, it makes sense to
+Since GROMACS is used in many and very different ways, it makes sense to
 benchmark various scenarios:
@@ Line 135: / Line 127: @@
     },
     title: {
-        text: 'Gromacs Benchmarks, 7 tests on various hardware'
+        text: 'GROMACS Benchmarks, 7 tests on various hardware'
     },
     xaxis: {
@@ Line 176: / Line 168: @@
 In most cases 1 GPU is **better** than 2 GPUs!
-In some cases, for example a large molecule like Test 7, you might want to run Gromacs on both GPUs. We strongly encourage you to test if you actually benefit from running with GPUs on many nodes.
+In some cases, for example a large molecule like Test 7, you might want to run GROMACS on both GPUs. We strongly encourage you to test if you actually benefit from running with GPUs on many nodes.
 To find out if more GPUs mean more work done we need some math: the parallel efficiency **η** is the ratio of the [[https://en.wikipedia.org/wiki/Speedup | speedup]] factor **S(N)** and the number of processors **N**:
@@ Line 182: / Line 174: @@
 η = S(N) / N
-In this chart we compare Gromacs parallel efficiency **η** of the 7 test cases with two GPU versus one GPU on VSC-5 ''A40'' and ''A100'' Nodes. Ideally two GPUs would bring a speedup **S** of 2, but also a **N** of 2, so the parallel efficiency would be 1. In reality, the speedup will be less than 1; and so will be the efficiency. An efficiency 0.45 for example means that with two GPUs, more than half the used resources were wasted! Because of the high demand on our GPU Nodes we kindly ask to test your case, and only use more than 1 GPU, if your efficiency is above 0.5.
+In this chart we compare GROMACS parallel efficiency **η** of the 7 test cases with two GPU versus one GPU on VSC-5 ''A40'' and ''A100'' Nodes. Ideally two GPUs would bring a speedup **S** of 2, but also a **N** of 2, so the parallel efficiency would be 1. In reality, the speedup will be less than 1; and so will be the efficiency. An efficiency 0.45 for example means that with two GPUs, more than half the used resources were wasted! Because of the high demand on our GPU Nodes we kindly ask to test your case, and only use more than 1 GPU, if your efficiency is above 0.5.
-Set the number of GPUs on the node visible to Gromacs with ''export CUDA_VISIBLE_DEVICES=0'' for one GPU or ''export CUDA_VISIBLE_DEVICES=0,1'' for both GPUs.
+Set the number of GPUs on the node visible to GROMACS with ''export CUDA_VISIBLE_DEVICES=0'' for one GPU or ''export CUDA_VISIBLE_DEVICES=0,1'' for both GPUs.
 <achart>
@@ Line 240: / Line 232: @@
 In most cases one node is **better** than more nodes.
-In some cases, for example a large molecule like Test 7, you might want to run Gromacs on
+In some cases, for example a large molecule like Test 7, you might want to run GROMACS on multiple nodes in parallel using MPI, with multiple GPUs (one each node). We strongly encourage you to test if you actually benefit from running with GPUs on many nodes. GROMACS can perform worse on many nodes in parallel than on a single one, even considerably!
-multiple nodes in parallel using MPI, with multiple
-GPUs (one each node). We strongly encourage you to test if you
-actually benefit from running with GPUs on many nodes. Gromacs can perform worse on
-many nodes in parallel than on a single one, even considerably!
-Run Gromacs on multiple nodes with:
+Run GROMACS on multiple nodes with:
 <code bash>
@@ Line 290: / Line 278: @@
     },
     title: {
-        text: 'Gromacs Benchmarks: Performance on multiple Nodes via MPI'
+        text: 'GROMACS Benchmarks: Performance on multiple Nodes via MPI'
     },
     xaxis: {
@@ Line 322: / Line 310: @@
   * Large problem: 8 ranks per node
-If you want to run Gromacs on multiple nodes and multiple GPUs in parallel using MPI, best
+If you want to run GROMACS on multiple nodes and multiple GPUs in parallel using MPI, best
 tell MPI how many processes should be launched on each node
 ''--map-by ppr:1:node''. MPI would normally use many ranks, which is good for CPUs, but bad with GPUs! We encourage you to test a few number of ranks
@@ Line 336: / Line 324: @@
 </code>
-The reason for this is that the graphics cards does more work than the CPU. Gromacs needs to copy data between
+The reason for this is that the graphics cards does more work than the CPU. GROMACS needs to copy data between different ranks on the CPUs and all GPUs, which takes more time with more ranks. GROMACS notices that and shows ''Wait GPU state copy'' at the end of the log. As an example our test 1 with 16 ranks on 1 node: the ''Wait GPU state copy'' amounts to 44.5% of the time spent!
-different ranks on the CPUs and all GPUs, which takes more time with more ranks. Gromacs notices that and shows
-''Wait GPU state copy'' at the end of the log. As an example our test
-with 16 ranks on 1 node: the ''Wait GPU state copy'' amounts to 44.5%
-of the time spent!
 <achart>
@@ Line 377: / Line 361: @@
     },
     title: {
-        text: 'Gromacs Benchmarks: Performance with various Ranks on 2 Nodes with 1 GPU each'
+        text: 'GROMACS Benchmarks: Performance with various Ranks on 2 Nodes with 1 GPU each'
     },
     xaxis: {