Differences

This shows you the differences between two versions of the page.

--- pandoc:introduction-to-vsc:10_performance:10_performance [2017/10/18 11:42] – Pandoc Auto-commit pandoc
+++ pandoc:introduction-to-vsc:10_performance:10_performance [2020/10/20 08:09] (current) – Pandoc Auto-commit pandoc
@@ Line 1: / Line 1: @@
 ====== Going for efficiency - how to get out performance ======
@@ Line 16: / Line 18: @@
 ^Type                                      ^Theory   ^Measured  ^Well-tuned Application  ^Untuned Application  ^
-|**Operations** per second (GFlops/s)      |384      |340       |340                     |30                   |
+|**Operations** per second (GFlops/s)      |384      |340       |340                     |30 / 2               |
 |**Memory** throughput (GB/s)              |120      |80        |80                      |5                    |
-|**Network** throughput (GB/s)             |2*4      |2*3.4     |2*2                     |0.2                  |
+|**Network** throughput (GB/s)             |2*4      |2*3.4     |2*2                     |0.2 / 0.0002         |
 |**Network** latency (µs)                  |1.4-1.8  |1.4-1.8   |1.4-1.8                 |2                    |
-|**Storage** throughput ("/global", GB/s)  |20       |15        |10                      |1                    |
+|**Storage** throughput (“/global”, GB/s)  |20       |15        |10                      |1                    |
 |**Storage** latency (ms)                  |0.03     |1         |1                       |1000                 |
-"Operations": Floating point / 16 cores @3GHz; Values per node\\
+“Operations”: Floating point / 16 cores @3GHz; Values per node\\
-"Memory": Mixture of three cache levels and main memory; Values per node\\
+“Memory”: Mixture of three cache levels and main memory; Values per node\\
-"Network": Medium shared by all users; Values per node\\
+“Network”: Medium shared by all users (large jobs only); Values per node\\
-"Storage": Medium shared by all users; Values per application
+“Storage”: Medium shared by all users; Values per application
 Which one is the **limiting factor** for a given code?
@@ Line 55: / Line 57: @@
   * Large problem size => subalgorithm with highest exponent of complexity becomes dominant
     * **Measure** compute time, preferably with profiler
-    * Beware: **cache effects** can 'change' algorithmic complexity
+    * Beware: **cache effects** can ‘change’ algorithmic complexity
   * Often: algorithmic complexity is well known and documented
@@ Line 138: / Line 140: @@
 ===== Very important libraries =====
-  * **Level 3 BLAS** (="Matrix Multiplication")
+  * **Level 3 BLAS** (=“Matrix Multiplication”)
     * O(n^3) operations with O(n^2) memory accesses
     * portable performance (often faster by a factor of 10)
@@ Line 179: / Line 181: @@
   * Naming schemes
-    * UPPER vs. lower case
+    * UPPER vs. lower case
     * added underscores
@@ Line 200: / Line 202: @@
   * Call by value / call by reference
   * Order of parameters: size of string parameters sometimes at the very end of parameter list
-  * I/O libraries
+  * I/O libraries * Hint: use I/O of only one language, or * link with correct runtime libraries, or * link by calling compiler (e.g. gfortran) instead of linker (e.g. ld)
-    * Hint: use I/O of only one language, or
-    * link with correct runtime libraries, or
-    * link by calling compiler (e.g. gfortran) instead of linker (e.g. ld)
   * Copy-in and/or copy-out: can affect performance
   * Class/module information in objects ("*.o") only useful in one language
@@ Line 237: / Line 236: @@
 ===== Memory optimization =====
-To keep CPUs busy, data has to be kept near the execution units, e.g. in **Level 1 Cache**
+To keep CPUs busy, data has to be kept near the execution units, e.g. in **Level 1 Cache**
 Methods
@@ Line 249: / Line 248: @@
   * ..
-===== Parallelization =====
+===== Example: Row - Column =====
-  * MPI<html><br></html> 20.-22.11.2017<html><br></html> VSC Training Course: Parallelization with MPI<html><br></html> Claudia Blaas-Schenner and Irene Reichl
+Column Access
-  * OpenMP<html><br></html> 23.-24.11.2017<html><br></html> VSC Training Course: Shared memory parallelization with OpenMP<html><br></html> Lukas Einkemmer (lectures+practicals; Department of Mathematics, University of Innsbruck),<html><br></html> Claudia Blaas-Schenner and Irene Reichl (practicals only; VSC Team, TU Wien)
-http://vsc.ac.at
+<code>
+for(j=0;i<N;i++)
+  for(i=0;j<N;j++)
+    A[i][j]+=2.0;
+</code>
+Strided memory access in C\\
+Contiguous memory access in Fortran
+Row Access
+<code>
+for(i=0;i<N;i++)
+  for(j=0;j<N;j++)
+    A[i][j]+=2.0;
+</code>
+Contiguous in memory in C\\
+Strided memory access in Fortran
+===== Example: Blocking =====
+<code>
+//Three nested loops
+N=10000;
+for(i=0;i<N;i++)
+  for(j=0;j<N;j++)
+    for(k=0;k<N;k++)
+      C[i][j]+=A[i][k]*B[k][j];
+//Large N => cache thrashing
+</code>
+<code>
+//Blocked access
+l=48; N=10000;
+for(i=0;i<N;i+=l)
+  for(j=0;j<N;j+=l)
+    for(k=0;k<N;k+=l)
+      for(ii=i;ii<min(i+l,N);ii++)
+        for(jj=j;jj<min(j+l,N);jj++)
+          for(kk=k;kk<min(k+l,N);kk++)
+            C[ii][jj]+=A[ii][kk]*B[kk][jj];
+//Large N => reusing data in fast cache levels
+</code>
+Adapt parameter l to size of cache – separately for each cache level
+Even better:
+<code>
+double ONE=1.0D0;
+dgemm("N","N",N,N,N,&ONE,A,N,B,N,&ONE,C,N)
+// dgemm (TRANSA, TRANSB, M, N, K, ALPHA, A, LDA, B, LDB, BETA, C, LDC)
+</code>
+===== Storage and Network =====
+Small messages are costly
+Example: Write 100MB
+  * in one block to storage: 100MB * 2GB/s + 2µs = 50.002ms\\
+  * 1000000 times 100B: 1000000 * 100B * 2GB/s + 1000000*2µs = 2050ms (blocking the file server)
 ===== Tools =====
@@ Line 274: / Line 340: @@
   * VSC School Trainings upon request
-https://wiki.vsc.ac.at/doku.php?id=doku:perf-report -- https://wiki.vsc.ac.at/doku.php?id=doku:forge
+https://wiki.vsc.ac.at/doku.php?id=doku:perf-report – https://wiki.vsc.ac.at/doku.php?id=doku:forge