Differences

This shows you the differences between two versions of the page.

--- pandoc:introduction-to-vsc:10_performance:10_performance [2017/10/19 05:46] – Pandoc Auto-commit pandoc
+++ pandoc:introduction-to-vsc:10_performance:10_performance [2018/03/21 11:33] – Pandoc Auto-commit pandoc
@@ Line 1: / Line 1: @@
 ====== Going for efficiency - how to get out performance ======
@@ Line 20: / Line 22: @@
 |**Network** throughput (GB/s)             |2*4      |2*3.4     |2*2                     |0.2 / 0.0002         |
 |**Network** latency (µs)                  |1.4-1.8  |1.4-1.8   |1.4-1.8                 |2                    |
-|**Storage** throughput ("/global", GB/s)  |20       |15        |10                      |1                    |
+|**Storage** throughput (“/global”, GB/s)  |20       |15        |10                      |1                    |
 |**Storage** latency (ms)                  |0.03     |1         |1                       |1000                 |
-"Operations": Floating point / 16 cores @3GHz; Values per node\\
+“Operations”: Floating point / 16 cores @3GHz; Values per node\\
-"Memory": Mixture of three cache levels and main memory; Values per node\\
+“Memory”: Mixture of three cache levels and main memory; Values per node\\
-"Network": Medium shared by all users (large jobs only); Values per node\\
+“Network”: Medium shared by all users (large jobs only); Values per node\\
-"Storage": Medium shared by all users; Values per application
+“Storage”: Medium shared by all users; Values per application
 Which one is the **limiting factor** for a given code?
@@ Line 55: / Line 57: @@
   * Large problem size => subalgorithm with highest exponent of complexity becomes dominant
     * **Measure** compute time, preferably with profiler
-    * Beware: **cache effects** can 'change' algorithmic complexity
+    * Beware: **cache effects** can ‘change’ algorithmic complexity
   * Often: algorithmic complexity is well known and documented
@@ Line 138: / Line 140: @@
 ===== Very important libraries =====
-  * **Level 3 BLAS** (="Matrix Multiplication")
+  * **Level 3 BLAS** (=“Matrix Multiplication”)
     * O(n^3) operations with O(n^2) memory accesses
     * portable performance (often faster by a factor of 10)
@@ Line 179: / Line 181: @@
   * Naming schemes
-    * UPPER vs. lower case
+    * UPPER vs. lower case
     * added underscores
@@ Line 200: / Line 202: @@
   * Call by value / call by reference
   * Order of parameters: size of string parameters sometimes at the very end of parameter list
-  * I/O libraries
+  * I/O libraries * Hint: use I/O of only one language, or * link with correct runtime libraries, or * link by calling compiler (e.g. gfortran) instead of linker (e.g. ld)
-    * Hint: use I/O of only one language, or
-    * link with correct runtime libraries, or
-    * link by calling compiler (e.g. gfortran) instead of linker (e.g. ld)
   * Copy-in and/or copy-out: can affect performance
-  * Class/module information in objects ("*.o") only useful in one language
+  * Class/module information in objects (“*.o“) only useful in one language
 ===== Interoperability (3) =====
@@ Line 237: / Line 236: @@
 ===== Memory optimization =====
-To keep CPUs busy, data has to be kept near the execution units, e.g. in **Level 1 Cache**
+To keep CPUs busy, data has to be kept near the execution units, e.g. in **Level 1 Cache**
 Methods
@@ Line 276: / Line 275: @@
 ===== Example: Blocking =====
-Three nested loops
 <code>
+//Three nested loops
 N=10000;
 for(i=0;i<N;i++)
@@ Line 285: / Line 283: @@
     for(k=0;k<N;k++)
       C[i][j]+=A[i][k]*B[k][j];
+//Large N => cache thrashing
 </code>
-Large N => cache thrashing
-Blocked access
 <code>
+//Blocked access
 l=48; N=10000;
 for(i=0;i<N;i+=l)
@@ Line 300: / Line 296: @@
           for(kk=k;kk<min(k+l,N);kk++)
             C[ii][jj]+=A[ii][kk]*B[kk][jj];
+//Large N => reusing data in fast cache levels
 </code>
-Large N => reusing data in fast cache levels
+Adapt parameter l to size of cache – separately for each cache level
@@ Line 324: / Line 321: @@
   * 1000000 times 100B: 1000000 * 100B * 2GB/s + 1000000*2µs = 2050ms (blocking the file server)
-===== Parallelization =====
-  * MPI<html><br></html> 20.-22.11.2017<html><br></html> VSC Training Course: Parallelization with MPI<html><br></html> Claudia Blaas-Schenner and Irene Reichl
-  * OpenMP<html><br></html> 23.-24.11.2017<html><br></html> VSC Training Course: Shared memory parallelization with OpenMP<html><br></html> Lukas Einkemmer (lectures+practicals; Department of Mathematics, University of Innsbruck),<html><br></html> Claudia Blaas-Schenner and Irene Reichl (practicals only; VSC Team, TU Wien)
-http://vsc.ac.at
 ===== Tools =====
@@ Line 350: / Line 340: @@
   * VSC School Trainings upon request
-https://wiki.vsc.ac.at/doku.php?id=doku:perf-report -- https://wiki.vsc.ac.at/doku.php?id=doku:forge
+https://wiki.vsc.ac.at/doku.php?id=doku:perf-report – https://wiki.vsc.ac.at/doku.php?id=doku:forge