Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
Last revisionBoth sides next revision
pandoc:introduction-to-vsc:10_performance:10_performance [2017/10/19 05:46] – Pandoc Auto-commit pandocpandoc:introduction-to-vsc:10_performance:10_performance [2018/03/21 11:33] – Pandoc Auto-commit pandoc
Line 1: Line 1:
 +
 +
 ====== Going for efficiency - how to get out performance ====== ====== Going for efficiency - how to get out performance ======
  
Line 20: Line 22:
 |**Network** throughput (GB/s)             |2*4      |2*3.4     |2*2                     |0.2 / 0.0002         | |**Network** throughput (GB/s)             |2*4      |2*3.4     |2*2                     |0.2 / 0.0002         |
 |**Network** latency (µs)                  |1.4-1.8  |1.4-1.8   |1.4-1.8                 |2                    | |**Network** latency (µs)                  |1.4-1.8  |1.4-1.8   |1.4-1.8                 |2                    |
-|**Storage** throughput ("/global", GB/s)  |20       |15        |10                      |1                    |+|**Storage** throughput (/global, GB/s)  |20       |15        |10                      |1                    |
 |**Storage** latency (ms)                  |0.03     |1         |1                       |1000                 | |**Storage** latency (ms)                  |0.03     |1         |1                       |1000                 |
  
-"Operations": Floating point / 16 cores @3GHz; Values per node\\ +Operations: Floating point / 16 cores @3GHz; Values per node\\ 
-"Memory": Mixture of three cache levels and main memory; Values per node\\ +Memory: Mixture of three cache levels and main memory; Values per node\\ 
-"Network": Medium shared by all users (large jobs only); Values per node\\ +Network: Medium shared by all users (large jobs only); Values per node\\ 
-"Storage": Medium shared by all users; Values per application+Storage: Medium shared by all users; Values per application
  
 Which one is the **limiting factor** for a given code? Which one is the **limiting factor** for a given code?
Line 55: Line 57:
   * Large problem size => subalgorithm with highest exponent of complexity becomes dominant   * Large problem size => subalgorithm with highest exponent of complexity becomes dominant
     * **Measure** compute time, preferably with profiler     * **Measure** compute time, preferably with profiler
-    * Beware: **cache effects** can 'changealgorithmic complexity+    * Beware: **cache effects** can change’ algorithmic complexity
   * Often: algorithmic complexity is well known and documented   * Often: algorithmic complexity is well known and documented
  
Line 138: Line 140:
 ===== Very important libraries ===== ===== Very important libraries =====
  
-  * **Level 3 BLAS** (="Matrix Multiplication")+  * **Level 3 BLAS** (=Matrix Multiplication)
     * O(n^3) operations with O(n^2) memory accesses     * O(n^3) operations with O(n^2) memory accesses
     * portable performance (often faster by a factor of 10)     * portable performance (often faster by a factor of 10)
Line 179: Line 181:
  
   * Naming schemes   * Naming schemes
-    * UPPER vs. lower case+    * UPPER vs. lower case
     * added underscores     * added underscores
  
Line 200: Line 202:
   * Call by value / call by reference   * Call by value / call by reference
   * Order of parameters: size of string parameters sometimes at the very end of parameter list   * Order of parameters: size of string parameters sometimes at the very end of parameter list
-  * I/O libraries +  * I/O libraries * Hint: use I/O of only one language, or * link with correct runtime libraries, or * link by calling compiler (e.g. gfortran) instead of linker (e.g. ld)
-    * Hint: use I/O of only one language, or +
-    * link with correct runtime libraries, or +
-    * link by calling compiler (e.g. gfortran) instead of linker (e.g. ld)+
   * Copy-in and/or copy-out: can affect performance   * Copy-in and/or copy-out: can affect performance
-  * Class/module information in objects ("*.o") only useful in one language+  * Class/module information in objects (*.o) only useful in one language
  
 ===== Interoperability (3) ===== ===== Interoperability (3) =====
Line 237: Line 236:
 ===== Memory optimization ===== ===== Memory optimization =====
  
-To keep CPUs busy, data has to be kept near the execution units, e.g. in **Level 1 Cache**+To keep CPUs busy, data has to be kept near the execution units, e.g. in **Level 1 Cache**
  
 Methods Methods
Line 276: Line 275:
  
 ===== Example: Blocking ===== ===== Example: Blocking =====
- 
-Three nested loops 
  
 <code> <code>
 +//Three nested loops
 N=10000; N=10000;
 for(i=0;i<N;i++) for(i=0;i<N;i++)
Line 285: Line 283:
     for(k=0;k<N;k++)     for(k=0;k<N;k++)
       C[i][j]+=A[i][k]*B[k][j];       C[i][j]+=A[i][k]*B[k][j];
 +//Large N => cache thrashing
 </code> </code>
-Large N => cache thrashing 
- 
- 
-Blocked access 
  
 <code> <code>
 +//Blocked access
 l=48; N=10000; l=48; N=10000;
 for(i=0;i<N;i+=l) for(i=0;i<N;i+=l)
Line 300: Line 296:
           for(kk=k;kk<min(k+l,N);kk++)           for(kk=k;kk<min(k+l,N);kk++)
             C[ii][jj]+=A[ii][kk]*B[kk][jj];             C[ii][jj]+=A[ii][kk]*B[kk][jj];
 +//Large N => reusing data in fast cache levels
 </code> </code>
-Large N => reusing data in fast cache levels+Adapt parameter l to size of cache – separately for each cache level
  
  
Line 324: Line 321:
  
   * 1000000 times 100B: 1000000 * 100B * 2GB/s + 1000000*2µs = 2050ms (blocking the file server)   * 1000000 times 100B: 1000000 * 100B * 2GB/s + 1000000*2µs = 2050ms (blocking the file server)
- 
-===== Parallelization ===== 
- 
-  * MPI<html><br></html> 20.-22.11.2017<html><br></html> VSC Training Course: Parallelization with MPI<html><br></html> Claudia Blaas-Schenner and Irene Reichl 
-  * OpenMP<html><br></html> 23.-24.11.2017<html><br></html> VSC Training Course: Shared memory parallelization with OpenMP<html><br></html> Lukas Einkemmer (lectures+practicals; Department of Mathematics, University of Innsbruck),<html><br></html> Claudia Blaas-Schenner and Irene Reichl (practicals only; VSC Team, TU Wien) 
- 
-http://vsc.ac.at 
  
 ===== Tools ===== ===== Tools =====
Line 350: Line 340:
   * VSC School Trainings upon request   * VSC School Trainings upon request
  
-https://wiki.vsc.ac.at/doku.php?id=doku:perf-report -- https://wiki.vsc.ac.at/doku.php?id=doku:forge+https://wiki.vsc.ac.at/doku.php?id=doku:perf-report – https://wiki.vsc.ac.at/doku.php?id=doku:forge
  
  • pandoc/introduction-to-vsc/10_performance/10_performance.txt
  • Last modified: 2020/10/20 08:09
  • by pandoc