Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Next revision
Previous revision
pandoc:introduction-to-vsc:10_performance:10_performance [2017/10/18 11:42] – Pandoc Auto-commit pandocpandoc:introduction-to-vsc:10_performance:10_performance [2020/10/20 08:09] (current) – Pandoc Auto-commit pandoc
Line 1: Line 1:
 +
 +
 ====== Going for efficiency - how to get out performance ====== ====== Going for efficiency - how to get out performance ======
  
Line 16: Line 18:
  
 ^Type                                      ^Theory   ^Measured  ^Well-tuned Application  ^Untuned Application  ^ ^Type                                      ^Theory   ^Measured  ^Well-tuned Application  ^Untuned Application  ^
-|**Operations** per second (GFlops/s)      |384      |340       |340                     |30                   |+|**Operations** per second (GFlops/s)      |384      |340       |340                     |30 / 2               |
 |**Memory** throughput (GB/s)              |120      |80        |80                      |5                    | |**Memory** throughput (GB/s)              |120      |80        |80                      |5                    |
-|**Network** throughput (GB/s)             |2*4      |2*3.4     |2*2                     |0.2                  |+|**Network** throughput (GB/s)             |2*4      |2*3.4     |2*2                     |0.2 / 0.0002         |
 |**Network** latency (µs)                  |1.4-1.8  |1.4-1.8   |1.4-1.8                 |2                    | |**Network** latency (µs)                  |1.4-1.8  |1.4-1.8   |1.4-1.8                 |2                    |
-|**Storage** throughput ("/global", GB/s)  |20       |15        |10                      |1                    |+|**Storage** throughput (/global, GB/s)  |20       |15        |10                      |1                    |
 |**Storage** latency (ms)                  |0.03     |1         |1                       |1000                 | |**Storage** latency (ms)                  |0.03     |1         |1                       |1000                 |
  
-"Operations": Floating point / 16 cores @3GHz; Values per node\\ +Operations: Floating point / 16 cores @3GHz; Values per node\\ 
-"Memory": Mixture of three cache levels and main memory; Values per node\\ +Memory: Mixture of three cache levels and main memory; Values per node\\ 
-"Network": Medium shared by all users; Values per node\\ +Network: Medium shared by all users (large jobs only); Values per node\\ 
-"Storage": Medium shared by all users; Values per application+Storage: Medium shared by all users; Values per application
  
 Which one is the **limiting factor** for a given code? Which one is the **limiting factor** for a given code?
Line 55: Line 57:
   * Large problem size => subalgorithm with highest exponent of complexity becomes dominant   * Large problem size => subalgorithm with highest exponent of complexity becomes dominant
     * **Measure** compute time, preferably with profiler     * **Measure** compute time, preferably with profiler
-    * Beware: **cache effects** can 'changealgorithmic complexity+    * Beware: **cache effects** can change’ algorithmic complexity
   * Often: algorithmic complexity is well known and documented   * Often: algorithmic complexity is well known and documented
  
Line 138: Line 140:
 ===== Very important libraries ===== ===== Very important libraries =====
  
-  * **Level 3 BLAS** (="Matrix Multiplication")+  * **Level 3 BLAS** (=Matrix Multiplication)
     * O(n^3) operations with O(n^2) memory accesses     * O(n^3) operations with O(n^2) memory accesses
     * portable performance (often faster by a factor of 10)     * portable performance (often faster by a factor of 10)
Line 179: Line 181:
  
   * Naming schemes   * Naming schemes
-    * UPPER vs. lower case+    * UPPER vs. lower case
     * added underscores     * added underscores
  
Line 200: Line 202:
   * Call by value / call by reference   * Call by value / call by reference
   * Order of parameters: size of string parameters sometimes at the very end of parameter list   * Order of parameters: size of string parameters sometimes at the very end of parameter list
-  * I/O libraries +  * I/O libraries * Hint: use I/O of only one language, or * link with correct runtime libraries, or * link by calling compiler (e.g. gfortran) instead of linker (e.g. ld)
-    * Hint: use I/O of only one language, or +
-    * link with correct runtime libraries, or +
-    * link by calling compiler (e.g. gfortran) instead of linker (e.g. ld)+
   * Copy-in and/or copy-out: can affect performance   * Copy-in and/or copy-out: can affect performance
   * Class/module information in objects ("*.o") only useful in one language   * Class/module information in objects ("*.o") only useful in one language
Line 237: Line 236:
 ===== Memory optimization ===== ===== Memory optimization =====
  
-To keep CPUs busy, data has to be kept near the execution units, e.g. in **Level 1 Cache**+To keep CPUs busy, data has to be kept near the execution units, e.g. in **Level 1 Cache**
  
 Methods Methods
Line 249: Line 248:
   * ..   * ..
  
-===== Parallelization =====+===== Example: Row - Column =====
  
-  * MPI<html><br></html> 20.-22.11.2017<html><br></html> VSC Training Course: Parallelization with MPI<html><br></html> Claudia Blaas-Schenner and Irene Reichl +Column Access
-  * OpenMP<html><br></html> 23.-24.11.2017<html><br></html> VSC Training Course: Shared memory parallelization with OpenMP<html><br></html> Lukas Einkemmer (lectures+practicals; Department of Mathematics, University of Innsbruck),<html><br></html> Claudia Blaas-Schenner and Irene Reichl (practicals only; VSC Team, TU Wien)+
  
-http://vsc.ac.at+<code> 
 +for(j=0;i<N;i++) 
 +  for(i=0;j<N;j++) 
 +    A[i][j]+=2.0; 
 +</code> 
 +Strided memory access in C\\ 
 +Contiguous memory access in Fortran 
 + 
 + 
 +Row Access 
 + 
 +<code> 
 +for(i=0;i<N;i++) 
 +  for(j=0;j<N;j++) 
 +    A[i][j]+=2.0; 
 +</code> 
 +Contiguous in memory in C\\ 
 +Strided memory access in Fortran 
 + 
 + 
 + 
 + 
 +===== ExampleBlocking ===== 
 + 
 +<code> 
 +//Three nested loops 
 +N=10000; 
 +for(i=0;i<N;i++) 
 +  for(j=0;j<N;j++) 
 +    for(k=0;k<N;k++) 
 +      C[i][j]+=A[i][k]*B[k][j]; 
 +//Large N => cache thrashing 
 +</code> 
 + 
 +<code> 
 +//Blocked access 
 +l=48; N=10000; 
 +for(i=0;i<N;i+=l) 
 +  for(j=0;j<N;j+=l) 
 +    for(k=0;k<N;k+=l) 
 +      for(ii=i;ii<min(i+l,N);ii++) 
 +        for(jj=j;jj<min(j+l,N);jj++) 
 +          for(kk=k;kk<min(k+l,N);kk++) 
 +            C[ii][jj]+=A[ii][kk]*B[kk][jj]; 
 +//Large N => reusing data in fast cache levels 
 +</code> 
 +Adapt parameter l to size of cache – separately for each cache level 
 + 
 + 
 + 
 +Even better: 
 + 
 +<code> 
 +double ONE=1.0D0; 
 +dgemm("N","N",N,N,N,&ONE,A,N,B,N,&ONE,C,N) 
 +// dgemm (TRANSA, TRANSB, M, N, K, ALPHA, A, LDA, B, LDB, BETA, C, LDC) 
 +</code> 
 + 
 + 
 +===== Storage and Network ===== 
 + 
 +Small messages are costly 
 + 
 +Example: Write 100MB 
 + 
 + 
 +  * in one block to storage: 100MB * 2GB/s + 2µs = 50.002ms\\ 
 + 
 +  * 1000000 times 100B: 1000000 * 100B * 2GB/s + 1000000*2µs = 2050ms (blocking the file server)
  
 ===== Tools ===== ===== Tools =====
Line 274: Line 340:
   * VSC School Trainings upon request   * VSC School Trainings upon request
  
-https://wiki.vsc.ac.at/doku.php?id=doku:perf-report -- https://wiki.vsc.ac.at/doku.php?id=doku:forge+https://wiki.vsc.ac.at/doku.php?id=doku:perf-report – https://wiki.vsc.ac.at/doku.php?id=doku:forge 
  
  • pandoc/introduction-to-vsc/10_performance/10_performance.1508326956.txt.gz
  • Last modified: 2017/10/18 11:42
  • by pandoc