Differences
This shows you the differences between two versions of the page.
Next revision | Previous revision | ||
pandoc:introduction-to-vsc:10_performance:10_performance [2017/10/18 11:42] – Pandoc Auto-commit pandoc | pandoc:introduction-to-vsc:10_performance:10_performance [2020/10/20 08:09] (current) – Pandoc Auto-commit pandoc | ||
---|---|---|---|
Line 1: | Line 1: | ||
+ | |||
+ | |||
====== Going for efficiency - how to get out performance ====== | ====== Going for efficiency - how to get out performance ====== | ||
Line 16: | Line 18: | ||
^Type ^Theory | ^Type ^Theory | ||
- | |**Operations** per second (GFlops/ | + | |**Operations** per second (GFlops/ |
|**Memory** throughput (GB/ | |**Memory** throughput (GB/ | ||
- | |**Network** throughput (GB/ | + | |**Network** throughput (GB/ |
|**Network** latency (µs) |1.4-1.8 | |**Network** latency (µs) |1.4-1.8 | ||
- | |**Storage** throughput ("/global", GB/s) |20 | + | |**Storage** throughput (“/global”, GB/s) |20 |
|**Storage** latency (ms) |0.03 | |**Storage** latency (ms) |0.03 | ||
- | "Operations": Floating point / 16 cores @3GHz; Values per node\\ | + | “Operations”: Floating point / 16 cores @3GHz; Values per node\\ |
- | "Memory": Mixture of three cache levels and main memory; Values per node\\ | + | “Memory”: Mixture of three cache levels and main memory; Values per node\\ |
- | "Network": Medium shared by all users; Values per node\\ | + | “Network”: Medium shared by all users (large jobs only); Values per node\\ |
- | "Storage": Medium shared by all users; Values per application | + | “Storage”: Medium shared by all users; Values per application |
Which one is the **limiting factor** for a given code? | Which one is the **limiting factor** for a given code? | ||
Line 55: | Line 57: | ||
* Large problem size => subalgorithm with highest exponent of complexity becomes dominant | * Large problem size => subalgorithm with highest exponent of complexity becomes dominant | ||
* **Measure** compute time, preferably with profiler | * **Measure** compute time, preferably with profiler | ||
- | * Beware: **cache effects** can 'change' | + | * Beware: **cache effects** can ‘change’ algorithmic complexity |
* Often: algorithmic complexity is well known and documented | * Often: algorithmic complexity is well known and documented | ||
Line 138: | Line 140: | ||
===== Very important libraries ===== | ===== Very important libraries ===== | ||
- | * **Level 3 BLAS** (="Matrix Multiplication") | + | * **Level 3 BLAS** (=“Matrix Multiplication”) |
* O(n^3) operations with O(n^2) memory accesses | * O(n^3) operations with O(n^2) memory accesses | ||
* portable performance (often faster by a factor of 10) | * portable performance (often faster by a factor of 10) | ||
Line 179: | Line 181: | ||
* Naming schemes | * Naming schemes | ||
- | * UPPER vs. lower case | + | * UPPER vs. lower case |
* added underscores | * added underscores | ||
Line 200: | Line 202: | ||
* Call by value / call by reference | * Call by value / call by reference | ||
* Order of parameters: size of string parameters sometimes at the very end of parameter list | * Order of parameters: size of string parameters sometimes at the very end of parameter list | ||
- | * I/O libraries | + | * I/O libraries * Hint: use I/O of only one language, or * link with correct runtime libraries, or * link by calling compiler (e.g. gfortran) instead of linker (e.g. ld) |
- | | + | |
- | | + | |
- | | + | |
* Copy-in and/or copy-out: can affect performance | * Copy-in and/or copy-out: can affect performance | ||
* Class/ | * Class/ | ||
Line 237: | Line 236: | ||
===== Memory optimization ===== | ===== Memory optimization ===== | ||
- | To keep CPUs busy, data has to be kept near the execution units, e.g. in **Level 1 Cache** | + | To keep CPUs busy, data has to be kept near the execution units, e.g. in **Level 1 Cache** |
Methods | Methods | ||
Line 249: | Line 248: | ||
* .. | * .. | ||
- | ===== Parallelization | + | ===== Example: Row - Column |
- | * MPI< | + | Column Access |
- | * OpenMP< | + | |
- | http://vsc.ac.at | + | < |
+ | for(j=0; | ||
+ | for(i=0; | ||
+ | A[i][j]+=2.0; | ||
+ | </ | ||
+ | Strided memory access in C\\ | ||
+ | Contiguous memory access in Fortran | ||
+ | |||
+ | |||
+ | Row Access | ||
+ | |||
+ | < | ||
+ | for(i=0; | ||
+ | for(j=0; | ||
+ | A[i][j]+=2.0; | ||
+ | </ | ||
+ | Contiguous in memory in C\\ | ||
+ | Strided memory access in Fortran | ||
+ | |||
+ | |||
+ | |||
+ | |||
+ | ===== Example: Blocking ===== | ||
+ | |||
+ | < | ||
+ | //Three nested loops | ||
+ | N=10000; | ||
+ | for(i=0; | ||
+ | for(j=0; | ||
+ | for(k=0; | ||
+ | C[i][j]+=A[i][k]*B[k][j]; | ||
+ | //Large N => cache thrashing | ||
+ | </ | ||
+ | |||
+ | < | ||
+ | //Blocked access | ||
+ | l=48; N=10000; | ||
+ | for(i=0; | ||
+ | for(j=0; | ||
+ | for(k=0; | ||
+ | for(ii=i; | ||
+ | for(jj=j; | ||
+ | for(kk=k; | ||
+ | C[ii][jj]+=A[ii][kk]*B[kk][jj]; | ||
+ | //Large N => reusing data in fast cache levels | ||
+ | </ | ||
+ | Adapt parameter l to size of cache – separately for each cache level | ||
+ | |||
+ | |||
+ | |||
+ | Even better: | ||
+ | |||
+ | < | ||
+ | double ONE=1.0D0; | ||
+ | dgemm(" | ||
+ | // dgemm (TRANSA, TRANSB, M, N, K, ALPHA, A, LDA, B, LDB, BETA, C, LDC) | ||
+ | </ | ||
+ | |||
+ | |||
+ | ===== Storage and Network ===== | ||
+ | |||
+ | Small messages are costly | ||
+ | |||
+ | Example: Write 100MB | ||
+ | |||
+ | |||
+ | * in one block to storage: 100MB * 2GB/s + 2µs = 50.002ms\\ | ||
+ | |||
+ | * 1000000 times 100B: 1000000 * 100B * 2GB/s + 1000000*2µs = 2050ms (blocking the file server) | ||
===== Tools ===== | ===== Tools ===== | ||
Line 274: | Line 340: | ||
* VSC School Trainings upon request | * VSC School Trainings upon request | ||
- | https:// | + | https:// |