Differences
This shows you the differences between two versions of the page.
Both sides previous revision Previous revision Next revision | Previous revision | ||
pandoc:introduction-to-vsc:10_performance:10_performance [2017/10/19 05:46] – Pandoc Auto-commit pandoc | pandoc:introduction-to-vsc:10_performance:10_performance [2020/10/20 08:09] (current) – Pandoc Auto-commit pandoc | ||
---|---|---|---|
Line 1: | Line 1: | ||
+ | |||
+ | |||
====== Going for efficiency - how to get out performance ====== | ====== Going for efficiency - how to get out performance ====== | ||
Line 20: | Line 22: | ||
|**Network** throughput (GB/ | |**Network** throughput (GB/ | ||
|**Network** latency (µs) |1.4-1.8 | |**Network** latency (µs) |1.4-1.8 | ||
- | |**Storage** throughput ("/global", GB/s) |20 | + | |**Storage** throughput (“/global”, GB/s) |20 |
|**Storage** latency (ms) |0.03 | |**Storage** latency (ms) |0.03 | ||
- | "Operations": Floating point / 16 cores @3GHz; Values per node\\ | + | “Operations”: Floating point / 16 cores @3GHz; Values per node\\ |
- | "Memory": Mixture of three cache levels and main memory; Values per node\\ | + | “Memory”: Mixture of three cache levels and main memory; Values per node\\ |
- | "Network": Medium shared by all users (large jobs only); Values per node\\ | + | “Network”: Medium shared by all users (large jobs only); Values per node\\ |
- | "Storage": Medium shared by all users; Values per application | + | “Storage”: Medium shared by all users; Values per application |
Which one is the **limiting factor** for a given code? | Which one is the **limiting factor** for a given code? | ||
Line 55: | Line 57: | ||
* Large problem size => subalgorithm with highest exponent of complexity becomes dominant | * Large problem size => subalgorithm with highest exponent of complexity becomes dominant | ||
* **Measure** compute time, preferably with profiler | * **Measure** compute time, preferably with profiler | ||
- | * Beware: **cache effects** can 'change' | + | * Beware: **cache effects** can ‘change’ algorithmic complexity |
* Often: algorithmic complexity is well known and documented | * Often: algorithmic complexity is well known and documented | ||
Line 138: | Line 140: | ||
===== Very important libraries ===== | ===== Very important libraries ===== | ||
- | * **Level 3 BLAS** (="Matrix Multiplication") | + | * **Level 3 BLAS** (=“Matrix Multiplication”) |
* O(n^3) operations with O(n^2) memory accesses | * O(n^3) operations with O(n^2) memory accesses | ||
* portable performance (often faster by a factor of 10) | * portable performance (often faster by a factor of 10) | ||
Line 179: | Line 181: | ||
* Naming schemes | * Naming schemes | ||
- | * UPPER vs. lower case | + | * UPPER vs. lower case |
* added underscores | * added underscores | ||
Line 200: | Line 202: | ||
* Call by value / call by reference | * Call by value / call by reference | ||
* Order of parameters: size of string parameters sometimes at the very end of parameter list | * Order of parameters: size of string parameters sometimes at the very end of parameter list | ||
- | * I/O libraries | + | * I/O libraries * Hint: use I/O of only one language, or * link with correct runtime libraries, or * link by calling compiler (e.g. gfortran) instead of linker (e.g. ld) |
- | | + | |
- | | + | |
- | | + | |
* Copy-in and/or copy-out: can affect performance | * Copy-in and/or copy-out: can affect performance | ||
* Class/ | * Class/ | ||
Line 237: | Line 236: | ||
===== Memory optimization ===== | ===== Memory optimization ===== | ||
- | To keep CPUs busy, data has to be kept near the execution units, e.g. in **Level 1 Cache** | + | To keep CPUs busy, data has to be kept near the execution units, e.g. in **Level 1 Cache** |
Methods | Methods | ||
Line 276: | Line 275: | ||
===== Example: Blocking ===== | ===== Example: Blocking ===== | ||
- | |||
- | Three nested loops | ||
< | < | ||
+ | //Three nested loops | ||
N=10000; | N=10000; | ||
for(i=0; | for(i=0; | ||
Line 285: | Line 283: | ||
for(k=0; | for(k=0; | ||
C[i][j]+=A[i][k]*B[k][j]; | C[i][j]+=A[i][k]*B[k][j]; | ||
+ | //Large N => cache thrashing | ||
</ | </ | ||
- | Large N => cache thrashing | ||
- | |||
- | |||
- | Blocked access | ||
< | < | ||
+ | //Blocked access | ||
l=48; N=10000; | l=48; N=10000; | ||
for(i=0; | for(i=0; | ||
Line 300: | Line 296: | ||
for(kk=k; | for(kk=k; | ||
C[ii][jj]+=A[ii][kk]*B[kk][jj]; | C[ii][jj]+=A[ii][kk]*B[kk][jj]; | ||
+ | //Large N => reusing data in fast cache levels | ||
</ | </ | ||
- | Large N => reusing data in fast cache levels | + | Adapt parameter l to size of cache – separately for each cache level |
Line 324: | Line 321: | ||
* 1000000 times 100B: 1000000 * 100B * 2GB/s + 1000000*2µs = 2050ms (blocking the file server) | * 1000000 times 100B: 1000000 * 100B * 2GB/s + 1000000*2µs = 2050ms (blocking the file server) | ||
- | |||
- | ===== Parallelization ===== | ||
- | |||
- | * MPI< | ||
- | * OpenMP< | ||
- | |||
- | http:// | ||
===== Tools ===== | ===== Tools ===== | ||
Line 350: | Line 340: | ||
* VSC School Trainings upon request | * VSC School Trainings upon request | ||
- | https:// | + | https:// |