Differences
This shows you the differences between two versions of the page.
Both sides previous revision Previous revision Next revision | Previous revisionLast revisionBoth sides next revision | ||
pandoc:introduction-to-vsc:10_performance:10_performance [2018/01/31 11:11] – Pandoc Auto-commit pandoc | pandoc:introduction-to-vsc:10_performance:10_performance [2018/03/21 11:33] – Pandoc Auto-commit pandoc | ||
---|---|---|---|
Line 1: | Line 1: | ||
+ | |||
+ | |||
+ | ====== Going for efficiency - how to get out performance ====== | ||
+ | |||
+ | * Article written by Dieter Kvasnicka (VSC Team) < | ||
+ | |||
+ | ===== Overview ===== | ||
+ | |||
+ | |||
+ | * Performance ? | ||
+ | * Know algorithmic complexity | ||
+ | * Libraries | ||
+ | * Interoperability | ||
+ | |||
+ | ===== Performance ? ===== | ||
+ | |||
+ | What is performance? | ||
+ | |||
+ | ^Type ^Theory | ||
+ | |**Operations** per second (GFlops/ | ||
+ | |**Memory** throughput (GB/ | ||
+ | |**Network** throughput (GB/ | ||
+ | |**Network** latency (µs) |1.4-1.8 | ||
+ | |**Storage** throughput (“/ | ||
+ | |**Storage** latency (ms) |0.03 | ||
+ | |||
+ | “Operations”: | ||
+ | “Memory”: | ||
+ | “Network”: | ||
+ | “Storage”: | ||
+ | |||
+ | Which one is the **limiting factor** for a given code? | ||
+ | |||
+ | |||
+ | ===== Know algorithmic complexity ===== | ||
+ | |||
+ | * // | ||
+ | * HPC (High Performance Computing): execution time spent on just **a few lines of code** | ||
+ | |||
+ | * Linear Algebra | ||
+ | * (Dense) matrix multiplication: | ||
+ | * (Dense) matrix vector multiplication: | ||
+ | * Dot product: O(n) | ||
+ | * (Dense) system of linear equations / least squares: O(n^3) | ||
+ | * (Dense) Eigensolver: | ||
+ | * FFT: O(n * log(n)) | ||
+ | * Sparse algorithms: O(n) .. O(n^3) | ||
+ | * .. | ||
+ | |||
+ | |||
+ | ===== Know algorithmic complexity (2) ===== | ||
+ | |||
+ | * Other codes | ||
+ | * Identify **main parameters** | ||
+ | * Identify **scaling** (usually polynomial) | ||
+ | |||
+ | * Large problem size => subalgorithm with highest exponent of complexity becomes dominant | ||
+ | * **Measure** compute time, preferably with profiler | ||
+ | * Beware: **cache effects** can ‘change’ algorithmic complexity | ||
+ | * Often: algorithmic complexity is well known and documented | ||
+ | |||
+ | |||
+ | ===== Know algorithmic complexity - Example ===== | ||
+ | |||
+ | Example: program with subprograms, | ||
+ | |||
+ | < | ||
+ | static void square(long n) | ||
+ | { printf(" | ||
+ | for(long i=0; | ||
+ | } | ||
+ | void cube(long n) | ||
+ | { printf(" | ||
+ | for(long i=0; | ||
+ | } | ||
+ | int main(int argc, char *argv[]) | ||
+ | { long n; | ||
+ | if(argc == 2) n=atoi(argv[1]); | ||
+ | printf(" | ||
+ | square(n); | ||
+ | cube(n); | ||
+ | return 0; | ||
+ | } | ||
+ | </ | ||
+ | ===== Know algorithmic complexity - Example (2) ===== | ||
+ | |||
+ | Makefile | ||
+ | |||
+ | < | ||
+ | short: | ||
+ | time ./a.out 5000 | ||
+ | gprof -Q -b a.out gmon.out | ||
+ | time ./a.out 20000 | ||
+ | gprof -Q -b a.out gmon.out | ||
+ | |||
+ | long: | ||
+ | gprof a.out gmon.out | less | ||
+ | |||
+ | gmon.out: | ||
+ | time ./a.out | ||
+ | |||
+ | a.out: | ||
+ | gcc -std=c99 -Wall -pg test_gprof.c | ||
+ | |||
+ | clean: | ||
+ | rm gmon.out a.out 2>/ | ||
+ | </ | ||
+ | ===== Know algorithmic complexity - Example (3) ===== | ||
+ | |||
+ | Execution: dominating subroutine changes with problem size | ||
+ | |||
+ | < | ||
+ | [profiling]$ make | ||
+ | gcc -std=c99 -Wall -pg test_gprof.c | ||
+ | time ./a.out 5000 | ||
+ | [..] | ||
+ | % | ||
+ | | ||
+ | | ||
+ | | ||
+ | time ./a.out 20000 | ||
+ | [..] | ||
+ | % | ||
+ | | ||
+ | | ||
+ | | ||
+ | </ | ||
+ | ===== Libraries ===== | ||
+ | |||
+ | use libraries for good reasons | ||
+ | |||
+ | |||
+ | * execution performance | ||
+ | * efficient development | ||
+ | * well tested | ||
+ | * portability | ||
+ | * maintenance (by others!) | ||
+ | * .. | ||
+ | |||
+ | ===== Very important libraries ===== | ||
+ | |||
+ | * **Level 3 BLAS** (=“Matrix Multiplication”) | ||
+ | * O(n^3) operations with O(n^2) memory accesses | ||
+ | * portable performance (often faster by a factor of 10) | ||
+ | * including triangular and symmetric / hermitian matrices | ||
+ | * any three-loop operation with $C=\alpha AB + \beta C$ can be written using Level 3 BLAS | ||
+ | |||
+ | * MKL: free **Math Kernel Library** from Intel | ||
+ | * Linear Algebra | ||
+ | * FFT | ||
+ | * Neural networks | ||
+ | * Statistics | ||
+ | * Sparse algorithms | ||
+ | * Parallel | ||
+ | * .. | ||
+ | * **HDF5**: Hierarchical Data Format: efficient and portable library | ||
+ | |||
+ | |||
+ | ===== Interoperability ===== | ||
+ | |||
+ | Any programming language can call subroutines of any other programming language | ||
+ | |||
+ | * C + Fortran | ||
+ | * Matlab + C | ||
+ | * Octave + C++ | ||
+ | * Java + MKL | ||
+ | * Python + R | ||
+ | * Haskell + Assembler | ||
+ | * Javascript + Nvidia Cuda | ||
+ | |||
+ | Combine your **best known** with the **best performing** tool | ||
+ | |||
+ | |||
+ | ===== Interoperability (2) ===== | ||
+ | |||
+ | ==== Wrapper ==== | ||
+ | |||
+ | In many cases wrapper routines are required, mostly in C (as the language of the operating system) | ||
+ | |||
+ | Difficulties can be | ||
+ | |||
+ | * Naming schemes | ||
+ | * UPPER vs. lower case | ||
+ | * added underscores | ||
+ | |||
+ | < | ||
+ | [test]$ grep -i dsbevk\( dsbevk.F90 | ||
+ | SUB R O U T I N E DSBEVK(JOBZ, | ||
+ | [test]$ nm a.out | grep -i dsbevk | ||
+ | 0000000000417504 T dsbevk_ | ||
+ | 000000000041abe1 t dsbevk_._omp_fn.0 | ||
+ | </ | ||
+ | ===== Interoperability (2a) ===== | ||
+ | |||
+ | ==== Wrapper ==== | ||
+ | |||
+ | In many cases wrapper routines are required, mostly in C (as the language of the operating system) | ||
+ | |||
+ | Difficulties can be | ||
+ | |||
+ | * Naming schemes | ||
+ | * Call by value / call by reference | ||
+ | * Order of parameters: size of string parameters sometimes at the very end of parameter list | ||
+ | * I/O libraries * Hint: use I/O of only one language, or * link with correct runtime libraries, or * link by calling compiler (e.g. gfortran) instead of linker (e.g. ld) | ||
+ | * Copy-in and/or copy-out: can affect performance | ||
+ | * Class/ | ||
+ | |||
+ | ===== Interoperability (3) ===== | ||
+ | |||
+ | ==== Wrapper example ==== | ||
+ | |||
+ | <code fortran> | ||
+ | [C_Fortran]$ more f.f | ||
+ | program y | ||
+ | write(*,' | ||
+ | end program | ||
+ | </ | ||
+ | < | ||
+ | [C_Fortran]$ more c.c | ||
+ | int intadd_(int *a, int *b) { return (*a)+(*b); } | ||
+ | </ | ||
+ | < | ||
+ | [C_Fortran]$ more makefile | ||
+ | run: a.out | ||
+ | ./a.out | ||
+ | |||
+ | a.out: | ||
+ | gfortran c.c f.f | ||
+ | </ | ||
+ | < | ||
+ | [C_Fortran]$ make | ||
+ | gfortran c.c f.f | ||
+ | ./a.out | ||
+ | 3 + 5 = 8 | ||
+ | </ | ||
+ | ===== Memory optimization ===== | ||
+ | |||
+ | To keep CPUs busy, data has to be kept near the execution units, e.g. in **Level 1 Cache** | ||
+ | |||
+ | Methods | ||
+ | |||
+ | * Optimized libraries | ||
+ | * Blocking | ||
+ | * Reordering of operations | ||
+ | * Reuse data as much as possible | ||
+ | * Reuse cache lines (64 byte = 8 double): smallest memory access | ||
+ | * Access by row / column (depending on chosen language) | ||
+ | * .. | ||
+ | |||
+ | ===== Example: Row - Column ===== | ||
+ | |||
+ | Column Access | ||
+ | |||
+ | < | ||
+ | for(j=0; | ||
+ | for(i=0; | ||
+ | A[i][j]+=2.0; | ||
+ | </ | ||
+ | Strided memory access in C\\ | ||
+ | Contiguous memory access in Fortran | ||
+ | |||
+ | |||
+ | Row Access | ||
+ | |||
+ | < | ||
+ | for(i=0; | ||
+ | for(j=0; | ||
+ | A[i][j]+=2.0; | ||
+ | </ | ||
+ | Contiguous in memory in C\\ | ||
+ | Strided memory access in Fortran | ||
+ | |||
+ | |||
+ | |||
+ | |||
+ | ===== Example: Blocking ===== | ||
+ | |||
+ | < | ||
+ | //Three nested loops | ||
+ | N=10000; | ||
+ | for(i=0; | ||
+ | for(j=0; | ||
+ | for(k=0; | ||
+ | C[i][j]+=A[i][k]*B[k][j]; | ||
+ | //Large N => cache thrashing | ||
+ | </ | ||
+ | |||
+ | < | ||
+ | //Blocked access | ||
+ | l=48; N=10000; | ||
+ | for(i=0; | ||
+ | for(j=0; | ||
+ | for(k=0; | ||
+ | for(ii=i; | ||
+ | for(jj=j; | ||
+ | for(kk=k; | ||
+ | C[ii][jj]+=A[ii][kk]*B[kk][jj]; | ||
+ | //Large N => reusing data in fast cache levels | ||
+ | </ | ||
+ | Adapt parameter l to size of cache – separately for each cache level | ||
+ | |||
+ | |||
+ | |||
+ | Even better: | ||
+ | |||
+ | < | ||
+ | double ONE=1.0D0; | ||
+ | dgemm(" | ||
+ | // dgemm (TRANSA, TRANSB, M, N, K, ALPHA, A, LDA, B, LDB, BETA, C, LDC) | ||
+ | </ | ||
+ | |||
+ | |||
+ | ===== Storage and Network ===== | ||
+ | |||
+ | Small messages are costly | ||
+ | |||
+ | Example: Write 100MB | ||
+ | |||
+ | |||
+ | * in one block to storage: 100MB * 2GB/s + 2µs = 50.002ms\\ | ||
+ | |||
+ | * 1000000 times 100B: 1000000 * 100B * 2GB/s + 1000000*2µs = 2050ms (blocking the file server) | ||
+ | |||
+ | ===== Tools ===== | ||
+ | |||
+ | * Debugger | ||
+ | * gdb | ||
+ | * Allinea ddt | ||
+ | * Profiler | ||
+ | * gprof | ||
+ | * Allinea Performance Reports: Quick performance overview (Runtime - Memory - Threads - I/O - MPI) | ||
+ | * Instrumentation | ||
+ | * Allinea map | ||
+ | * Intel performance tools | ||
+ | |||
+ | < | ||
+ | $ ls / | ||
+ | [..] advisor [..] inspector [..] vtune [..] | ||
+ | </ | ||
+ | * VSC School Trainings upon request | ||
+ | |||
+ | https:// | ||