====== Going for efficiency - how to get out performance ====== * Article written by Dieter Kvasnicka (VSC Team)
(last update 2017-10-17 by dk). ===== Overview ===== * Performance ? * Know algorithmic complexity * Libraries * Interoperability ===== Performance ? ===== What is performance? ^Type ^Theory ^Measured ^Well-tuned Application ^Untuned Application ^ |**Operations** per second (GFlops/s) |384 |340 |340 |30 / 2 | |**Memory** throughput (GB/s) |120 |80 |80 |5 | |**Network** throughput (GB/s) |2*4 |2*3.4 |2*2 |0.2 / 0.0002 | |**Network** latency (µs) |1.4-1.8 |1.4-1.8 |1.4-1.8 |2 | |**Storage** throughput (“/global”, GB/s) |20 |15 |10 |1 | |**Storage** latency (ms) |0.03 |1 |1 |1000 | “Operations”: Floating point / 16 cores @3GHz; Values per node\\ “Memory”: Mixture of three cache levels and main memory; Values per node\\ “Network”: Medium shared by all users (large jobs only); Values per node\\ “Storage”: Medium shared by all users; Values per application Which one is the **limiting factor** for a given code? ===== Know algorithmic complexity ===== * //Algorithmic complexity//: Scaling with varying input parameters * HPC (High Performance Computing): execution time spent on just **a few lines of code** * Linear Algebra * (Dense) matrix multiplication: O(n^3) * (Dense) matrix vector multiplication: (O(n^2) * Dot product: O(n) * (Dense) system of linear equations / least squares: O(n^3) * (Dense) Eigensolver: O(n^3) * FFT: O(n * log(n)) * Sparse algorithms: O(n) .. O(n^3) * .. ===== Know algorithmic complexity (2) ===== * Other codes * Identify **main parameters** * Identify **scaling** (usually polynomial) * Large problem size => subalgorithm with highest exponent of complexity becomes dominant * **Measure** compute time, preferably with profiler * Beware: **cache effects** can ‘change’ algorithmic complexity * Often: algorithmic complexity is well known and documented ===== Know algorithmic complexity - Example ===== Example: program with subprograms, running with varying problem size static void square(long n) { printf("Inside square \n"); for(long i=0;i ===== Know algorithmic complexity - Example (2) ===== Makefile short: a.out time ./a.out 5000 gprof -Q -b a.out gmon.out time ./a.out 20000 gprof -Q -b a.out gmon.out long: gmon.out gprof a.out gmon.out | less gmon.out: a.out time ./a.out a.out: test_gprof.c gcc -std=c99 -Wall -pg test_gprof.c clean: rm gmon.out a.out 2>/dev/null ===== Know algorithmic complexity - Example (3) ===== Execution: dominating subroutine changes with problem size [profiling]$ make gcc -std=c99 -Wall -pg test_gprof.c time ./a.out 5000 [..] % cumulative self self total time seconds seconds calls ms/call ms/call name 70.75 0.07 0.07 1 70.75 70.75 square 30.32 0.10 0.03 1 30.32 30.32 cube time ./a.out 20000 [..] % cumulative self self total time seconds seconds calls s/call s/call name 66.23 1.75 1.75 1 1.75 1.75 cube 34.84 2.67 0.92 1 0.92 0.92 square ===== Libraries ===== use libraries for good reasons * execution performance * efficient development * well tested * portability * maintenance (by others!) * .. ===== Very important libraries ===== * **Level 3 BLAS** (=“Matrix Multiplication”) * O(n^3) operations with O(n^2) memory accesses * portable performance (often faster by a factor of 10) * including triangular and symmetric / hermitian matrices * any three-loop operation with $C=\alpha AB + \beta C$ can be written using Level 3 BLAS * MKL: free **Math Kernel Library** from Intel * Linear Algebra * FFT * Neural networks * Statistics * Sparse algorithms * Parallel * .. * **HDF5**: Hierarchical Data Format: efficient and portable library ===== Interoperability ===== Any programming language can call subroutines of any other programming language * C + Fortran * Matlab + C * Octave + C++ * Java + MKL * Python + R * Haskell + Assembler * Javascript + Nvidia Cuda Combine your **best known** with the **best performing** tool ===== Interoperability (2) ===== ==== Wrapper ==== In many cases wrapper routines are required, mostly in C (as the language of the operating system) Difficulties can be * Naming schemes * UPPER vs. lower case * added underscores [test]$ grep -i dsbevk\( dsbevk.F90 SUB R O U T I N E DSBEVK(JOBZ, RANGE, UPLO, B2T, N, ABAND, LDABAND, VL, VU, IL, IU, ABSTOL, NE, W, Z, LDZ, & [test]$ nm a.out | grep -i dsbevk 0000000000417504 T dsbevk_ 000000000041abe1 t dsbevk_._omp_fn.0 ===== Interoperability (2a) ===== ==== Wrapper ==== In many cases wrapper routines are required, mostly in C (as the language of the operating system) Difficulties can be * Naming schemes * Call by value / call by reference * Order of parameters: size of string parameters sometimes at the very end of parameter list * I/O libraries * Hint: use I/O of only one language, or * link with correct runtime libraries, or * link by calling compiler (e.g. gfortran) instead of linker (e.g. ld) * Copy-in and/or copy-out: can affect performance * Class/module information in objects ("*.o") only useful in one language ===== Interoperability (3) ===== ==== Wrapper example ==== [C_Fortran]$ more f.f program y write(*,'(I1," + ",I1," = ",I1)') 3, 5, intadd(3,5) end program [C_Fortran]$ more c.c int intadd_(int *a, int *b) { return (*a)+(*b); } [C_Fortran]$ more makefile run: a.out ./a.out a.out: c.c f.f gfortran c.c f.f [C_Fortran]$ make gfortran c.c f.f ./a.out 3 + 5 = 8 ===== Memory optimization ===== To keep CPUs busy, data has to be kept near the execution units, e.g. in **Level 1 Cache** Methods * Optimized libraries * Blocking * Reordering of operations * Reuse data as much as possible * Reuse cache lines (64 byte = 8 double): smallest memory access * Access by row / column (depending on chosen language) * .. ===== Example: Row - Column ===== Column Access for(j=0;i Strided memory access in C\\ Contiguous memory access in Fortran Row Access for(i=0;i Contiguous in memory in C\\ Strided memory access in Fortran ===== Example: Blocking ===== //Three nested loops N=10000; for(i=0;i cache thrashing //Blocked access l=48; N=10000; for(i=0;i reusing data in fast cache levels Adapt parameter l to size of cache – separately for each cache level Even better: double ONE=1.0D0; dgemm("N","N",N,N,N,&ONE,A,N,B,N,&ONE,C,N) // dgemm (TRANSA, TRANSB, M, N, K, ALPHA, A, LDA, B, LDB, BETA, C, LDC) ===== Storage and Network ===== Small messages are costly Example: Write 100MB * in one block to storage: 100MB * 2GB/s + 2µs = 50.002ms\\ * 1000000 times 100B: 1000000 * 100B * 2GB/s + 1000000*2µs = 2050ms (blocking the file server) ===== Tools ===== * Debugger * gdb * Allinea ddt * Profiler * gprof * Allinea Performance Reports: Quick performance overview (Runtime - Memory - Threads - I/O - MPI) * Instrumentation * Allinea map * Intel performance tools $ ls /cm/shared/apps/intel [..] advisor [..] inspector [..] vtune [..] * VSC School Trainings upon request https://wiki.vsc.ac.at/doku.php?id=doku:perf-report – https://wiki.vsc.ac.at/doku.php?id=doku:forge