This page is read only. You can view the source, but not change it. Ask your administrator if you think this is wrong. ====== Going for efficiency - how to get out performance ====== * Article written by Dieter Kvasnicka (VSC Team) <html><br></html>(last update 2017-10-17 by dk). ===== Overview ===== * Performance ? * Know algorithmic complexity * Libraries * Interoperability ===== Performance ? ===== What is performance? ^Type ^Theory ^Measured ^Well-tuned Application ^Untuned Application ^ |**Operations** per second (GFlops/s) |384 |340 |340 |30 / 2 | |**Memory** throughput (GB/s) |120 |80 |80 |5 | |**Network** throughput (GB/s) |2*4 |2*3.4 |2*2 |0.2 / 0.0002 | |**Network** latency (µs) |1.4-1.8 |1.4-1.8 |1.4-1.8 |2 | |**Storage** throughput (“/global”, GB/s) |20 |15 |10 |1 | |**Storage** latency (ms) |0.03 |1 |1 |1000 | “Operations”: Floating point / 16 cores @3GHz; Values per node\\ “Memory”: Mixture of three cache levels and main memory; Values per node\\ “Network”: Medium shared by all users (large jobs only); Values per node\\ “Storage”: Medium shared by all users; Values per application Which one is the **limiting factor** for a given code? ===== Know algorithmic complexity ===== * //Algorithmic complexity//: Scaling with varying input parameters * HPC (High Performance Computing): execution time spent on just **a few lines of code** * Linear Algebra * (Dense) matrix multiplication: O(n^3) * (Dense) matrix vector multiplication: (O(n^2) * Dot product: O(n) * (Dense) system of linear equations / least squares: O(n^3) * (Dense) Eigensolver: O(n^3) * FFT: O(n * log(n)) * Sparse algorithms: O(n) .. O(n^3) * .. ===== Know algorithmic complexity (2) ===== * Other codes * Identify **main parameters** * Identify **scaling** (usually polynomial) * Large problem size => subalgorithm with highest exponent of complexity becomes dominant * **Measure** compute time, preferably with profiler * Beware: **cache effects** can ‘change’ algorithmic complexity * Often: algorithmic complexity is well known and documented ===== Know algorithmic complexity - Example ===== Example: program with subprograms, running with varying problem size <code> static void square(long n) { printf("Inside square \n"); for(long i=0;i<n*n;i++); } void cube(long n) { printf("Inside cube \n"); for(long i=0;i<n*n/10000*n;i++); } int main(int argc, char *argv[]) { long n; if(argc == 2) n=atoi(argv[1]); else n=4000; printf("Inside main\n"); square(n); cube(n); return 0; } </code> ===== Know algorithmic complexity - Example (2) ===== Makefile <code> short: a.out time ./a.out 5000 gprof -Q -b a.out gmon.out time ./a.out 20000 gprof -Q -b a.out gmon.out long: gmon.out gprof a.out gmon.out | less gmon.out: a.out time ./a.out a.out: test_gprof.c gcc -std=c99 -Wall -pg test_gprof.c clean: rm gmon.out a.out 2>/dev/null </code> ===== Know algorithmic complexity - Example (3) ===== Execution: dominating subroutine changes with problem size <code> [profiling]$ make gcc -std=c99 -Wall -pg test_gprof.c time ./a.out 5000 [..] % cumulative self self total time seconds seconds calls ms/call ms/call name 70.75 0.07 0.07 1 70.75 70.75 square 30.32 0.10 0.03 1 30.32 30.32 cube time ./a.out 20000 [..] % cumulative self self total time seconds seconds calls s/call s/call name 66.23 1.75 1.75 1 1.75 1.75 cube 34.84 2.67 0.92 1 0.92 0.92 square </code> ===== Libraries ===== use libraries for good reasons * execution performance * efficient development * well tested * portability * maintenance (by others!) * .. ===== Very important libraries ===== * **Level 3 BLAS** (=“Matrix Multiplication”) * O(n^3) operations with O(n^2) memory accesses * portable performance (often faster by a factor of 10) * including triangular and symmetric / hermitian matrices * any three-loop operation with $C=\alpha AB + \beta C$ can be written using Level 3 BLAS * MKL: free **Math Kernel Library** from Intel * Linear Algebra * FFT * Neural networks * Statistics * Sparse algorithms * Parallel * .. * **HDF5**: Hierarchical Data Format: efficient and portable library ===== Interoperability ===== Any programming language can call subroutines of any other programming language * C + Fortran * Matlab + C * Octave + C++ * Java + MKL * Python + R * Haskell + Assembler * Javascript + Nvidia Cuda Combine your **best known** with the **best performing** tool ===== Interoperability (2) ===== ==== Wrapper ==== In many cases wrapper routines are required, mostly in C (as the language of the operating system) Difficulties can be * Naming schemes * UPPER vs. lower case * added underscores <code> [test]$ grep -i dsbevk\( dsbevk.F90 SUB R O U T I N E DSBEVK(JOBZ, RANGE, UPLO, B2T, N, ABAND, LDABAND, VL, VU, IL, IU, ABSTOL, NE, W, Z, LDZ, & [test]$ nm a.out | grep -i dsbevk 0000000000417504 T dsbevk_ 000000000041abe1 t dsbevk_._omp_fn.0 </code> ===== Interoperability (2a) ===== ==== Wrapper ==== In many cases wrapper routines are required, mostly in C (as the language of the operating system) Difficulties can be * Naming schemes * Call by value / call by reference * Order of parameters: size of string parameters sometimes at the very end of parameter list * I/O libraries * Hint: use I/O of only one language, or * link with correct runtime libraries, or * link by calling compiler (e.g. gfortran) instead of linker (e.g. ld) * Copy-in and/or copy-out: can affect performance * Class/module information in objects ("*.o") only useful in one language ===== Interoperability (3) ===== ==== Wrapper example ==== <code fortran> [C_Fortran]$ more f.f program y write(*,'(I1," + ",I1," = ",I1)') 3, 5, intadd(3,5) end program </code> <code> [C_Fortran]$ more c.c int intadd_(int *a, int *b) { return (*a)+(*b); } </code> <code> [C_Fortran]$ more makefile run: a.out ./a.out a.out: c.c f.f gfortran c.c f.f </code> <code> [C_Fortran]$ make gfortran c.c f.f ./a.out 3 + 5 = 8 </code> ===== Memory optimization ===== To keep CPUs busy, data has to be kept near the execution units, e.g. in **Level 1 Cache** Methods * Optimized libraries * Blocking * Reordering of operations * Reuse data as much as possible * Reuse cache lines (64 byte = 8 double): smallest memory access * Access by row / column (depending on chosen language) * .. ===== Example: Row - Column ===== Column Access <code> for(j=0;i<N;i++) for(i=0;j<N;j++) A[i][j]+=2.0; </code> Strided memory access in C\\ Contiguous memory access in Fortran Row Access <code> for(i=0;i<N;i++) for(j=0;j<N;j++) A[i][j]+=2.0; </code> Contiguous in memory in C\\ Strided memory access in Fortran ===== Example: Blocking ===== <code> //Three nested loops N=10000; for(i=0;i<N;i++) for(j=0;j<N;j++) for(k=0;k<N;k++) C[i][j]+=A[i][k]*B[k][j]; //Large N => cache thrashing </code> <code> //Blocked access l=48; N=10000; for(i=0;i<N;i+=l) for(j=0;j<N;j+=l) for(k=0;k<N;k+=l) for(ii=i;ii<min(i+l,N);ii++) for(jj=j;jj<min(j+l,N);jj++) for(kk=k;kk<min(k+l,N);kk++) C[ii][jj]+=A[ii][kk]*B[kk][jj]; //Large N => reusing data in fast cache levels </code> Adapt parameter l to size of cache – separately for each cache level Even better: <code> double ONE=1.0D0; dgemm("N","N",N,N,N,&ONE,A,N,B,N,&ONE,C,N) // dgemm (TRANSA, TRANSB, M, N, K, ALPHA, A, LDA, B, LDB, BETA, C, LDC) </code> ===== Storage and Network ===== Small messages are costly Example: Write 100MB * in one block to storage: 100MB * 2GB/s + 2µs = 50.002ms\\ * 1000000 times 100B: 1000000 * 100B * 2GB/s + 1000000*2µs = 2050ms (blocking the file server) ===== Tools ===== * Debugger * gdb * Allinea ddt * Profiler * gprof * Allinea Performance Reports: Quick performance overview (Runtime - Memory - Threads - I/O - MPI) * Instrumentation * Allinea map * Intel performance tools <code> $ ls /cm/shared/apps/intel [..] advisor [..] inspector [..] vtune [..] </code> * VSC School Trainings upon request https://wiki.vsc.ac.at/doku.php?id=doku:perf-report – https://wiki.vsc.ac.at/doku.php?id=doku:forge pandoc/introduction-to-vsc/10_performance/10_performance.txt Last modified: 2020/10/20 08:09by pandoc