Approvals: 0/1
Going for efficiency - how to get out performance
- Article written by Dieter Kvasnicka (VSC Team) <html><br></html>(last update 2017-10-17 by dk).
Overview
- Performance ?
- Know algorithmic complexity
- Libraries
- Interoperability
Performance ?
What is performance?
Type | Theory | Measured | Well-tuned Application | Untuned Application |
---|---|---|---|---|
Operations per second (GFlops/s) | 384 | 340 | 340 | 30 / 2 |
Memory throughput (GB/s) | 120 | 80 | 80 | 5 |
Network throughput (GB/s) | 2*4 | 2*3.4 | 2*2 | 0.2 / 0.0002 |
Network latency (µs) | 1.4-1.8 | 1.4-1.8 | 1.4-1.8 | 2 |
Storage throughput (“/global”, GB/s) | 20 | 15 | 10 | 1 |
Storage latency (ms) | 0.03 | 1 | 1 | 1000 |
“Operations”: Floating point / 16 cores @3GHz; Values per node
“Memory”: Mixture of three cache levels and main memory; Values per node
“Network”: Medium shared by all users (large jobs only); Values per node
“Storage”: Medium shared by all users; Values per application
Which one is the limiting factor for a given code?
Know algorithmic complexity
- Algorithmic complexity: Scaling with varying input parameters
- HPC (High Performance Computing): execution time spent on just a few lines of code
- Linear Algebra
- (Dense) matrix multiplication: O(n^3)
- (Dense) matrix vector multiplication: (O(n^2)
- Dot product: O(n)
- (Dense) system of linear equations / least squares: O(n^3)
- (Dense) Eigensolver: O(n^3)
- FFT: O(n * log(n))
- Sparse algorithms: O(n) .. O(n^3)
- ..
Know algorithmic complexity (2)
- Other codes
- Identify main parameters
- Identify scaling (usually polynomial)
- Large problem size ⇒ subalgorithm with highest exponent of complexity becomes dominant
- Measure compute time, preferably with profiler
- Beware: cache effects can ‘change’ algorithmic complexity
- Often: algorithmic complexity is well known and documented
Know algorithmic complexity - Example
Example: program with subprograms, running with varying problem size
static void square(long n) { printf("Inside square \n"); for(long i=0;i<n*n;i++); } void cube(long n) { printf("Inside cube \n"); for(long i=0;i<n*n/10000*n;i++); } int main(int argc, char *argv[]) { long n; if(argc == 2) n=atoi(argv[1]); else n=4000; printf("Inside main\n"); square(n); cube(n); return 0; }
Know algorithmic complexity - Example (2)
Makefile
short: a.out time ./a.out 5000 gprof -Q -b a.out gmon.out time ./a.out 20000 gprof -Q -b a.out gmon.out long: gmon.out gprof a.out gmon.out | less gmon.out: a.out time ./a.out a.out: test_gprof.c gcc -std=c99 -Wall -pg test_gprof.c clean: rm gmon.out a.out 2>/dev/null
Know algorithmic complexity - Example (3)
Execution: dominating subroutine changes with problem size
[profiling]$ make gcc -std=c99 -Wall -pg test_gprof.c time ./a.out 5000 [..] % cumulative self self total time seconds seconds calls ms/call ms/call name 70.75 0.07 0.07 1 70.75 70.75 square 30.32 0.10 0.03 1 30.32 30.32 cube time ./a.out 20000 [..] % cumulative self self total time seconds seconds calls s/call s/call name 66.23 1.75 1.75 1 1.75 1.75 cube 34.84 2.67 0.92 1 0.92 0.92 square
Libraries
use libraries for good reasons
- execution performance
- efficient development
- well tested
- portability
- maintenance (by others!)
- ..
Very important libraries
- Level 3 BLAS (=“Matrix Multiplication”)
- O(n^3) operations with O(n^2) memory accesses
- portable performance (often faster by a factor of 10)
- including triangular and symmetric / hermitian matrices
- any three-loop operation with $C=\alpha AB + \beta C$ can be written using Level 3 BLAS
- MKL: free Math Kernel Library from Intel
- Linear Algebra
- FFT
- Neural networks
- Statistics
- Sparse algorithms
- Parallel
- ..
- HDF5: Hierarchical Data Format: efficient and portable library
Interoperability
Any programming language can call subroutines of any other programming language
- C + Fortran
- Matlab + C
- Octave + C++
- Java + MKL
- Python + R
- Haskell + Assembler
- Javascript + Nvidia Cuda
Combine your best known with the best performing tool
Interoperability (2)
Wrapper
In many cases wrapper routines are required, mostly in C (as the language of the operating system)
Difficulties can be
- Naming schemes
- UPPER vs. lower case
- added underscores
[test]$ grep -i dsbevk\( dsbevk.F90 SUB R O U T I N E DSBEVK(JOBZ, RANGE, UPLO, B2T, N, ABAND, LDABAND, VL, VU, IL, IU, ABSTOL, NE, W, Z, LDZ, & [test]$ nm a.out | grep -i dsbevk 0000000000417504 T dsbevk_ 000000000041abe1 t dsbevk_._omp_fn.0
Interoperability (2a)
Wrapper
In many cases wrapper routines are required, mostly in C (as the language of the operating system)
Difficulties can be
- Naming schemes
- Call by value / call by reference
- Order of parameters: size of string parameters sometimes at the very end of parameter list
- I/O libraries * Hint: use I/O of only one language, or * link with correct runtime libraries, or * link by calling compiler (e.g. gfortran) instead of linker (e.g. ld)
- Copy-in and/or copy-out: can affect performance
- Class/module information in objects (“*.o”) only useful in one language
Interoperability (3)
Wrapper example
[C_Fortran]$ more f.f program y write(*,'(I1," + ",I1," = ",I1)') 3, 5, intadd(3,5) end program
[C_Fortran]$ more c.c int intadd_(int *a, int *b) { return (*a)+(*b); }
[C_Fortran]$ more makefile run: a.out ./a.out a.out: c.c f.f gfortran c.c f.f
[C_Fortran]$ make gfortran c.c f.f ./a.out 3 + 5 = 8
Memory optimization
To keep CPUs busy, data has to be kept near the execution units, e.g. in Level 1 Cache
Methods
- Optimized libraries
- Blocking
- Reordering of operations
- Reuse data as much as possible
- Reuse cache lines (64 byte = 8 double): smallest memory access
- Access by row / column (depending on chosen language)
- ..
Example: Row - Column
Column Access
for(j=0;i<N;i++) for(i=0;j<N;j++) A[i][j]+=2.0;
Strided memory access in C
Contiguous memory access in Fortran
Row Access
for(i=0;i<N;i++) for(j=0;j<N;j++) A[i][j]+=2.0;
Contiguous in memory in C
Strided memory access in Fortran
Example: Blocking
//Three nested loops N=10000; for(i=0;i<N;i++) for(j=0;j<N;j++) for(k=0;k<N;k++) C[i][j]+=A[i][k]*B[k][j]; //Large N => cache thrashing
//Blocked access l=48; N=10000; for(i=0;i<N;i+=l) for(j=0;j<N;j+=l) for(k=0;k<N;k+=l) for(ii=i;ii<min(i+l,N);ii++) for(jj=j;jj<min(j+l,N);jj++) for(kk=k;kk<min(k+l,N);kk++) C[ii][jj]+=A[ii][kk]*B[kk][jj]; //Large N => reusing data in fast cache levels
Adapt parameter l to size of cache – separately for each cache level
Even better:
double ONE=1.0D0; dgemm("N","N",N,N,N,&ONE,A,N,B,N,&ONE,C,N) // dgemm (TRANSA, TRANSB, M, N, K, ALPHA, A, LDA, B, LDB, BETA, C, LDC)
Storage and Network
Small messages are costly
Example: Write 100MB
- in one block to storage: 100MB * 2GB/s + 2µs = 50.002ms
- 1000000 times 100B: 1000000 * 100B * 2GB/s + 1000000*2µs = 2050ms (blocking the file server)
Tools
- Debugger
- gdb
- Allinea ddt
- Profiler
- gprof
- Allinea Performance Reports: Quick performance overview (Runtime - Memory - Threads - I/O - MPI)
- Instrumentation
- Allinea map
- Intel performance tools
$ ls /cm/shared/apps/intel [..] advisor [..] inspector [..] vtune [..]
- VSC School Trainings upon request
https://wiki.vsc.ac.at/doku.php?id=doku:perf-report – https://wiki.vsc.ac.at/doku.php?id=doku:forge