====== Going for efficiency - how to get out performance ======
* Article written by Dieter Kvasnicka (VSC Team)
(last update 2017-10-17 by dk).
===== Overview =====
* Performance ?
* Know algorithmic complexity
* Libraries
* Interoperability
===== Performance ? =====
What is performance?
^Type ^Theory ^Measured ^Well-tuned Application ^Untuned Application ^
|**Operations** per second (GFlops/s) |384 |340 |340 |30 / 2 |
|**Memory** throughput (GB/s) |120 |80 |80 |5 |
|**Network** throughput (GB/s) |2*4 |2*3.4 |2*2 |0.2 / 0.0002 |
|**Network** latency (µs) |1.4-1.8 |1.4-1.8 |1.4-1.8 |2 |
|**Storage** throughput (“/global”, GB/s) |20 |15 |10 |1 |
|**Storage** latency (ms) |0.03 |1 |1 |1000 |
“Operations”: Floating point / 16 cores @3GHz; Values per node\\
“Memory”: Mixture of three cache levels and main memory; Values per node\\
“Network”: Medium shared by all users (large jobs only); Values per node\\
“Storage”: Medium shared by all users; Values per application
Which one is the **limiting factor** for a given code?
===== Know algorithmic complexity =====
* //Algorithmic complexity//: Scaling with varying input parameters
* HPC (High Performance Computing): execution time spent on just **a few lines of code**
* Linear Algebra
* (Dense) matrix multiplication: O(n^3)
* (Dense) matrix vector multiplication: (O(n^2)
* Dot product: O(n)
* (Dense) system of linear equations / least squares: O(n^3)
* (Dense) Eigensolver: O(n^3)
* FFT: O(n * log(n))
* Sparse algorithms: O(n) .. O(n^3)
* ..
===== Know algorithmic complexity (2) =====
* Other codes
* Identify **main parameters**
* Identify **scaling** (usually polynomial)
* Large problem size => subalgorithm with highest exponent of complexity becomes dominant
* **Measure** compute time, preferably with profiler
* Beware: **cache effects** can ‘change’ algorithmic complexity
* Often: algorithmic complexity is well known and documented
===== Know algorithmic complexity - Example =====
Example: program with subprograms, running with varying problem size
static void square(long n)
{ printf("Inside square \n");
for(long i=0;i
===== Know algorithmic complexity - Example (2) =====
Makefile
short: a.out
time ./a.out 5000
gprof -Q -b a.out gmon.out
time ./a.out 20000
gprof -Q -b a.out gmon.out
long: gmon.out
gprof a.out gmon.out | less
gmon.out: a.out
time ./a.out
a.out: test_gprof.c
gcc -std=c99 -Wall -pg test_gprof.c
clean:
rm gmon.out a.out 2>/dev/null
===== Know algorithmic complexity - Example (3) =====
Execution: dominating subroutine changes with problem size
[profiling]$ make
gcc -std=c99 -Wall -pg test_gprof.c
time ./a.out 5000
[..]
% cumulative self self total
time seconds seconds calls ms/call ms/call name
70.75 0.07 0.07 1 70.75 70.75 square
30.32 0.10 0.03 1 30.32 30.32 cube
time ./a.out 20000
[..]
% cumulative self self total
time seconds seconds calls s/call s/call name
66.23 1.75 1.75 1 1.75 1.75 cube
34.84 2.67 0.92 1 0.92 0.92 square
===== Libraries =====
use libraries for good reasons
* execution performance
* efficient development
* well tested
* portability
* maintenance (by others!)
* ..
===== Very important libraries =====
* **Level 3 BLAS** (=“Matrix Multiplication”)
* O(n^3) operations with O(n^2) memory accesses
* portable performance (often faster by a factor of 10)
* including triangular and symmetric / hermitian matrices
* any three-loop operation with $C=\alpha AB + \beta C$ can be written using Level 3 BLAS
* MKL: free **Math Kernel Library** from Intel
* Linear Algebra
* FFT
* Neural networks
* Statistics
* Sparse algorithms
* Parallel
* ..
* **HDF5**: Hierarchical Data Format: efficient and portable library
===== Interoperability =====
Any programming language can call subroutines of any other programming language
* C + Fortran
* Matlab + C
* Octave + C++
* Java + MKL
* Python + R
* Haskell + Assembler
* Javascript + Nvidia Cuda
Combine your **best known** with the **best performing** tool
===== Interoperability (2) =====
==== Wrapper ====
In many cases wrapper routines are required, mostly in C (as the language of the operating system)
Difficulties can be
* Naming schemes
* UPPER vs. lower case
* added underscores
[test]$ grep -i dsbevk\( dsbevk.F90
SUB R O U T I N E DSBEVK(JOBZ, RANGE, UPLO, B2T, N, ABAND, LDABAND, VL, VU, IL, IU, ABSTOL, NE, W, Z, LDZ, &
[test]$ nm a.out | grep -i dsbevk
0000000000417504 T dsbevk_
000000000041abe1 t dsbevk_._omp_fn.0
===== Interoperability (2a) =====
==== Wrapper ====
In many cases wrapper routines are required, mostly in C (as the language of the operating system)
Difficulties can be
* Naming schemes
* Call by value / call by reference
* Order of parameters: size of string parameters sometimes at the very end of parameter list
* I/O libraries * Hint: use I/O of only one language, or * link with correct runtime libraries, or * link by calling compiler (e.g. gfortran) instead of linker (e.g. ld)
* Copy-in and/or copy-out: can affect performance
* Class/module information in objects ("*.o") only useful in one language
===== Interoperability (3) =====
==== Wrapper example ====
[C_Fortran]$ more f.f
program y
write(*,'(I1," + ",I1," = ",I1)') 3, 5, intadd(3,5)
end program
[C_Fortran]$ more c.c
int intadd_(int *a, int *b) { return (*a)+(*b); }
[C_Fortran]$ more makefile
run: a.out
./a.out
a.out: c.c f.f
gfortran c.c f.f
[C_Fortran]$ make
gfortran c.c f.f
./a.out
3 + 5 = 8
===== Memory optimization =====
To keep CPUs busy, data has to be kept near the execution units, e.g. in **Level 1 Cache**
Methods
* Optimized libraries
* Blocking
* Reordering of operations
* Reuse data as much as possible
* Reuse cache lines (64 byte = 8 double): smallest memory access
* Access by row / column (depending on chosen language)
* ..
===== Example: Row - Column =====
Column Access
for(j=0;i
Strided memory access in C\\
Contiguous memory access in Fortran
Row Access
for(i=0;i
Contiguous in memory in C\\
Strided memory access in Fortran
===== Example: Blocking =====
//Three nested loops
N=10000;
for(i=0;i cache thrashing
//Blocked access
l=48; N=10000;
for(i=0;i reusing data in fast cache levels
Adapt parameter l to size of cache – separately for each cache level
Even better:
double ONE=1.0D0;
dgemm("N","N",N,N,N,&ONE,A,N,B,N,&ONE,C,N)
// dgemm (TRANSA, TRANSB, M, N, K, ALPHA, A, LDA, B, LDB, BETA, C, LDC)
===== Storage and Network =====
Small messages are costly
Example: Write 100MB
* in one block to storage: 100MB * 2GB/s + 2µs = 50.002ms\\
* 1000000 times 100B: 1000000 * 100B * 2GB/s + 1000000*2µs = 2050ms (blocking the file server)
===== Tools =====
* Debugger
* gdb
* Allinea ddt
* Profiler
* gprof
* Allinea Performance Reports: Quick performance overview (Runtime - Memory - Threads - I/O - MPI)
* Instrumentation
* Allinea map
* Intel performance tools
$ ls /cm/shared/apps/intel
[..] advisor [..] inspector [..] vtune [..]
* VSC School Trainings upon request
https://wiki.vsc.ac.at/doku.php?id=doku:perf-report – https://wiki.vsc.ac.at/doku.php?id=doku:forge