Going for efficiency - how to get out performance

Article written by Dieter Kvasnicka (VSC Team) <html><br></html>(last update 2017-10-17 by dk).

Overview

Performance ?
Know algorithmic complexity
Libraries
Interoperability

Performance ?

What is performance?

Type	Theory	Measured	Well-tuned Application	Untuned Application
Operations per second (GFlops/s)	384	340	340	30 / 2
Memory throughput (GB/s)	120	80	80	5
Network throughput (GB/s)	2*4	2*3.4	2*2	0.2 / 0.0002
Network latency (µs)	1.4-1.8	1.4-1.8	1.4-1.8	2
Storage throughput (“/global”, GB/s)	20	15	10	1
Storage latency (ms)	0.03	1	1	1000

“Operations”: Floating point / 16 cores @3GHz; Values per node
“Memory”: Mixture of three cache levels and main memory; Values per node
“Network”: Medium shared by all users (large jobs only); Values per node
“Storage”: Medium shared by all users; Values per application

Which one is the limiting factor for a given code?

Know algorithmic complexity

Algorithmic complexity: Scaling with varying input parameters
HPC (High Performance Computing): execution time spent on just a few lines of code

Linear Algebra
- (Dense) matrix multiplication: O(n^3)
- (Dense) matrix vector multiplication: (O(n^2)
- Dot product: O(n)
- (Dense) system of linear equations / least squares: O(n^3)
- (Dense) Eigensolver: O(n^3)
- FFT: O(n * log(n))
- Sparse algorithms: O(n) .. O(n^3)
- ..

Know algorithmic complexity (2)

Other codes
- Identify main parameters
- Identify scaling (usually polynomial)

Large problem size ⇒ subalgorithm with highest exponent of complexity becomes dominant
- Measure compute time, preferably with profiler
- Beware: cache effects can ‘change’ algorithmic complexity
Often: algorithmic complexity is well known and documented

Know algorithmic complexity - Example

Example: program with subprograms, running with varying problem size

static void square(long n)
{ printf("Inside square \n");
  for(long i=0;i<n*n;i++);
}
void cube(long n)
{ printf("Inside cube \n");
  for(long i=0;i<n*n/10000*n;i++);
}
int main(int argc, char *argv[])
{ long n;
  if(argc == 2) n=atoi(argv[1]); else n=4000;
  printf("Inside main\n");
  square(n);
  cube(n);
  return 0;
}

Know algorithmic complexity - Example (2)

Makefile

short:  a.out
        time ./a.out 5000
        gprof -Q -b a.out gmon.out
        time ./a.out 20000
        gprof -Q -b a.out gmon.out

long:   gmon.out
        gprof a.out gmon.out | less

gmon.out:       a.out
        time ./a.out

a.out:  test_gprof.c
        gcc -std=c99 -Wall -pg test_gprof.c

clean:
        rm gmon.out a.out 2>/dev/null

Know algorithmic complexity - Example (3)

Execution: dominating subroutine changes with problem size

[profiling]$ make
gcc -std=c99 -Wall -pg test_gprof.c
time ./a.out 5000
[..]
  %   cumulative   self              self     total           
 time   seconds   seconds    calls  ms/call  ms/call  name    
 70.75      0.07     0.07        1    70.75    70.75  square
 30.32      0.10     0.03        1    30.32    30.32  cube
time ./a.out 20000
[..]
  %   cumulative   self              self     total           
 time   seconds   seconds    calls   s/call   s/call  name    
 66.23      1.75     1.75        1     1.75     1.75  cube
 34.84      2.67     0.92        1     0.92     0.92  square

Libraries

use libraries for good reasons

execution performance
efficient development
well tested
portability
maintenance (by others!)
..

Very important libraries

Level 3 BLAS (=“Matrix Multiplication”)
- O(n^3) operations with O(n^2) memory accesses
- portable performance (often faster by a factor of 10)
- including triangular and symmetric / hermitian matrices
- any three-loop operation with $C=\alpha AB + \beta C$ can be written using Level 3 BLAS

MKL: free Math Kernel Library from Intel
- Linear Algebra
- FFT
- Neural networks
- Statistics
- Sparse algorithms
- Parallel
- ..
HDF5: Hierarchical Data Format: efficient and portable library

Interoperability

Any programming language can call subroutines of any other programming language

C + Fortran
Matlab + C
Octave + C++
Java + MKL
Python + R
Haskell + Assembler
Javascript + Nvidia Cuda

Combine your best known with the best performing tool

Interoperability (2)

Wrapper

In many cases wrapper routines are required, mostly in C (as the language of the operating system)

Difficulties can be

Naming schemes
- UPPER vs. lower case
- added underscores

[test]$ grep -i dsbevk\( dsbevk.F90
SUB R O U T I N E DSBEVK(JOBZ, RANGE, UPLO, B2T, N, ABAND, LDABAND, VL, VU, IL, IU, ABSTOL, NE, W, Z, LDZ, &
[test]$ nm a.out | grep -i dsbevk
0000000000417504 T dsbevk_
000000000041abe1 t dsbevk_._omp_fn.0

Interoperability (2a)

Wrapper

In many cases wrapper routines are required, mostly in C (as the language of the operating system)

Difficulties can be

Naming schemes
Call by value / call by reference
Order of parameters: size of string parameters sometimes at the very end of parameter list
I/O libraries * Hint: use I/O of only one language, or * link with correct runtime libraries, or * link by calling compiler (e.g. gfortran) instead of linker (e.g. ld)
Copy-in and/or copy-out: can affect performance
Class/module information in objects (“*.o”) only useful in one language

Interoperability (3)

Wrapper example

[C_Fortran]$ more f.f
        program y
        write(*,'(I1," + ",I1," = ",I1)') 3, 5, intadd(3,5)
        end program

[C_Fortran]$ more c.c
int intadd_(int *a, int *b) { return (*a)+(*b); }

[C_Fortran]$ more makefile
run:    a.out
        ./a.out

a.out:  c.c f.f
        gfortran c.c f.f

[C_Fortran]$ make
gfortran c.c f.f
./a.out
3 + 5 = 8

Memory optimization

To keep CPUs busy, data has to be kept near the execution units, e.g. in Level 1 Cache

Methods

Optimized libraries
Blocking
- Reordering of operations
- Reuse data as much as possible
- Reuse cache lines (64 byte = 8 double): smallest memory access
Access by row / column (depending on chosen language)
..

Example: Row - Column

Column Access

for(j=0;i<N;i++)
  for(i=0;j<N;j++)
    A[i][j]+=2.0;

Strided memory access in C
Contiguous memory access in Fortran

Row Access

for(i=0;i<N;i++)
  for(j=0;j<N;j++)
    A[i][j]+=2.0;

Contiguous in memory in C
Strided memory access in Fortran

Example: Blocking

//Three nested loops
N=10000;
for(i=0;i<N;i++)
  for(j=0;j<N;j++)
    for(k=0;k<N;k++)
      C[i][j]+=A[i][k]*B[k][j];
//Large N => cache thrashing

//Blocked access
l=48; N=10000;
for(i=0;i<N;i+=l)
  for(j=0;j<N;j+=l)
    for(k=0;k<N;k+=l)
      for(ii=i;ii<min(i+l,N);ii++)
        for(jj=j;jj<min(j+l,N);jj++)
          for(kk=k;kk<min(k+l,N);kk++)
            C[ii][jj]+=A[ii][kk]*B[kk][jj];
//Large N => reusing data in fast cache levels

Adapt parameter l to size of cache – separately for each cache level

Even better:

double ONE=1.0D0;
dgemm("N","N",N,N,N,&ONE,A,N,B,N,&ONE,C,N)
// dgemm (TRANSA, TRANSB, M, N, K, ALPHA, A, LDA, B, LDB, BETA, C, LDC)

Storage and Network

Small messages are costly

Example: Write 100MB

in one block to storage: 100MB * 2GB/s + 2µs = 50.002ms

1000000 times 100B: 1000000 * 100B * 2GB/s + 1000000*2µs = 2050ms (blocking the file server)

Tools

Debugger
- gdb
- Allinea ddt
Profiler
- gprof
- Allinea Performance Reports: Quick performance overview (Runtime - Memory - Threads - I/O - MPI)
Instrumentation
- Allinea map
Intel performance tools

$ ls /cm/shared/apps/intel
[..] advisor [..] inspector [..] vtune [..]

VSC School Trainings upon request

https://wiki.vsc.ac.at/doku.php?id=doku:perf-report – https://wiki.vsc.ac.at/doku.php?id=doku:forge

Table of Contents

Going for efficiency - how to get out performance

Overview

Performance ?

Know algorithmic complexity

Know algorithmic complexity (2)

Know algorithmic complexity - Example

Know algorithmic complexity - Example (2)

Know algorithmic complexity - Example (3)

Libraries

Very important libraries

Interoperability

Interoperability (2)

Wrapper

Interoperability (2a)

Wrapper

Interoperability (3)

Wrapper example

Memory optimization

Example: Row - Column

Example: Blocking

Storage and Network

Tools