====== Going for efficiency - how to get out performance ======

  * Article written by Dieter Kvasnicka (VSC Team) <html><br></html>(last update 2017-10-17 by dk).

===== Overview =====


  * Performance ?
  * Know algorithmic complexity
  * Libraries
  * Interoperability

===== Performance ? =====

What is performance?

^Type                                      ^Theory   ^Measured  ^Well-tuned Application  ^Untuned Application  ^
|**Operations** per second (GFlops/s)      |384      |340       |340                     |30 / 2               |
|**Memory** throughput (GB/s)              |120      |80        |80                      |5                    |
|**Network** throughput (GB/s)             |2*4      |2*3.4     |2*2                     |0.2 / 0.0002         |
|**Network** latency (µs)                  |1.4-1.8  |1.4-1.8   |1.4-1.8                 |2                    |
|**Storage** throughput (“/global”, GB/s)  |20       |15        |10                      |1                    |
|**Storage** latency (ms)                  |0.03     |1         |1                       |1000                 |

“Operations”: Floating point / 16 cores @3GHz; Values per node\\
“Memory”: Mixture of three cache levels and main memory; Values per node\\
“Network”: Medium shared by all users (large jobs only); Values per node\\
“Storage”: Medium shared by all users; Values per application

Which one is the **limiting factor** for a given code?


===== Know algorithmic complexity =====

  * //Algorithmic complexity//: Scaling with varying input parameters
  * HPC (High Performance Computing): execution time spent on just **a few lines of code**

  * Linear Algebra
    * (Dense) matrix multiplication: O(n^3)
    * (Dense) matrix vector multiplication: (O(n^2)
    * Dot product: O(n)
    * (Dense) system of linear equations / least squares: O(n^3)
    * (Dense) Eigensolver: O(n^3)
    * FFT: O(n * log(n))
    * Sparse algorithms: O(n) .. O(n^3)
    * ..


===== Know algorithmic complexity (2) =====

  * Other codes
    * Identify **main parameters**
    * Identify **scaling** (usually polynomial)

  * Large problem size => subalgorithm with highest exponent of complexity becomes dominant
    * **Measure** compute time, preferably with profiler
    * Beware: **cache effects** can ‘change’ algorithmic complexity
  * Often: algorithmic complexity is well known and documented


===== Know algorithmic complexity - Example =====

Example: program with subprograms, running with varying problem size

<code>
static void square(long n)
{ printf("Inside square \n");
  for(long i=0;i<n*n;i++);
}
void cube(long n)
{ printf("Inside cube \n");
  for(long i=0;i<n*n/10000*n;i++);
}
int main(int argc, char *argv[])
{ long n;
  if(argc == 2) n=atoi(argv[1]); else n=4000;
  printf("Inside main\n");
  square(n);
  cube(n);
  return 0;
}
</code>
===== Know algorithmic complexity - Example (2) =====

Makefile

<code>
short:  a.out
        time ./a.out 5000
        gprof -Q -b a.out gmon.out
        time ./a.out 20000
        gprof -Q -b a.out gmon.out

long:   gmon.out
        gprof a.out gmon.out | less

gmon.out:       a.out
        time ./a.out

a.out:  test_gprof.c
        gcc -std=c99 -Wall -pg test_gprof.c

clean:
        rm gmon.out a.out 2>/dev/null
</code>
===== Know algorithmic complexity - Example (3) =====

Execution: dominating subroutine changes with problem size

<code>
[profiling]$ make
gcc -std=c99 -Wall -pg test_gprof.c
time ./a.out 5000
[..]
  %   cumulative   self              self     total           
 time   seconds   seconds    calls  ms/call  ms/call  name    
 70.75      0.07     0.07        1    70.75    70.75  square
 30.32      0.10     0.03        1    30.32    30.32  cube
time ./a.out 20000
[..]
  %   cumulative   self              self     total           
 time   seconds   seconds    calls   s/call   s/call  name    
 66.23      1.75     1.75        1     1.75     1.75  cube
 34.84      2.67     0.92        1     0.92     0.92  square
</code>
===== Libraries =====

use libraries for good reasons


  * execution performance
  * efficient development
  * well tested
  * portability
  * maintenance (by others!)
  * ..

===== Very important libraries =====

  * **Level 3 BLAS** (=“Matrix Multiplication”)
    * O(n^3) operations with O(n^2) memory accesses
    * portable performance (often faster by a factor of 10)
    * including triangular and symmetric / hermitian matrices
    * any three-loop operation with $C=\alpha AB + \beta C$ can be written using Level 3 BLAS

  * MKL: free **Math Kernel Library** from Intel
    * Linear Algebra
    * FFT
    * Neural networks
    * Statistics
    * Sparse algorithms
    * Parallel
    * ..
  * **HDF5**: Hierarchical Data Format: efficient and portable library


===== Interoperability =====

Any programming language can call subroutines of any other programming language

  * C + Fortran
  * Matlab + C
  * Octave + C++
  * Java + MKL
  * Python + R
  * Haskell + Assembler
  * Javascript + Nvidia Cuda

Combine your **best known** with the **best performing** tool


===== Interoperability (2) =====

==== Wrapper ====

In many cases wrapper routines are required, mostly in C (as the language of the operating system)

Difficulties can be

  * Naming schemes
    * UPPER vs. lower case
    * added underscores

<code>
[test]$ grep -i dsbevk\( dsbevk.F90
SUB R O U T I N E DSBEVK(JOBZ, RANGE, UPLO, B2T, N, ABAND, LDABAND, VL, VU, IL, IU, ABSTOL, NE, W, Z, LDZ, &
[test]$ nm a.out | grep -i dsbevk
0000000000417504 T dsbevk_
000000000041abe1 t dsbevk_._omp_fn.0
</code>
===== Interoperability (2a) =====

==== Wrapper ====

In many cases wrapper routines are required, mostly in C (as the language of the operating system)

Difficulties can be

  * Naming schemes
  * Call by value / call by reference
  * Order of parameters: size of string parameters sometimes at the very end of parameter list
  * I/O libraries * Hint: use I/O of only one language, or * link with correct runtime libraries, or * link by calling compiler (e.g. gfortran) instead of linker (e.g. ld)
  * Copy-in and/or copy-out: can affect performance
  * Class/module information in objects ("*.o") only useful in one language

===== Interoperability (3) =====

==== Wrapper example ====

<code fortran>
[C_Fortran]$ more f.f
        program y
        write(*,'(I1," + ",I1," = ",I1)') 3, 5, intadd(3,5)
        end program
</code>
<code>
[C_Fortran]$ more c.c
int intadd_(int *a, int *b) { return (*a)+(*b); }
</code>
<code>
[C_Fortran]$ more makefile
run:    a.out
        ./a.out

a.out:  c.c f.f
        gfortran c.c f.f
</code>
<code>
[C_Fortran]$ make
gfortran c.c f.f
./a.out
3 + 5 = 8
</code>
===== Memory optimization =====

To keep CPUs busy, data has to be kept near the execution units, e.g. in **Level 1 Cache**

Methods

  * Optimized libraries
  * Blocking
    * Reordering of operations
    * Reuse data as much as possible
    * Reuse cache lines (64 byte = 8 double): smallest memory access
  * Access by row / column (depending on chosen language)
  * ..

===== Example: Row - Column =====

Column Access

<code>
for(j=0;i<N;i++)
  for(i=0;j<N;j++)
    A[i][j]+=2.0;
</code>
Strided memory access in C\\
Contiguous memory access in Fortran


Row Access

<code>
for(i=0;i<N;i++)
  for(j=0;j<N;j++)
    A[i][j]+=2.0;
</code>
Contiguous in memory in C\\
Strided memory access in Fortran


===== Example: Blocking =====

<code>
//Three nested loops
N=10000;
for(i=0;i<N;i++)
  for(j=0;j<N;j++)
    for(k=0;k<N;k++)
      C[i][j]+=A[i][k]*B[k][j];
//Large N => cache thrashing
</code>

<code>
//Blocked access
l=48; N=10000;
for(i=0;i<N;i+=l)
  for(j=0;j<N;j+=l)
    for(k=0;k<N;k+=l)
      for(ii=i;ii<min(i+l,N);ii++)
        for(jj=j;jj<min(j+l,N);jj++)
          for(kk=k;kk<min(k+l,N);kk++)
            C[ii][jj]+=A[ii][kk]*B[kk][jj];
//Large N => reusing data in fast cache levels
</code>
Adapt parameter l to size of cache – separately for each cache level


Even better:

<code>
double ONE=1.0D0;
dgemm("N","N",N,N,N,&ONE,A,N,B,N,&ONE,C,N)
// dgemm (TRANSA, TRANSB, M, N, K, ALPHA, A, LDA, B, LDB, BETA, C, LDC)
</code>


===== Storage and Network =====

Small messages are costly

Example: Write 100MB


  * in one block to storage: 100MB * 2GB/s + 2µs = 50.002ms\\

  * 1000000 times 100B: 1000000 * 100B * 2GB/s + 1000000*2µs = 2050ms (blocking the file server)

===== Tools =====

  * Debugger
    * gdb
    * Allinea ddt
  * Profiler
    * gprof
    * Allinea Performance Reports: Quick performance overview (Runtime - Memory - Threads - I/O - MPI)
  * Instrumentation
    * Allinea map
  * Intel performance tools

<code>
$ ls /cm/shared/apps/intel
[..] advisor [..] inspector [..] vtune [..]
</code>
  * VSC School Trainings upon request

https://wiki.vsc.ac.at/doku.php?id=doku:perf-report – https://wiki.vsc.ac.at/doku.php?id=doku:forge