This version (2024/10/24 10:28) is a draft.
Approvals: 0/1

Going for efficiency - how to get out performance

  • Article written by Dieter Kvasnicka (VSC Team) <html><br></html>(last update 2017-10-17 by dk).
  • Performance ?
  • Know algorithmic complexity
  • Libraries
  • Interoperability

What is performance?

Type Theory Measured Well-tuned Application Untuned Application
Operations per second (GFlops/s) 384 340 340 30 / 2
Memory throughput (GB/s) 120 80 80 5
Network throughput (GB/s) 2*4 2*3.4 2*2 0.2 / 0.0002
Network latency (µs) 1.4-1.8 1.4-1.8 1.4-1.8 2
Storage throughput (“/global”, GB/s) 20 15 10 1
Storage latency (ms) 0.03 1 1 1000

“Operations”: Floating point / 16 cores @3GHz; Values per node
“Memory”: Mixture of three cache levels and main memory; Values per node
“Network”: Medium shared by all users (large jobs only); Values per node
“Storage”: Medium shared by all users; Values per application

Which one is the limiting factor for a given code?

  • Algorithmic complexity: Scaling with varying input parameters
  • HPC (High Performance Computing): execution time spent on just a few lines of code
  • Linear Algebra
    • (Dense) matrix multiplication: O(n^3)
    • (Dense) matrix vector multiplication: (O(n^2)
    • Dot product: O(n)
    • (Dense) system of linear equations / least squares: O(n^3)
    • (Dense) Eigensolver: O(n^3)
    • FFT: O(n * log(n))
    • Sparse algorithms: O(n) .. O(n^3)
    • ..
  • Other codes
    • Identify main parameters
    • Identify scaling (usually polynomial)
  • Large problem size ⇒ subalgorithm with highest exponent of complexity becomes dominant
    • Measure compute time, preferably with profiler
    • Beware: cache effects can ‘change’ algorithmic complexity
  • Often: algorithmic complexity is well known and documented

Example: program with subprograms, running with varying problem size

static void square(long n)
{ printf("Inside square \n");
  for(long i=0;i<n*n;i++);
}
void cube(long n)
{ printf("Inside cube \n");
  for(long i=0;i<n*n/10000*n;i++);
}
int main(int argc, char *argv[])
{ long n;
  if(argc == 2) n=atoi(argv[1]); else n=4000;
  printf("Inside main\n");
  square(n);
  cube(n);
  return 0;
}

Makefile

short:  a.out
        time ./a.out 5000
        gprof -Q -b a.out gmon.out
        time ./a.out 20000
        gprof -Q -b a.out gmon.out

long:   gmon.out
        gprof a.out gmon.out | less

gmon.out:       a.out
        time ./a.out

a.out:  test_gprof.c
        gcc -std=c99 -Wall -pg test_gprof.c

clean:
        rm gmon.out a.out 2>/dev/null

Execution: dominating subroutine changes with problem size

[profiling]$ make
gcc -std=c99 -Wall -pg test_gprof.c
time ./a.out 5000
[..]
  %   cumulative   self              self     total           
 time   seconds   seconds    calls  ms/call  ms/call  name    
 70.75      0.07     0.07        1    70.75    70.75  square
 30.32      0.10     0.03        1    30.32    30.32  cube
time ./a.out 20000
[..]
  %   cumulative   self              self     total           
 time   seconds   seconds    calls   s/call   s/call  name    
 66.23      1.75     1.75        1     1.75     1.75  cube
 34.84      2.67     0.92        1     0.92     0.92  square

use libraries for good reasons

  • execution performance
  • efficient development
  • well tested
  • portability
  • maintenance (by others!)
  • ..
  • Level 3 BLAS (=“Matrix Multiplication”)
    • O(n^3) operations with O(n^2) memory accesses
    • portable performance (often faster by a factor of 10)
    • including triangular and symmetric / hermitian matrices
    • any three-loop operation with $C=\alpha AB + \beta C$ can be written using Level 3 BLAS
  • MKL: free Math Kernel Library from Intel
    • Linear Algebra
    • FFT
    • Neural networks
    • Statistics
    • Sparse algorithms
    • Parallel
    • ..
  • HDF5: Hierarchical Data Format: efficient and portable library

Any programming language can call subroutines of any other programming language

  • C + Fortran
  • Matlab + C
  • Octave + C++
  • Java + MKL
  • Python + R
  • Haskell + Assembler
  • Javascript + Nvidia Cuda

Combine your best known with the best performing tool

In many cases wrapper routines are required, mostly in C (as the language of the operating system)

Difficulties can be

  • Naming schemes
    • UPPER vs. lower case
    • added underscores
[test]$ grep -i dsbevk\( dsbevk.F90
SUB R O U T I N E DSBEVK(JOBZ, RANGE, UPLO, B2T, N, ABAND, LDABAND, VL, VU, IL, IU, ABSTOL, NE, W, Z, LDZ, &
[test]$ nm a.out | grep -i dsbevk
0000000000417504 T dsbevk_
000000000041abe1 t dsbevk_._omp_fn.0

In many cases wrapper routines are required, mostly in C (as the language of the operating system)

Difficulties can be

  • Naming schemes
  • Call by value / call by reference
  • Order of parameters: size of string parameters sometimes at the very end of parameter list
  • I/O libraries * Hint: use I/O of only one language, or * link with correct runtime libraries, or * link by calling compiler (e.g. gfortran) instead of linker (e.g. ld)
  • Copy-in and/or copy-out: can affect performance
  • Class/module information in objects (“*.o”) only useful in one language
[C_Fortran]$ more f.f
        program y
        write(*,'(I1," + ",I1," = ",I1)') 3, 5, intadd(3,5)
        end program
[C_Fortran]$ more c.c
int intadd_(int *a, int *b) { return (*a)+(*b); }
[C_Fortran]$ more makefile
run:    a.out
        ./a.out

a.out:  c.c f.f
        gfortran c.c f.f
[C_Fortran]$ make
gfortran c.c f.f
./a.out
3 + 5 = 8

To keep CPUs busy, data has to be kept near the execution units, e.g. in Level 1 Cache

Methods

  • Optimized libraries
  • Blocking
    • Reordering of operations
    • Reuse data as much as possible
    • Reuse cache lines (64 byte = 8 double): smallest memory access
  • Access by row / column (depending on chosen language)
  • ..

Column Access

for(j=0;i<N;i++)
  for(i=0;j<N;j++)
    A[i][j]+=2.0;

Strided memory access in C
Contiguous memory access in Fortran

Row Access

for(i=0;i<N;i++)
  for(j=0;j<N;j++)
    A[i][j]+=2.0;

Contiguous in memory in C
Strided memory access in Fortran

//Three nested loops
N=10000;
for(i=0;i<N;i++)
  for(j=0;j<N;j++)
    for(k=0;k<N;k++)
      C[i][j]+=A[i][k]*B[k][j];
//Large N => cache thrashing
//Blocked access
l=48; N=10000;
for(i=0;i<N;i+=l)
  for(j=0;j<N;j+=l)
    for(k=0;k<N;k+=l)
      for(ii=i;ii<min(i+l,N);ii++)
        for(jj=j;jj<min(j+l,N);jj++)
          for(kk=k;kk<min(k+l,N);kk++)
            C[ii][jj]+=A[ii][kk]*B[kk][jj];
//Large N => reusing data in fast cache levels

Adapt parameter l to size of cache – separately for each cache level

Even better:

double ONE=1.0D0;
dgemm("N","N",N,N,N,&ONE,A,N,B,N,&ONE,C,N)
// dgemm (TRANSA, TRANSB, M, N, K, ALPHA, A, LDA, B, LDB, BETA, C, LDC)

Small messages are costly

Example: Write 100MB

  • in one block to storage: 100MB * 2GB/s + 2µs = 50.002ms
  • 1000000 times 100B: 1000000 * 100B * 2GB/s + 1000000*2µs = 2050ms (blocking the file server)
  • Debugger
    • gdb
    • Allinea ddt
  • Profiler
    • gprof
    • Allinea Performance Reports: Quick performance overview (Runtime - Memory - Threads - I/O - MPI)
  • Instrumentation
    • Allinea map
  • Intel performance tools
$ ls /cm/shared/apps/intel
[..] advisor [..] inspector [..] vtune [..]
  • VSC School Trainings upon request

https://wiki.vsc.ac.at/doku.php?id=doku:perf-reporthttps://wiki.vsc.ac.at/doku.php?id=doku:forge

  • pandoc/introduction-to-vsc/10_performance/10_performance.txt
  • Last modified: 2024/10/24 10:28
  • by 127.0.0.1