Table of Contents

Going for efficiency - how to get out performance

Overview

Performance ?

What is performance?

Type Theory Measured Well-tuned Application Untuned Application
Operations per second (GFlops/s) 384 340 340 30 / 2
Memory throughput (GB/s) 120 80 80 5
Network throughput (GB/s) 2*4 2*3.4 2*2 0.2 / 0.0002
Network latency (µs) 1.4-1.8 1.4-1.8 1.4-1.8 2
Storage throughput (“/global”, GB/s) 20 15 10 1
Storage latency (ms) 0.03 1 1 1000

“Operations”: Floating point / 16 cores @3GHz; Values per node
“Memory”: Mixture of three cache levels and main memory; Values per node
“Network”: Medium shared by all users (large jobs only); Values per node
“Storage”: Medium shared by all users; Values per application

Which one is the limiting factor for a given code?

Know algorithmic complexity

Know algorithmic complexity (2)

Know algorithmic complexity - Example

Example: program with subprograms, running with varying problem size

static void square(long n)
{ printf("Inside square \n");
  for(long i=0;i<n*n;i++);
}
void cube(long n)
{ printf("Inside cube \n");
  for(long i=0;i<n*n/10000*n;i++);
}
int main(int argc, char *argv[])
{ long n;
  if(argc == 2) n=atoi(argv[1]); else n=4000;
  printf("Inside main\n");
  square(n);
  cube(n);
  return 0;
}

Know algorithmic complexity - Example (2)

Makefile

short:  a.out
        time ./a.out 5000
        gprof -Q -b a.out gmon.out
        time ./a.out 20000
        gprof -Q -b a.out gmon.out

long:   gmon.out
        gprof a.out gmon.out | less

gmon.out:       a.out
        time ./a.out

a.out:  test_gprof.c
        gcc -std=c99 -Wall -pg test_gprof.c

clean:
        rm gmon.out a.out 2>/dev/null

Know algorithmic complexity - Example (3)

Execution: dominating subroutine changes with problem size

[profiling]$ make
gcc -std=c99 -Wall -pg test_gprof.c
time ./a.out 5000
[..]
  %   cumulative   self              self     total           
 time   seconds   seconds    calls  ms/call  ms/call  name    
 70.75      0.07     0.07        1    70.75    70.75  square
 30.32      0.10     0.03        1    30.32    30.32  cube
time ./a.out 20000
[..]
  %   cumulative   self              self     total           
 time   seconds   seconds    calls   s/call   s/call  name    
 66.23      1.75     1.75        1     1.75     1.75  cube
 34.84      2.67     0.92        1     0.92     0.92  square

Libraries

use libraries for good reasons

Very important libraries

Interoperability

Any programming language can call subroutines of any other programming language

Combine your best known with the best performing tool

Interoperability (2)

Wrapper

In many cases wrapper routines are required, mostly in C (as the language of the operating system)

Difficulties can be

[test]$ grep -i dsbevk\( dsbevk.F90
SUB R O U T I N E DSBEVK(JOBZ, RANGE, UPLO, B2T, N, ABAND, LDABAND, VL, VU, IL, IU, ABSTOL, NE, W, Z, LDZ, &
[test]$ nm a.out | grep -i dsbevk
0000000000417504 T dsbevk_
000000000041abe1 t dsbevk_._omp_fn.0

Interoperability (2a)

Wrapper

In many cases wrapper routines are required, mostly in C (as the language of the operating system)

Difficulties can be

Interoperability (3)

Wrapper example

[C_Fortran]$ more f.f
        program y
        write(*,'(I1," + ",I1," = ",I1)') 3, 5, intadd(3,5)
        end program
[C_Fortran]$ more c.c
int intadd_(int *a, int *b) { return (*a)+(*b); }
[C_Fortran]$ more makefile
run:    a.out
        ./a.out

a.out:  c.c f.f
        gfortran c.c f.f
[C_Fortran]$ make
gfortran c.c f.f
./a.out
3 + 5 = 8

Memory optimization

To keep CPUs busy, data has to be kept near the execution units, e.g. in Level 1 Cache

Methods

Example: Row - Column

Column Access

for(j=0;i<N;i++)
  for(i=0;j<N;j++)
    A[i][j]+=2.0;

Strided memory access in C
Contiguous memory access in Fortran

Row Access

for(i=0;i<N;i++)
  for(j=0;j<N;j++)
    A[i][j]+=2.0;

Contiguous in memory in C
Strided memory access in Fortran

Example: Blocking

//Three nested loops
N=10000;
for(i=0;i<N;i++)
  for(j=0;j<N;j++)
    for(k=0;k<N;k++)
      C[i][j]+=A[i][k]*B[k][j];
//Large N => cache thrashing
//Blocked access
l=48; N=10000;
for(i=0;i<N;i+=l)
  for(j=0;j<N;j+=l)
    for(k=0;k<N;k+=l)
      for(ii=i;ii<min(i+l,N);ii++)
        for(jj=j;jj<min(j+l,N);jj++)
          for(kk=k;kk<min(k+l,N);kk++)
            C[ii][jj]+=A[ii][kk]*B[kk][jj];
//Large N => reusing data in fast cache levels

Adapt parameter l to size of cache – separately for each cache level

Even better:

double ONE=1.0D0;
dgemm("N","N",N,N,N,&ONE,A,N,B,N,&ONE,C,N)
// dgemm (TRANSA, TRANSB, M, N, K, ALPHA, A, LDA, B, LDB, BETA, C, LDC)

Storage and Network

Small messages are costly

Example: Write 100MB

Tools

$ ls /cm/shared/apps/intel
[..] advisor [..] inspector [..] vtune [..]

https://wiki.vsc.ac.at/doku.php?id=doku:perf-reporthttps://wiki.vsc.ac.at/doku.php?id=doku:forge