What is performance?
Type | Theory | Measured | Well-tuned Application | Untuned Application |
---|---|---|---|---|
Operations per second (GFlops/s) | 384 | 340 | 340 | 30 / 2 |
Memory throughput (GB/s) | 120 | 80 | 80 | 5 |
Network throughput (GB/s) | 2*4 | 2*3.4 | 2*2 | 0.2 / 0.0002 |
Network latency (µs) | 1.4-1.8 | 1.4-1.8 | 1.4-1.8 | 2 |
Storage throughput (“/global”, GB/s) | 20 | 15 | 10 | 1 |
Storage latency (ms) | 0.03 | 1 | 1 | 1000 |
“Operations”: Floating point / 16 cores @3GHz; Values per node
“Memory”: Mixture of three cache levels and main memory; Values per node
“Network”: Medium shared by all users (large jobs only); Values per node
“Storage”: Medium shared by all users; Values per application
Which one is the limiting factor for a given code?
Example: program with subprograms, running with varying problem size
static void square(long n) { printf("Inside square \n"); for(long i=0;i<n*n;i++); } void cube(long n) { printf("Inside cube \n"); for(long i=0;i<n*n/10000*n;i++); } int main(int argc, char *argv[]) { long n; if(argc == 2) n=atoi(argv[1]); else n=4000; printf("Inside main\n"); square(n); cube(n); return 0; }
Makefile
short: a.out time ./a.out 5000 gprof -Q -b a.out gmon.out time ./a.out 20000 gprof -Q -b a.out gmon.out long: gmon.out gprof a.out gmon.out | less gmon.out: a.out time ./a.out a.out: test_gprof.c gcc -std=c99 -Wall -pg test_gprof.c clean: rm gmon.out a.out 2>/dev/null
Execution: dominating subroutine changes with problem size
[profiling]$ make gcc -std=c99 -Wall -pg test_gprof.c time ./a.out 5000 [..] % cumulative self self total time seconds seconds calls ms/call ms/call name 70.75 0.07 0.07 1 70.75 70.75 square 30.32 0.10 0.03 1 30.32 30.32 cube time ./a.out 20000 [..] % cumulative self self total time seconds seconds calls s/call s/call name 66.23 1.75 1.75 1 1.75 1.75 cube 34.84 2.67 0.92 1 0.92 0.92 square
use libraries for good reasons
Any programming language can call subroutines of any other programming language
Combine your best known with the best performing tool
In many cases wrapper routines are required, mostly in C (as the language of the operating system)
Difficulties can be
[test]$ grep -i dsbevk\( dsbevk.F90 SUB R O U T I N E DSBEVK(JOBZ, RANGE, UPLO, B2T, N, ABAND, LDABAND, VL, VU, IL, IU, ABSTOL, NE, W, Z, LDZ, & [test]$ nm a.out | grep -i dsbevk 0000000000417504 T dsbevk_ 000000000041abe1 t dsbevk_._omp_fn.0
In many cases wrapper routines are required, mostly in C (as the language of the operating system)
Difficulties can be
[C_Fortran]$ more f.f program y write(*,'(I1," + ",I1," = ",I1)') 3, 5, intadd(3,5) end program
[C_Fortran]$ more c.c int intadd_(int *a, int *b) { return (*a)+(*b); }
[C_Fortran]$ more makefile run: a.out ./a.out a.out: c.c f.f gfortran c.c f.f
[C_Fortran]$ make gfortran c.c f.f ./a.out 3 + 5 = 8
To keep CPUs busy, data has to be kept near the execution units, e.g. in Level 1 Cache
Methods
Column Access
for(j=0;i<N;i++) for(i=0;j<N;j++) A[i][j]+=2.0;
Strided memory access in C
Contiguous memory access in Fortran
Row Access
for(i=0;i<N;i++) for(j=0;j<N;j++) A[i][j]+=2.0;
Contiguous in memory in C
Strided memory access in Fortran
//Three nested loops N=10000; for(i=0;i<N;i++) for(j=0;j<N;j++) for(k=0;k<N;k++) C[i][j]+=A[i][k]*B[k][j]; //Large N => cache thrashing
//Blocked access l=48; N=10000; for(i=0;i<N;i+=l) for(j=0;j<N;j+=l) for(k=0;k<N;k+=l) for(ii=i;ii<min(i+l,N);ii++) for(jj=j;jj<min(j+l,N);jj++) for(kk=k;kk<min(k+l,N);kk++) C[ii][jj]+=A[ii][kk]*B[kk][jj]; //Large N => reusing data in fast cache levels
Adapt parameter l to size of cache – separately for each cache level
Even better:
double ONE=1.0D0; dgemm("N","N",N,N,N,&ONE,A,N,B,N,&ONE,C,N) // dgemm (TRANSA, TRANSB, M, N, K, ALPHA, A, LDA, B, LDB, BETA, C, LDC)
Small messages are costly
Example: Write 100MB
$ ls /cm/shared/apps/intel [..] advisor [..] inspector [..] vtune [..]
https://wiki.vsc.ac.at/doku.php?id=doku:perf-report – https://wiki.vsc.ac.at/doku.php?id=doku:forge