Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
Last revisionBoth sides next revision
pandoc:introduction-to-vsc:10_performance:10_performance [2018/01/31 11:11] – Pandoc Auto-commit pandocpandoc:introduction-to-vsc:10_performance:10_performance [2018/03/21 11:33] – Pandoc Auto-commit pandoc
Line 1: Line 1:
 +
 +
 +====== Going for efficiency - how to get out performance ======
 +
 +  * Article written by Dieter Kvasnicka (VSC Team) <html><br></html>(last update 2017-10-17 by dk).
 +
 +===== Overview =====
 +
 +
 +  * Performance ?
 +  * Know algorithmic complexity
 +  * Libraries
 +  * Interoperability
 +
 +===== Performance ? =====
 +
 +What is performance?
 +
 +^Type                                      ^Theory   ^Measured  ^Well-tuned Application  ^Untuned Application  ^
 +|**Operations** per second (GFlops/s)      |384      |340       |340                     |30 / 2               |
 +|**Memory** throughput (GB/s)              |120      |80        |80                      |5                    |
 +|**Network** throughput (GB/s)             |2*4      |2*3.4     |2*2                     |0.2 / 0.0002         |
 +|**Network** latency (µs)                  |1.4-1.8  |1.4-1.8   |1.4-1.8                 |2                    |
 +|**Storage** throughput (“/global”, GB/s)  |20       |15        |10                      |1                    |
 +|**Storage** latency (ms)                  |0.03     |1         |1                       |1000                 |
 +
 +“Operations”: Floating point / 16 cores @3GHz; Values per node\\
 +“Memory”: Mixture of three cache levels and main memory; Values per node\\
 +“Network”: Medium shared by all users (large jobs only); Values per node\\
 +“Storage”: Medium shared by all users; Values per application
 +
 +Which one is the **limiting factor** for a given code?
 +
 +
 +===== Know algorithmic complexity =====
 +
 +  * //Algorithmic complexity//: Scaling with varying input parameters
 +  * HPC (High Performance Computing): execution time spent on just **a few lines of code**
 +
 +  * Linear Algebra
 +    * (Dense) matrix multiplication: O(n^3)
 +    * (Dense) matrix vector multiplication: (O(n^2)
 +    * Dot product: O(n)
 +    * (Dense) system of linear equations / least squares: O(n^3)
 +    * (Dense) Eigensolver: O(n^3)
 +    * FFT: O(n * log(n))
 +    * Sparse algorithms: O(n) .. O(n^3)
 +    * ..
 +
 +
 +===== Know algorithmic complexity (2) =====
 +
 +  * Other codes
 +    * Identify **main parameters**
 +    * Identify **scaling** (usually polynomial)
 +
 +  * Large problem size => subalgorithm with highest exponent of complexity becomes dominant
 +    * **Measure** compute time, preferably with profiler
 +    * Beware: **cache effects** can ‘change’ algorithmic complexity
 +  * Often: algorithmic complexity is well known and documented
 +
 +
 +===== Know algorithmic complexity - Example =====
 +
 +Example: program with subprograms, running with varying problem size
 +
 +<code>
 +static void square(long n)
 +{ printf("Inside square \n");
 +  for(long i=0;i<n*n;i++);
 +}
 +void cube(long n)
 +{ printf("Inside cube \n");
 +  for(long i=0;i<n*n/10000*n;i++);
 +}
 +int main(int argc, char *argv[])
 +{ long n;
 +  if(argc == 2) n=atoi(argv[1]); else n=4000;
 +  printf("Inside main\n");
 +  square(n);
 +  cube(n);
 +  return 0;
 +}
 +</code>
 +===== Know algorithmic complexity - Example (2) =====
 +
 +Makefile
 +
 +<code>
 +short:  a.out
 +        time ./a.out 5000
 +        gprof -Q -b a.out gmon.out
 +        time ./a.out 20000
 +        gprof -Q -b a.out gmon.out
 +
 +long:   gmon.out
 +        gprof a.out gmon.out | less
 +
 +gmon.out:       a.out
 +        time ./a.out
 +
 +a.out:  test_gprof.c
 +        gcc -std=c99 -Wall -pg test_gprof.c
 +
 +clean:
 +        rm gmon.out a.out 2>/dev/null
 +</code>
 +===== Know algorithmic complexity - Example (3) =====
 +
 +Execution: dominating subroutine changes with problem size
 +
 +<code>
 +[profiling]$ make
 +gcc -std=c99 -Wall -pg test_gprof.c
 +time ./a.out 5000
 +[..]
 +  %   cumulative   self              self     total           
 + time   seconds   seconds    calls  ms/call  ms/call  name    
 + 70.75      0.07     0.07        1    70.75    70.75  square
 + 30.32      0.10     0.03        1    30.32    30.32  cube
 +time ./a.out 20000
 +[..]
 +  %   cumulative   self              self     total           
 + time   seconds   seconds    calls   s/call   s/call  name    
 + 66.23      1.75     1.75        1     1.75     1.75  cube
 + 34.84      2.67     0.92        1     0.92     0.92  square
 +</code>
 +===== Libraries =====
 +
 +use libraries for good reasons
 +
 +
 +  * execution performance
 +  * efficient development
 +  * well tested
 +  * portability
 +  * maintenance (by others!)
 +  * ..
 +
 +===== Very important libraries =====
 +
 +  * **Level 3 BLAS** (=“Matrix Multiplication”)
 +    * O(n^3) operations with O(n^2) memory accesses
 +    * portable performance (often faster by a factor of 10)
 +    * including triangular and symmetric / hermitian matrices
 +    * any three-loop operation with $C=\alpha AB + \beta C$ can be written using Level 3 BLAS
 +
 +  * MKL: free **Math Kernel Library** from Intel
 +    * Linear Algebra
 +    * FFT
 +    * Neural networks
 +    * Statistics
 +    * Sparse algorithms
 +    * Parallel
 +    * ..
 +  * **HDF5**: Hierarchical Data Format: efficient and portable library
 +
 +
 +===== Interoperability =====
 +
 +Any programming language can call subroutines of any other programming language
 +
 +  * C + Fortran
 +  * Matlab + C
 +  * Octave + C++
 +  * Java + MKL
 +  * Python + R
 +  * Haskell + Assembler
 +  * Javascript + Nvidia Cuda
 +
 +Combine your **best known** with the **best performing** tool
 +
 +
 +===== Interoperability (2) =====
 +
 +==== Wrapper ====
 +
 +In many cases wrapper routines are required, mostly in C (as the language of the operating system)
 +
 +Difficulties can be
 +
 +  * Naming schemes
 +    * UPPER vs. lower case
 +    * added underscores
 +
 +<code>
 +[test]$ grep -i dsbevk\( dsbevk.F90
 +SUB R O U T I N E DSBEVK(JOBZ, RANGE, UPLO, B2T, N, ABAND, LDABAND, VL, VU, IL, IU, ABSTOL, NE, W, Z, LDZ, &
 +[test]$ nm a.out | grep -i dsbevk
 +0000000000417504 T dsbevk_
 +000000000041abe1 t dsbevk_._omp_fn.0
 +</code>
 +===== Interoperability (2a) =====
 +
 +==== Wrapper ====
 +
 +In many cases wrapper routines are required, mostly in C (as the language of the operating system)
 +
 +Difficulties can be
 +
 +  * Naming schemes
 +  * Call by value / call by reference
 +  * Order of parameters: size of string parameters sometimes at the very end of parameter list
 +  * I/O libraries * Hint: use I/O of only one language, or * link with correct runtime libraries, or * link by calling compiler (e.g. gfortran) instead of linker (e.g. ld)
 +  * Copy-in and/or copy-out: can affect performance
 +  * Class/module information in objects (“*.o“) only useful in one language
 +
 +===== Interoperability (3) =====
 +
 +==== Wrapper example ====
 +
 +<code fortran>
 +[C_Fortran]$ more f.f
 +        program y
 +        write(*,'(I1," + ",I1," = ",I1)') 3, 5, intadd(3,5)
 +        end program
 +</code>
 +<code>
 +[C_Fortran]$ more c.c
 +int intadd_(int *a, int *b) { return (*a)+(*b); }
 +</code>
 +<code>
 +[C_Fortran]$ more makefile
 +run:    a.out
 +        ./a.out
 +
 +a.out:  c.c f.f
 +        gfortran c.c f.f
 +</code>
 +<code>
 +[C_Fortran]$ make
 +gfortran c.c f.f
 +./a.out
 +3 + 5 = 8
 +</code>
 +===== Memory optimization =====
 +
 +To keep CPUs busy, data has to be kept near the execution units, e.g. in **Level 1 Cache**
 +
 +Methods
 +
 +  * Optimized libraries
 +  * Blocking
 +    * Reordering of operations
 +    * Reuse data as much as possible
 +    * Reuse cache lines (64 byte = 8 double): smallest memory access
 +  * Access by row / column (depending on chosen language)
 +  * ..
 +
 +===== Example: Row - Column =====
 +
 +Column Access
 +
 +<code>
 +for(j=0;i<N;i++)
 +  for(i=0;j<N;j++)
 +    A[i][j]+=2.0;
 +</code>
 +Strided memory access in C\\
 +Contiguous memory access in Fortran
 +
 +
 +Row Access
 +
 +<code>
 +for(i=0;i<N;i++)
 +  for(j=0;j<N;j++)
 +    A[i][j]+=2.0;
 +</code>
 +Contiguous in memory in C\\
 +Strided memory access in Fortran
 +
 +
 +
 +
 +===== Example: Blocking =====
 +
 +<code>
 +//Three nested loops
 +N=10000;
 +for(i=0;i<N;i++)
 +  for(j=0;j<N;j++)
 +    for(k=0;k<N;k++)
 +      C[i][j]+=A[i][k]*B[k][j];
 +//Large N => cache thrashing
 +</code>
 +
 +<code>
 +//Blocked access
 +l=48; N=10000;
 +for(i=0;i<N;i+=l)
 +  for(j=0;j<N;j+=l)
 +    for(k=0;k<N;k+=l)
 +      for(ii=i;ii<min(i+l,N);ii++)
 +        for(jj=j;jj<min(j+l,N);jj++)
 +          for(kk=k;kk<min(k+l,N);kk++)
 +            C[ii][jj]+=A[ii][kk]*B[kk][jj];
 +//Large N => reusing data in fast cache levels
 +</code>
 +Adapt parameter l to size of cache – separately for each cache level
 +
 +
 +
 +Even better:
 +
 +<code>
 +double ONE=1.0D0;
 +dgemm("N","N",N,N,N,&ONE,A,N,B,N,&ONE,C,N)
 +// dgemm (TRANSA, TRANSB, M, N, K, ALPHA, A, LDA, B, LDB, BETA, C, LDC)
 +</code>
 +
 +
 +===== Storage and Network =====
 +
 +Small messages are costly
 +
 +Example: Write 100MB
 +
 +
 +  * in one block to storage: 100MB * 2GB/s + 2µs = 50.002ms\\
 +
 +  * 1000000 times 100B: 1000000 * 100B * 2GB/s + 1000000*2µs = 2050ms (blocking the file server)
 +
 +===== Tools =====
 +
 +  * Debugger
 +    * gdb
 +    * Allinea ddt
 +  * Profiler
 +    * gprof
 +    * Allinea Performance Reports: Quick performance overview (Runtime - Memory - Threads - I/O - MPI)
 +  * Instrumentation
 +    * Allinea map
 +  * Intel performance tools
 +
 +<code>
 +$ ls /cm/shared/apps/intel
 +[..] advisor [..] inspector [..] vtune [..]
 +</code>
 +  * VSC School Trainings upon request
 +
 +https://wiki.vsc.ac.at/doku.php?id=doku:perf-report – https://wiki.vsc.ac.at/doku.php?id=doku:forge
  
  • pandoc/introduction-to-vsc/10_performance/10_performance.txt
  • Last modified: 2020/10/20 08:09
  • by pandoc