Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
pandoc:introduction-to-vsc:10_performance:10_performance [2018/01/31 11:11] – Pandoc Auto-commit pandocpandoc:introduction-to-vsc:10_performance:10_performance [2020/10/20 08:09] (current) – Pandoc Auto-commit pandoc
Line 1: Line 1:
 +
 +
 +====== Going for efficiency - how to get out performance ======
 +
 +  * Article written by Dieter Kvasnicka (VSC Team) <html><br></html>(last update 2017-10-17 by dk).
 +
 +===== Overview =====
 +
 +
 +  * Performance ?
 +  * Know algorithmic complexity
 +  * Libraries
 +  * Interoperability
 +
 +===== Performance ? =====
 +
 +What is performance?
 +
 +^Type                                      ^Theory   ^Measured  ^Well-tuned Application  ^Untuned Application  ^
 +|**Operations** per second (GFlops/s)      |384      |340       |340                     |30 / 2               |
 +|**Memory** throughput (GB/s)              |120      |80        |80                      |5                    |
 +|**Network** throughput (GB/s)             |2*4      |2*3.4     |2*2                     |0.2 / 0.0002         |
 +|**Network** latency (µs)                  |1.4-1.8  |1.4-1.8   |1.4-1.8                 |2                    |
 +|**Storage** throughput (“/global”, GB/s)  |20       |15        |10                      |1                    |
 +|**Storage** latency (ms)                  |0.03     |1         |1                       |1000                 |
 +
 +“Operations”: Floating point / 16 cores @3GHz; Values per node\\
 +“Memory”: Mixture of three cache levels and main memory; Values per node\\
 +“Network”: Medium shared by all users (large jobs only); Values per node\\
 +“Storage”: Medium shared by all users; Values per application
 +
 +Which one is the **limiting factor** for a given code?
 +
 +
 +===== Know algorithmic complexity =====
 +
 +  * //Algorithmic complexity//: Scaling with varying input parameters
 +  * HPC (High Performance Computing): execution time spent on just **a few lines of code**
 +
 +  * Linear Algebra
 +    * (Dense) matrix multiplication: O(n^3)
 +    * (Dense) matrix vector multiplication: (O(n^2)
 +    * Dot product: O(n)
 +    * (Dense) system of linear equations / least squares: O(n^3)
 +    * (Dense) Eigensolver: O(n^3)
 +    * FFT: O(n * log(n))
 +    * Sparse algorithms: O(n) .. O(n^3)
 +    * ..
 +
 +
 +===== Know algorithmic complexity (2) =====
 +
 +  * Other codes
 +    * Identify **main parameters**
 +    * Identify **scaling** (usually polynomial)
 +
 +  * Large problem size => subalgorithm with highest exponent of complexity becomes dominant
 +    * **Measure** compute time, preferably with profiler
 +    * Beware: **cache effects** can ‘change’ algorithmic complexity
 +  * Often: algorithmic complexity is well known and documented
 +
 +
 +===== Know algorithmic complexity - Example =====
 +
 +Example: program with subprograms, running with varying problem size
 +
 +<code>
 +static void square(long n)
 +{ printf("Inside square \n");
 +  for(long i=0;i<n*n;i++);
 +}
 +void cube(long n)
 +{ printf("Inside cube \n");
 +  for(long i=0;i<n*n/10000*n;i++);
 +}
 +int main(int argc, char *argv[])
 +{ long n;
 +  if(argc == 2) n=atoi(argv[1]); else n=4000;
 +  printf("Inside main\n");
 +  square(n);
 +  cube(n);
 +  return 0;
 +}
 +</code>
 +===== Know algorithmic complexity - Example (2) =====
 +
 +Makefile
 +
 +<code>
 +short:  a.out
 +        time ./a.out 5000
 +        gprof -Q -b a.out gmon.out
 +        time ./a.out 20000
 +        gprof -Q -b a.out gmon.out
 +
 +long:   gmon.out
 +        gprof a.out gmon.out | less
 +
 +gmon.out:       a.out
 +        time ./a.out
 +
 +a.out:  test_gprof.c
 +        gcc -std=c99 -Wall -pg test_gprof.c
 +
 +clean:
 +        rm gmon.out a.out 2>/dev/null
 +</code>
 +===== Know algorithmic complexity - Example (3) =====
 +
 +Execution: dominating subroutine changes with problem size
 +
 +<code>
 +[profiling]$ make
 +gcc -std=c99 -Wall -pg test_gprof.c
 +time ./a.out 5000
 +[..]
 +  %   cumulative   self              self     total           
 + time   seconds   seconds    calls  ms/call  ms/call  name    
 + 70.75      0.07     0.07        1    70.75    70.75  square
 + 30.32      0.10     0.03        1    30.32    30.32  cube
 +time ./a.out 20000
 +[..]
 +  %   cumulative   self              self     total           
 + time   seconds   seconds    calls   s/call   s/call  name    
 + 66.23      1.75     1.75        1     1.75     1.75  cube
 + 34.84      2.67     0.92        1     0.92     0.92  square
 +</code>
 +===== Libraries =====
 +
 +use libraries for good reasons
 +
 +
 +  * execution performance
 +  * efficient development
 +  * well tested
 +  * portability
 +  * maintenance (by others!)
 +  * ..
 +
 +===== Very important libraries =====
 +
 +  * **Level 3 BLAS** (=“Matrix Multiplication”)
 +    * O(n^3) operations with O(n^2) memory accesses
 +    * portable performance (often faster by a factor of 10)
 +    * including triangular and symmetric / hermitian matrices
 +    * any three-loop operation with $C=\alpha AB + \beta C$ can be written using Level 3 BLAS
 +
 +  * MKL: free **Math Kernel Library** from Intel
 +    * Linear Algebra
 +    * FFT
 +    * Neural networks
 +    * Statistics
 +    * Sparse algorithms
 +    * Parallel
 +    * ..
 +  * **HDF5**: Hierarchical Data Format: efficient and portable library
 +
 +
 +===== Interoperability =====
 +
 +Any programming language can call subroutines of any other programming language
 +
 +  * C + Fortran
 +  * Matlab + C
 +  * Octave + C++
 +  * Java + MKL
 +  * Python + R
 +  * Haskell + Assembler
 +  * Javascript + Nvidia Cuda
 +
 +Combine your **best known** with the **best performing** tool
 +
 +
 +===== Interoperability (2) =====
 +
 +==== Wrapper ====
 +
 +In many cases wrapper routines are required, mostly in C (as the language of the operating system)
 +
 +Difficulties can be
 +
 +  * Naming schemes
 +    * UPPER vs. lower case
 +    * added underscores
 +
 +<code>
 +[test]$ grep -i dsbevk\( dsbevk.F90
 +SUB R O U T I N E DSBEVK(JOBZ, RANGE, UPLO, B2T, N, ABAND, LDABAND, VL, VU, IL, IU, ABSTOL, NE, W, Z, LDZ, &
 +[test]$ nm a.out | grep -i dsbevk
 +0000000000417504 T dsbevk_
 +000000000041abe1 t dsbevk_._omp_fn.0
 +</code>
 +===== Interoperability (2a) =====
 +
 +==== Wrapper ====
 +
 +In many cases wrapper routines are required, mostly in C (as the language of the operating system)
 +
 +Difficulties can be
 +
 +  * Naming schemes
 +  * Call by value / call by reference
 +  * Order of parameters: size of string parameters sometimes at the very end of parameter list
 +  * I/O libraries * Hint: use I/O of only one language, or * link with correct runtime libraries, or * link by calling compiler (e.g. gfortran) instead of linker (e.g. ld)
 +  * Copy-in and/or copy-out: can affect performance
 +  * Class/module information in objects ("*.o") only useful in one language
 +
 +===== Interoperability (3) =====
 +
 +==== Wrapper example ====
 +
 +<code fortran>
 +[C_Fortran]$ more f.f
 +        program y
 +        write(*,'(I1," + ",I1," = ",I1)') 3, 5, intadd(3,5)
 +        end program
 +</code>
 +<code>
 +[C_Fortran]$ more c.c
 +int intadd_(int *a, int *b) { return (*a)+(*b); }
 +</code>
 +<code>
 +[C_Fortran]$ more makefile
 +run:    a.out
 +        ./a.out
 +
 +a.out:  c.c f.f
 +        gfortran c.c f.f
 +</code>
 +<code>
 +[C_Fortran]$ make
 +gfortran c.c f.f
 +./a.out
 +3 + 5 = 8
 +</code>
 +===== Memory optimization =====
 +
 +To keep CPUs busy, data has to be kept near the execution units, e.g. in **Level 1 Cache**
 +
 +Methods
 +
 +  * Optimized libraries
 +  * Blocking
 +    * Reordering of operations
 +    * Reuse data as much as possible
 +    * Reuse cache lines (64 byte = 8 double): smallest memory access
 +  * Access by row / column (depending on chosen language)
 +  * ..
 +
 +===== Example: Row - Column =====
 +
 +Column Access
 +
 +<code>
 +for(j=0;i<N;i++)
 +  for(i=0;j<N;j++)
 +    A[i][j]+=2.0;
 +</code>
 +Strided memory access in C\\
 +Contiguous memory access in Fortran
 +
 +
 +Row Access
 +
 +<code>
 +for(i=0;i<N;i++)
 +  for(j=0;j<N;j++)
 +    A[i][j]+=2.0;
 +</code>
 +Contiguous in memory in C\\
 +Strided memory access in Fortran
 +
 +
 +
 +
 +===== Example: Blocking =====
 +
 +<code>
 +//Three nested loops
 +N=10000;
 +for(i=0;i<N;i++)
 +  for(j=0;j<N;j++)
 +    for(k=0;k<N;k++)
 +      C[i][j]+=A[i][k]*B[k][j];
 +//Large N => cache thrashing
 +</code>
 +
 +<code>
 +//Blocked access
 +l=48; N=10000;
 +for(i=0;i<N;i+=l)
 +  for(j=0;j<N;j+=l)
 +    for(k=0;k<N;k+=l)
 +      for(ii=i;ii<min(i+l,N);ii++)
 +        for(jj=j;jj<min(j+l,N);jj++)
 +          for(kk=k;kk<min(k+l,N);kk++)
 +            C[ii][jj]+=A[ii][kk]*B[kk][jj];
 +//Large N => reusing data in fast cache levels
 +</code>
 +Adapt parameter l to size of cache – separately for each cache level
 +
 +
 +
 +Even better:
 +
 +<code>
 +double ONE=1.0D0;
 +dgemm("N","N",N,N,N,&ONE,A,N,B,N,&ONE,C,N)
 +// dgemm (TRANSA, TRANSB, M, N, K, ALPHA, A, LDA, B, LDB, BETA, C, LDC)
 +</code>
 +
 +
 +===== Storage and Network =====
 +
 +Small messages are costly
 +
 +Example: Write 100MB
 +
 +
 +  * in one block to storage: 100MB * 2GB/s + 2µs = 50.002ms\\
 +
 +  * 1000000 times 100B: 1000000 * 100B * 2GB/s + 1000000*2µs = 2050ms (blocking the file server)
 +
 +===== Tools =====
 +
 +  * Debugger
 +    * gdb
 +    * Allinea ddt
 +  * Profiler
 +    * gprof
 +    * Allinea Performance Reports: Quick performance overview (Runtime - Memory - Threads - I/O - MPI)
 +  * Instrumentation
 +    * Allinea map
 +  * Intel performance tools
 +
 +<code>
 +$ ls /cm/shared/apps/intel
 +[..] advisor [..] inspector [..] vtune [..]
 +</code>
 +  * VSC School Trainings upon request
 +
 +https://wiki.vsc.ac.at/doku.php?id=doku:perf-report – https://wiki.vsc.ac.at/doku.php?id=doku:forge
 +
  
  • pandoc/introduction-to-vsc/10_performance/10_performance.1517397060.txt.gz
  • Last modified: 2018/01/31 11:11
  • by pandoc