Differences

This shows you the differences between two versions of the page.

--- pandoc:introduction-to-vsc:10_performance:10_performance [2018/01/31 11:11] – Pandoc Auto-commit pandoc
+++ pandoc:introduction-to-vsc:10_performance:10_performance [2018/03/21 11:33] – Pandoc Auto-commit pandoc
@@ Line 1: / Line 1: @@
+====== Going for efficiency - how to get out performance ======
+  * Article written by Dieter Kvasnicka (VSC Team) <html><br></html>(last update 2017-10-17 by dk).
+===== Overview =====
+  * Performance ?
+  * Know algorithmic complexity
+  * Libraries
+  * Interoperability
+===== Performance ? =====
+What is performance?
+^Type                                      ^Theory   ^Measured  ^Well-tuned Application  ^Untuned Application  ^
+|**Operations** per second (GFlops/s)      |384      |340       |340                     |30 / 2               |
+|**Memory** throughput (GB/s)              |120      |80        |80                      |5                    |
+|**Network** throughput (GB/s)             |2*4      |2*3.4     |2*2                     |0.2 / 0.0002         |
+|**Network** latency (µs)                  |1.4-1.8  |1.4-1.8   |1.4-1.8                 |2                    |
+|**Storage** throughput (“/global”, GB/s)  |20       |15        |10                      |1                    |
+|**Storage** latency (ms)                  |0.03     |1         |1                       |1000                 |
+“Operations”: Floating point / 16 cores @3GHz; Values per node\\
+“Memory”: Mixture of three cache levels and main memory; Values per node\\
+“Network”: Medium shared by all users (large jobs only); Values per node\\
+“Storage”: Medium shared by all users; Values per application
+Which one is the **limiting factor** for a given code?
+===== Know algorithmic complexity =====
+  * //Algorithmic complexity//: Scaling with varying input parameters
+  * HPC (High Performance Computing): execution time spent on just **a few lines of code**
+  * Linear Algebra
+    * (Dense) matrix multiplication: O(n^3)
+    * (Dense) matrix vector multiplication: (O(n^2)
+    * Dot product: O(n)
+    * (Dense) system of linear equations / least squares: O(n^3)
+    * (Dense) Eigensolver: O(n^3)
+    * FFT: O(n * log(n))
+    * Sparse algorithms: O(n) .. O(n^3)
+    * ..
+===== Know algorithmic complexity (2) =====
+  * Other codes
+    * Identify **main parameters**
+    * Identify **scaling** (usually polynomial)
+  * Large problem size => subalgorithm with highest exponent of complexity becomes dominant
+    * **Measure** compute time, preferably with profiler
+    * Beware: **cache effects** can ‘change’ algorithmic complexity
+  * Often: algorithmic complexity is well known and documented
+===== Know algorithmic complexity - Example =====
+Example: program with subprograms, running with varying problem size
+<code>
+static void square(long n)
+{ printf("Inside square \n");
+  for(long i=0;i<n*n;i++);
+}
+void cube(long n)
+{ printf("Inside cube \n");
+  for(long i=0;i<n*n/10000*n;i++);
+}
+int main(int argc, char *argv[])
+{ long n;
+  if(argc == 2) n=atoi(argv[1]); else n=4000;
+  printf("Inside main\n");
+  square(n);
+  cube(n);
+  return 0;
+}
+</code>
+===== Know algorithmic complexity - Example (2) =====
+Makefile
+<code>
+short:  a.out
+        time ./a.out 5000
+        gprof -Q -b a.out gmon.out
+        time ./a.out 20000
+        gprof -Q -b a.out gmon.out
+long:   gmon.out
+        gprof a.out gmon.out | less
+gmon.out:       a.out
+        time ./a.out
+a.out:  test_gprof.c
+        gcc -std=c99 -Wall -pg test_gprof.c
+clean:
+        rm gmon.out a.out 2>/dev/null
+</code>
+===== Know algorithmic complexity - Example (3) =====
+Execution: dominating subroutine changes with problem size
+<code>
+[profiling]$ make
+gcc -std=c99 -Wall -pg test_gprof.c
+time ./a.out 5000
+[..]
+  %   cumulative   self              self     total
+ time   seconds   seconds    calls  ms/call  ms/call  name
+.75      0.07     0.07        1    70.75    70.75  square
+.32      0.10     0.03        1    30.32    30.32  cube
+time ./a.out 20000
+[..]
+  %   cumulative   self              self     total
+ time   seconds   seconds    calls   s/call   s/call  name
+.23      1.75     1.75        1     1.75     1.75  cube
+.84      2.67     0.92        1     0.92     0.92  square
+</code>
+===== Libraries =====
+use libraries for good reasons
+  * execution performance
+  * efficient development
+  * well tested
+  * portability
+  * maintenance (by others!)
+  * ..
+===== Very important libraries =====
+  * **Level 3 BLAS** (=“Matrix Multiplication”)
+    * O(n^3) operations with O(n^2) memory accesses
+    * portable performance (often faster by a factor of 10)
+    * including triangular and symmetric / hermitian matrices
+    * any three-loop operation with $C=\alpha AB + \beta C$ can be written using Level 3 BLAS
+  * MKL: free **Math Kernel Library** from Intel
+    * Linear Algebra
+    * FFT
+    * Neural networks
+    * Statistics
+    * Sparse algorithms
+    * Parallel
+    * ..
+  * **HDF5**: Hierarchical Data Format: efficient and portable library
+===== Interoperability =====
+Any programming language can call subroutines of any other programming language
+  * C + Fortran
+  * Matlab + C
+  * Octave + C++
+  * Java + MKL
+  * Python + R
+  * Haskell + Assembler
+  * Javascript + Nvidia Cuda
+Combine your **best known** with the **best performing** tool
+===== Interoperability (2) =====
+==== Wrapper ====
+In many cases wrapper routines are required, mostly in C (as the language of the operating system)
+Difficulties can be
+  * Naming schemes
+    * UPPER vs. lower case
+    * added underscores
+<code>
+[test]$ grep -i dsbevk\( dsbevk.F90
+SUB R O U T I N E DSBEVK(JOBZ, RANGE, UPLO, B2T, N, ABAND, LDABAND, VL, VU, IL, IU, ABSTOL, NE, W, Z, LDZ, &
+[test]$ nm a.out | grep -i dsbevk
+0000000000417504 T dsbevk_
+000000000041abe1 t dsbevk_._omp_fn.0
+</code>
+===== Interoperability (2a) =====
+==== Wrapper ====
+In many cases wrapper routines are required, mostly in C (as the language of the operating system)
+Difficulties can be
+  * Naming schemes
+  * Call by value / call by reference
+  * Order of parameters: size of string parameters sometimes at the very end of parameter list
+  * I/O libraries * Hint: use I/O of only one language, or * link with correct runtime libraries, or * link by calling compiler (e.g. gfortran) instead of linker (e.g. ld)
+  * Copy-in and/or copy-out: can affect performance
+  * Class/module information in objects (“*.o“) only useful in one language
+===== Interoperability (3) =====
+==== Wrapper example ====
+<code fortran>
+[C_Fortran]$ more f.f
+        program y
+        write(*,'(I1," + ",I1," = ",I1)') 3, 5, intadd(3,5)
+        end program
+</code>
+<code>
+[C_Fortran]$ more c.c
+int intadd_(int *a, int *b) { return (*a)+(*b); }
+</code>
+<code>
+[C_Fortran]$ more makefile
+run:    a.out
+        ./a.out
+a.out:  c.c f.f
+        gfortran c.c f.f
+</code>
+<code>
+[C_Fortran]$ make
+gfortran c.c f.f
+./a.out
++ 5 = 8
+</code>
+===== Memory optimization =====
+To keep CPUs busy, data has to be kept near the execution units, e.g. in **Level 1 Cache**
+Methods
+  * Optimized libraries
+  * Blocking
+    * Reordering of operations
+    * Reuse data as much as possible
+    * Reuse cache lines (64 byte = 8 double): smallest memory access
+  * Access by row / column (depending on chosen language)
+  * ..
+===== Example: Row - Column =====
+Column Access
+<code>
+for(j=0;i<N;i++)
+  for(i=0;j<N;j++)
+    A[i][j]+=2.0;
+</code>
+Strided memory access in C\\
+Contiguous memory access in Fortran
+Row Access
+<code>
+for(i=0;i<N;i++)
+  for(j=0;j<N;j++)
+    A[i][j]+=2.0;
+</code>
+Contiguous in memory in C\\
+Strided memory access in Fortran
+===== Example: Blocking =====
+<code>
+//Three nested loops
+N=10000;
+for(i=0;i<N;i++)
+  for(j=0;j<N;j++)
+    for(k=0;k<N;k++)
+      C[i][j]+=A[i][k]*B[k][j];
+//Large N => cache thrashing
+</code>
+<code>
+//Blocked access
+l=48; N=10000;
+for(i=0;i<N;i+=l)
+  for(j=0;j<N;j+=l)
+    for(k=0;k<N;k+=l)
+      for(ii=i;ii<min(i+l,N);ii++)
+        for(jj=j;jj<min(j+l,N);jj++)
+          for(kk=k;kk<min(k+l,N);kk++)
+            C[ii][jj]+=A[ii][kk]*B[kk][jj];
+//Large N => reusing data in fast cache levels
+</code>
+Adapt parameter l to size of cache – separately for each cache level
+Even better:
+<code>
+double ONE=1.0D0;
+dgemm("N","N",N,N,N,&ONE,A,N,B,N,&ONE,C,N)
+// dgemm (TRANSA, TRANSB, M, N, K, ALPHA, A, LDA, B, LDB, BETA, C, LDC)
+</code>
+===== Storage and Network =====
+Small messages are costly
+Example: Write 100MB
+  * in one block to storage: 100MB * 2GB/s + 2µs = 50.002ms\\
+  * 1000000 times 100B: 1000000 * 100B * 2GB/s + 1000000*2µs = 2050ms (blocking the file server)
+===== Tools =====
+  * Debugger
+    * gdb
+    * Allinea ddt
+  * Profiler
+    * gprof
+    * Allinea Performance Reports: Quick performance overview (Runtime - Memory - Threads - I/O - MPI)
+  * Instrumentation
+    * Allinea map
+  * Intel performance tools
+<code>
+$ ls /cm/shared/apps/intel
+[..] advisor [..] inspector [..] vtune [..]
+</code>
+  * VSC School Trainings upon request
+https://wiki.vsc.ac.at/doku.php?id=doku:perf-report – https://wiki.vsc.ac.at/doku.php?id=doku:forge