===== Performance Tests =====

In this section the results for different performance tests are presented. 

==== Matrix diagonalisation ====

=== Libraries ===
Following MPI Versions and libraries were used:

**qlogic:** 
  * vsc1: qlogicmpi_intel-0.1.0 (compiled with intel ifort)
  * vsc2: qlogicmpi_intel-3.0.1 (compiled with intel ifort)

**mvapich2:**
  * vsc1: mvapich2_intel_qlc-1.6  (compiled with intel ifort for qlogic)
  * vsc2: mvapich2_1.8a2_intel_limic (compiled with ifort and limic module)

**impi:**
  * vsc2: impi_4.0.1.007 (Intel MPI)


**sca:**
  * scalapack 2.0.0 compiled using INTEL ifort and GotoBLAS2,lapack-3.3.1
  
**mkl:**
  * vsc1: INTEL mkl libraries Version 11.1/046
  * vsc2: INTEL mkl libraries Versoin 2011_sp1.9.293
  
  
**elpa:**

Elpa was compiled using **sca** from above + mkl libraries; when using only mkl libraries BLACS errors occured.


=== Timings Blocksize 64 ===

In a small test programm a Matrix of size N x N with  N = 512, 1024, 2048, 4096 was randomly setup and diagonalized using 
PZHEEVX from SCALAPACK and solve_evp_complex_2stage from ELPA. The timings are only given for the diagonalization part.

For each number of cores all possible processor row/column combinations of row/cols = 1,2,4,8,16,32,64 were calculated. In the plotted 
data only the lowest times are presented.


Absolute timings of the different subroutines:

{{:doku:sca:performance_sca_absolute_512.ps.jpg?300|}}
{{:doku:sca:performance_sca_absolute_1024.ps.jpg?300|}}

{{:doku:sca:performance_sca_absolute_2048.ps.jpg?300|}}
{{:doku:sca:performance_sca_absolute_4096.ps.jpg?300|}}


Scaling of the runtimes relative to the calculation with 16 cores:

{{:doku:sca:performance_sca_scaling_512.ps.jpg?300|}}
{{:doku:sca:performance_sca_scaling_1024.ps.jpg?300|}}

{{:doku:sca:performance_sca_scaling_2048.ps.jpg?300|}}
{{:doku:sca:performance_sca_scaling_4096.ps.jpg?300|}}


=== Timings Blocksize optimized ===

For qlogic MPI we also tested the influence of different blocksizes on VSC-1 and 
VSC-2. The runs were performed as above, but the calculations were done for blocksizes = 2,4,8,16,32,64. 
The data in the plots and the tables represents the lowest obtained timings for a certain matrix size and number 
of used cores. 

{{:doku:sca:optimized_blocksize_512.ps.jpg?300|}}
{{:doku:sca:optimized_blocksize_1024.ps.jpg?300|}}

{{:doku:sca:optimized_blocksize_2048.ps.jpg?300|}}
{{:doku:sca:optimized_blocksize_4096.ps.jpg?300|}}

<code>
#Data obtained from VSC-1 with qlogic MPI:


cores       time         blocksize
        SCA      ELPA    SCA   ELPA
------------------------------------
Matrix Size 512:
  16    0.081    0.072    16     8 
  32    0.087    0.059    32     4 
  64    0.085    0.049    32     2 
 128    0.093    0.043     4     4 
 256    0.114    0.040    32     8 
------------------------------------
Matrix Size 1024:
  16    0.320    0.402    16     2 
  32    0.274    0.263    32     2 
  64    0.245    0.187    32     4 
 128    0.249    0.153    32     8 
 256    0.273    0.120    32     2 
------------------------------------
Matrix Size 2048:
  16    1.699    2.565    16     2 
  32    1.148    1.498    32     4 
  64    0.856    0.907    32     4 
 128    0.749    0.613    32     4 
 256    0.666    0.442    32     8 
------------------------------------
Matrix Size 4096:
  16   11.921   17.662    32     8 
  32    6.559    9.710    32    16 
  64    4.101    5.549    16     2 
 128    2.837    3.264    32    16 
 256    2.136    2.066    16     4 
</code>
<code>
#Data obtained from VSC-2 with qlogic MPI:


cores       time         blocksize
        SCA      ELPA    SCA   ELPA
------------------------------------
Matrix Size 512:  
  16    0.101    0.097    16     4 
  32    0.096    0.077    16     2 
  64    0.090    0.066     8     4 
 128    0.109    0.058    16     4 
 256    0.126    0.054     4     4 
------------------------------------
Matrix Size 1024:
  16    0.423    0.525    16     4 
  32    0.312    0.341    16     4 
  64    0.249    0.254    16     4 
 128    0.266    0.189     8     4 
 256    0.251    0.148     8     8
------------------------------------
Matrix Size 2048:
  16    2.448    3.264    32     4 
  32    1.460    1.974    16     4 
  64    0.987    1.173    16    16 
 128    0.848    0.777    16     8 
 256    0.671    0.545     4     4 
------------------------------------
Matrix Size 4096:
  16   19.075   22.678    32     2 
  32   10.114   12.827    32     8 
  64    5.705    7.059    32     8 
 128    3.463    4.288    16    16 
 256    2.461    2.624    16     2 
</code>