===== Performance Tests =====
In this section the results for different performance tests are presented.
==== Matrix diagonalisation ====
=== Libraries ===
Following MPI Versions and libraries were used:
**qlogic:**
* vsc1: qlogicmpi_intel-0.1.0 (compiled with intel ifort)
* vsc2: qlogicmpi_intel-3.0.1 (compiled with intel ifort)
**mvapich2:**
* vsc1: mvapich2_intel_qlc-1.6 (compiled with intel ifort for qlogic)
* vsc2: mvapich2_1.8a2_intel_limic (compiled with ifort and limic module)
**impi:**
* vsc2: impi_4.0.1.007 (Intel MPI)
**sca:**
* scalapack 2.0.0 compiled using INTEL ifort and GotoBLAS2,lapack-3.3.1
**mkl:**
* vsc1: INTEL mkl libraries Version 11.1/046
* vsc2: INTEL mkl libraries Versoin 2011_sp1.9.293
**elpa:**
Elpa was compiled using **sca** from above + mkl libraries; when using only mkl libraries BLACS errors occured.
=== Timings Blocksize 64 ===
In a small test programm a Matrix of size N x N with N = 512, 1024, 2048, 4096 was randomly setup and diagonalized using
PZHEEVX from SCALAPACK and solve_evp_complex_2stage from ELPA. The timings are only given for the diagonalization part.
For each number of cores all possible processor row/column combinations of row/cols = 1,2,4,8,16,32,64 were calculated. In the plotted
data only the lowest times are presented.
Absolute timings of the different subroutines:
{{:doku:sca:performance_sca_absolute_512.ps.jpg?300|}}
{{:doku:sca:performance_sca_absolute_1024.ps.jpg?300|}}
{{:doku:sca:performance_sca_absolute_2048.ps.jpg?300|}}
{{:doku:sca:performance_sca_absolute_4096.ps.jpg?300|}}
Scaling of the runtimes relative to the calculation with 16 cores:
{{:doku:sca:performance_sca_scaling_512.ps.jpg?300|}}
{{:doku:sca:performance_sca_scaling_1024.ps.jpg?300|}}
{{:doku:sca:performance_sca_scaling_2048.ps.jpg?300|}}
{{:doku:sca:performance_sca_scaling_4096.ps.jpg?300|}}
=== Timings Blocksize optimized ===
For qlogic MPI we also tested the influence of different blocksizes on VSC-1 and
VSC-2. The runs were performed as above, but the calculations were done for blocksizes = 2,4,8,16,32,64.
The data in the plots and the tables represents the lowest obtained timings for a certain matrix size and number
of used cores.
{{:doku:sca:optimized_blocksize_512.ps.jpg?300|}}
{{:doku:sca:optimized_blocksize_1024.ps.jpg?300|}}
{{:doku:sca:optimized_blocksize_2048.ps.jpg?300|}}
{{:doku:sca:optimized_blocksize_4096.ps.jpg?300|}}
#Data obtained from VSC-1 with qlogic MPI:
cores time blocksize
SCA ELPA SCA ELPA
------------------------------------
Matrix Size 512:
16 0.081 0.072 16 8
32 0.087 0.059 32 4
64 0.085 0.049 32 2
128 0.093 0.043 4 4
256 0.114 0.040 32 8
------------------------------------
Matrix Size 1024:
16 0.320 0.402 16 2
32 0.274 0.263 32 2
64 0.245 0.187 32 4
128 0.249 0.153 32 8
256 0.273 0.120 32 2
------------------------------------
Matrix Size 2048:
16 1.699 2.565 16 2
32 1.148 1.498 32 4
64 0.856 0.907 32 4
128 0.749 0.613 32 4
256 0.666 0.442 32 8
------------------------------------
Matrix Size 4096:
16 11.921 17.662 32 8
32 6.559 9.710 32 16
64 4.101 5.549 16 2
128 2.837 3.264 32 16
256 2.136 2.066 16 4
#Data obtained from VSC-2 with qlogic MPI:
cores time blocksize
SCA ELPA SCA ELPA
------------------------------------
Matrix Size 512:
16 0.101 0.097 16 4
32 0.096 0.077 16 2
64 0.090 0.066 8 4
128 0.109 0.058 16 4
256 0.126 0.054 4 4
------------------------------------
Matrix Size 1024:
16 0.423 0.525 16 4
32 0.312 0.341 16 4
64 0.249 0.254 16 4
128 0.266 0.189 8 4
256 0.251 0.148 8 8
------------------------------------
Matrix Size 2048:
16 2.448 3.264 32 4
32 1.460 1.974 16 4
64 0.987 1.173 16 16
128 0.848 0.777 16 8
256 0.671 0.545 4 4
------------------------------------
Matrix Size 4096:
16 19.075 22.678 32 2
32 10.114 12.827 32 8
64 5.705 7.059 32 8
128 3.463 4.288 16 16
256 2.461 2.624 16 2