This version (2024/10/24 10:28) is a draft.
Approvals: 0/1
The Previously approved version (2012/08/02 08:33) is available.Diff

In this section the results for different performance tests are presented.

Libraries

Following MPI Versions and libraries were used:

qlogic:

  • vsc1: qlogicmpi_intel-0.1.0 (compiled with intel ifort)
  • vsc2: qlogicmpi_intel-3.0.1 (compiled with intel ifort)

mvapich2:

  • vsc1: mvapich2_intel_qlc-1.6 (compiled with intel ifort for qlogic)
  • vsc2: mvapich2_1.8a2_intel_limic (compiled with ifort and limic module)

impi:

  • vsc2: impi_4.0.1.007 (Intel MPI)

sca:

  • scalapack 2.0.0 compiled using INTEL ifort and GotoBLAS2,lapack-3.3.1

mkl:

  • vsc1: INTEL mkl libraries Version 11.1/046
  • vsc2: INTEL mkl libraries Versoin 2011_sp1.9.293

elpa:

Elpa was compiled using sca from above + mkl libraries; when using only mkl libraries BLACS errors occured.

Timings Blocksize 64

In a small test programm a Matrix of size N x N with N = 512, 1024, 2048, 4096 was randomly setup and diagonalized using PZHEEVX from SCALAPACK and solve_evp_complex_2stage from ELPA. The timings are only given for the diagonalization part.

For each number of cores all possible processor row/column combinations of row/cols = 1,2,4,8,16,32,64 were calculated. In the plotted data only the lowest times are presented.

Absolute timings of the different subroutines:

Scaling of the runtimes relative to the calculation with 16 cores:

Timings Blocksize optimized

For qlogic MPI we also tested the influence of different blocksizes on VSC-1 and VSC-2. The runs were performed as above, but the calculations were done for blocksizes = 2,4,8,16,32,64. The data in the plots and the tables represents the lowest obtained timings for a certain matrix size and number of used cores.

#Data obtained from VSC-1 with qlogic MPI:


cores       time         blocksize
        SCA      ELPA    SCA   ELPA
------------------------------------
Matrix Size 512:
  16    0.081    0.072    16     8 
  32    0.087    0.059    32     4 
  64    0.085    0.049    32     2 
 128    0.093    0.043     4     4 
 256    0.114    0.040    32     8 
------------------------------------
Matrix Size 1024:
  16    0.320    0.402    16     2 
  32    0.274    0.263    32     2 
  64    0.245    0.187    32     4 
 128    0.249    0.153    32     8 
 256    0.273    0.120    32     2 
------------------------------------
Matrix Size 2048:
  16    1.699    2.565    16     2 
  32    1.148    1.498    32     4 
  64    0.856    0.907    32     4 
 128    0.749    0.613    32     4 
 256    0.666    0.442    32     8 
------------------------------------
Matrix Size 4096:
  16   11.921   17.662    32     8 
  32    6.559    9.710    32    16 
  64    4.101    5.549    16     2 
 128    2.837    3.264    32    16 
 256    2.136    2.066    16     4 
#Data obtained from VSC-2 with qlogic MPI:


cores       time         blocksize
        SCA      ELPA    SCA   ELPA
------------------------------------
Matrix Size 512:  
  16    0.101    0.097    16     4 
  32    0.096    0.077    16     2 
  64    0.090    0.066     8     4 
 128    0.109    0.058    16     4 
 256    0.126    0.054     4     4 
------------------------------------
Matrix Size 1024:
  16    0.423    0.525    16     4 
  32    0.312    0.341    16     4 
  64    0.249    0.254    16     4 
 128    0.266    0.189     8     4 
 256    0.251    0.148     8     8
------------------------------------
Matrix Size 2048:
  16    2.448    3.264    32     4 
  32    1.460    1.974    16     4 
  64    0.987    1.173    16    16 
 128    0.848    0.777    16     8 
 256    0.671    0.545     4     4 
------------------------------------
Matrix Size 4096:
  16   19.075   22.678    32     2 
  32   10.114   12.827    32     8 
  64    5.705    7.059    32     8 
 128    3.463    4.288    16    16 
 256    2.461    2.624    16     2 
  • doku/performance_tests.txt
  • Last modified: 2024/10/24 10:28
  • by 127.0.0.1