===== Performance Tests ===== In this section the results for different performance tests are presented. ==== Matrix diagonalisation ==== === Libraries === Following MPI Versions and libraries were used: **qlogic:** * vsc1: qlogicmpi_intel-0.1.0 (compiled with intel ifort) * vsc2: qlogicmpi_intel-3.0.1 (compiled with intel ifort) **mvapich2:** * vsc1: mvapich2_intel_qlc-1.6 (compiled with intel ifort for qlogic) * vsc2: mvapich2_1.8a2_intel_limic (compiled with ifort and limic module) **impi:** * vsc2: impi_4.0.1.007 (Intel MPI) **sca:** * scalapack 2.0.0 compiled using INTEL ifort and GotoBLAS2,lapack-3.3.1 **mkl:** * vsc1: INTEL mkl libraries Version 11.1/046 * vsc2: INTEL mkl libraries Versoin 2011_sp1.9.293 **elpa:** Elpa was compiled using **sca** from above + mkl libraries; when using only mkl libraries BLACS errors occured. === Timings Blocksize 64 === In a small test programm a Matrix of size N x N with N = 512, 1024, 2048, 4096 was randomly setup and diagonalized using PZHEEVX from SCALAPACK and solve_evp_complex_2stage from ELPA. The timings are only given for the diagonalization part. For each number of cores all possible processor row/column combinations of row/cols = 1,2,4,8,16,32,64 were calculated. In the plotted data only the lowest times are presented. Absolute timings of the different subroutines: {{:doku:sca:performance_sca_absolute_512.ps.jpg?300|}} {{:doku:sca:performance_sca_absolute_1024.ps.jpg?300|}} {{:doku:sca:performance_sca_absolute_2048.ps.jpg?300|}} {{:doku:sca:performance_sca_absolute_4096.ps.jpg?300|}} Scaling of the runtimes relative to the calculation with 16 cores: {{:doku:sca:performance_sca_scaling_512.ps.jpg?300|}} {{:doku:sca:performance_sca_scaling_1024.ps.jpg?300|}} {{:doku:sca:performance_sca_scaling_2048.ps.jpg?300|}} {{:doku:sca:performance_sca_scaling_4096.ps.jpg?300|}} === Timings Blocksize optimized === For qlogic MPI we also tested the influence of different blocksizes on VSC-1 and VSC-2. The runs were performed as above, but the calculations were done for blocksizes = 2,4,8,16,32,64. The data in the plots and the tables represents the lowest obtained timings for a certain matrix size and number of used cores. {{:doku:sca:optimized_blocksize_512.ps.jpg?300|}} {{:doku:sca:optimized_blocksize_1024.ps.jpg?300|}} {{:doku:sca:optimized_blocksize_2048.ps.jpg?300|}} {{:doku:sca:optimized_blocksize_4096.ps.jpg?300|}} #Data obtained from VSC-1 with qlogic MPI: cores time blocksize SCA ELPA SCA ELPA ------------------------------------ Matrix Size 512: 16 0.081 0.072 16 8 32 0.087 0.059 32 4 64 0.085 0.049 32 2 128 0.093 0.043 4 4 256 0.114 0.040 32 8 ------------------------------------ Matrix Size 1024: 16 0.320 0.402 16 2 32 0.274 0.263 32 2 64 0.245 0.187 32 4 128 0.249 0.153 32 8 256 0.273 0.120 32 2 ------------------------------------ Matrix Size 2048: 16 1.699 2.565 16 2 32 1.148 1.498 32 4 64 0.856 0.907 32 4 128 0.749 0.613 32 4 256 0.666 0.442 32 8 ------------------------------------ Matrix Size 4096: 16 11.921 17.662 32 8 32 6.559 9.710 32 16 64 4.101 5.549 16 2 128 2.837 3.264 32 16 256 2.136 2.066 16 4 #Data obtained from VSC-2 with qlogic MPI: cores time blocksize SCA ELPA SCA ELPA ------------------------------------ Matrix Size 512: 16 0.101 0.097 16 4 32 0.096 0.077 16 2 64 0.090 0.066 8 4 128 0.109 0.058 16 4 256 0.126 0.054 4 4 ------------------------------------ Matrix Size 1024: 16 0.423 0.525 16 4 32 0.312 0.341 16 4 64 0.249 0.254 16 4 128 0.266 0.189 8 4 256 0.251 0.148 8 8 ------------------------------------ Matrix Size 2048: 16 2.448 3.264 32 4 32 1.460 1.974 16 4 64 0.987 1.173 16 16 128 0.848 0.777 16 8 256 0.671 0.545 4 4 ------------------------------------ Matrix Size 4096: 16 19.075 22.678 32 2 32 10.114 12.827 32 8 64 5.705 7.059 32 8 128 3.463 4.288 16 16 256 2.461 2.624 16 2