mmult3_c.exe - Performance Report

Error: javascript is not running

The graphs in this Performance Report require javascript, which is disabled or not working.

Check whether your javascript support is enabled or try another browser.

Remember, you can always contact support@allinea.com, we're very nice!

Summary: mmult3_c.exe is CPU-bound in this configuration

CPU

79.7%

Time spent running application code. High values are usually good.

This is high; check the CPU performance section for optimization advice.

MPI

19.8%

Time spent in MPI calls. High values are usually bad.

This is low; this code may benefit from increasing the process count.

I/O

0.5%

Time spent in filesystem I/O. High values are usually bad.

This is very low; however single-process I/O often causes large MPI wait times.

This application run was CPU-bound. A breakdown of this time and advice for investigating further is in the CPU section below.

As little time is spent in MPI calls, this code may also benefit from running at larger scales.

CPU

A breakdown of the 79.7% CPU time:

Scalar numeric ops	16.0%
Vector numeric ops	10.2%
Memory accesses	73.8%

The per-core performance is memory-bound. Use a profiler to identify time-consuming loops and check their cache performance.

Little time is spent in vectorized instructions. Check the compiler's vectorization advice to see why key loops could not be vectorized.

MPI

A breakdown of the 19.8% MPI time:

Time in collective calls	89.2%
Time in point-to-point calls	10.8%
Effective process collective rate	0.00e+00
Effective process point-to-point rate	1.08e+09

Most of the time is spent in collective calls with a very low transfer rate. This suggests load imbalance is causing synchonization overhead; use an MPI profiler to investigate further.

I/O

A breakdown of the 0.5% I/O time:

Time in reads	0.0%
Time in writes	100.0%
Effective process read rate	0.00e+00
Effective process write rate	1.22e+08

Most of the time is spent in write operations with an average effective transfer rate. It may be possible to achieve faster effective transfer rates using asynchronous file operations.

Threads

A breakdown of how multiple threads were used:

Computation	0.0%
Synchronization	0.0%
Physical core utilization	49.7%
Involuntary context switches per second	248.4

No measurable time is spent in multithreaded code.

Memory

Per-process memory usage may also affect scaling:

Mean process memory usage	2.76e+08
Peak process memory usage	5.55e+08
Peak node memory usage	15.0%

There is significant variation between peak and mean memory usage. This may be a sign of workload imbalance or a memory leak.

The peak node memory usage is very low. Running with fewer MPI processes and more data on each process may be more efficient.

Command:	mpirun -n 6 --bind-to-core ./mmult3_c.exe 4608
Resources:	1 node (12 physical, 24 logical cores per node)
Tasks:	6 processes
Machine:	mic2
Start time:	Fri Feb 20 21:27:25 2015
Total time:	65 seconds (1 minute)
Full path:	/scratch/allinea/mmult/3_fix
Input file:
Notes: