This is an old revision of the document!
PAPI
<html> <h3> <span style=“color:red;font-size:100%;”>P</span>erformance <span style=“color:red;font-size:100%;”>A</span>pplication <span style=“color:red;font-size:100%;”>P</span>rogramming <span style=“color:red;font-size:100%;”>I</span>nterface </h2> </html>
Synopsis
PAPI (Performance Application Programming Interface) has been developed at the University of Tennessee’s Innovative Computing Laboratory in the Computer Science Department. PAPI is an event-based profiling library providing access to hardware performance counters of critical events, e.g., cache misses, number of CYCLES, number of floating point instructions (flpins), or floating point operations per second (FLOPs). These events can be monitored for selected sections of the code allowing for the analysis of the efficiency of mapping the code to the underlying hardware architecture.
Types of events
Native events are countable by the CPU and can such be accessed directly or by the PAPI low-level interface delivering a CPU-specific bit pattern.
Preset or predefined events
are software abstractions of architecture-dependent native events that are accessible through the PAPI interface.
papiStdEventDefs.h
includes a collection of about 100 preset events, e.g., memory hierarchy, cache coherence protocol events, cycle and instruction counts, functional unit, and pipeline status.
However, due to hardware implementation differences, it is sometimes only possible to count similar (not the same) types of events on different platforms implying that the direct comparison of particular PAPI event counts is not necessarily suitable.
Usage of PAPI
The user will have to modify the source code and insert PAPI calls (see below: Interfacing with PAPI). Invocation and usage simply requires to load the PAPI module,
module purge module load papi/5.4.3
and, to compile the user's code while linking the PAPI library (-lpapi
), i.e., for C users,
gcc my_program.c -lpapi ./a.out
or for Fortran users,
gfortran my_program.f -I/opt/sw/x86_64/glibc-2.12/ivybridge-ep/papi/5.4.3/gnu-4.4.7/include -lpapi ./a.out
Interfacing with PAPI : Low level interface
In general, some code section to be analyzed with PAPI needs to be wrapped into a sequence of standard PAPI calls. Here, code examples for
- C users and
- Fortran users can be found.
As stated in the comment of the C code above, it is best to analyze one particular event at a time. This advice is given because the CPU has limitations in combining arbitrary counters at a time.
Interfacing with PAPI : High level interface
The high-level API combines the counters for a specified list of PAPI preset events, only. The set of implemented high level functions is quite limited and can be found in the section The High Level API
next to the last paragraph of papi.h
. It can also be used in conjunction with the low-level API. An example of the usage of the high level API can be found here for C users.
Practical tips:
- A quick overview of supported events and corresponding PAPI variables for a particular type of CPU is obtained from executing command
papi_avail
. - Measuring the specific event
PAPI_TOT_CYC
can differ significantly from the result obtained by callingPAPI_get_real_cyc()
. This is particularly true for the PAPI analysis of very small code sections that are executed frequently (e.g. hotspot functions/routines that were ranked high during time based profiling). Although off from being accurate in absolute terms,PAPI_TOT_CYC
remains a useful reference time for relative comparisons. - Evaluating floating point performance on Intel ivy bridge: https://icl.cs.utk.edu/projects/papi/wiki/PAPITopics:SandyFlops
- Useful notes on Intel's CPI metric: https://software.intel.com/en-us/node/544403
- Occasionally, it is useful to PAPI-analyze an application within two steps: in the first step, selected characteristic events of the outermost code region are collected. In the second step, a set of subroutines/functions consuming major fractions of the execution time are analyzed with respect to the same events. In other words, we first obtain overall counts for
main()
of some application, e.g.PAPI_TOT_CYC
,PAPI_FP_OPS
,PAPI_L1_DCM
andPAPI_L2_DCM
. Subsequently, the analogous set of event counters is measured for suspicious subroutines or functions. Also relative fractions of these event counters could be useful. Based on these measures the subroutines' or functions' performance characteristics can be compared to that of the initial evaluation of the complete code. In this way, those parts matching the overall performance (w.r.t. cache misses, flops etc.) best can be identified.