This version is outdated by a newer approved version.This version (2016/06/16 13:46) is a draft.
Approvals: 0/1
Approvals: 0/1
This is an old revision of the document!
VSC-3: papi version 5.4.3
Synopsis:
papi is an event-based profiling library that reads out hardware performance counters from the CPU and thus can provide useful information about critical events, e.g. cache misses, number of FLOPs, number of CYCLES etc.
Usage of papi:
The user will have to modify the source code and insert papi
calls (see below). Invocation and usage is then as simple as,
module purge module load papi/5.4.3 gcc my_program.c -lpapi ./a.out
or for Fortran users,
module purge module load papi/5.4.3 gfortran my_program.f -I/opt/sw/x86_64/glibc-2.12/ivybridge-ep/papi/5.4.3/gnu-4.4.7/include -lpapi ./a.out
Interfacing with papi :
In general, some code section to be analyzed with papi
needs to be wrapped into a sequence of standard papi
calls, e.g.
#include "papi.h" // PAPI variables // best is to analyze one particular event at a time int eventset; long long value, time0, time1, cyc0, cyc1; // PAPI Initialization eventset = PAPI_NULL; if (PAPI_library_init(PAPI_VER_CURRENT) != PAPI_VER_CURRENT) { printf("PAPI init error !\n"); exit(993); } // PAPI Event Set Creation if (PAPI_create_eventset(&eventset) != PAPI_OK) { printf("PAPI event set creation error !\n"); exit(994); } // PAPI Specify a Particular Target Event to Analyze // PAPI_TOT_CYC Total cycles executed // PAPI_FP_OPS Floating point operations executed // PAPI_L1_DCM Level 1 data cache misses // PAPI_L2_DCM Level 2 data cache misses // for other events see /opt/sw/x86_64/glibc-2.12/ivybridge-ep/papi/5.4.3/gnu-4.4.7/include/papiStdEventDefs.h // if (PAPI_add_event(eventset, PAPI_FP_OPS) != PAPI_OK) { printf("PAPI event set adding error !\n"); exit(995); } // PAPI Time Estimators Initialization time0 = PAPI_get_real_usec(); cyc0 = PAPI_get_real_cyc(); // PAPI Counting Start if (PAPI_start(eventset) != PAPI_OK) { printf("PAPI start error !\n"); exit(996); } //*** Here follows the original code section to be analyzed *** // PAPI Counting Stop if (PAPI_stop(eventset, &value) != PAPI_OK) { printf("PAPI stop error !\n"); exit(997); } // PAPI Time Estimators Stop time1 = PAPI_get_real_usec(); cyc1 = PAPI_get_real_cyc(); // PAPI Results printf("PAPI event count %lld\n", value); printf("PAPI time passed in usec %lld\n", time1 - time0); printf("PAPI cycles passed %lld\n", cyc1 - cyc0); // PAPI Free Event Set if (PAPI_cleanup_eventset(eventset) != PAPI_OK) { printf("PAPI event set cleanup error !\n"); exit(998); } if (PAPI_destroy_eventset(&eventset) != PAPI_OK) { printf("PAPI event set destruction error !\n"); exit(999); } // PAPI Finalize PAPI_shutdown();
Practical tips:
- A quick overview of supported events and corresponding
papi
variables for a particular type of CPU is obtained from executing commandpapi_avail
. - Measuring the specific event
PAPI_TOT_CYC
can differ significantly from the result obtained by callingPAPI_get_real_cyc()
. This is particularly true forpapi
analysis of very small code sections that are executed frequently (e.g. hotspot functions/routines that were ranked high during time based profiling). Although off in absolute terms,PAPI_TOT_CYC
remains a useful reference time for relative comparisons. - Evaluating floating point performance on Intel ivy bridge: https://icl.cs.utk.edu/projects/papi/wiki/PAPITopics:SandyFlops
- Useful notes on Intel's CPI metric: https://software.intel.com/en-us/node/544403
- Occasionally it is useful to
papi
-analyze an application within two steps: at first the outermost code region by a selection of characteristic events, then in similar fashion a set of subroutines/functions that consume major fractions of the execution time. Say for example we obtain overall counts for themain()
part of some application, e.g.PAPI_TOT_CYC
,PAPI_FP_OPS
,PAPI_L1_DCM
andPAPI_L2_DCM
, then it is quite useful to determine the analogous set of event counters for suspicious subroutines/functions and look into relative fractions of these event counts and identify those which best match the initial (i.e. overall) evaluation. In so doing specific subroutines/functions can be detected that determine overall performance with respect to cache misses, flops etc.