doku:papi [VSC Wiki]

This version is outdated by a newer approved version.

This version (2016/06/16 13:46) is a draft.
Approvals: 0/1

This is an old revision of the document!

papi is an event-based profiling library that reads out hardware performance counters from the CPU and thus can provide useful information about critical events, e.g. cache misses, number of FLOPs, number of CYCLES etc.

The user will have to modify the source code and insert papi calls (see below). Invocation and usage is then as simple as,

 module purge
 module load papi/5.4.3
 gcc my_program.c -lpapi    
 ./a.out

or for Fortran users,

 module purge
 module load papi/5.4.3
 gfortran  my_program.f -I/opt/sw/x86_64/glibc-2.12/ivybridge-ep/papi/5.4.3/gnu-4.4.7/include -lpapi
 ./a.out

In general, some code section to be analyzed with papi needs to be wrapped into a sequence of standard papi calls, e.g.

 #include "papi.h"
 // PAPI variables
 // best is to analyze one particular event at a time
 int eventset;
 long long value, time0, time1, cyc0, cyc1;
     
 // PAPI Initialization
 eventset = PAPI_NULL;
 if (PAPI_library_init(PAPI_VER_CURRENT) != PAPI_VER_CURRENT) {
    printf("PAPI init error !\n");
    exit(993);
 }
 // PAPI Event Set Creation
 if (PAPI_create_eventset(&eventset) != PAPI_OK) {
    printf("PAPI event set creation error !\n");
    exit(994);
 }
 // PAPI Specify a Particular Target Event to Analyze
 //   PAPI_TOT_CYC         Total cycles executed
 //   PAPI_FP_OPS          Floating point operations executed
 //   PAPI_L1_DCM          Level 1 data cache misses
 //   PAPI_L2_DCM          Level 2 data cache misses
 //   for other events see /opt/sw/x86_64/glibc-2.12/ivybridge-ep/papi/5.4.3/gnu-4.4.7/include/papiStdEventDefs.h
 //
 if (PAPI_add_event(eventset, PAPI_FP_OPS) != PAPI_OK) {
    printf("PAPI event set adding error !\n");
    exit(995);
 }
 // PAPI Time Estimators Initialization
 time0 = PAPI_get_real_usec();
 cyc0 = PAPI_get_real_cyc();
 // PAPI Counting Start
 if (PAPI_start(eventset) != PAPI_OK) {
    printf("PAPI start error !\n");
    exit(996);
 }
 
 //*** Here follows the original code section to be analyzed ***
 
 // PAPI Counting Stop
 if (PAPI_stop(eventset, &value) != PAPI_OK) {
    printf("PAPI stop error !\n");
    exit(997);
 }
 // PAPI Time Estimators Stop
 time1 = PAPI_get_real_usec();
 cyc1 = PAPI_get_real_cyc();
 // PAPI Results
 printf("PAPI event count %lld\n", value);
 printf("PAPI time passed in usec %lld\n", time1 - time0);
 printf("PAPI cycles passed %lld\n", cyc1 - cyc0);
 // PAPI Free Event Set
 if (PAPI_cleanup_eventset(eventset) != PAPI_OK) {
    printf("PAPI event set cleanup error !\n");
    exit(998);
 }
 if (PAPI_destroy_eventset(&eventset) != PAPI_OK) {
    printf("PAPI event set destruction error !\n");
    exit(999);
 }
 // PAPI Finalize
 PAPI_shutdown();

A quick overview of supported events and corresponding papi variables for a particular type of CPU is obtained from executing command papi_avail.
Measuring the specific event PAPI_TOT_CYC can differ significantly from the result obtained by calling PAPI_get_real_cyc(). This is particularly true for papi analysis of very small code sections that are executed frequently (e.g. hotspot functions/routines that were ranked high during time based profiling). Although off in absolute terms, PAPI_TOT_CYC remains a useful reference time for relative comparisons.
Evaluating floating point performance on Intel ivy bridge: https://icl.cs.utk.edu/projects/papi/wiki/PAPITopics:SandyFlops
Useful notes on Intel's CPI metric: https://software.intel.com/en-us/node/544403
Occasionally it is useful to papi-analyze an application within two steps: at first the outermost code region by a selection of characteristic events, then in similar fashion a set of subroutines/functions that consume major fractions of the execution time. Say for example we obtain overall counts for the main() part of some application, e.g. PAPI_TOT_CYC, PAPI_FP_OPS, PAPI_L1_DCM and PAPI_L2_DCM, then it is quite useful to determine the analogous set of event counters for suspicious subroutines/functions and look into relative fractions of these event counts and identify those which best match the initial (i.e. overall) evaluation. In so doing specific subroutines/functions can be detected that determine overall performance with respect to cache misses, flops etc.

VSC-3: papi version 5.4.3

Synopsis:

Usage of papi:

Interfacing with papi :

Practical tips: