Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
Last revision Both sides next revision
doku:papi [2016/06/15 13:23]
sh
doku:papi [2016/07/06 12:27]
ir [Usage of PAPI]
Line 1: Line 1:
-===== VSC-3: papi version 5.4.=====+====== PAPI ======   
 +<html> <h3> <span style="color:red;font-size:100%;">P</span>erformance <span style="color:red;font-size:100%;">A</span>pplication <span style="color:red;font-size:100%;">P</span>rogramming <span style="color:red;font-size:100%;">I</span>nterface </h2> 
 +</html> 
 +===== Synopsis ===== 
 +[[http://icl.cs.utk.edu/projects/papi/files/documentation/PAPI_USER_GUIDE_23.htm#WHAT_IS_PAPI |PAPI ]] (**P**erformance **A**pplication **P**rogramming **I**nterface) has been developed at the University of Tennessee’s Innovative Computing Laboratory in the Computer Science Department 
 +PAPI is an event-based profiling library providing access to hardware performance counters of critical events, e.g., cache misses, number of CYCLES, number of floating point instructions (flpins), or floating point operations per second (FLOPs). These events can be monitored for selected sections of the code allowing for the analysis of the efficiency of mapping the code to the underlying hardware architecture.  
 +==== Types of events ==== 
 +**Native events** 
 +are countable by the CPU and can such be accessed directly or by the PAPI low-level interface delivering a CPU-specific bit pattern. 
  
 +**Preset or predefined events**
 +are software abstractions of architecture-dependent native events that are accessible through the PAPI interface. 
 +''papiStdEventDefs.h'' includes a collection of about 100 preset events, e.g., memory hierarchy, cache coherence protocol events, cycle and instruction counts, functional unit, and pipeline status.
 +However, due to hardware implementation differences, it is sometimes only possible to count similar (not the same) types of events on different platforms implying that the direct comparison of particular PAPI event counts is not necessarily suitable.
  
-==== Synopsis: ====  +===== Usage of PAPI ===== 
-[[http://icl.cs.utk.edu/projects/papi/files/documentation/PAPI_USER_GUIDE_23.htm#WHAT_IS_PAPI papi]] is an event-based profiling library that reads out hardware performance counters from the CPU and thus can provide useful information about critical eventse.g. cache misses, number of FLOPs, number of CYCLES etc.+The user will have to modify the source code and insert PAPI calls (see below: [[doku:papi&#interfacing_with_papi|Interfacing with PAPI]]). Invocation and usage simply requires to load the PAPI module,
  
- 
- 
-==== Usage of papi: ==== 
-The user will have to modify the source code and insert ''papi'' calls (see below). Invocation and usage is then as simple as 
- 
-    
    module purge    module purge
    module load papi/5.4.3    module load papi/5.4.3
-   gcc my_program.c -lpapi    gfortran my_program.f -lpapi )+and, to compile the user's code while linking the PAPI library (''-lpapi''), i.e., for C users, 
 +   gcc my_program.c -lpapi     
 +   ./a.out    
 +or for Fortran users, 
 +   gfortran  my_program.f -I/opt/sw/x86_64/glibc-2.12/ivybridge-ep/papi/5.4.3/gnu-4.4.7/include -lpapi
    ./a.out    ./a.out
-    
-    
-==== Interfacing with papi : ==== 
-In general, some code section to be analyzed with ''papi'' needs to be wrapped into a sequence of standard ''papi'' calls, e.g.  
  
-   #include "papi.h" +==== Interfacing with PAPI : Low level interface ==== 
-   // PAPI variables +In general, some code section to be analyzed with PAPI needs to be wrapped into a sequence of standard PAPI calls, e.g.like in the following examples for  
-   // best is to analyze one particular event at a time +  * [[doku:papi_ll_c|C]] or 
-   int eventset; +  * [[doku:papi_ll_Fortran|Fortran]]. 
-   long long value, time0, time1, cyc0, cyc1; +As stated in the comment of the C code above, it is best to analyze one particular event at a time. This advice is given because the CPU has limitations in combining arbitrary counters at a time.  
-        +==== Interfacing with PAPI : High level interface ===
-   // PAPI Initialization +The high level API combines the counters for a specified list of PAPI preset events, only. The set of implemented high level functions is quite limited and can be found in the section ''The High Level API'' next to the last paragraph of ''papi.h'' 
-   eventset PAPI_NULL; +The high level API can also be used in conjunction with the low level API. 
-   if (PAPI_library_init(PAPI_VER_CURRENT) !PAPI_VER_CURRENT) { + 
-      printf("PAPI init error !\n"); +Example code:  
-      exit(993); +  * [[doku:papi_hl_c|C]]  
-   } +  * [[doku:papi_hl_fortran|Fortran]]
-   // PAPI Event Set Creation +
-   if (PAPI_create_eventset(&eventset) !PAPI_OK) { +
-      printf("PAPI event set creation error !\n"); +
-      exit(994); +
-   } +
-   // PAPI Specify a Particular Target Event to Analyze +
-   //   PAPI_TOT_CYC         Total cycles executed +
-   //   PAPI_FP_OPS          Floating point operations executed +
-   //   PAPI_L1_DCM          Level 1 data cache misses +
-   //   PAPI_L2_DCM          Level 2 data cache misses +
-   //   for other events see /opt/sw/x86_64/glibc-2.12/ivybridge-ep/papi/5.4.3/gnu-4.4.7/include/papiStdEventDefs.h +
-   // +
-   if (PAPI_add_event(eventset, PAPI_FP_OPS) !PAPI_OK) { +
-      printf("PAPI event set adding error !\n"); +
-      exit(995); +
-   } +
-   // PAPI Time Estimators Initialization +
-   time0 PAPI_get_real_usec(); +
-   cyc0 PAPI_get_real_cyc(); +
-   // PAPI Counting Start +
-   if (PAPI_start(eventset) !PAPI_OK) { +
-      printf("PAPI start error !\n"); +
-      exit(996); +
-   } +
-    +
-   //*** Here follows the original code section to be analyzed *** +
-    +
-   // PAPI Counting Stop +
-   if (PAPI_stop(eventset&value) != PAPI_OK) { +
-      printf("PAPI stop error !\n"); +
-      exit(997); +
-   } +
-   // PAPI Time Estimators Stop +
-   time1 = PAPI_get_real_usec(); +
-   cyc1 = PAPI_get_real_cyc(); +
-   // PAPI Results +
-   printf("PAPI event count %lld\n", value); +
-   printf("PAPI time passed in usec %lld\n", time1 - time0); +
-   printf("PAPI cycles passed %lld\n", cyc1 - cyc0); +
-   // PAPI Free Event Set +
-   if (PAPI_cleanup_eventset(eventset) !PAPI_OK) { +
-      printf("PAPI event set cleanup error !\n"); +
-      exit(998); +
-   } +
-   if (PAPI_destroy_eventset(&eventset) !PAPI_OK) { +
-      printf("PAPI event set destruction error !\n"); +
-      exit(999); +
-   } +
-   // PAPI Finalize +
-   PAPI_shutdown(); +
-    +
-   +
  
 ==== Practical tips: ==== ==== Practical tips: ====
-  * A quick overview of supported events and corresponding ''papi'' variables for a particular type of CPU is obtained from executing command ''papi_avail''.  +  * A quick overview of supported events and corresponding PAPI variables for a particular type of CPU is obtained from executing command ''papi_avail''.  
-  * Measuring the specific event ''PAPI_TOT_CYC'' can differ significantly from the result obtained by calling ''PAPI_get_real_cyc()''. This is particularly true for ''papi'' analysis of very small code sections that are executed frequently (e.g. hotspot functions/routines that were ranked high during time based profiling). Although off in absolute terms, ''PAPI_TOT_CYC'' remains a useful reference time for relative comparisons. +  * Measuring the specific event ''PAPI_TOT_CYC'' can differ significantly from the result obtained by calling ''PAPI_get_real_cyc()''. This is particularly true for the PAPI analysis of very small code sections that are executed frequently (e.g. hotspot functions/routines that were ranked high during time based profiling). Although off from being accurate in absolute terms, ''PAPI_TOT_CYC'' remains a useful reference time for relative comparisons. 
   * Evaluating floating point performance on Intel ivy bridge: [[https://icl.cs.utk.edu/projects/papi/wiki/PAPITopics:SandyFlops]]   * Evaluating floating point performance on Intel ivy bridge: [[https://icl.cs.utk.edu/projects/papi/wiki/PAPITopics:SandyFlops]]
   * Useful notes on Intel's CPI metric: [[https://software.intel.com/en-us/node/544403]]   * Useful notes on Intel's CPI metric: [[https://software.intel.com/en-us/node/544403]]
-  * Occasionally it is useful to ''papi''-analyze an application within two steps: at first the outermost code region by a selection of characteristic eventsthen in similar fashion a set of subroutines/functions that consume major fractions of the execution time. Say for example we obtain overall counts for the ''main()'' part of some application, e.g. ''PAPI_TOT_CYC'', ''PAPI_FP_OPS'', ''PAPI_L1_DCM'' and ''PAPI_L2_DCM'', then it is quite useful to determine the analogous set of event counters for suspicious subroutines/functions and look into relative fractions of these event counts and identify those which best match the initial (i.e. overall) evaluationIn so doing specific subroutines/functions can be detected that determine overall performance with respect to cache misses, flops etc.+  * Occasionallyit is useful to PAPI-analyze an application within two steps: in the first step, selected characteristic events of the outermost code region are collected. In the second step, a set of subroutines/functions consuming major fractions of the execution time are analyzed with respect to the same eventsIn other words, we first obtain overall counts for ''main()'' of some application, e.g. ''PAPI_TOT_CYC'', ''PAPI_FP_OPS'', ''PAPI_L1_DCM'' and ''PAPI_L2_DCM''. Subsequently, the analogous set of event counters is measured for suspicious subroutines or functions. Also relative fractions of these event counters could be usefulBased on these measures the subroutines' or functions' performance characteristics can be compared to that of the initial evaluation of the complete code. In this way, those parts matching the overall performance (w.r.t. cache misses, flops etc.) best can be identified.
            
 +
 +
  
  • doku/papi.txt
  • Last modified: 2016/07/06 12:28
  • by ir