PAPI The Performance Application Programming Interface

41
PAPI The Performance Application Programming Interface Kevin London Kevin London [email protected] Nathan Garner Nathan Garner [email protected]

description

PAPI The Performance Application Programming Interface. Kevin London [email protected] Nathan Garner [email protected]. Purpose. - PowerPoint PPT Presentation

Transcript of PAPI The Performance Application Programming Interface

Page 1: PAPI The Performance Application Programming Interface

PAPIThe Performance Application

Programming Interface

Kevin London Kevin London [email protected]

Nathan Garner Nathan Garner [email protected]

Page 2: PAPI The Performance Application Programming Interface

2

Purpose

The purpose of the PAPI project is to design, standardize and implement a portable and efficient API to access the hardware performance monitor counters found on most modern microprocessors.

Page 3: PAPI The Performance Application Programming Interface

3

Motivation

• To leverage existing and future performance tool development

• To increase application and system performance

• To characterize application and system workload

• To stimulate run-time optimization research

Page 4: PAPI The Performance Application Programming Interface

4

Goals

• Provide a solid foundation for cross platform performance analysis tools.

• Loose standardization between vendors, academics and users.

• Provide a number of implementations for HPC architectures.

• Well documented, easy to use.

Page 5: PAPI The Performance Application Programming Interface

5

Why PAPI is needed

• No common performance tools except prof and gprof.

• Most commercial tools are based on time.

• HPC has memory and floating point intensive workloads which require good scheduling. (pipelining)

Page 6: PAPI The Performance Application Programming Interface

6

Implementation

• Support native events and 103 “preset” events, which are commonly available metrics, some are derived.

• Query to see if a preset exists

• Fully programmable, thread safe, low level interface directed towards the tool developer and the sophisticated user

• The EventSet is the underlying abstraction

• Hardware events are used in conjunction with one another to provide meaningful information.

Page 7: PAPI The Performance Application Programming Interface

7

PAPI PresetsTest case 8: Available events and hardware information.-------------------------------------------------------------------------Vendor string and code : GenuineIntel (-1)Model string and code : Celeron (Mendocino) (6)CPU revision : 10.000000CPU Megahertz : 366.504944-------------------------------------------------------------------------Name Code Avail Deriv Description (Note)PAPI_L1_DCM 0x80000000Yes No Level 1 data cache missesPAPI_L1_ICM 0x80000001Yes No Level 1 instruction cache missesPAPI_L2_DCM 0x80000002No No Level 2 data cache missesPAPI_L2_ICM 0x80000003No No Level 2 instruction cache missesPAPI_L3_DCM 0x80000004No No Level 3 data cache missesPAPI_L3_ICM 0x80000005No No Level 3 instruction cache missesPAPI_L1_TCM 0x80000006Yes Yes Level 1 cache misses PAPI_L2_TCM 0x80000007Yes No Level 2 cache misses PAPI_L3_TCM 0x80000008No No Level 3 cache misses PAPI_CA_SNP 0x80000009No No Requests for a snoop PAPI_CA_SHR 0x8000000aNo No Requests for shared cache linePAPI_CA_CLN 0x8000000bNo No Requests for clean cache linePAPI_CA_INV 0x8000000cNo No Requests for cache line inv....

Page 8: PAPI The Performance Application Programming Interface

8

PAPI High Level API

• PAPI high level is meant for application programmers wanting coarse-grained measurements.• Not tuned for efficiency

• Calls the lower level API.

• Not thread safe. (may change)

• Only allows PAPI Presets. (may change)

Page 9: PAPI The Performance Application Programming Interface

9

PAPI High Level Functions

PAPI_num_counters()

PAPI_start_counters()

PAPI_stop_counters()

PAPI_read_counters()

Page 10: PAPI The Performance Application Programming Interface

10

Implementation

PAPI contains functions to:

• Obtain accurate time.

• Obtain information about the executable and the hardware.

• Register callbacks on counter overflow of a user threshold.

• SRV4 compatible profil() call that uses hardware counters,

Page 11: PAPI The Performance Application Programming Interface

11

Implementation

Page 12: PAPI The Performance Application Programming Interface

12

49 PAPI FunctionsPAPI_accumPAPI_add_eventPAPI_add_eventsPAPI_add_peventPAPI_cleanup_eventsetPAPI_create_eventsetPAPI_create_eventset_rPAPI_destroy_eventsetPAPI_get_executable_infoPAPI_get_hardware_infoPAPI_get_optPAPI_get_overflow_addressPAPI_get_real_cycPAPI_get_real_usecPAPI_get_virt_cycPAPI_get_virt_usecPAPI_library_initPAPI_thread_initPAPI_list_eventsPAPI_lockPAPI_overflowPAPI_perrorPAPI_profilPAPI_query_all_events_verbosePAPI_query_eventPAPI_query_event_verbose

PAPI_get_optPAPI_get_overflow_addressPAPI_get_real_cycPAPI_get_real_usecPAPI_get_virt_cycPAPI_get_virt_usecPAPI_library_initPAPI_thread_initPAPI_list_eventsPAPI_lockPAPI_num_countersPAPI_overflowPAPI_perrorPAPI_profilPAPI_query_all_events_verbosePAPI_query_eventPAPI_query_event_verbosePAPI_readPAPI_read_countersPAPI_rem_eventPAPI_rem_eventsPAPI_resetPAPI_restorePAPI_savePAPI_set_debug

PAPI_set_domainPAPI_set_granularityPAPI_set_optPAPI_shutdownPAPI_startPAPI_start_countersPAPI_statePAPI_statePAPI_stopPAPI_stop_countersPAPI_unlockPAPI_write

Page 13: PAPI The Performance Application Programming Interface

13

#include "fpapi.h"

program fmatrixlowpapi ** USER DECLERATIONS **

call PAPIf_library_init( check ) call PAPIf_thread_init( handle, handle, check ) call PAPIf_num_counters( numevents ) print *, 'number of hardware counters supported: ', numevents call PAPIf_add_event(EventSet,PAPI_FLOPS,check) call PAPIf_add_event(EventSet,PAPI_L1_TCM,check) call PAPIf_add_event(EventSet,PAPI_L2_TCM,check) call PAPIf_get_hardware_info( ncpu, nnodes, totalcpus, vendor, . vstring, model, mstring, revision, mhz ) print *, 'A', totalcpus, ' CPU ', mstring, ' at', mhz, 'Mhz.' print *, ncpu, nnodes, totalcpus, vendor, vstring, model, . mstring, revision, mhz call PAPIf_get_real_usec( starttime ) call PAPIf_start( EventSet, check ) ** USER CODE **

Page 14: PAPI The Performance Application Programming Interface

14

call PAPIf_stop(EventSet,values,check) call PAPIf_get_real_usec( stoptime ) finaltime = (stoptime/1000000.0) - (starttime/1000000.0)

print *, 'Time: ', finaltime print *, 'FLOPS: ', values(1) print *, 'Total Level 1 Data cache misses: ', values(2) print *, 'Total Level 2 Data cache misses: ', values(3) return end

Page 15: PAPI The Performance Application Programming Interface

15

number of hardware counters supported: 32 A 2 CPU R12000 at 270.0000 Mhz.MIPS 30 R12000 2.300000 270.0000 Time: 1.547424316406250 FLOPS: 4258753 Total Level 1 Data cache misses: 1539918 Total Level 2 Data cache misses: 6936

Page 16: PAPI The Performance Application Programming Interface

16

Threads and PAPI

• PAPI must be able to support both explicit (library) and implicit (compiler) threading models.

• However, this can only happen if the threads are ‘bound’.

• A ‘bound’ thread is one that has a scheduling entity known and handled by the OS kernel.

Page 17: PAPI The Performance Application Programming Interface

17

The 1.0 Release

• Platforms• Linux/x86

• Solaris/Ultra

• AIX/Power

• Tru64/Alpha

• IRIX/MIPS

• Fortran wrappers

• Thread support

• Remote CVS access

• Updated Web Site

• Documentation

• Tool integration

Page 18: PAPI The Performance Application Programming Interface

18

UTK Tools

• Perfometer• Real time trace based visualization of metrics at the

subroutine level. (Java/Swing)

• Profometer (planned)• Real time sample based visualization at the line level.

(Java/Swing)

• Hwprof (planned)• Back end to generate performance data to be fed into

the above tools. Possible integration with DynInst.

Page 19: PAPI The Performance Application Programming Interface

19

• Platform independent visualization of PAPI metrics

• Graphical display may run remotely, freeing the compute node of the drawing overhead

• Flexible interface (internal drawing classes are reused for other tools)

• Quick interpretation of complex results

• Color coding to highlight selected procedures

Perfometer Features

Page 20: PAPI The Performance Application Programming Interface

20

Perfometer Screenshot

Page 21: PAPI The Performance Application Programming Interface

21

Perfometer Usage

• Application is instrumented with a single call to perfometer()

• Sections of code that are of interest can be distinguished in the graph with specific colors using a call to mark_perfometer(COLOR)

• #include "papicolorcodes.h"

• call perfometer

• call mark_perfometer(RED)

Page 22: PAPI The Performance Application Programming Interface

22

Perfometer Future Development

• Allow runtime selection of multiple PAPI metrics for simultaneous display

• Integration with Dyninst to eliminate need for recompiling user codes

• Dump trace data to file for post-mortem study

• Additional graph display types

Page 23: PAPI The Performance Application Programming Interface

23

Profometer Features

• Visual representation of the quantity of a given metric spent in a particular code segment

• Color coding of user selected code segments

• Zoom in and out to emphasize sections of interest

• Reuse of the Perfometer engine

Page 24: PAPI The Performance Application Programming Interface

24

Profometer Screenshot

Profometer – Histogram of a given metric per code segment

Page 25: PAPI The Performance Application Programming Interface

25

Profometer Future Development

• Run time modification of metric being monitored

• Hooks into debugging interface to allow GDB style interaction with source code

Page 26: PAPI The Performance Application Programming Interface

26

UTK hwprof Screenshot rusage child rusage childrusage child rusage child ============= ===== ============= ================== ===== ============= ===== user time sec 1.000 num of swap operations 0user time sec 1.000 num of swap operations 0 sys time sec 0.010 block input operations 0sys time sec 0.010 block input operations 0 real time sec 1.010 block output operations 0real time sec 1.010 block output operations 0 maximum resident set size 0 messages sent 0maximum resident set size 0 messages sent 0 (ru_ixrss) currently null 0 messages received 0(ru_ixrss) currently null 0 messages received 0 integral resident set size 0 signals received 0integral resident set size 0 signals received 0 (ru_ixrss) currently null 0 voluntary context switches 0(ru_ixrss) currently null 0 voluntary context switches 0 page faults without I/O 29 involuntary context switches 0page faults without I/O 29 involuntary context switches 0 page faults with I/O 78page faults with I/O 78 local platformlocal platform ============================ num hw counters: 3num hw counters: 3 clock tick: 100 Hzclock tick: 100 Hz PAPI clock rate: 199.00 MHzPAPI clock rate: 199.00 MHz PAPI cycle time: 0.00502513 usec/cyclePAPI cycle time: 0.00502513 usec/cycle CPU name for this node: redwood.cs.utk.eduCPU name for this node: redwood.cs.utk.edu PAPI countsPAPI counts ====================== PAPI_TOT_CYC: 4419PAPI_TOT_CYC: 4419 PAPI_INT_INS: 4451PAPI_INT_INS: 4451 PAPI_TOT_INS: 102034PAPI_TOT_INS: 102034

Page 27: PAPI The Performance Application Programming Interface

Other Tools using PAPI

Page 28: PAPI The Performance Application Programming Interface

28

U. Illinois: SvPablo

• Source code instrumentation based profiling of F77, F90, C and C++.

• Color coded key next to source code indicating severity of metric.

• MPI aware.

• Statistics at the function, loop and line level.

Page 29: PAPI The Performance Application Programming Interface

29

U. Illinois: SvPablo

Page 30: PAPI The Performance Application Programming Interface

30

U. Oregon: TAU

• Source code based instrumentation of C, C++, F77, F90, HPF and pC++.

• Maintains a program database in which to store and localize performance data.

• Multiple lightweight tools and a launcher• Including call graph/control flow browser, a class

browser, a remote debugger, MPI trace analysis and a profiler.

• Integrated with PAPI.

Page 31: PAPI The Performance Application Programming Interface

31

TAU: Racy/PAPI

Page 32: PAPI The Performance Application Programming Interface

32

TAU: Racy

Page 33: PAPI The Performance Application Programming Interface

33

Visual Profiler: vprof

• Developed by Curtis Janssen at Sandia Livermore

• Creates and visualizes line level execution profiles obtained with PC-sampling.

• Data usually generated with the profil()/monitor() library/system call or done by hand with interval timers and signal information.

• Ported to use PAPI_profil() in a day.

Page 34: PAPI The Performance Application Programming Interface

34

Sandia Livermore: vprof

Page 35: PAPI The Performance Application Programming Interface

35

Pacific Sierra Research DEEP/MPI

• Source code instrumentation based profiling at the basic block level. (regions of code with 1 entry and 1 exit, order 10’s of instructions)

• Comprehensive visualization and analysis.• Integrated source code browser with

highlighting.• Works now with MPI, soon with OpenMP. • Integrated with PAPI.

Page 36: PAPI The Performance Application Programming Interface

36

Pacific Sierra Research DEEP/MPI

Page 37: PAPI The Performance Application Programming Interface

Web Resources

• Mailing list• send “subscribe ptools-perfapi” to [email protected]

[email protected] is the reflector

• Web page• http://icl.cs.utk.edu/projects/papi

• Post RISC paper by Richard Enbody et. al.• http://www.cps.msu.edu/~crs/cps920/

Page 38: PAPI The Performance Application Programming Interface

38

Web Resources 2• PCL

http://www.fz-juelich.de/zam/PCL/

• Vprofhttp://aros.ca.sandia.gov/~cljanss/perf/vprof/

• Paradynhttp://www.cs.wisc.edu/paradyn/libhrtime/

• DynInsthttp://www.cs.umd.edu/projects/dyninstAPI/

• Libhrtimehttp://www.cs.wisc.edu/paradyn/libhrtime/

• TAUhttp://www.cs.uoregon.edu/research/paracomp/tau/

• SvPablohttp://www-pablo.cs.uiuc.edu/Project/SVPablo/SvPabloOverview.htm

Page 39: PAPI The Performance Application Programming Interface

39

The Future

• x86/Alpha Linux kernel• Implementation under /proc

• merge with libhrtime patch from U. Wisc

• Support for signal dispatch on hardware counter overflow

• Support for 21064, HP PA 8000, Cray Inc. SV, IBM P2SC

Page 40: PAPI The Performance Application Programming Interface

40

Source Code Access

• Every 24 hours, snapshot of source tree at:http://icl.cs.utk.edu/projects/papi/snapshot.cgi

• Remote read-only access to the CVS source tree:> (csh) setenv CVSROOT or % (sh) export CVSROOT=

[email protected]:/cvs/homes/papicvs loginpassword: <cr>cvs checkout papi or cvs updatecd papi/srcmake –f Makefile.<arch>cvs logout

Page 41: PAPI The Performance Application Programming Interface

41

The Future

• Dynamic Instrumentation of Running Applications via Dyninst

• Support of gathering performance data of Applications using MPI

• Support for 21064, HP PA 8000, Cray Inc. SV, IBM P2SC