NUG Meeting Performance Profiling Using hpmcount, poe+ & libhpm Richard Gerber NERSC User Services...
-
Upload
lenard-may -
Category
Documents
-
view
218 -
download
0
Transcript of NUG Meeting Performance Profiling Using hpmcount, poe+ & libhpm Richard Gerber NERSC User Services...
NUG Meeting
Performance Profiling Using hpmcount, poe+ & libhpm
Richard GerberNERSC User Services
510-486-6820
NUG Meeting
IntroductionIntroduction
• How to obtain performance numbers
• Tools based on IBM’s PMAPI
• Relevant for FY2003 ERCAP
NUG Meeting
AgendaAgenda
• Low Level PAPI Interface
• HPM Toolkit– hpmcount– poe+
• libhpm : hardware performance library
NUG Meeting
Overview Overview
• These tools are used for performance measurement
• All can be used to tune applications and measure performance
• Needed for FY 2003 ERCAP applications
NUG Meeting
VocabularyVocabulary
• PMAPI – IBM’s low-level interface
• PAPI – Performance API (portable)
• hpmcount, poe+ report overall code performance
• libhpm can be used to instrument portions of code
NUG Meeting
PAPIPAPI
• Standard application programming interface• Portable, don’t confuse with IBM low-level
PMAPI interface• Can access hardware counter info• V2.1 at NERSC• See
– http://hpcf.nersc.gov/software/papi.html– http://icl.cs.utk.edu/projects/papi/
NUG Meeting
Using PAPIUsing PAPI
• PAPI is available through a module– module load papi
• You place calls in source code– xlf –O3 source.F $PAPI
#include "fpapi.h“
…
integer*8 values(2)
integer counters(2), ncounters, irc
…
irc = PAPI_VER_CURRENT
CALL papif_library_init(irc)
counters(1)=PAPI_FMA_INS
counters(2)=PAPI_FP_INS
ncounters=2
CALL papif_start_counters(counters,ncounters,irc)
…
call papif_stop_counters(values,ncounters,irc)
write(6,*) 'Total FMA ',values(1), ' Total FP ', values(2)
…
NUG Meeting
hpmcounthpmcount
• Easy to use
• Does not affect code performance
• Profiles entire code
• Uses hardware counters
• Reports flip (floating point instruction) rate and many other quantities
NUG Meeting
hpmcount usagehpmcount usage
• Serial– %hpmcount executable
• Parallel– %poe hpmcount executable –nodes n -procs np
• Gives performance numbers for each task• Prints output to STDOUT (or use –o filename)• Beware! These profile the poe command
– hpmcount poe executable– hpmcount executable (if compiled with mp*
compilers)
NUG Meeting
hpmcount example hpmcount example
ex1.f - Unoptimized matrix-matrix multiply% xlf90 -o ex1 -O3 -qstrict ex1.f% hpmcount ./ex1
hpmcount (V 2.3.1) summary Total execution time (wall clock time): 17.258385 seconds ######## Resource Usage Statistics ######## Total amount of time in user mode : 17.220000 seconds Total amount of time in system mode : 0.040000 seconds Maximum resident set size : 3116 Kbytes Average shared memory use in text segment : 6900 Kbytes*sec Average unshared memory use in data segment : 5344036 Kbytes*sec Number of page faults without I/O activity : 785 Number of page faults with I/O activity : 1 Number of times process was swapped out : 0 Number of times file system performed INPUT : 0 Number of times file system performed OUTPUT : 0 Number of IPC messages sent : 0 Number of IPC messages received : 0 Number of signals delivered : 0 Number of voluntary context switches : 1 Number of involuntary context switches : 1727
####### End of Resource Statistics ########
NUG Meeting
hpmcount output hpmcount output
ex1.f - Unoptimized matrix-matrix multiply% xlf90 -o ex1 -O3 -qstrict ex1.f% hpmcount ./ex1
PM_CYC (Cycles) : 6428126205 PM_INST_CMPL (Instructions completed) : 693651174 PM_TLB_MISS (TLB misses) : 122468941 PM_ST_CMPL (Stores completed) : 125758955 PM_LD_CMPL (Loads completed) : 250513627
PM_FPU0_CMPL (FPU 0 instructions) : 249691884 PM_FPU1_CMPL (FPU 1 instructions) : 3134223 PM_EXEC_FMA (FMAs executed) : 126535192 Utilization rate : 99.308 %
Avg number of loads per TLB miss : 2.046 Load and store operations : 376.273 M
Instructions per load/store : 1.843 MIPS : 40.192 Instructions per cycle : 0.108 HW Float points instructions per Cycle : 0.039 Floating point instructions + FMAs : 379.361 M
Float point instructions + FMA rate : 21.981 Mflip/s FMA percentage : 66.710 % Computation intensity : 1.008
NUG Meeting
Floating point measures Floating point measures
• PM_FPU0_CMPL (FPU 0 instructions)• PM_FPU1_CMPL (FPU 1 instructions)
– The POWER3 processor has two Floating Point Units (FPU) which operate in parallel.
– Each FPU can start a new instruction at every cycle.– This is the number of floating point instructions
(add, multiply, subtract, divide, multiply+add) that have been executed by each FPU.
• PM_EXEC_FMA (FMAs executed)– The POWER3 can execute a computation of the
form x=s*a+b with one instruction. The is known as a Floating point Multiply & Add (FMA).
NUG Meeting
Total flop rateTotal flop rate
• Float point instructions + FMA rate – Float point instructions + FMAs gives the
floating point operations. The two are added together since an FMA instruction yields 2 floating point operations.
– The rate gives the code’s Mflops.– The POWER3 has a peak rate of 1500 Mflops.
(375 MHz clock x 2 FPUs x 2Flops/FMA instruction)
– Our example: 22 Mflops.
NUG Meeting
Memory accessMemory access
• Average number of loads per TLB miss– Memory addresses that are in the Translation
Lookaside Buffer can be accessed quickly. – Each time a TLB miss occurs, a new page (4KB,
512 8-byte elements) is brought into the buffer.
– A value of ~500 means each element is accessed ~1 time while the page is in the buffer.
– A small value indicates that needed data is stored in widely separated places in memory and a redesign of data structures may help performance significantly.
– Our example: 2.0
NUG Meeting
Cache hitsCache hits
• The –sN option to hpmcount specifies a different statistics set
• -s2 will include L1 data cache hit rate– 33.4% for our example– See
http://hpcf.nersc.gov/software/ibm/hpmcount/HPM_README.html for more options and descriptions.
NUG Meeting
Optimizing the codeOptimizing the code
• Original code fragment
DO I=1,N
DO K=1,N
DO J=1,N
Z(I,J) = Z(I,J) + X(I,K) * Y(K,J)
END DO
END DO
END DO
NUG Meeting
Optimizing the codeOptimizing the code
• “Optimized” code: move I to inner loop
DO J=1,N
DO K=1,N
DO I=1,N
Z(I,J) = Z(I,J) + X(I,K) * Y(K,J)
END DO
END DO
END DO
NUG Meeting
Optimized resultsOptimized results
• Float point instructions + FMA rate– 461 vs. 22 Mflips (ESSL 933)
• Avg number of loads per TLB miss– 20,877 vs. 2.0 (ESSL: 162)
• L1 cache hit rate– 98.9% vs. 33.4%
NUG Meeting
Using libhpmUsing libhpm
• libhpm can instrument code sections
• Embed calls into source code – Fortran, C, C++
• Contained in hpmtoolkit module– module load hpmtoolkit
• compile with $HPMTOOLKIT– xlf –O3 source.F $HPMTOOLKIT
• Execute program normally
NUG Meeting
hpmlib example hpmlib example
…#include f_hpm.h
…
CALL f_hpminit(0,”someid")
CALL f_hpmstart(1,"matrix-matrix multiply")
DO J=1,N
DO K=1,N
DO I=1,N
Z(I,J) = Z(I,J) + X(I,K) * Y(K,J)
END DO
END DO
END DO
CALL f_hpmstop(1)
CALL f_hpmterminate(0)
…
NUG Meeting
Parallel programsParallel programs
• poe hpmcount executable –nodes n –procs np– Will print output to STDOUT separately for each task
• poe+ executable –nodes n –procs np– Will print aggregate number to STDOUT
• libhpm– Writes output to a separate file for each task
• Do not do these!– hpmcount poe executable …– hpmcount executable (if compiled with mp*
compiler)
NUG Meeting
SummarySummary
• Utilities to measure performance– PAPI– hpmcount– poe+– hpmlib
• You need to quote performance data in ERCAP application
NUG Meeting
Where to Get More Information
Where to Get More Information
• NERSC Website: hpcf.nersc.gov
• PAPI– http://hpcf.nersc.gov/software/tools/papi.html
• hpmcount, poe+– http://hpcf.nersc.gov/software/ibm/hpmcount/
– http://hpcf.nersc.gov/software/ibm/hpmcount/counter.html
• hpmlib– http://hpcf.nersc.gov/software/ibm/hpmcount/HPM_README.html