NUG Meeting Performance Profiling Using hpmcount, poe+ & libhpm Richard Gerber NERSC User Services...

NUG Meeting

Performance Profiling Using hpmcount, poe+ & libhpm

Richard GerberNERSC User Services

[email protected]

510-486-6820

NUG Meeting

IntroductionIntroduction

• How to obtain performance numbers

• Tools based on IBM’s PMAPI

• Relevant for FY2003 ERCAP

NUG Meeting

AgendaAgenda

• Low Level PAPI Interface

• HPM Toolkit– hpmcount– poe+

• libhpm : hardware performance library

NUG Meeting

Overview Overview

• These tools are used for performance measurement

• All can be used to tune applications and measure performance

• Needed for FY 2003 ERCAP applications

NUG Meeting

VocabularyVocabulary

• PMAPI – IBM’s low-level interface

• PAPI – Performance API (portable)

• hpmcount, poe+ report overall code performance

• libhpm can be used to instrument portions of code

NUG Meeting

PAPIPAPI

• Standard application programming interface• Portable, don’t confuse with IBM low-level

PMAPI interface• Can access hardware counter info• V2.1 at NERSC• See

– http://hpcf.nersc.gov/software/papi.html– http://icl.cs.utk.edu/projects/papi/

NUG Meeting

Using PAPIUsing PAPI

• PAPI is available through a module– module load papi

• You place calls in source code– xlf –O3 source.F $PAPI

#include "fpapi.h“

…

integer*8 values(2)

integer counters(2), ncounters, irc

…

irc = PAPI_VER_CURRENT

CALL papif_library_init(irc)

counters(1)=PAPI_FMA_INS

counters(2)=PAPI_FP_INS

ncounters=2

CALL papif_start_counters(counters,ncounters,irc)

…

call papif_stop_counters(values,ncounters,irc)

write(6,*) 'Total FMA ',values(1), ' Total FP ', values(2)

…

NUG Meeting

hpmcounthpmcount

• Easy to use

• Does not affect code performance

• Profiles entire code

• Uses hardware counters

• Reports flip (floating point instruction) rate and many other quantities

NUG Meeting

hpmcount usagehpmcount usage

• Serial– %hpmcount executable

• Parallel– %poe hpmcount executable –nodes n -procs np

• Gives performance numbers for each task• Prints output to STDOUT (or use –o filename)• Beware! These profile the poe command

– hpmcount poe executable– hpmcount executable (if compiled with mp*

compilers)

NUG Meeting

hpmcount example hpmcount example

ex1.f - Unoptimized matrix-matrix multiply% xlf90 -o ex1 -O3 -qstrict ex1.f% hpmcount ./ex1

hpmcount (V 2.3.1) summary Total execution time (wall clock time): 17.258385 seconds ######## Resource Usage Statistics ######## Total amount of time in user mode : 17.220000 seconds Total amount of time in system mode : 0.040000 seconds Maximum resident set size : 3116 Kbytes Average shared memory use in text segment : 6900 Kbytes*sec Average unshared memory use in data segment : 5344036 Kbytes*sec Number of page faults without I/O activity : 785 Number of page faults with I/O activity : 1 Number of times process was swapped out : 0 Number of times file system performed INPUT : 0 Number of times file system performed OUTPUT : 0 Number of IPC messages sent : 0 Number of IPC messages received : 0 Number of signals delivered : 0 Number of voluntary context switches : 1 Number of involuntary context switches : 1727

####### End of Resource Statistics ########

NUG Meeting

hpmcount output hpmcount output

ex1.f - Unoptimized matrix-matrix multiply% xlf90 -o ex1 -O3 -qstrict ex1.f% hpmcount ./ex1

PM_CYC (Cycles) : 6428126205 PM_INST_CMPL (Instructions completed) : 693651174 PM_TLB_MISS (TLB misses) : 122468941 PM_ST_CMPL (Stores completed) : 125758955 PM_LD_CMPL (Loads completed) : 250513627

PM_FPU0_CMPL (FPU 0 instructions) : 249691884 PM_FPU1_CMPL (FPU 1 instructions) : 3134223 PM_EXEC_FMA (FMAs executed) : 126535192 Utilization rate : 99.308 %

Avg number of loads per TLB miss : 2.046 Load and store operations : 376.273 M

Instructions per load/store : 1.843 MIPS : 40.192 Instructions per cycle : 0.108 HW Float points instructions per Cycle : 0.039 Floating point instructions + FMAs : 379.361 M

Float point instructions + FMA rate : 21.981 Mflip/s FMA percentage : 66.710 % Computation intensity : 1.008

NUG Meeting

Floating point measures Floating point measures

• PM_FPU0_CMPL (FPU 0 instructions)• PM_FPU1_CMPL (FPU 1 instructions)

– The POWER3 processor has two Floating Point Units (FPU) which operate in parallel.

– Each FPU can start a new instruction at every cycle.– This is the number of floating point instructions

(add, multiply, subtract, divide, multiply+add) that have been executed by each FPU.

• PM_EXEC_FMA (FMAs executed)– The POWER3 can execute a computation of the

form x=s*a+b with one instruction. The is known as a Floating point Multiply & Add (FMA).

NUG Meeting

Total flop rateTotal flop rate

• Float point instructions + FMA rate – Float point instructions + FMAs gives the

floating point operations. The two are added together since an FMA instruction yields 2 floating point operations.

– The rate gives the code’s Mflops.– The POWER3 has a peak rate of 1500 Mflops.

(375 MHz clock x 2 FPUs x 2Flops/FMA instruction)

– Our example: 22 Mflops.

NUG Meeting

Memory accessMemory access

• Average number of loads per TLB miss– Memory addresses that are in the Translation

Lookaside Buffer can be accessed quickly. – Each time a TLB miss occurs, a new page (4KB,

512 8-byte elements) is brought into the buffer.

– A value of ~500 means each element is accessed ~1 time while the page is in the buffer.

– A small value indicates that needed data is stored in widely separated places in memory and a redesign of data structures may help performance significantly.

– Our example: 2.0

NUG Meeting

Cache hitsCache hits

• The –sN option to hpmcount specifies a different statistics set

• -s2 will include L1 data cache hit rate– 33.4% for our example– See

http://hpcf.nersc.gov/software/ibm/hpmcount/HPM_README.html for more options and descriptions.

NUG Meeting

Optimizing the codeOptimizing the code

• Original code fragment

DO I=1,N

DO K=1,N

DO J=1,N

Z(I,J) = Z(I,J) + X(I,K) * Y(K,J)

END DO

END DO

END DO

NUG Meeting

Optimizing the codeOptimizing the code

• “Optimized” code: move I to inner loop

DO J=1,N

DO K=1,N

DO I=1,N

Z(I,J) = Z(I,J) + X(I,K) * Y(K,J)

END DO

END DO

END DO

NUG Meeting

Optimized resultsOptimized results

• Float point instructions + FMA rate– 461 vs. 22 Mflips (ESSL 933)

• Avg number of loads per TLB miss– 20,877 vs. 2.0 (ESSL: 162)

• L1 cache hit rate– 98.9% vs. 33.4%

NUG Meeting

Using libhpmUsing libhpm

• libhpm can instrument code sections

• Embed calls into source code – Fortran, C, C++

• Contained in hpmtoolkit module– module load hpmtoolkit

• compile with $HPMTOOLKIT– xlf –O3 source.F $HPMTOOLKIT

• Execute program normally

NUG Meeting

hpmlib example hpmlib example

…#include f_hpm.h

…

CALL f_hpminit(0,”someid")

CALL f_hpmstart(1,"matrix-matrix multiply")

DO J=1,N

DO K=1,N

DO I=1,N

Z(I,J) = Z(I,J) + X(I,K) * Y(K,J)

END DO

END DO

END DO

CALL f_hpmstop(1)

CALL f_hpmterminate(0)

…

NUG Meeting

Parallel programsParallel programs

• poe hpmcount executable –nodes n –procs np– Will print output to STDOUT separately for each task

• poe+ executable –nodes n –procs np– Will print aggregate number to STDOUT

• libhpm– Writes output to a separate file for each task

• Do not do these!– hpmcount poe executable …– hpmcount executable (if compiled with mp*

compiler)

NUG Meeting

SummarySummary

• Utilities to measure performance– PAPI– hpmcount– poe+– hpmlib

• You need to quote performance data in ERCAP application

NUG Meeting

Where to Get More Information

Where to Get More Information

• NERSC Website: hpcf.nersc.gov

• PAPI– http://hpcf.nersc.gov/software/tools/papi.html

• hpmcount, poe+– http://hpcf.nersc.gov/software/ibm/hpmcount/

– http://hpcf.nersc.gov/software/ibm/hpmcount/counter.html

• hpmlib– http://hpcf.nersc.gov/software/ibm/hpmcount/HPM_README.html

NUG Meeting Performance Profiling Using hpmcount, poe+ & libhpm Richard Gerber NERSC User Services...

Documents

Transcript of NUG Meeting Performance Profiling Using hpmcount, poe+ & libhpm Richard Gerber NERSC User Services...