Project F2: Application Performance Analysis Seth Koehler John Curreri Rafael Garcia.

22
Project F2: Project F2: Application Application Performance Analysis Performance Analysis Seth Koehler John Curreri Rafael Garcia

Transcript of Project F2: Application Performance Analysis Seth Koehler John Curreri Rafael Garcia.

Page 1: Project F2: Application Performance Analysis Seth Koehler John Curreri Rafael Garcia.

Project F2: Application Project F2: Application Performance AnalysisPerformance Analysis

Seth Koehler

John Curreri

Rafael Garcia

Page 2: Project F2: Application Performance Analysis Seth Koehler John Curreri Rafael Garcia.

Outline Introduction Performance analysis overview

Historical background Performance analysis today Related research and tools

RC performance analysis Motivation Instrumentation Framework Visualization User’s perspective

Case studies N-Queens Collatz (3x+1) conjecture

Conclusions & References

Page 3: Project F2: Application Performance Analysis Seth Koehler John Curreri Rafael Garcia.

Introduction Goals for performance analysis in RC

Productively identify and remedy performance bottlenecks in RC applications (CPUs and FPGAs)

Motivations Complex systems are difficult to analyze by hand

Manual instrumentation is unwieldy Difficult to make sense of large volume of raw data

Tools can help quickly locate performance problems Collect and view performance data with little effort Analyze performance data to indicate potential bottlenecks Staple in HPC, limited in HPEC, and virtually non-existent in RC

Challenges How do we expand notion of software performance analysis into

software-hardware realm of RC? What are common bottlenecks for dual-paradigm applications? What techniques are necessary to detect performance bottlenecks? How do we analyze and present these bottlenecks to a user?

Page 4: Project F2: Application Performance Analysis Seth Koehler John Curreri Rafael Garcia.

Historical Background Gettimeofday and printf

VERY cumbersome, repetitive, manual, not optimized for speed Profilers date back to 70’s with “prof” (gprof, 1982)

Provide user with information about application behavior Percentage of time spent in a function How often a function calls another function

Simulators / Emulators Too slow or too inaccurate Require significant development time

PAPI (Performance Application Programming Interface) Portable interface to hardware

performance counters on modern CPUs Provides information about caches,

CPU functional units, main memory, and more

Processor HW countersUltraSparc II 2

Pentium 3 2

AMD Athlon 4

IA-64 4

POWER4 8

Pentium 4 18

* Source: Wikipedia

Page 5: Project F2: Application Performance Analysis Seth Koehler John Curreri Rafael Garcia.

5

Performance Analysis Today

Original Application

Instrument

Execute

Measure

Analyze (Automatically)Present

Optimize

Measured Data File

Execution Environment

Visualizations

Instrumented Application

Potential Bottlenecks

Analyze (Manually)

Modified Application

Optimized Application

What does performance analysis look like today? Goals

Low impact on application behavior

High-fidelity performance data Flexible Portable Automated Concise Visualization

Techniques Event-based, sample-based Profile, Trace

Above all, we want to understand application behavior in order to locate performance problems!

Page 6: Project F2: Application Performance Analysis Seth Koehler John Curreri Rafael Garcia.

Related Research and Tools: Parallel Performance Wizard (PPW) Open-source tool developed by UPC Group at

University of Florida Performance analysis and optimization (PGAS*

systems and MPI support) Performance data can be analyzed for

bottlenecks Offers several ways of exploring performance

data Graphs and charts to quickly view high-level

performance information at a glance [right, top] In-depth execution statistics for identifying

communication and computational bottlenecks Interacts with popular trace viewers (e.g.

Jumpshot [right, bottom]) for detailed analysis of trace data

Comprehensive support for correlating performance back to original source code

* Partitioned Global Address Space languages allow partitioned memory to be treated as global shared memory by software.

Page 7: Project F2: Application Performance Analysis Seth Koehler John Curreri Rafael Garcia.

Motivation for RC Performance Analysis Dual-paradigm applications gaining more

traction in HPC and HPEC Design flexibility allows best use of FPGAs

and traditional processors Drawback: More challenging to design

applications for dual-paradigm systems Parallel application tuning and FPGA core

debugging are hard enough!

Debug

Performance

Debug

Performance

Debug

Performance

Sequential

Parallel

Dual-Paradigm

Difficultylevel

Less

More

No existing holistic solutions for analyzing dual-paradigm applications Software-only views leave out low-level details Hardware-only views provide incomplete

performance information Need complete system view for effective tuning

of entire application

Page 8: Project F2: Application Performance Analysis Seth Koehler John Curreri Rafael Garcia.

Motivation for RC Performance Analysis

Q: Is my runtime load-balancing strategy working? A: ???

ChipScope waveform

Page 9: Project F2: Application Performance Analysis Seth Koehler John Curreri Rafael Garcia.

Motivation for RC Performance Analysis Q: How well is my core’s pipelining strategy working?

A: ???

gprof output (×N, one for each node!)

Flat profile:

Each sample counts as 0.01 seconds. % cumulative self self total time seconds seconds calls ms/call ms/call name 51.52 2.55 2.55 5 510.04 510.04 USURP_Reg_poll 29.41 4.01 1.46 34 42.82 42.82 USURP_DMA_write 11.97 4.60 0.59 14 42.31 42.31 USURP_DMA_read 4.06 4.80 0.20 1 200.80 200.80 USURP_Finalize 2.23 4.91 0.11 5 22.09 22.09 localp 1.22 4.97 0.06 5 12.05 12.05 USURP_Load 0.00 4.97 0.00 10 0.00 0.00 USURP_Reg_write 0.00 4.97 0.00 5 0.00 0.00 USURP_Set_clk 0.00 4.97 0.00 5 0.00 931.73 rcwork 0.00 4.97 0.00 1 0.00 0.00 USURP_Init

Page 10: Project F2: Application Performance Analysis Seth Koehler John Curreri Rafael Garcia.

10

What to Instrument in Hardware? Control

Watch state machines, pipelines, etc. Replicated cores

Understand distribution and parallelism inside FPGA

CoreCore

CoreCore

CoreCore

CoreCore

CoreCore

CoreCore

0 1

2 3

FPGA Board

No

de

No

de

No

de

Primary Interconnect

Main Memory

Network

Primary Interconnect

CPU

CPU

NetworkO

n-b

oar

d

Mem

ory

CPU & Primary Interconnect

Secondary InterconnectSecondary Interconnect

System

...

Machine

...

Node Board

CPU

...

Mai

n M

emor

y Dev

ice

Inte

rfac

e

App core

App core

App core

...

FPGA / Device

Secondary Interconnect

Bo

ard

Inte

rfac

e

FPGA

FPGA

FPGA

...

Embedded CPU(s)Boa

rd In

terf

ace

Legend

FPGA Communication

Traditional Processor Communication

CP

U

Inte

rco

nn

ect

To

p-l

evel

Ap

p

On-board FPGA

Communication On-chip (Components, Block RAMs, embedded processors) On-board (On-board memory, other on-board FPGAs or processors) Off-board (CPUs, off-board FPGAs, main memory)

Page 11: Project F2: Application Performance Analysis Seth Koehler John Curreri Rafael Garcia.

Instrumentation Modifications

Original top-level file

Module

Submodule

User Application (HLL)

CPU(s)FPGA Access Methods

(Wrapper)

Original Application

FPGA(s)

User Application (HDL)

Submodule Submodule

Module

Legend

Original RC Application

Additions by Instrumentation

User Application

Color Legend

Modified component

interface

Measurement and Interface

Hardware Measurement Module (HMM)New top-level file

Hardware Measurement Thread / Process

Data Transfer Module

LockFramework

Process is automatable!

Additions are temporary!

Page 12: Project F2: Application Performance Analysis Seth Koehler John Curreri Rafael Garcia.

Performance Analysis Framework Instrument VHDL source (vs. binary or

intermediate levels) Portable across devices Flexible (access to signals) Low change in area / speed (optimized) Relatively easy Must pass through place-and-route Language specific (VHDL vs. Verilog)

Store data with CPU-initiated transfers (vs. CPU-assisted or FPGA-initiated) Universally supported Not portable across APIs Inefficient (lock contention, wasteful) Lower fidelity

CPU FPGA

Request

Data

Page 13: Project F2: Application Performance Analysis Seth Koehler John Curreri Rafael Garcia.

Hardware Measurement Extractation Module Separate thread (HMM_Main) periodically transfers data from FPGA to memory Adaptive polling frequency can be

employed to balance fidelity and overhead

Measurement can be stopped andrestarted (similar to stopwatch)

HMM_Init

HMM_Start HMM_Main(thread)HMM_Stop

HMM_Finalize

Ap

plic

atio

n

Page 14: Project F2: Application Performance Analysis Seth Koehler John Curreri Rafael Garcia.

14

Instrumentation Modifications (cont) New top-level file arbitrates between application and

performance framework for off-chip communication Splice into communication scheme

Acquire address space in memory map Acquire network address or other unique identifier

Connect hardware together Signal analysis

Modify HLL Main File Modify HDL Files

Create new “top-level” fileModify HMM_Main

Synthesize / ImplementCompile

Execute

What / How to Instrument

Legend

Automated Instrumentation

Performed by User

User HLL/HDL Source

What / How to Instrument

Inst

rum

enta

tion

Fra

mew

ork

Challenges in Automation Custom APIs for

FPGAs Custom user schemes

for communication Application knowledge

not available

Page 15: Project F2: Application Performance Analysis Seth Koehler John Curreri Rafael Garcia.

15

Hardware Measurement Module Tracing, profiling, & sampling with signal analysis

TriggersSignal

Analysis Module

SignalsData

Profile Counters0 1 2 P - 1

Trace Data

Trace Data

Trace Data

Cycle Counter

Module Statistics

Module Control

Request

Perf. Data

Sample Control

signal

value

comp trigger

Bu

ffer

Blo

ck R

AM

On

-bo

ard

M

emo

ry

(DD

R/Q

DR

)

data

...

Page 16: Project F2: Application Performance Analysis Seth Koehler John Curreri Rafael Garcia.

Visualization Need unified visualizations that accentuate

important statistics Must be scalable to many nodes

CPU 3CPU 2CPU 1CPU 0

904MB/s88%

10MB/s1%

812MB/s79%

914MB/s89%

1.79GB/s72%

6MB/s0%

2.50GB/s100%

0MB/s0%

0MB/s0%

FPGA 0 FPGA 1

2.76GB/s69%

CPU 4 CPU 5

1.98GB/s99%

Network

210MB/s10%

FPGA 2

Throughput (MB/s)

Time (sec)

IDLE75%

PHASE 19%

PHASE 216%

Potential Bottlenecks CPU Interconnect

121MB/s12%

691MB/s67%

Page 17: Project F2: Application Performance Analysis Seth Koehler John Curreri Rafael Garcia.

Analysis Instrument and measure to locate common or

expected bottlenecks Provide potential solutions or other aid to mitigate

these bottlenecks Best practices, common pitfalls, etc Hardware/platform specific checks and solutions

Bottleneck Pattern Possible Solution

FPGA idle waiting for data Employ double-buffering

Frequent, small communication packets between CPU & FPGA

Buffer data on CPU or FPGA side

Some cores busy while others idle Improve distribution scheme / load-balancing

Cray XD1 reads slow on CPU Use FPGA to write data

Heavy CPU/FPGA communication Modify partitioning of CPU and FPGA work/data

Excessive time spent in miscellaneous states

Combine states

Page 18: Project F2: Application Performance Analysis Seth Koehler John Curreri Rafael Garcia.

Performance flow (user’s perspective) Instrument hardware through VHDL Instrumenter GUI

Java/Perl program to simplify modifications to VHDL for performance analysis

Must resynthesize & implement hardware Requires adding in instrumented HDL file via standard tool flow

Instrument software through PPW compiler scripts Run software with ppwupcc instead of standard compiler Use –fpga-nallatech and –inst-functions command line options

VHDL Files

C/UPC Files

Select what and how to record

InstrumentedVHDL Files

InstrumentedExecutable

Configuration

011010101011111101100100

Instrumented FPGA Binary

PerformanceData Files

Page 19: Project F2: Application Performance Analysis Seth Koehler John Curreri Rafael Garcia.

19

Case Study: N-Queens*

Overview Find number of distinct ways n queens can be placed on

an nxn board without attacking each other Performance analysis overhead

Sixteen 32-bit profile counters One 96-bit trace buffer (completed cores)

Main state machine optimized based on data Improved speedup (from 34 to 37 vs. Xeon code)

QQ

QQ

N-Queens results for board size of 16

XD1 Xeon-H101

Original Instr. Original Instr.

Slices

(% relative to device)

9,041 9,901

(+4%)

23,086 26,218

(+6%)

Block RAM

(% relative to device)

11 15

(+2%)

21 22

(0%)

Frequency (MHz)

(% relative to orig.)

124 123

(-1%)

101 101

(0%)

Communication (KB/s) <1 33 <1 30

Application speedup over single 3.2GHz Xeon

33.9

7.9

37.1

0

5

10

15

20

25

30

35

40

Sp

eed

up

8-node 3.2GHz Xeon8-node H101Optimized 8-node H101

* Standard backtracking algorithm employed

FPGAs

Page 20: Project F2: Application Performance Analysis Seth Koehler John Curreri Rafael Garcia.

Case study: Collatz conjecture (3x+1)

Computation FPGA Read FPGA Write FPGA Data Processing

evennn

oddnnnh

2/

13)(

FPGARead

FPGAWrite

Application Search for sequences that do not reach 1

under the following function

3.2GHz P4-Xeon CPU with Virtex-4 LX100 FPGA over PCI-X Uses 88% of FPGA slices, 22% (53) of block

RAM, runs at 100MHz Setup

17 counters monitored 3 state machines No frequency degradation observe

Results Frequent, small FPGA communication

31% performance improvement achieved by buffering data before sending to the FPGA

Unexpected...hardware was tuned to work longer to eliminate communication problems

Distribution of data inside FPGA Expected performance increase not large

enough to merit implementation Conclusions

Buffering data achieved 31% increase in speed

Page 21: Project F2: Application Performance Analysis Seth Koehler John Curreri Rafael Garcia.

Conclusions RC performance analysis is critical to understanding RC

application behavior Need unified instrumentation, measurement, and visualization to

handle diverse and massively parallel RC systems Automated analysis can be useful for locating common RC

bottlenecks (though difficult to do) Framework developed

First RC performance concept and tool framework (per extensive literature review)

Automated instrumentation Measurement via tracing, profiling, & sampling

Application case-studies Observed minimal overhead from tool Speedup achieved due to performance analysis

Page 22: Project F2: Application Performance Analysis Seth Koehler John Curreri Rafael Garcia.

References R. DeVille, I. Troxel, and A. George. Performance monitoring for run-time management

of reconfigurable devices. Proc. of International Conference on Engineering of Reconfigurable Systems and Algorithms (ERSA), pages 175-181, June 2005.

Paul Graham, Brent Nelson, and Brad Hutchings. Instrumenting bitstreams for debugging FPGA circuits. In Proc. of the the 9th Annual IEEE Symposium on Field-Programmable Custom Computing Machines (FCCM), pages 41-50, Washington, DC, USA, Apr. 2001. IEEE Computer Society.

Sameer S. Shende and Allen D. Malony. The Tau parallel performance system. International Journal of High Performance Computing Applications (HPCA), 20(2):287-311, May 2006.

C. EricWu, Anthony Bolmarcich, Marc Snir, DavidWootton, Farid Parpia, Anthony Chan, Ewing Lusk, and William Gropp. From trace generation to visualization: a performance framework for distributed parallel systems. In Proc. of the 2000 ACM/IEEE conference on Supercomputing (CDROM) (SC), page 50, Washington, DC, USA, Nov. 2000. IEEE Computer Society.

Adam Leko and Max Billingsley, III. Parallel performance wizard user manual. http://ppw.hcs.ufl.edu/docs/pdf/manual.pdf, 2007.

S. Koehler, J. Curreri, and A. George, "Challenges for Performance Analysis in High-Performance Reconfigurable Computing," Proc. of Reconfigurable Systems Summer Institute 2007 (RSSI), Urbana, IL, July 17-20, 2007.