DeadSpy A Tool to Pinpoint Program Inefficiencies

25
DeadSpy A Tool to Pinpoint Program Inefficiencies Milind Chabbi and John Mellor- Crummey Department of Computer Science Rice University Funded by the DARPA through the Air Force Research Lab (AFRL) and the TI Fellowship

description

DeadSpy A Tool to Pinpoint Program Inefficiencies. Milind Chabbi and John Mellor- Crummey Department of Computer Science Rice University. Funded by the DARPA through the Air Force Research Lab ( AFRL) and the TI Fellowship. Impact of Code Efficiency. Programs need to - PowerPoint PPT Presentation

Transcript of DeadSpy A Tool to Pinpoint Program Inefficiencies

Page 1: DeadSpy A Tool to Pinpoint  Program Inefficiencies

DeadSpyA Tool to Pinpoint

Program Inefficiencies

Milind Chabbi and John Mellor-CrummeyDepartment of Computer Science

Rice University

Funded by the DARPA through the Air Force Research Lab (AFRL) and the TI Fellowship

Page 2: DeadSpy A Tool to Pinpoint  Program Inefficiencies

Impact of Code Efficiency

Figure credit: ornl.gov PFLO

TRAN

code

to m

odel

the

dist

ributi

on o

f ura

nium

in so

il

Figure credit: Apple.com

Programs need to be efficient at all scales.

Page 3: DeadSpy A Tool to Pinpoint  Program Inefficiencies

Causes of Inefficiencies Program design• Algorithms• Data structures

Programming practice• Lack of design for performance

Compiler optimizations• Sometimes optimizations can do more harm than good• Code must be tailored to enable some optimizations

One tool to pinpoint inefficiencies: DeadSpy

Page 4: DeadSpy A Tool to Pinpoint  Program Inefficiencies

Classical Performance Analysis Focus on code regions with “high resource utilization” Look for “hot spots” in program• Time• Cache misses• Energy

Improve code in these “hot spots”Hot spot analysis is indispensable, but …

Can’t identify if resources were “well spent”Hot spots may be symptoms of problems

The onus is on the programmer to investigate causes

Page 5: DeadSpy A Tool to Pinpoint  Program Inefficiencies

DeadSpy Philosophy Shift focus from “resource usage” to

“resource wastage” Identify causes of wasteful resource consumption• E.g. Wasted memory accesses

Dead Write: Two writes happen to the same memory location without an intervening read

int x = 0;x = 20;

Dead writeint x = 0;Print(x);x = 20;

Not a dead writeDead write

Killing write

Page 6: DeadSpy A Tool to Pinpoint  Program Inefficiencies

DeadSpy: Motivating Example

Chombo: Framework for Adaptive Mesh Refinement (AMR) for solving partial differential equations

Spends 20% time

Compilers can’t eliminate all dead writes because of:• Aliasing / ambiguity• Aggregate variables• Function boundaries• Late binding• Partial deadness

3-level nested loop

C KERNEL:RIEMANNF 8 0do k = iboxlo2,iboxhi2do j = iboxlo1,iboxhi1do i = iboxlo0,iboxhi0 …

Wgdnv(i,j,k,0) = ro + frac*(rstar - ro)Wgdnv(i,j,k,inorm) = uno + frac*(ustar - uno)Wgdnv(i,j,k,4) = po + frac*(pstar - po)if (spout.le.(0.0d0)) thenWgdnv(i,j,k,0) = roWgdnv(i,j,k,inorm) = unoWgdnv(i,j,k,4) = poendifif (spin.gt.(0.0d0)) thenWgdnv(i,j,k,0) = rstarWgdnv(i,j,k,inorm) = ustarWgdnv(i,j,k,4) = pstarendif

Wgdnv(i,j,k,0) = ro + frac*(rstar - ro)Wgdnv(i,j,k,inorm) = uno + frac*(ustar - uno)Wgdnv(i,j,k,4) = po + frac*(pstar - po)if (spout.le.(0.0d0)) then Wgdnv(i,j,k,0) = ro Wgdnv(i,j,k,inorm) = uno Wgdnv(i,j,k,4) = poendifif (spin.gt.(0.0d0)) then Wgdnv(i,j,k,0) = rstar Wgdnv(i,j,k,inorm) = ustar Wgdnv(i,j,k,4) = pstarendif

Page 7: DeadSpy A Tool to Pinpoint  Program Inefficiencies

DeadSpy: Motivating Example

C KERNEL:RIEMANNF 8 0do k = iboxlo2,iboxhi2do j = iboxlo1,iboxhi1do i = iboxlo0,iboxhi0 …

Wgdnv(i,j,k,0) = ro + frac*(rstar - ro)Wgdnv(i,j,k,inorm) = uno + frac*(ustar - uno)Wgdnv(i,j,k,4) = po + frac*(pstar - po)if (spout.le.(0.0d0)) thenWgdnv(i,j,k,0) = roWgdnv(i,j,k,inorm) = unoWgdnv(i,j,k,4) = poendifif (spin.gt.(0.0d0)) thenWgdnv(i,j,k,0) = rstarWgdnv(i,j,k,inorm) = ustarWgdnv(i,j,k,4) = pstarendif

Else Wgdnv(i,j,k,0) = ro + frac*(rstar - ro) Wgdnv(i,j,k,inorm) = uno + frac*(ustar - uno) Wgdnv(i,j,k,4) = po + frac*(pstar - po)Endif

Else if (spout.le.(0.0d0)) then Wgdnv(i,j,k,0) = ro Wgdnv(i,j,k,inorm) = uno Wgdnv(i,j,k,4) = po

if (spin.gt.(0.0d0)) then Wgdnv(i,j,k,0) = rstar Wgdnv(i,j,k,inorm) = ustar Wgdnv(i,j,k,4) = pstar

Spends 20% time

3-level nested loop

Wgdnv(i,j,k,0) = ro + frac*(rstar - ro)Wgdnv(i,j,k,inorm) = uno + frac*(ustar - uno)Wgdnv(i,j,k,4) = po + frac*(pstar - po)if (spout.le.(0.0d0)) then Wgdnv(i,j,k,0) = ro Wgdnv(i,j,k,inorm) = uno Wgdnv(i,j,k,4) = poendifif (spin.gt.(0.0d0)) then Wgdnv(i,j,k,0) = rstar Wgdnv(i,j,k,inorm) = ustar Wgdnv(i,j,k,4) = pstarendifCode lacks design for performance

Correct code:• Else-if nesting

Page 8: DeadSpy A Tool to Pinpoint  Program Inefficiencies

WriteReport dead

DeadSpy: A Tool to Pinpoint Dead Writes Monitor every load and store operation in a program Maintain state information for each memory byte Detect every dead write in an execution with an

automaton

R

WV

Read

Write

Write

Read

Read

Page 9: DeadSpy A Tool to Pinpoint  Program Inefficiencies

DeadSpy: Implementation Intel Pin instruments code to monitor each load

and store• Pin - dynamic binary rewriting tool

Shadow memory maintains state • 1-1 mapping from a program address to the

corresponding state information Instrumentation code identifies dead writes• Check and update the state before performing any

memory operation• Record necessary information to report W W state

transitions

Page 10: DeadSpy A Tool to Pinpoint  Program Inefficiencies

DeadSpy: Measurement and Attribution Instrumentation ensures precise measurement • No false positives and no false negatives

DeadSpy provides useful (source-level) feedback with calling context for dead and killing writes• Build Calling Context Tree (CCT) on-the-fly by

instrumenting function entry and exit

int x = 0;Dead write

Killing write

x = 20;

Page 11: DeadSpy A Tool to Pinpoint  Program Inefficiencies

DeadSpy in Action

Main()

CCT

Page 12: DeadSpy A Tool to Pinpoint  Program Inefficiencies

DeadSpy in Action

Main()

A()

CCT

Page 13: DeadSpy A Tool to Pinpoint  Program Inefficiencies

DeadSpy in Action

Main()

A()

B()

W W W W

W W

W W

WM M’

Memory Shadow Memory

C()

W

CCT

Page 14: DeadSpy A Tool to Pinpoint  Program Inefficiencies

DeadSpy in Action

Main()

A()

B()

W W W W

W W

W W

WM M’

Memory Shadow Memory

C()

W

Dead table

Dead Killing F

+1

CCT

Page 15: DeadSpy A Tool to Pinpoint  Program Inefficiencies

DeadSpy in Action

Main()

A()

B()

W W W W

W W

W W

WM M’

Memory Shadow Memory

C()

W

2e10

Dead table

25080

Dead Context

Killing Context

Dead Killing F

CCT

• Implementation challenges due to limitations of binary analysis

• Cost of instrumentation

Page 16: DeadSpy A Tool to Pinpoint  Program Inefficiencies

Formalizing Dead Writes in an Execution

2e10

Dead table

i j 25080

Dead Killing F

∑𝑖∑𝑗𝐹 (𝐶𝑖𝑗)

Higher Poor performance

Page 17: DeadSpy A Tool to Pinpoint  Program Inefficiencies

Deadness in SPEC CPU2006 Integer Reference Benchmarks

astar

bzip2 gcc

gobmk

h264ref

hmmer

perlbench

libquan

tum mcf

omnetppsje

ngxal

an

Averag

e0.0

10.0

20.0

30.0

40.0

50.0

60.0

70.0

80.0

3.39.8

60.5

19.2

27.6

14.019.1

6.0

39.3

19.716.3

6.6

20.1

Benchmarks

%De

adne

ss

Compiled with gcc 4.1.2 with –O2

The number of dead writes is surprisingly high!76.2

Across compilers and optimization levels

Page 18: DeadSpy A Tool to Pinpoint  Program Inefficiencies

Case Study: GCCUse of Inappropriate Data Structure

static void loop_regs_scan (const struct loop * loop, int extra_size) { last_set = (rtx *) xcalloc (regs->num, sizeof (rtx)); /* Scan the loop, recording register usage. */ for (Instrunction insn in loop) { if (INSN_P (insn)) {

if (PATTERN (insn) sets a register) count_one_set (regs, insn, PATTERN (insn), last_set);

} if (Label(insn)) // new BBL memset (last_set, 0, regs->num * sizeof (rtx)); }

Dead

Killing

Calloc 16937 elements (132KB) array…

• Basic blocks are short • Median use 2 unique elements• Dense array is a poor data structure

choice

Dead

Performance implications on a JIT compiler

Reinitialize 16937 elements (132KB) each time> 28% running time improvement

Page 19: DeadSpy A Tool to Pinpoint  Program Inefficiencies

Original codeBool mainGtU ( UInt32 i1, UInt32 i2, UChar* block,…) { Int32 k; UChar c1, c2; UInt16 s1, s2; /* 1 */ c1 = block[i1]; c2 = block[i2]; if (c1 != c2) return (c1 > c2); i1++; i2++; /* 2 */ c1 = block[i1]; c2 = block[i2]; if (c1 != c2) return (c1 > c2); i1++; i2++; /* 3 */ c1 = block[i1]; c2 = block[i2]; /* 12 */ c1 = block[i1]; c2 = block[i2]; if (c1 != c2) return (c1 > c2); i1++; i2++;

REST OF THE FUNCTION

Optimized Code (gcc 4.1.2)Bool mainGtU ( UInt32 i1, UInt32 i2, UChar* block …) { Int32 k; UChar c1, c2; UInt16 s1, s2; tmp1 = i1 + 1; tmp2 = i1 + 2; tmp12 = i1 + 12; /* 1 */ c1 = block[i1]; c2 = block[i2]; if (c1 != c2) return (c1 > c2); i1++; i2++; /* 2 */ c1 = block[tmp1]; c2 = block[i2]; if (c1 != c2) return (c1 > c2); i1++; i2++; /* 3 */ c1 = block[tmp2]; c2 = block[i2];

/* 12 */ c1 = block[tmp11]; c2 = block[i2]; if (c1 != c2) return (c1 > c2); i1++; i2++;

Killing

… Killing

Case Study: Bzip2Overly Aggressive Compiler Optimization

>14% running time improvement

Dead

Dead

Early returnDead

Killing Early return

Page 20: DeadSpy A Tool to Pinpoint  Program Inefficiencies

HMMER: code for sequence alignment Dead writes

Case Study: HMMERLack of Design for Performance

for (i = 1; i <= L; i++) { for (k = 1; k <= M; k++) { ... ic[k] = mpp[k] + tpmi[k]; if ((sc = ip[k] + tpii[k]) > ic[k]) ic[k] = sc;

Never Alias.Declare as “restrict” pointers.

Can vectorize.

for (i = 1; i <= L; i++) { for (k = 1; k <= M; k++) { ...

R1= mpp[k] + tpmi[k]; ic[k] = R1; if ((sc = ip[k] + tpii[k]) > R1) ic[k] = sc;

OptimizedUnoptimized

Deadness in unoptimized: 0.3%Deadness in optimized : 30%

54 times more absolute no. dead writes in O2

2-level nested loop

W

WR W

WKilling

Dead

else ic[k] = R1;

>16% running time improvement.> 40% with vectorization.

Page 21: DeadSpy A Tool to Pinpoint  Program Inefficiencies

Speedup Based on DeadSpy’s Advice

gcc hmmer bzip2 chombo0

2

4

6

8

10

12

14

16

18

14.315.7

7.2 6.6

Benchmarks

% S

peed

up

> 40% on Intel compiler

AMD Opteron [email protected], 128GB of 1333MHz DDR3, CentOS 5.

Poor data structure choice

Lack of design forperformance

Overly aggressive compiler optimization

Lack of design forperformance

Page 22: DeadSpy A Tool to Pinpoint  Program Inefficiencies

Conclusions Dead writes are a common symptom of inefficiency• DeadSpy pinpoints inefficiencies• Improvements in tuned codes

Lessons for compiler writers• Pay attention to code generation algorithms

‐ Understand when optimizations do more harm than good (bzip2)

‐ Eliminate dead writes when possible• Choose compiler data structures prudently (gcc)

Focus on wasteful resource consumption• New avenue for performance tuningDeadSpy: http://www.cs.rice.edu/~mc29/deadspy.tgz (Linux x86_64 only)

HPCToolkit: http://hpctoolkit.org

Page 23: DeadSpy A Tool to Pinpoint  Program Inefficiencies

Backup: DeadSpy Overhead

astar bzip2 gcc h264ref hmmer libquantum mcf omnetpp perlbench sjeng xalan Average0

10

20

30

40

50

60

70

80

1217

3944

22

7 7

17

36

2024 2223

33

59

72

36

26

9

31

66

45

56

41

CCT only CCT + IPBenchmarks

Slow

dow

n (ti

mes

)

Page 24: DeadSpy A Tool to Pinpoint  Program Inefficiencies

Deadness in Multi-Threaded Codes

bt.S cg.S dc.S ep.S ft.S is.S lu.S mg.S sp.S ua.SAverage1.00E-03

1.00E-02

1.00E-01

1.00E+00

1.00E+01

1.00E+02

Intrer-Thread Intra-Thread Total

OpenMP NAS Parallel Benchmarks

Dead

ness

in %

(log

bas

e 2

scal

e)

Page 25: DeadSpy A Tool to Pinpoint  Program Inefficiencies

Deadness Across Compilers and Optimization Levels