The Truth, the Whole Truth, and Nothing but the Truth: A Pragmatic Guide to Assessing Empirical...

The Truth, the Whole Truth, and Nothing but the Truth

A Pragmatic Guide to Assessing Empirical Evaluations

Stephen M. Blackburn, Amer Diwan, Matthias Hauswirth, Peter F. Sweeney

Jose ́ Nelson Amaral, Tim Brecht, Lubomr Bulej, Cliff Click, Lieven Eeckhout, Sebastian Fischmeister, Daniel Frampton, Laurie J. Hendren, Michael Hind,

Antony L. Hosking, Richard E. Jones, Tomas Kalibera, Nathan Keynes, Nathaniel Nystrom, and Andreas Zeller

Programming Languages Mentoring Workshop, Santa Barbara, June 2016

Rob Walsh

Anneke

Kathryn McKinley

Ron Morrison

Eliot Moss

Robin Stanton

Dave Thomas

Generosity

Perseverance

Mr Beach

Graham Hellistrand

The Truth, the Whole Truth, and Nothing but the Truth

A Pragmatic Guide to Assessing Empirical Evaluations

Stephen M. Blackburn, Amer Diwan, Matthias Hauswirth, Peter F. Sweeney



Programming Languages Mentoring Workshop, Santa Barbara, June 2016

20162010

1

1.1

1.2

1.3

1.4

1.5

1.6

1 1.5 2 2.5 3

16

18

20

22

24

30MB 40MB 50MB 60MB 70MB 80MB 90MBNo

rmal

ized

Tim

e

Tim

e (s

ec)

Heap size relative to minimum heap size

Heap size

Depth FirstBreadth First

Partial DF 2 ChildrenOOR

(a) jython Total Time

1

1.1

1.2

1.3

1.4

1.5

1.6

1 1.5 2 2.5 3

12

13

14

15

16

17

18

15MB 20MB 25MB 30MB 35MB 40MB 45MB 50MB 55MB 60MB 65MB

Norm

alize

d Ti

me

Tim

e (s

ec)


Heap size



(b) db Total Time

1

1.1

1.2

1.3

1.4

1.5

1.6

1 1.5 2 2.5 3

6.5

7

7.5

8

8.5

9

9.5

10

20MB 30MB 40MB 50MB 60MB 70MB 80MB

Norm

alize

d Ti

me

Tim

e (s

ec)


Heap size



(c) javac Total Time

1

1.1

1.2

1.3

1.4

1.5

1 1.5 2 2.5 3

15

16

17

18

19

20

21

22


Norm

alize

d M

utat

or T

ime

Mut

ator

Tim

e (s

ec)


Heap size



(d) jythonMutator Time

1

1.1

1.2

1.3

1.4

1.5

1 1.5 2 2.5 311

12

13

14

15

16

1715MB 20MB 25MB 30MB 35MB 40MB 45MB 50MB 55MB 60MB 65MB

Norm

alize

d M

utat

or T

ime

Mut

ator

Tim

e (s

ec)


Heap size



(e) dbMutator Time

0.96

0.98

1

1.02

1.04

1.06

1.08

1.1

1 1.5 2 2.5 35.1

5.2

5.3

5.4

5.5

5.6

5.7

5.8


Norm

alize

d M

utat

or T

ime

Mut

ator

Tim

e (s

ec)


Heap size



(f) javacMutator Time

1

1.5

2

2.5

3

3.5

4

4.5

5

1 1.5 2 2.5 3

40

60

80

100

120

140

16030MB 40MB 50MB 60MB 70MB 80MB 90MB

Norm

alize

d M

utat

or L

2 M

isses

Mut

ator

L2

Miss

es (1

0^6)


Heap size



(g) jython L2 Mutator Misses

1

1.2

1.4

1.6

1.8

2

1 1.5 2 2.5 3

120

140

160

180

200

220

15MB20MB25MB30MB35MB40MB45MB50MB55MB60MB65MB

Norm

alize

d M

utat

or L

2 M

isses

Mut

ator

L2

Miss

es (1

0^6)


Heap size



(h) db L2 Mutator Misses

1

1.2

1.4

1.6

1.8

2

1 1.5 2 2.5 316

18

20

22

24

26

28

30

32


Norm

alize

d M

utat

or L

2 M

isses

Mut

ator

L2

Miss

es (1

0^6)


Heap size



(i) javac L2 Mutator Misses

1

1.5

2

2.5

3

3.5

4

4.5

5

1 1.5 2 2.5 3

1

1.5

2

2.5

3

3.5


Norm

alize

d G

C Ti

me

GC

Tim

e (s

ec)


Heap size



(j) jython GC Time

1

1.5

2

2.5

3

3.5

4

4.5

5

1 1.5 2 2.5 3

0.2

0.3

0.4

0.5

0.6

0.7

15MB20MB25MB30MB35MB40MB45MB50MB55MB60MB65MB

Norm

alize

d G

C Ti

me

GC

Tim

e (s

ec)


Heap size



(k) db GC Time

1

1.5

2

2.5

3

3.5

4

4.5

5

1 1.5 2 2.5 31

1.5

2

2.5

3

3.5

4

4.5


Norm

alize

d G

C Ti

me

GC

Tim

e (s

ec)


Heap size



(l) javac GC TimeFigure 5: OOR vs. Class-Oblivious Traversals [jython, db & javac]

OOR’s dynamism allows it to adapt to changes within the executionof a given application. Section 4.2 describes how the decay model en-sures that field heat metrics adapt to changes in application behavior.We now examine the sensitivity of this approach.We use a synthetic benchmark, phase, which exhibits two distinct

phases. The phase benchmark repeatedly constructs and then tra-verses large trees of arity 11. The traversals favor a particular child.Each phase creates and destroys many trees and performs a large num-ber of traversals. The first phase traverses only the 4th child, and thesecond phase traverses the 7th child.Figure 6 compares the default depth first traversal in Jikes RVM

against OOR and OOR without phase change detection on the phasebenchmark. Phase change detection improves OOR total time by 25%

and improves over the default depth first traversal by 55%. Mutatorperformance is improved by 37% and 70% respectively (Figure 6(b)).Much of this difference is explained by reductions in L2 misses of50% and 61% (Figure 6(c)). Figure 7 compares OOR with and with-out phase change detection on jess, jython, javac, and db. These andthe other benchmarks are insensitive to OOR’s phase change adaptiv-ity, which indicates that they have few, if any, traversal order phases.

6.4 Hot SpaceIn order to improve locality further, OOR groups objects with hotfields together in a separate copy space within the mature space, asdescribed in Section 4.3. Figure 8 shows results from four representa-tive benchmarks for OOR with and without a hot space. On average,

hypotheses, evaluations, claims

a block-based heap structure cou ld allow us

to get the best of compacting, free list, and copying collectors

hypothesis

Immix: A Mark-Region Garbage Collector withSpace Efficiency, Fast Collection, and Mutator Performance ⇤

Stephen M. BlackburnAustralian National University

[email protected]

Kathryn S. McKinleyThe University of Texas at Austin

[email protected]

AbstractProgrammers are increasingly choosing managed languages formodern applications, which tend to allocate many short-to-mediumlived small objects. The garbage collector therefore directly deter-mines program performance by making a classic space-time trade-off that seeks to provide space efficiency, fast reclamation, and mu-tator performance. The three canonical tracing garbage collectors:semi-space, mark-sweep, and mark-compact each sacrifice one ob-jective. This paper describes a collector family, called mark-region,and introduces opportunistic defragmentation, which mixes copy-ing and marking in a single pass. Combining both, we implementimmix, a novel high performance garbage collector that achieves allthree performance objectives. The key insight is to allocate and re-claim memory in contiguous regions, at a coarse block grain whenpossible and otherwise in groups of finer grain lines. We showthat immix outperforms existing canonical algorithms, improvingtotal application performance by 7 to 25% on average across 20benchmarks. As the mature space in a generational collector, im-mix matches or beats a highly tuned generational collector, e.g. itimproves jbb2000 by 5%. These innovations and the identificationof a new family of collectors open new opportunities for garbagecollector design.Categories and Subject Descriptors D.3.4 [Programming Lan-guages]: Processors—Memory management (garbage collection)General Terms Algorithms, Experimentation, Languages, Performance,MeasurementKeywords Fragmentation, Free-List, Compact, Mark-Sweep, Semi-Space, Mark-Region, Immix, Sweep-To-Region, Sweep-To-Free-List1. IntroductionModern applications are increasingly written in managed lan-guages and make conflicting demands on their underlying mem-ory managers. For example, real-time applications demand pause-time guarantees, embedded systems demand space efficiency, andservers demand high throughput. In seeking to satisfy these de-mands, the literature includes reference counting collectors andthree canonical tracing collectors: semi-space, mark-sweep, andmark-compact. These collectors are typically used as buildingblocks for more sophisticated algorithms. Since reference count-ing is incomplete, we omit it from further consideration here. Un-fortunately, the tracing collectors each achieve only two of: space

⇤ This work is supported by ARC DP0666059, NSF CNS-0719966, NSF CCF-0429859, NSF EIA-0303609, DARPA F33615-03-C-4106, Microsoft, Intel, and IBM.Any opinions, findings and conclusions expressed herein are the authors’ and do notnecessarily reflect those of the sponsors.

Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full citationon the first page. To copy otherwise, to republish, to post on servers or to redistributeto lists, requires prior specific permission and/or a fee.PLDI’08, June 7–13, 2008, Tucson, Arizona, USA.Copyright c� 2008 ACM 978-1-59593-860-2/08/06. . . $5.00

efficiency, fast collection, and mutator performance through con-tiguous allocation of contemporaneous objects.

Figure 1 starkly illustrates this dichotomy for full heap versionsof mark-sweep (MS), semi-space (SS), and mark-compact (MC)implemented in MMTk [12], running on a Core 2 Duo. It plotsthe geometric mean of total time, collection time, mutator time,and mutator cache misses as a function of heap size, normalizedto the best, for 20 DaCapo, SPECjvm98, and SPECjbb2000 bench-marks, and shows 99% confidence intervals. The crossing lines inFigure 1(a) illustrate the classic space-time trade-off at the heart ofgarbage collection. Mark-compact is uncompetitive in this settingdue to its overwhelming collection costs. In smaller heap sizes, thespace and collector efficiency of mark-sweep perform best sincethe overheads of garbage collection dominate total performance.Figures 1(c) and 1(d) show that the primary advantage for semi-space is 10% better mutator time compared with mark-sweep, dueto better cache locality. Once the heap size is large enough, garbagecollection time reduces, and the locality of the mutator dominatestotal performance so semi-space performs best.

To explain this tradeoff, we need to introduce and slightly ex-pand memory management terminology. A tracing garbage collec-tor performs allocation of new objects, identification of live ob-jects, and reclamation of free memory. The canonical collectors allidentify live objects the same way, by marking objects during atransitive closure over the object graph.

Reclamation strategy dictates allocation strategy, and the litera-ture identifies just three strategies: (1) sweep-to-free-list, (2) evacu-ation, and (3) compaction. For example, mark-sweep collectors al-locate from a free list, mark live objects, and then sweep-to-free-list

1

1.1

1.2

1.3

1.4

1.5

1 2 3 4 5 6

Nor

mal

ized

Tim

e


MSMCSS

(a) Total Time

0.5 1 2 4 8

16 32 64

1 2 3 4 5 6

Nor

mal

ized

GC

Tim

e (lo

g)Heap size relative to minimum heap size

MSMCSS

(b) Garbage Collection Time

1

1.05

1.1

1.15

1.2

1 2 3 4 5 6

Nor

mal

ized

Mut

ator

Tim

e


MSMCSS

(c) Mutator Time

0.5

1

2

4

1 2 3 4 5 6

Nor

mal

ized

Mut

ator

Per

f (lo

g)


MSMCSS

(d) Mutator L2 Misses

Figure 1. Performance Tradeoffs For Canonical Collectors: Geo-metric Mean for 20 DaCapo and SPEC Benchmarks.

claim evaluation



[email protected]


[email protected]








1

1.1

1.2

1.3

1.4

1.5

1 2 3 4 5 6

Norm

alize

d Ti

me


MSMCSS

(a) Total Time

0.5 1 2 4 8

16 32 64

1 2 3 4 5 6

Norm

alize

d G

C Ti

me

(log)


MSMCSS


1

1.05

1.1

1.15

1.2

1 2 3 4 5 6

Norm

alize

d M

utat

or T

ime


MSMCSS

(c) Mutator Time

0.5

1

2

4

1 2 3 4 5 6

Norm

alize

d M

utat

or P

erf (

log)


MSMCSS





[email protected]


[email protected]








1

1.1

1.2

1.3

1.4

1.5

1 2 3 4 5 6

Nor

mal

ized

Tim

e


MSMCSS

(a) Total Time

0.5 1 2 4 8

16 32 64

1 2 3 4 5 6

Nor

mal

ized

GC

Tim

e (lo

g)


MSMCSS


1

1.05

1.1

1.15

1.2

1 2 3 4 5 6

Nor

mal

ized

Mut

ator

Tim

e


MSMCSS

(c) Mutator Time

0.5

1

2

4

1 2 3 4 5 6

Nor

mal

ized

Mut

ator

Per

f (lo

g)


MSMCSS



evaluation claim?

You are doing it completely wrong!

sins of exposition

sins of reasoning

inscrutabilityirreproducability

ignorance inappropriateness inconsistency

sins of reasoning

evaluationclaim

ignorance• claim ignores elements of the

evaluation• may be benign• not benign when the ignored elements

counter the claim

• common idioms• ignoring data points• ignoring data distribution

evaluationclaim

ignorance: ignoring data points [Georges et al 2007]

with the non-determinism in the experimental setup. In aJava system, or managed runtime system in general, thereare a number of sources of non-determinism that affect over-all performance. One potential source of non-determinism isJust-In-Time (JIT) compilation. A virtual machine (VM) thatuses timer-based sampling to drive the VM compilation andoptimization subsystem may lead to non-determinism andexecution time variability: different executions of the sameprogram may result in different samples being taken and,by consequence, different methods being compiled and op-timized to different levels of optimization. Another sourceof non-determinism comes from thread scheduling in time-shared and multiprocessor systems. Running multithreadedworkloads, as is the case for most Java programs, requiresthread scheduling in the operating system and/or virtual ma-chine. Different executions of the same program may in-troduce different thread schedules, and may result in dif-ferent interactions between threads, affecting overall per-formance. The non-determinism introduced by JIT compi-lation and thread scheduling may affect the points in timewhere garbage collections occur. Garbage collection in itsturn may affect program locality, and thus memory systemperformance as well as overall system performance. Yet an-other source of non-determinism is various system effects,such as system interrupts — this is not specific to managedruntime systems though as it is a general concern when run-ning experiments on real hardware.

From an extensive literature survey, we found that thereare a plethora of prevalent approaches, both in experimen-tal design and data analysis for benchmarking Java perfor-mance. Prevalent data analysis approaches for dealing withnon-determinism are not statistically rigorous though. Somereport the average performance number across multiple runsof the same experiments; others report the best performancenumber, others report the second best performance numberand yet others report the worst. In this paper, we argue thatnot appropriately specifying the experimental design and notusing a statistically rigorous data analysis can be mislead-ing and can even lead to incorrect conclusions. This paperadvocates using statistics theory as a rigorous data analysisapproach for dealing with the non-determinism in managedruntime systems.

The pitfall in using a prevalent method is illustrated inFigure 1 which compares the execution time for runningJikes RVM with five garbage collectors (CopyMS, GenCopy,GenMS, MarkSweep and SemiSpace) for the SPECjvm98db benchmark with a 120MB heap size — the experi-mental setup will be detailed later. This graph comparesthe prevalent ‘best’ method which reports the best perfor-mance number (or smallest execution time) among 30 mea-surements against a statistically rigorous method which re-ports 95% confidence intervals; the ‘best’ method does notcontrol non-determinism, and corresponds to the SPEC re-porting rules [23]. Based on the best method, one would

mean w/ 95% confidence interval

9.0

9.5

10.0

10.5

11.0

11.5

12.0

12.5

CopyM

S

GenCopy

GenM

S

Mark

Sw

eep

Sem

iSpace

execution t

ime (

s)

best of 30

9.0

9.5

10.0

10.5

11.0

11.5

12.0

12.5

CopyM

S

GenCopy

GenM

S

Mark

Sw

eep

Sem

iSpace

execution t

ime (

s)

Figure 1. An example illustrating the pitfall of prevalentJava performance data analysis methods: the ‘best’ methodis shown on the left and the statistically rigorous method isshown on the right. This is for db and a 120MB heap size.

conclude that the performance for the CopyMS and Gen-Copy collectors is about the same. The statistically rigorousmethod though shows that GenCopy significantly outper-forms CopyMS. Similarly, based on the best method, onewould conclude that SemiSpace clearly outperforms Gen-Copy. The reality though is that the confidence intervals forboth garbage collectors overlap and, as a result, the per-formance difference seen between both garbage collectorsis likely due to the random performance variations in thesystem under measurement. In fact, we observe a large per-formance variation for SemiSpace, and at least one reallygood run along with a large number of less impressive runs.The ‘best’ method reports the really good run whereas a sta-tistically rigorous approach reliably reports that the averagescores for GenCopy and SemiSpace are very close to eachother.

This paper makes the following contributions:

• We demonstrate that there is a major pitfall associ-ated with today’s prevalent Java performance evaluationmethodologies, especially in terms of data analysis. Thepitfall is that they may yield misleading and even in-correct conclusions. The reason is that the data analysisemployed by these methodologies is not statistically rig-orous.

• We advocate adding statistical rigor to performance eval-uation studies of managed runtime systems, and in partic-ular Java systems. The motivation for statistically rigor-ous data analysis is that statistics, and in particular con-fidence intervals, enable one to determine whether dif-ferences observed in measurements are due to randomfluctuations in the measurements or due to actual differ-ences in the alternatives compared against each other. Wediscuss how to compute confidence intervals and discusstechniques to compare multiple alternatives.

• We survey existing performance evaluation methodolo-gies for start-up and steady-state performance, and ad-vocate the following methods. For start-up performance,we advise to: (i) take multiple measurements where each

with the non-determinism in the experimental setup. In aJava system, or managed runtime system in general, thereare a number of sources of non-determinism that affect over-all performance. One potential source of non-determinism isJust-In-Time (JIT) compilation. A virtual machine (VM) thatuses timer-based sampling to drive the VM compilation andoptimization subsystem may lead to non-determinism andexecution time variability: different executions of the sameprogram may result in different samples being taken and,by consequence, different methods being compiled and op-timized to different levels of optimization. Another sourceof non-determinism comes from thread scheduling in time-shared and multiprocessor systems. Running multithreadedworkloads, as is the case for most Java programs, requiresthread scheduling in the operating system and/or virtual ma-chine. Different executions of the same program may in-troduce different thread schedules, and may result in dif-ferent interactions between threads, affecting overall per-formance. The non-determinism introduced by JIT compi-lation and thread scheduling may affect the points in timewhere garbage collections occur. Garbage collection in itsturn may affect program locality, and thus memory systemperformance as well as overall system performance. Yet an-other source of non-determinism is various system effects,such as system interrupts — this is not specific to managedruntime systems though as it is a general concern when run-ning experiments on real hardware.

From an extensive literature survey, we found that thereare a plethora of prevalent approaches, both in experimen-tal design and data analysis for benchmarking Java perfor-mance. Prevalent data analysis approaches for dealing withnon-determinism are not statistically rigorous though. Somereport the average performance number across multiple runsof the same experiments; others report the best performancenumber, others report the second best performance numberand yet others report the worst. In this paper, we argue thatnot appropriately specifying the experimental design and notusing a statistically rigorous data analysis can be mislead-ing and can even lead to incorrect conclusions. This paperadvocates using statistics theory as a rigorous data analysisapproach for dealing with the non-determinism in managedruntime systems.

The pitfall in using a prevalent method is illustrated inFigure 1 which compares the execution time for runningJikes RVM with five garbage collectors (CopyMS, GenCopy,GenMS, MarkSweep and SemiSpace) for the SPECjvm98db benchmark with a 120MB heap size — the experi-mental setup will be detailed later. This graph comparesthe prevalent ‘best’ method which reports the best perfor-mance number (or smallest execution time) among 30 mea-surements against a statistically rigorous method which re-ports 95% confidence intervals; the ‘best’ method does notcontrol non-determinism, and corresponds to the SPEC re-porting rules [23]. Based on the best method, one would

mean w/ 95% confidence interval

9.0

9.5

10.0

10.5

11.0

11.5

12.0

12.5

CopyM

S

GenCopy

GenM

S

Mark

Sw

eep

Sem

iSpace

execution t

ime (

s)

best of 30

9.0

9.5

10.0

10.5

11.0

11.5

12.0

12.5

CopyM

S

GenCopy

GenM

S

Mark

Sw

eep

Sem

iSpace

execution t

ime (

s)

Figure 1. An example illustrating the pitfall of prevalentJava performance data analysis methods: the ‘best’ methodis shown on the left and the statistically rigorous method isshown on the right. This is for db and a 120MB heap size.

conclude that the performance for the CopyMS and Gen-Copy collectors is about the same. The statistically rigorousmethod though shows that GenCopy significantly outper-forms CopyMS. Similarly, based on the best method, onewould conclude that SemiSpace clearly outperforms Gen-Copy. The reality though is that the confidence intervals forboth garbage collectors overlap and, as a result, the per-formance difference seen between both garbage collectorsis likely due to the random performance variations in thesystem under measurement. In fact, we observe a large per-formance variation for SemiSpace, and at least one reallygood run along with a large number of less impressive runs.The ‘best’ method reports the really good run whereas a sta-tistically rigorous approach reliably reports that the averagescores for GenCopy and SemiSpace are very close to eachother.

This paper makes the following contributions:

• We demonstrate that there is a major pitfall associ-ated with today’s prevalent Java performance evaluationmethodologies, especially in terms of data analysis. Thepitfall is that they may yield misleading and even in-correct conclusions. The reason is that the data analysisemployed by these methodologies is not statistically rig-orous.

• We advocate adding statistical rigor to performance eval-uation studies of managed runtime systems, and in partic-ular Java systems. The motivation for statistically rigor-ous data analysis is that statistics, and in particular con-fidence intervals, enable one to determine whether dif-ferences observed in measurements are due to randomfluctuations in the measurements or due to actual differ-ences in the alternatives compared against each other. Wediscuss how to compute confidence intervals and discusstechniques to compare multiple alternatives.

• We survey existing performance evaluation methodolo-gies for start-up and steady-state performance, and ad-vocate the following methods. For start-up performance,we advise to: (i) take multiple measurements where each

ignorance: ignoring bimodality of a distribution

‘In our experience, while the sin of ignorance seems obvious and easy to avoid, in reality it is far from that. Many factors in the evaluation that seem irrelevant to a claim may actually be critical to the soundness of the claim.

As a community we need to work towards identifying these factors.’

inappropriateness

• claim extends beyond the evaluation• may be benign

• assume laws of physics• assume prior (cited) work

• not benign when extending beyond these

• common idioms• inappropriate metrics (energy != performance)• inappropriate use of independent variables (don’t control for space in GC eval)• inappropriate use of tools (such as biased sampling)

evaluationclaim

A B 1

1.1

1.2

1.3

1.4

1.5

1 2 3 4 5 6

Nor

mal

ized

Tim

e


SemiSpaceMarkSweep

inappropriateness: misuse of independent variables (ignoring heap size)[Blackburn et al 2008]

1

1.1

1.2

1.3

1.4

1.5

1 2 3 4 5 6

Nor

mal

ized

Tim

e


SemiSpaceMarkSweep

inappropriateness: inappropriate (biased) sampling: hotness of methods under four profilers [Mytkowicz et al 2010]

hprof jprofile xprof yourkit

0

5

10

15

20

JavaParser.jj_scan_tokenNodeIterator.getPositionFromParentDefaultNameStep.evaluate

●

● ●

●

●

●

● ●● ●

●

●

perc

ent of ove

rall

exe

cutio

n

Figure 1. Disagreement in the hottest method for benchmark pmdacross four popular Java profilers.

This paper is organized as follows: Section 2 presents a mo-tivating example. Section 3 presents our experimental methodol-ogy. Section 4 illustrates how profiler disagreement can be usedto demonstrate that profiles are incorrect. Section 5 uses causal-ity analysis to determine if a profiler is actionable. Section 6 ex-plores why profilers often produce non-actionable data. Section 7introduces a proof-of-concept profiler that addresses the bias prob-lems with existing profilers and produces actionable profiles. Fi-nally, Section 8 discusses related work and Section 9 concludes.

2. Motivation

Figure 1 illustrates the amount of time that four popular Javaprofilers (hprof , jprofile, xprof , and yourkit) attribute to threemethods from the pmd DaCapo benchmark [3]. There are threebars for each profiler, and each bar gives data for one of the threemethods: jj scan token, getPositionFromParent, and evaluate.These are the methods that one of the four profilers identified as thehottest method. For a given profiler, P , and method, M, the heightof the bar is the percentage of overall execution time spent in Maccording to P . The error bars (which are tight enough to be nearlyinvisible) denote 95% confidence interval of the mean of 30 runs.

Figure 1 illustrates that the four profilers disagree dramaticallyabout which method is the hottest method. For example, two of theprofilers, hprof and yourkit , identify the jj scan token method asthe hottest method; however, the other two profilers indicate thatthis method is irrelevant to performance as they attribute 0% ofexecution time to it.

Figure 1 also illustrates that even when two profilers agreeon the hottest method, they disagree in the percentage of timespent in the method. For example, hprof attributes 6.2% of overallexecution time to the jj scan token method and yourkit attributes8.5% of overall execution time to this method.

Clearly, when two profilers disagree, they cannot both be cor-rect. Thus, if a performance analyst uses a profiler, she may or maynot get a correct profile; in the case of an incorrect profile, the per-formance analyst may waste her time optimizing a cold methodthat will not improve performance. This paper demonstrates thatthe above inaccuracies are not corner cases but occur for the major-ity of commonly studied benchmarks.

3. Experimental methodology

This section describes profilers we use in this study, the benchmarkprograms we use in our experiments, the metrics we use to evaluateprofilers, and our experimental setup.

B.mark Description Time Overhead[sec.] hprof xprof jprof. y.kit

antlr parser generator 21.02 1.1x 1.2x 1.2x 1.2xbloat bytecode optimizer 74.26 1.1x 1.3x 1.0x 1.2xchart plot and render PDF 75.70 1.1x 1.1x 1.1x 1.1xfop print formatter 27.68 1.5x 1.1x 1.0x 1.8xjython python interpreter 68.12 1.1x 1.3x 1.1x 1.7xluindex text indexing tool 85.98 1.1x 1.2x 1.0x 1.1xpmd source analyzer 62.75 1.9x 1.3x 1.0x 2.2x

mean 1.3x 1.2x 1.1x 1.5x

Table 1. Overhead for the four profilers. We calculate “Overhead”as the total execution time with the profiler divided by executiontime without the profiler

3.1 Profilers

We study four state-of-the-art Java profilers that they are widelyused in both academia and industry:

hprof : is an open-source profiler that ships with Sun’s Hotspotand IBM’s J9.

xprof : is the internal profiler in Sun’s Hotspot JVM.

jprofile: is an award-winning2 commercial product from EJ tech-nologies.

yourkit : is an award-winning3 commercial product from YourKit.

To collect data with minimal overhead, all four profilers usesampling. Sampling approximates the time spent in an application’smethods by periodically stopping a program and recording the cur-rently executing method (a “sample”). These profilers all assumethat the number of samples for a method is proportional to the timespent in the method. We used a sampling rate of 10ms for the ex-periments in this paper (this is the default rate for most profilers).

3.2 Benchmarks

We evaluated the profilers using the single-threaded DaCapo Javabenchmarks[3] (Table 1) with their default inputs.

We did not use the multi-threaded benchmarks (eclipse, luse-arch, xalan, and hsqldb), because each profiler handles threadsdifferently, which complicate comparisons across profilers.

The “Overhead” columns in Table 1 give the overhead of eachprofiler. Specifically, they give the end-to-end execution time withprofiling divided by the end-to-end execution time without profil-ing. We see that profiler overhead is relatively low, usually 1.2 orbetter for all profilers except yourkit , which has more overheadthan other profilers because it also injects bytecodes into classes tocount the number of calls to each method, in addition to sampling

3.3 How to evaluate profilers

If we knew the “correct” profile for a program run, we could eval-uate the profiler with respect to this correct profile. Unfortunately,there is no “correct” profile most of the time and thus we cannotdefinitively determine if a profiler is producing correct results.

For this reason, we relax the notion of “correctness” into “ac-tionable”. By saying that a “profile is actionable” we mean thatwe do not know if the profile is “correct”; however, acting on theprofile yields the expected outcome. For example, optimizing thehot methods identified by the profile will yield a measurable bene-fit. Thus, unlike “correctness” which is an absolute characterization(a profile is either correct or incorrect), actionable is necessarily afuzzy characterization.

2 Java Developer’s Journal Readers Choice Award for Best Java Profiling(2005-2007).3 Java Developer’s Journal Editors Choice Award.(2005).

‘In our experience, while the sin of inappropriateness seems obvious and easy to avoid, in reality it is far from that. Many factors that may be unaccounted for in an evaluation may actually be important to deriving a sound claim.

As a community, we need to work towards identifying these factors.’

inconsistency

• claim and evaluation areabout different things• compare two incompatible things

in the evaluation, but claim ignores the incompatibility (apples v oranges)• common idioms

• inconsistent measurement contexts (night | day, 32 | 64 machines)• inconsistent metrics (wall clock | cycles, retired instructions | issued instructions)• inconsistent workloads• inconsistent data analysis

evaluation claim

inconsistency: time of day/week makes a large difference to workload [gmail]

Time

Req

uest

s pe

r Sec

ond

‘In our experience, while the sin of inappropriateness seems obvious and easy to avoid, in reality it is far from that. Many artifacts that seem comparable may actually be inconsistent.

As a community we need to work towards identifying these artifacts.’

sins of exposition

inscrutability

• description of the claim is inadequate• authors often fail to explicitly identify their claim

• three common forms:• omission (no claim: when this happens, the work is literally meaningless)• ambiguity (‘improved performance’: latency? throughput? …)• distortion (‘improved by 10%’: the GC? the whole program? …)

• claim is synthesized by the authors, so in principle, inscrutability should never occur

claim

irreproducability

• the description of the evaluation is inadequate• three common forms:• omission (space constraints, not realizing factor is relevant, confidentiality)• ambiguity (imprecise language, lack of detail, missing units)• distortion (unambiguous, but incorrect, such as missing units

evaluation

using this framework

evaluation claim?

sins of exposition

sins of reasoning

inscrutabilityirreproducability

ignorance inappropriateness inconsistency

our hope: we’re moving up Phil Amour’s orders of ignorance…

OOI3: I don’t know something and I do not have a process to find out that I don’t know it

OOI2: I don’t know something but I do have a process to find out that I don’t know it

call for cultural change

Rejected

Often rejected

Safe bet

Often rejected

Rare

Novelty

Qua

lity

of e

valu

atio

n

PLDI 2015: 27/33 artifacts accepted from 58 accepted papers

assert(scope_of_claim == scope_of_eval)

Questions?

With special thanks to my co-authors:Amer Diwan, Matthias Hauswirth, Peter F. Sweeney



The Truth, the Whole Truth, and Nothing but the Truth: A Pragmatic Guide to Assessing Empirical...

Science

Transcript of The Truth, the Whole Truth, and Nothing but the Truth: A Pragmatic Guide to Assessing Empirical...