Evaluation and Methodology For Experimental Computer Science

16
Evaluation and Methodology For Experimental Computer Science Steve Blackburn Research School of Computer Science Australian National University

description

Evaluation and Methodology For Experimental Computer Science. Steve Blackburn Research School of Computer Science Australian National University. Research: Solving problems without known answers. Quantitative Experimentation. Quantitative Experimentation. - PowerPoint PPT Presentation

Transcript of Evaluation and Methodology For Experimental Computer Science

Page 1: Evaluation and Methodology For Experimental Computer Science

Evaluation and MethodologyFor Experimental Computer Science

Steve BlackburnResearch School of Computer ScienceAustralian National University

Page 2: Evaluation and Methodology For Experimental Computer Science

2Steve Blackburn | Evaluation and Methodology | PhD Workshop May 2012

Research:Solving problems without known answers

Page 3: Evaluation and Methodology For Experimental Computer Science

3Steve Blackburn | Evaluation and Methodology | PhD Workshop May 2012

Page 4: Evaluation and Methodology For Experimental Computer Science

4Steve Blackburn | Evaluation and Methodology | PhD Workshop May 2012

Quantitative Experimentation

Quantitative Experimentation

• Experiment– Measure A and B in context of C

• Claim– “A is better than B”

Does the experiment support the claim?

[Blackburn, Diwan, Hauswirth, Sweeney et al 2012]

Page 5: Evaluation and Methodology For Experimental Computer Science

5Steve Blackburn | Evaluation and Methodology | PhD Workshop May 2012

Scope of Claim & Experiment

• Claim with broad scope is hard to satisfy– “We improve Java programs by 10%”– Implicitly all Java programs in all circumstances– Scope of experiment limited by resources

• Claim with narrow scope is uninteresting– “We improve Java on lusearch on an i7 on … by 10%”

Scope of claim is the key tension

[Blackburn, Diwan, Hauswirth, Sweeney et al 2012]

Quantitative Experimentation

Page 6: Evaluation and Methodology For Experimental Computer Science

6Steve Blackburn | Evaluation and Methodology | PhD Workshop May 2012

Components of an Experiment

• Measurement context– Software and hardware components varied or held constant

• Workloads– Benchmarks and their inputs used in the experiment

• Metrics– The properties to measure and how to measure them

• Data analysis and interpretation– How to analyze the data and how to interpret the results

[Blackburn, Diwan, Hauswirth, Sweeney et al 2012]

Control / Independent VariablesDependent Variables

Quantitative Experimentation

Page 7: Evaluation and Methodology For Experimental Computer Science

7Steve Blackburn | Evaluation and Methodology | PhD Workshop May 2012

Experimental Pitfalls (the four “I”s)

• Inappropriate– Experiments that are inappropriate (or surplus) to the claim

• Ignored– Elements relevant to claim, but omitted

• Inconsistent– Elements are treated inconsistently

• Irreproducible– Others cannot reproduce this experiment

[Blackburn, Diwan, Hauswirth, Sweeney et al 2012]

Quantitative Experimentation

Page 8: Evaluation and Methodology For Experimental Computer Science

Steve Blackburn | Evaluation and Methodology | PhD Workshop May 2012

Components X Pitfalls[Blackburn, Diwan, Hauswirth, Sweeney et al 2012]

Quantitative Experimentation

A measurement context is inappropriate when it is flawed or does not reflect the measurement context that is implicit in the claim. This may become manifest as an error or as a distraction (a “red herring”).

✗An aspect of the measurement context is ignored when an experiment design does not consider it even when it is necessary to support the claim.

✗A measurement context is inconsistent when an experiment compares two systems and uses different measurement contexts for each system. The different contexts may produce incomparable results for the two systems. Unfortunately, the more disparate the objects of comparison, the more difficult it is to ensure consistent measurement contexts. Even a benign-looking difference in contexts can introduce bias and make measurement results incomparable. For this reason, it may be important to randomize experimental parameters (e.g., memory layout in experiments that measure performance).

✗If the measurement context is irreproducible then the experiment is also irreproducible. Measurement contexts may be irreproducible because either they are not public or they are not documented.

8

Page 9: Evaluation and Methodology For Experimental Computer Science

9

Advice

Steve Blackburn | Evaluation and Methodology | PhD Workshop May 2012

Page 10: Evaluation and Methodology For Experimental Computer Science

10Steve Blackburn | Evaluation and Methodology | PhD Workshop May 2012

Ingrained, Systematic Skepticism

Why Write?

Too good to be true? Probably.

• Is the result repeatable?– If it is not, it’s nothing more

than noise• Is the result plausible?

– You need to posses a clear idea of what is plausible

• Can you explain the result?– Plausible support of

hypothesis is essential

Street-Fighting Mathematics MITOPENCOURSEWARE 18.098 / 6.099

Page 11: Evaluation and Methodology For Experimental Computer Science

11Steve Blackburn | Evaluation and Methodology | PhD Workshop May 2012

Clean Environment

Why Write?

Just as essential as a clean lab for a bio scientist

• Clean OS & distro– All machines run same image of same distro

• Clean hardware– Buy machines in pairs (redundancy & sanity checks)

• Know what is running– No NFS mounts, no non-essential daemons

• Machine reservation system– Ensure only you are using the machine

Page 12: Evaluation and Methodology For Experimental Computer Science

12Steve Blackburn | Evaluation and Methodology | PhD Workshop May 2012

Repeatability and Accountability

Why Write?

Disk is cheap. Don’t throw anything away.

• All experiments should be scripted• Log every experiment

– Capture the environment and output in log– Keep logs (forever)

• Publish your raw data– Downloadable from your web site– If you’re not comfortable with this, you probably

should not be publishing

Page 13: Evaluation and Methodology For Experimental Computer Science

13Steve Blackburn | Evaluation and Methodology | PhD Workshop May 2012

Statistics

Why Write?

Lies, damn lies, and statistics

• Understand basic statistics• Are your results statistically significant?• Report confidence intervals

Page 14: Evaluation and Methodology For Experimental Computer Science

14Steve Blackburn | Evaluation and Methodology | PhD Workshop May 2012

Good Tools

Why Write?

Good evaluation infrastructure gives you an edge

• Good data management system– Easy manipulation of and recovery of data

• Good data analysis tools– See results that others can’t and share with your collaborators

• Good workloads– Realistic workloads key to credibility

• Good teamwork– Resist the temptation to write your own. Work as a team.

Page 15: Evaluation and Methodology For Experimental Computer Science

15Steve Blackburn | Evaluation and Methodology | PhD Workshop May 2012

Good Tools

Why Write?

Good evaluation infrastructure gives you an edge

Page 16: Evaluation and Methodology For Experimental Computer Science

16Steve Blackburn | Evaluation and Methodology | PhD Workshop May 2012

Questions?

Mechanics