Evaluation and Methodology For Experimental Computer Science
description
Transcript of Evaluation and Methodology For Experimental Computer Science
Evaluation and MethodologyFor Experimental Computer Science
Steve BlackburnResearch School of Computer ScienceAustralian National University
2Steve Blackburn | Evaluation and Methodology | PhD Workshop May 2012
Research:Solving problems without known answers
3Steve Blackburn | Evaluation and Methodology | PhD Workshop May 2012
4Steve Blackburn | Evaluation and Methodology | PhD Workshop May 2012
Quantitative Experimentation
Quantitative Experimentation
• Experiment– Measure A and B in context of C
• Claim– “A is better than B”
Does the experiment support the claim?
[Blackburn, Diwan, Hauswirth, Sweeney et al 2012]
5Steve Blackburn | Evaluation and Methodology | PhD Workshop May 2012
Scope of Claim & Experiment
• Claim with broad scope is hard to satisfy– “We improve Java programs by 10%”– Implicitly all Java programs in all circumstances– Scope of experiment limited by resources
• Claim with narrow scope is uninteresting– “We improve Java on lusearch on an i7 on … by 10%”
Scope of claim is the key tension
[Blackburn, Diwan, Hauswirth, Sweeney et al 2012]
Quantitative Experimentation
6Steve Blackburn | Evaluation and Methodology | PhD Workshop May 2012
Components of an Experiment
• Measurement context– Software and hardware components varied or held constant
• Workloads– Benchmarks and their inputs used in the experiment
• Metrics– The properties to measure and how to measure them
• Data analysis and interpretation– How to analyze the data and how to interpret the results
[Blackburn, Diwan, Hauswirth, Sweeney et al 2012]
Control / Independent VariablesDependent Variables
Quantitative Experimentation
7Steve Blackburn | Evaluation and Methodology | PhD Workshop May 2012
Experimental Pitfalls (the four “I”s)
• Inappropriate– Experiments that are inappropriate (or surplus) to the claim
• Ignored– Elements relevant to claim, but omitted
• Inconsistent– Elements are treated inconsistently
• Irreproducible– Others cannot reproduce this experiment
[Blackburn, Diwan, Hauswirth, Sweeney et al 2012]
Quantitative Experimentation
Steve Blackburn | Evaluation and Methodology | PhD Workshop May 2012
Components X Pitfalls[Blackburn, Diwan, Hauswirth, Sweeney et al 2012]
Quantitative Experimentation
A measurement context is inappropriate when it is flawed or does not reflect the measurement context that is implicit in the claim. This may become manifest as an error or as a distraction (a “red herring”).
✗An aspect of the measurement context is ignored when an experiment design does not consider it even when it is necessary to support the claim.
✗A measurement context is inconsistent when an experiment compares two systems and uses different measurement contexts for each system. The different contexts may produce incomparable results for the two systems. Unfortunately, the more disparate the objects of comparison, the more difficult it is to ensure consistent measurement contexts. Even a benign-looking difference in contexts can introduce bias and make measurement results incomparable. For this reason, it may be important to randomize experimental parameters (e.g., memory layout in experiments that measure performance).
✗If the measurement context is irreproducible then the experiment is also irreproducible. Measurement contexts may be irreproducible because either they are not public or they are not documented.
✗
8
9
Advice
Steve Blackburn | Evaluation and Methodology | PhD Workshop May 2012
10Steve Blackburn | Evaluation and Methodology | PhD Workshop May 2012
Ingrained, Systematic Skepticism
Why Write?
Too good to be true? Probably.
• Is the result repeatable?– If it is not, it’s nothing more
than noise• Is the result plausible?
– You need to posses a clear idea of what is plausible
• Can you explain the result?– Plausible support of
hypothesis is essential
Street-Fighting Mathematics MITOPENCOURSEWARE 18.098 / 6.099
11Steve Blackburn | Evaluation and Methodology | PhD Workshop May 2012
Clean Environment
Why Write?
Just as essential as a clean lab for a bio scientist
• Clean OS & distro– All machines run same image of same distro
• Clean hardware– Buy machines in pairs (redundancy & sanity checks)
• Know what is running– No NFS mounts, no non-essential daemons
• Machine reservation system– Ensure only you are using the machine
12Steve Blackburn | Evaluation and Methodology | PhD Workshop May 2012
Repeatability and Accountability
Why Write?
Disk is cheap. Don’t throw anything away.
• All experiments should be scripted• Log every experiment
– Capture the environment and output in log– Keep logs (forever)
• Publish your raw data– Downloadable from your web site– If you’re not comfortable with this, you probably
should not be publishing
13Steve Blackburn | Evaluation and Methodology | PhD Workshop May 2012
Statistics
Why Write?
Lies, damn lies, and statistics
• Understand basic statistics• Are your results statistically significant?• Report confidence intervals
14Steve Blackburn | Evaluation and Methodology | PhD Workshop May 2012
Good Tools
Why Write?
Good evaluation infrastructure gives you an edge
• Good data management system– Easy manipulation of and recovery of data
• Good data analysis tools– See results that others can’t and share with your collaborators
• Good workloads– Realistic workloads key to credibility
• Good teamwork– Resist the temptation to write your own. Work as a team.
15Steve Blackburn | Evaluation and Methodology | PhD Workshop May 2012
Good Tools
Why Write?
Good evaluation infrastructure gives you an edge
16Steve Blackburn | Evaluation and Methodology | PhD Workshop May 2012
Questions?
Mechanics