Lies, Damn Lies and Benchmarks

Click here to load reader

download Lies, Damn Lies and Benchmarks

of 20

  • date post

    20-Jan-2016
  • Category

    Documents

  • view

    38
  • download

    1

Embed Size (px)

description

Lies, Damn Lies and Benchmarks. Are your benchmark tests reliable?. Typical Computer Systems Paper. Abstract: What this paper contains. Most readers will be reading just this. Introduction : Present a problem. The universe cannot go on, if the problem persists. - PowerPoint PPT Presentation

Transcript of Lies, Damn Lies and Benchmarks

  • Lies, Damn Lies and BenchmarksAre your benchmark tests reliable?

  • Typical Computer Systems PaperAbstract: What this paper contains.Most readers will be reading just this.Introduction: Present a problem.The universe cannot go on, if the problem persists.Related Work: Show the work of competitors.They are stink.Solution: Present the suggested solution.We are the best.

  • Typical Paper (Cont.)Technique: Go into details.Many drawings and figures.Experiments: Prove our point, Evaluation Methodology.Which benchmarks adhere to my assumptions?Results: Show how the enhancement is great.The objective benchmarks agree that we are the best.Conclusions: Highlights of the paper.Some readers will be reading besides the abstract, also this.

  • SPECSPEC is Standard Performance Evaluation Corporation.Legally, SPEC is a non-profit corporation registered in California. SPEC's mission: To establish, maintain, and endorse a standardized set of relevant benchmarks and metrics for performance evaluation of modern computer systems."SPEC CPU2000 is the next-generation industry-standardized CPU-intensive benchmark suite."Composed of 12 integer (CINT2000) and 14 floating-point benchmarks (CFP2000).

  • Some Conferences StatisticsNumber of papers published:209Papers that used a version of SPEC:138 (66%)Earliest conference deadline: December 2000SPEC CPU2000 announced: December 1999

  • Partial use of CINT2000

    4

    7442

    7523

    7532

    7654

    45117

    2445

    0977

    34383630

    0

    1-6

    7-11

    12

    percents of papers

    number of benchmarks used per paper

    1

    ISCA 2001Micro 2001HPCA 2002ISCA 2002Micro 2002HPCA 2003ISCA 2003Total

    0777742034

    1-6455654938

    7-114235114736

    12232475730

    1

    0

    1-6

    7-11

    12

    percents of papers

    number of benchmarks used per paper

    2

    3

  • Why not using it all?It seemed that many papers are not using all benchmarks of the suite.Selected excuses were:The chosen benchmarks stress the problem Several benchmarks couldnt be simulated A subset of CINT2000 was chosen select benchmarks from CPU2000 More benchmarks wouldn't fit into our displays

  • Omission ExplanationRoughly a third of the papers (34/108) present any reason at all.Many reasons are not so convincing.Are the claims in the previous slide persuasive?

  • What has been omittedPossible reasons for the omissions:eon is written in C++.gap calls ioctl system call, which is a device specific call.crafty uses a 64-bit word.perlbmk has problems with 64-bit processors

    2

    83

    75

    74

    73

    72

    68

    68

    68

    0

    57

    57

    49

    43

    percents of usage

    1

    gzipvprparsergccmcfvortextwolfbzip2perlbmkcraftygapeon

    8375747372686868057574943

    1

    percents of usage

    2

    3

  • CINT95Still widespread even though it retired on June 2000.Smaller suite (8 vs. 12).Over 50% of full use, but around for at least 3 years already.Only 5 papers out of 36 explain the partial use.

    1

    5121935

    34292923

    0

    1-6 (1-4)

    7-11 (5-7)

    12 (8)

    percents of papers

    1

    CINT95 (1999-2000)CINT2000 (2001-2002)

    0534

    1-6 (1-4)1229

    7-11 (5-7)1929

    12 (8)3523

    1

    0

    1-6 (1-4)

    7-11 (5-7)

    12 (8)

    percents of papers

    2

    3

  • Using of CINT2000The using of CINT has been increasing over the years.The benchmarking of new systems is done by old tests

  • Amdahl's LawFenhanced is the percents of the benchmarks that were enhanced. Speedup is: = =

    Example: if we have a way to improve just the gzip benchmark by a factor of 10, what fraction of usage must be gzip to achieve a 300% speedup?

    3= Fenhanced=20/27=74% Fenhancedspeedupenhanced(1 - Fenhanced ) + 1CPU TimeoldCPU TimenewCPU TimeoldCPU Timeold (1 - Fenhanced) + CPU Timeold Fenhanced (1/ speedupenhanced) (1 - Fenhanced ) + 1

  • Breaking Amdahl's Law"The performance improvement to be gained from using some faster mode of execution is limited by the fraction of the time the faster mode be used."Just the full suite can accurately gauge the enhancement.It is possible that other benchmarks :produce similar results.degrade performance.invariant to the enhancement. Even in this case the published results are too high according to Amdahl's Law.

  • TradeoffsWhat about papers that offer performance tradeoffs?Performance tradeoff are more than 40% of the papers.An average paper contains just 8 tests out of the 12.What do we assume about missing results?

  • Besides SPECCategories of benchmarks:Official benchmarks like SPEC; there are also official benchmarks by non-vendor source.They will not always concentrate on the points important for your usage.Traces real users whose activities are logged and kept. An improved (or worsened) system may change the users behavior.

  • Besides SPEC (Cont.)Microbenchmarks test just an isolated component of a system. Using multiple microbenchmarks will not test the interaction between the components.Ad-hoc benchmarks run a bunch of programs that seem interesting.If you suggest a way to compile Linux faster, Linux compilation can be a good benchmark.Synthetic Benchmarks write a program to test yourself.You can stress your point.

  • Whetstone BenchmarkHistorically it is the first synthetic microbenchmark. The original Whetstone benchmark was designed in the 60's. First practical implementation on 1972.Was named after the small town of Whetstone, where it was designed.Designed to measure the execution speed of a variety of FP instructions (+, *, sin, cos, atan, sqrt, log, exp). Contains small loop of FP instructions.The majority of its variables are global; hence will not show up the RISC advantages, where large number of registers enhance the local variables handling.

  • The Andrew benchmarkAndrew benchmark was suggested at 1988.In the early 90's the Andrew benchmark was one of the popular non-vendor benchmark for file system efficiency.The Andrew benchmark:Copies a directory hierarchy containing a source code of a large program."stat"s every file in the hierarchy.Reads any byte of every copied file.Compiles the code in the copied hierarchy.Does this reflect the reality? Who does work like this?

  • Kernel CompilationMaybe a "real" job can be more representative?Measure the compilation of the Linux kernel.The compilation reads large memory areas only once. This reduces the influence of the cache efficiency.The influence of the L2 cache will be drastically reduced.

  • Benchmarks' ContributionOn 1999 Mogul presented statistics which have shown that while HW is usually measured by SPEC; when it comes to the code of the Operating System, no standard is popular.Distributed Systems are commonly benchmarked by NAS.On 1993, Chen & Patterson wrote: "Benchmarks do not help in understanding system performance".