Benchmarking: Science? Art? Neither? J. E. Smith University of Wisconsin-Madison Copyright (C) 2006...

31
Benchmarking: Science? Art? Benchmarking: Science? Art? Neither? Neither? J. E. Smith University of Wisconsin-Madison Copyright (C) 2006 by James E. Smith

Transcript of Benchmarking: Science? Art? Neither? J. E. Smith University of Wisconsin-Madison Copyright (C) 2006...

Page 1: Benchmarking: Science? Art? Neither? J. E. Smith University of Wisconsin-Madison Copyright (C) 2006 by James E. Smith.

Benchmarking: Science? Art? Neither?Benchmarking: Science? Art? Neither?

J. E. Smith

University of Wisconsin-Madison

Copyright (C) 2006 by James E. Smith

Page 2: Benchmarking: Science? Art? Neither? J. E. Smith University of Wisconsin-Madison Copyright (C) 2006 by James E. Smith.

01/06 copyright J. E. Smith, 2006 2

Everyone has an opinion…Everyone has an opinion…

Like most people working in research and development, I’m a great benchmark “hobbyist”

My perspective on the benchmark process• Emphasis on the “scientific” (or non-scientific) aspects• Accumulated observations and opinions

Page 3: Benchmarking: Science? Art? Neither? J. E. Smith University of Wisconsin-Madison Copyright (C) 2006 by James E. Smith.

01/06 copyright J. E. Smith, 2006 3

Benchmarking ProcessBenchmarking Process

Steps:1) Define workload

2) Extract Benchmarks from applications

3) Choose performance metric

4) Execute benchmarks on target machine(s)

5) Project workload performance for target machine(s) and summarize results

Page 4: Benchmarking: Science? Art? Neither? J. E. Smith University of Wisconsin-Madison Copyright (C) 2006 by James E. Smith.

01/06 copyright J. E. Smith, 2006 4

The Benchmarking Process (Science)The Benchmarking Process (Science)

Extract Benchmarks

Run Benchmarks

Project Performance

Define Workload

t1 t2 t3 t4 t5

w1 w2 w3 w4 w5

w1' w2' w3' w4' w5'

t1' t2' t3' t4' t5'

t1'^ t5'^t4'^t2'^ t3'^

w5^w1^ w2^ w3^ w4^

t5^

t1^ t2^ t3

^t4^

Page 5: Benchmarking: Science? Art? Neither? J. E. Smith University of Wisconsin-Madison Copyright (C) 2006 by James E. Smith.

01/06 copyright J. E. Smith, 2006 5

Extracting BenchmarksExtracting Benchmarks

Total work in application environment

Total work in benchmarks

Fraction of work from each job type

Fraction of work from each benchmark

Perfect Scaling assumption:

Perfect scaling is often implicitly assumed, but does not always hold.

iwW

''iwW

Wwf ii /

''' /Wwf ii

'' // iiii wwtt

Page 6: Benchmarking: Science? Art? Neither? J. E. Smith University of Wisconsin-Madison Copyright (C) 2006 by James E. Smith.

01/06 copyright J. E. Smith, 2006 6

Projecting PerformanceProjecting Performance

Define scale factor

Constant work model: • work done is the same regardless of machine used.

• What is projected time? (assume perfect scaling)

Constant time model:• time spent in each program is the same regardless of machine

used. • What is projected work? (assume perfect scaling)

'''' /// WfWfwwttm iiiiiii

i all for ˆii ww

iii mttT 'ˆˆˆ

i all for ˆii tt

)ˆ/(ˆˆ '' iiii ttwwW

Page 7: Benchmarking: Science? Art? Neither? J. E. Smith University of Wisconsin-Madison Copyright (C) 2006 by James E. Smith.

01/06 copyright J. E. Smith, 2006 7

Performance Measured as a RatePerformance Measured as a Rate Rate = work/time

For constant work model;

(i.e. weighted harmonic mean rate) For constant time model;

Because the ti are fixed, this is essentially a weighted arithmetic mean.

What about geometric mean? • neither science nor art• It has a nice (but mostly useless) mathematical property.

iii twr /

'' ˆ//1)ˆ/(/ˆ/ˆˆiiiiiii rfrWfWftwR

iiiii ttrtwR /ˆˆ/ˆˆ '

Page 8: Benchmarking: Science? Art? Neither? J. E. Smith University of Wisconsin-Madison Copyright (C) 2006 by James E. Smith.

01/06 copyright J. E. Smith, 2006 8

Defining the WorkloadDefining the Workload

Who is the benchmark designed for? Purchaser

• In the best position (theoretically)Workload is well known

• In theory, should develop own benchmarksIn practice, often does not

• Problem: matching standard benchmarks to workload Developer

• Typically use both internal and standard benchmarksUse standard benchmarks more often than is

admitted• Deals with markets (application domains)

Needs to know the market designer’s paradox

• Only needs to satisfy decision makers in the organization

Page 9: Benchmarking: Science? Art? Neither? J. E. Smith University of Wisconsin-Madison Copyright (C) 2006 by James E. Smith.

01/06 copyright J. E. Smith, 2006 9

Designer’s ParadoxDesigner’s Paradox

Consider multiple application domains and multiple computer designs

Computer 3 gives best overall performance• BUT WON’T SELL• Customers in domain 1 will choose Computer 1 and customers in

domain 2 will choose Computer 2

Application Computer 1 Computer 2 Computer 3 Domain time (sec.) time (sec.) time (sec.)Domain 1 10 100 20Domain 2 100 10 20Total Time 110 110 40

Page 10: Benchmarking: Science? Art? Neither? J. E. Smith University of Wisconsin-Madison Copyright (C) 2006 by James E. Smith.

01/06 copyright J. E. Smith, 2006 10

Defining the WorkloadDefining the Workload

Who is the benchmark designed for? Researcher

• Faces biggest problemsNo market (or all markets)Can easily fall prey to designer’s paradox

• Must satisfy anonymous reviewersOften fall prey to conventional wisdome.g. “execution driven” simulation == good

trace-driven simulation == bad

Page 11: Benchmarking: Science? Art? Neither? J. E. Smith University of Wisconsin-Madison Copyright (C) 2006 by James E. Smith.

01/06 copyright J. E. Smith, 2006 11

Program SpaceProgram Space

Choosing the benchmarks from a program space (art)• Where the main difference lies among user, designer,

researcher User can use “scenario-based” or “day in the life”

choice – e.g., Sysmark, Winstone Designer can combine multiple scenarios based on

marketing input Researcher has a problem – the set of all programs is

not a well-defined space to choose from• All possible programs?

Put them in alphabetical order and choose randomly?

• Modeling may have a role to play (later)

Page 12: Benchmarking: Science? Art? Neither? J. E. Smith University of Wisconsin-Madison Copyright (C) 2006 by James E. Smith.

01/06 copyright J. E. Smith, 2006 12

Extracting Benchmarks (scaling)Extracting Benchmarks (scaling)

Cutting real applications down to size May change relative time spent in different

parts of program Changing data set can be risky

• Data-dependent optimizations in SW and HW• Generating a special data set is even riskier

A bigger problem w/ multi-threaded benchmarks (more later)

Page 13: Benchmarking: Science? Art? Neither? J. E. Smith University of Wisconsin-Madison Copyright (C) 2006 by James E. Smith.

01/06 copyright J. E. Smith, 2006 13

MetricsMetrics

Constant work model harmonic mean Gmean gives equal reward for speeding up all

benchmarks• Easier to speedup programs with more inherent parallelism

the already fast programs get faster Hmean gives greater reward for speeding up the slow

benchmarks• Consistent with Amdahl’s law• You can pay for bandwidth Hmean at a “disadvantage”• Will become a greater issue with parallel benchmarks

Arithmetic mean gives greater reward for speeding up already-fast benchmark

Page 14: Benchmarking: Science? Art? Neither? J. E. Smith University of Wisconsin-Madison Copyright (C) 2006 by James E. Smith.

01/06 copyright J. E. Smith, 2006 14

Reward for Speeding Up Slow Benchmark (Gmean)Reward for Speeding Up Slow Benchmark (Gmean)

0.00

0.50

1.00

1.50

2.00

1 2 3 4 5 6 7 8 9 10

Ratio -- Faster:Slower

Gm

ean

imp

rove

men

t

speedup slower by 2

speedup faster by 2

Page 15: Benchmarking: Science? Art? Neither? J. E. Smith University of Wisconsin-Madison Copyright (C) 2006 by James E. Smith.

01/06 copyright J. E. Smith, 2006 15

Reward for Speeding Up Slow Benchmark (Hmean)Reward for Speeding Up Slow Benchmark (Hmean)

0.00

0.50

1.00

1.50

2.00

1 2 3 4 5 6 7 8 9 10

Ratio -- Faster:Slower

Hm

ean

Im

pro

vem

ent

speedup faster by 2

speedup slower by 2

Page 16: Benchmarking: Science? Art? Neither? J. E. Smith University of Wisconsin-Madison Copyright (C) 2006 by James E. Smith.

01/06 copyright J. E. Smith, 2006 16

Reward for Speeding Up Slow Benchmark (Amean)Reward for Speeding Up Slow Benchmark (Amean)

0.00

0.50

1.00

1.50

2.00

2.50

1 2 3 4 5 6 7 8 9 10

Ratio -- Faster:Slower

Am

ean

Impr

ovem

ent

speedup faster by 2

speedup slower by 2

Page 17: Benchmarking: Science? Art? Neither? J. E. Smith University of Wisconsin-Madison Copyright (C) 2006 by James E. Smith.

01/06 copyright J. E. Smith, 2006 17

Defining “Work”Defining “Work”

Some benchmarks (SPEC) work with speedup rather than rate

• This leads to a rather odd “work” metric• That which the base machine can do in a fixed amount of time• Changing the base machine therefore changes the amount of

work

Here is where Geom mean would seem to have an advantage, but it is hiding the lack of good weights

Weights can be adjusted if the baseline changes How do we solve this?

• Maybe use non-optimizing compiler for a RISC ISA and count dynamic instructions (or run on a non-pipelined processor??)

Page 18: Benchmarking: Science? Art? Neither? J. E. Smith University of Wisconsin-Madison Copyright (C) 2006 by James E. Smith.

01/06 copyright J. E. Smith, 2006 18

WeightsWeights

Carefully chosen weights can play an important role in making the benchmark process more scientific

Using unweighted means ignores relative importance in a real workload

Give a set of weights along with benchmarks• Ideally give several sets of weights for different application

domains

With “scenario-based” benchmarking, weights can be assigned during scenario creation

Application Benchmark1 Benchmark2 Benchmark3 Domain weight weight weightDomain 1 .5 .3 .2Domain 2 .1 .6 .3Domain 3 .7 0 .3

Page 19: Benchmarking: Science? Art? Neither? J. E. Smith University of Wisconsin-Madison Copyright (C) 2006 by James E. Smith.

01/06 copyright J. E. Smith, 2006 19

Reducing Simulation TimeReducing Simulation Time

For researcher, cycle accurate simulation can be very time consuming

• Especially for realistic benchmarks• Need to cut down benchmarks

Sampling• Simulate only a portion of the benchmark• May require warm-up• Difficult to do with perfect scaling property• Significant recent study in this direction

Page 20: Benchmarking: Science? Art? Neither? J. E. Smith University of Wisconsin-Madison Copyright (C) 2006 by James E. Smith.

01/06 copyright J. E. Smith, 2006 20

Reducing Simulation Time (Sampling)Reducing Simulation Time (Sampling)

Random sampling• More scientifically based• Relies on Central Limit Theorem• Can provide confidence measures

Phase-based sampling• Analyze program phase behavior to select

representative samples • More of an art… if that• Phase characterization is not as scientific as one

might think

Page 21: Benchmarking: Science? Art? Neither? J. E. Smith University of Wisconsin-Madison Copyright (C) 2006 by James E. Smith.

01/06 copyright J. E. Smith, 2006 21

Role of ModelingRole of Modeling

Receiving significant research interest In general can be used to cut large space to small

space• In general, can be used to select “representative” benchmarks• Can be used by researchers for subsetting• Based on quantifiable, key characteristics

Empirical models vs. mechanistic models• Empirical models “guess” at the significant characteristics• Mechanistic models derive them• Modeling superscalar-based computers not as complex as it

seems

Page 22: Benchmarking: Science? Art? Neither? J. E. Smith University of Wisconsin-Madison Copyright (C) 2006 by James E. Smith.

01/06 copyright J. E. Smith, 2006 22

Interval AnalysisInterval Analysis

Superscalar execution can be divided into intervals separated by miss events

• Branch miss predictions• I cache misses• Long d cache misses• TLB misses, etc.

time

IPC

branchmispredicts

i-cachemiss long d-cache miss

interval 1 interval 2 interval 3interval 0

Page 23: Benchmarking: Science? Art? Neither? J. E. Smith University of Wisconsin-Madison Copyright (C) 2006 by James E. Smith.

01/06 copyright J. E. Smith, 2006 23

Branch Miss Prediction IntervalBranch Miss Prediction Interval

total time = N/D + cdr + cfe

N = no. instructions in interval

d = decode/dispatch width

cdr = drain cycles

cfe = front-end pipeline length

time = N/D time= pipeline length

time= branch latency

window drain time

Page 24: Benchmarking: Science? Art? Neither? J. E. Smith University of Wisconsin-Madison Copyright (C) 2006 by James E. Smith.

01/06 copyright J. E. Smith, 2006 24

Overall PerformanceOverall Performance

Total Cycles = Ntotal/D + ((D-1)/2)*(miL1 + mbr + mL2)

+ mic * ciL1

+ mbr * (cdr + cfe)

+ mL2 * (- W/D + clr + cL2)

Ntotal – total number of instructions

D – pipeline decode/issue/retire width

W – Window size

mic – i-cache misses ciL1 – I cache miss latency

mbr – branch mispredictions cdr – window drain time

mL2 – L2 cache misses cfe – pipeline front-end latency

(non-overlapped) clr – load resoluton time

Page 25: Benchmarking: Science? Art? Neither? J. E. Smith University of Wisconsin-Madison Copyright (C) 2006 by James E. Smith.

01/06 copyright J. E. Smith, 2006 25

Model AccuracyModel Accuracy

Average error 3 %Worst error 6.5%

-0.08

-0.06

-0.04

-0.02

0

0.02

0.04

0.06

0.08

bzip

2

cra

fty

vpr

perl

pars

er

gzip

gap

vort

ex

gcc

mcf

twolf

eon

benchmark

err

or

(dif

fere

nce)

Page 26: Benchmarking: Science? Art? Neither? J. E. Smith University of Wisconsin-Madison Copyright (C) 2006 by James E. Smith.

01/06 copyright J. E. Smith, 2006 26

Modeling Instruction DependencesModeling Instruction Dependences

Slide window over dynamic stream and compute average critical path, K

unit latency avg. IPC = W/K non-unit latency avg. IPC = W/K*avg. latency

Window

Dynamic InstructionStream

Page 27: Benchmarking: Science? Art? Neither? J. E. Smith University of Wisconsin-Madison Copyright (C) 2006 by James E. Smith.

01/06 copyright J. E. Smith, 2006 27

Important Workload CharacteristicsImportant Workload Characteristics

0

0.5

1

1.5

2

2.5

bzip

craf

tyeo

ngap gcc

gzip mcf

parse

rper

ltw

olf

vorte

xvp

r

CP

I

Ideal L1 Icache misses

L2 Icache misses L2 Dcache misses

Branch mispredictions

Page 28: Benchmarking: Science? Art? Neither? J. E. Smith University of Wisconsin-Madison Copyright (C) 2006 by James E. Smith.

01/06 copyright J. E. Smith, 2006 28

Challenges: MultiprocessorsChallenges: Multiprocessors Multiple cores are with us

(And will be pushed hard by manufacturers) Are SPLASH benchmarks the best we can do? Problems

• Performance determines the dynamic workloadThread self-scheduling, spinning on locks,

barriers• Sampling

Must respect lock structure• Scaling

Must respect Amdahl’s law (intra-benchmark) Get out front of the problems

• As opposed to what happened with uniprocessors

Page 29: Benchmarking: Science? Art? Neither? J. E. Smith University of Wisconsin-Madison Copyright (C) 2006 by James E. Smith.

01/06 copyright J. E. Smith, 2006 29

Challenges: MultiprocessorsChallenges: Multiprocessors Metrics

• Should account for Amdahl’s law (Inter-benchmark) Harmonic mean even more important than with ILP

Programming models• Standardized thread libraries • Automatic parallelization?

SPEC CPU has traditionally been a compiler test

as well as CPU test Consider NAS-type benchmarks

• Benchmarker codes application

Page 30: Benchmarking: Science? Art? Neither? J. E. Smith University of Wisconsin-Madison Copyright (C) 2006 by James E. Smith.

01/06 copyright J. E. Smith, 2006 30

Challenges: Realistic WorkloadsChallenges: Realistic Workloads

Originally SPEC was RISC workstation based• Unix, compiled C

What about Windows apps?• No source code• OS-specific• BUT used all the time• Consider Winstone/Sysmark

In our research, we are weaning ourselves off SPEC and toward Winstone/Sysmark –

It does make a difference, i.e. IA-32 EL study Scenario-based benchmarks

• Include context switches, OS code, etc.• Essential for sound system architecture research

Page 31: Benchmarking: Science? Art? Neither? J. E. Smith University of Wisconsin-Madison Copyright (C) 2006 by James E. Smith.

01/06 copyright J. E. Smith, 2006 31

ConclusionsConclusions

A scientific benchmarking process should be a foundation

• Program space• Weights• Defining “work”• Scaling• Metrics

Modeling has a role to play Many multiprocessor (multi-core) challenges Scenario-based benchmarking