Benchmarking: Science? Art? Neither? J. E. Smith University of Wisconsin-Madison Copyright (C) 2006...

Benchmarking: Science? Art? Neither?Benchmarking: Science? Art? Neither?

J. E. Smith

University of Wisconsin-Madison

Copyright (C) 2006 by James E. Smith

01/06 copyright J. E. Smith, 2006 2

Everyone has an opinion…Everyone has an opinion…

Like most people working in research and development, I’m a great benchmark “hobbyist”

My perspective on the benchmark process• Emphasis on the “scientific” (or non-scientific) aspects• Accumulated observations and opinions


Benchmarking ProcessBenchmarking Process

Steps:1) Define workload

2) Extract Benchmarks from applications

3) Choose performance metric

4) Execute benchmarks on target machine(s)

5) Project workload performance for target machine(s) and summarize results


The Benchmarking Process (Science)The Benchmarking Process (Science)

Extract Benchmarks

Run Benchmarks

Project Performance

Define Workload

t1 t2 t3 t4 t5

w1 w2 w3 w4 w5

w1' w2' w3' w4' w5'

t1' t2' t3' t4' t5'

t1'^ t5'^t4'^t2'^ t3'^

w5^w1^ w2^ w3^ w4^

t5^

t1^ t2^ t3

^t4^


Extracting BenchmarksExtracting Benchmarks

Total work in application environment

Total work in benchmarks

Fraction of work from each job type

Fraction of work from each benchmark

Perfect Scaling assumption:

Perfect scaling is often implicitly assumed, but does not always hold.

iwW

''iwW

Wwf ii /

''' /Wwf ii

'' // iiii wwtt


Projecting PerformanceProjecting Performance

Define scale factor

Constant work model: • work done is the same regardless of machine used.

• What is projected time? (assume perfect scaling)

Constant time model:• time spent in each program is the same regardless of machine

used. • What is projected work? (assume perfect scaling)

'''' /// WfWfwwttm iiiiiii

i all for ˆii ww

iii mttT 'ˆˆˆ

i all for ˆii tt

)ˆ/(ˆˆ '' iiii ttwwW


Performance Measured as a RatePerformance Measured as a Rate Rate = work/time

For constant work model;

(i.e. weighted harmonic mean rate) For constant time model;

Because the ti are fixed, this is essentially a weighted arithmetic mean.

What about geometric mean? • neither science nor art• It has a nice (but mostly useless) mathematical property.

iii twr /

'' ˆ//1)ˆ/(/ˆ/ˆˆiiiiiii rfrWfWftwR

iiiii ttrtwR /ˆˆ/ˆˆ '


Defining the WorkloadDefining the Workload

Who is the benchmark designed for? Purchaser

• In the best position (theoretically)Workload is well known

• In theory, should develop own benchmarksIn practice, often does not

• Problem: matching standard benchmarks to workload Developer

• Typically use both internal and standard benchmarksUse standard benchmarks more often than is

admitted• Deals with markets (application domains)

Needs to know the market designer’s paradox

• Only needs to satisfy decision makers in the organization


Designer’s ParadoxDesigner’s Paradox

Consider multiple application domains and multiple computer designs

Computer 3 gives best overall performance• BUT WON’T SELL• Customers in domain 1 will choose Computer 1 and customers in

domain 2 will choose Computer 2

Application Computer 1 Computer 2 Computer 3 Domain time (sec.) time (sec.) time (sec.)Domain 1 10 100 20Domain 2 100 10 20Total Time 110 110 40


Defining the WorkloadDefining the Workload

Who is the benchmark designed for? Researcher

• Faces biggest problemsNo market (or all markets)Can easily fall prey to designer’s paradox

• Must satisfy anonymous reviewersOften fall prey to conventional wisdome.g. “execution driven” simulation == good

trace-driven simulation == bad


Program SpaceProgram Space

Choosing the benchmarks from a program space (art)• Where the main difference lies among user, designer,

researcher User can use “scenario-based” or “day in the life”

choice – e.g., Sysmark, Winstone Designer can combine multiple scenarios based on

marketing input Researcher has a problem – the set of all programs is

not a well-defined space to choose from• All possible programs?

Put them in alphabetical order and choose randomly?

• Modeling may have a role to play (later)


Extracting Benchmarks (scaling)Extracting Benchmarks (scaling)

Cutting real applications down to size May change relative time spent in different

parts of program Changing data set can be risky

• Data-dependent optimizations in SW and HW• Generating a special data set is even riskier

A bigger problem w/ multi-threaded benchmarks (more later)


MetricsMetrics

Constant work model harmonic mean Gmean gives equal reward for speeding up all

benchmarks• Easier to speedup programs with more inherent parallelism

the already fast programs get faster Hmean gives greater reward for speeding up the slow

benchmarks• Consistent with Amdahl’s law• You can pay for bandwidth Hmean at a “disadvantage”• Will become a greater issue with parallel benchmarks

Arithmetic mean gives greater reward for speeding up already-fast benchmark


Reward for Speeding Up Slow Benchmark (Gmean)Reward for Speeding Up Slow Benchmark (Gmean)

0.00

0.50

1.00

1.50

2.00

1 2 3 4 5 6 7 8 9 10

Ratio -- Faster:Slower

Gm

ean

imp

rove

men

t

speedup slower by 2

speedup faster by 2


Reward for Speeding Up Slow Benchmark (Hmean)Reward for Speeding Up Slow Benchmark (Hmean)

0.00

0.50

1.00

1.50

2.00

1 2 3 4 5 6 7 8 9 10


Hm

ean

Im

pro

vem

ent

speedup faster by 2

speedup slower by 2


Reward for Speeding Up Slow Benchmark (Amean)Reward for Speeding Up Slow Benchmark (Amean)

0.00

0.50

1.00

1.50

2.00

2.50

1 2 3 4 5 6 7 8 9 10


Am

ean

Impr

ovem

ent

speedup faster by 2

speedup slower by 2


Defining “Work”Defining “Work”

Some benchmarks (SPEC) work with speedup rather than rate

• This leads to a rather odd “work” metric• That which the base machine can do in a fixed amount of time• Changing the base machine therefore changes the amount of

work

Here is where Geom mean would seem to have an advantage, but it is hiding the lack of good weights

Weights can be adjusted if the baseline changes How do we solve this?

• Maybe use non-optimizing compiler for a RISC ISA and count dynamic instructions (or run on a non-pipelined processor??)


WeightsWeights

Carefully chosen weights can play an important role in making the benchmark process more scientific

Using unweighted means ignores relative importance in a real workload

Give a set of weights along with benchmarks• Ideally give several sets of weights for different application

domains

With “scenario-based” benchmarking, weights can be assigned during scenario creation

Application Benchmark1 Benchmark2 Benchmark3 Domain weight weight weightDomain 1 .5 .3 .2Domain 2 .1 .6 .3Domain 3 .7 0 .3


Reducing Simulation TimeReducing Simulation Time

For researcher, cycle accurate simulation can be very time consuming

• Especially for realistic benchmarks• Need to cut down benchmarks

Sampling• Simulate only a portion of the benchmark• May require warm-up• Difficult to do with perfect scaling property• Significant recent study in this direction


Reducing Simulation Time (Sampling)Reducing Simulation Time (Sampling)

Random sampling• More scientifically based• Relies on Central Limit Theorem• Can provide confidence measures

Phase-based sampling• Analyze program phase behavior to select

representative samples • More of an art… if that• Phase characterization is not as scientific as one

might think


Role of ModelingRole of Modeling

Receiving significant research interest In general can be used to cut large space to small

space• In general, can be used to select “representative” benchmarks• Can be used by researchers for subsetting• Based on quantifiable, key characteristics

Empirical models vs. mechanistic models• Empirical models “guess” at the significant characteristics• Mechanistic models derive them• Modeling superscalar-based computers not as complex as it

seems


Interval AnalysisInterval Analysis

Superscalar execution can be divided into intervals separated by miss events

• Branch miss predictions• I cache misses• Long d cache misses• TLB misses, etc.

time

IPC

branchmispredicts

i-cachemiss long d-cache miss

interval 1 interval 2 interval 3interval 0


Branch Miss Prediction IntervalBranch Miss Prediction Interval

total time = N/D + cdr + cfe

N = no. instructions in interval

d = decode/dispatch width

cdr = drain cycles

cfe = front-end pipeline length

time = N/D time= pipeline length

time= branch latency

window drain time


Overall PerformanceOverall Performance

Total Cycles = Ntotal/D + ((D-1)/2)*(miL1 + mbr + mL2)

+ mic * ciL1

+ mbr * (cdr + cfe)

+ mL2 * (- W/D + clr + cL2)

Ntotal – total number of instructions

D – pipeline decode/issue/retire width

W – Window size

mic – i-cache misses ciL1 – I cache miss latency

mbr – branch mispredictions cdr – window drain time

mL2 – L2 cache misses cfe – pipeline front-end latency

(non-overlapped) clr – load resoluton time


Model AccuracyModel Accuracy

Average error 3 %Worst error 6.5%

-0.08

-0.06

-0.04

-0.02

0

0.02

0.04

0.06

0.08

bzip

2

cra

fty

vpr

perl

pars

er

gzip

gap

vort

ex

gcc

mcf

twolf

eon

benchmark

err

or

(dif

fere

nce)


Modeling Instruction DependencesModeling Instruction Dependences

Slide window over dynamic stream and compute average critical path, K

unit latency avg. IPC = W/K non-unit latency avg. IPC = W/K*avg. latency

Window

Dynamic InstructionStream


Important Workload CharacteristicsImportant Workload Characteristics

0

0.5

1

1.5

2

2.5

bzip

craf

tyeo

ngap gcc

gzip mcf

parse

rper

ltw

olf

vorte

xvp

r

CP

I

Ideal L1 Icache misses

L2 Icache misses L2 Dcache misses

Branch mispredictions


Challenges: MultiprocessorsChallenges: Multiprocessors Multiple cores are with us

(And will be pushed hard by manufacturers) Are SPLASH benchmarks the best we can do? Problems

• Performance determines the dynamic workloadThread self-scheduling, spinning on locks,

barriers• Sampling

Must respect lock structure• Scaling

Must respect Amdahl’s law (intra-benchmark) Get out front of the problems

• As opposed to what happened with uniprocessors


Challenges: MultiprocessorsChallenges: Multiprocessors Metrics

• Should account for Amdahl’s law (Inter-benchmark) Harmonic mean even more important than with ILP

Programming models• Standardized thread libraries • Automatic parallelization?

SPEC CPU has traditionally been a compiler test

as well as CPU test Consider NAS-type benchmarks

• Benchmarker codes application


Challenges: Realistic WorkloadsChallenges: Realistic Workloads

Originally SPEC was RISC workstation based• Unix, compiled C

What about Windows apps?• No source code• OS-specific• BUT used all the time• Consider Winstone/Sysmark

In our research, we are weaning ourselves off SPEC and toward Winstone/Sysmark –

It does make a difference, i.e. IA-32 EL study Scenario-based benchmarks

• Include context switches, OS code, etc.• Essential for sound system architecture research


ConclusionsConclusions

A scientific benchmarking process should be a foundation

• Program space• Weights• Defining “work”• Scaling• Metrics

Modeling has a role to play Many multiprocessor (multi-core) challenges Scenario-based benchmarking

Benchmarking: Science? Art? Neither? J. E. Smith University of Wisconsin-Madison Copyright (C) 2006...

Documents

Transcript of Benchmarking: Science? Art? Neither? J. E. Smith University of Wisconsin-Madison Copyright (C) 2006...