Benchmarking: Science? Art? Neither? J. E. Smith University of Wisconsin-Madison Copyright (C) 2006...
-
Upload
cody-chase -
Category
Documents
-
view
216 -
download
0
Transcript of Benchmarking: Science? Art? Neither? J. E. Smith University of Wisconsin-Madison Copyright (C) 2006...
Benchmarking: Science? Art? Neither?Benchmarking: Science? Art? Neither?
J. E. Smith
University of Wisconsin-Madison
Copyright (C) 2006 by James E. Smith
01/06 copyright J. E. Smith, 2006 2
Everyone has an opinion…Everyone has an opinion…
Like most people working in research and development, I’m a great benchmark “hobbyist”
My perspective on the benchmark process• Emphasis on the “scientific” (or non-scientific) aspects• Accumulated observations and opinions
01/06 copyright J. E. Smith, 2006 3
Benchmarking ProcessBenchmarking Process
Steps:1) Define workload
2) Extract Benchmarks from applications
3) Choose performance metric
4) Execute benchmarks on target machine(s)
5) Project workload performance for target machine(s) and summarize results
01/06 copyright J. E. Smith, 2006 4
The Benchmarking Process (Science)The Benchmarking Process (Science)
Extract Benchmarks
Run Benchmarks
Project Performance
Define Workload
t1 t2 t3 t4 t5
w1 w2 w3 w4 w5
w1' w2' w3' w4' w5'
t1' t2' t3' t4' t5'
t1'^ t5'^t4'^t2'^ t3'^
w5^w1^ w2^ w3^ w4^
t5^
t1^ t2^ t3
^t4^
01/06 copyright J. E. Smith, 2006 5
Extracting BenchmarksExtracting Benchmarks
Total work in application environment
Total work in benchmarks
Fraction of work from each job type
Fraction of work from each benchmark
Perfect Scaling assumption:
Perfect scaling is often implicitly assumed, but does not always hold.
iwW
''iwW
Wwf ii /
''' /Wwf ii
'' // iiii wwtt
01/06 copyright J. E. Smith, 2006 6
Projecting PerformanceProjecting Performance
Define scale factor
Constant work model: • work done is the same regardless of machine used.
• What is projected time? (assume perfect scaling)
Constant time model:• time spent in each program is the same regardless of machine
used. • What is projected work? (assume perfect scaling)
'''' /// WfWfwwttm iiiiiii
i all for ˆii ww
iii mttT 'ˆˆˆ
i all for ˆii tt
)ˆ/(ˆˆ '' iiii ttwwW
01/06 copyright J. E. Smith, 2006 7
Performance Measured as a RatePerformance Measured as a Rate Rate = work/time
For constant work model;
(i.e. weighted harmonic mean rate) For constant time model;
Because the ti are fixed, this is essentially a weighted arithmetic mean.
What about geometric mean? • neither science nor art• It has a nice (but mostly useless) mathematical property.
iii twr /
'' ˆ//1)ˆ/(/ˆ/ˆˆiiiiiii rfrWfWftwR
iiiii ttrtwR /ˆˆ/ˆˆ '
01/06 copyright J. E. Smith, 2006 8
Defining the WorkloadDefining the Workload
Who is the benchmark designed for? Purchaser
• In the best position (theoretically)Workload is well known
• In theory, should develop own benchmarksIn practice, often does not
• Problem: matching standard benchmarks to workload Developer
• Typically use both internal and standard benchmarksUse standard benchmarks more often than is
admitted• Deals with markets (application domains)
Needs to know the market designer’s paradox
• Only needs to satisfy decision makers in the organization
01/06 copyright J. E. Smith, 2006 9
Designer’s ParadoxDesigner’s Paradox
Consider multiple application domains and multiple computer designs
Computer 3 gives best overall performance• BUT WON’T SELL• Customers in domain 1 will choose Computer 1 and customers in
domain 2 will choose Computer 2
Application Computer 1 Computer 2 Computer 3 Domain time (sec.) time (sec.) time (sec.)Domain 1 10 100 20Domain 2 100 10 20Total Time 110 110 40
01/06 copyright J. E. Smith, 2006 10
Defining the WorkloadDefining the Workload
Who is the benchmark designed for? Researcher
• Faces biggest problemsNo market (or all markets)Can easily fall prey to designer’s paradox
• Must satisfy anonymous reviewersOften fall prey to conventional wisdome.g. “execution driven” simulation == good
trace-driven simulation == bad
01/06 copyright J. E. Smith, 2006 11
Program SpaceProgram Space
Choosing the benchmarks from a program space (art)• Where the main difference lies among user, designer,
researcher User can use “scenario-based” or “day in the life”
choice – e.g., Sysmark, Winstone Designer can combine multiple scenarios based on
marketing input Researcher has a problem – the set of all programs is
not a well-defined space to choose from• All possible programs?
Put them in alphabetical order and choose randomly?
• Modeling may have a role to play (later)
01/06 copyright J. E. Smith, 2006 12
Extracting Benchmarks (scaling)Extracting Benchmarks (scaling)
Cutting real applications down to size May change relative time spent in different
parts of program Changing data set can be risky
• Data-dependent optimizations in SW and HW• Generating a special data set is even riskier
A bigger problem w/ multi-threaded benchmarks (more later)
01/06 copyright J. E. Smith, 2006 13
MetricsMetrics
Constant work model harmonic mean Gmean gives equal reward for speeding up all
benchmarks• Easier to speedup programs with more inherent parallelism
the already fast programs get faster Hmean gives greater reward for speeding up the slow
benchmarks• Consistent with Amdahl’s law• You can pay for bandwidth Hmean at a “disadvantage”• Will become a greater issue with parallel benchmarks
Arithmetic mean gives greater reward for speeding up already-fast benchmark
01/06 copyright J. E. Smith, 2006 14
Reward for Speeding Up Slow Benchmark (Gmean)Reward for Speeding Up Slow Benchmark (Gmean)
0.00
0.50
1.00
1.50
2.00
1 2 3 4 5 6 7 8 9 10
Ratio -- Faster:Slower
Gm
ean
imp
rove
men
t
speedup slower by 2
speedup faster by 2
01/06 copyright J. E. Smith, 2006 15
Reward for Speeding Up Slow Benchmark (Hmean)Reward for Speeding Up Slow Benchmark (Hmean)
0.00
0.50
1.00
1.50
2.00
1 2 3 4 5 6 7 8 9 10
Ratio -- Faster:Slower
Hm
ean
Im
pro
vem
ent
speedup faster by 2
speedup slower by 2
01/06 copyright J. E. Smith, 2006 16
Reward for Speeding Up Slow Benchmark (Amean)Reward for Speeding Up Slow Benchmark (Amean)
0.00
0.50
1.00
1.50
2.00
2.50
1 2 3 4 5 6 7 8 9 10
Ratio -- Faster:Slower
Am
ean
Impr
ovem
ent
speedup faster by 2
speedup slower by 2
01/06 copyright J. E. Smith, 2006 17
Defining “Work”Defining “Work”
Some benchmarks (SPEC) work with speedup rather than rate
• This leads to a rather odd “work” metric• That which the base machine can do in a fixed amount of time• Changing the base machine therefore changes the amount of
work
Here is where Geom mean would seem to have an advantage, but it is hiding the lack of good weights
Weights can be adjusted if the baseline changes How do we solve this?
• Maybe use non-optimizing compiler for a RISC ISA and count dynamic instructions (or run on a non-pipelined processor??)
01/06 copyright J. E. Smith, 2006 18
WeightsWeights
Carefully chosen weights can play an important role in making the benchmark process more scientific
Using unweighted means ignores relative importance in a real workload
Give a set of weights along with benchmarks• Ideally give several sets of weights for different application
domains
With “scenario-based” benchmarking, weights can be assigned during scenario creation
Application Benchmark1 Benchmark2 Benchmark3 Domain weight weight weightDomain 1 .5 .3 .2Domain 2 .1 .6 .3Domain 3 .7 0 .3
01/06 copyright J. E. Smith, 2006 19
Reducing Simulation TimeReducing Simulation Time
For researcher, cycle accurate simulation can be very time consuming
• Especially for realistic benchmarks• Need to cut down benchmarks
Sampling• Simulate only a portion of the benchmark• May require warm-up• Difficult to do with perfect scaling property• Significant recent study in this direction
01/06 copyright J. E. Smith, 2006 20
Reducing Simulation Time (Sampling)Reducing Simulation Time (Sampling)
Random sampling• More scientifically based• Relies on Central Limit Theorem• Can provide confidence measures
Phase-based sampling• Analyze program phase behavior to select
representative samples • More of an art… if that• Phase characterization is not as scientific as one
might think
01/06 copyright J. E. Smith, 2006 21
Role of ModelingRole of Modeling
Receiving significant research interest In general can be used to cut large space to small
space• In general, can be used to select “representative” benchmarks• Can be used by researchers for subsetting• Based on quantifiable, key characteristics
Empirical models vs. mechanistic models• Empirical models “guess” at the significant characteristics• Mechanistic models derive them• Modeling superscalar-based computers not as complex as it
seems
01/06 copyright J. E. Smith, 2006 22
Interval AnalysisInterval Analysis
Superscalar execution can be divided into intervals separated by miss events
• Branch miss predictions• I cache misses• Long d cache misses• TLB misses, etc.
time
IPC
branchmispredicts
i-cachemiss long d-cache miss
interval 1 interval 2 interval 3interval 0
01/06 copyright J. E. Smith, 2006 23
Branch Miss Prediction IntervalBranch Miss Prediction Interval
total time = N/D + cdr + cfe
N = no. instructions in interval
d = decode/dispatch width
cdr = drain cycles
cfe = front-end pipeline length
time = N/D time= pipeline length
time= branch latency
window drain time
01/06 copyright J. E. Smith, 2006 24
Overall PerformanceOverall Performance
Total Cycles = Ntotal/D + ((D-1)/2)*(miL1 + mbr + mL2)
+ mic * ciL1
+ mbr * (cdr + cfe)
+ mL2 * (- W/D + clr + cL2)
Ntotal – total number of instructions
D – pipeline decode/issue/retire width
W – Window size
mic – i-cache misses ciL1 – I cache miss latency
mbr – branch mispredictions cdr – window drain time
mL2 – L2 cache misses cfe – pipeline front-end latency
(non-overlapped) clr – load resoluton time
01/06 copyright J. E. Smith, 2006 25
Model AccuracyModel Accuracy
Average error 3 %Worst error 6.5%
-0.08
-0.06
-0.04
-0.02
0
0.02
0.04
0.06
0.08
bzip
2
cra
fty
vpr
perl
pars
er
gzip
gap
vort
ex
gcc
mcf
twolf
eon
benchmark
err
or
(dif
fere
nce)
01/06 copyright J. E. Smith, 2006 26
Modeling Instruction DependencesModeling Instruction Dependences
Slide window over dynamic stream and compute average critical path, K
unit latency avg. IPC = W/K non-unit latency avg. IPC = W/K*avg. latency
Window
Dynamic InstructionStream
01/06 copyright J. E. Smith, 2006 27
Important Workload CharacteristicsImportant Workload Characteristics
0
0.5
1
1.5
2
2.5
bzip
craf
tyeo
ngap gcc
gzip mcf
parse
rper
ltw
olf
vorte
xvp
r
CP
I
Ideal L1 Icache misses
L2 Icache misses L2 Dcache misses
Branch mispredictions
01/06 copyright J. E. Smith, 2006 28
Challenges: MultiprocessorsChallenges: Multiprocessors Multiple cores are with us
(And will be pushed hard by manufacturers) Are SPLASH benchmarks the best we can do? Problems
• Performance determines the dynamic workloadThread self-scheduling, spinning on locks,
barriers• Sampling
Must respect lock structure• Scaling
Must respect Amdahl’s law (intra-benchmark) Get out front of the problems
• As opposed to what happened with uniprocessors
01/06 copyright J. E. Smith, 2006 29
Challenges: MultiprocessorsChallenges: Multiprocessors Metrics
• Should account for Amdahl’s law (Inter-benchmark) Harmonic mean even more important than with ILP
Programming models• Standardized thread libraries • Automatic parallelization?
SPEC CPU has traditionally been a compiler test
as well as CPU test Consider NAS-type benchmarks
• Benchmarker codes application
01/06 copyright J. E. Smith, 2006 30
Challenges: Realistic WorkloadsChallenges: Realistic Workloads
Originally SPEC was RISC workstation based• Unix, compiled C
What about Windows apps?• No source code• OS-specific• BUT used all the time• Consider Winstone/Sysmark
In our research, we are weaning ourselves off SPEC and toward Winstone/Sysmark –
It does make a difference, i.e. IA-32 EL study Scenario-based benchmarks
• Include context switches, OS code, etc.• Essential for sound system architecture research
01/06 copyright J. E. Smith, 2006 31
ConclusionsConclusions
A scientific benchmarking process should be a foundation
• Program space• Weights• Defining “work”• Scaling• Metrics
Modeling has a role to play Many multiprocessor (multi-core) challenges Scenario-based benchmarking