David J. Lilja Department of Electrical and Computer Engineering University of Minnesota
description
Transcript of David J. Lilja Department of Electrical and Computer Engineering University of Minnesota
Exploiting Incorrectly Speculated Memory Operations in a Concurrent Multithreaded
Architecture(Plus a Few Thoughts on Simulation Methodology)
David J. LiljaDepartment of Electrical and Computer Engineering
University of [email protected]
Department of Electrical and Computer EngineeringUniversity of Minnesota
Acknowledgements
Graduate students (who did the real work)o Ying Cheno Resit Sendago Joshua Yi
Faculty collaboratoro Douglas Hawkins (School of Statistics)
Funderso National Science Foundationo IBMo HP/Compaqo Minnesota Supercomputing Institute
Department of Electrical and Computer EngineeringUniversity of Minnesota
Problem #1
Speculative execution is becoming more popularo Branch predictiono Value predictiono Speculative multithreading
Potentially higher performance What about impact on the memory system?
o Pollute cache/memory hierarchy?o Leads to more misses?
Department of Electrical and Computer EngineeringUniversity of Minnesota
Problem #2
Computer architecture research relies on simulation
Simulation is slowo Years to simulate SPEC CPU2000 benchmarks
Simulation can be wildly inaccurateo Did I really mean to build that system?
Results are difficult to reproduce Need statistical rigor
Department of Electrical and Computer EngineeringUniversity of Minnesota
The Superthreaded Architecture The Wrong Execution Cache (WEC) Experimental Methodology Performance of the WEC
[Chen, Sendag, Lilja, IPDPS, 2003]
Outline (Part 1)
Department of Electrical and Computer EngineeringUniversity of Minnesota
Hard-to-Parallelize Applications
Early exit loops Pointers and aliases Complex branching behaviors Small basic blocks Small loops counts→ Hard to parallelize with conventional
techniques.
Department of Electrical and Computer EngineeringUniversity of Minnesota
Introduce Maybe dependences
Data dependence? Pointer aliasing?
o Yeso Noo Maybe
Maybe allows aggressive compiler optimizationso When in doubt, parallelize
Run-time check to correct wrong assumption.
Department of Electrical and Computer EngineeringUniversity of Minnesota
Thread Pipelining Execution Model
Thread i
Thread i+1
Thread i+2
Fork
Sync…
…
Sync
Fork
Sync…
…
Sync
Fork
Sync…
…
Sync
CONTINUATION-Values needed to fork next thread
TARGET STORE-Forward addresses of maybe dependences
COMPUTATION-Forward addresses and computed data as needed
WRITE-BACK
CONTINUATION-Values needed to fork next thread
TARGET STORE-Forward addresses of maybe dependences
COMPUTATION-Forward addresses and computed data as needed
WRITE-BACK
CONTINUATION-Values needed to fork next thread
TARGET STORE-Forward addresses of maybe dependences
COMPUTATION-Forward addresses and computed data as needed
WRITE-BACK
Department of Electrical and Computer EngineeringUniversity of Minnesota
The Superthread ArchitectureInstruction Cache
Data Cache
Execution Unit
DependenceBuffer
Registers
PC
Comm
Super-Scalar Core Super-Scalar Core Super-Scalar Core Super-Scalar Core
Execution Unit
DependenceBuffer
Registers
PC
Comm
Execution Unit
DependenceBuffer
Registers
PC
Comm
Execution Unit
DependenceBuffer
Registers
PC
Comm
Department of Electrical and Computer EngineeringUniversity of Minnesota
Wrong Path Execution Within Superscalar Core
Ld ALd B
Ld CLd D
Ld E
CP
WP
Predicted path
Correct pathPrediction result is wrong
Wrong path
Speculative execution
Wrong path execution
Not ready to be executed
Department of Electrical and Computer EngineeringUniversity of Minnesota
Wrong Thread Execution
TU0
FORK . . . . . . . . .ABORT
BEGIN
TU1
FORK . . . . . . . . .ABORT
TU2
FORK . . . . . . . . .ABORT
TU1
FORK . . . . . . . . .ABORT
TU2
FORK . . . . . . . . .ABORT
TU3
FORK . . . . . . . . .ABORT
TU0
FORK . . . . . . . . .ABORT
TU3
FORK . . . . . . . . .ABORT
WTH WTH
BEGIN
Sequential region
Mark the successor threads as wrong threads
Sequential regionbetween two parallel regions
Parallel region
Parallel region
Kill all the wrong threads from the Previous parallel region
Wrong thread kills itself
Department of Electrical and Computer EngineeringUniversity of Minnesota
How Could Wrong Thread Execution Help Improve Performance?
for (i=0; i<10; i++){……for (j=0; j<i; j++)
{……x=y[j];……}
……}
Parallelized
When i=4, j=0,1,2,3=>y[0], y[1], y[2], y[3],y[4]…
When i=5, j=0,1,2,3,4 =>y[0],y[1],y[2],y[3],y[4],y[5]…
i=4
i=5
TU1 TU2 TU3 TU4
TU1 TU2 TU3 TU4
y[0] y[1] y[2]
y[3]
y[0] y[1] y[2]
wrong threads
y[4] y[5]
y[3] y[4] y[5] y[6]
Department of Electrical and Computer EngineeringUniversity of Minnesota
Operation of the WEC
Wrongthread
execution?
L1 datacachemiss?
Wrong pathexecution?
NO
YES
YES
WEC miss?
Bring the blockfrom the nextlevel memoryinto the WEC
Update LRUinfo for the
WEC
YES
NO
YES NO
L1 datacachemiss?
NO
WEC miss?
YES
Update LRUinfo for the L1data cache
NO
NOYES
Swap thevictim blockand the WECblock. Prefetchthe next lineinto the WEC
Updata LRUinfo for the L1data cache
Bring the blockfrom next levelmemory intoL1 data cache. Put the victimblock into theWEC
Wrong execution Correct execution
Department of Electrical and Computer EngineeringUniversity of Minnesota
Processor Configurations for Simulations
baseline
(orig)
wrong path
(wp)
wrong thread
(wth)
wrong execution cache
(wec)
prefetch into WEC
victim cache
(vc)
next line prefetch
(nlp)
orig
vc
wth-wp-vc
wth-wp-wec
nlp
SIMCA (the SIMulator for the Superthreaded Architecture)
configurations
features
Department of Electrical and Computer EngineeringUniversity of Minnesota
Parameters for Each Thread UnitIssue rate 8 instrs/cycle per thread unit
branch target buffer 4-way associative, 1024 entries
speculative memory buffer fully associative, 128 entries
round-trip memory latency 200 cycles
fork delay 4 cycles
unidirectional communication ring 2 requests/cycle bandwidth
Load/store queue 64 entries
Reorder buffer 64 entries
INT ALU, INT multiply/divide units 8, 4
FP adders, FP multiply/divide units 4, 4
WEC 8 entries (same block size as L1 cache), fully associative
L1 data cache distributed, 8KB, 1-way associative, block size of 64 bytes
L1 instruction caches distributed, 32KB, 2-way associative, block size of 64 bytes
L2 cache unified, 512KB, 4-way associative, block size of 128 bytes
Department of Electrical and Computer EngineeringUniversity of Minnesota
Characteristics of the Parallelized SPEC2000 Benchmarks
Bench
-marks
SPEC 2000
Type
Input
set
Fraction Parallelized
Loop Coalesc
-ing
Loop
Unroll-ing
Statement Reordering to Increase Overlap
175.vpr INT SPECtest
8.6%
164.gzip INT MinneSPEC large
15.7%
181.mcf INT MinneSPEC large
36.1%
197.parser INT MinneSPEC medium
17.2%
183.equake FP MinneSPEC large
21.3%
177.mesa FP SPEC test 17.3%
Department of Electrical and Computer EngineeringUniversity of Minnesota
Performance of the Superthreaded Architecture for the Parallelized Portions of the Benchmarks
0
2
4
6
8
10
12
14
spee
du
p
1TU 2TU 4TU 8TU 16TU
# of TUs
Issue rate
1
11
162
84
48
216
1
Reorder
buffer size
8 128 64 32 16 8
INT ALU 1 16 8 4 2 1
INT MULT 1 8 4 2 1 1
FP ALU 1 16 8 4 2 1
FP MULT 1 8 4 2 1 1
L1 data cache size (KB)
2 32 16 8 4 2
Baseline configuration
Department of Electrical and Computer EngineeringUniversity of Minnesota
Performance of the wth-wp-wec Configuration on Top of the Parallel Execution
-10
-5
0
5
10
15
20
25
30
35
40
175.vpr 164.gzip 181.mcf 197.parser 183.equake 177.mesa average
rela
tive
sp
eed
up
(%
)
2TU org 4TU org 8TU org
16TU org 1TU w th-w p-w ec 2TU w th-w p-w ec
4TU w th-w p-w ec 8TU w th-w p-w ec 16TU w th-w p-w ec
Department of Electrical and Computer EngineeringUniversity of Minnesota
Performance Improvements Due to the WEC
-1
1
3
5
7
9
11
13
15
17
19
rela
tive
sp
eed
up
(%
)
vc wth-wp-vc wth-wp-wec nlp
Department of Electrical and Computer EngineeringUniversity of Minnesota
Sensitivity to L1 Data Cache Size
0
5
10
15
20
25
rela
tiv
e s
pe
ed
up
(%
)
orig 32k wth-wp-wec 4k wth-wp-wec 32k
Department of Electrical and Computer EngineeringUniversity of Minnesota
Sensitivity to WEC Size Compared to a Victim Cache
0
5
10
15
20
25
rela
tive
spee
dup
(%)
wth-wp-vc 4 wth-wp-vc 16 wth-wp-wec 4 wth-wp-wec 16
Department of Electrical and Computer EngineeringUniversity of Minnesota
Sensitivity to WEC Size Compared to Next-Line Prefetching (NLP)
0
4
8
12
16
20
24
rela
tive
sp
eed
up
(%
)
nlp 8 nlp 32 w th-w p-w ec 8 w th-w p-w ec 32
Department of Electrical and Computer EngineeringUniversity of Minnesota
Additional Loads and Reduction of Misses %
0
10
20
30
40
50
60
70
80
%
Additional loads Reduction of Misses
Department of Electrical and Computer EngineeringUniversity of Minnesota
Conclusions for the WEC
Allow loads to continue executing even after they are known to be incorrectly issuedo Do not let them change state
45.5% average reduction in number of misseso 9.7% average improvement on top of parallel executiono 4% average improvement over victim cacheo 5.6% average improvement over next-line prefetching
Costo 14% additional loadso Minor hardware complexity
Department of Electrical and Computer EngineeringUniversity of Minnesota
Typical Computer Architecture Study
1. Find an interesting problem/performance bottleneck E.g. Memory delays
2. Invent a clever idea for solving it. This is the hard part.
3. Implement the idea in a processor/system simulator This is the part grad students usually like best
4. Run simulations on n “standard” benchmark programs This is time-consuming and boring
5. Compare performance with and without your change Execution time, clocks per instruction (CPI), etc.
Department of Electrical and Computer EngineeringUniversity of Minnesota
Problem #2 – Simulation in Computer Architecture Research
Simulators are an important tool for computer architecture research and designo Low costo Faster than building a new systemo Very flexible
Department of Electrical and Computer EngineeringUniversity of Minnesota
Performance EvaluationTechniques Used in ISCA Papers
0%
20%
40%
60%
80%
100%
1973 1985 1993 1997 2001
Other
Modeling
Measurement
Other sim
SimpleScalar
* Some papers used more than one evaluation technique.
Department of Electrical and Computer EngineeringUniversity of Minnesota
Simulation is Very Popular, But …
Current simulation methodology is noto Formalo Rigorouso Statistically-based
Never enough simulationso Design a new processor based on a few seconds of actual
execution time What are benchmark programs really exercising?
Department of Electrical and Computer EngineeringUniversity of Minnesota
An Example -- Sensitivity Analysis
Which parameters should be varied? Fixed? What range of values should be used for each
variable parameter? What values should be used for the constant
parameters? Are there interactions between variable and
fixed parameters? What is the magnitude of those interactions?
Department of Electrical and Computer EngineeringUniversity of Minnesota
Let’s Introduce Some Statistical Rigor
Decreases the number of errorso Modelingo Implementationo Set upo Analysis
Helps find errors more quickly Provides greater insight
o Into the processoro Effects of an enhancement
Provides objective confidence in results Provides statistical support for conclusions
Department of Electrical and Computer EngineeringUniversity of Minnesota
A statistical technique for o Examining the overall impact of an architectural
changeo Classifying benchmark programso Ranking the importance of processor/simulation
parameterso Reducing the total number of simulation runs
[Yi, Lilja, Hawkins, HPCA, 2003]
Outline (Part 2)
Department of Electrical and Computer EngineeringUniversity of Minnesota
A Technique to Limit the Number of Simulations
Plackett and Burman designs (1946)o Multifactorial designso Originally proposed for mechanical assemblies
Effects of main factors onlyo Logically minimal number of experiments to estimate effects of
m input parameters (factors)o Ignores interactions
Requires O(m) experimentso Instead of O(2m) or O(vm)
Department of Electrical and Computer EngineeringUniversity of Minnesota
Plackett and Burman Designs PB designs exist only in sizes that are multiples of 4 Requires X experiments for m parameters
o X = next multiple of 4 ≥ m PB design matrix
o Rows = configurationso Columns = parameters’ values in each config
• High/low = +1/ -1o First row = from P&B papero Subsequent rows = circular right shift of preceding rowo Last row = all (-1)
Department of Electrical and Computer EngineeringUniversity of Minnesota
PB Design MatrixConfig Input Parameters (factors) Response
A B C D E F G
1 +1 +1 +1 -1 +1 -1 -1 9
2 -1 +1 +1 +1 -1 +1 -1
3 -1 -1 +1 +1 +1 -1 +1
4 +1 -1 -1 +1 +1 +1 -1
5 -1 +1 -1 -1 +1 +1 +1
6 +1 -1 +1 -1 -1 +1 +1
7 +1 +1 -1 +1 -1 -1 +1
8 -1 -1 -1 -1 -1 -1 -1
Effect
Department of Electrical and Computer EngineeringUniversity of Minnesota
PB Design MatrixConfig Input Parameters (factors) Response
A B C D E F G
1 +1 +1 +1 -1 +1 -1 -1 9
2 -1 +1 +1 +1 -1 +1 -1 11
3 -1 -1 +1 +1 +1 -1 +1
4 +1 -1 -1 +1 +1 +1 -1
5 -1 +1 -1 -1 +1 +1 +1
6 +1 -1 +1 -1 -1 +1 +1
7 +1 +1 -1 +1 -1 -1 +1
8 -1 -1 -1 -1 -1 -1 -1
Effect
Department of Electrical and Computer EngineeringUniversity of Minnesota
PB Design MatrixConfig Input Parameters (factors) Response
A B C D E F G
1 +1 +1 +1 -1 +1 -1 -1 9
2 -1 +1 +1 +1 -1 +1 -1 11
3 -1 -1 +1 +1 +1 -1 +1 2
4 +1 -1 -1 +1 +1 +1 -1 1
5 -1 +1 -1 -1 +1 +1 +1 9
6 +1 -1 +1 -1 -1 +1 +1 74
7 +1 +1 -1 +1 -1 -1 +1 7
8 -1 -1 -1 -1 -1 -1 -1 4
Effect
Department of Electrical and Computer EngineeringUniversity of Minnesota
PB Design MatrixConfig Input Parameters (factors) Response
A B C D E F G
1 +1 +1 +1 -1 +1 -1 -1 9
2 -1 +1 +1 +1 -1 +1 -1 11
3 -1 -1 +1 +1 +1 -1 +1 2
4 +1 -1 -1 +1 +1 +1 -1 1
5 -1 +1 -1 -1 +1 +1 +1 9
6 +1 -1 +1 -1 -1 +1 +1 74
7 +1 +1 -1 +1 -1 -1 +1 7
8 -1 -1 -1 -1 -1 -1 -1 4
Effect 65
Department of Electrical and Computer EngineeringUniversity of Minnesota
PB Design MatrixConfig Input Parameters (factors) Response
A B C D E F G
1 +1 +1 +1 -1 +1 -1 -1 9
2 -1 +1 +1 +1 -1 +1 -1 11
3 -1 -1 +1 +1 +1 -1 +1 2
4 +1 -1 -1 +1 +1 +1 -1 1
5 -1 +1 -1 -1 +1 +1 +1 9
6 +1 -1 +1 -1 -1 +1 +1 74
7 +1 +1 -1 +1 -1 -1 +1 7
8 -1 -1 -1 -1 -1 -1 -1 4
Effect 65 -45
Department of Electrical and Computer EngineeringUniversity of Minnesota
PB Design MatrixConfig Input Parameters (factors) Response
A B C D E F G
1 +1 +1 +1 -1 +1 -1 -1 9
2 -1 +1 +1 +1 -1 +1 -1 11
3 -1 -1 +1 +1 +1 -1 +1 2
4 +1 -1 -1 +1 +1 +1 -1 1
5 -1 +1 -1 -1 +1 +1 +1 9
6 +1 -1 +1 -1 -1 +1 +1 74
7 +1 +1 -1 +1 -1 -1 +1 7
8 -1 -1 -1 -1 -1 -1 -1 4
Effect 65 -45 75 -75 -75 73 67
Department of Electrical and Computer EngineeringUniversity of Minnesota
PB Design
Only magnitude of effect is importanto Sign is meaningless
In example, most → least important effects:o [C, D, E] → F → G → A → B
Department of Electrical and Computer EngineeringUniversity of Minnesota
Case Study #1
Determine the most significant parameters in a processor simulator.
Department of Electrical and Computer EngineeringUniversity of Minnesota
Determine the Most Significant Processor Parameters
Problemo So many parameters in a simulatoro How to choose parameter values?o How to decide which parameters are most important?
Approacho Choose reasonable upper/lower bounds.o Rank parameters by impact on total execution time.
Department of Electrical and Computer EngineeringUniversity of Minnesota
Simulation Environment
SimpleScalar simulatoro sim-outorder 3.0
Selected SPEC 2000 Benchmarkso gzip, vpr, gcc, mesa, art, mcf, equake, parser, vortex, bzip2, twolf
MinneSPEC Reduced Input Sets Compiled with gcc (PISA) at O3
Department of Electrical and Computer EngineeringUniversity of Minnesota
Functional Unit ValuesParameter Low Value High Value
Int ALUs 1 4
Int ALU Latency 2 Cycles 1 Cycle
Int ALU Throughput 1
FP ALUs 1 4
FP ALU Latency 5 Cycles 1 Cycle
FP ALU Throughputs 1
Int Mult/Div Units 1 4
Int Mult Latency 15 Cycles 2 Cycles
Int Div Latency 80 Cycles 10 Cycles
Int Mult Throughput 1
Int Div Throughput Equal to Int Div Latency
FP Mult/Div Units 1 4
FP Mult Latency 5 Cycles 2 Cycles
FP Div Latency 35 Cycles 10 Cycles
FP Sqrt Latency 35 Cycles 15 Cycles
FP Mult Throughput Equal to FP Mult Latency
FP Div Throughput Equal to FP Div Latency
FP Sqrt Throughput Equal to FP Sqrt Latency
Department of Electrical and Computer EngineeringUniversity of Minnesota
Memory System Values, Part IParameter Low Value High Value
L1 I-Cache Size 4 KB 128 KB
L1 I-Cache Assoc 1-Way 8-Way
L1 I-Cache Block Size 16 Bytes 64 Bytes
L1 I-Cache Repl Policy Least Recently Used
L1 I-Cache Latency 4 Cycles 1 Cycle
L1 D-Cache Size 4 KB 128 KB
L1 D-Cache Assoc 1-Way 8-Way
L1 D-Cache Block Size 16 Bytes 64 Bytes
L1 D-Cache Repl Policy Least Recently Used
L1 D-Cache Latency 4 Cycles 1 Cycle
L2 Cache Size 256 KB 8192 KB
L2 Cache Assoc 1-Way 8-Way
L2 Cache Block Size 64 Bytes 256 Bytes
Department of Electrical and Computer EngineeringUniversity of Minnesota
Memory System Values, Part IIParameter Low Value High Value
L2 Cache Repl Policy Least Recently Used
L2 Cache Latency 20 Cycles 5 Cycles
Mem Latency, First 200 Cycles 50 Cycles
Mem Latency, Next 0.02 * Mem Latency, First
Mem Bandwidth 4 Bytes 32 Bytes
I-TLB Size 32 Entries 256 Entries
I-TLB Page Size 4 KB 4096 KB
I-TLB Assoc 2-Way Fully Assoc
I-TLB Latency 80 Cycles 30 Cycles
D-TLB Size 32 Entries 256 Entries
D-TLB Page Size Same as I-TLB Page Size
D-TLB Assoc 2-Way Fully-Assoc
D-TLB Latency Same as I-TLB Latency
Department of Electrical and Computer EngineeringUniversity of Minnesota
Processor Core Values
Parameter Low Value High ValueFetch Queue Entries 4 32
Branch Predictor 2-Level Perfect
Branch MPred Penalty 10 Cycles 2 Cycles
RAS Entries 4 64
BTB Entries 16 512
BTB Assoc 2-Way Fully-Assoc
Spec Branch Update In Commit In Decode
Decode/Issue Width 4-Way
ROB Entries 8 64
LSQ Entries 0.25 * ROB 1.0 * ROB
Memory Ports 1 4
Department of Electrical and Computer EngineeringUniversity of Minnesota
Determining the Most Significant Parameters
1. Run simulations to find response• With input parameters at high/low, on/off values
Config
Input Parameters (factors) Response
A B C D E F G
1 +1 +1 +1 -1 +1 -1 -1 9
2 -1 +1 +1 +1 -1 +1 -1
3 -1 -1 +1 +1 +1 -1 +1
… … … … … … … …
Effect
Department of Electrical and Computer EngineeringUniversity of Minnesota
Determining the Most Significant Parameters
2. Calculate the effect of each parameter• Across configurations
Config
Input Parameters (factors) Response
A B C D E F G
1 +1 +1 +1 -1 +1 -1 -1 9
2 -1 +1 +1 +1 -1 +1 -1
3 -1 -1 +1 +1 +1 -1 +1
… … … … … … … …
Effect 65
Department of Electrical and Computer EngineeringUniversity of Minnesota
Determining the Most Significant Parameters
3. For each benchmarkRank the parameters in descending order of effect(1=most important, …)
Parameter Benchmark 1 Benchmark 2 Benchmark 3
A 3 12 8
B 29 4 22
C 2 6 7
… … … …
Department of Electrical and Computer EngineeringUniversity of Minnesota
Determining the Most Significant Parameters
4. For each parameterAverage the ranks
Parameter Benchmark 1 Benchmark 2 Benchmark 3 Average
A 3 12 8 7.67
B 29 4 22 18.3
C 2 6 7 5
… … … … …
Department of Electrical and Computer EngineeringUniversity of Minnesota
Most Significant Parameters
Number Parameter gcc gzip art Average1 ROB Entries 4 1 2 2.77
2 L2 Cache Latency 2 4 4 4.00
3 Branch Predictor Accuracy 5 2 27 7.69
4 Number of Integer ALUs 8 3 29 9.08
5 L1 D-Cache Latency 7 7 8 10.00
6 L1 I-Cache Size 1 6 12 10.23
7 L2 Cache Size 6 9 1 10.62
8 L1 I-Cache Block Size 3 16 10 11.77
9 Memory Latency, First 9 36 3 12.31
10 LSQ Entries 10 12 39 12.62
11 Speculative Branch Update 28 8 16 18.23
Department of Electrical and Computer EngineeringUniversity of Minnesota
General Procedure
Determine upper/lower bounds for parameters Simulate configurations to find response Compute effects of each parameter for each configuration Rank the parameters for each benchmark based on
effects Average the ranks across benchmarks Focus on top-ranked parameters for subsequent analysis
Department of Electrical and Computer EngineeringUniversity of Minnesota
Case Study #2
Determine the “big picture” impact of a system enhancement.
Department of Electrical and Computer EngineeringUniversity of Minnesota
Determining the Overall Effect of an Enhancement
Problem:o Performance analysis is typically limited to single metrics
• Speedup, power consumption, miss rate, etc.o Simple analysis
• Discards a lot of good information
Department of Electrical and Computer EngineeringUniversity of Minnesota
Determining the Overall Effect of an Enhancement
Find most important parameters without enhancemento Using Plackett and Burman
Find most important parameters with enhancemento Again using Plackett and Burman
Compare parameter ranks
Department of Electrical and Computer EngineeringUniversity of Minnesota
Example: Instruction Precomputation
Profile to find the most common operationso 0+1, 1+1, etc.
Insert the results of common operations in a table when the program is loaded into memory
Query the table when an instruction is issued Don’t execute the instruction if it is already in the table Reduces contention for function units
[Yi, Sendag, Lilja, Europar, 2002]
Department of Electrical and Computer EngineeringUniversity of Minnesota
The Effect of Instruction PrecomputationAverage Rank
Parameter Before After Difference
ROB Entries 2.77
L2 Cache Latency 4.00
Branch Predictor Accuracy 7.69
Number of Integer ALUs 9.08
L1 D-Cache Latency 10.00
L1 I-Cache Size 10.23
L2 Cache Size 10.62
L1 I-Cache Block Size 11.77
Memory Latency, First 12.31
LSQ Entries 12.62
Department of Electrical and Computer EngineeringUniversity of Minnesota
The Effect of Instruction PrecomputationAverage Rank
Parameter Before After Difference
ROB Entries 2.77 2.77
L2 Cache Latency 4.00 4.00
Branch Predictor Accuracy 7.69 7.92
Number of Integer ALUs 9.08 10.54
L1 D-Cache Latency 10.00 9.62
L1 I-Cache Size 10.23 10.15
L2 Cache Size 10.62 10.54
L1 I-Cache Block Size 11.77 11.38
Memory Latency, First 12.31 11.62
LSQ Entries 12.62 13.00
Department of Electrical and Computer EngineeringUniversity of Minnesota
The Effect of Instruction PrecomputationAverage Rank
Parameter Before After Difference
ROB Entries 2.77 2.77 0.00
L2 Cache Latency 4.00 4.00 0.00
Branch Predictor Accuracy 7.69 7.92 -0.23
Number of Integer ALUs 9.08 10.54 -1.46
L1 D-Cache Latency 10.00 9.62 0.38
L1 I-Cache Size 10.23 10.15 0.08
L2 Cache Size 10.62 10.54 0.08
L1 I-Cache Block Size 11.77 11.38 0.39
Memory Latency, First 12.31 11.62 0.69
LSQ Entries 12.62 13.00 -0.38
Department of Electrical and Computer EngineeringUniversity of Minnesota
Case Study #3
Benchmark program classification.
Department of Electrical and Computer EngineeringUniversity of Minnesota
Benchmark Classification
By application typeo Scientific and engineering applicationso Transaction processing applicationso Multimedia applications
By use of processor function unitso Floating-point codeo Integer codeo Memory intensive code
Etc., etc.
Department of Electrical and Computer EngineeringUniversity of Minnesota
Another Point-of-View
Classify by overall impact on processor Define:
o Two benchmark programs are similar if –• They stress the same components of a system to similar
degrees
How to measure this similarity?o Use Plackett and Burman design to find rankso Then compare ranks
Department of Electrical and Computer EngineeringUniversity of Minnesota
Similarity metric
Use rank of each parameter as elements of a vector
For benchmark program X, leto X = (x1, x2,…, xn-1, xn)
o x1 = rank of parameter 1
o x2 = rank of parameter 2o …
Department of Electrical and Computer EngineeringUniversity of Minnesota
Vector Defines a Point in n-space
• (y1, y2, y3)
Param #3
• (x1, x2, x3)
Param #2
Param #1
D
Department of Electrical and Computer EngineeringUniversity of Minnesota
Similarity Metric
Euclidean Distance Between Points
2/12211
222
211 ])()()()[( nnnn yxyxyxyxD
Department of Electrical and Computer EngineeringUniversity of Minnesota
Most Significant Parameters
Number Parameter gcc gzip art1 ROB Entries 4 1 2
2 L2 Cache Latency 2 4 4
3 Branch Predictor Accuracy 5 2 27
4 Number of Integer ALUs 8 3 29
5 L1 D-Cache Latency 7 7 8
6 L1 I-Cache Size 1 6 12
7 L2 Cache Size 6 9 1
8 L1 I-Cache Block Size 3 16 10
9 Memory Latency, First 9 36 3
10 LSQ Entries 10 12 39
11 Speculative Branch Update 28 8 16
Department of Electrical and Computer EngineeringUniversity of Minnesota
Distance Computation
Rank vectorso Gcc = (4, 2, 5, 8, …)o Gzip = (1, 4, 2, 3, …)o Art = (2, 4, 27, 29, …)
Euclidean distanceso D(gcc - gzip) = [(4-1)2 + (2-4)2 + (5-2)2 + … ]1/2 o D(gcc - art) = [(4-2)2 + (2-4)2 + (5-27)2 + … ]1/2 o D(gzip - art) = [(1-2)2 + (4-4)2 + (2-27)2 + … ]1/2
Department of Electrical and Computer EngineeringUniversity of Minnesota
Euclidean Distances for Selected Benchmarks
gcc gzip art mcf
gcc 0 81.9 92.6 94.5
gzip 0 113.5 109.6
art 0 98.6
mcf 0
Department of Electrical and Computer EngineeringUniversity of Minnesota
Dendogram of Distances Showing (Dis-)Similarity
Department of Electrical and Computer EngineeringUniversity of Minnesota
Final Benchmark Groupings
Group Benchmarks
I Gzip,mesaII Vpr-Place,twolfIII Vpr-Route, parser, bzip2IV Gcc, vortexV ArtVI McfVII EquakeVIII ammp
Department of Electrical and Computer EngineeringUniversity of Minnesota
Conclusion
Multifactorial (Plackett and Burman) designo Requires only O(m) experimentso Determines effects of main factors onlyo Ignores interactions
Logically minimal number of experiments to estimate effects of m input parameters
Powerful technique for obtaining a big-picture view of a lot of simulation data
Department of Electrical and Computer EngineeringUniversity of Minnesota
Conclusion
Demonstrated foro Ranking importance of simulation parameterso Finding overall impact of processor enhancemento Classifying benchmark programs
Current work comparing simulation strategieso Reduced input sets (e.g. MinneSPEC)o Sampling (e.g. SimPoints, sampling)
Department of Electrical and Computer EngineeringUniversity of Minnesota
Goals
Develop/understand tools for interpreting large quantities of data
Increase insights into processor design
Improve rigor in computer architecture research