David J. Lilja Department of Electrical and Computer Engineering University of Minnesota

74
Exploiting Incorrectly Speculated Memory Operations in a Concurrent Multithreaded Architecture (Plus a Few Thoughts on Simulation Methodology) David J. Lilja Department of Electrical and Computer Engineering University of Minnesota [email protected]

description

Exploiting Incorrectly Speculated Memory Operations in a Concurrent Multithreaded Architecture (Plus a Few Thoughts on Simulation Methodology). David J. Lilja Department of Electrical and Computer Engineering University of Minnesota [email protected]. Acknowledgements. - PowerPoint PPT Presentation

Transcript of David J. Lilja Department of Electrical and Computer Engineering University of Minnesota

Page 1: David J. Lilja Department of Electrical and Computer Engineering University of Minnesota

Exploiting Incorrectly Speculated Memory Operations in a Concurrent Multithreaded

Architecture(Plus a Few Thoughts on Simulation Methodology)

David J. LiljaDepartment of Electrical and Computer Engineering

University of [email protected]

Page 2: David J. Lilja Department of Electrical and Computer Engineering University of Minnesota

Department of Electrical and Computer EngineeringUniversity of Minnesota

Acknowledgements

Graduate students (who did the real work)o Ying Cheno Resit Sendago Joshua Yi

Faculty collaboratoro Douglas Hawkins (School of Statistics)

Funderso National Science Foundationo IBMo HP/Compaqo Minnesota Supercomputing Institute

Page 3: David J. Lilja Department of Electrical and Computer Engineering University of Minnesota

Department of Electrical and Computer EngineeringUniversity of Minnesota

Problem #1

Speculative execution is becoming more popularo Branch predictiono Value predictiono Speculative multithreading

Potentially higher performance What about impact on the memory system?

o Pollute cache/memory hierarchy?o Leads to more misses?

Page 4: David J. Lilja Department of Electrical and Computer Engineering University of Minnesota

Department of Electrical and Computer EngineeringUniversity of Minnesota

Problem #2

Computer architecture research relies on simulation

Simulation is slowo Years to simulate SPEC CPU2000 benchmarks

Simulation can be wildly inaccurateo Did I really mean to build that system?

Results are difficult to reproduce Need statistical rigor

Page 5: David J. Lilja Department of Electrical and Computer Engineering University of Minnesota

Department of Electrical and Computer EngineeringUniversity of Minnesota

The Superthreaded Architecture The Wrong Execution Cache (WEC) Experimental Methodology Performance of the WEC

[Chen, Sendag, Lilja, IPDPS, 2003]

Outline (Part 1)

Page 6: David J. Lilja Department of Electrical and Computer Engineering University of Minnesota

Department of Electrical and Computer EngineeringUniversity of Minnesota

Hard-to-Parallelize Applications

Early exit loops Pointers and aliases Complex branching behaviors Small basic blocks Small loops counts→ Hard to parallelize with conventional

techniques.

Page 7: David J. Lilja Department of Electrical and Computer Engineering University of Minnesota

Department of Electrical and Computer EngineeringUniversity of Minnesota

Introduce Maybe dependences

Data dependence? Pointer aliasing?

o Yeso Noo Maybe

Maybe allows aggressive compiler optimizationso When in doubt, parallelize

Run-time check to correct wrong assumption.

Page 8: David J. Lilja Department of Electrical and Computer Engineering University of Minnesota

Department of Electrical and Computer EngineeringUniversity of Minnesota

Thread Pipelining Execution Model

Thread i

Thread i+1

Thread i+2

Fork

Sync…

Sync

Fork

Sync…

Sync

Fork

Sync…

Sync

CONTINUATION-Values needed to fork next thread

TARGET STORE-Forward addresses of maybe dependences

COMPUTATION-Forward addresses and computed data as needed

WRITE-BACK

CONTINUATION-Values needed to fork next thread

TARGET STORE-Forward addresses of maybe dependences

COMPUTATION-Forward addresses and computed data as needed

WRITE-BACK

CONTINUATION-Values needed to fork next thread

TARGET STORE-Forward addresses of maybe dependences

COMPUTATION-Forward addresses and computed data as needed

WRITE-BACK

Page 9: David J. Lilja Department of Electrical and Computer Engineering University of Minnesota

Department of Electrical and Computer EngineeringUniversity of Minnesota

The Superthread ArchitectureInstruction Cache

Data Cache

Execution Unit

DependenceBuffer

Registers

PC

Comm

Super-Scalar Core Super-Scalar Core Super-Scalar Core Super-Scalar Core

Execution Unit

DependenceBuffer

Registers

PC

Comm

Execution Unit

DependenceBuffer

Registers

PC

Comm

Execution Unit

DependenceBuffer

Registers

PC

Comm

Page 10: David J. Lilja Department of Electrical and Computer Engineering University of Minnesota

Department of Electrical and Computer EngineeringUniversity of Minnesota

Wrong Path Execution Within Superscalar Core

Ld ALd B

Ld CLd D

Ld E

CP

WP

Predicted path

Correct pathPrediction result is wrong

Wrong path

Speculative execution

Wrong path execution

Not ready to be executed

Page 11: David J. Lilja Department of Electrical and Computer Engineering University of Minnesota

Department of Electrical and Computer EngineeringUniversity of Minnesota

Wrong Thread Execution

TU0

FORK . . . . . . . . .ABORT

BEGIN

TU1

FORK . . . . . . . . .ABORT

TU2

FORK . . . . . . . . .ABORT

TU1

FORK . . . . . . . . .ABORT

TU2

FORK . . . . . . . . .ABORT

TU3

FORK . . . . . . . . .ABORT

TU0

FORK . . . . . . . . .ABORT

TU3

FORK . . . . . . . . .ABORT

WTH WTH

BEGIN

Sequential region

Mark the successor threads as wrong threads

Sequential regionbetween two parallel regions

Parallel region

Parallel region

Kill all the wrong threads from the Previous parallel region

Wrong thread kills itself

Page 12: David J. Lilja Department of Electrical and Computer Engineering University of Minnesota

Department of Electrical and Computer EngineeringUniversity of Minnesota

How Could Wrong Thread Execution Help Improve Performance?

for (i=0; i<10; i++){……for (j=0; j<i; j++)

{……x=y[j];……}

……}

Parallelized

When i=4, j=0,1,2,3=>y[0], y[1], y[2], y[3],y[4]…

When i=5, j=0,1,2,3,4 =>y[0],y[1],y[2],y[3],y[4],y[5]…

i=4

i=5

TU1 TU2 TU3 TU4

TU1 TU2 TU3 TU4

y[0] y[1] y[2]

y[3]

y[0] y[1] y[2]

wrong threads

y[4] y[5]

y[3] y[4] y[5] y[6]

Page 13: David J. Lilja Department of Electrical and Computer Engineering University of Minnesota

Department of Electrical and Computer EngineeringUniversity of Minnesota

Operation of the WEC

Wrongthread

execution?

L1 datacachemiss?

Wrong pathexecution?

NO

YES

YES

WEC miss?

Bring the blockfrom the nextlevel memoryinto the WEC

Update LRUinfo for the

WEC

YES

NO

YES NO

L1 datacachemiss?

NO

WEC miss?

YES

Update LRUinfo for the L1data cache

NO

NOYES

Swap thevictim blockand the WECblock. Prefetchthe next lineinto the WEC

Updata LRUinfo for the L1data cache

Bring the blockfrom next levelmemory intoL1 data cache. Put the victimblock into theWEC

Wrong execution Correct execution

Page 14: David J. Lilja Department of Electrical and Computer Engineering University of Minnesota

Department of Electrical and Computer EngineeringUniversity of Minnesota

Processor Configurations for Simulations

baseline

(orig)

wrong path

(wp)

wrong thread

(wth)

wrong execution cache

(wec)

prefetch into WEC

victim cache

(vc)

next line prefetch

(nlp)

orig

vc

wth-wp-vc

wth-wp-wec

nlp

SIMCA (the SIMulator for the Superthreaded Architecture)

configurations

features

Page 15: David J. Lilja Department of Electrical and Computer Engineering University of Minnesota

Department of Electrical and Computer EngineeringUniversity of Minnesota

Parameters for Each Thread UnitIssue rate 8 instrs/cycle per thread unit

branch target buffer 4-way associative, 1024 entries

speculative memory buffer fully associative, 128 entries

round-trip memory latency 200 cycles

fork delay 4 cycles

unidirectional communication ring 2 requests/cycle bandwidth

Load/store queue 64 entries

Reorder buffer 64 entries

INT ALU, INT multiply/divide units 8, 4

FP adders, FP multiply/divide units 4, 4

WEC 8 entries (same block size as L1 cache), fully associative

L1 data cache distributed, 8KB, 1-way associative, block size of 64 bytes

L1 instruction caches distributed, 32KB, 2-way associative, block size of 64 bytes

L2 cache unified, 512KB, 4-way associative, block size of 128 bytes

Page 16: David J. Lilja Department of Electrical and Computer Engineering University of Minnesota

Department of Electrical and Computer EngineeringUniversity of Minnesota

Characteristics of the Parallelized SPEC2000 Benchmarks

Bench

-marks

SPEC 2000

Type

Input

set

Fraction Parallelized

Loop Coalesc

-ing

Loop

Unroll-ing

Statement Reordering to Increase Overlap

175.vpr INT SPECtest

8.6%

164.gzip INT MinneSPEC large

15.7%

181.mcf INT MinneSPEC large

36.1%

197.parser INT MinneSPEC medium

17.2%

183.equake FP MinneSPEC large

21.3%

177.mesa FP SPEC test 17.3%

Page 17: David J. Lilja Department of Electrical and Computer Engineering University of Minnesota

Department of Electrical and Computer EngineeringUniversity of Minnesota

Performance of the Superthreaded Architecture for the Parallelized Portions of the Benchmarks

0

2

4

6

8

10

12

14

spee

du

p

1TU 2TU 4TU 8TU 16TU

# of TUs

Issue rate

1

11

162

84

48

216

1

Reorder

buffer size

8 128 64 32 16 8

INT ALU 1 16 8 4 2 1

INT MULT 1 8 4 2 1 1

FP ALU 1 16 8 4 2 1

FP MULT 1 8 4 2 1 1

L1 data cache size (KB)

2 32 16 8 4 2

Baseline configuration

Page 18: David J. Lilja Department of Electrical and Computer Engineering University of Minnesota

Department of Electrical and Computer EngineeringUniversity of Minnesota

Performance of the wth-wp-wec Configuration on Top of the Parallel Execution

-10

-5

0

5

10

15

20

25

30

35

40

175.vpr 164.gzip 181.mcf 197.parser 183.equake 177.mesa average

rela

tive

sp

eed

up

(%

)

2TU org 4TU org 8TU org

16TU org 1TU w th-w p-w ec 2TU w th-w p-w ec

4TU w th-w p-w ec 8TU w th-w p-w ec 16TU w th-w p-w ec

Page 19: David J. Lilja Department of Electrical and Computer Engineering University of Minnesota

Department of Electrical and Computer EngineeringUniversity of Minnesota

Performance Improvements Due to the WEC

-1

1

3

5

7

9

11

13

15

17

19

rela

tive

sp

eed

up

(%

)

vc wth-wp-vc wth-wp-wec nlp

Page 20: David J. Lilja Department of Electrical and Computer Engineering University of Minnesota

Department of Electrical and Computer EngineeringUniversity of Minnesota

Sensitivity to L1 Data Cache Size

0

5

10

15

20

25

rela

tiv

e s

pe

ed

up

(%

)

orig 32k wth-wp-wec 4k wth-wp-wec 32k

Page 21: David J. Lilja Department of Electrical and Computer Engineering University of Minnesota

Department of Electrical and Computer EngineeringUniversity of Minnesota

Sensitivity to WEC Size Compared to a Victim Cache

0

5

10

15

20

25

rela

tive

spee

dup

(%)

wth-wp-vc 4 wth-wp-vc 16 wth-wp-wec 4 wth-wp-wec 16

Page 22: David J. Lilja Department of Electrical and Computer Engineering University of Minnesota

Department of Electrical and Computer EngineeringUniversity of Minnesota

Sensitivity to WEC Size Compared to Next-Line Prefetching (NLP)

0

4

8

12

16

20

24

rela

tive

sp

eed

up

(%

)

nlp 8 nlp 32 w th-w p-w ec 8 w th-w p-w ec 32

Page 23: David J. Lilja Department of Electrical and Computer Engineering University of Minnesota

Department of Electrical and Computer EngineeringUniversity of Minnesota

Additional Loads and Reduction of Misses %

0

10

20

30

40

50

60

70

80

%

Additional loads Reduction of Misses

Page 24: David J. Lilja Department of Electrical and Computer Engineering University of Minnesota

Department of Electrical and Computer EngineeringUniversity of Minnesota

Conclusions for the WEC

Allow loads to continue executing even after they are known to be incorrectly issuedo Do not let them change state

45.5% average reduction in number of misseso 9.7% average improvement on top of parallel executiono 4% average improvement over victim cacheo 5.6% average improvement over next-line prefetching

Costo 14% additional loadso Minor hardware complexity

Page 25: David J. Lilja Department of Electrical and Computer Engineering University of Minnesota

Department of Electrical and Computer EngineeringUniversity of Minnesota

Typical Computer Architecture Study

1. Find an interesting problem/performance bottleneck E.g. Memory delays

2. Invent a clever idea for solving it. This is the hard part.

3. Implement the idea in a processor/system simulator This is the part grad students usually like best

4. Run simulations on n “standard” benchmark programs This is time-consuming and boring

5. Compare performance with and without your change Execution time, clocks per instruction (CPI), etc.

Page 26: David J. Lilja Department of Electrical and Computer Engineering University of Minnesota

Department of Electrical and Computer EngineeringUniversity of Minnesota

Problem #2 – Simulation in Computer Architecture Research

Simulators are an important tool for computer architecture research and designo Low costo Faster than building a new systemo Very flexible

Page 27: David J. Lilja Department of Electrical and Computer Engineering University of Minnesota

Department of Electrical and Computer EngineeringUniversity of Minnesota

Performance EvaluationTechniques Used in ISCA Papers

0%

20%

40%

60%

80%

100%

1973 1985 1993 1997 2001

Other

Modeling

Measurement

Other sim

SimpleScalar

* Some papers used more than one evaluation technique.

Page 28: David J. Lilja Department of Electrical and Computer Engineering University of Minnesota

Department of Electrical and Computer EngineeringUniversity of Minnesota

Simulation is Very Popular, But …

Current simulation methodology is noto Formalo Rigorouso Statistically-based

Never enough simulationso Design a new processor based on a few seconds of actual

execution time What are benchmark programs really exercising?

Page 29: David J. Lilja Department of Electrical and Computer Engineering University of Minnesota

Department of Electrical and Computer EngineeringUniversity of Minnesota

An Example -- Sensitivity Analysis

Which parameters should be varied? Fixed? What range of values should be used for each

variable parameter? What values should be used for the constant

parameters? Are there interactions between variable and

fixed parameters? What is the magnitude of those interactions?

Page 30: David J. Lilja Department of Electrical and Computer Engineering University of Minnesota

Department of Electrical and Computer EngineeringUniversity of Minnesota

Let’s Introduce Some Statistical Rigor

Decreases the number of errorso Modelingo Implementationo Set upo Analysis

Helps find errors more quickly Provides greater insight

o Into the processoro Effects of an enhancement

Provides objective confidence in results Provides statistical support for conclusions

Page 31: David J. Lilja Department of Electrical and Computer Engineering University of Minnesota

Department of Electrical and Computer EngineeringUniversity of Minnesota

A statistical technique for o Examining the overall impact of an architectural

changeo Classifying benchmark programso Ranking the importance of processor/simulation

parameterso Reducing the total number of simulation runs

[Yi, Lilja, Hawkins, HPCA, 2003]

Outline (Part 2)

Page 32: David J. Lilja Department of Electrical and Computer Engineering University of Minnesota

Department of Electrical and Computer EngineeringUniversity of Minnesota

A Technique to Limit the Number of Simulations

Plackett and Burman designs (1946)o Multifactorial designso Originally proposed for mechanical assemblies

Effects of main factors onlyo Logically minimal number of experiments to estimate effects of

m input parameters (factors)o Ignores interactions

Requires O(m) experimentso Instead of O(2m) or O(vm)

Page 33: David J. Lilja Department of Electrical and Computer Engineering University of Minnesota

Department of Electrical and Computer EngineeringUniversity of Minnesota

Plackett and Burman Designs PB designs exist only in sizes that are multiples of 4 Requires X experiments for m parameters

o X = next multiple of 4 ≥ m PB design matrix

o Rows = configurationso Columns = parameters’ values in each config

• High/low = +1/ -1o First row = from P&B papero Subsequent rows = circular right shift of preceding rowo Last row = all (-1)

Page 34: David J. Lilja Department of Electrical and Computer Engineering University of Minnesota

Department of Electrical and Computer EngineeringUniversity of Minnesota

PB Design MatrixConfig Input Parameters (factors) Response

A B C D E F G

1 +1 +1 +1 -1 +1 -1 -1 9

2 -1 +1 +1 +1 -1 +1 -1

3 -1 -1 +1 +1 +1 -1 +1

4 +1 -1 -1 +1 +1 +1 -1

5 -1 +1 -1 -1 +1 +1 +1

6 +1 -1 +1 -1 -1 +1 +1

7 +1 +1 -1 +1 -1 -1 +1

8 -1 -1 -1 -1 -1 -1 -1

Effect

Page 35: David J. Lilja Department of Electrical and Computer Engineering University of Minnesota

Department of Electrical and Computer EngineeringUniversity of Minnesota

PB Design MatrixConfig Input Parameters (factors) Response

A B C D E F G

1 +1 +1 +1 -1 +1 -1 -1 9

2 -1 +1 +1 +1 -1 +1 -1 11

3 -1 -1 +1 +1 +1 -1 +1

4 +1 -1 -1 +1 +1 +1 -1

5 -1 +1 -1 -1 +1 +1 +1

6 +1 -1 +1 -1 -1 +1 +1

7 +1 +1 -1 +1 -1 -1 +1

8 -1 -1 -1 -1 -1 -1 -1

Effect

Page 36: David J. Lilja Department of Electrical and Computer Engineering University of Minnesota

Department of Electrical and Computer EngineeringUniversity of Minnesota

PB Design MatrixConfig Input Parameters (factors) Response

A B C D E F G

1 +1 +1 +1 -1 +1 -1 -1 9

2 -1 +1 +1 +1 -1 +1 -1 11

3 -1 -1 +1 +1 +1 -1 +1 2

4 +1 -1 -1 +1 +1 +1 -1 1

5 -1 +1 -1 -1 +1 +1 +1 9

6 +1 -1 +1 -1 -1 +1 +1 74

7 +1 +1 -1 +1 -1 -1 +1 7

8 -1 -1 -1 -1 -1 -1 -1 4

Effect

Page 37: David J. Lilja Department of Electrical and Computer Engineering University of Minnesota

Department of Electrical and Computer EngineeringUniversity of Minnesota

PB Design MatrixConfig Input Parameters (factors) Response

A B C D E F G

1 +1 +1 +1 -1 +1 -1 -1 9

2 -1 +1 +1 +1 -1 +1 -1 11

3 -1 -1 +1 +1 +1 -1 +1 2

4 +1 -1 -1 +1 +1 +1 -1 1

5 -1 +1 -1 -1 +1 +1 +1 9

6 +1 -1 +1 -1 -1 +1 +1 74

7 +1 +1 -1 +1 -1 -1 +1 7

8 -1 -1 -1 -1 -1 -1 -1 4

Effect 65

Page 38: David J. Lilja Department of Electrical and Computer Engineering University of Minnesota

Department of Electrical and Computer EngineeringUniversity of Minnesota

PB Design MatrixConfig Input Parameters (factors) Response

A B C D E F G

1 +1 +1 +1 -1 +1 -1 -1 9

2 -1 +1 +1 +1 -1 +1 -1 11

3 -1 -1 +1 +1 +1 -1 +1 2

4 +1 -1 -1 +1 +1 +1 -1 1

5 -1 +1 -1 -1 +1 +1 +1 9

6 +1 -1 +1 -1 -1 +1 +1 74

7 +1 +1 -1 +1 -1 -1 +1 7

8 -1 -1 -1 -1 -1 -1 -1 4

Effect 65 -45

Page 39: David J. Lilja Department of Electrical and Computer Engineering University of Minnesota

Department of Electrical and Computer EngineeringUniversity of Minnesota

PB Design MatrixConfig Input Parameters (factors) Response

A B C D E F G

1 +1 +1 +1 -1 +1 -1 -1 9

2 -1 +1 +1 +1 -1 +1 -1 11

3 -1 -1 +1 +1 +1 -1 +1 2

4 +1 -1 -1 +1 +1 +1 -1 1

5 -1 +1 -1 -1 +1 +1 +1 9

6 +1 -1 +1 -1 -1 +1 +1 74

7 +1 +1 -1 +1 -1 -1 +1 7

8 -1 -1 -1 -1 -1 -1 -1 4

Effect 65 -45 75 -75 -75 73 67

Page 40: David J. Lilja Department of Electrical and Computer Engineering University of Minnesota

Department of Electrical and Computer EngineeringUniversity of Minnesota

PB Design

Only magnitude of effect is importanto Sign is meaningless

In example, most → least important effects:o [C, D, E] → F → G → A → B

Page 41: David J. Lilja Department of Electrical and Computer Engineering University of Minnesota

Department of Electrical and Computer EngineeringUniversity of Minnesota

Case Study #1

Determine the most significant parameters in a processor simulator.

Page 42: David J. Lilja Department of Electrical and Computer Engineering University of Minnesota

Department of Electrical and Computer EngineeringUniversity of Minnesota

Determine the Most Significant Processor Parameters

Problemo So many parameters in a simulatoro How to choose parameter values?o How to decide which parameters are most important?

Approacho Choose reasonable upper/lower bounds.o Rank parameters by impact on total execution time.

Page 43: David J. Lilja Department of Electrical and Computer Engineering University of Minnesota

Department of Electrical and Computer EngineeringUniversity of Minnesota

Simulation Environment

SimpleScalar simulatoro sim-outorder 3.0

Selected SPEC 2000 Benchmarkso gzip, vpr, gcc, mesa, art, mcf, equake, parser, vortex, bzip2, twolf

MinneSPEC Reduced Input Sets Compiled with gcc (PISA) at O3

Page 44: David J. Lilja Department of Electrical and Computer Engineering University of Minnesota

Department of Electrical and Computer EngineeringUniversity of Minnesota

Functional Unit ValuesParameter Low Value High Value

Int ALUs 1 4

Int ALU Latency 2 Cycles 1 Cycle

Int ALU Throughput 1

FP ALUs 1 4

FP ALU Latency 5 Cycles 1 Cycle

FP ALU Throughputs 1

Int Mult/Div Units 1 4

Int Mult Latency 15 Cycles 2 Cycles

Int Div Latency 80 Cycles 10 Cycles

Int Mult Throughput 1

Int Div Throughput Equal to Int Div Latency

FP Mult/Div Units 1 4

FP Mult Latency 5 Cycles 2 Cycles

FP Div Latency 35 Cycles 10 Cycles

FP Sqrt Latency 35 Cycles 15 Cycles

FP Mult Throughput Equal to FP Mult Latency

FP Div Throughput Equal to FP Div Latency

FP Sqrt Throughput Equal to FP Sqrt Latency

Page 45: David J. Lilja Department of Electrical and Computer Engineering University of Minnesota

Department of Electrical and Computer EngineeringUniversity of Minnesota

Memory System Values, Part IParameter Low Value High Value

L1 I-Cache Size 4 KB 128 KB

L1 I-Cache Assoc 1-Way 8-Way

L1 I-Cache Block Size 16 Bytes 64 Bytes

L1 I-Cache Repl Policy Least Recently Used

L1 I-Cache Latency 4 Cycles 1 Cycle

L1 D-Cache Size 4 KB 128 KB

L1 D-Cache Assoc 1-Way 8-Way

L1 D-Cache Block Size 16 Bytes 64 Bytes

L1 D-Cache Repl Policy Least Recently Used

L1 D-Cache Latency 4 Cycles 1 Cycle

L2 Cache Size 256 KB 8192 KB

L2 Cache Assoc 1-Way 8-Way

L2 Cache Block Size 64 Bytes 256 Bytes

Page 46: David J. Lilja Department of Electrical and Computer Engineering University of Minnesota

Department of Electrical and Computer EngineeringUniversity of Minnesota

Memory System Values, Part IIParameter Low Value High Value

L2 Cache Repl Policy Least Recently Used

L2 Cache Latency 20 Cycles 5 Cycles

Mem Latency, First 200 Cycles 50 Cycles

Mem Latency, Next 0.02 * Mem Latency, First

Mem Bandwidth 4 Bytes 32 Bytes

I-TLB Size 32 Entries 256 Entries

I-TLB Page Size 4 KB 4096 KB

I-TLB Assoc 2-Way Fully Assoc

I-TLB Latency 80 Cycles 30 Cycles

D-TLB Size 32 Entries 256 Entries

D-TLB Page Size Same as I-TLB Page Size

D-TLB Assoc 2-Way Fully-Assoc

D-TLB Latency Same as I-TLB Latency

Page 47: David J. Lilja Department of Electrical and Computer Engineering University of Minnesota

Department of Electrical and Computer EngineeringUniversity of Minnesota

Processor Core Values

Parameter Low Value High ValueFetch Queue Entries 4 32

Branch Predictor 2-Level Perfect

Branch MPred Penalty 10 Cycles 2 Cycles

RAS Entries 4 64

BTB Entries 16 512

BTB Assoc 2-Way Fully-Assoc

Spec Branch Update In Commit In Decode

Decode/Issue Width 4-Way

ROB Entries 8 64

LSQ Entries 0.25 * ROB 1.0 * ROB

Memory Ports 1 4

Page 48: David J. Lilja Department of Electrical and Computer Engineering University of Minnesota

Department of Electrical and Computer EngineeringUniversity of Minnesota

Determining the Most Significant Parameters

1. Run simulations to find response• With input parameters at high/low, on/off values

Config

Input Parameters (factors) Response

A B C D E F G

1 +1 +1 +1 -1 +1 -1 -1 9

2 -1 +1 +1 +1 -1 +1 -1

3 -1 -1 +1 +1 +1 -1 +1

… … … … … … … …

Effect

Page 49: David J. Lilja Department of Electrical and Computer Engineering University of Minnesota

Department of Electrical and Computer EngineeringUniversity of Minnesota

Determining the Most Significant Parameters

2. Calculate the effect of each parameter• Across configurations

Config

Input Parameters (factors) Response

A B C D E F G

1 +1 +1 +1 -1 +1 -1 -1 9

2 -1 +1 +1 +1 -1 +1 -1

3 -1 -1 +1 +1 +1 -1 +1

… … … … … … … …

Effect 65

Page 50: David J. Lilja Department of Electrical and Computer Engineering University of Minnesota

Department of Electrical and Computer EngineeringUniversity of Minnesota

Determining the Most Significant Parameters

3. For each benchmarkRank the parameters in descending order of effect(1=most important, …)

Parameter Benchmark 1 Benchmark 2 Benchmark 3

A 3 12 8

B 29 4 22

C 2 6 7

… … … …

Page 51: David J. Lilja Department of Electrical and Computer Engineering University of Minnesota

Department of Electrical and Computer EngineeringUniversity of Minnesota

Determining the Most Significant Parameters

4. For each parameterAverage the ranks

Parameter Benchmark 1 Benchmark 2 Benchmark 3 Average

A 3 12 8 7.67

B 29 4 22 18.3

C 2 6 7 5

… … … … …

Page 52: David J. Lilja Department of Electrical and Computer Engineering University of Minnesota

Department of Electrical and Computer EngineeringUniversity of Minnesota

Most Significant Parameters

Number Parameter gcc gzip art Average1 ROB Entries 4 1 2 2.77

2 L2 Cache Latency 2 4 4 4.00

3 Branch Predictor Accuracy 5 2 27 7.69

4 Number of Integer ALUs 8 3 29 9.08

5 L1 D-Cache Latency 7 7 8 10.00

6 L1 I-Cache Size 1 6 12 10.23

7 L2 Cache Size 6 9 1 10.62

8 L1 I-Cache Block Size 3 16 10 11.77

9 Memory Latency, First 9 36 3 12.31

10 LSQ Entries 10 12 39 12.62

11 Speculative Branch Update 28 8 16 18.23

Page 53: David J. Lilja Department of Electrical and Computer Engineering University of Minnesota

Department of Electrical and Computer EngineeringUniversity of Minnesota

General Procedure

Determine upper/lower bounds for parameters Simulate configurations to find response Compute effects of each parameter for each configuration Rank the parameters for each benchmark based on

effects Average the ranks across benchmarks Focus on top-ranked parameters for subsequent analysis

Page 54: David J. Lilja Department of Electrical and Computer Engineering University of Minnesota

Department of Electrical and Computer EngineeringUniversity of Minnesota

Case Study #2

Determine the “big picture” impact of a system enhancement.

Page 55: David J. Lilja Department of Electrical and Computer Engineering University of Minnesota

Department of Electrical and Computer EngineeringUniversity of Minnesota

Determining the Overall Effect of an Enhancement

Problem:o Performance analysis is typically limited to single metrics

• Speedup, power consumption, miss rate, etc.o Simple analysis

• Discards a lot of good information

Page 56: David J. Lilja Department of Electrical and Computer Engineering University of Minnesota

Department of Electrical and Computer EngineeringUniversity of Minnesota

Determining the Overall Effect of an Enhancement

Find most important parameters without enhancemento Using Plackett and Burman

Find most important parameters with enhancemento Again using Plackett and Burman

Compare parameter ranks

Page 57: David J. Lilja Department of Electrical and Computer Engineering University of Minnesota

Department of Electrical and Computer EngineeringUniversity of Minnesota

Example: Instruction Precomputation

Profile to find the most common operationso 0+1, 1+1, etc.

Insert the results of common operations in a table when the program is loaded into memory

Query the table when an instruction is issued Don’t execute the instruction if it is already in the table Reduces contention for function units

[Yi, Sendag, Lilja, Europar, 2002]

Page 58: David J. Lilja Department of Electrical and Computer Engineering University of Minnesota

Department of Electrical and Computer EngineeringUniversity of Minnesota

The Effect of Instruction PrecomputationAverage Rank

Parameter Before After Difference

ROB Entries 2.77

L2 Cache Latency 4.00

Branch Predictor Accuracy 7.69

Number of Integer ALUs 9.08

L1 D-Cache Latency 10.00

L1 I-Cache Size 10.23

L2 Cache Size 10.62

L1 I-Cache Block Size 11.77

Memory Latency, First 12.31

LSQ Entries 12.62

Page 59: David J. Lilja Department of Electrical and Computer Engineering University of Minnesota

Department of Electrical and Computer EngineeringUniversity of Minnesota

The Effect of Instruction PrecomputationAverage Rank

Parameter Before After Difference

ROB Entries 2.77 2.77

L2 Cache Latency 4.00 4.00

Branch Predictor Accuracy 7.69 7.92

Number of Integer ALUs 9.08 10.54

L1 D-Cache Latency 10.00 9.62

L1 I-Cache Size 10.23 10.15

L2 Cache Size 10.62 10.54

L1 I-Cache Block Size 11.77 11.38

Memory Latency, First 12.31 11.62

LSQ Entries 12.62 13.00

Page 60: David J. Lilja Department of Electrical and Computer Engineering University of Minnesota

Department of Electrical and Computer EngineeringUniversity of Minnesota

The Effect of Instruction PrecomputationAverage Rank

Parameter Before After Difference

ROB Entries 2.77 2.77 0.00

L2 Cache Latency 4.00 4.00 0.00

Branch Predictor Accuracy 7.69 7.92 -0.23

Number of Integer ALUs 9.08 10.54 -1.46

L1 D-Cache Latency 10.00 9.62 0.38

L1 I-Cache Size 10.23 10.15 0.08

L2 Cache Size 10.62 10.54 0.08

L1 I-Cache Block Size 11.77 11.38 0.39

Memory Latency, First 12.31 11.62 0.69

LSQ Entries 12.62 13.00 -0.38

Page 61: David J. Lilja Department of Electrical and Computer Engineering University of Minnesota

Department of Electrical and Computer EngineeringUniversity of Minnesota

Case Study #3

Benchmark program classification.

Page 62: David J. Lilja Department of Electrical and Computer Engineering University of Minnesota

Department of Electrical and Computer EngineeringUniversity of Minnesota

Benchmark Classification

By application typeo Scientific and engineering applicationso Transaction processing applicationso Multimedia applications

By use of processor function unitso Floating-point codeo Integer codeo Memory intensive code

Etc., etc.

Page 63: David J. Lilja Department of Electrical and Computer Engineering University of Minnesota

Department of Electrical and Computer EngineeringUniversity of Minnesota

Another Point-of-View

Classify by overall impact on processor Define:

o Two benchmark programs are similar if –• They stress the same components of a system to similar

degrees

How to measure this similarity?o Use Plackett and Burman design to find rankso Then compare ranks

Page 64: David J. Lilja Department of Electrical and Computer Engineering University of Minnesota

Department of Electrical and Computer EngineeringUniversity of Minnesota

Similarity metric

Use rank of each parameter as elements of a vector

For benchmark program X, leto X = (x1, x2,…, xn-1, xn)

o x1 = rank of parameter 1

o x2 = rank of parameter 2o …

Page 65: David J. Lilja Department of Electrical and Computer Engineering University of Minnesota

Department of Electrical and Computer EngineeringUniversity of Minnesota

Vector Defines a Point in n-space

• (y1, y2, y3)

Param #3

• (x1, x2, x3)

Param #2

Param #1

D

Page 66: David J. Lilja Department of Electrical and Computer Engineering University of Minnesota

Department of Electrical and Computer EngineeringUniversity of Minnesota

Similarity Metric

Euclidean Distance Between Points

2/12211

222

211 ])()()()[( nnnn yxyxyxyxD

Page 67: David J. Lilja Department of Electrical and Computer Engineering University of Minnesota

Department of Electrical and Computer EngineeringUniversity of Minnesota

Most Significant Parameters

Number Parameter gcc gzip art1 ROB Entries 4 1 2

2 L2 Cache Latency 2 4 4

3 Branch Predictor Accuracy 5 2 27

4 Number of Integer ALUs 8 3 29

5 L1 D-Cache Latency 7 7 8

6 L1 I-Cache Size 1 6 12

7 L2 Cache Size 6 9 1

8 L1 I-Cache Block Size 3 16 10

9 Memory Latency, First 9 36 3

10 LSQ Entries 10 12 39

11 Speculative Branch Update 28 8 16

Page 68: David J. Lilja Department of Electrical and Computer Engineering University of Minnesota

Department of Electrical and Computer EngineeringUniversity of Minnesota

Distance Computation

Rank vectorso Gcc = (4, 2, 5, 8, …)o Gzip = (1, 4, 2, 3, …)o Art = (2, 4, 27, 29, …)

Euclidean distanceso D(gcc - gzip) = [(4-1)2 + (2-4)2 + (5-2)2 + … ]1/2 o D(gcc - art) = [(4-2)2 + (2-4)2 + (5-27)2 + … ]1/2 o D(gzip - art) = [(1-2)2 + (4-4)2 + (2-27)2 + … ]1/2

Page 69: David J. Lilja Department of Electrical and Computer Engineering University of Minnesota

Department of Electrical and Computer EngineeringUniversity of Minnesota

Euclidean Distances for Selected Benchmarks

gcc gzip art mcf

gcc 0 81.9 92.6 94.5

gzip 0 113.5 109.6

art 0 98.6

mcf 0

Page 70: David J. Lilja Department of Electrical and Computer Engineering University of Minnesota

Department of Electrical and Computer EngineeringUniversity of Minnesota

Dendogram of Distances Showing (Dis-)Similarity

Page 71: David J. Lilja Department of Electrical and Computer Engineering University of Minnesota

Department of Electrical and Computer EngineeringUniversity of Minnesota

Final Benchmark Groupings

Group Benchmarks

I Gzip,mesaII Vpr-Place,twolfIII Vpr-Route, parser, bzip2IV Gcc, vortexV ArtVI McfVII EquakeVIII ammp

Page 72: David J. Lilja Department of Electrical and Computer Engineering University of Minnesota

Department of Electrical and Computer EngineeringUniversity of Minnesota

Conclusion

Multifactorial (Plackett and Burman) designo Requires only O(m) experimentso Determines effects of main factors onlyo Ignores interactions

Logically minimal number of experiments to estimate effects of m input parameters

Powerful technique for obtaining a big-picture view of a lot of simulation data

Page 73: David J. Lilja Department of Electrical and Computer Engineering University of Minnesota

Department of Electrical and Computer EngineeringUniversity of Minnesota

Conclusion

Demonstrated foro Ranking importance of simulation parameterso Finding overall impact of processor enhancemento Classifying benchmark programs

Current work comparing simulation strategieso Reduced input sets (e.g. MinneSPEC)o Sampling (e.g. SimPoints, sampling)

Page 74: David J. Lilja Department of Electrical and Computer Engineering University of Minnesota

Department of Electrical and Computer EngineeringUniversity of Minnesota

Goals

Develop/understand tools for interpreting large quantities of data

Increase insights into processor design

Improve rigor in computer architecture research