VLSI Trends in Microarchitecture Past, present and futureafek/Intel2006.pdf · 2006-01-24 ·...

66
Vlsi_03_2005.ppt/April_2005 1 VLSI Trends in Microarchitecture Past, present and future TAU university January 24, 2006 Uri Weiser

Transcript of VLSI Trends in Microarchitecture Past, present and futureafek/Intel2006.pdf · 2006-01-24 ·...

Page 1: VLSI Trends in Microarchitecture Past, present and futureafek/Intel2006.pdf · 2006-01-24 · Vlsi_03_2005.ppt/April_2005 1 VLSI Trends in Microarchitecture Past, present and future

Vlsi_03_2005.ppt/April_2005 1

VLSI Trends in MicroarchitecturePast, present and future

TAU universityJanuary 24, 2006

Uri Weiser

Page 2: VLSI Trends in Microarchitecture Past, present and futureafek/Intel2006.pdf · 2006-01-24 · Vlsi_03_2005.ppt/April_2005 1 VLSI Trends in Microarchitecture Past, present and future

Vlsi_03_2005.ppt/April_2005 2

Agenda

Microarchitecture– VLSI– Trends Past and present:

» Pipeline, superpipeline» Out of Order» Branch prediction» Caches» Trace cache» Threads and Chip Multiprocessing

– Future» Asymmetric» Accelerators

Page 3: VLSI Trends in Microarchitecture Past, present and futureafek/Intel2006.pdf · 2006-01-24 · Vlsi_03_2005.ppt/April_2005 1 VLSI Trends in Microarchitecture Past, present and future

Vlsi_03_2005.ppt/April_2005 3

“[In the beginning] we had little idea of what we had started. ...I remember... saying, ‘Okay, we’ve done integrated circuit. What do we do next?”

Gordon E. Moore

Page 4: VLSI Trends in Microarchitecture Past, present and futureafek/Intel2006.pdf · 2006-01-24 · Vlsi_03_2005.ppt/April_2005 1 VLSI Trends in Microarchitecture Past, present and future

Vlsi_03_2005.ppt/April_2005 4

TRENDSIN

VLSI

Sources:Shekhar BorkarUri Weiser

Page 5: VLSI Trends in Microarchitecture Past, present and futureafek/Intel2006.pdf · 2006-01-24 · Vlsi_03_2005.ppt/April_2005 1 VLSI Trends in Microarchitecture Past, present and future

Vlsi_03_2005.ppt/April_2005 5

Process TechnologyProcess Technology1.51.5µµ 1.01.0µµ 0.80.8µµ 0.60.6µµ 0.350.35µµ 0.250.25µµ 0.180.18µµ 0.130.13µµ

Intel386Intel386™™ DX DX ProcessorProcessor

Intel486Intel486™™ DX DX ProcessorProcessor

PentiumPentium®®ProcessorProcessor

PentiumPentium®® Pro Pro ProcessorProcessor

PentiumPentium®® II II ProcessorProcessor

PentiumPentium®® 4 4 ProcessorProcessor

PentiumPentium®® III III ProcessorProcessor

ProcessorProcessor

Technology trendTrends

Page 6: VLSI Trends in Microarchitecture Past, present and futureafek/Intel2006.pdf · 2006-01-24 · Vlsi_03_2005.ppt/April_2005 1 VLSI Trends in Microarchitecture Past, present and future

Vlsi_03_2005.ppt/April_2005 6

Performance History

10

100

1,000

10,000

1.0μ 0.7μ 0.5μ 0.35μ 0.25μ 0.18μ

Freq

(Mhz

)

Freq (uArch)Freq (Process)

18.3X

3.3X

i486Pentium® proc

Pentium® II & III proc

1.0u-0.18u, 1989-2001

Frequency increased 61X

1. 18.3X due to process technology

2. Additional 3.3X due to uArch

1

10

100

1.0μ 0.7μ 0.5μ 0.35μ 0.25μ 0.18μ

Rel

ativ

e Pe

rfor

man

ce Perf due to uArchitecture

Perf due to Freq

14X

7X

i486Pentium® proc

Pentium® II & III proc

Pentium® 4 proc

Performance increased ~100X

1. 14X due to process tech

2. Additional 7X due to uArch & design

Trends

Page 7: VLSI Trends in Microarchitecture Past, present and futureafek/Intel2006.pdf · 2006-01-24 · Vlsi_03_2005.ppt/April_2005 1 VLSI Trends in Microarchitecture Past, present and future

Vlsi_03_2005.ppt/April_2005 7

Process Technology: Minimum Feature Size

Source: Intel, SIA Technology RoadmapSource: Intel, SIA Technology Roadmap0.01

Feature SizeFeature Size(microns)(microns)

0.1

1

10

’68 ’71 ’76 ’80 ’84 ’88 ’92 ’96 ’00 ’04 ’08

IntelSIA

Trends

Page 8: VLSI Trends in Microarchitecture Past, present and futureafek/Intel2006.pdf · 2006-01-24 · Vlsi_03_2005.ppt/April_2005 1 VLSI Trends in Microarchitecture Past, present and future

Vlsi_03_2005.ppt/April_2005 8

Transistors on a Chip

400480088080

8085 8086286

386486 Pentium® proc

P6

0.001

0.01

0.1

1

10

100

1000

1970 1980 1990 2000 2010Year

Tran

sist

ors

(MT)

2X growth in 1.96 years!

Transistors on a chip doubled every two yearsTransistors on a chip doubled every two years

Pentium® 4 proc

Trends

Page 9: VLSI Trends in Microarchitecture Past, present and futureafek/Intel2006.pdf · 2006-01-24 · Vlsi_03_2005.ppt/April_2005 1 VLSI Trends in Microarchitecture Past, present and future

Vlsi_03_2005.ppt/April_2005 9

Die Size Growth

40048008

80808085

8086 286386

486 Pentium ® procP6

1

10

100

1970 1980 1990 2000 2010Year

Die

sid

e (m

m) =

Are

a1/2

Die size grows? Is it saturated?Die size grows? Is it saturated?

Trends

Page 10: VLSI Trends in Microarchitecture Past, present and futureafek/Intel2006.pdf · 2006-01-24 · Vlsi_03_2005.ppt/April_2005 1 VLSI Trends in Microarchitecture Past, present and future

Vlsi_03_2005.ppt/April_2005 10

Frequency

P6Pentium ® proc

48638628680868085

8080800840040.1

1

10

100

1000

10000

1970 1980 1990 2000 2010Year

Freq

uenc

y (M

hz)

Doubles every 2 years

Lead Microprocessors frequency doubles every 2 yearsLead Microprocessors frequency doubles every 2 years

Trends

Page 11: VLSI Trends in Microarchitecture Past, present and futureafek/Intel2006.pdf · 2006-01-24 · Vlsi_03_2005.ppt/April_2005 1 VLSI Trends in Microarchitecture Past, present and future

Vlsi_03_2005.ppt/April_2005 11

Frequency of Operation

PPro-300PPro-225

486DX2-66

386DX-33 486DX-33

8080

80868085

8088

80286

386DX-16486DX-25

486DX-50PP-66

PP-90

1

10

100

1000

1970 1975 1980 1985 1990 1995 2000Years

Freq

uenc

yM

HZ

IntelPPCOther

Alpha

Trends

Page 12: VLSI Trends in Microarchitecture Past, present and futureafek/Intel2006.pdf · 2006-01-24 · Vlsi_03_2005.ppt/April_2005 1 VLSI Trends in Microarchitecture Past, present and future

Vlsi_03_2005.ppt/April_2005 12

Frequency of Operation (cont.)

PPro-300

486DX2 PP-66PP-90

PP-100PP-120

PPro-225

PPro-180PPro-150/PPC604-150

PP-133/PPC604-133

PPC601-100

PPC620C-200

A21064A-275A21164-300

M-R4400-200

0

50

100

150

200

250

300

350

400

1992 1993 1994 1995 1996 1997 1998Years

Freq

uenc

y (M

Hz)

Intel

PPC

Other

A21164-333

A21164A-417

Trends

Page 13: VLSI Trends in Microarchitecture Past, present and futureafek/Intel2006.pdf · 2006-01-24 · Vlsi_03_2005.ppt/April_2005 1 VLSI Trends in Microarchitecture Past, present and future

Vlsi_03_2005.ppt/April_2005 13

Brainiacs and Speed demons

Source: ISCA 95, p. 174

1.6

0.57 0.6

0

0.5

1

1.5

2

0 50 100 150 200 250 300 350

Speed demons

Brai

niac

sSPECInt92 = 400

300

200

SPECInt92 = 10050

MHz

ALPHA

X86

PowerPC

PENTIUM

PENTIUM PRO

21164

21064

1.0

0.5

1.5

2.0

50 100 150 200 250 300 350 400

1

SPEC

Int9

2 / M

Hz

Trends

Page 14: VLSI Trends in Microarchitecture Past, present and futureafek/Intel2006.pdf · 2006-01-24 · Vlsi_03_2005.ppt/April_2005 1 VLSI Trends in Microarchitecture Past, present and future

Vlsi_03_2005.ppt/April_2005 14

Trends of Future Processors

0.0

1.0

2.0

3.0

4.0

5.0

6.0

7.0

8.0

0 500 1000 1500 2000 2500

M HZ

SPEC

int/1

00M

HZ

PII, PIIISPECInt = 60SPECint = 40

SPECint = 27= 18

= 12= 8

= 4

SPECInt = 90

P IV

P M

Trends

Page 15: VLSI Trends in Microarchitecture Past, present and futureafek/Intel2006.pdf · 2006-01-24 · Vlsi_03_2005.ppt/April_2005 1 VLSI Trends in Microarchitecture Past, present and future

Vlsi_03_2005.ppt/April_2005 15

1

10

100

1000

1.5μ 1μ 0.7μ 0.5μ 0.35μ 0.25μ 0.18μ 0.13μ 0.1μ 0.07μ

Wat

ts/c

m2

i386i486

Pentium ® processor

Pentium Pro ® processor

Pentium II ® processor Pentium III ® processor

Power density continues to get worse

Hot plate

Nuclear ReactorRocketNozzleSun’s

Surface

Trends

Page 16: VLSI Trends in Microarchitecture Past, present and futureafek/Intel2006.pdf · 2006-01-24 · Vlsi_03_2005.ppt/April_2005 1 VLSI Trends in Microarchitecture Past, present and future

Vlsi_03_2005.ppt/April_2005 16

On Die Cache Memory

Core Logic

Cache

Core Logic

Cache

Core Logic

Cache

0.13μ

0.18μ

0.10μ

Pentium III & 4

Pentium IIIPentium IIPentium

ProPentium

0%

20%

40%

60%

80%

100%

0.7μ 0.5μ 0.35μ 0.25μ 0.18μ 0.13μ 0.10μ

Cache % offull chiparea ?

Larger % of die area will be memoryLarger % of die area will be memory

Trends

Page 17: VLSI Trends in Microarchitecture Past, present and futureafek/Intel2006.pdf · 2006-01-24 · Vlsi_03_2005.ppt/April_2005 1 VLSI Trends in Microarchitecture Past, present and future

Vlsi_03_2005.ppt/April_2005 17

Process trend – the theory (cont)Performance driven era vs. Power aware era

Processes

“Old” processes

Pow

er ra

tioProcess generation p vs. p-1

@ same die area

n1µ

n+1 n+2 n+30.35µ

“new” processes

m0.25µ

m+1 m+2 m+30.09µ

1.4X Frequency0.9X voltage0.7X capacity/transistor1X area2X transistors

Power increase/generation: 1.6X

Power aware era(Performance within power envelop)

1.4X Frequency0.75X voltage0.7X capacity/transistor1X area2X transistor

Power increase/generation: 1.1X

Performance driven era(no power constrains)

Reference: Eric Spargle

1.0

2.0

3.0

4.0

Trends

Page 18: VLSI Trends in Microarchitecture Past, present and futureafek/Intel2006.pdf · 2006-01-24 · Vlsi_03_2005.ppt/April_2005 1 VLSI Trends in Microarchitecture Past, present and future

Vlsi_03_2005.ppt/April_2005 18

Processor roadmap trend – real life (cont)Extension of Pollack’s Rule (Micro32, 1999)

Processor generation k vs. k-1 compacted @ the same process technology

Power

0

1

2

3

4

1.5 1 0.7 0.5 0.35 0.18Technology Generation

Gro

wth

(X)

Perf/power delta ratio

1 : >3Performance

!Perf/power delta

ratio3 : 1

Trends

Page 19: VLSI Trends in Microarchitecture Past, present and futureafek/Intel2006.pdf · 2006-01-24 · Vlsi_03_2005.ppt/April_2005 1 VLSI Trends in Microarchitecture Past, present and future

Vlsi_03_2005.ppt/April_2005 19

Microarchitecture

Page 20: VLSI Trends in Microarchitecture Past, present and futureafek/Intel2006.pdf · 2006-01-24 · Vlsi_03_2005.ppt/April_2005 1 VLSI Trends in Microarchitecture Past, present and future

Vlsi_03_2005.ppt/April_2005 20

The Generic Processor

Datasupply

Instructionsupply

Executionengine

Sophisticated organization to “service” instructionsInstruction supply– Instruction cache– Branch prediction– Instruction decoder– ...

Execution engine– Instruction scheduler– Register files– Execution units– ...

Data supply– Data cache– TLB’s– …

Goal - Maximum throughput – balanced design

Microarchitecture

Page 21: VLSI Trends in Microarchitecture Past, present and futureafek/Intel2006.pdf · 2006-01-24 · Vlsi_03_2005.ppt/April_2005 1 VLSI Trends in Microarchitecture Past, present and future

Vlsi_03_2005.ppt/April_2005 21

“The Core” - A Block Diagram

FPU ALU1 ALU2AGU

+MEM

n

2

DataCache

D-TLBReg.File

Scheduler

Instruction flowRead SourcesWrite DestinationWrite Bypass

Write Back

Microarchitecture

Page 22: VLSI Trends in Microarchitecture Past, present and futureafek/Intel2006.pdf · 2006-01-24 · Vlsi_03_2005.ppt/April_2005 1 VLSI Trends in Microarchitecture Past, present and future

Vlsi_03_2005.ppt/April_2005 22

Parallelism Evolution

ProcessorElement

Instruction

Basic configuration Pipeline

VLIW

Superscalar - In order

Superscalar - Out of Order

a

...

PE

a ...

PE

b

PE

c

PE

n

PE

a ...

PE

d

PE

f

PE

n

b

c

e

PE

a a

PE

b c n

PE PE PE PE

Parallelism

Page 23: VLSI Trends in Microarchitecture Past, present and futureafek/Intel2006.pdf · 2006-01-24 · Vlsi_03_2005.ppt/April_2005 1 VLSI Trends in Microarchitecture Past, present and future

Vlsi_03_2005.ppt/April_2005 23

PipelineBreak the work to smaller pieces

Increased throughput– increased # of completed instructions per cycle and reduces cycle time– Number of stages varies

– Small: 4-5 (Pentium), “Superpipeline” ~14 (Pentium Pro), “ultra-pipeline” ~25 (PIV)

Calls for good balancing among stages

I1I2

I3

F D E WF D E W

F D E WF D E W

F D E WF D E W

F: FetchD: DecodeE: ExecuteW: Write Back

0 1 2 3 4 5 6 7 8 9 10 11 12 (Cycles)

1/4 IPC = 4 CPI

1 IPC = 1 CPIIPC = Instructions Per CycleCPI = Cycles Per Instructions

ExamplesIntel 486NS 32532

t

Parallelism

Page 24: VLSI Trends in Microarchitecture Past, present and futureafek/Intel2006.pdf · 2006-01-24 · Vlsi_03_2005.ppt/April_2005 1 VLSI Trends in Microarchitecture Past, present and future

Vlsi_03_2005.ppt/April_2005 24

Pipeline Stalls

But there are “stalls” in the pipeline– “Data Hazards”: Data flow dependency (instructions output/input)

» Solved by: bypasses, renaming– “Control Hazards”: Control flow dependencies

» Solved by branch prediction– “Structural Hazards”: Limited resources– Other (Cache misses, long latency instructions, page faults….)

F D E WF D stall W

F D E WF E W

F D E W

F: FetchD: DecodeE: ExecuteW: Write Back

Estall

D

Data Flow stall(w/o bypass)

Control Flow stall

stallstallstallstall

Address Generation Interlock

stall

0 1 2 3 4 5 6 7 8 9 10 11 12 (Cycles)t

Parallelism

Page 25: VLSI Trends in Microarchitecture Past, present and futureafek/Intel2006.pdf · 2006-01-24 · Vlsi_03_2005.ppt/April_2005 1 VLSI Trends in Microarchitecture Past, present and future

Vlsi_03_2005.ppt/April_2005 25

Super ScalarPerforms more in a single cycle

Ideally, can multiply the throughput– But stall occurs more frequently

0 1 2 3 4 5 6 7 8 9 10 11 12F D E W

F D E W

F D E W

F D E WF D E W

F D E W

F D E W

F D E W

F D E W

F D E W

F D E WF D E WStall

2 IPC = 1/2 CPI

ExamplesIntel Pentium® Proc. Alpha 21164

t

Parallelism

Page 26: VLSI Trends in Microarchitecture Past, present and futureafek/Intel2006.pdf · 2006-01-24 · Vlsi_03_2005.ppt/April_2005 1 VLSI Trends in Microarchitecture Past, present and future

Vlsi_03_2005.ppt/April_2005 26

Super PipelineSplit to shorter stages - allows higher frequency

Ideally, can (again) multiply the throughput, but– Stall penalties do not scale (e.g., control flow stall, cache misses)– Clock setup/hold reduces net cycle time - each instruction takes longer!

In the example above: 2X stages, but performance gain is <33%

0 1 2 3 4 5 6 7 8 9 10 11 12

F: FetchD: DecodeE: ExecuteW: Write Back

1 IPC = 1 CPI33% higher freq!

0 1 2 3 4 5 6 7 8 9 10 11 12

F1 F2 D1 D2 D3 E1 W1 W2F1 F2 D1 D2 D3 E1 W1 W2

F1 F2 D1 D2 D3 E1 W1 W2F1 F2 D1 D2 D3 E1 W1 W2

F1 F2 D1 D2 D3 E1 W1 W2F1 F2 D1 D2 D3 E1 W1 W2

F1 F2 D1 D2 D3 E1 W1 W2F1 F2 D1 D2 D3 E1 W1 W2

Old clk =

New clk =

Examples:Intel Pentium® II/III/4

Parallelism

Page 27: VLSI Trends in Microarchitecture Past, present and futureafek/Intel2006.pdf · 2006-01-24 · Vlsi_03_2005.ppt/April_2005 1 VLSI Trends in Microarchitecture Past, present and future

Vlsi_03_2005.ppt/April_2005 27

1 2 95 6 73 4 8

1 2 95 6 73 4 8

Out Of Order ExecutionIn Order Execution: instructions are processed in their program order.

– Limitation to potential Parallelism.OOO: Instructions are executed based on “data flow” rather than program orderBefore: src -> dest(1) load (r10), r21(2) mov r21, r31 (2 depends on 1)(3) load a, r11(4) mov r11, r22 (4 depends on 3)(5) mov r22, r23 (5 depends on 4)

After:(1) load (r10), r21; (3) load a, r11;

<wait for loads to complete>(2) mov r21,r31; (4) mov r11,r22;

(5) mov r22,r23;Usually highly superscalar

1F

2F

3F

4F

1D

2D

3D

4D

5F 5D 5W

1W

2E

3w

4E

5E

2W

3W

4W

5E

1E

2E

3E

4E

5E

1E

2E

3E

4E

5E

1F

2F

3F

4F

1D

2D

3D

4D

5F 5D 5W

1W

2E

3E

4E

5E

2W

3E

4E

5E

4E

5E

1E

2E

3E

4E

5E

1E

2E

3E

4E

5E

4W

5E

3W

In Order Processing

Out of Order Processing

In Order vs. OOO execution.Assuming:- Unlimited resources- 2 cycles load latency

Examples:Intel Pentium® II/III/4Compaq Alpha 21264

t

t

OOO

Page 28: VLSI Trends in Microarchitecture Past, present and futureafek/Intel2006.pdf · 2006-01-24 · Vlsi_03_2005.ppt/April_2005 1 VLSI Trends in Microarchitecture Past, present and future

Vlsi_03_2005.ppt/April_2005 28

Out Of Order (cont.)Advantages

– Help exploit Instruction Level Parallelism (ILP)– Help cover latencies (e.g., cache miss, divide)– Artificially increase the Register file size (i.e. number of registers)– Superior/complementary to compiler scheduler

» Dynamic instruction window» Make usage of more registers than the Architecture Registers

Complex microarchitecture– Complex scheduler. Involves also

» Large instruction window» Speculative execution

– Requires reordering back-end mechanism (retirement) for:» Precise interrupt resolution» Misprediction/speculation recovery» Memory ordering

OOO

Page 29: VLSI Trends in Microarchitecture Past, present and futureafek/Intel2006.pdf · 2006-01-24 · Vlsi_03_2005.ppt/April_2005 1 VLSI Trends in Microarchitecture Past, present and future

Vlsi_03_2005.ppt/April_2005 29

Speculation

Page 30: VLSI Trends in Microarchitecture Past, present and futureafek/Intel2006.pdf · 2006-01-24 · Vlsi_03_2005.ppt/April_2005 1 VLSI Trends in Microarchitecture Past, present and future

Vlsi_03_2005.ppt/April_2005 30

Branch PredictionGoal - ensure enough instruction supply by correct prefetchingIn the past - prefetcher assumed fall-through

– Lose on unconditional branch (e.g., call)– Lose on frequently taken branches (e.g., loops)

Branch prediction– Predicts whether a branch is taken/not taken– Predicts the branch target address

Misprediction cost varies (higher w/ increased pipeline length)Typical Branch prediction rates: ~90%-96%

4%-10% misprediction,10-25 branches between mispredictions50-125 instructions between mispredictions

Misprediction cost increased with– Pipeline depth– Machine width

» e.g. 3 width x 10 stages = 30 inst flushed!

?

Speculation - BP

Page 31: VLSI Trends in Microarchitecture Past, present and futureafek/Intel2006.pdf · 2006-01-24 · Vlsi_03_2005.ppt/April_2005 1 VLSI Trends in Microarchitecture Past, present and future

Vlsi_03_2005.ppt/April_2005 31

Target Array + Direction PredictionTarget and direction are predicted separately Tag may be partial

Branch IP

tag predicted target

predicted Target

Address

predicted direction(taken/not-taken)

hit / miss(indicates a branch)

DirectionPrediction

(for conditionalbranches only)

TargetPrediction

Speculation - BP

Page 32: VLSI Trends in Microarchitecture Past, present and futureafek/Intel2006.pdf · 2006-01-24 · Vlsi_03_2005.ppt/April_2005 1 VLSI Trends in Microarchitecture Past, present and future

Vlsi_03_2005.ppt/April_2005 32

Speculative ExecutionExecution of instructions from a predicted (yet unsure) pathEventually, path may turn wrong.Advantages:

– Ensure instruction supply– Allow large scheduling window (for out of order)

Issues:– Misprediction cost– Misprediction recovery

Speculation - execution

Page 33: VLSI Trends in Microarchitecture Past, present and futureafek/Intel2006.pdf · 2006-01-24 · Vlsi_03_2005.ppt/April_2005 1 VLSI Trends in Microarchitecture Past, present and future

Vlsi_03_2005.ppt/April_2005 33

Cache - Motivation & PrincipleMemory consumption is growing about 2X every 2 years

– Typical size: (Y2000) 64M-128M, (Y2002) 128M-256MCPU speed grows faster than memory and buses

– CPU/Bus grew from 1:1 to 6:1, and still growing486 Pentium P-II P-III P4 25-66MHz 66-233MHz 200-450MH 0.5-1.33GHz 1.4-2.4GHz33MHz 66MHZ 66-100MHz 133-200MHz 400MHz

– Memory: DRAM: 60-100ns (“10-16MHz”), Cost: <10$ per 1MSRAM is faster but much more expensiveMemory becomes the bottleneck for both instructions and data!SlowSlow or expensive

Solution: Cache - A Small,Small, Fast,Fast, CloseClose memory– Serves as a buffer between CPU and main memory

Contains copy of a portionof the main memory

– Small in size – Dynamically changed

Exploit space and time locality: – Code is fetched sequentially (Space)– Code is re-executed (loops, procedures) (Time)– Access close or previous data (Space, Time)

Cache MemoryCPU

Small, fastClose, Expensive

Large, SlowFast, Cheap

Speculation - Cache

Page 34: VLSI Trends in Microarchitecture Past, present and futureafek/Intel2006.pdf · 2006-01-24 · Vlsi_03_2005.ppt/April_2005 1 VLSI Trends in Microarchitecture Past, present and future

Vlsi_03_2005.ppt/April_2005 34

The Generic Processor

Executionengine

Datasupply

Instruction Cache

BTB Instruction fetch/decode

Rename

Scheduler

Trace/decodedcache

Speculation – Trace Cache

Page 35: VLSI Trends in Microarchitecture Past, present and futureafek/Intel2006.pdf · 2006-01-24 · Vlsi_03_2005.ppt/April_2005 1 VLSI Trends in Microarchitecture Past, present and future

35Vlsi_03_2005.ppt/April_2005 35

Fetch bandwidthexample

A CB B CAtime

Dynamic instruction stream

o o o o

A

B

CD

B E

Control flow graphA, B, C are instruction blocks

Speculation – Trace Cache

Page 36: VLSI Trends in Microarchitecture Past, present and futureafek/Intel2006.pdf · 2006-01-24 · Vlsi_03_2005.ppt/April_2005 1 VLSI Trends in Microarchitecture Past, present and future

36Vlsi_03_2005.ppt/April_2005 36

Trace Cache Concept

• Hold in the “instruction”cache the dynamic stream of the executed instructions

=> Trace cache acts as “branch predictor” + wide instructions supplier

Speculation – Trace Cache

Page 37: VLSI Trends in Microarchitecture Past, present and futureafek/Intel2006.pdf · 2006-01-24 · Vlsi_03_2005.ppt/April_2005 1 VLSI Trends in Microarchitecture Past, present and future

37Vlsi_03_2005.ppt/April_2005 37

Trace Cache Overview

I Cache

Decoder

Trace Cache

FillUnit

address

A B C

A B CA B C

To Execution Core

Build Mode

Stream Mode

Speculation – Trace Cache

Page 38: VLSI Trends in Microarchitecture Past, present and futureafek/Intel2006.pdf · 2006-01-24 · Vlsi_03_2005.ppt/April_2005 1 VLSI Trends in Microarchitecture Past, present and future

38Vlsi_03_2005.ppt/April_2005 38

Trace cache line

• Tag: identifies starting address of trace

• N instructions (potentially decoded)• Next address: next fetch address• path info: branch flags (T, NT), number

of branches, trace ends w/ branch?,…)

N Instructions Next address path infoPC Tag

Speculation – Trace Cache

Page 39: VLSI Trends in Microarchitecture Past, present and futureafek/Intel2006.pdf · 2006-01-24 · Vlsi_03_2005.ppt/April_2005 1 VLSI Trends in Microarchitecture Past, present and future

Vlsi_03_2005.ppt/April_2005 39

Threads

Page 40: VLSI Trends in Microarchitecture Past, present and futureafek/Intel2006.pdf · 2006-01-24 · Vlsi_03_2005.ppt/April_2005 1 VLSI Trends in Microarchitecture Past, present and future

Vlsi_03_2005.ppt/April_2005 40

Scalar Execution

Dependencies reduce throughput/utilization

Time

Threads

Page 41: VLSI Trends in Microarchitecture Past, present and futureafek/Intel2006.pdf · 2006-01-24 · Vlsi_03_2005.ppt/April_2005 1 VLSI Trends in Microarchitecture Past, present and future

Vlsi_03_2005.ppt/April_2005 41

Superscalar Execution

Generally increases throughput, but decreases utilization

Time

Threads

Page 42: VLSI Trends in Microarchitecture Past, present and futureafek/Intel2006.pdf · 2006-01-24 · Vlsi_03_2005.ppt/April_2005 1 VLSI Trends in Microarchitecture Past, present and future

Vlsi_03_2005.ppt/April_2005 42

Predication

Generally increases utilization, increases throughput less(much of the utilization is thrown away)

Time

Threads

Page 43: VLSI Trends in Microarchitecture Past, present and futureafek/Intel2006.pdf · 2006-01-24 · Vlsi_03_2005.ppt/April_2005 1 VLSI Trends in Microarchitecture Past, present and future

Vlsi_03_2005.ppt/April_2005 43

CMP – Chip Multi-ProcessorTime

Low utilization / higher throughput

Threads

Page 44: VLSI Trends in Microarchitecture Past, present and futureafek/Intel2006.pdf · 2006-01-24 · Vlsi_03_2005.ppt/April_2005 1 VLSI Trends in Microarchitecture Past, present and future

Vlsi_03_2005.ppt/April_2005 44

Blocked Multithreading

May increase utilization and throughput, but must switch when currentthread goes to low utilization/throughput section (e.g. L2 cache miss)

Time

Threads

Page 45: VLSI Trends in Microarchitecture Past, present and futureafek/Intel2006.pdf · 2006-01-24 · Vlsi_03_2005.ppt/April_2005 1 VLSI Trends in Microarchitecture Past, present and future

Vlsi_03_2005.ppt/April_2005 45

Fine Grained Multithreading

Increases utilization/throughput by reducing impact of dependences

Time

Threads

Page 46: VLSI Trends in Microarchitecture Past, present and futureafek/Intel2006.pdf · 2006-01-24 · Vlsi_03_2005.ppt/April_2005 1 VLSI Trends in Microarchitecture Past, present and future

Vlsi_03_2005.ppt/April_2005 46

Simultaneous Multithreading

Time

Increases utilization/throughput

Threads

Page 47: VLSI Trends in Microarchitecture Past, present and futureafek/Intel2006.pdf · 2006-01-24 · Vlsi_03_2005.ppt/April_2005 1 VLSI Trends in Microarchitecture Past, present and future

Vlsi_03_2005.ppt/April_2005 47

Future

Page 48: VLSI Trends in Microarchitecture Past, present and futureafek/Intel2006.pdf · 2006-01-24 · Vlsi_03_2005.ppt/April_2005 1 VLSI Trends in Microarchitecture Past, present and future

Vlsi_03_2005.ppt/April_2005 48

Gain

Frequency

Analog Circuit Paradigm

GBWP = Gain Bandwidth Product = constant @ a given technology

BW1

Gain1

BW2Gain2

e.g. Gain1*BW1= Gain2*BW2

Future

Page 49: VLSI Trends in Microarchitecture Past, present and futureafek/Intel2006.pdf · 2006-01-24 · Vlsi_03_2005.ppt/April_2005 1 VLSI Trends in Microarchitecture Past, present and future

Vlsi_03_2005.ppt/April_2005 49

Gain

Frequency

Analog Circuit Paradigm (cont.)

BW1 BWn Gain

Future

Page 50: VLSI Trends in Microarchitecture Past, present and futureafek/Intel2006.pdf · 2006-01-24 · Vlsi_03_2005.ppt/April_2005 1 VLSI Trends in Microarchitecture Past, present and future

Vlsi_03_2005.ppt/April_2005 50

“Theory”Analog Gain Bandwidth Product (GBWP) is constant for a specific technology, this is also true for other “environments”…A computer structure can excel in performance for a specific application set but not at all applications (also true for benchmarks)a person can excel in several areas but not at all...…...

examples: benchmarks, application in coming foils people….

Future

Page 51: VLSI Trends in Microarchitecture Past, present and futureafek/Intel2006.pdf · 2006-01-24 · Vlsi_03_2005.ppt/April_2005 1 VLSI Trends in Microarchitecture Past, present and future

Vlsi_03_2005.ppt/April_2005 51

Tuning for ApplicationsPerformance

“Applications”

Apps1 Appsn

Future

Page 52: VLSI Trends in Microarchitecture Past, present and futureafek/Intel2006.pdf · 2006-01-24 · Vlsi_03_2005.ppt/April_2005 1 VLSI Trends in Microarchitecture Past, present and future

Vlsi_03_2005.ppt/April_2005 52

Provide Specialized “efficient”MIPS

Find a way to support the new performance requirements via an efficient “mechanism”A tailored solutions (to a specific application set) can provide an “efficient” MIPS via INTEGRATION, how?

Future

Page 53: VLSI Trends in Microarchitecture Past, present and futureafek/Intel2006.pdf · 2006-01-24 · Vlsi_03_2005.ppt/April_2005 1 VLSI Trends in Microarchitecture Past, present and future

Vlsi_03_2005.ppt/April_2005 53

The Needthe environment

These days is the PC's 20th birthday – 835 Million PC sold 1981-2001– 138 million PCs in year 2001(IDC), 10X number of cars, 1.5X of

television sold annually– 2.2 Billion Email a day, 10X of the first class mail– 400 million on line users (200 in Sep99)– CPU performance improved ~8000X !!!

What will be the need for performance in the coming 20 years?What will be the technology progress in the coming 20 years? 10 years? 5 years?

Statistics courtesy of Gartner Dataquest, U.S. News & World Report, Jupiter Internet Population Model, and NUA Internet Surveys

Future

Page 54: VLSI Trends in Microarchitecture Past, present and futureafek/Intel2006.pdf · 2006-01-24 · Vlsi_03_2005.ppt/April_2005 1 VLSI Trends in Microarchitecture Past, present and future

Vlsi_03_2005.ppt/April_2005 54

Windows XP examples that needs excessive performance:

- Movie Maker Video Indexing- Video smoothing

Example 1:Movie Maker Video Indexing

320*240, 30fps4X slower thanreal Time on CentrinoTM @1.6Ghz

1980x1080, 30fps~100X over CenrinoTM @1.6Mhz

Future

Page 55: VLSI Trends in Microarchitecture Past, present and futureafek/Intel2006.pdf · 2006-01-24 · Vlsi_03_2005.ppt/April_2005 1 VLSI Trends in Microarchitecture Past, present and future

Vlsi_03_2005.ppt/April_2005 55

30 FPS (Enhanced)5 FPS (Jerky)

Example 2:Emulation of:Video smoothingVideo Enhancement

Video smoothing

352*240 pixelsCPU usage:70% of CentrinoTM @1.6Ghz

1980x1080, 30fps~21X over CenrinoTM @1.6Mhz

Future

Page 56: VLSI Trends in Microarchitecture Past, present and futureafek/Intel2006.pdf · 2006-01-24 · Vlsi_03_2005.ppt/April_2005 1 VLSI Trends in Microarchitecture Past, present and future

Vlsi_03_2005.ppt/April_2005 56

The Need

Future

Page 57: VLSI Trends in Microarchitecture Past, present and futureafek/Intel2006.pdf · 2006-01-24 · Vlsi_03_2005.ppt/April_2005 1 VLSI Trends in Microarchitecture Past, present and future

Vlsi_03_2005.ppt/April_2005 57

The need: Build a Panorama

M. Brown and D. G. Lowe. Recognising Panoramas. ICCV 2003Performance: >30min P4 3GHz

Simplified capabilities at Microsoft Digital Image Suite 10 ($129.95)

Future

Page 58: VLSI Trends in Microarchitecture Past, present and futureafek/Intel2006.pdf · 2006-01-24 · Vlsi_03_2005.ppt/April_2005 1 VLSI Trends in Microarchitecture Past, present and future

Vlsi_03_2005.ppt/April_2005 58

Future

Page 59: VLSI Trends in Microarchitecture Past, present and futureafek/Intel2006.pdf · 2006-01-24 · Vlsi_03_2005.ppt/April_2005 1 VLSI Trends in Microarchitecture Past, present and future

Vlsi_03_2005.ppt/April_2005 59

CPU Usagean example(measured on IBM X20)

Streaming vs. General purpose

General purpose usageExcel and Outlook -->Burst need for MIPS

Streaming Processing6 low resolution videos-->Continuous need for MIPS

Avg.

Avg.

5min

~5min

Future

Page 60: VLSI Trends in Microarchitecture Past, present and futureafek/Intel2006.pdf · 2006-01-24 · Vlsi_03_2005.ppt/April_2005 1 VLSI Trends in Microarchitecture Past, present and future

Vlsi_03_2005.ppt/April_2005 60

Process trend – the theory (cont)Performance driven era vs. Power aware era

Processes

“Old” processes

Pow

er ra

tioProcess generation p vs. p-1

@ same die area

n1µ

n+1 n+2 n+30.35µ

“new” processes

m0.25µ

m+1 m+2 m+30.09µ

1.4X Frequency0.9X voltage0.7X capacity/transistor1X area2X transistorsLeakage 30%

Power increase/generation: 1.6X

Power aware era(Performance within power envelop)

1.4X Frequency0.75X voltage0.7X capacity/transistor1X area2X transistorLeakage <5%

Power increase/generation: 1.1X

Performance driven era(no power constrains)

Reference: Eric Spargle

1.0

2.0

3.0

4.0

Future

Page 61: VLSI Trends in Microarchitecture Past, present and futureafek/Intel2006.pdf · 2006-01-24 · Vlsi_03_2005.ppt/April_2005 1 VLSI Trends in Microarchitecture Past, present and future

Vlsi_03_2005.ppt/April_2005 61

Processor roadmap trend – real life (cont)Extension of Pollack’s Rule (Micro32, 1999)

Processor generation k vs. k-1 compacted @ the same process technology

Power

0

1

2

3

4

1.5 1 0.7 0.5 0.35 0.18Technology Generation

Gro

wth

(X)

Perf/power delta ratio

1 : >3Performance

!Perf/power delta

ratio3 : 1

Future

Page 62: VLSI Trends in Microarchitecture Past, present and futureafek/Intel2006.pdf · 2006-01-24 · Vlsi_03_2005.ppt/April_2005 1 VLSI Trends in Microarchitecture Past, present and future

Vlsi_03_2005.ppt/April_2005 62

solution 1: CMP (Chip Multi-Processor)

Per

form

ance

Power

penalty: MP

Conventional design 1% performancefor 3% in power

One processor

2 CMP

3 CMP

4 CMP

*CMP = Symmetric General Purpose (GP) cores

CMP*

Future

Page 63: VLSI Trends in Microarchitecture Past, present and futureafek/Intel2006.pdf · 2006-01-24 · Vlsi_03_2005.ppt/April_2005 1 VLSI Trends in Microarchitecture Past, present and future

Vlsi_03_2005.ppt/April_2005 63

solution 2: ACCMP (Asymmetric Cluster CMP)

Per

form

ance

Power

penalty: Specialized MIPS

AC

CMP

~1% performancefor 1% in power

>3% performancefor 1% in power

Future

Page 64: VLSI Trends in Microarchitecture Past, present and futureafek/Intel2006.pdf · 2006-01-24 · Vlsi_03_2005.ppt/April_2005 1 VLSI Trends in Microarchitecture Past, present and future

Vlsi_03_2005.ppt/April_2005 64

ACCMPWhat is the ACCMP?– On Die Asymmetric Clusters of cores– Efficient specialized MIPS clusters with

>3-4X performance/power over GP cores– Compatible ISA?

Penalties– Multi-Processing (tasks or threads)

Specialized MIPS

ACCMP is a solution that enables to continue (for a while) Moore’s performance law within the power envelop

Future

Page 65: VLSI Trends in Microarchitecture Past, present and futureafek/Intel2006.pdf · 2006-01-24 · Vlsi_03_2005.ppt/April_2005 1 VLSI Trends in Microarchitecture Past, present and future

Vlsi_03_2005.ppt/April_2005 65

ACCMP

Specialized MIPS A Cluster

Host coreHost core

L2 $

Host ClusterGeneral Purpose

MIPS

Interconnect

Specialized MIPS B Cluster

Ext.Bus

Future

Page 66: VLSI Trends in Microarchitecture Past, present and futureafek/Intel2006.pdf · 2006-01-24 · Vlsi_03_2005.ppt/April_2005 1 VLSI Trends in Microarchitecture Past, present and future

Vlsi_03_2005.ppt/April_2005 66

Future - Processors

• applications need• Specialized MIPS• Detached from the CPU core• Different engines• Mixture of Programmable and fixed function

• ?