Lecture 9 Outline

45
Lecture 9 ECE/CSC 506 - Summer 2006 - E. F. Gehr inger, based on slides by Yan Solihin 1 Lecture 9 Outline MESI protocol Dragon update-based protocol Impact of protocol optimizations

description

Lecture 9 Outline. MESI protocol Dragon update-based protocol Impact of protocol optimizations. Lower-Level Protocol Choices. BusRd observed in M state: what transition to make? Change to S: assume I’ll read again soon good for mostly read data what about “migratory” data, thus: - PowerPoint PPT Presentation

Transcript of Lecture 9 Outline

Page 1: Lecture 9 Outline

Lecture 9 ECE/CSC 506 - Summer 2006 - E. F. Gehringer, based on slides by Yan Solihin

1

Lecture 9 Outline

MESI protocol Dragon update-based protocol Impact of protocol optimizations

Page 2: Lecture 9 Outline

Lecture 9 ECE/CSC 506 - Summer 2006 - E. F. Gehringer, based on slides by Yan Solihin

2

Lower-Level Protocol Choices

BusRd observed in M state: what transition to make? Change to S: assume I’ll read again soon

good for mostly read data what about “migratory” data, thus:

Change to I: assume other will write to it (Synapse) I read and write, then you read and write, then X reads and

writes... Sequent Symmetry and MIT Alewife use adaptive protocols

Page 3: Lecture 9 Outline

Lecture 9 ECE/CSC 506 - Summer 2006 - E. F. Gehringer, based on slides by Yan Solihin

3

MESI (4-state) Invalidation Protocol

Problem with MSI protocol Rd, Wr sequence incurs 2 transactions

even when no one is sharing (e.g., serial program!) BusRd (I S) followed by BusRdX or BusUpgr (S M) In general, penalizing serial programs is unacceptable

Add exclusive state: Invalid Modified (dirty) Shared (two or more caches may have copies) Exclusive: (only this cache has clean copy, same value as in memory)

How to decide I E or I S? Need to check whether someone else has copy “Shared” signal on bus: wired-or line asserted in response to BusRd

Page 4: Lecture 9 Outline

Lecture 9 ECE/CSC 506 - Summer 2006 - E. F. Gehringer, based on slides by Yan Solihin

4

MESI: Processor-Initiated Transactions

M

S

E

PrRd/-PrWr/-

PrRd/-

PrWr/-

I

PrRd/BusRd(~S)

PrRd/BusRd(S)

PrWr/BusRdX

PrWr/BusRdX

PrRd/-

Page 5: Lecture 9 Outline

Lecture 9 ECE/CSC 506 - Summer 2006 - E. F. Gehringer, based on slides by Yan Solihin

5

MESI: Bus-Initiated Transactions

M

I

E

BusRd/-BusRdX/-

S

BusRd/Flush BusRd/FlushBusRdX/Flush

BusRdX/Flush

BusRdX/Flush1

BusRd/Flush1

Page 6: Lecture 9 Outline

Lecture 9 ECE/CSC 506 - Summer 2006 - E. F. Gehringer, based on slides by Yan Solihin

6

MESI State Transition Diagram

BusRd(S) means shared line asserted on BusRd transaction

PrWr/—

BusRd/Flush

PrRd/

BusRdX/Flush

PrWr/BusRdX

PrWr/—

PrRd/—

PrRd/—BusRd/Flush′

E

M

I

S

PrRd

BusRd(S)

BusRdX/Flush′

BusRdX/Flush

BusRd/Flush

PrWr/BusRdX

PrRd/BusRd (S)

Page 7: Lecture 9 Outline

Lecture 9 ECE/CSC 506 - Summer 2006 - E. F. Gehringer, based on slides by Yan Solihin

7

Flush vs. Flush1 (Flush' in textbook)

Flush: mandatory Flush' (Flush1): happens only when

Cache-to-cache sharing is used, and, Only one cache flushes data

Page 8: Lecture 9 Outline

Lecture 9 ECE/CSC 506 - Summer 2006 - E. F. Gehringer, based on slides by Yan Solihin

8

MESI Visualization

P1 P3P2

Cache

Main Memory

BusSnooper Snooper Snooper

X=1

Mem Ctrl

Page 9: Lecture 9 Outline

Lecture 9 ECE/CSC 506 - Summer 2006 - E. F. Gehringer, based on slides by Yan Solihin

9

MESI Visualization

P1 P3P2

Snooper Snooper Snooper

X=1

Mem Ctrl

rd &X

BusRd

Page 10: Lecture 9 Outline

Lecture 9 ECE/CSC 506 - Summer 2006 - E. F. Gehringer, based on slides by Yan Solihin

10

MESI Visualization

P1 P3P2

Snooper Snooper Snooper

X=1

Mem Ctrl

X=1 E

Page 11: Lecture 9 Outline

Lecture 9 ECE/CSC 506 - Summer 2006 - E. F. Gehringer, based on slides by Yan Solihin

11

MESI Visualization

P1 P3P2

Snooper Snooper Snooper

X=1

Mem Ctrl

X=1 E

wr &X(X=2)

M2

One less bus requestdue to Exclusive state,esp. for serial programs

Page 12: Lecture 9 Outline

Lecture 9 ECE/CSC 506 - Summer 2006 - E. F. Gehringer, based on slides by Yan Solihin

12

MESI Visualization

P1 P3P2

Snooper Snooper Snooper

X=1

Mem Ctrl

X=2 M

rd &X

BusRd

Page 13: Lecture 9 Outline

Lecture 9 ECE/CSC 506 - Summer 2006 - E. F. Gehringer, based on slides by Yan Solihin

13

MESI Visualization

P1 P3P2

Snooper Snooper Snooper

X=1

Mem Ctrl

X=2 M X=2 S

2

S

Flush

Page 14: Lecture 9 Outline

Lecture 9 ECE/CSC 506 - Summer 2006 - E. F. Gehringer, based on slides by Yan Solihin

14

MESI Visualization

P1 P3P2

Snooper Snooper Snooper

X=2

Mem Ctrl

X=2 S X=2 S

wr &XX=3

BusUpgr

I M3

Note: BusUpgr insteadof BusRdX

Page 15: Lecture 9 Outline

Lecture 9 ECE/CSC 506 - Summer 2006 - E. F. Gehringer, based on slides by Yan Solihin

15

MESI Visualization

P1 P3P2

Snooper Snooper Snooper

X=2

Mem Ctrl

X=2 I X=3

rd &X

BusRd

3

S3 M S

Flush

Page 16: Lecture 9 Outline

Lecture 9 ECE/CSC 506 - Summer 2006 - E. F. Gehringer, based on slides by Yan Solihin

16

MESI Visualization

P1 P3P2

Snooper Snooper Snooper

X=3

Mem Ctrl

X=3 S X=3 S

rd &X

Page 17: Lecture 9 Outline

Lecture 9 ECE/CSC 506 - Summer 2006 - E. F. Gehringer, based on slides by Yan Solihin

17

MESI Visualization

P1 P3P2

Snooper Snooper Snooper

X=3

Mem Ctrl

X=3 S X=3 S

rd &X

BusRd

X=3 S

Referred to as Cache-to-cache transferin Illinois MESI protocol

Flush1

Page 18: Lecture 9 Outline

Lecture 9 ECE/CSC 506 - Summer 2006 - E. F. Gehringer, based on slides by Yan Solihin

18

MESI Example (Cache-to-Cache Transfer)

* Data from memory if no cache2cache transfer, BusRd/-

Proc Action

State P1 State P2 State P3 Bus Action Data From

R1 E - - BusRd Mem

W1 M - - - Own cache

R3 S - S BusRd/Flush P1 cache

W3 I - M BusRdX Mem

R1 S - S BusRd/Flush P3 cache

R3 S - S - Own cache

R2 S S S BusRd/Flush1P1/P3

Cache*

Page 19: Lecture 9 Outline

Lecture 9 ECE/CSC 506 - Summer 2006 - E. F. Gehringer, based on slides by Yan Solihin

19

MESI Example (Cache-to-Cache Transfer+BusUpgr)

* Data from memory if no cache2cache transfer, BusRd/-

Proc Action

State P1 State P2 State P3 Bus Action Data From

R1 E - - BusRd Mem

W1 M - - - Own cache

R3 S - S BusRd/Flush P1 cache

W3 I - M BusUpgr Own cache

R1 S - S BusRd/Flush P3 cache

R3 S - S - Own cache

R2 S S S BusRd/Flush1P1/P3

Cache*

Page 20: Lecture 9 Outline

Lecture 9 ECE/CSC 506 - Summer 2006 - E. F. Gehringer, based on slides by Yan Solihin

20

Lower-Level Protocol Choices

Who supplies data on miss when not in M state: memory or cache? Original, lllinois MESI: cache

assume cache faster than memory (Cache-to-cache transfer) Not necessarily true

Adds complexity How does memory know it should supply data? (must wait for caches) Selection algorithm if multiple caches have valid data

Valuable for distributed memory May be cheaper to obtain from nearby cache than distant memory Especially when constructed out of SMP nodes (Stanford DASH)

Page 21: Lecture 9 Outline

Lecture 9 ECE/CSC 506 - Summer 2006 - E. F. Gehringer, based on slides by Yan Solihin

21

Lecture 9 Outline

MESI protocol Dragon update-based protocol Impact of protocol optimizations

Page 22: Lecture 9 Outline

Lecture 9 ECE/CSC 506 - Summer 2006 - E. F. Gehringer, based on slides by Yan Solihin

22

Dragon Writeback Update Protocol

Four states Exclusive-clean (E): I and memory have it Shared clean (Sc): I, others, and maybe memory, but I’m not owner Shared modified (Sm): I and others but not memory, and I’m the owner

Sm and Sc can coexist in different caches, with at most one Sm Modified or dirty (M): I and, no one else On replacement: Sc can silently drop, Sm has to flush

No invalid state If in cache, cannot be invalid If not present in cache, can view as being in not-present or invalid state

New processor events: PrRdMiss, PrWrMiss Introduced to specify actions when block not present in cache

New bus transaction: BusUpd Broadcasts single word written on bus; updates other relevant caches

Page 23: Lecture 9 Outline

Lecture 9 ECE/CSC 506 - Summer 2006 - E. F. Gehringer, based on slides by Yan Solihin

23

Dragon State Transition Diagram

E Sc

Sm M

PrWr/—

PrRd/—

PrRd/—

PrRd/—

PrRdMiss/BusRd(S)

PrRdMiss/BusRd(S) PrWr/—

PrWrMiss/(BusRd(S); BusUpd)

PrWrMiss/BusRd(S)

PrWr/BusUpd(S)

PrWr/BusUpd(S)

BusRd/—

BusRd/Flush

PrRd/— BusUpd/Update

BusUpd/Update

BusRd/Flush

PrWr/BusUpd(S)

PrWr/BusUpd(S)

Page 24: Lecture 9 Outline

Lecture 9 ECE/CSC 506 - Summer 2006 - E. F. Gehringer, based on slides by Yan Solihin

24

Dragon: Processor-Initiated Transactions

E

M

Sc

Sm

PrRdMiss/BusRd(~S)

PrRd/-

PrWr/-

PrRd/-

PrWr/BusUpd(S)

PrWr/BusUpd(~S)

PrRdMiss/BusRd(S)

PrWrMiss/(BusRd(S);BusUpd)

PrRd/-PrWr/BusUpd(S)

PrWr/BusUpd(~S)

PrRdMiss/BusRd(~S)

PrRd/-PrWr/-

Page 25: Lecture 9 Outline

Lecture 9 ECE/CSC 506 - Summer 2006 - E. F. Gehringer, based on slides by Yan Solihin

25

Dragon: Bus-Initiated Transactions

E

M

Sc

Sm

BusRd/-BusUpd/Update

BusRd/-

BusRd/Flush

BusUpd/Update

BusRd/Flush

Page 26: Lecture 9 Outline

Lecture 9 ECE/CSC 506 - Summer 2006 - E. F. Gehringer, based on slides by Yan Solihin

26

Dragon Visualization

P1 P3P2

Cache

Main Memory

BusSnooper Snooper Snooper

X=1

Mem Ctrl

Page 27: Lecture 9 Outline

Lecture 9 ECE/CSC 506 - Summer 2006 - E. F. Gehringer, based on slides by Yan Solihin

27

Dragon Visualization

P1 P3P2

Snooper Snooper Snooper

X=1

Mem Ctrl

rd &X

BusRd

Page 28: Lecture 9 Outline

Lecture 9 ECE/CSC 506 - Summer 2006 - E. F. Gehringer, based on slides by Yan Solihin

28

Dragon Visualization

P1 P3P2

Snooper Snooper Snooper

X=1

Mem Ctrl

X=1 E

Page 29: Lecture 9 Outline

Lecture 9 ECE/CSC 506 - Summer 2006 - E. F. Gehringer, based on slides by Yan Solihin

29

Dragon Visualization

P1 P3P2

Snooper Snooper Snooper

X=1

Mem Ctrl

X=1 E

wr &X(X=2)

M2

One less bus requestdue to Exclusive state,esp. for serial programs

Page 30: Lecture 9 Outline

Lecture 9 ECE/CSC 506 - Summer 2006 - E. F. Gehringer, based on slides by Yan Solihin

30

Dragon Visualization

P1 P3P2

Snooper Snooper Snooper

X=1

Mem Ctrl

X=2 M

rd &X

BusRd

Page 31: Lecture 9 Outline

Lecture 9 ECE/CSC 506 - Summer 2006 - E. F. Gehringer, based on slides by Yan Solihin

31

Dragon Visualization

P1 P3P2

Snooper Snooper Snooper

X=1

Mem Ctrl

X=2 M X=2 ScSm

Page 32: Lecture 9 Outline

Lecture 9 ECE/CSC 506 - Summer 2006 - E. F. Gehringer, based on slides by Yan Solihin

32

Dragon Visualization

P1 P3P2

Snooper Snooper Snooper

X=1

Mem Ctrl

X=2 Sm X=2 Sc

wr &XX=3

BusUpd

Sm3

Note: BusUpdate insteadof BusUpgr (no inval isperformed)

Sc3

Page 33: Lecture 9 Outline

Lecture 9 ECE/CSC 506 - Summer 2006 - E. F. Gehringer, based on slides by Yan Solihin

33

Dragon Visualization

P1 P3P2

Snooper Snooper Snooper

X=1

Mem Ctrl

X=3 Sc X=3

rd &X

Sm

This is a miss in theMESI and MSI protocols

Page 34: Lecture 9 Outline

Lecture 9 ECE/CSC 506 - Summer 2006 - E. F. Gehringer, based on slides by Yan Solihin

34

Dragon Visualization

P1 P3P2

Snooper Snooper Snooper

X=1

Mem Ctrl

X=3 Sc X=3 Sm

rd &X

Page 35: Lecture 9 Outline

Lecture 9 ECE/CSC 506 - Summer 2006 - E. F. Gehringer, based on slides by Yan Solihin

35

Dragon Visualization

P1 P3P2

Snooper Snooper Snooper

X=1

Mem Ctrl

X=3 Sc X=3 Sm

rd &X

BusRd

X=3 Sc

Note: only one with Smis responsible for cache-to-cache transfer

Page 36: Lecture 9 Outline

Lecture 9 ECE/CSC 506 - Summer 2006 - E. F. Gehringer, based on slides by Yan Solihin

36

Dragon Visualization

P1 P3P2

Snooper Snooper Snooper

X=1

Mem Ctrl

X=3 Sc X=3 SmX=3 Sc

P1 replaces X

Page 37: Lecture 9 Outline

Lecture 9 ECE/CSC 506 - Summer 2006 - E. F. Gehringer, based on slides by Yan Solihin

37

Dragon Visualization

P1 P3P2

Snooper Snooper Snooper

X=1

Mem Ctrl

X=3 Sc X=3 SmX=3 Sc

P3 replaces XOwner responsiblefor writing back to mem 3

vs. MSI or MESI wherewrite-back only when the line is in M state

Page 38: Lecture 9 Outline

Lecture 9 ECE/CSC 506 - Summer 2006 - E. F. Gehringer, based on slides by Yan Solihin

38

Dragon Example

Proc Action

State P1 State P2 State P3 Bus Action Data From

R1 E - - BusRd Mem

W1 M - - - Own cache

R3 Sm - Sc BusRd/Flush P1 cache

W3 Sc - Sm BusUpd/Upd Own cache

R1 Sc - Sm - Own cache

R3 Sc - Sm - Own cache

R2 Sc Sc Sm BusRd/Flush P3 cache

Page 39: Lecture 9 Outline

Lecture 9 ECE/CSC 506 - Summer 2006 - E. F. Gehringer, based on slides by Yan Solihin

39

Lower-Level Protocol Choices

Can shared-modified state be eliminated? If update memory as well on BusUpd transactions (DEC Firefly) Dragon protocol doesn’t (assumes DRAM memory slow to update)

Should replacement of an Sc block be broadcast? Would allow last copy to go to Exclusive state and not generate updates Replacement bus transaction is not in critical path, later update may be

Shouldn’t update local copy on write hit before controller gets bus Can mess up serialization

Coherence, consistency considerations much like write-through case

In general, many subtle race conditions in protocols But first, let’s illustrate quantitative assessment at logical level

Page 40: Lecture 9 Outline

Lecture 9 ECE/CSC 506 - Summer 2006 - E. F. Gehringer, based on slides by Yan Solihin

40

Lecture 9 Outline

MESI protocol Dragon update-based protocol Impact of protocol optimizations

Page 41: Lecture 9 Outline

Lecture 9 ECE/CSC 506 - Summer 2006 - E. F. Gehringer, based on slides by Yan Solihin

41

Assessing Protocol Tradeoffs

Methodology: Use simulator; choose parameters per earlier methodology

(default 1MB, 4-way cache, 64-byte block, 16 processors; 64K cache for some)

Focus on frequencies, not end performance for now transcends architectural details, but not what we’re really after

Use idealized memory performance model to avoid changes of reference interleaving across processors with machine parameters

Cheap simulation: no need to model contention

Page 42: Lecture 9 Outline

Lecture 9 ECE/CSC 506 - Summer 2006 - E. F. Gehringer, based on slides by Yan Solihin

42

Impact of Protocol Optimizations

MSI = MESI Upgrades instead of read-exclusive helps Same story when working sets don’t fit for Ocean, Radix, Raytrace

MESI vs. MSI (w/ BusUpgr) vs. MSI (w/ BusRdX)Traffic (MB/s)

Traffic (MB/s)

x d

l t x

Ill

t Ex

0

20

40

60

80

100

120

140

160

180

200

Data bus

Address bus

E E0

10

20

30

40

50

60

70

80

Data bus

Address bus

Bar

nes/

III

Bar

nes/

3St

Bar

nes/

3St-

RdE

x

LU/I

II

Rad

ix/3

St-

RdE

x

LU/3

St

LU/3

St-

RdE

x

Rad

ix/3

St

Oce

an/I

II

Oce

an/

3S

Rad

iosi

ty/3

St-

RdE

x

Oce

an/3

St-

RdE

x

Rad

ix/I

II

Rad

iosi

ty/I

II

Rad

iosi

ty/3

St

Ray

trac

e/II

I

Ray

trac

e/3S

t

Ray

trac

e/3S

t-R

dEx

App

l-Cod

e/III

App

l-Cod

e/3S

t

App

l-Cod

e/3S

t-R

dEx

App

l-Dat

a/III App

l-Dat

a/3S

t

App

l-Dat

a/3S

t-R

dEx

OS

-Cod

e/III

OS

-Cod

e/3S

t

OS

-Dat

a/3S

t

OS

-Dat

a/III

OS

-Cod

e/3S

t-R

dEx

OS

-Dat

a/3S

t-R

dEx

Page 43: Lecture 9 Outline

Lecture 9 ECE/CSC 506 - Summer 2006 - E. F. Gehringer, based on slides by Yan Solihin

43

Impact of Cache-Block Size

Multiprocessors add new kind of miss to cold, capacity, conflict Coherence misses: Due to invalidations

True sharing: Write to same word False sharing: Write to different words

Reducing misses architecturally in invalidation protocol Capacity: enlarge cache; increase block size (if spatial locality) Conflict: increase associativity Cold and coherence: only block size

Increasing block size has advantages and disadvantages Can reduce misses if spatial locality is good Can hurt too

increase misses due to false sharing if spatial locality not good increase misses due to conflicts in fixed-size cache increase traffic due to fetching unnecessary data and due to false sharing can increase miss penalty and perhaps hit cost

Page 44: Lecture 9 Outline

Lecture 9 ECE/CSC 506 - Summer 2006 - E. F. Gehringer, based on slides by Yan Solihin

44

Impact of Block Size on Miss Rate For default problem size: vary block/line size from 8-256 Bytes

• Decreases with larger lines: cold, capacity (due to spatial locality), true sharing (due to spatial locality)• Increases with larger lines: false sharing • Working set doesn’t fit: impact of capacity misses large: (Ocean, Radix)

Cold

Capacity

True sharing

False sharing

Upgrade

8

0

0.1

0.2

0.3

0.4

0.5

0.6

Cold

Capacity

True sharing

False sharing

Upgrade

8 6 2 4 8 6 80

2

4

6

8

10

12

Mis

s ra

te (

%)

Bar

nes/

8

Bar

nes/

16

Bar

nes/

32

Bar

nes/

64

Bar

nes/

128

Bar

nes/

256

Lu/8

Lu/1

6

Lu/3

2

Lu/6

4

Lu/1

28

Lu/2

56

Rad

iosi

ty/8

Rad

iosi

ty/1

6

Rad

iosi

ty/3

2

Rad

iosi

ty/6

4

Rad

iosi

ty/1

28

Rad

iosi

ty/2

56

Mis

s ra

te (

%)

Oce

an/8

Oce

an/1

6

Oce

an/3

2

Oce

an/6

4

Oce

an/1

28

Oce

an/2

56

Rad

ix/8

Rad

ix/1

6

Rad

ix/3

2

Rad

ix/6

4

Rad

ix/1

28

Rad

ix/2

56

Ray

trac

e/8

Ray

trac

e/16

Ray

trac

e/32

Ray

trac

e/64

Ray

trac

e/12

8

Ray

trac

e/25

6

Page 45: Lecture 9 Outline

Lecture 9 ECE/CSC 506 - Summer 2006 - E. F. Gehringer, based on slides by Yan Solihin

45

Impact of Block Size on Traffic

Results different than for miss rate: traffic almost always increases When working sets fits, overall traffic still small, except for Radix Fixed overhead is significant component

So total traffic often minimized at 16-32 byte block, not smaller

Working set doesn’t fit: even 128-byte good for Ocean due to capacity Address bus traffic behaves in opposite way as the data bus traffic

Traffic (bytes/inst) affects performance indirectly through contentionTraffic (bytes/inst) affects performance indirectly through contention

Traffic (bytes/instruction)

Traffic (bytes/FLOP)

Data bus

Address busData bus

Address bus

Radix/8

Radix/16

Radix/32

Radix/64

Radix/128

Radix/256

0

1

2

3

4

5

6

7

8

9

10

LU/8

LU/16

LU/32

LU/64

LU/128

LU/256

Ocean/8

Ocean/16

Ocean/32

Ocean/64

Ocean/128

Ocean/256

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2 4 280

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

0.18

Data bus

Address bus

Bar

nes/

16

Tra

ffic

(by

tes/

inst

ruct

ions

)

Bar

nes/

8

Bar

nes/

32

Bar

nes/

64

Bar

nes/

128

Bar

nes/

256

Rad

iosi

ty/8

Rad

iosi

ty/1

6

Rad

iosi

ty/3

2

Rad

iosi

ty/6

4

Rad

iosi

ty/1

28

Rad

iosi

ty/2

56

Ray

trac

e/8

Ray

trac

e/16

Ray

trac

e/32

Ray

trac

e/64

Ray

trac

e/12

8

Ray

trac

e/25

6