Cache Coherence Techniques for Multicore Processors Dissertation Defense Mike Marty 12/19/2007.

91
Cache Coherence Techniques for Multicore Processors Dissertation Defense Mike Marty 12/19/2007

Transcript of Cache Coherence Techniques for Multicore Processors Dissertation Defense Mike Marty 12/19/2007.

Page 1: Cache Coherence Techniques for Multicore Processors Dissertation Defense Mike Marty 12/19/2007.

Cache Coherence Techniques for Multicore Processors

Dissertation Defense

Mike Marty

12/19/2007

Page 2: Cache Coherence Techniques for Multicore Processors Dissertation Defense Mike Marty 12/19/2007.

2

Key Contributions

Trend: Multicore ring interconnects emerging

Challenge: Order of ring != order of bus

Contribution: New protocol exploits ring order

Trend: Multicore now the basic building block

Challenge: Hierarchical coherence for Multiple-CMP is complex

Contribution: DirectoryCMP and TokenCMP

Trend: Workload consolidation w/ space sharing

Challenge: Physical hierarchies often do not match workloads

Contribution: Virtual Hierarchies

Page 3: Cache Coherence Techniques for Multicore Processors Dissertation Defense Mike Marty 12/19/2007.

3

Outline

Introduction and Motivation• Multicore Trends

Virtual Hierarchies• Focus of presentation

Multiple-CMP Coherence

Ring-based Coherence

Conclusion

Page 4: Cache Coherence Techniques for Multicore Processors Dissertation Defense Mike Marty 12/19/2007.

4

Is SMP + On-chip Integration == Multicore?

bus

memorycontroller

P0$

P1$

$P3

$P2

Multicore

Page 5: Cache Coherence Techniques for Multicore Processors Dissertation Defense Mike Marty 12/19/2007.

5

Multicore Trends

bus

memorycontroller

P0$

P1$

$P3

$P2

Multicore

Trend: On-chip Interconnect• Competes for same resources as cores, caches• Ring an emerging multicore interconnect

Page 6: Cache Coherence Techniques for Multicore Processors Dissertation Defense Mike Marty 12/19/2007.

6

Multicore Trends

bus

memorycontroller

P0$

P1$

Multicore

Trend: latency/bandwidth tradeoffs• Increasing on-chip wire delay, memory latency• Coherence protocol interacts with shared-cache

hierarchy

Shared $

$P3

$P2

Shared $

Page 7: Cache Coherence Techniques for Multicore Processors Dissertation Defense Mike Marty 12/19/2007.

7

Multicore Trends

bus

memorycontroller

P0$

P1$

$P3

$P2

Multicore

Trend: Multicore is the basic building block• Multiple-CMP systems instead of SMPs• Hierarchical systems required

bus

memorycontroller

P0$

P1$

$P3

$P2

Multicore

bus

memorycontroller

P0$

P1$

$P3

$P2

Multicore

bus

memorycontroller

P0$

P1$

$P3

$P2

Multicore

Page 8: Cache Coherence Techniques for Multicore Processors Dissertation Defense Mike Marty 12/19/2007.

8

Multicore Trends

bus

memorycontroller

P0$

P1$

$P3

$P2

Multicore

Trend: Workload Consolidation w/ Space Sharing• More cores, more workload consolidation• Space sharing instead of time sharing• Opportunities to optimize caching, coherence

VM 1

VM 2 VM 3

Page 9: Cache Coherence Techniques for Multicore Processors Dissertation Defense Mike Marty 12/19/2007.

9

Outline

Introduction and Motivation

Virtual Hierarchies• Focus of presentation

Multiple-CMP Coherence

Ring-based Coherence

Conclusion

[ISCA 2007, IEEE Micro Top Pick 2008]

Page 10: Cache Coherence Techniques for Multicore Processors Dissertation Defense Mike Marty 12/19/2007.

10

Virtual Hierarchy Motivations

Space-sharing

Server (workload) consolidation

Tiled architectures

APP 1

AP

P 1

AP

P 2

AP

P 3

AP

P 4

Page 11: Cache Coherence Techniques for Multicore Processors Dissertation Defense Mike Marty 12/19/2007.

11

64-core CMP

Motivation: Server Consolidation

www server

database server #1

database server #2

middleware server #1

middleware server #1

Core

L2 Cache

L1

Page 12: Cache Coherence Techniques for Multicore Processors Dissertation Defense Mike Marty 12/19/2007.

12

64-core CMP

Motivation: Server Consolidation

www server

database server #1

database server #2

middleware server #1

middleware server #1

Page 13: Cache Coherence Techniques for Multicore Processors Dissertation Defense Mike Marty 12/19/2007.

13

64-core CMP

Motivation: Server Consolidation

www server

database server #1

database server #2

middleware server #1

middleware server #1

data

data

Optimize Performance

Page 14: Cache Coherence Techniques for Multicore Processors Dissertation Defense Mike Marty 12/19/2007.

14

64-core CMP

Motivation: Server Consolidation

www server

database server #1

database server #2

middleware server #1

middleware server #1

Isolate Performance

Page 15: Cache Coherence Techniques for Multicore Processors Dissertation Defense Mike Marty 12/19/2007.

15

64-core CMP

Motivation: Server Consolidation

www server

database server #1

database server #2

middleware server #1

middleware server #1

Dynamic Partitioning

Page 16: Cache Coherence Techniques for Multicore Processors Dissertation Defense Mike Marty 12/19/2007.

16

64-core CMP

Motivation: Server Consolidation

www server

database server #1

database server #2

middleware server #1

middleware server #1

data

Inter-VM Sharing

VMWare’s Content-based Page Sharing Up to 60% reduced memory

Page 17: Cache Coherence Techniques for Multicore Processors Dissertation Defense Mike Marty 12/19/2007.

17

Outline

Introduction and Motivation

Virtual Hierarchies• Expanded Motivation• Non-hierarchical approaches• Proposed Virtual Hierarchies • Evaluation• Related Work

Ring-based and Multiple-CMP Coherence

Conclusion

Page 18: Cache Coherence Techniques for Multicore Processors Dissertation Defense Mike Marty 12/19/2007.

18

Tiled Architecture Memory System

Core

L2 Cache

L1M

emor

y C

ontr

olle

r

global broadcast too expensive

Page 19: Cache Coherence Techniques for Multicore Processors Dissertation Defense Mike Marty 12/19/2007.

19

duplicate tagdirectory

TAG-DIRECTORY

A

getM A

1fwd data3

duplicate tagdirectory

2

Read AA

Page 20: Cache Coherence Techniques for Multicore Processors Dissertation Defense Mike Marty 12/19/2007.

20

STATIC-BANK-DIRECTORY

getM A

1

2

fwd

data3

A

A Read A

Page 21: Cache Coherence Techniques for Multicore Processors Dissertation Defense Mike Marty 12/19/2007.

21

STATIC-BANK-DIRECTORY

getM A

1

2fwd

data3

A

A Read A

with hypervisor-managed cache

Page 22: Cache Coherence Techniques for Multicore Processors Dissertation Defense Mike Marty 12/19/2007.

22

Goals

Optimize Performance

Isolate Performance

Allow Dynamic Partitioning

Support Inter-VM Sharing

Hypervisor/OS Simplicity

Yes

Yes

?

Yes

No

No

No

Yes

Yes

Yes

STATIC-BANK-DIRECTORY w/ hypervisor-managed cache

{STATIC-BANK, TAG}-DIRECTORY

Page 23: Cache Coherence Techniques for Multicore Processors Dissertation Defense Mike Marty 12/19/2007.

23

Outline

Introduction and Motivation

Virtual Hierarchies• Expanded Motivation• Non-hierarchical approaches• Proposed Virtual Hierarchies • Evaluation• Related Work

Ring-based and Multiple-CMP Coherence

Conclusion

Page 24: Cache Coherence Techniques for Multicore Processors Dissertation Defense Mike Marty 12/19/2007.

24

Virtual HierarchiesKey Idea: Overlay 2-level Cache & Coherence Hierarchy

- First level harmonizes with VM/Workload

- Second level allows inter-VM sharing, migration, reconfig

Page 25: Cache Coherence Techniques for Multicore Processors Dissertation Defense Mike Marty 12/19/2007.

25

VH: First-Level Protocol

Goals:• Exploit locality from space affinity• Isolate resources

Strategy: Directory protocol• Interleave directories across first-level tiles• Store L2 block at first-level directory tile

Questions: • How to name directories?• How to name sharers?

INVgetM

Page 26: Cache Coherence Techniques for Multicore Processors Dissertation Defense Mike Marty 12/19/2007.

26

VH: Naming First-level Directory

Select Dynamic Home Tile with VM Config Table• Hardware VM Config Table at each tile• Set by hypervisor during scheduling

Example:

Address……000101

Home Tile: p14offset

6

VM Config Table

p12p13p14

012

63 p12

p12p13

p14

345

p13

p12

p14

Core

L2 Cache

L1

per-Tile

Dynamic

Page 27: Cache Coherence Techniques for Multicore Processors Dissertation Defense Mike Marty 12/19/2007.

27

VH: Dynamic Home Tile Actions

Dynamic Home Tile either:• Returns data cached at L2 bank• Generates forwards/invalidates• Issues second-level request

Stable First-level States (a subset):• Typical: M, E, S, I • Atypical:

ILX: L2 Invalid, points to exclusive tile

SLS: L2 Shared, other tiles share

SLSX: L2 Shared, other tiles share, exclusive to first level

getM

Page 28: Cache Coherence Techniques for Multicore Processors Dissertation Defense Mike Marty 12/19/2007.

28

VH: Naming First-level Sharers

Any tile can share the block

Solution: full bit-vector• 64-bits for 64-tile system• Names multiple sharers or single exclusive

Alternatives:• First-level broadcast• (Dynamic) coarse granularity

INVgetM

Page 29: Cache Coherence Techniques for Multicore Processors Dissertation Defense Mike Marty 12/19/2007.

29

Virtual HierarchiesTwo Solutions for Global Coherence: VHA and VHB

memory controller(s)

Page 30: Cache Coherence Techniques for Multicore Processors Dissertation Defense Mike Marty 12/19/2007.

30

Protocol VHA

Directory as Second-level Protocol• Any tile can act as first-level directory• How to track and name first-level directories?

Full bit-vector of sharers to name any tile• State stored in DRAM• Possibly cache on-chip

+ Maximum scalability, message efficiency- DRAM State ( ~ 12.5% overhead )

Page 31: Cache Coherence Techniques for Multicore Processors Dissertation Defense Mike Marty 12/19/2007.

31

VHA Example

dir

ecto

ry/m

emo

ry c

on

tro

ller

getM A

1

2

data6

A

AFwd

data

4

3

getM A

5

Fwd

data

A

Page 32: Cache Coherence Techniques for Multicore Processors Dissertation Defense Mike Marty 12/19/2007.

32

VHA: Handling Races

Blocking Directories• Handles races within same protocol• Requires blocking buffer + wakeup/replay logic

Inter-Intra Races• Naïve blocking leads to deadlock!

getM AA

getM A

blocked

Ablocked

Ablocked

blocked

getM AFWD AgetM A

getM AgetM A

Page 33: Cache Coherence Techniques for Multicore Processors Dissertation Defense Mike Marty 12/19/2007.

33

VHA: Handling Races(cont)

Possible Solution:• Always handle second-level message at first-level• But this causes explosion of state space

Second-level may interrupt first-level actions:• First-level indirections, invalidations, writebacks

Ablocked

Ablocked

blocked

getM AFWD AgetM A

getM AgetM A

Page 34: Cache Coherence Techniques for Multicore Processors Dissertation Defense Mike Marty 12/19/2007.

34

VHA: Handling Races(cont)

Reduce the state-space explosion w/ Safe States:• Subset of transient states• Immediately handle second-level message• Limit concurrency between protocols

Algorithm:• Level-one requests either complete, or enter safe-state before

issuing level-two request• Level-one directories handle level-two forwards when a safe

state reached (they may stall)• Level-two requests eventually handled by Level-two directory• Completion messages unblock directories

Page 35: Cache Coherence Techniques for Multicore Processors Dissertation Defense Mike Marty 12/19/2007.

35

Virtual HierarchiesTwo Solutions for Global Coherence: VHA and VHB

memory controller(s)

Page 36: Cache Coherence Techniques for Multicore Processors Dissertation Defense Mike Marty 12/19/2007.

36

Protocol VHB

Broadcast as Second-level Protocol• Locate first-level directory tiles• Memory controller tracks outstanding second-level

requestor

Attach token count for each block • T tokens for each block. One token to read, all to write• Allows 1-bit at memory per block• Eliminates system-wide ACK responses

Page 37: Cache Coherence Techniques for Multicore Processors Dissertation Defense Mike Marty 12/19/2007.

37

Protocol VHB: Token Coalescing

Memory logically holds all or none tokens:• Enables 1-bit token count

Replacing tile sends tokens to memory controller:• Message usually contains all tokens

Process: • Tokens held in Token Holding Buffer (THB)• FIND broadcast initiated to locate other first-level

directory with tokens• First-level directories respond to THB, tokens sent• Repeat for race

Page 38: Cache Coherence Techniques for Multicore Processors Dissertation Defense Mike Marty 12/19/2007.

38

VHB Example

mem

ory

co

ntr

oll

ergetM A

1

2

3global getM A

getM A

Data+tokens

5

A

A

AFwd

4

Page 39: Cache Coherence Techniques for Multicore Processors Dissertation Defense Mike Marty 12/19/2007.

39

Goals

Optimize Performance

Isolate Performance

Allow Dynamic Partitioning

Support Inter-VM Sharing

Hypervisor/OS Simplicity

Yes

Yes

?

Yes

No

No

No

Yes

Yes

Yes

STATIC-BANK-DIRECTORY w/ hypervisor-managed cache

Yes

Yes

Yes

Yes

Yes

Virtual Hierarchies: VHA and VHB

{DRAM, STATIC-BANK, TAG}-DIRECTORY

Page 40: Cache Coherence Techniques for Multicore Processors Dissertation Defense Mike Marty 12/19/2007.

40

VHNULL

Are two levels really necessary?

VHNULL: first level only

Implications:• Many OS modifications for single-OS environment• Dynamic Partitioning requires cache flushes• Inter-VM Sharing difficult• Hypervisor complexity increases• Requires atomic updates of VM Config Tables• Limits optimized placement policies

Page 41: Cache Coherence Techniques for Multicore Processors Dissertation Defense Mike Marty 12/19/2007.

41

VH: Capacity/Latency Trade-off

Maximize Capacity• Store only L2 copy at dynamic home tile• But, L2 access time penalized

• Especially for large VMs

Minimize L2 access latency/bandwidth:• Replicate data in local L2 slice• Selective/Adaptive Replication well-studied

ASR [Beckmann et al.], CC [Chang et al.]

• But, dynamic home tile still needed for first-level

Can we exploit virtual hierarchy for placement?

Page 42: Cache Coherence Techniques for Multicore Processors Dissertation Defense Mike Marty 12/19/2007.

42

VH: Data Placement Optimization Policy

Data from memory placed in tile’s local L2 bank• Tag not allocated at dynamic home tile

Use second-level coherence on first sharing miss• Then allocate tag at dynamic home tile for future

sharing misses

Benefits:• Private data allocates in tile’s local L2 bank• Overhead of replicating data reduced• Fast, first-level sharing for widely shared data

Page 43: Cache Coherence Techniques for Multicore Processors Dissertation Defense Mike Marty 12/19/2007.

43

Outline

Introduction and Motivation

Virtual Hierarchies• Expanded Motivation• Non-hierarchical approaches• Proposed Virtual Hierarchies • Evaluation• Related Work

Ring-based and Multiple-CMP Coherence

Conclusion

Page 44: Cache Coherence Techniques for Multicore Processors Dissertation Defense Mike Marty 12/19/2007.

44

VH Evaluation Methods

Wisconsin GEMS

Target System: 64-core tiled CMP• In-order SPARC cores• 1 MB, 16-way L2 cache per tile, 10-cycle access• 2D mesh interconnect, 16-byte links, 5-cycle link

latency• Eight on-chip memory controllers, 275-cycle DRAM

latency

Page 45: Cache Coherence Techniques for Multicore Processors Dissertation Defense Mike Marty 12/19/2007.

45

VH Evaluation: Simulating Consolidation

Challenge: bring-up of consolidated workloads

Solution: approximate virtualization• Combine existing Simics checkpoints

script

8p checkpointMemory0P0-P7PCI0, DISK0

64p checkpointP0-P63

VM0_Memory0VM0_PCI0, VM0_DISK0

VM1_Memory0VM1_PCI0, VM1_DISK0

Page 46: Cache Coherence Techniques for Multicore Processors Dissertation Defense Mike Marty 12/19/2007.

46

VH Evaluation: Simulating Consolidation

At simulation-time, Ruby handles mapping:• Converts <Processor ID, 32-bit Address> to <36-bit

address>• Schedules VMs to adjacent cores by sending Simics

requests to appropriate L1 controllers• Memory controllers evenly interleaved

Bottom-line:• Static scheduling• No hypervisor execution simulated• No content-based page sharing

Page 47: Cache Coherence Techniques for Multicore Processors Dissertation Defense Mike Marty 12/19/2007.

47

VH Evaluation: Workloads

OLTP, SpecJBB, Apache, Zeus• Separate instance of Solaris for each VM

Homogenous Consolidation• Simulate same-size workload N times• Unit of work identical across all workloads• (each workload staggered by 1,000,000+ ins)

Heterogeneous Consolidation• Simulate different-size, different workloads• Cycles-per-Transaction for each workload

Page 48: Cache Coherence Techniques for Multicore Processors Dissertation Defense Mike Marty 12/19/2007.

48

VH Evaluation: Baseline Protocols

DRAM-DIRECTORY:• 1 MB directory cache per controller • Each tile nominally private, but replication limited

TAG-DIRECTORY:• 3-cycle central tag directory (1024 ways). Non-

pipelined• Replication limited

STATIC-BANK-DIRECTORY

• Home tiles interleave by frame address• Home tile stores only L2 copy

Page 49: Cache Coherence Techniques for Multicore Processors Dissertation Defense Mike Marty 12/19/2007.

49

VH Evaluation: VHA and VHB Protocols

VHA

• Based on DirectoryCMP implementation• Dynamic Home Tile stores only L2 copy

VHB with optimizations• Private data placement optimization policy

(shared data stored at home tile, private data is not)• Can violate inclusiveness (evict L2 tag w/ sharers)• Memory data returned directly to requestor

Page 50: Cache Coherence Techniques for Multicore Processors Dissertation Defense Mike Marty 12/19/2007.

50

Micro-benchmark: Sharing Latency

0

20

40

60

80

100

120

0 10 20 30 40 50 60

processors per VM

av

era

ge

sh

ari

ng

late

nc

y (

cy

cle

s)

Dram-DirStatic-Bank-DirTag-DirVH_AVH_B

Page 51: Cache Coherence Techniques for Multicore Processors Dissertation Defense Mike Marty 12/19/2007.

51

Result: Runtime for 8x8p Homogenous Consolidation

0

0.2

0.4

0.6

0.8

1

1.2

OLTP Apache

TAG-DIR

STATIC-B

ANK-DIR

VH AVH B

Zeus SpecJBB

DRAM-D

IR

TAG-DIR

STATIC-B

ANK-DIR

VH AVH B

DRAM-D

IR

TAG-DIR

STATIC-B

ANK-DIR

VH AVH B

DRAM-D

IR

TAG-DIR

STATIC-B

ANK-DIR

VH AVH B

DRAM-D

IR

Page 52: Cache Coherence Techniques for Multicore Processors Dissertation Defense Mike Marty 12/19/2007.

52

Result: Memory Stall Cycles for 8x8p Homogenous Consolidation

0

0.2

0.4

0.6

0.8

1

1.2

No

rma

lize

d M

em

ory

Sta

ll C

yc

les

Off-chip Local L2 Remote L1 Remote L2

OLTP Apache

TAG-DIR

STATIC-B

ANK-DIR

VH AVH B

Zeus SpecJBB

DRAM-D

IR

TAG-DIR

STATIC-B

ANK-DIR

VH AVH B

DRAM-D

IR

TAG-DIR

STATIC-B

ANK-DIR

VH AVH B

DRAM-D

IR

TAG-DIR

STATIC-B

ANK-DIR

VH AVH B

DRAM-D

IR

Page 53: Cache Coherence Techniques for Multicore Processors Dissertation Defense Mike Marty 12/19/2007.

53

Result: Runtime for 16x4p Homogenous Consolidation

0

0.2

0.4

0.6

0.8

1

1.2

OLTP Apache Zeus SpecJBB

TAG-DIR

STATIC-B

ANK-DIR

VH AVH B

DRAM-D

IR

TAG-DIR

STATIC-B

ANK-DIR

VH AVH B

DRAM-D

IR

TAG-DIR

STATIC-B

ANK-DIR

VH AVH B

DRAM-D

IR

TAG-DIR

STATIC-B

ANK-DIR

VH AVH B

DRAM-D

IR

Page 54: Cache Coherence Techniques for Multicore Processors Dissertation Defense Mike Marty 12/19/2007.

54

Result: Runtime for 4x16p Homogenous Consolidation

0

0.2

0.4

0.6

0.8

1

1.2

OLTP Apache Zeus SpecJBB

TAG-DIR

STATIC-B

ANK-DIR

VH AVH B

DRAM-D

IR

TAG-DIR

STATIC-B

ANK-DIR

VH AVH B

DRAM-D

IR

TAG-DIR

STATIC-B

ANK-DIR

VH AVH B

DRAM-D

IR

TAG-DIR

STATIC-B

ANK-DIR

VH AVH B

DRAM-D

IR

Page 55: Cache Coherence Techniques for Multicore Processors Dissertation Defense Mike Marty 12/19/2007.

55

Result: Heterogeneous Consolidation mixed1 configuration

0

0.2

0.4

0.6

0.8

1

1.2

VM0:Apache

VM1:Apache

VM2:OLTP

VM3:OLTP

VM4: JBB VM5: JBB VM6: JBB

cy

cle

s-p

er-

tra

ns

ac

tio

n (

CP

T)

Dram-Dir Static-Bank-Dir Tag-Dir VH_A VH_B

Page 56: Cache Coherence Techniques for Multicore Processors Dissertation Defense Mike Marty 12/19/2007.

56

Result: Heterogeneous Consolidationmixed2 configuration

0

0.2

0.4

0.6

0.8

1

1.2

VM0:Apache

VM1:Apache

VM2:Apache

VM3:Apache

VM4:OLTP

VM5:OLTP

VM6:OLTP

VM7:OLTP

cycl

es-p

er-t

ran

sact

ion

(C

PT

)

Dram-Dir Static-Bank-Dir Tag-Dir VH_A VH_B

Page 57: Cache Coherence Techniques for Multicore Processors Dissertation Defense Mike Marty 12/19/2007.

57

Effect of Replication

Apache 8x8p

OLTP 8x8p

Zeus 8x8p

JBB 8x8p

DRAM-DIR 19.7% 14.4% 9.29% 0.06%

STATIC-BANK-DIR -33.0% 3.31% -7.02% -11.2%

TAG-DIR 1.27% 3.91% 1.63% -0.22%

VHA n/a n/a n/a n/a

VHB -11.0% -5.22% -0.98% -0.12%

Treat tile’s L2 bank as private

Page 58: Cache Coherence Techniques for Multicore Processors Dissertation Defense Mike Marty 12/19/2007.

58

Outline

Introduction and Motivation

Virtual Hierarchies• Expanded Motivation• Non-hierarchical approaches• Proposed Virtual Hierarchies • Evaluation• Related Work

Ring-based and Multiple-CMP Coherence

Conclusion

Page 59: Cache Coherence Techniques for Multicore Processors Dissertation Defense Mike Marty 12/19/2007.

59

Virtual Hierarchies: Related Work

Commercial systems usually support partitioning

Sun (Starfire and others)• Physical partitioning• No coherence between partitions

IBM’s LPAR• Logical partitions, time-slicing of processors• Global coherence, but doesn’t optimize space-sharing

Page 60: Cache Coherence Techniques for Multicore Processors Dissertation Defense Mike Marty 12/19/2007.

60

Virtual Hierarchies: Related Work

Systems Approaches to Space Affinity:• Cellular Disco, Managing L2 via OS [Cho et al.]

Shared L2 Cache Partitioning• Way-based, replacement-based

• Molecular Caches ( ~ VHnull )

Cache Organization and Replication• D-NUCA, NuRapid, Cooperative Caching, ASR

Quality-of-Service• Virtual Private Caches [Nesbit et al.]• More

Page 61: Cache Coherence Techniques for Multicore Processors Dissertation Defense Mike Marty 12/19/2007.

61

Virtual Hierarchies: Related Work

Coherence protocol implementations• Token coherence w/ multicast• Multicast snooping

Two-level directory• Compaq Piranha• Pruning caches [Scott et al.]

Page 62: Cache Coherence Techniques for Multicore Processors Dissertation Defense Mike Marty 12/19/2007.

62

Summary: Virtual Hierarchies

Contribution: Virtual Hierarchy Idea• Alternative to physical hard-wired hierarchies• Optimize for space sharing and workload consolidation

Contribution: VHA and VHB implementations• Two-level virtual hierarchy implementations

Published in ISCA 2007 and 2008 Top Picks

Page 63: Cache Coherence Techniques for Multicore Processors Dissertation Defense Mike Marty 12/19/2007.

63

Outline

Introduction and Motivation

Virtual Hierarchies

Ring-based Coherence• Skip, 5-minute versions, or 15-minute versions?

Multiple-CMP Coherence• Skip, 5-minute versions, or 15-minute versions?

Conclusion

[MICRO 2006]

[HPCA 2005]

Page 64: Cache Coherence Techniques for Multicore Processors Dissertation Defense Mike Marty 12/19/2007.

64

Contribution: Ring-based Coherence

Problem: Order of Bus != Order of Ring• Cannot apply bus-based snooping protocols

Existing Solutions• Use unbounded retries to handle contention• Use a performance-costly ordering point

Contribution: RING-ORDER

• Exploits round-robin order of ring• Fast and stable performance

Appears in MICRO 2006

Page 65: Cache Coherence Techniques for Multicore Processors Dissertation Defense Mike Marty 12/19/2007.

65

Contribution: Multiple-CMP Coherence

Hierarchy now the default, increases complexity• Most prior hierarchical protocols use bus-based nodes

Contribution: DirectoryCMP• Two-level directory protocol

Contribution: TokenCMP• Extend token coherence to Multiple-CMPs• Flat for correctness, hierarchical for performance

Appears in HPCA 2005

Page 66: Cache Coherence Techniques for Multicore Processors Dissertation Defense Mike Marty 12/19/2007.

66

Other Research and Contributions

Wisconsin GEMS• ISCA ’05 tutorial, CMP development, release, support

Amdahl’s Law in the Multicore Era• Mark D. Hill and Michael R. Marty, to appear IEEE

Computer

ASR: Adaptive Selective Replication for CMP Caches• Beckmann et al., MICRO 2006

LogTM-SE: Decoupling Hardware Transactional Memory from Caches, • Yen et al., HPCA 2007

Page 67: Cache Coherence Techniques for Multicore Processors Dissertation Defense Mike Marty 12/19/2007.

67

Key Contributions

Trend: Multicore ring interconnects emerging

Challenge: Order of ring != order of bus

Contribution: New protocol exploits ring order

Trend: Multicore now the basic building block

Challenge: Hierarchical coherence for Multiple-CMP is complex

Contribution: DirectoryCMP and TokenCMP

Trend: Workload consolidation w/ space sharing

Challenge: Physical hierarchies often do not match workloads

Contribution: Virtual Hierarchies

Page 68: Cache Coherence Techniques for Multicore Processors Dissertation Defense Mike Marty 12/19/2007.

68

Backup Slides

Page 69: Cache Coherence Techniques for Multicore Processors Dissertation Defense Mike Marty 12/19/2007.

69

What about Physical Hierarchy / Clusters?

PL1 $

PL1 $

PL1 $

PL1 $

PL1 $

PL1 $

PL1 $

PL1 $

Shared L2 Shared L2

PL1 $

Shared L2 Shared L2

PL1 $

PL1 $

PL1 $

PL1 $

PL1 $

PL1 $

PL1 $

Page 70: Cache Coherence Techniques for Multicore Processors Dissertation Defense Mike Marty 12/19/2007.

70

Physical Hierarchy / Clusters

PL1 $

PL1 $

PL1 $

PL1 $

PL1 $

PL1 $

PL1 $

PL1 $

Shared L2 Shared L2

PL1 $

Shared L2 Shared L2

PL1 $

PL1 $

PL1 $

PL1 $

PL1 $

PL1 $

PL1 $

www server

database server #1

middleware server #1

Interference between workloads in shared caches

Lots of prior work on partitioning single Shared L2 Cache

Page 71: Cache Coherence Techniques for Multicore Processors Dissertation Defense Mike Marty 12/19/2007.

71

Protocol VHNULL

Example: Steps for VM Migration • from Tiles {M} to {N}

1. Stop all threads on {M}

2. Flush {M} caches

3. Update {N} VM Config Tables

4. Start threads on {N}

Page 72: Cache Coherence Techniques for Multicore Processors Dissertation Defense Mike Marty 12/19/2007.

72

Protocol VHNULL

Example: Inter-VM Content-based Page Sharing• Is read-only sharing possible with VHNULL?

VMWare’s Implementation:• Global hash table to store hashes of pages• Guest pages scanned by VMM, hashes computed• Full comparison of pages on hash match

Potential VHNULL Implementation:• How does hypervisor scan guest pages? Are they

modified in cache? • Even read-only pages must initially be written at some

point

Page 73: Cache Coherence Techniques for Multicore Processors Dissertation Defense Mike Marty 12/19/2007.

73

5-minute Ring Coherence

Page 74: Cache Coherence Techniques for Multicore Processors Dissertation Defense Mike Marty 12/19/2007.

74

Ring Interconnect

• Why?

Short, fast point-to-point links

Fewer (data) ports

Less complex than packet-switched

Simple, distributed arbitration

Exploitable ordering for coherence

Page 75: Cache Coherence Techniques for Multicore Processors Dissertation Defense Mike Marty 12/19/2007.

75

Cache Coherence for a Ring

• Ring is broadcast and offers ordering

• Apply existing bus-based snooping protocols?

• NO!

• Order properties of ring are different

Page 76: Cache Coherence Techniques for Multicore Processors Dissertation Defense Mike Marty 12/19/2007.

76

Ring Order != Bus Order

P9 P3

P6

P12

A

B

{A, B}

{B, A}

Page 77: Cache Coherence Techniques for Multicore Processors Dissertation Defense Mike Marty 12/19/2007.

77

Ring-based Coherence

Existing Solutions:1. ORDERING-POINT

• Establishes total order • Extra latency and control message overhead

2. GREEDY-ORDER

• Fast in common case• Unbounded retries

Ideal Solution• Fast for average case• Stable for worse-case (no retries)

Page 78: Cache Coherence Techniques for Multicore Processors Dissertation Defense Mike Marty 12/19/2007.

78

New Approach: RING-ORDER

+ Requests complete in order of ring position• Fully exploits ring ordering

+ Initial requests always succeeds• No retries, No ordering point• Fast, stable, predictable performance

Key: Use token counting • All tokens to write, one token to read

Page 79: Cache Coherence Techniques for Multicore Processors Dissertation Defense Mike Marty 12/19/2007.

79

RING-ORDER Example

P9 P3

P6

P10

P11 P1

P2

P4

P5P7

P8

P9 getM

Store

P12= token

= priority token

Page 80: Cache Coherence Techniques for Multicore Processors Dissertation Defense Mike Marty 12/19/2007.

80

RING-ORDER Example

P9 P3

P6

P10

P11 P1

P2

P4

P5P7

P8

P9 getM

Store

P12= token

= priority token

FurthestDest = P9

Page 81: Cache Coherence Techniques for Multicore Processors Dissertation Defense Mike Marty 12/19/2007.

81

RING-ORDER Example

P9 P3

P6

P10

P11 P1

P2

P4

P5P7

P8

Store

P12

Store

FurthestDest = P9

P6 getM

Page 82: Cache Coherence Techniques for Multicore Processors Dissertation Defense Mike Marty 12/19/2007.

82

RING-ORDER Example

P9 P3

P6

P10

P11 P1

P2

P4

P5P7

P8

Store

P12

Store Complete

FurthestDest = P9

Store Complete

Page 83: Cache Coherence Techniques for Multicore Processors Dissertation Defense Mike Marty 12/19/2007.

83

Ring-based Coherence: Results Summary

System: 8-core with private L2s and shared L3

Key Results:

• RING-ORDER outperforms ORDERING-POINT by 7-86% with in-order cores

• RING-ORDER offers similar, or slightly better, performance than GREEDY-ORDER

• Pathological starvation did occur with GREEDY-ORDER

Page 84: Cache Coherence Techniques for Multicore Processors Dissertation Defense Mike Marty 12/19/2007.

84

5-minute Multiple-CMP Coherence

Page 85: Cache Coherence Techniques for Multicore Processors Dissertation Defense Mike Marty 12/19/2007.

85

Problem: Hierarchical Coherence

Inter-CMP Coherence

Intra-CMP Coherence

Intra-CMP protocol for coherence within CMP

Inter-CMP protocol for coherence between CMPs

Interactions between protocols increase complexity• explodes state space, especially without bus

CMP 3 CMP 4

CMP 2CMP 1

interconnect

Page 86: Cache Coherence Techniques for Multicore Processors Dissertation Defense Mike Marty 12/19/2007.

86

Hierarchical Coherence Example: Sun Wildfire

• First-level bus-based snooping protocol• Second-level directory protocol• Interface is key:

• Accesses directory state• Asserts “bus ignore” signal if necessary• Replays bus request when second-level completes

Memory interface

$CPU

$CPU

$CPU

bus

Page 87: Cache Coherence Techniques for Multicore Processors Dissertation Defense Mike Marty 12/19/2007.

87

Solution #1: DirectoryCMP

Two-level directory protocol for Multiple-CMPs• Arbitrary interconnect ordering for on- and off-chip• Non-nacking. Safe States to help resolve races

• Design of DirectoryCMP led to VHA

Advantages:• Powerful, scalable, solid baseline

Disadvantages:• Complex (~63 states at interface), not model-checked?• Second-level indirections slow without directory cache

Page 88: Cache Coherence Techniques for Multicore Processors Dissertation Defense Mike Marty 12/19/2007.

88

Improving Multiple CMP Systems with Token Coherence

• Token Coherence allows Multiple-CMP systems to be...• Flat for correctness, but• Hierarchical for performance

Correctness Substrate

PerformanceProtocol

Low Complexity

Fast

interconnect

CMP 3 CMP 4

CMP 2CMP 1

Page 89: Cache Coherence Techniques for Multicore Processors Dissertation Defense Mike Marty 12/19/2007.

89

Solution #2: TokenCMP

Extend token coherence to Multiple-CMPs• Flat for correctness, hierarchical for performance• Enables model-checkable solution

Flat Correctness:• Global set of T tokens, pass to individual caches• End-to-end token counting• Keep flat persistent request scheme

Page 90: Cache Coherence Techniques for Multicore Processors Dissertation Defense Mike Marty 12/19/2007.

90

TokenCMP Performance Policies

TokenCMPA:• Two-level broadcast• L2 broadcasts off-chip on miss• Local cache responds if it has extra tokens• Responses from off-chip carry extra tokens

TokenCMPB: • On-chip broadcast on L2 miss only (local indirection)

TokenCMPC: Extra states for further filtering

TokenCMPA-PRED: persistent request prediction

Page 91: Cache Coherence Techniques for Multicore Processors Dissertation Defense Mike Marty 12/19/2007.

91

M-CMP Coherence: Summary of Results

System: Four, 4-core CMP

Notable Results:• TokenCMP 2-32% faster than DirectoryCMP w/ in-

order cores

• TokenCMPA, TokenCMPB, TokenCMPC all perform similarly

• Persistent request prediction greatly helps Zeus• TokenCMP gains diminished with out-of-order cores