Cache Coherence Techniques for Multicore Processors Dissertation Defense Mike Marty 12/19/2007.

Cache Coherence Techniques for Multicore Processors

Dissertation Defense

Mike Marty

12/19/2007

2

Key Contributions

Trend: Multicore ring interconnects emerging

Challenge: Order of ring != order of bus

Contribution: New protocol exploits ring order

Trend: Multicore now the basic building block

Challenge: Hierarchical coherence for Multiple-CMP is complex

Contribution: DirectoryCMP and TokenCMP

Trend: Workload consolidation w/ space sharing

Challenge: Physical hierarchies often do not match workloads

Contribution: Virtual Hierarchies

3

Outline

Introduction and Motivation• Multicore Trends

Virtual Hierarchies• Focus of presentation

Multiple-CMP Coherence

Ring-based Coherence

Conclusion

4

Is SMP + On-chip Integration == Multicore?

bus

memorycontroller

P0$

P1$

$P3

$P2

Multicore

5

Multicore Trends

bus

memorycontroller

P0$

P1$

$P3

$P2

Multicore

Trend: On-chip Interconnect• Competes for same resources as cores, caches• Ring an emerging multicore interconnect

6

Multicore Trends

bus

memorycontroller

P0$

P1$

Multicore

Trend: latency/bandwidth tradeoffs• Increasing on-chip wire delay, memory latency• Coherence protocol interacts with shared-cache

hierarchy

Shared $

$P3

$P2

Shared $

7

Multicore Trends

bus

memorycontroller

P0$

P1$

$P3

$P2

Multicore

Trend: Multicore is the basic building block• Multiple-CMP systems instead of SMPs• Hierarchical systems required

bus

memorycontroller

P0$

P1$

$P3

$P2

Multicore

bus

memorycontroller

P0$

P1$

$P3

$P2

Multicore

bus

memorycontroller

P0$

P1$

$P3

$P2

Multicore

8

Multicore Trends

bus

memorycontroller

P0$

P1$

$P3

$P2

Multicore

Trend: Workload Consolidation w/ Space Sharing• More cores, more workload consolidation• Space sharing instead of time sharing• Opportunities to optimize caching, coherence

VM 1

VM 2 VM 3

9

Outline

Introduction and Motivation

Virtual Hierarchies• Focus of presentation

Multiple-CMP Coherence


Conclusion

[ISCA 2007, IEEE Micro Top Pick 2008]

10

Virtual Hierarchy Motivations

Space-sharing

Server (workload) consolidation

Tiled architectures

APP 1

AP

P 1

AP

P 2

AP

P 3

AP

P 4

11

64-core CMP

Motivation: Server Consolidation

www server

database server #1

database server #2

middleware server #1


Core

L2 Cache

L1

12

64-core CMP


www server

database server #1

database server #2



13

64-core CMP


www server

database server #1

database server #2



data

data

Optimize Performance

14

64-core CMP


www server

database server #1

database server #2



Isolate Performance

15

64-core CMP


www server

database server #1

database server #2



Dynamic Partitioning

16

64-core CMP


www server

database server #1

database server #2



data

Inter-VM Sharing

VMWare’s Content-based Page Sharing Up to 60% reduced memory

17

Outline


Virtual Hierarchies• Expanded Motivation• Non-hierarchical approaches• Proposed Virtual Hierarchies • Evaluation• Related Work

Ring-based and Multiple-CMP Coherence

Conclusion

18

Tiled Architecture Memory System

Core

L2 Cache

L1M

emor

y C

ontr

olle

r

global broadcast too expensive

19

duplicate tagdirectory

TAG-DIRECTORY

A

getM A

1fwd data3

duplicate tagdirectory

2

Read AA

20

STATIC-BANK-DIRECTORY

getM A

1

2

fwd

data3

A

A Read A

21


getM A

1

2fwd

data3

A

A Read A

with hypervisor-managed cache

22

Goals


Isolate Performance

Allow Dynamic Partitioning

Support Inter-VM Sharing

Hypervisor/OS Simplicity

Yes

Yes

?

Yes

No

No

No

Yes

Yes

Yes

STATIC-BANK-DIRECTORY w/ hypervisor-managed cache

{STATIC-BANK, TAG}-DIRECTORY

23

Outline




Conclusion

24

Virtual HierarchiesKey Idea: Overlay 2-level Cache & Coherence Hierarchy

- First level harmonizes with VM/Workload

- Second level allows inter-VM sharing, migration, reconfig

25

VH: First-Level Protocol

Goals:• Exploit locality from space affinity• Isolate resources

Strategy: Directory protocol• Interleave directories across first-level tiles• Store L2 block at first-level directory tile

Questions: • How to name directories?• How to name sharers?

INVgetM

26

VH: Naming First-level Directory

Select Dynamic Home Tile with VM Config Table• Hardware VM Config Table at each tile• Set by hypervisor during scheduling

Example:

Address……000101

Home Tile: p14offset

6

VM Config Table

p12p13p14

012

63 p12

p12p13

p14

345

p13

p12

p14

Core

L2 Cache

L1

per-Tile

Dynamic

27

VH: Dynamic Home Tile Actions

Dynamic Home Tile either:• Returns data cached at L2 bank• Generates forwards/invalidates• Issues second-level request

Stable First-level States (a subset):• Typical: M, E, S, I • Atypical:

ILX: L2 Invalid, points to exclusive tile

SLS: L2 Shared, other tiles share

SLSX: L2 Shared, other tiles share, exclusive to first level

getM

28

VH: Naming First-level Sharers

Any tile can share the block

Solution: full bit-vector• 64-bits for 64-tile system• Names multiple sharers or single exclusive

Alternatives:• First-level broadcast• (Dynamic) coarse granularity

INVgetM

29

Virtual HierarchiesTwo Solutions for Global Coherence: VHA and VHB

memory controller(s)

30

Protocol VHA

Directory as Second-level Protocol• Any tile can act as first-level directory• How to track and name first-level directories?

Full bit-vector of sharers to name any tile• State stored in DRAM• Possibly cache on-chip

+ Maximum scalability, message efficiency- DRAM State ( ~ 12.5% overhead )

31

VHA Example

dir

ecto

ry/m

emo

ry c

on

tro

ller

getM A

1

2

data6

A

AFwd

data

4

3

getM A

5

Fwd

data

A

32

VHA: Handling Races

Blocking Directories• Handles races within same protocol• Requires blocking buffer + wakeup/replay logic

Inter-Intra Races• Naïve blocking leads to deadlock!

getM AA

getM A

blocked

Ablocked

Ablocked

blocked

getM AFWD AgetM A

getM AgetM A

33

VHA: Handling Races(cont)

Possible Solution:• Always handle second-level message at first-level• But this causes explosion of state space

Second-level may interrupt first-level actions:• First-level indirections, invalidations, writebacks

Ablocked

Ablocked

blocked

getM AFWD AgetM A

getM AgetM A

34

VHA: Handling Races(cont)

Reduce the state-space explosion w/ Safe States:• Subset of transient states• Immediately handle second-level message• Limit concurrency between protocols

Algorithm:• Level-one requests either complete, or enter safe-state before

issuing level-two request• Level-one directories handle level-two forwards when a safe

state reached (they may stall)• Level-two requests eventually handled by Level-two directory• Completion messages unblock directories

35

Virtual HierarchiesTwo Solutions for Global Coherence: VHA and VHB

memory controller(s)

36

Protocol VHB

Broadcast as Second-level Protocol• Locate first-level directory tiles• Memory controller tracks outstanding second-level

requestor

Attach token count for each block • T tokens for each block. One token to read, all to write• Allows 1-bit at memory per block• Eliminates system-wide ACK responses

37

Protocol VHB: Token Coalescing

Memory logically holds all or none tokens:• Enables 1-bit token count

Replacing tile sends tokens to memory controller:• Message usually contains all tokens

Process: • Tokens held in Token Holding Buffer (THB)• FIND broadcast initiated to locate other first-level

directory with tokens• First-level directories respond to THB, tokens sent• Repeat for race

38

VHB Example

mem

ory

co

ntr

oll

ergetM A

1

2

3global getM A

getM A

Data+tokens

5

A

A

AFwd

4

39

Goals


Isolate Performance

Allow Dynamic Partitioning

Support Inter-VM Sharing

Hypervisor/OS Simplicity

Yes

Yes

?

Yes

No

No

No

Yes

Yes

Yes

STATIC-BANK-DIRECTORY w/ hypervisor-managed cache

Yes

Yes

Yes

Yes

Yes

Virtual Hierarchies: VHA and VHB

{DRAM, STATIC-BANK, TAG}-DIRECTORY

40

VHNULL

Are two levels really necessary?

VHNULL: first level only

Implications:• Many OS modifications for single-OS environment• Dynamic Partitioning requires cache flushes• Inter-VM Sharing difficult• Hypervisor complexity increases• Requires atomic updates of VM Config Tables• Limits optimized placement policies

41

VH: Capacity/Latency Trade-off

Maximize Capacity• Store only L2 copy at dynamic home tile• But, L2 access time penalized

• Especially for large VMs

Minimize L2 access latency/bandwidth:• Replicate data in local L2 slice• Selective/Adaptive Replication well-studied

ASR [Beckmann et al.], CC [Chang et al.]

• But, dynamic home tile still needed for first-level

Can we exploit virtual hierarchy for placement?

42

VH: Data Placement Optimization Policy

Data from memory placed in tile’s local L2 bank• Tag not allocated at dynamic home tile

Use second-level coherence on first sharing miss• Then allocate tag at dynamic home tile for future

sharing misses

Benefits:• Private data allocates in tile’s local L2 bank• Overhead of replicating data reduced• Fast, first-level sharing for widely shared data

43

Outline




Conclusion

44

VH Evaluation Methods

Wisconsin GEMS

Target System: 64-core tiled CMP• In-order SPARC cores• 1 MB, 16-way L2 cache per tile, 10-cycle access• 2D mesh interconnect, 16-byte links, 5-cycle link

latency• Eight on-chip memory controllers, 275-cycle DRAM

latency

45

VH Evaluation: Simulating Consolidation

Challenge: bring-up of consolidated workloads

Solution: approximate virtualization• Combine existing Simics checkpoints

script

8p checkpointMemory0P0-P7PCI0, DISK0

64p checkpointP0-P63

VM0_Memory0VM0_PCI0, VM0_DISK0

VM1_Memory0VM1_PCI0, VM1_DISK0

46

VH Evaluation: Simulating Consolidation

At simulation-time, Ruby handles mapping:• Converts <Processor ID, 32-bit Address> to <36-bit

address>• Schedules VMs to adjacent cores by sending Simics

requests to appropriate L1 controllers• Memory controllers evenly interleaved

Bottom-line:• Static scheduling• No hypervisor execution simulated• No content-based page sharing

47

VH Evaluation: Workloads

OLTP, SpecJBB, Apache, Zeus• Separate instance of Solaris for each VM

Homogenous Consolidation• Simulate same-size workload N times• Unit of work identical across all workloads• (each workload staggered by 1,000,000+ ins)

Heterogeneous Consolidation• Simulate different-size, different workloads• Cycles-per-Transaction for each workload

48

VH Evaluation: Baseline Protocols

DRAM-DIRECTORY:• 1 MB directory cache per controller • Each tile nominally private, but replication limited

TAG-DIRECTORY:• 3-cycle central tag directory (1024 ways). Non-

pipelined• Replication limited


• Home tiles interleave by frame address• Home tile stores only L2 copy

49

VH Evaluation: VHA and VHB Protocols

VHA

• Based on DirectoryCMP implementation• Dynamic Home Tile stores only L2 copy

VHB with optimizations• Private data placement optimization policy

(shared data stored at home tile, private data is not)• Can violate inclusiveness (evict L2 tag w/ sharers)• Memory data returned directly to requestor

50

Micro-benchmark: Sharing Latency

0

20

40

60

80

100

120

0 10 20 30 40 50 60

processors per VM

av

era

ge

sh

ari

ng

late

nc

y (

cy

cle

s)

Dram-DirStatic-Bank-DirTag-DirVH_AVH_B

51

Result: Runtime for 8x8p Homogenous Consolidation

0

0.2

0.4

0.6

0.8

1

1.2

OLTP Apache

TAG-DIR

STATIC-B

ANK-DIR

VH AVH B

Zeus SpecJBB

DRAM-D

IR

TAG-DIR

STATIC-B

ANK-DIR

VH AVH B

DRAM-D

IR

TAG-DIR

STATIC-B

ANK-DIR

VH AVH B

DRAM-D

IR

TAG-DIR

STATIC-B

ANK-DIR

VH AVH B

DRAM-D

IR

52

Result: Memory Stall Cycles for 8x8p Homogenous Consolidation

0

0.2

0.4

0.6

0.8

1

1.2

No

rma

lize

d M

em

ory

Sta

ll C

yc

les

Off-chip Local L2 Remote L1 Remote L2

OLTP Apache

TAG-DIR

STATIC-B

ANK-DIR

VH AVH B

Zeus SpecJBB

DRAM-D

IR

TAG-DIR

STATIC-B

ANK-DIR

VH AVH B

DRAM-D

IR

TAG-DIR

STATIC-B

ANK-DIR

VH AVH B

DRAM-D

IR

TAG-DIR

STATIC-B

ANK-DIR

VH AVH B

DRAM-D

IR

53


0

0.2

0.4

0.6

0.8

1

1.2

OLTP Apache Zeus SpecJBB

TAG-DIR

STATIC-B

ANK-DIR

VH AVH B

DRAM-D

IR

TAG-DIR

STATIC-B

ANK-DIR

VH AVH B

DRAM-D

IR

TAG-DIR

STATIC-B

ANK-DIR

VH AVH B

DRAM-D

IR

TAG-DIR

STATIC-B

ANK-DIR

VH AVH B

DRAM-D

IR

54


0

0.2

0.4

0.6

0.8

1

1.2

OLTP Apache Zeus SpecJBB

TAG-DIR

STATIC-B

ANK-DIR

VH AVH B

DRAM-D

IR

TAG-DIR

STATIC-B

ANK-DIR

VH AVH B

DRAM-D

IR

TAG-DIR

STATIC-B

ANK-DIR

VH AVH B

DRAM-D

IR

TAG-DIR

STATIC-B

ANK-DIR

VH AVH B

DRAM-D

IR

55

Result: Heterogeneous Consolidation mixed1 configuration

0

0.2

0.4

0.6

0.8

1

1.2

VM0:Apache

VM1:Apache

VM2:OLTP

VM3:OLTP

VM4: JBB VM5: JBB VM6: JBB

cy

cle

s-p

er-

tra

ns

ac

tio

n (

CP

T)

Dram-Dir Static-Bank-Dir Tag-Dir VH_A VH_B

56

Result: Heterogeneous Consolidationmixed2 configuration

0

0.2

0.4

0.6

0.8

1

1.2

VM0:Apache

VM1:Apache

VM2:Apache

VM3:Apache

VM4:OLTP

VM5:OLTP

VM6:OLTP

VM7:OLTP

cycl

es-p

er-t

ran

sact

ion

(C

PT

)

Dram-Dir Static-Bank-Dir Tag-Dir VH_A VH_B

57

Effect of Replication

Apache 8x8p

OLTP 8x8p

Zeus 8x8p

JBB 8x8p

DRAM-DIR 19.7% 14.4% 9.29% 0.06%

STATIC-BANK-DIR -33.0% 3.31% -7.02% -11.2%

TAG-DIR 1.27% 3.91% 1.63% -0.22%

VHA n/a n/a n/a n/a

VHB -11.0% -5.22% -0.98% -0.12%

Treat tile’s L2 bank as private

58

Outline




Conclusion

59

Virtual Hierarchies: Related Work

Commercial systems usually support partitioning

Sun (Starfire and others)• Physical partitioning• No coherence between partitions

IBM’s LPAR• Logical partitions, time-slicing of processors• Global coherence, but doesn’t optimize space-sharing

60


Systems Approaches to Space Affinity:• Cellular Disco, Managing L2 via OS [Cho et al.]

Shared L2 Cache Partitioning• Way-based, replacement-based

• Molecular Caches ( ~ VHnull )

Cache Organization and Replication• D-NUCA, NuRapid, Cooperative Caching, ASR

Quality-of-Service• Virtual Private Caches [Nesbit et al.]• More

61


Coherence protocol implementations• Token coherence w/ multicast• Multicast snooping

Two-level directory• Compaq Piranha• Pruning caches [Scott et al.]

62

Summary: Virtual Hierarchies

Contribution: Virtual Hierarchy Idea• Alternative to physical hard-wired hierarchies• Optimize for space sharing and workload consolidation

Contribution: VHA and VHB implementations• Two-level virtual hierarchy implementations

Published in ISCA 2007 and 2008 Top Picks

63

Outline


Virtual Hierarchies

Ring-based Coherence• Skip, 5-minute versions, or 15-minute versions?

Multiple-CMP Coherence• Skip, 5-minute versions, or 15-minute versions?

Conclusion

[MICRO 2006]

[HPCA 2005]

64

Contribution: Ring-based Coherence

Problem: Order of Bus != Order of Ring• Cannot apply bus-based snooping protocols

Existing Solutions• Use unbounded retries to handle contention• Use a performance-costly ordering point

Contribution: RING-ORDER

• Exploits round-robin order of ring• Fast and stable performance

Appears in MICRO 2006

65

Contribution: Multiple-CMP Coherence

Hierarchy now the default, increases complexity• Most prior hierarchical protocols use bus-based nodes

Contribution: DirectoryCMP• Two-level directory protocol

Contribution: TokenCMP• Extend token coherence to Multiple-CMPs• Flat for correctness, hierarchical for performance

Appears in HPCA 2005

66

Other Research and Contributions

Wisconsin GEMS• ISCA ’05 tutorial, CMP development, release, support

Amdahl’s Law in the Multicore Era• Mark D. Hill and Michael R. Marty, to appear IEEE

Computer

ASR: Adaptive Selective Replication for CMP Caches• Beckmann et al., MICRO 2006

LogTM-SE: Decoupling Hardware Transactional Memory from Caches, • Yen et al., HPCA 2007

67

Key Contributions

Trend: Multicore ring interconnects emerging

Challenge: Order of ring != order of bus

Contribution: New protocol exploits ring order

Trend: Multicore now the basic building block

Challenge: Hierarchical coherence for Multiple-CMP is complex

Contribution: DirectoryCMP and TokenCMP

Trend: Workload consolidation w/ space sharing

Challenge: Physical hierarchies often do not match workloads

Contribution: Virtual Hierarchies

68

Backup Slides

69

What about Physical Hierarchy / Clusters?

PL1 $

PL1 $

PL1 $

PL1 $

PL1 $

PL1 $

PL1 $

PL1 $

Shared L2 Shared L2

PL1 $

Shared L2 Shared L2

PL1 $

PL1 $

PL1 $

PL1 $

PL1 $

PL1 $

PL1 $

70

Physical Hierarchy / Clusters

PL1 $

PL1 $

PL1 $

PL1 $

PL1 $

PL1 $

PL1 $

PL1 $

Shared L2 Shared L2

PL1 $

Shared L2 Shared L2

PL1 $

PL1 $

PL1 $

PL1 $

PL1 $

PL1 $

PL1 $

www server

database server #1


Interference between workloads in shared caches

Lots of prior work on partitioning single Shared L2 Cache

71

Protocol VHNULL

Example: Steps for VM Migration • from Tiles {M} to {N}

1. Stop all threads on {M}

2. Flush {M} caches

3. Update {N} VM Config Tables

4. Start threads on {N}

72

Protocol VHNULL

Example: Inter-VM Content-based Page Sharing• Is read-only sharing possible with VHNULL?

VMWare’s Implementation:• Global hash table to store hashes of pages• Guest pages scanned by VMM, hashes computed• Full comparison of pages on hash match

Potential VHNULL Implementation:• How does hypervisor scan guest pages? Are they

modified in cache? • Even read-only pages must initially be written at some

point

73

5-minute Ring Coherence

74

Ring Interconnect

• Why?

Short, fast point-to-point links

Fewer (data) ports

Less complex than packet-switched

Simple, distributed arbitration

Exploitable ordering for coherence

75

Cache Coherence for a Ring

• Ring is broadcast and offers ordering

• Apply existing bus-based snooping protocols?

• NO!

• Order properties of ring are different

76

Ring Order != Bus Order

P9 P3

P6

P12

A

B

{A, B}

{B, A}

77


Existing Solutions:1. ORDERING-POINT

• Establishes total order • Extra latency and control message overhead

2. GREEDY-ORDER

• Fast in common case• Unbounded retries

Ideal Solution• Fast for average case• Stable for worse-case (no retries)

78

New Approach: RING-ORDER

+ Requests complete in order of ring position• Fully exploits ring ordering

+ Initial requests always succeeds• No retries, No ordering point• Fast, stable, predictable performance

Key: Use token counting • All tokens to write, one token to read

79

RING-ORDER Example

P9 P3

P6

P10

P11 P1

P2

P4

P5P7

P8

P9 getM

Store

P12= token

= priority token

80

RING-ORDER Example

P9 P3

P6

P10

P11 P1

P2

P4

P5P7

P8

P9 getM

Store

P12= token

= priority token

FurthestDest = P9

81

RING-ORDER Example

P9 P3

P6

P10

P11 P1

P2

P4

P5P7

P8

Store

P12

Store

FurthestDest = P9

P6 getM

82

RING-ORDER Example

P9 P3

P6

P10

P11 P1

P2

P4

P5P7

P8

Store

P12

Store Complete

FurthestDest = P9

Store Complete

83

Ring-based Coherence: Results Summary

System: 8-core with private L2s and shared L3

Key Results:

• RING-ORDER outperforms ORDERING-POINT by 7-86% with in-order cores

• RING-ORDER offers similar, or slightly better, performance than GREEDY-ORDER

• Pathological starvation did occur with GREEDY-ORDER

84

5-minute Multiple-CMP Coherence

85

Problem: Hierarchical Coherence

Inter-CMP Coherence

Intra-CMP Coherence

Intra-CMP protocol for coherence within CMP

Inter-CMP protocol for coherence between CMPs

Interactions between protocols increase complexity• explodes state space, especially without bus

CMP 3 CMP 4

CMP 2CMP 1

interconnect

86

Hierarchical Coherence Example: Sun Wildfire

• First-level bus-based snooping protocol• Second-level directory protocol• Interface is key:

• Accesses directory state• Asserts “bus ignore” signal if necessary• Replays bus request when second-level completes

Memory interface

$CPU

$CPU

$CPU

bus

87

Solution #1: DirectoryCMP

Two-level directory protocol for Multiple-CMPs• Arbitrary interconnect ordering for on- and off-chip• Non-nacking. Safe States to help resolve races

• Design of DirectoryCMP led to VHA

Advantages:• Powerful, scalable, solid baseline

Disadvantages:• Complex (~63 states at interface), not model-checked?• Second-level indirections slow without directory cache

88

Improving Multiple CMP Systems with Token Coherence

• Token Coherence allows Multiple-CMP systems to be...• Flat for correctness, but• Hierarchical for performance

Correctness Substrate

PerformanceProtocol

Low Complexity

Fast

interconnect

CMP 3 CMP 4

CMP 2CMP 1

89

Solution #2: TokenCMP

Extend token coherence to Multiple-CMPs• Flat for correctness, hierarchical for performance• Enables model-checkable solution

Flat Correctness:• Global set of T tokens, pass to individual caches• End-to-end token counting• Keep flat persistent request scheme

90

TokenCMP Performance Policies

TokenCMPA:• Two-level broadcast• L2 broadcasts off-chip on miss• Local cache responds if it has extra tokens• Responses from off-chip carry extra tokens

TokenCMPB: • On-chip broadcast on L2 miss only (local indirection)

TokenCMPC: Extra states for further filtering

TokenCMPA-PRED: persistent request prediction

91

M-CMP Coherence: Summary of Results

System: Four, 4-core CMP

Notable Results:• TokenCMP 2-32% faster than DirectoryCMP w/ in-

order cores

• TokenCMPA, TokenCMPB, TokenCMPC all perform similarly

• Persistent request prediction greatly helps Zeus• TokenCMP gains diminished with out-of-order cores

Cache Coherence Techniques for Multicore Processors Dissertation Defense Mike Marty 12/19/2007.

Documents

Transcript of Cache Coherence Techniques for Multicore Processors Dissertation Defense Mike Marty 12/19/2007.