Cache Coherence Techniques for Multicore Processors Dissertation Defense Mike Marty 12/19/2007.
-
Upload
jonas-erik-ross -
Category
Documents
-
view
338 -
download
0
Transcript of Cache Coherence Techniques for Multicore Processors Dissertation Defense Mike Marty 12/19/2007.
Cache Coherence Techniques for Multicore Processors
Dissertation Defense
Mike Marty
12/19/2007
2
Key Contributions
Trend: Multicore ring interconnects emerging
Challenge: Order of ring != order of bus
Contribution: New protocol exploits ring order
Trend: Multicore now the basic building block
Challenge: Hierarchical coherence for Multiple-CMP is complex
Contribution: DirectoryCMP and TokenCMP
Trend: Workload consolidation w/ space sharing
Challenge: Physical hierarchies often do not match workloads
Contribution: Virtual Hierarchies
3
Outline
Introduction and Motivation• Multicore Trends
Virtual Hierarchies• Focus of presentation
Multiple-CMP Coherence
Ring-based Coherence
Conclusion
4
Is SMP + On-chip Integration == Multicore?
bus
memorycontroller
P0$
P1$
$P3
$P2
Multicore
5
Multicore Trends
bus
memorycontroller
P0$
P1$
$P3
$P2
Multicore
Trend: On-chip Interconnect• Competes for same resources as cores, caches• Ring an emerging multicore interconnect
6
Multicore Trends
bus
memorycontroller
P0$
P1$
Multicore
Trend: latency/bandwidth tradeoffs• Increasing on-chip wire delay, memory latency• Coherence protocol interacts with shared-cache
hierarchy
Shared $
$P3
$P2
Shared $
7
Multicore Trends
bus
memorycontroller
P0$
P1$
$P3
$P2
Multicore
Trend: Multicore is the basic building block• Multiple-CMP systems instead of SMPs• Hierarchical systems required
bus
memorycontroller
P0$
P1$
$P3
$P2
Multicore
bus
memorycontroller
P0$
P1$
$P3
$P2
Multicore
bus
memorycontroller
P0$
P1$
$P3
$P2
Multicore
8
Multicore Trends
bus
memorycontroller
P0$
P1$
$P3
$P2
Multicore
Trend: Workload Consolidation w/ Space Sharing• More cores, more workload consolidation• Space sharing instead of time sharing• Opportunities to optimize caching, coherence
VM 1
VM 2 VM 3
9
Outline
Introduction and Motivation
Virtual Hierarchies• Focus of presentation
Multiple-CMP Coherence
Ring-based Coherence
Conclusion
[ISCA 2007, IEEE Micro Top Pick 2008]
10
Virtual Hierarchy Motivations
Space-sharing
Server (workload) consolidation
Tiled architectures
APP 1
AP
P 1
AP
P 2
AP
P 3
AP
P 4
11
64-core CMP
Motivation: Server Consolidation
www server
database server #1
database server #2
middleware server #1
middleware server #1
Core
L2 Cache
L1
12
64-core CMP
Motivation: Server Consolidation
www server
database server #1
database server #2
middleware server #1
middleware server #1
13
64-core CMP
Motivation: Server Consolidation
www server
database server #1
database server #2
middleware server #1
middleware server #1
data
data
Optimize Performance
14
64-core CMP
Motivation: Server Consolidation
www server
database server #1
database server #2
middleware server #1
middleware server #1
Isolate Performance
15
64-core CMP
Motivation: Server Consolidation
www server
database server #1
database server #2
middleware server #1
middleware server #1
Dynamic Partitioning
16
64-core CMP
Motivation: Server Consolidation
www server
database server #1
database server #2
middleware server #1
middleware server #1
data
Inter-VM Sharing
VMWare’s Content-based Page Sharing Up to 60% reduced memory
17
Outline
Introduction and Motivation
Virtual Hierarchies• Expanded Motivation• Non-hierarchical approaches• Proposed Virtual Hierarchies • Evaluation• Related Work
Ring-based and Multiple-CMP Coherence
Conclusion
18
Tiled Architecture Memory System
Core
L2 Cache
L1M
emor
y C
ontr
olle
r
global broadcast too expensive
19
duplicate tagdirectory
TAG-DIRECTORY
A
getM A
1fwd data3
duplicate tagdirectory
2
Read AA
20
STATIC-BANK-DIRECTORY
getM A
1
2
fwd
data3
A
A Read A
21
STATIC-BANK-DIRECTORY
getM A
1
2fwd
data3
A
A Read A
with hypervisor-managed cache
22
Goals
Optimize Performance
Isolate Performance
Allow Dynamic Partitioning
Support Inter-VM Sharing
Hypervisor/OS Simplicity
Yes
Yes
?
Yes
No
No
No
Yes
Yes
Yes
STATIC-BANK-DIRECTORY w/ hypervisor-managed cache
{STATIC-BANK, TAG}-DIRECTORY
23
Outline
Introduction and Motivation
Virtual Hierarchies• Expanded Motivation• Non-hierarchical approaches• Proposed Virtual Hierarchies • Evaluation• Related Work
Ring-based and Multiple-CMP Coherence
Conclusion
24
Virtual HierarchiesKey Idea: Overlay 2-level Cache & Coherence Hierarchy
- First level harmonizes with VM/Workload
- Second level allows inter-VM sharing, migration, reconfig
25
VH: First-Level Protocol
Goals:• Exploit locality from space affinity• Isolate resources
Strategy: Directory protocol• Interleave directories across first-level tiles• Store L2 block at first-level directory tile
Questions: • How to name directories?• How to name sharers?
INVgetM
26
VH: Naming First-level Directory
Select Dynamic Home Tile with VM Config Table• Hardware VM Config Table at each tile• Set by hypervisor during scheduling
Example:
Address……000101
Home Tile: p14offset
6
VM Config Table
p12p13p14
012
63 p12
p12p13
p14
345
p13
p12
p14
Core
L2 Cache
L1
per-Tile
Dynamic
27
VH: Dynamic Home Tile Actions
Dynamic Home Tile either:• Returns data cached at L2 bank• Generates forwards/invalidates• Issues second-level request
Stable First-level States (a subset):• Typical: M, E, S, I • Atypical:
ILX: L2 Invalid, points to exclusive tile
SLS: L2 Shared, other tiles share
SLSX: L2 Shared, other tiles share, exclusive to first level
getM
28
VH: Naming First-level Sharers
Any tile can share the block
Solution: full bit-vector• 64-bits for 64-tile system• Names multiple sharers or single exclusive
Alternatives:• First-level broadcast• (Dynamic) coarse granularity
INVgetM
29
Virtual HierarchiesTwo Solutions for Global Coherence: VHA and VHB
memory controller(s)
30
Protocol VHA
Directory as Second-level Protocol• Any tile can act as first-level directory• How to track and name first-level directories?
Full bit-vector of sharers to name any tile• State stored in DRAM• Possibly cache on-chip
+ Maximum scalability, message efficiency- DRAM State ( ~ 12.5% overhead )
31
VHA Example
dir
ecto
ry/m
emo
ry c
on
tro
ller
getM A
1
2
data6
A
AFwd
data
4
3
getM A
5
Fwd
data
A
32
VHA: Handling Races
Blocking Directories• Handles races within same protocol• Requires blocking buffer + wakeup/replay logic
Inter-Intra Races• Naïve blocking leads to deadlock!
getM AA
getM A
blocked
Ablocked
Ablocked
blocked
getM AFWD AgetM A
getM AgetM A
33
VHA: Handling Races(cont)
Possible Solution:• Always handle second-level message at first-level• But this causes explosion of state space
Second-level may interrupt first-level actions:• First-level indirections, invalidations, writebacks
Ablocked
Ablocked
blocked
getM AFWD AgetM A
getM AgetM A
34
VHA: Handling Races(cont)
Reduce the state-space explosion w/ Safe States:• Subset of transient states• Immediately handle second-level message• Limit concurrency between protocols
Algorithm:• Level-one requests either complete, or enter safe-state before
issuing level-two request• Level-one directories handle level-two forwards when a safe
state reached (they may stall)• Level-two requests eventually handled by Level-two directory• Completion messages unblock directories
35
Virtual HierarchiesTwo Solutions for Global Coherence: VHA and VHB
memory controller(s)
36
Protocol VHB
Broadcast as Second-level Protocol• Locate first-level directory tiles• Memory controller tracks outstanding second-level
requestor
Attach token count for each block • T tokens for each block. One token to read, all to write• Allows 1-bit at memory per block• Eliminates system-wide ACK responses
37
Protocol VHB: Token Coalescing
Memory logically holds all or none tokens:• Enables 1-bit token count
Replacing tile sends tokens to memory controller:• Message usually contains all tokens
Process: • Tokens held in Token Holding Buffer (THB)• FIND broadcast initiated to locate other first-level
directory with tokens• First-level directories respond to THB, tokens sent• Repeat for race
38
VHB Example
mem
ory
co
ntr
oll
ergetM A
1
2
3global getM A
getM A
Data+tokens
5
A
A
AFwd
4
39
Goals
Optimize Performance
Isolate Performance
Allow Dynamic Partitioning
Support Inter-VM Sharing
Hypervisor/OS Simplicity
Yes
Yes
?
Yes
No
No
No
Yes
Yes
Yes
STATIC-BANK-DIRECTORY w/ hypervisor-managed cache
Yes
Yes
Yes
Yes
Yes
Virtual Hierarchies: VHA and VHB
{DRAM, STATIC-BANK, TAG}-DIRECTORY
40
VHNULL
Are two levels really necessary?
VHNULL: first level only
Implications:• Many OS modifications for single-OS environment• Dynamic Partitioning requires cache flushes• Inter-VM Sharing difficult• Hypervisor complexity increases• Requires atomic updates of VM Config Tables• Limits optimized placement policies
41
VH: Capacity/Latency Trade-off
Maximize Capacity• Store only L2 copy at dynamic home tile• But, L2 access time penalized
• Especially for large VMs
Minimize L2 access latency/bandwidth:• Replicate data in local L2 slice• Selective/Adaptive Replication well-studied
ASR [Beckmann et al.], CC [Chang et al.]
• But, dynamic home tile still needed for first-level
Can we exploit virtual hierarchy for placement?
42
VH: Data Placement Optimization Policy
Data from memory placed in tile’s local L2 bank• Tag not allocated at dynamic home tile
Use second-level coherence on first sharing miss• Then allocate tag at dynamic home tile for future
sharing misses
Benefits:• Private data allocates in tile’s local L2 bank• Overhead of replicating data reduced• Fast, first-level sharing for widely shared data
43
Outline
Introduction and Motivation
Virtual Hierarchies• Expanded Motivation• Non-hierarchical approaches• Proposed Virtual Hierarchies • Evaluation• Related Work
Ring-based and Multiple-CMP Coherence
Conclusion
44
VH Evaluation Methods
Wisconsin GEMS
Target System: 64-core tiled CMP• In-order SPARC cores• 1 MB, 16-way L2 cache per tile, 10-cycle access• 2D mesh interconnect, 16-byte links, 5-cycle link
latency• Eight on-chip memory controllers, 275-cycle DRAM
latency
45
VH Evaluation: Simulating Consolidation
Challenge: bring-up of consolidated workloads
Solution: approximate virtualization• Combine existing Simics checkpoints
script
8p checkpointMemory0P0-P7PCI0, DISK0
64p checkpointP0-P63
VM0_Memory0VM0_PCI0, VM0_DISK0
VM1_Memory0VM1_PCI0, VM1_DISK0
46
VH Evaluation: Simulating Consolidation
At simulation-time, Ruby handles mapping:• Converts <Processor ID, 32-bit Address> to <36-bit
address>• Schedules VMs to adjacent cores by sending Simics
requests to appropriate L1 controllers• Memory controllers evenly interleaved
Bottom-line:• Static scheduling• No hypervisor execution simulated• No content-based page sharing
47
VH Evaluation: Workloads
OLTP, SpecJBB, Apache, Zeus• Separate instance of Solaris for each VM
Homogenous Consolidation• Simulate same-size workload N times• Unit of work identical across all workloads• (each workload staggered by 1,000,000+ ins)
Heterogeneous Consolidation• Simulate different-size, different workloads• Cycles-per-Transaction for each workload
48
VH Evaluation: Baseline Protocols
DRAM-DIRECTORY:• 1 MB directory cache per controller • Each tile nominally private, but replication limited
TAG-DIRECTORY:• 3-cycle central tag directory (1024 ways). Non-
pipelined• Replication limited
STATIC-BANK-DIRECTORY
• Home tiles interleave by frame address• Home tile stores only L2 copy
49
VH Evaluation: VHA and VHB Protocols
VHA
• Based on DirectoryCMP implementation• Dynamic Home Tile stores only L2 copy
VHB with optimizations• Private data placement optimization policy
(shared data stored at home tile, private data is not)• Can violate inclusiveness (evict L2 tag w/ sharers)• Memory data returned directly to requestor
50
Micro-benchmark: Sharing Latency
0
20
40
60
80
100
120
0 10 20 30 40 50 60
processors per VM
av
era
ge
sh
ari
ng
late
nc
y (
cy
cle
s)
Dram-DirStatic-Bank-DirTag-DirVH_AVH_B
51
Result: Runtime for 8x8p Homogenous Consolidation
0
0.2
0.4
0.6
0.8
1
1.2
OLTP Apache
TAG-DIR
STATIC-B
ANK-DIR
VH AVH B
Zeus SpecJBB
DRAM-D
IR
TAG-DIR
STATIC-B
ANK-DIR
VH AVH B
DRAM-D
IR
TAG-DIR
STATIC-B
ANK-DIR
VH AVH B
DRAM-D
IR
TAG-DIR
STATIC-B
ANK-DIR
VH AVH B
DRAM-D
IR
52
Result: Memory Stall Cycles for 8x8p Homogenous Consolidation
0
0.2
0.4
0.6
0.8
1
1.2
No
rma
lize
d M
em
ory
Sta
ll C
yc
les
Off-chip Local L2 Remote L1 Remote L2
OLTP Apache
TAG-DIR
STATIC-B
ANK-DIR
VH AVH B
Zeus SpecJBB
DRAM-D
IR
TAG-DIR
STATIC-B
ANK-DIR
VH AVH B
DRAM-D
IR
TAG-DIR
STATIC-B
ANK-DIR
VH AVH B
DRAM-D
IR
TAG-DIR
STATIC-B
ANK-DIR
VH AVH B
DRAM-D
IR
53
Result: Runtime for 16x4p Homogenous Consolidation
0
0.2
0.4
0.6
0.8
1
1.2
OLTP Apache Zeus SpecJBB
TAG-DIR
STATIC-B
ANK-DIR
VH AVH B
DRAM-D
IR
TAG-DIR
STATIC-B
ANK-DIR
VH AVH B
DRAM-D
IR
TAG-DIR
STATIC-B
ANK-DIR
VH AVH B
DRAM-D
IR
TAG-DIR
STATIC-B
ANK-DIR
VH AVH B
DRAM-D
IR
54
Result: Runtime for 4x16p Homogenous Consolidation
0
0.2
0.4
0.6
0.8
1
1.2
OLTP Apache Zeus SpecJBB
TAG-DIR
STATIC-B
ANK-DIR
VH AVH B
DRAM-D
IR
TAG-DIR
STATIC-B
ANK-DIR
VH AVH B
DRAM-D
IR
TAG-DIR
STATIC-B
ANK-DIR
VH AVH B
DRAM-D
IR
TAG-DIR
STATIC-B
ANK-DIR
VH AVH B
DRAM-D
IR
55
Result: Heterogeneous Consolidation mixed1 configuration
0
0.2
0.4
0.6
0.8
1
1.2
VM0:Apache
VM1:Apache
VM2:OLTP
VM3:OLTP
VM4: JBB VM5: JBB VM6: JBB
cy
cle
s-p
er-
tra
ns
ac
tio
n (
CP
T)
Dram-Dir Static-Bank-Dir Tag-Dir VH_A VH_B
56
Result: Heterogeneous Consolidationmixed2 configuration
0
0.2
0.4
0.6
0.8
1
1.2
VM0:Apache
VM1:Apache
VM2:Apache
VM3:Apache
VM4:OLTP
VM5:OLTP
VM6:OLTP
VM7:OLTP
cycl
es-p
er-t
ran
sact
ion
(C
PT
)
Dram-Dir Static-Bank-Dir Tag-Dir VH_A VH_B
57
Effect of Replication
Apache 8x8p
OLTP 8x8p
Zeus 8x8p
JBB 8x8p
DRAM-DIR 19.7% 14.4% 9.29% 0.06%
STATIC-BANK-DIR -33.0% 3.31% -7.02% -11.2%
TAG-DIR 1.27% 3.91% 1.63% -0.22%
VHA n/a n/a n/a n/a
VHB -11.0% -5.22% -0.98% -0.12%
Treat tile’s L2 bank as private
58
Outline
Introduction and Motivation
Virtual Hierarchies• Expanded Motivation• Non-hierarchical approaches• Proposed Virtual Hierarchies • Evaluation• Related Work
Ring-based and Multiple-CMP Coherence
Conclusion
59
Virtual Hierarchies: Related Work
Commercial systems usually support partitioning
Sun (Starfire and others)• Physical partitioning• No coherence between partitions
IBM’s LPAR• Logical partitions, time-slicing of processors• Global coherence, but doesn’t optimize space-sharing
60
Virtual Hierarchies: Related Work
Systems Approaches to Space Affinity:• Cellular Disco, Managing L2 via OS [Cho et al.]
Shared L2 Cache Partitioning• Way-based, replacement-based
• Molecular Caches ( ~ VHnull )
Cache Organization and Replication• D-NUCA, NuRapid, Cooperative Caching, ASR
Quality-of-Service• Virtual Private Caches [Nesbit et al.]• More
61
Virtual Hierarchies: Related Work
Coherence protocol implementations• Token coherence w/ multicast• Multicast snooping
Two-level directory• Compaq Piranha• Pruning caches [Scott et al.]
62
Summary: Virtual Hierarchies
Contribution: Virtual Hierarchy Idea• Alternative to physical hard-wired hierarchies• Optimize for space sharing and workload consolidation
Contribution: VHA and VHB implementations• Two-level virtual hierarchy implementations
Published in ISCA 2007 and 2008 Top Picks
63
Outline
Introduction and Motivation
Virtual Hierarchies
Ring-based Coherence• Skip, 5-minute versions, or 15-minute versions?
Multiple-CMP Coherence• Skip, 5-minute versions, or 15-minute versions?
Conclusion
[MICRO 2006]
[HPCA 2005]
64
Contribution: Ring-based Coherence
Problem: Order of Bus != Order of Ring• Cannot apply bus-based snooping protocols
Existing Solutions• Use unbounded retries to handle contention• Use a performance-costly ordering point
Contribution: RING-ORDER
• Exploits round-robin order of ring• Fast and stable performance
Appears in MICRO 2006
65
Contribution: Multiple-CMP Coherence
Hierarchy now the default, increases complexity• Most prior hierarchical protocols use bus-based nodes
Contribution: DirectoryCMP• Two-level directory protocol
Contribution: TokenCMP• Extend token coherence to Multiple-CMPs• Flat for correctness, hierarchical for performance
Appears in HPCA 2005
66
Other Research and Contributions
Wisconsin GEMS• ISCA ’05 tutorial, CMP development, release, support
Amdahl’s Law in the Multicore Era• Mark D. Hill and Michael R. Marty, to appear IEEE
Computer
ASR: Adaptive Selective Replication for CMP Caches• Beckmann et al., MICRO 2006
LogTM-SE: Decoupling Hardware Transactional Memory from Caches, • Yen et al., HPCA 2007
67
Key Contributions
Trend: Multicore ring interconnects emerging
Challenge: Order of ring != order of bus
Contribution: New protocol exploits ring order
Trend: Multicore now the basic building block
Challenge: Hierarchical coherence for Multiple-CMP is complex
Contribution: DirectoryCMP and TokenCMP
Trend: Workload consolidation w/ space sharing
Challenge: Physical hierarchies often do not match workloads
Contribution: Virtual Hierarchies
68
Backup Slides
69
What about Physical Hierarchy / Clusters?
PL1 $
PL1 $
PL1 $
PL1 $
PL1 $
PL1 $
PL1 $
PL1 $
Shared L2 Shared L2
PL1 $
Shared L2 Shared L2
PL1 $
PL1 $
PL1 $
PL1 $
PL1 $
PL1 $
PL1 $
70
Physical Hierarchy / Clusters
PL1 $
PL1 $
PL1 $
PL1 $
PL1 $
PL1 $
PL1 $
PL1 $
Shared L2 Shared L2
PL1 $
Shared L2 Shared L2
PL1 $
PL1 $
PL1 $
PL1 $
PL1 $
PL1 $
PL1 $
www server
database server #1
middleware server #1
Interference between workloads in shared caches
Lots of prior work on partitioning single Shared L2 Cache
71
Protocol VHNULL
Example: Steps for VM Migration • from Tiles {M} to {N}
1. Stop all threads on {M}
2. Flush {M} caches
3. Update {N} VM Config Tables
4. Start threads on {N}
72
Protocol VHNULL
Example: Inter-VM Content-based Page Sharing• Is read-only sharing possible with VHNULL?
VMWare’s Implementation:• Global hash table to store hashes of pages• Guest pages scanned by VMM, hashes computed• Full comparison of pages on hash match
Potential VHNULL Implementation:• How does hypervisor scan guest pages? Are they
modified in cache? • Even read-only pages must initially be written at some
point
73
5-minute Ring Coherence
74
Ring Interconnect
• Why?
Short, fast point-to-point links
Fewer (data) ports
Less complex than packet-switched
Simple, distributed arbitration
Exploitable ordering for coherence
75
Cache Coherence for a Ring
• Ring is broadcast and offers ordering
• Apply existing bus-based snooping protocols?
• NO!
• Order properties of ring are different
76
Ring Order != Bus Order
P9 P3
P6
P12
A
B
{A, B}
{B, A}
77
Ring-based Coherence
Existing Solutions:1. ORDERING-POINT
• Establishes total order • Extra latency and control message overhead
2. GREEDY-ORDER
• Fast in common case• Unbounded retries
Ideal Solution• Fast for average case• Stable for worse-case (no retries)
78
New Approach: RING-ORDER
+ Requests complete in order of ring position• Fully exploits ring ordering
+ Initial requests always succeeds• No retries, No ordering point• Fast, stable, predictable performance
Key: Use token counting • All tokens to write, one token to read
79
RING-ORDER Example
P9 P3
P6
P10
P11 P1
P2
P4
P5P7
P8
P9 getM
Store
P12= token
= priority token
80
RING-ORDER Example
P9 P3
P6
P10
P11 P1
P2
P4
P5P7
P8
P9 getM
Store
P12= token
= priority token
FurthestDest = P9
81
RING-ORDER Example
P9 P3
P6
P10
P11 P1
P2
P4
P5P7
P8
Store
P12
Store
FurthestDest = P9
P6 getM
82
RING-ORDER Example
P9 P3
P6
P10
P11 P1
P2
P4
P5P7
P8
Store
P12
Store Complete
FurthestDest = P9
Store Complete
83
Ring-based Coherence: Results Summary
System: 8-core with private L2s and shared L3
Key Results:
• RING-ORDER outperforms ORDERING-POINT by 7-86% with in-order cores
• RING-ORDER offers similar, or slightly better, performance than GREEDY-ORDER
• Pathological starvation did occur with GREEDY-ORDER
84
5-minute Multiple-CMP Coherence
85
Problem: Hierarchical Coherence
Inter-CMP Coherence
Intra-CMP Coherence
Intra-CMP protocol for coherence within CMP
Inter-CMP protocol for coherence between CMPs
Interactions between protocols increase complexity• explodes state space, especially without bus
CMP 3 CMP 4
CMP 2CMP 1
interconnect
86
Hierarchical Coherence Example: Sun Wildfire
• First-level bus-based snooping protocol• Second-level directory protocol• Interface is key:
• Accesses directory state• Asserts “bus ignore” signal if necessary• Replays bus request when second-level completes
Memory interface
$CPU
$CPU
$CPU
bus
87
Solution #1: DirectoryCMP
Two-level directory protocol for Multiple-CMPs• Arbitrary interconnect ordering for on- and off-chip• Non-nacking. Safe States to help resolve races
• Design of DirectoryCMP led to VHA
Advantages:• Powerful, scalable, solid baseline
Disadvantages:• Complex (~63 states at interface), not model-checked?• Second-level indirections slow without directory cache
88
Improving Multiple CMP Systems with Token Coherence
• Token Coherence allows Multiple-CMP systems to be...• Flat for correctness, but• Hierarchical for performance
Correctness Substrate
PerformanceProtocol
Low Complexity
Fast
interconnect
CMP 3 CMP 4
CMP 2CMP 1
89
Solution #2: TokenCMP
Extend token coherence to Multiple-CMPs• Flat for correctness, hierarchical for performance• Enables model-checkable solution
Flat Correctness:• Global set of T tokens, pass to individual caches• End-to-end token counting• Keep flat persistent request scheme
90
TokenCMP Performance Policies
TokenCMPA:• Two-level broadcast• L2 broadcasts off-chip on miss• Local cache responds if it has extra tokens• Responses from off-chip carry extra tokens
TokenCMPB: • On-chip broadcast on L2 miss only (local indirection)
TokenCMPC: Extra states for further filtering
TokenCMPA-PRED: persistent request prediction
91
M-CMP Coherence: Summary of Results
System: Four, 4-core CMP
Notable Results:• TokenCMP 2-32% faster than DirectoryCMP w/ in-
order cores
• TokenCMPA, TokenCMPB, TokenCMPC all perform similarly
• Persistent request prediction greatly helps Zeus• TokenCMP gains diminished with out-of-order cores