Cache Coherence “Can we do a better job of supporting cache coherence ?”
Power Efficient Cache Coherence
description
Transcript of Power Efficient Cache Coherence
Power Efficient Cache Coherence
Craig SaldanhaMikko Lipasti
Motivation Power consumption becoming a
serious design constraint. Market demand for faster and
more complex servers. Complex coherence protocol and
interconnect. fclock + Complexity = Pinterconnect
Motivation Traditional power saving
methodologies ineffective. Minimize number of transaction
packets. At the end-points: Jetty. At the source: Serial Snoop.
OUTLINE
Overview of Snoop based protocols and opportunity for power savings
Latency and Power consumption of parallel snooping techniques
Serial Snooping Results Conclusions Future Work
Snoop Based Coherence
Memory
Local Node Tag Lookup Bus Arbitration Snoop Transmission
Remote Node Tag Lookup Snoop Response Combination of Responses Data Fetch Data Transmit
Tag Array
Data Array
P1 P2 P3 P4
Degrees of Speculation 3 Degrees of Freedom to SpeculateSnooping Data Fetch Data Transmit
Snoop
DFetch
DFetch
D-Xmit
D-Xmit
D-Xmit
D-Xmit
Parallel Snoop,Spec Dfetch Spec DXmit
Parallel Snoop,Spec Dfetch Non-Spec DXmit
Parallel Snoop,Non-Spec Dfetch Non-Spec DXmit
Serial Snoop,Spec Dfetch Spec DXmit
Serial Snoop,Spec Dfetch Non-Spec DXmit
Serial Snoop,Non-Spec Dfetch Non-Spec DXmitX
X
Latency and Power assumptions
MEMORY
Consider only load misses
Tree of point-point connections.
Latency to traverse a link: 1Bus cycle (7ns)
Tag Look up :1 bus cycle (7ns)
D-Fetch: 2 bus cycles (14ns)
DRAM access: 10 bus cycles (70ns)
P1 P2 P3 P4
Root Node
Switch1 Switch2
Board1
Backplane
Parallel Snoop Speculative Fetch Speculative Transmit (PS/SF/ST)
MEMORY
Snoop Broadcast
0 7
Latency
Power
Plink
Parallel Snoop Speculative Fetch Speculative Transmit (PS/SF/ST)
MEMORY
Snoop Broadcast
0 7
Latency
Power
Plink+Pswitch
14
Parallel Snoop Speculative Fetch Speculative Transmit (PS/SF/ST)
MEMORY
Snoop BroadcastLatency
Power
2Plink+Pswitch
0 7 14 21
Parallel Snoop Speculative Fetch Speculative Transmit (PS/SF/ST)
MEMORY
Snoop BroadcastLatency
Power
2Plink+2Pswitch
0 7 14 21 28
Parallel Snoop Speculative Fetch Speculative Transmit (PS/SF/ST)
MEMORYSnoop BroadcastLatency
Power
5Plink+2Pswitch
0 7 14 21 28 35
Parallel Snoop Speculative Fetch Speculative Transmit (PS/SF/ST)
MEMORY
Snoop BroadcastLatency
Power
5Plink+4Pswitch
0 7 14 21 28 35 42
Parallel Snoop Speculative Fetch Speculative Transmit (PS/SF/ST)
MEMORY
Snoop BroadcastLatency
Power
8Plink+4Pswitch
0 7 14 21 28 35 42 49
Parallel Snoop Speculative Fetch Speculative Transmit (PS/SF/ST)
Memory Access :Data FetchLatency
Power
Pmem
35 91 105
MEMORY
Parallel Snoop Speculative Fetch Speculative Transmit (PS/SF/ST)
Memory Access :Data TransmitLatency
Power
Pmem+3Plink+2Pswitch
35 91 105 140
MEMORY
Parallel Snoop Speculative Fetch Speculative Transmit (PS/SF/ST)
Remote Node: Tag LookupLatency
Power
3Ptag
49 56
MEMORY
Parallel Snoop Speculative Fetch Speculative Transmit (PS/SF/ST)
MEMORYRemote Node :Snoop Response Latency
Power
3Ptag+3*(Pswitch+2Plink)+Plink+2Pswitch+2Plink
49 56 105
Parallel Snoop Speculative Fetch Speculative Transmit (PS/SF/ST)
Remote Node: Data FetchLatency
Power
3Pcache
49 63
MEMORY
Parallel Snoop Speculative Fetch Speculative Transmit (PS/SF/ST)
MEMORYRemote Node :Data Transmit Latency
Power
3Ptag+3*(4Plink+3Pswitch)
49 63 112
Parallel Snoop Speculative Fetch Speculative Transmit (PS/SF/ST)
Local Node
0TL Snoop BRDCSTTL49
105 14091Data XmitMemory Access
35Memory
105 14091Data XmitMemory Access
35
63DF
49Remote Node11263
DF Data Xmit49
10556 RSP + CMB
49Remote Node
10556 RSP + CMB
49TL
Latency
Power29Plink+18Pswitch+3Ptag+3Pcache+PmemRemote Node supplies the data
Memory supplies the data20Plink+11Pswitch+3Ptag+3Pcache+Pmem
Parallel Snoop Speculative Fetch Non-Speculative Transmit (PS/SF/NT)
0 49
63
105
105 14091
56
Snoop BRDCST
RSP
+ CMB
DF
Data XmitMemory Access35
TL
49
49
Local Node
Remote Node
Memory
Remote Node
105 154
TL
Data Xmit
Power21Plink+12Pswitch+3Ptag+Pcache+PmemRemote Node supplies the data
Memory supplies the data20Plink+11Pswitch+3Ptag+3Pcache+Pmem
Latency
Parallel Snoop Non-Speculative Fetch Non-Speculative Transmit (PS/NF/NT)
49
105
196161
Snoop BRDCST
RSP +
CMB
Data XmitMemory Access91
TL
49
Local Node
Remote Node
Memory
105
TL
119 168
Remote Node
DF Data Xmit
0
Latency
Power21Plink+12Pswitch+3Ptag+PcacheRemote Node supplies the data
Memory supplies the data 20Plink+11Pswitch+3Ptag+Pmem
MEMORY
Serial Snooping
Avoids Speculative transmission of Snoop packets.
MEMORY
Serial Snooping
Avoids Speculative transmission of Snoop packets.
Check the nearest neighbor
Data supplied with minimum latency and power
MEMORY
Serial Snooping
Forward snoop to next level
MEMORY
Serial Snooping
Forward snoop to next level
MEMORY
Serial Snooping
Search other half of tree
MEMORY
Serial Snooping
Search other half of tree
Search leaf nodes serially
MEMORY
Serial Snooping
Search other half of tree
Search leaf nodes serially
Serial Snooping : Features Latency to satisfy a request dependent on
distance from requestor. Data resident at the nearest neighbor
supplied with the lowest latency and power. Requests visible to memory controller only
at root node. Latency is adversely affected when
requested data present at the farthest node Worst case power consumption is still less
than the parallel snooping.
MEMORY
Serial Snooping :
Request satisfied by Nearest NodeLatency
Power
0 21 SNPTL
35 56DF Data Xmit
21
28 49 RSPTL
21
P1
P2
P2
Xmit Snoop: 2Plink + Pswitch
P2 Tag access and snoop response: Ptag + 2Plink + Pswitch
P2 Data Fetch and Xmit: Pcache +2Plink + Pswitch
Ptotal= 6Plink+3Pswitch+Ptag+Pcache
MEMORY
Serial Snooping :
Request satisfied by Next-Nearest NeighborLatency
Power
0 21 SNPTL
28 49
RSPTL21 63 77
77 84 133
77 91 140DF
TL Xmit
Xmit
XmitMemory Access63 133 168
P1
P2
P3
P3
Memory
16Plink+10Pswitch+2Ptag+Pcache
MEMORY
Serial Snooping :
Request satisfied by farthest nodeLatency
Power
0 21
XmitTL
28 49 RSPTL
21 63 77
77 84 10TL Xm
P1
P2
P3
Spec-Memory
P4
P4
Non-Spec-Memory
XmitMemory Access 133 168
RSPTL
105 161
DF Xmit
147
112
105 168119
XmitMemory Access217 252
147
XmitMemory Access133 182147
18Plink+11Pswitch+3Ptag+PcacheRemote Node supplies the data
If Memory supplies the data 17Plink+10Pswitch+3Ptag+Pmem
RESULTS : Load Miss Distributions
Load Miss Distribution
0
5
10
15
20
25
30
35
40
45
% o
f lo
ad
mis
se
s s
ati
sfi
ed
NearestNeighbourNext NearestNodeFarthestNodeMemory
raytrace specwebspecjbb tpc-w
RESULTS: Average Latencies to satisfy load misses
Average Latencies to satisfy load misses
0 20 40 60 80 100 120 140 160 180 200
Avg. Latencies (ns)
SSNFNT
SSSFNT
SSSFST
PSNFNT
PSSFNT
PSSFST
raytrace
specjbb
spe
cwe
btpc-w
RESULTS: Relative Power Savings
Factors contributing to power consumption
0
0.2
0.4
0.6
0.8
1
1.2
PS/SF/ST
SS/SF/ST
PS/SF/NT
SS/SF/NT
PS/NF/NT
SS/NF/NT
Plink Psw Ptag Pcache Pmem
CONCLUSIONS Reducing degree of speculation has
potential for significant power savings
Performance degradation is minimal for the set of benchmarks studied.
Serial Snooping with speculative memory fetch provides optimal latency and power consumption.
Future Work Develop detailed execution-driven
Power Model Explore different interconnect
topologies. Examine the viability of adaptive
mechanisms for protocol policy.
MEMORY
Serial Snooping
Nearest NeighborLatency
Power
2Plink+Pswitch
0 7 14 21
Questions
Parallel Snoop Speculative Fetch Speculative Transmit (PS/SF/ST)
Local Node
0TL Snoop BRDCSTTL49
MEMORY
Memory Access35
Memory Access35 91