Cluster Prefetch: Tolerating On-Chip Wire Delays in Clustered Microarchitectures
description
Transcript of Cluster Prefetch: Tolerating On-Chip Wire Delays in Clustered Microarchitectures
![Page 1: Cluster Prefetch: Tolerating On-Chip Wire Delays in Clustered Microarchitectures](https://reader036.fdocuments.net/reader036/viewer/2022070418/5681572f550346895dc4cd87/html5/thumbnails/1.jpg)
Cluster Prefetch: Tolerating On-Chip WireDelays in Clustered Microarchitectures
Rajeev Balasubramonian
School of Computing, University of Utah
July 1st 2004
![Page 2: Cluster Prefetch: Tolerating On-Chip Wire Delays in Clustered Microarchitectures](https://reader036.fdocuments.net/reader036/viewer/2022070418/5681572f550346895dc4cd87/html5/thumbnails/2.jpg)
University of Utah 2
Billion-Transistor Chips
• Partitioned architectures: small computational units connected by a communication fabric
Small computational units with limited functionality fast clocks, low design effort, low power
Numerous computational units high parallelism
![Page 3: Cluster Prefetch: Tolerating On-Chip Wire Delays in Clustered Microarchitectures](https://reader036.fdocuments.net/reader036/viewer/2022070418/5681572f550346895dc4cd87/html5/thumbnails/3.jpg)
University of Utah 3
The Communication Bottleneck
• Wire delays do not scale down at the same rate as logic delays [Agarwal, ISCA’00][Ho, Proc. IEEE’01]
30 cycle delay to go across the chip in 10 years
1-cycle inter-hop latency in the RAW prototype [Taylor, ISCA’04]
![Page 4: Cluster Prefetch: Tolerating On-Chip Wire Delays in Clustered Microarchitectures](https://reader036.fdocuments.net/reader036/viewer/2022070418/5681572f550346895dc4cd87/html5/thumbnails/4.jpg)
University of Utah 4
Cache Design
L1D
AddressTransfer 6 cyc
Data6 cyc Transfer
6 cyc RAM Access
Centralized Cache
18-cycle access (12 cyclesfor communication)
![Page 5: Cluster Prefetch: Tolerating On-Chip Wire Delays in Clustered Microarchitectures](https://reader036.fdocuments.net/reader036/viewer/2022070418/5681572f550346895dc4cd87/html5/thumbnails/5.jpg)
University of Utah 5
Cache Design
L1D
AddressTransfer 6 cyc
Data6 cyc Transfer
6 cyc RAM Access
Centralized Cache
18-cycle access (12 cyclesfor communication)
L1D L1D
L1D L1D
Decentralized Cache
![Page 6: Cluster Prefetch: Tolerating On-Chip Wire Delays in Clustered Microarchitectures](https://reader036.fdocuments.net/reader036/viewer/2022070418/5681572f550346895dc4cd87/html5/thumbnails/6.jpg)
University of Utah 6
Research Goals
• Identify bottlenecks in cache access
• Design cluster prefetch, a latency hiding mechanism
• Evaluate and compare centralized and decentralized designs
![Page 7: Cluster Prefetch: Tolerating On-Chip Wire Delays in Clustered Microarchitectures](https://reader036.fdocuments.net/reader036/viewer/2022070418/5681572f550346895dc4cd87/html5/thumbnails/7.jpg)
University of Utah 7
Outline
• Motivation
• Evaluation platform
• Cluster prefetch
• Centralized vs. decentralized caches
• Conclusions
![Page 8: Cluster Prefetch: Tolerating On-Chip Wire Delays in Clustered Microarchitectures](https://reader036.fdocuments.net/reader036/viewer/2022070418/5681572f550346895dc4cd87/html5/thumbnails/8.jpg)
University of Utah 8
Clustered Microarchitectures
• Centralized front-end
• Dynamically steered (dependences & load)
• O-o-o issue and 1-cycle bypass within a cluster
• Hierarchical interconnect
L1D
lsq
InstrFetch
crossbar
ring
![Page 9: Cluster Prefetch: Tolerating On-Chip Wire Delays in Clustered Microarchitectures](https://reader036.fdocuments.net/reader036/viewer/2022070418/5681572f550346895dc4cd87/html5/thumbnails/9.jpg)
University of Utah 9
Simulation Parameters
• Simplescalar-based simulator
• In-flight instruction window of 480
• 16 clusters, each with 60 registers, 30 issue queue entries, and one FU of each kind
• Inter-cluster latencies between 2-10
• Primary focus on SPEC-FP programs
![Page 10: Cluster Prefetch: Tolerating On-Chip Wire Delays in Clustered Microarchitectures](https://reader036.fdocuments.net/reader036/viewer/2022070418/5681572f550346895dc4cd87/html5/thumbnails/10.jpg)
University of Utah 10
Steps Involved in Cache Access
L1D
lsq
InstrFetch
Instr Dispatch
Effective Address Computation
Effective Address Transfer
Memory Disambiguation
RAM Access
Data Transfer
![Page 11: Cluster Prefetch: Tolerating On-Chip Wire Delays in Clustered Microarchitectures](https://reader036.fdocuments.net/reader036/viewer/2022070418/5681572f550346895dc4cd87/html5/thumbnails/11.jpg)
University of Utah 11
Lifetime of a Load
2
25
7
3426
5
98
0
20
40
60
80
100
120
Transf
er o
f inst
r to c
lust
er
Eff. A
ddr. C
omput
atio
n
Addr.
Transf
er to
LSQ
Mem
ory D
epen
dence
Res
olutio
n
Cache
Acces
s
Data
Transf
er fr
om L
SQ to C
lust
er
TotalA
ve
rag
e c
yc
les
pe
r lo
ad
![Page 12: Cluster Prefetch: Tolerating On-Chip Wire Delays in Clustered Microarchitectures](https://reader036.fdocuments.net/reader036/viewer/2022070418/5681572f550346895dc4cd87/html5/thumbnails/12.jpg)
University of Utah 12
Load Address Prediction
L1DLSQ
ClusterEff. Addr. Transfer
Cycle 27
Data TransferCycle 94
Dispatch at cycle 0Cache Access
Cycle 68
![Page 13: Cluster Prefetch: Tolerating On-Chip Wire Delays in Clustered Microarchitectures](https://reader036.fdocuments.net/reader036/viewer/2022070418/5681572f550346895dc4cd87/html5/thumbnails/13.jpg)
University of Utah 13
Load Address Prediction
L1DLSQ
ClusterEff. Addr. Transfer
Cycle 27
Data TransferCycle 94
L1DLSQ
ClusterEff. Addr. Transfer
Cycle 27
Data TransferCycle 26
AddressPredictor
Cache AccessCycle 68 Dispatch at cycle 0
Cache Access Cycle 0
![Page 14: Cluster Prefetch: Tolerating On-Chip Wire Delays in Clustered Microarchitectures](https://reader036.fdocuments.net/reader036/viewer/2022070418/5681572f550346895dc4cd87/html5/thumbnails/14.jpg)
University of Utah 14
Memory Dependence Speculation
• To allow early cache access, loads must issue before resolving earlier store addresses
• High-confidence store address predictions are employed for disambiguation
• Stores that have never forwarded results within the LSQ are ignored
Cluster Prefetch: Combination of Load Address Prediction and Memory Dependence Speculation
![Page 15: Cluster Prefetch: Tolerating On-Chip Wire Delays in Clustered Microarchitectures](https://reader036.fdocuments.net/reader036/viewer/2022070418/5681572f550346895dc4cd87/html5/thumbnails/15.jpg)
University of Utah 15
Implementation Details
• Centralized table that maintains stride and last address; stride is determined by five consecutive accesses and cleared in case of five mispredicts
• Separate centralized table that maintains a single bit per entry to indicate stores that pose conflicts
• Each mispredict flushes all subsequent instrs
• Storage overhead: 18KB
![Page 16: Cluster Prefetch: Tolerating On-Chip Wire Delays in Clustered Microarchitectures](https://reader036.fdocuments.net/reader036/viewer/2022070418/5681572f550346895dc4cd87/html5/thumbnails/16.jpg)
University of Utah 16
Performance Results
0
0.5
1
1.5
2
2.5
3
applu
apsi ar
t
equak
e
fma3
d
galgel
luca
s
mes
a
mgrid
swim
wupwis
eHM
Ins
tru
cti
on
s p
er
cy
cle
(IP
C)
Base case
Ld-addr pred only
St-addr and mem-dep pred onlyLd-addr, st-addr, and mem-dep pred
Overall IPC improvement: 21%
![Page 17: Cluster Prefetch: Tolerating On-Chip Wire Delays in Clustered Microarchitectures](https://reader036.fdocuments.net/reader036/viewer/2022070418/5681572f550346895dc4cd87/html5/thumbnails/17.jpg)
University of Utah 17
Results Analysis
• Roughly half the programs improved IPC by >8%
• Load address prediction rate: 65%• Store address prediction rate: 79%• Stores likely to not pose conflicts: 59%
• Avg. number of mispredicts: 12K per 100M instrs
![Page 18: Cluster Prefetch: Tolerating On-Chip Wire Delays in Clustered Microarchitectures](https://reader036.fdocuments.net/reader036/viewer/2022070418/5681572f550346895dc4cd87/html5/thumbnails/18.jpg)
University of Utah 18
Decentralized Cache
Replicated Cache Banks• Loads do not travel far• Stores & cache refills are broadcast• Memory disambiguation is not accelerated
• Overheads: interconnect for broadcast and cache refill, power for redundant writes, distributed LRU, etc.
L1D
lsq
L1D
lsq
L1D
lsq
L1D
lsq
![Page 19: Cluster Prefetch: Tolerating On-Chip Wire Delays in Clustered Microarchitectures](https://reader036.fdocuments.net/reader036/viewer/2022070418/5681572f550346895dc4cd87/html5/thumbnails/19.jpg)
University of Utah 19
Comparing Centralized & Decentralized
L1D
lsq
L1D
lsq
L1D
lsq
L1D
lsq
L1D
lsq
IPCs without cluster prefetch
1.43 1.52
IPCs with cluster prefetch
1.73 1.79
![Page 20: Cluster Prefetch: Tolerating On-Chip Wire Delays in Clustered Microarchitectures](https://reader036.fdocuments.net/reader036/viewer/2022070418/5681572f550346895dc4cd87/html5/thumbnails/20.jpg)
University of Utah 20
Sensitivity Analysis
• Results verified for processor models with varying resources and interconnect latencies
• Evaluations on SPEC-Int: address prediction rate is only 38% modest speedups:
twolf (7%), parser (9%) crafty, gcc, vpr (3-4%) rest (< 2%)
![Page 21: Cluster Prefetch: Tolerating On-Chip Wire Delays in Clustered Microarchitectures](https://reader036.fdocuments.net/reader036/viewer/2022070418/5681572f550346895dc4cd87/html5/thumbnails/21.jpg)
University of Utah 21
Related Work
• Modest speedups with decentralized caches: Racunas and Patt [ICS ’03], for dynamic clustered processors; Gibert et al. [MICRO ’02] , for VLIW clustered processors
• Gibert et al. [MICRO ’03]: compiler-managed L0 buffers for critical data
![Page 22: Cluster Prefetch: Tolerating On-Chip Wire Delays in Clustered Microarchitectures](https://reader036.fdocuments.net/reader036/viewer/2022070418/5681572f550346895dc4cd87/html5/thumbnails/22.jpg)
University of Utah 22
Conclusions
• Address prediction and memory dependence speculation can hide latency to cache banks; prediction rate of 66% for SPEC-FP and IPC improvement of 21%
• Additional benefits from decentralization are modest
• Future work: build better predictors, impact on power consumption [WCED ’04]
![Page 23: Cluster Prefetch: Tolerating On-Chip Wire Delays in Clustered Microarchitectures](https://reader036.fdocuments.net/reader036/viewer/2022070418/5681572f550346895dc4cd87/html5/thumbnails/23.jpg)
![Page 24: Cluster Prefetch: Tolerating On-Chip Wire Delays in Clustered Microarchitectures](https://reader036.fdocuments.net/reader036/viewer/2022070418/5681572f550346895dc4cd87/html5/thumbnails/24.jpg)
University of Utah 24
Title
• Bullet
![Page 25: Cluster Prefetch: Tolerating On-Chip Wire Delays in Clustered Microarchitectures](https://reader036.fdocuments.net/reader036/viewer/2022070418/5681572f550346895dc4cd87/html5/thumbnails/25.jpg)
University of Utah 25
Title
• Bullet