A Mostly-Clean DRAM Cache for Effective Hit Speculation ...Β Β· ππ β βππ: # of...
Transcript of A Mostly-Clean DRAM Cache for Effective Hit Speculation ...Β Β· ππ β βππ: # of...
ππππβπΆβππ: # of requests already waiting for the same
bank in the off-chip memory
πΏπππβπΆβππ: Typical latency of one off-chip memory
request, excluding queuing delays
πΈπππβπΆβππ: ππππβπΆβππ Γ πΏπππβπΆβππ (total expected
queuing delay for off-chip)
ππ·π π΄π_πΆππβπ: # of requests already waiting for the same
bank in the DRAM cache
πΏπ·π π΄π_πΆππβπ: Typical latency of one DRAM cache
request, excluding queuing delays
πΈπ·π π΄π_πΆππβπ: ππ·π π΄π_πΆππβπ Γ πΏπ·π π΄π_πΆππβπ (total
expected queuing delay for DRAM cache)
πΈπππβπΆβππ < πΈπ·π π΄π_πΆππβπ: send request to off-chip
πΈπππβπΆβππ β₯ πΈπ·π π΄π_πΆππβπ: send request to DRAM$
Self-Balancing Dispatch (SBD)
A Mostly-Clean DRAM Cache for Effective Hit Speculation and Self-Balancing Dispatch
Motivation: β’ The cache line tracking structure (MissMap) to avoid a DRAM
cache access on a miss is expensive β’ Applying a conventional cache organization to DRAM caches
makes the aggregate system bandwidth under-utilized β’ Dirty data in DRAM caches severely restrict the effectiveness
of speculative techniques
The 45th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-45)
Jaewoong Sim Hyesoon Kim Gabriel H. Loh Mike OβConnor
Memory Request SBD
Input (Clean or Dirty) to HMP and SBD
DiRT HMP
Our Solution: HMP + SBD + DiRT β’ Use a low-cost hit-miss predictor to avoid a DRAM cache access on a miss (HMP) β’ Steer hit requests to either a DRAM cache or off-chip memory based on the
expected latency of both memory sources (SBD) β’ Maintain a mostly-clean DRAM cache via region-based WT/WB to guarantee the
cleanliness of a memory request (DiRT)
Hit/Miss phases within a 4KB region
Die-Stacked DRAM Cache Hit-Miss Predictor (HMP)
β’ Base Predictor (Default) => Prediction for 4MB regions
β’ 2nd-Level Table (On Tag Matching)
=> Prediction for 256KB regions
β’ 3rd-Level Table (On Tag Matching) => Prediction for 4KB regions
β’ Region-Based WT/WB policy
Dirty Region Tracker (DiRT) β’ Steer hit requests to a DRAM cache or off-chip
memory based on the expected latency of both memory sources
Putting It All Together
β’ HMP, SBD, and the DiRT can be accessed in parallel
β’ Based on the outcomes of the mechanisms, memory requests are sent to either DRAM$ or off-chip DRAM
Mithuna Thottethodi
0
20
40
60
80
1
10
19
28
37
46
55
64
73
82
91
10
0
10
9
11
8
12
7
13
6
14
5
15
4
16
3
17
2
18
1
19
0
19
9
#Lin
es
inst
alle
d
in t
he
cac
he
for
a p
age
#Accesses to the page
Miss Phase Hit Phase Miss Phase Hit Phase
Increasing on misses
Flat on hits
95%+ prediction accuracy less-than-1KB cost
Dirty Request?
DRAM$ Queue
DRAM Queue
Predicted Hit?
E(DRAM$) < E(DRAM)
YES
NO
NO
YES
NO
YES
DiRT HMP SBD
Mechanism
START
0.8
0.9
1.0
1.1
1.2
1.3
1.4
1.5
1.6
WL-1 WL-2 WL-3 WL-4 WL-5 WL-6 WL-7 WL-8 WL-9 WL-10 AVG.
Spe
ed
up
ove
r n
o
DR
AM
cac
he
MM HMP HMP + DiRT HMP + DiRT + SBD
Stacked DRAM$
Off-chip DRAM
Another Hit Request
Req. Buffer
Req. Buffer
Algorithm: Self-Balancing Dispatch
0%
20%
40%
60%
80%
100%
WL-1 WL-2 WL-3 WL-4 WL-5 WL-6 WL-7 WL-8 WL-9 WL-10
DiRT
CLEAN
2 0 0 9 0 8 0 1
4KB pages (or segments)
Write-Back
Write-Through
Clean!
β’ DiRT = Counting Bloom Filters (CBFs) + Dirty List β’CBFs: Track the number of writes to different pages β’Dirty List: Record most write-intensive pages
Install in Dirty List
TAG Memory request
Counting Bloom Filters Dirty List
NRU SET A
TAG NRU SET B
Write-back policy is applied to the pages in the Dirty List
0%
20%
40%
60%
80%
100%
WL-1 WL-2 WL-3 WL-4 WL-5 WL-6 WL-7 WL-8 WL-9 WL-10Pe
rce
nta
ge o
f w
rite
bac
ks
to D
RA
M
DiRT WB WT
3 tag blocks
29 data blocks
Ro
w D
eco
de
r
Sense Amplifier
DRAM Bank
β¦
DRAM (2KB ROW, 32 blocks for 64B line)
MissMap
4MB for 1GB DRAM$! Where to architect this?
Should we send this to the DRAM cache?
0%
20%
40%
60%
80%
100%
WL-1 WL-2 WL-3 WL-4 WL-5 WL-6 WL-7 WL-8 WL-9 WL-10
PH (To DRAM$) PH (To DRAM) Predicted Miss
Same Tech/Logic (DRAM Stack)
Processor Die
Credit: IBM
3 tag blocks
29 data blocks
Ro
w D
eco
de
r
Sense Amplifier
DRAM Bank
β¦
DRAM (2KB ROW, 32 blocks for 64B line)
MissMap
Hundreds of MBs On-Chip Stacked DRAM!!
β’ Two main usages β’ Use it as main memory β’ Use it as a large cache (DRAM cache)
This work is about the DRAM cache!
HMPMG: Multi-Granular Hit Miss Predictor
DRAM Cache Organization - Loh and Hill [MICROβ11]
β’ Break up memory space into coarse-grained regions (e.g. 4KB)
β’ Index into the HMPregion with a hash of the regionβs base address
HMPregion: Region-Based Hit Miss Predictor
Provides the cache line existence info!
Research Research