A Mostly-Clean DRAM Cache for Effective Hit Speculation ...Β Β· 𝑁𝑂 βˆ’ β„Žπ‘–π‘: # of...

1
βˆ’β„Ž : # of requests already waiting for the same bank in the off-chip memory βˆ’β„Ž : Typical latency of one off-chip memory request, excluding queuing delays βˆ’β„Ž : βˆ’β„Ž Γ— βˆ’β„Ž (total expected queuing delay for off-chip) _β„Ž : # of requests already waiting for the same bank in the DRAM cache _β„Ž : Typical latency of one DRAM cache request, excluding queuing delays _β„Ž : _β„Ž Γ— _β„Ž (total expected queuing delay for DRAM cache) βˆ’β„Ž < _β„Ž : send request to off-chip βˆ’β„Ž β‰₯ _β„Ž : send request to DRAM$ Self-Balancing Dispatch (SBD) A Mostly-Clean DRAM Cache for Effective Hit Speculation and Self-Balancing Dispatch Motivation: β€’ The cache line tracking structure (MissMap) to avoid a DRAM cache access on a miss is expensive β€’ Applying a conventional cache organization to DRAM caches makes the aggregate system bandwidth under-utilized β€’ Dirty data in DRAM caches severely restrict the effectiveness of speculative techniques The 45 th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-45) Jaewoong Sim Hyesoon Kim Gabriel H. Loh Mike Oβ€˜Connor Memory Request SBD Input (Clean or Dirty) to HMP and SBD DiRT HMP Our Solution: HMP + SBD + DiRT β€’ Use a low-cost hit-miss predictor to avoid a DRAM cache access on a miss (HMP) β€’ Steer hit requests to either a DRAM cache or off-chip memory based on the expected latency of both memory sources (SBD) β€’ Maintain a mostly-clean DRAM cache via region-based WT/WB to guarantee the cleanliness of a memory request (DiRT) Hit/Miss phases within a 4KB region Die-Stacked DRAM Cache Hit-Miss Predictor (HMP) β€’ Base Predictor (Default) => Prediction for 4MB regions β€’2 nd -Level Table (On Tag Matching) => Prediction for 256KB regions β€’3 rd -Level Table (On Tag Matching) => Prediction for 4KB regions β€’ Region-Based WT/WB policy Dirty Region Tracker (DiRT) β€’ Steer hit requests to a DRAM cache or off-chip memory based on the expected latency of both memory sources Putting It All Together β€’ HMP, SBD, and the DiRT can be accessed in parallel β€’ Based on the outcomes of the mechanisms, memory requests are sent to either DRAM$ or off-chip DRAM Mithuna Thottethodi 0 20 40 60 80 1 10 19 28 37 46 55 64 73 82 91 100 109 118 127 136 145 154 163 172 181 190 199 #Lines installed in the cache for a page #Accesses to the page Miss Phase Hit Phase Miss Phase Hit Phase Increasing on misses Flat on hits 95%+ prediction accuracy less-than-1KB cost Dirty Request? DRAM$ Queue DRAM Queue Predicted Hit? E(DRAM$) < E(DRAM) YES NO NO YES NO YES DiRT HMP SBD Mechanism START 0.8 0.9 1.0 1.1 1.2 1.3 1.4 1.5 1.6 WL-1 WL-2 WL-3 WL-4 WL-5 WL-6 WL-7 WL-8 WL-9 WL-10 AVG. Speedup over no DRAM cache MM HMP HMP + DiRT HMP + DiRT + SBD Stacked DRAM$ Off-chip DRAM Another Hit Request Req. Buffer Req. Buffer Algorithm: Self-Balancing Dispatch 0% 20% 40% 60% 80% 100% WL-1 WL-2 WL-3 WL-4 WL-5 WL-6 WL-7 WL-8 WL-9 WL-10 DiRT CLEAN 2 0 0 9 0 8 0 1 4KB pages (or segments) Write-Back Write-Through Clean! β€’ DiRT = Counting Bloom Filters (CBFs) + Dirty List β€’CBFs: Track the number of writes to different pages β€’Dirty List: Record most write-intensive pages Install in Dirty List TAG Memory request Counting Bloom Filters Dirty List NRU SET A TAG NRU SET B Write-back policy is applied to the pages in the Dirty List 0% 20% 40% 60% 80% 100% WL-1 WL-2 WL-3 WL-4 WL-5 WL-6 WL-7 WL-8 WL-9 WL-10 Percentage of writebacks to DRAM DiRT WB WT 3 tag blocks 29 data blocks Row Decoder Sense Amplifier DRAM Bank … DRAM (2KB ROW, 32 blocks for 64B line) MissMap 4MB for 1GB DRAM$! Where to architect this? Should we send this to the DRAM cache? 0% 20% 40% 60% 80% 100% WL-1 WL-2 WL-3 WL-4 WL-5 WL-6 WL-7 WL-8 WL-9 WL-10 PH (To DRAM$) PH (To DRAM) Predicted Miss Same Tech/Logic (DRAM Stack) Processor Die Credit: IBM 3 tag blocks 29 data blocks Row Decoder Sense Amplifier DRAM Bank … DRAM (2KB ROW, 32 blocks for 64B line) MissMap Hundreds of MBs On-Chip Stacked DRAM!! β€’ Two main usages β€’ Use it as main memory β€’ Use it as a large cache (DRAM cache) This work is about the DRAM cache! HMPMG: Multi-Granular Hit Miss Predictor DRAM Cache Organization - Loh and Hill [MICRO’11] β€’ Break up memory space into coarse-grained regions (e.g. 4KB) β€’ Index into the HMPregion with a hash of the region’s base address HMPregion: Region-Based Hit Miss Predictor Provides the cache line existence info! Research Research

Transcript of A Mostly-Clean DRAM Cache for Effective Hit Speculation ...Β Β· 𝑁𝑂 βˆ’ β„Žπ‘–π‘: # of...

Page 1: A Mostly-Clean DRAM Cache for Effective Hit Speculation ...Β Β· 𝑁𝑂 βˆ’ β„Žπ‘–π‘: # of requests already waiting for the same bank in the off-chip memory 𝐿𝑂 βˆ’ β„Žπ‘–π‘:

π‘π‘‚π‘“π‘“βˆ’πΆβ„Žπ‘–π‘: # of requests already waiting for the same

bank in the off-chip memory

πΏπ‘‚π‘“π‘“βˆ’πΆβ„Žπ‘–π‘: Typical latency of one off-chip memory

request, excluding queuing delays

πΈπ‘‚π‘“π‘“βˆ’πΆβ„Žπ‘–π‘: π‘π‘‚π‘“π‘“βˆ’πΆβ„Žπ‘–π‘ Γ— πΏπ‘‚π‘“π‘“βˆ’πΆβ„Žπ‘–π‘ (total expected

queuing delay for off-chip)

𝑁𝐷𝑅𝐴𝑀_πΆπ‘Žπ‘β„Žπ‘’: # of requests already waiting for the same

bank in the DRAM cache

𝐿𝐷𝑅𝐴𝑀_πΆπ‘Žπ‘β„Žπ‘’: Typical latency of one DRAM cache

request, excluding queuing delays

𝐸𝐷𝑅𝐴𝑀_πΆπ‘Žπ‘β„Žπ‘’: 𝑁𝐷𝑅𝐴𝑀_πΆπ‘Žπ‘β„Žπ‘’ Γ— 𝐿𝐷𝑅𝐴𝑀_πΆπ‘Žπ‘β„Žπ‘’ (total

expected queuing delay for DRAM cache)

πΈπ‘‚π‘“π‘“βˆ’πΆβ„Žπ‘–π‘ < 𝐸𝐷𝑅𝐴𝑀_πΆπ‘Žπ‘β„Žπ‘’: send request to off-chip

πΈπ‘‚π‘“π‘“βˆ’πΆβ„Žπ‘–π‘ β‰₯ 𝐸𝐷𝑅𝐴𝑀_πΆπ‘Žπ‘β„Žπ‘’: send request to DRAM$

Self-Balancing Dispatch (SBD)

A Mostly-Clean DRAM Cache for Effective Hit Speculation and Self-Balancing Dispatch

Motivation: β€’ The cache line tracking structure (MissMap) to avoid a DRAM

cache access on a miss is expensive β€’ Applying a conventional cache organization to DRAM caches

makes the aggregate system bandwidth under-utilized β€’ Dirty data in DRAM caches severely restrict the effectiveness

of speculative techniques

The 45th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-45)

Jaewoong Sim Hyesoon Kim Gabriel H. Loh Mike Oβ€˜Connor

Memory Request SBD

Input (Clean or Dirty) to HMP and SBD

DiRT HMP

Our Solution: HMP + SBD + DiRT β€’ Use a low-cost hit-miss predictor to avoid a DRAM cache access on a miss (HMP) β€’ Steer hit requests to either a DRAM cache or off-chip memory based on the

expected latency of both memory sources (SBD) β€’ Maintain a mostly-clean DRAM cache via region-based WT/WB to guarantee the

cleanliness of a memory request (DiRT)

Hit/Miss phases within a 4KB region

Die-Stacked DRAM Cache Hit-Miss Predictor (HMP)

β€’ Base Predictor (Default) => Prediction for 4MB regions

β€’ 2nd-Level Table (On Tag Matching)

=> Prediction for 256KB regions

β€’ 3rd-Level Table (On Tag Matching) => Prediction for 4KB regions

β€’ Region-Based WT/WB policy

Dirty Region Tracker (DiRT) β€’ Steer hit requests to a DRAM cache or off-chip

memory based on the expected latency of both memory sources

Putting It All Together

β€’ HMP, SBD, and the DiRT can be accessed in parallel

β€’ Based on the outcomes of the mechanisms, memory requests are sent to either DRAM$ or off-chip DRAM

Mithuna Thottethodi

0

20

40

60

80

1

10

19

28

37

46

55

64

73

82

91

10

0

10

9

11

8

12

7

13

6

14

5

15

4

16

3

17

2

18

1

19

0

19

9

#Lin

es

inst

alle

d

in t

he

cac

he

for

a p

age

#Accesses to the page

Miss Phase Hit Phase Miss Phase Hit Phase

Increasing on misses

Flat on hits

95%+ prediction accuracy less-than-1KB cost

Dirty Request?

DRAM$ Queue

DRAM Queue

Predicted Hit?

E(DRAM$) < E(DRAM)

YES

NO

NO

YES

NO

YES

DiRT HMP SBD

Mechanism

START

0.8

0.9

1.0

1.1

1.2

1.3

1.4

1.5

1.6

WL-1 WL-2 WL-3 WL-4 WL-5 WL-6 WL-7 WL-8 WL-9 WL-10 AVG.

Spe

ed

up

ove

r n

o

DR

AM

cac

he

MM HMP HMP + DiRT HMP + DiRT + SBD

Stacked DRAM$

Off-chip DRAM

Another Hit Request

Req. Buffer

Req. Buffer

Algorithm: Self-Balancing Dispatch

0%

20%

40%

60%

80%

100%

WL-1 WL-2 WL-3 WL-4 WL-5 WL-6 WL-7 WL-8 WL-9 WL-10

DiRT

CLEAN

2 0 0 9 0 8 0 1

4KB pages (or segments)

Write-Back

Write-Through

Clean!

β€’ DiRT = Counting Bloom Filters (CBFs) + Dirty List β€’CBFs: Track the number of writes to different pages β€’Dirty List: Record most write-intensive pages

Install in Dirty List

TAG Memory request

Counting Bloom Filters Dirty List

NRU SET A

TAG NRU SET B

Write-back policy is applied to the pages in the Dirty List

0%

20%

40%

60%

80%

100%

WL-1 WL-2 WL-3 WL-4 WL-5 WL-6 WL-7 WL-8 WL-9 WL-10Pe

rce

nta

ge o

f w

rite

bac

ks

to D

RA

M

DiRT WB WT

3 tag blocks

29 data blocks

Ro

w D

eco

de

r

Sense Amplifier

DRAM Bank

…

DRAM (2KB ROW, 32 blocks for 64B line)

MissMap

4MB for 1GB DRAM$! Where to architect this?

Should we send this to the DRAM cache?

0%

20%

40%

60%

80%

100%

WL-1 WL-2 WL-3 WL-4 WL-5 WL-6 WL-7 WL-8 WL-9 WL-10

PH (To DRAM$) PH (To DRAM) Predicted Miss

Same Tech/Logic (DRAM Stack)

Processor Die

Credit: IBM

3 tag blocks

29 data blocks

Ro

w D

eco

de

r

Sense Amplifier

DRAM Bank

…

DRAM (2KB ROW, 32 blocks for 64B line)

MissMap

Hundreds of MBs On-Chip Stacked DRAM!!

β€’ Two main usages β€’ Use it as main memory β€’ Use it as a large cache (DRAM cache)

This work is about the DRAM cache!

HMPMG: Multi-Granular Hit Miss Predictor

DRAM Cache Organization - Loh and Hill [MICRO’11]

β€’ Break up memory space into coarse-grained regions (e.g. 4KB)

β€’ Index into the HMPregion with a hash of the region’s base address

HMPregion: Region-Based Hit Miss Predictor

Provides the cache line existence info!

Research Research