A Mostly-Clean DRAM Cache for Effective Hit Speculation ... · 𝑁𝑂 − ℎ𝑖𝑝: # of...

𝑁𝑂𝑓𝑓−𝐶ℎ𝑖𝑝: # of requests already waiting for the same

bank in the off-chip memory

𝐿𝑂𝑓𝑓−𝐶ℎ𝑖𝑝: Typical latency of one off-chip memory

request, excluding queuing delays

𝐸𝑂𝑓𝑓−𝐶ℎ𝑖𝑝: 𝑁𝑂𝑓𝑓−𝐶ℎ𝑖𝑝 × 𝐿𝑂𝑓𝑓−𝐶ℎ𝑖𝑝 (total expected

queuing delay for off-chip)

𝑁𝐷𝑅𝐴𝑀_𝐶𝑎𝑐ℎ𝑒: # of requests already waiting for the same

bank in the DRAM cache

𝐿𝐷𝑅𝐴𝑀_𝐶𝑎𝑐ℎ𝑒: Typical latency of one DRAM cache

request, excluding queuing delays

𝐸𝐷𝑅𝐴𝑀_𝐶𝑎𝑐ℎ𝑒: 𝑁𝐷𝑅𝐴𝑀_𝐶𝑎𝑐ℎ𝑒 × 𝐿𝐷𝑅𝐴𝑀_𝐶𝑎𝑐ℎ𝑒 (total

expected queuing delay for DRAM cache)

𝐸𝑂𝑓𝑓−𝐶ℎ𝑖𝑝 < 𝐸𝐷𝑅𝐴𝑀_𝐶𝑎𝑐ℎ𝑒: send request to off-chip

𝐸𝑂𝑓𝑓−𝐶ℎ𝑖𝑝 ≥ 𝐸𝐷𝑅𝐴𝑀_𝐶𝑎𝑐ℎ𝑒: send request to DRAM$

Self-Balancing Dispatch (SBD)

A Mostly-Clean DRAM Cache for Effective Hit Speculation and Self-Balancing Dispatch

Motivation: • The cache line tracking structure (MissMap) to avoid a DRAM

cache access on a miss is expensive • Applying a conventional cache organization to DRAM caches

makes the aggregate system bandwidth under-utilized • Dirty data in DRAM caches severely restrict the effectiveness

of speculative techniques

The 45th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-45)

Jaewoong Sim Hyesoon Kim Gabriel H. Loh Mike O‘Connor

Memory Request SBD

Input (Clean or Dirty) to HMP and SBD

DiRT HMP

Our Solution: HMP + SBD + DiRT • Use a low-cost hit-miss predictor to avoid a DRAM cache access on a miss (HMP) • Steer hit requests to either a DRAM cache or off-chip memory based on the

expected latency of both memory sources (SBD) • Maintain a mostly-clean DRAM cache via region-based WT/WB to guarantee the

cleanliness of a memory request (DiRT)

Hit/Miss phases within a 4KB region

Die-Stacked DRAM Cache Hit-Miss Predictor (HMP)

• Base Predictor (Default) => Prediction for 4MB regions

• 2nd-Level Table (On Tag Matching)

=> Prediction for 256KB regions

• 3rd-Level Table (On Tag Matching) => Prediction for 4KB regions

• Region-Based WT/WB policy

Dirty Region Tracker (DiRT) • Steer hit requests to a DRAM cache or off-chip

memory based on the expected latency of both memory sources

Putting It All Together

• HMP, SBD, and the DiRT can be accessed in parallel

• Based on the outcomes of the mechanisms, memory requests are sent to either DRAM$ or off-chip DRAM

Mithuna Thottethodi

0

20

40

60

80

1

10

19

28

37

46

55

64

73

82

91

10

0

10

9

11

8

12

7

13

6

14

5

15

4

16

3

17

2

18

1

19

0

19

9

#Lin

es

inst

alle

d

in t

he

cac

he

for

a p

age

#Accesses to the page

Miss Phase Hit Phase Miss Phase Hit Phase

Increasing on misses

Flat on hits

95%+ prediction accuracy less-than-1KB cost

Dirty Request?

DRAM$ Queue

DRAM Queue

Predicted Hit?

E(DRAM$) < E(DRAM)

YES

NO

NO

YES

NO

YES

DiRT HMP SBD

Mechanism

START

0.8

0.9

1.0

1.1

1.2

1.3

1.4

1.5

1.6

WL-1 WL-2 WL-3 WL-4 WL-5 WL-6 WL-7 WL-8 WL-9 WL-10 AVG.

Spe

ed

up

ove

r n

o

DR

AM

cac

he

MM HMP HMP + DiRT HMP + DiRT + SBD

Stacked DRAM$

Off-chip DRAM

Another Hit Request

Req. Buffer

Req. Buffer

Algorithm: Self-Balancing Dispatch

0%

20%

40%

60%

80%

100%

WL-1 WL-2 WL-3 WL-4 WL-5 WL-6 WL-7 WL-8 WL-9 WL-10

DiRT

CLEAN

2 0 0 9 0 8 0 1

4KB pages (or segments)

Write-Back

Write-Through

Clean!

• DiRT = Counting Bloom Filters (CBFs) + Dirty List •CBFs: Track the number of writes to different pages •Dirty List: Record most write-intensive pages

Install in Dirty List

TAG Memory request

Counting Bloom Filters Dirty List

NRU SET A

TAG NRU SET B

Write-back policy is applied to the pages in the Dirty List

0%

20%

40%

60%

80%

100%

WL-1 WL-2 WL-3 WL-4 WL-5 WL-6 WL-7 WL-8 WL-9 WL-10Pe

rce

nta

ge o

f w

rite

bac

ks

to D

RA

M

DiRT WB WT

3 tag blocks

29 data blocks

Ro

w D

eco

de

r

Sense Amplifier

DRAM Bank

…

DRAM (2KB ROW, 32 blocks for 64B line)

MissMap

4MB for 1GB DRAM$! Where to architect this?

Should we send this to the DRAM cache?

0%

20%

40%

60%

80%

100%

WL-1 WL-2 WL-3 WL-4 WL-5 WL-6 WL-7 WL-8 WL-9 WL-10

PH (To DRAM$) PH (To DRAM) Predicted Miss

Same Tech/Logic (DRAM Stack)

Processor Die

Credit: IBM

3 tag blocks

29 data blocks

Ro

w D

eco

de

r

Sense Amplifier

DRAM Bank

…

DRAM (2KB ROW, 32 blocks for 64B line)

MissMap

Hundreds of MBs On-Chip Stacked DRAM!!

• Two main usages • Use it as main memory • Use it as a large cache (DRAM cache)

This work is about the DRAM cache!

HMPMG: Multi-Granular Hit Miss Predictor

DRAM Cache Organization - Loh and Hill [MICRO’11]

• Break up memory space into coarse-grained regions (e.g. 4KB)

• Index into the HMPregion with a hash of the region’s base address

HMPregion: Region-Based Hit Miss Predictor

Provides the cache line existence info!

Research Research

A Mostly-Clean DRAM Cache for Effective Hit Speculation ... · 𝑁𝑂 − ℎ𝑖𝑝: # of...

Documents

Transcript of A Mostly-Clean DRAM Cache for Effective Hit Speculation ... · 𝑁𝑂 − ℎ𝑖𝑝: # of...