Region-Centric Memory Design

46
Region-Centric Memory Design AENAO Research Group Patrick Akl, M.A.Sc. Ioana Burcea, Ph.D. C. Myrto Papadopoulou, M.A.Sc. C. Elham Safi, Ph.D. C. Jason Zebchuk, M.A.Sc. C. Andreas Moshovos {pakl, ioana, myrto, elham, zebchuk, moshovos}@eecg.toronto.e

description

Region-Centric Memory Design. AENAO Research Group Patrick Akl , M.A.Sc. Ioana Burcea , Ph.D. C. Myrto Papadopoulou , M.A.Sc. C. Elham Safi , Ph.D. C. Jason Zebchuk , M.A.Sc. C. Andreas Moshovos. {pakl, ioana, myrto, elham, zebchuk, moshovos}@eecg.toronto.edu. CPU. CPU. I$. I$. D$. - PowerPoint PPT Presentation

Transcript of Region-Centric Memory Design

Page 1: Region-Centric Memory Design

Region-Centric Memory Design

AENAO Research GroupPatrick Akl, M.A.Sc.

Ioana Burcea, Ph.D. C.

Myrto Papadopoulou, M.A.Sc. C.

Elham Safi, Ph.D. C.

Jason Zebchuk, M.A.Sc. C.

Andreas Moshovos

{pakl, ioana, myrto, elham, zebchuk, moshovos}@eecg.toronto.edu

Page 2: Region-Centric Memory Design

Sept. 11, 2007, JJPAR 2Aenao Group/Toronto

Future On-Chip Caches: Just Larger?

CPU

I$ D$

CPU

I$ D$

CPU

I$ D$

interconnect

Main Memory

Observe and Exploit Memory Access Behavior at a Coarse Grain

10s – 100s of MB

Page 3: Region-Centric Memory Design

Sept. 11, 2007, JJPAR 3Aenao Group/Toronto

Conventional Block-Centric Memory Hierarchy

Conventional Fine-Grain Tracking

“Small” Blocks Performance and Bandwidth

Several optimizations exist

Big picture is lost

Page 4: Region-Centric Memory Design

Sept. 11, 2007, JJPAR 4Aenao Group/Toronto

“Big Picture” View

Region: 2n sized, aligned memory area Concept already in use: TLBs

Patterns Emerge in Space / Time Exploit for performance & power Expose to software

Supplemental Coarse-Grain Tracking

Page 5: Region-Centric Memory Design

Sept. 11, 2007, JJPAR 5Aenao Group/Toronto

This Presentation

Examples of Coarse-Grain Optimizations Snoop Coherence Thread-level speculation disambiguation

Region-Centric Memory Design RegionTracker Cache Snoop Coherence Revisited

Current Activities Coherence Delegation Predictor Virtualization

Page 6: Region-Centric Memory Design

Sept. 11, 2007, JJPAR 6Aenao Group/Toronto

An Example: Snoop Coherence

Conventional Considerations: Complexity and Correctness NOT Power/Bandwidth

Can we: (1) Reduce Power/bandwidth (2) Leverage snoop coherence? Remains Attractive: Simple / Design Re-use

Yes: Exploit Program Behavior toDynamically Identify Requests that do not Need Snooping

CPU

I$ D$

CPU

I$ D$

CPU

I$ D$

interconnect

Main Memory

Page 7: Region-Centric Memory Design

Sept. 11, 2007, JJPAR 7Aenao Group/Toronto

Coherence Basics

Given request for memory block X (address) Detect where current value resides

Main Memory

snoop

snoop

X

hit

CPU CPU CPU

Page 8: Region-Centric Memory Design

Sept. 11, 2007, JJPAR 8Aenao Group/Toronto

Conventional Coherence not Power-Aware/Bandwidth-Effective

All L2 tags see all accessesPerf. & Complexity: Have L2 tags why not use themPower: All L2 tags consume power on all accesses

Bandwidth: broadcast all coherent requests

Main Memory

L2

CPU

missmiss

CPU CPU

Page 9: Region-Centric Memory Design

Sept. 11, 2007, JJPAR 9Aenao Group/Toronto

RegionScout Motivation:Sharing is Coarse

Region: large continuous memory area, power of 2 size CPU X asks for data block in region R

1. No one else has X

2. No one else has any block in RRegionScout Exploits this Behavior

Layered Extension over Snoop Coherence

Typical Memory Space Snapshot: colored by owner(s)

addresses

Page 10: Region-Centric Memory Design

Sept. 11, 2007, JJPAR 10Aenao Group/Toronto

Optimization Opportunities

Power and Bandwidth Originating node: avoid asking others Remote node: avoid tag lookup

CPU

I$ D$

CPU

I$ D$

Memory

SWITCH

CPU

I$ D$

Page 11: Region-Centric Memory Design

Sept. 11, 2007, JJPAR 11Aenao Group/Toronto

Potential: Region Miss Frequency

0%

25%

50%

75%

100%

256 512 1K 2K 4K 8K 16K

p4.512K

p4.1M

p8.512K

p8.1M

% o

f all

request

s

Region Size

Even with a 16K Region~45% of requests miss in all remote nodes

bett

er

Glo

bal R

eg

ion

Mis

ses

Page 12: Region-Centric Memory Design

Sept. 11, 2007, JJPAR 12Aenao Group/Toronto

RegionScout at Work: Non-Shared Region Discovery

First request detects a non-shared region

Main Memory

CPUCPU CPU

Global Region Miss

Region Miss Region Miss12 2

3

Record: Non-Shared Regions Record: Locally Cached Regions

Page 13: Region-Centric Memory Design

Sept. 11, 2007, JJPAR 13Aenao Group/Toronto

RegionScout at Work:Avoiding Snoops

Subsequent request avoids snoops

Main Memory

CPUCPU CPU

Global Region Miss

1

2

Record: Non-Shared Regions Record: Locally Cached Regions

Page 14: Region-Centric Memory Design

Sept. 11, 2007, JJPAR 14Aenao Group/Toronto

RegionScout is Self-Correcting

Request from another node invalidates non-shared record

Main Memory

CPUCPU CPU

12 2

Record: Non-Shared Regions Record: Locally Cached Regions

Page 15: Region-Centric Memory Design

Sept. 11, 2007, JJPAR 15Aenao Group/Toronto

Requesting Node provides address:

At Originating Node – from CPU: Have I discovered that this region is not shared?

At Remote Nodes – from Interconnect: Do I have a block in the region?

Implementation: Requirements

Region Tag offsetlg(Region Size)

CPU

address

Page 16: Region-Centric Memory Design

Sept. 11, 2007, JJPAR 16Aenao Group/Toronto

Remembering Non-Shared Regions

Records non-shared regions Lookup by Region portion prior to issuing a request Snoop requests and invalidate

Region Tag offsetaddress

validNon-Shared Region Table

Few entries16x4 in most experiments

Page 17: Region-Centric Memory Design

Sept. 11, 2007, JJPAR 17Aenao Group/Toronto

What Regions are Locally Cached?

If we had as many counters as regions: Block Allocation: counter[region]++ Block Eviction: counter[region]-- Region cached only if counter[Region] non-zero

Not Practical: E.g., 16K Regions and 4G Memory 256K counters

Region Tag offset

counter

Page 18: Region-Centric Memory Design

Sept. 11, 2007, JJPAR 18Aenao Group/Toronto

What Regions are Locally Cached?

Region Tag offset

counter

hash()

Imprecise: Records a superset of locally cached Regions False positives: lost opportunity, correctness preserved Small: e.g., 256 entries for 1M cache

Power-Optimized structures described in the paper

Page 19: Region-Centric Memory Design

Sept. 11, 2007, JJPAR 19Aenao Group/Toronto

LFSR-Based Implementation

Region Tag offset

LFSR

hash()

Zero

D

ete

ctor

Linear-Feedback Shift Register Array Increment/Decrement/Is Zero?

130nm commercial technology ISLPED ’06 Faster: 1.6x to 3.7x More Energy Efficient: 1.4x to 2.3x But Area: 3.2x

Page 20: Region-Centric Memory Design

Sept. 11, 2007, JJPAR 20Aenao Group/Toronto

Filter Rates: SPLASH-II

0%

25%

50%

75%

100%

256 512 1K 2K

p4.512K.R4K

p4.512K.R16K

p8.512K.R4K

p8.512K.R16K

Iden

tifi

ed

Glo

bal R

eg

ion

Mis

ses

CRH Size

bett

er

Jason Cantin@Wisconsin studied commercial workloads40% filter rate

Page 21: Region-Centric Memory Design

Region-Centric Disambiguation

Join work w/

Greg Steffan and Mihai Burcea

Patrick Akl

Andreas Moshovos

Page 22: Region-Centric Memory Design

Sept. 11, 2007, JJPAR 22Aenao Group/Toronto

Speculative Parallelization Models

Thread level speculation Transactional Memory

Original Speculative Parallelization

tim

e

write a

read b

write a

read a

Good Scenario Bad Scenario

Need to Compare Addresses Across Code Pieces

Page 23: Region-Centric Memory Design

Sept. 11, 2007, JJPAR 23Aenao Group/Toronto

Ex #2: Region-Centric Disambiguation

Send digest at region level Region-conflict

Send block-level info

Reduced bandwidth, potential for performance and power

Task 1 Task 2 Task 1 Task 2

Mem

ory

Space

Conventional Region-Centric

Page 24: Region-Centric Memory Design

Sept. 11, 2007, JJPAR 24Aenao Group/Toronto

How Much Traffic Can We Save?

TLS benchmarks from STAMPEDE group (G. Steffan) Approximate timing model

Potential for traffic reduction by 38%

0

0.2

0.4

0.6

0.8

1

32 64 128 256 512 1024

Disambiguation Traffic Reduction with TLS

Region Size (Bytes)

Tra

ffic

Rat

io

Better

Page 25: Region-Centric Memory Design

Sept. 11, 2007, JJPAR 25Aenao Group/Toronto

Exploiting Region-Level Information

Region Coherence Arrays Cantin, Lipasti and Smith

RegionScout Both of these reduce snoop lookups (and broadcasts) in snoop

coherence protocolsOur work

Spatial Memory Prefetching Leverages spatial memory patterns for prefetching with

commercial workloads Impetus Group at CMU

Stealth Prefetching Cantin, Lipasti and Smith

Page 26: Region-Centric Memory Design

Sept. 11, 2007, JJPAR 26Aenao Group/Toronto

Coarse-Grain Techniques Today

Overhead Storage: e.g., 60% of tags Functionality: Restrict placement, Region Evictions

Loss of Information

Hard to justify for a commercial design

CPU

I$ D$AuxiliaryTracking

DATATAGS

Conventional Cache

Page 27: Region-Centric Memory Design

Sept. 11, 2007, JJPAR 27Aenao Group/Toronto

Rethinking Cache Design

Can we provide a common substrate for all these optimizations? Redesign caches:

Regions a first class citizen

RegionTracker Cache

CPU

I$ D$EmbeddedTracking

DATA

Dual-GrainTAGS

Page 28: Region-Centric Memory Design

Sept. 11, 2007, JJPAR 28Aenao Group/Toronto

RegionTracker Cache

Goals Expose region behavior

Is region X cached? Which blocks are?

Facilitate management at the region level Evict/migrate region X Do something with all blocks in X

Constraints: Data movement only at the block level No increase in area No decrease in performance Complexity Associativity

Page 29: Region-Centric Memory Design

Sept. 11, 2007, JJPAR 29Aenao Group/Toronto

Region-Based Caches Start with conventional 16-way cache and replace tag array Sector Caches

Hit rate suffers: 20% loss Sector Pool Caches

High Associavity: 48-way for matching a 16-way cache Decoupled-Sector Caches

No coarse-grain info Replacements require searching

No previous design is adequate

RegionTracker: Meets all requirements But does not save as much tag resources

Page 30: Region-Centric Memory Design

Sept. 11, 2007, JJPAR 30Aenao Group/Toronto

Sector Cache

Reduced Area and Power Increased miss-rates (2.5% - 96% for 1kB sectors) Replacement?

D-way Data

{

D-way Region Tags

RVA

Data Array

Page 31: Region-Centric Memory Design

Sept. 11, 2007, JJPAR 31Aenao Group/Toronto

Sector Pool Cache

M > D Requires highly associative cache to achieve

same performance as RegionTracker (~48-way)

D-way Data

{

Data Array

M-way Region Tags

RVA

1 D

SR

Page 32: Region-Centric Memory Design

Sept. 11, 2007, JJPAR 32Aenao Group/Toronto

Decoupled-Sectored Cache

Has multiple block evictions Requires scanning “status” array No simple mechanism to avoid this

Does NOT expose region-level information

Page 33: Region-Centric Memory Design

Sept. 11, 2007, JJPAR 33Aenao Group/Toronto

RegionTracker

In practice L <= D Decouple Data and Lookup organizations Lower Associativity lookups with no hit-rate penalty RegionTracker provides complete solution

D-way Data

{

Data Array

L-way Region Tags

RVA

1 D

SR

Page 34: Region-Centric Memory Design

Sept. 11, 2007, JJPAR 34Aenao Group/Toronto

RegionTracker Cache

L1

L1

L1

L1

RVA

ERB

Data Array

BST

Block and Region LookupsRegion Tag + Way Per

Block

Evict Region Blocks Lazily

Simplify replacement and reduce area

Status per block + RVA set backpointer

Can be banked and partitioned

Page 35: Region-Centric Memory Design

Sept. 11, 2007, JJPAR 35Aenao Group/Toronto

Region-Aware Cache: Performance vs. Area

Commercial workloads: DB2, Oracle, TPC-C and TPC-H, Apache, Zeus SimICS + SimFlex, Sampling, 2K Regions

Miss Rate vs. Size for 128MB Cache Designs

1

1.02

1.04

1.06

1.08

1.1

0 0.2 0.4 0.6 0.8 1Relative Tag Array Size

Re

lati

ve

Mis

s R

ate

Sector Cache

Sector Pool

RA+

sqrt(2) Rule

better

Page 36: Region-Centric Memory Design

Sept. 11, 2007, JJPAR 36Aenao Group/Toronto

RegionTracker-RegionScout

One bit per Region tag: Known to be not shared 1KB Regions, Commercial workloads 512KB L2 private caches

Filter 41% of snoops at “Zero Cost” compared to conventional cache

0%

20%

40%

60%

RS-1K CRH RS-2K CRH RS-4K CRH RSRTReduct

ion in B

roadca

sts

bett

er

BlockScout

Page 37: Region-Centric Memory Design

Sept. 11, 2007, JJPAR 37Aenao Group/Toronto

Directory Optimizations Base Architecture

L3 Data DRAM

CoreDirectory

L3 Tags

L2 Tags

L2 Data

Page 38: Region-Centric Memory Design

Sept. 11, 2007, JJPAR 38Aenao Group/Toronto

Coherence Delegation

Eliminate 3-hop overhead Attract directory tracking to nodes

Directory Lookup

Remote L2 containing data

Requesting Node

Ideal Path

Page 39: Region-Centric Memory Design

Sept. 11, 2007, JJPAR 39Aenao Group/Toronto

Predictor Virtualization

InterconnectInterconnect

L2L2

CPUCPU

L1-L1-DD

L1-IL1-I

CPUCPU

L1-L1-DD

L1-IL1-I

CPUCPU

L1-L1-DD

L1-IL1-I

CPUCPU

L1-L1-DD

L1-IL1-I

Main MemoryMain Memory

Optimization Engines: Predictors

CPUCPU CPUCPU

L1-L1-DD

L1-IL1-I

CPUCPU

L1-L1-DD

L1-IL1-I

CPUCPU CPUCPU

L1-L1-DD

L1-IL1-I

CPUCPU

L1-L1-DD

L1-IL1-I

CPUCPU CPUCPU CPUCPUCPUCPU CPUCPU

L1-L1-DD

L1-IL1-IL1-L1-DD

L1-L1-DD

L1-IL1-IL1-L1-DD L1-IL1-IL1-L1-

DDL1-IL1-IL1-L1-

DDL1-L1-DD

L1-IL1-IL1-L1-DD

L1-IL1-IL1-L1-DD L1-IL1-IL1-L1-

DDL1-IL1-IL1-L1-

DDL1-IL1-IL1-L1-

DD

Page 40: Region-Centric Memory Design

Sept. 11, 2007, JJPAR 40Aenao Group/Toronto

Motivating Trends

Chip multiprocessors Space dedicated to predictors X #processors

Larger predictor table Increased performance

Memory hierarchies Increased capacities

Use conventional memory hierarchies to store predictor information

Page 41: Region-Centric Memory Design

Sept. 11, 2007, JJPAR 41Aenao Group/Toronto

PV Architecture

Optimization Engine

Predictor Table

entry index prediction

Page 42: Region-Centric Memory Design

Sept. 11, 2007, JJPAR 42Aenao Group/Toronto

PV Architecture

Optimization Engine

entry index prediction

Predictor Virtualization

Page 43: Region-Centric Memory Design

Sept. 11, 2007, JJPAR 43Aenao Group/Toronto

PV Architecture

Optimization Engine

entry index prediction

+

indexPVStart

PVCache MSHR

PVProxy

L2

Main MemoryPVTable

Page 44: Region-Centric Memory Design

Sept. 11, 2007, JJPAR 44Aenao Group/Toronto

Virtualized Spatial Memory Streaming

-100

10

20304050

607080

Apache Zeus DB2 Oracle Qry 1 Qry 2 Qry 16 Qry 17

Per

cen

tag

e S

pee

du

p

SMS - 1K sets SMS - 8 sets SMS - PVCache 8 sets

Original Prefetcher: Cost: 80KB

Virtualized Prefetcer: Cost: <1Kbyte

Nearly Identical Performance

Page 45: Region-Centric Memory Design

Region-Centric Memory Design

AENAO Research GroupPatrick Akl, M.A.Sc. C.Ioana Burcea, Ph.D. C.

Myrto Papadopoulou, M.A.Sc. C.Elham Safi, Ph.D. C.

Jason Zebchuk, M.A.Sc. C.Andreas Moshovos

{pakl, ioana, myrto, elham, zebchuk, moshovos}@eecg.toronto.edu

Page 46: Region-Centric Memory Design

Sept. 11, 2007, JJPAR 46Aenao Group/Toronto

Summary

Caches are getting larger Time to look at the “big picture” Region-Centric Memory Design

Expose region-level info Allow management at the region-level

RegionScout eliminate broadcasts for snoop coherence

Region-Centric Disambiguation Reduce bandwidth for TLS or TM

Region-Aware Memory “Same” area and performance as conventional + region info

Predictor Virtualization