Region-Centric Memory Design

Region-Centric Memory Design

AENAO Research GroupPatrick Akl, M.A.Sc.

Ioana Burcea, Ph.D. C.

Myrto Papadopoulou, M.A.Sc. C.

Elham Safi, Ph.D. C.

Jason Zebchuk, M.A.Sc. C.

Andreas Moshovos

{pakl, ioana, myrto, elham, zebchuk, moshovos}@eecg.toronto.edu

Sept. 11, 2007, JJPAR 2Aenao Group/Toronto

Future On-Chip Caches: Just Larger?

CPU

I$ D$

CPU

I$ D$

CPU

I$ D$

interconnect

Main Memory

Observe and Exploit Memory Access Behavior at a Coarse Grain

10s – 100s of MB


Conventional Block-Centric Memory Hierarchy

Conventional Fine-Grain Tracking

“Small” Blocks Performance and Bandwidth

Several optimizations exist

Big picture is lost


“Big Picture” View

Region: 2n sized, aligned memory area Concept already in use: TLBs

Patterns Emerge in Space / Time Exploit for performance & power Expose to software

Supplemental Coarse-Grain Tracking


This Presentation

Examples of Coarse-Grain Optimizations Snoop Coherence Thread-level speculation disambiguation

Region-Centric Memory Design RegionTracker Cache Snoop Coherence Revisited

Current Activities Coherence Delegation Predictor Virtualization


An Example: Snoop Coherence

Conventional Considerations: Complexity and Correctness NOT Power/Bandwidth

Can we: (1) Reduce Power/bandwidth (2) Leverage snoop coherence? Remains Attractive: Simple / Design Re-use

Yes: Exploit Program Behavior toDynamically Identify Requests that do not Need Snooping

CPU

I$ D$

CPU

I$ D$

CPU

I$ D$

interconnect

Main Memory


Coherence Basics

Given request for memory block X (address) Detect where current value resides

Main Memory

snoop

snoop

X

hit

CPU CPU CPU


Conventional Coherence not Power-Aware/Bandwidth-Effective

All L2 tags see all accessesPerf. & Complexity: Have L2 tags why not use themPower: All L2 tags consume power on all accesses

Bandwidth: broadcast all coherent requests

Main Memory

L2

CPU

missmiss

CPU CPU


RegionScout Motivation:Sharing is Coarse

Region: large continuous memory area, power of 2 size CPU X asks for data block in region R

1. No one else has X

2. No one else has any block in RRegionScout Exploits this Behavior

Layered Extension over Snoop Coherence

Typical Memory Space Snapshot: colored by owner(s)

addresses


Optimization Opportunities

Power and Bandwidth Originating node: avoid asking others Remote node: avoid tag lookup

CPU

I$ D$

CPU

I$ D$

Memory

SWITCH

CPU

I$ D$


Potential: Region Miss Frequency

0%

25%

50%

75%

100%

256 512 1K 2K 4K 8K 16K

p4.512K

p4.1M

p8.512K

p8.1M

% o

f all

request

s

Region Size

Even with a 16K Region~45% of requests miss in all remote nodes

bett

er

Glo

bal R

eg

ion

Mis

ses


RegionScout at Work: Non-Shared Region Discovery

First request detects a non-shared region

Main Memory

CPUCPU CPU

Global Region Miss

Region Miss Region Miss12 2

3

Record: Non-Shared Regions Record: Locally Cached Regions


RegionScout at Work:Avoiding Snoops

Subsequent request avoids snoops

Main Memory

CPUCPU CPU

Global Region Miss

1

2



RegionScout is Self-Correcting

Request from another node invalidates non-shared record

Main Memory

CPUCPU CPU

12 2



Requesting Node provides address:

At Originating Node – from CPU: Have I discovered that this region is not shared?

At Remote Nodes – from Interconnect: Do I have a block in the region?

Implementation: Requirements

Region Tag offsetlg(Region Size)

CPU

address


Remembering Non-Shared Regions

Records non-shared regions Lookup by Region portion prior to issuing a request Snoop requests and invalidate

Region Tag offsetaddress

validNon-Shared Region Table

Few entries16x4 in most experiments


What Regions are Locally Cached?

If we had as many counters as regions: Block Allocation: counter[region]++ Block Eviction: counter[region]-- Region cached only if counter[Region] non-zero

Not Practical: E.g., 16K Regions and 4G Memory 256K counters

Region Tag offset

counter


What Regions are Locally Cached?

Region Tag offset

counter

hash()

Imprecise: Records a superset of locally cached Regions False positives: lost opportunity, correctness preserved Small: e.g., 256 entries for 1M cache

Power-Optimized structures described in the paper


LFSR-Based Implementation

Region Tag offset

LFSR

hash()

Zero

D

ete

ctor

Linear-Feedback Shift Register Array Increment/Decrement/Is Zero?

130nm commercial technology ISLPED ’06 Faster: 1.6x to 3.7x More Energy Efficient: 1.4x to 2.3x But Area: 3.2x


Filter Rates: SPLASH-II

0%

25%

50%

75%

100%

256 512 1K 2K

p4.512K.R4K

p4.512K.R16K

p8.512K.R4K

p8.512K.R16K

Iden

tifi

ed

Glo

bal R

eg

ion

Mis

ses

CRH Size

bett

er

Jason Cantin@Wisconsin studied commercial workloads40% filter rate

Region-Centric Disambiguation

Join work w/

Greg Steffan and Mihai Burcea

Patrick Akl

Andreas Moshovos


Speculative Parallelization Models

Thread level speculation Transactional Memory

Original Speculative Parallelization

tim

e

write a

read b

write a

read a

Good Scenario Bad Scenario

Need to Compare Addresses Across Code Pieces


Ex #2: Region-Centric Disambiguation

Send digest at region level Region-conflict

Send block-level info

Reduced bandwidth, potential for performance and power

Task 1 Task 2 Task 1 Task 2

Mem

ory

Space

Conventional Region-Centric


How Much Traffic Can We Save?

TLS benchmarks from STAMPEDE group (G. Steffan) Approximate timing model

Potential for traffic reduction by 38%

0

0.2

0.4

0.6

0.8

1

32 64 128 256 512 1024

Disambiguation Traffic Reduction with TLS

Region Size (Bytes)

Tra

ffic

Rat

io

Better


Exploiting Region-Level Information

Region Coherence Arrays Cantin, Lipasti and Smith

RegionScout Both of these reduce snoop lookups (and broadcasts) in snoop

coherence protocolsOur work

Spatial Memory Prefetching Leverages spatial memory patterns for prefetching with

commercial workloads Impetus Group at CMU

Stealth Prefetching Cantin, Lipasti and Smith


Coarse-Grain Techniques Today

Overhead Storage: e.g., 60% of tags Functionality: Restrict placement, Region Evictions

Loss of Information

Hard to justify for a commercial design

CPU

I$ D$AuxiliaryTracking

DATATAGS

Conventional Cache


Rethinking Cache Design

Can we provide a common substrate for all these optimizations? Redesign caches:

Regions a first class citizen

RegionTracker Cache

CPU

I$ D$EmbeddedTracking

DATA

Dual-GrainTAGS


RegionTracker Cache

Goals Expose region behavior

Is region X cached? Which blocks are?

Facilitate management at the region level Evict/migrate region X Do something with all blocks in X

Constraints: Data movement only at the block level No increase in area No decrease in performance Complexity Associativity


Region-Based Caches Start with conventional 16-way cache and replace tag array Sector Caches

Hit rate suffers: 20% loss Sector Pool Caches

High Associavity: 48-way for matching a 16-way cache Decoupled-Sector Caches

No coarse-grain info Replacements require searching

No previous design is adequate

RegionTracker: Meets all requirements But does not save as much tag resources


Sector Cache

Reduced Area and Power Increased miss-rates (2.5% - 96% for 1kB sectors) Replacement?

D-way Data

{

D-way Region Tags

RVA

Data Array


Sector Pool Cache

M > D Requires highly associative cache to achieve

same performance as RegionTracker (~48-way)

D-way Data

{

Data Array

M-way Region Tags

RVA

1 D

SR


Decoupled-Sectored Cache

Has multiple block evictions Requires scanning “status” array No simple mechanism to avoid this

Does NOT expose region-level information


RegionTracker

In practice L <= D Decouple Data and Lookup organizations Lower Associativity lookups with no hit-rate penalty RegionTracker provides complete solution

D-way Data

{

Data Array

L-way Region Tags

RVA

1 D

SR


RegionTracker Cache

L1

L1

L1

L1

RVA

ERB

Data Array

BST

Block and Region LookupsRegion Tag + Way Per

Block

Evict Region Blocks Lazily

Simplify replacement and reduce area

Status per block + RVA set backpointer

Can be banked and partitioned


Region-Aware Cache: Performance vs. Area

Commercial workloads: DB2, Oracle, TPC-C and TPC-H, Apache, Zeus SimICS + SimFlex, Sampling, 2K Regions

Miss Rate vs. Size for 128MB Cache Designs

1

1.02

1.04

1.06

1.08

1.1

0 0.2 0.4 0.6 0.8 1Relative Tag Array Size

Re

lati

ve

Mis

s R

ate

Sector Cache

Sector Pool

RA+

sqrt(2) Rule

better


RegionTracker-RegionScout

One bit per Region tag: Known to be not shared 1KB Regions, Commercial workloads 512KB L2 private caches

Filter 41% of snoops at “Zero Cost” compared to conventional cache

0%

20%

40%

60%

RS-1K CRH RS-2K CRH RS-4K CRH RSRTReduct

ion in B

roadca

sts

bett

er

BlockScout


Directory Optimizations Base Architecture

L3 Data DRAM

CoreDirectory

L3 Tags

L2 Tags

L2 Data


Coherence Delegation

Eliminate 3-hop overhead Attract directory tracking to nodes

Directory Lookup

Remote L2 containing data

Requesting Node

Ideal Path


Predictor Virtualization

InterconnectInterconnect

L2L2

CPUCPU

L1-L1-DD

L1-IL1-I

CPUCPU

L1-L1-DD

L1-IL1-I

CPUCPU

L1-L1-DD

L1-IL1-I

CPUCPU

L1-L1-DD

L1-IL1-I

Main MemoryMain Memory

Optimization Engines: Predictors

CPUCPU CPUCPU

L1-L1-DD

L1-IL1-I

CPUCPU

L1-L1-DD

L1-IL1-I

CPUCPU CPUCPU

L1-L1-DD

L1-IL1-I

CPUCPU

L1-L1-DD

L1-IL1-I

CPUCPU CPUCPU CPUCPUCPUCPU CPUCPU

L1-L1-DD

L1-IL1-IL1-L1-DD

L1-L1-DD

L1-IL1-IL1-L1-DD L1-IL1-IL1-L1-

DDL1-IL1-IL1-L1-

DDL1-L1-DD

L1-IL1-IL1-L1-DD

L1-IL1-IL1-L1-DD L1-IL1-IL1-L1-

DDL1-IL1-IL1-L1-

DDL1-IL1-IL1-L1-

DD


Motivating Trends

Chip multiprocessors Space dedicated to predictors X #processors

Larger predictor table Increased performance

Memory hierarchies Increased capacities

Use conventional memory hierarchies to store predictor information


PV Architecture

Optimization Engine

Predictor Table

entry index prediction


PV Architecture

Optimization Engine




PV Architecture

Optimization Engine


+

indexPVStart

PVCache MSHR

PVProxy

L2

Main MemoryPVTable


Virtualized Spatial Memory Streaming

-100

10

20304050

607080

Apache Zeus DB2 Oracle Qry 1 Qry 2 Qry 16 Qry 17

Per

cen

tag

e S

pee

du

p

SMS - 1K sets SMS - 8 sets SMS - PVCache 8 sets

Original Prefetcher: Cost: 80KB

Virtualized Prefetcer: Cost: <1Kbyte

Nearly Identical Performance

Region-Centric Memory Design

AENAO Research GroupPatrick Akl, M.A.Sc. C.Ioana Burcea, Ph.D. C.

Myrto Papadopoulou, M.A.Sc. C.Elham Safi, Ph.D. C.

Jason Zebchuk, M.A.Sc. C.Andreas Moshovos

{pakl, ioana, myrto, elham, zebchuk, moshovos}@eecg.toronto.edu


Summary

Caches are getting larger Time to look at the “big picture” Region-Centric Memory Design

Expose region-level info Allow management at the region-level

RegionScout eliminate broadcasts for snoop coherence

Region-Centric Disambiguation Reduce bandwidth for TLS or TM

Region-Aware Memory “Same” area and performance as conventional + region info


Region-Centric Memory Design

Documents

Transcript of Region-Centric Memory Design