Region-Centric Memory Design
description
Transcript of Region-Centric Memory Design
Region-Centric Memory Design
AENAO Research GroupPatrick Akl, M.A.Sc.
Ioana Burcea, Ph.D. C.
Myrto Papadopoulou, M.A.Sc. C.
Elham Safi, Ph.D. C.
Jason Zebchuk, M.A.Sc. C.
Andreas Moshovos
{pakl, ioana, myrto, elham, zebchuk, moshovos}@eecg.toronto.edu
Sept. 11, 2007, JJPAR 2Aenao Group/Toronto
Future On-Chip Caches: Just Larger?
CPU
I$ D$
CPU
I$ D$
CPU
I$ D$
interconnect
Main Memory
Observe and Exploit Memory Access Behavior at a Coarse Grain
10s – 100s of MB
Sept. 11, 2007, JJPAR 3Aenao Group/Toronto
Conventional Block-Centric Memory Hierarchy
Conventional Fine-Grain Tracking
“Small” Blocks Performance and Bandwidth
Several optimizations exist
Big picture is lost
Sept. 11, 2007, JJPAR 4Aenao Group/Toronto
“Big Picture” View
Region: 2n sized, aligned memory area Concept already in use: TLBs
Patterns Emerge in Space / Time Exploit for performance & power Expose to software
Supplemental Coarse-Grain Tracking
Sept. 11, 2007, JJPAR 5Aenao Group/Toronto
This Presentation
Examples of Coarse-Grain Optimizations Snoop Coherence Thread-level speculation disambiguation
Region-Centric Memory Design RegionTracker Cache Snoop Coherence Revisited
Current Activities Coherence Delegation Predictor Virtualization
Sept. 11, 2007, JJPAR 6Aenao Group/Toronto
An Example: Snoop Coherence
Conventional Considerations: Complexity and Correctness NOT Power/Bandwidth
Can we: (1) Reduce Power/bandwidth (2) Leverage snoop coherence? Remains Attractive: Simple / Design Re-use
Yes: Exploit Program Behavior toDynamically Identify Requests that do not Need Snooping
CPU
I$ D$
CPU
I$ D$
CPU
I$ D$
interconnect
Main Memory
Sept. 11, 2007, JJPAR 7Aenao Group/Toronto
Coherence Basics
Given request for memory block X (address) Detect where current value resides
Main Memory
snoop
snoop
X
hit
CPU CPU CPU
Sept. 11, 2007, JJPAR 8Aenao Group/Toronto
Conventional Coherence not Power-Aware/Bandwidth-Effective
All L2 tags see all accessesPerf. & Complexity: Have L2 tags why not use themPower: All L2 tags consume power on all accesses
Bandwidth: broadcast all coherent requests
Main Memory
L2
CPU
missmiss
CPU CPU
Sept. 11, 2007, JJPAR 9Aenao Group/Toronto
RegionScout Motivation:Sharing is Coarse
Region: large continuous memory area, power of 2 size CPU X asks for data block in region R
1. No one else has X
2. No one else has any block in RRegionScout Exploits this Behavior
Layered Extension over Snoop Coherence
Typical Memory Space Snapshot: colored by owner(s)
addresses
Sept. 11, 2007, JJPAR 10Aenao Group/Toronto
Optimization Opportunities
Power and Bandwidth Originating node: avoid asking others Remote node: avoid tag lookup
CPU
I$ D$
CPU
I$ D$
Memory
SWITCH
CPU
I$ D$
Sept. 11, 2007, JJPAR 11Aenao Group/Toronto
Potential: Region Miss Frequency
0%
25%
50%
75%
100%
256 512 1K 2K 4K 8K 16K
p4.512K
p4.1M
p8.512K
p8.1M
% o
f all
request
s
Region Size
Even with a 16K Region~45% of requests miss in all remote nodes
bett
er
Glo
bal R
eg
ion
Mis
ses
Sept. 11, 2007, JJPAR 12Aenao Group/Toronto
RegionScout at Work: Non-Shared Region Discovery
First request detects a non-shared region
Main Memory
CPUCPU CPU
Global Region Miss
Region Miss Region Miss12 2
3
Record: Non-Shared Regions Record: Locally Cached Regions
Sept. 11, 2007, JJPAR 13Aenao Group/Toronto
RegionScout at Work:Avoiding Snoops
Subsequent request avoids snoops
Main Memory
CPUCPU CPU
Global Region Miss
1
2
Record: Non-Shared Regions Record: Locally Cached Regions
Sept. 11, 2007, JJPAR 14Aenao Group/Toronto
RegionScout is Self-Correcting
Request from another node invalidates non-shared record
Main Memory
CPUCPU CPU
12 2
Record: Non-Shared Regions Record: Locally Cached Regions
Sept. 11, 2007, JJPAR 15Aenao Group/Toronto
Requesting Node provides address:
At Originating Node – from CPU: Have I discovered that this region is not shared?
At Remote Nodes – from Interconnect: Do I have a block in the region?
Implementation: Requirements
Region Tag offsetlg(Region Size)
CPU
address
Sept. 11, 2007, JJPAR 16Aenao Group/Toronto
Remembering Non-Shared Regions
Records non-shared regions Lookup by Region portion prior to issuing a request Snoop requests and invalidate
Region Tag offsetaddress
validNon-Shared Region Table
Few entries16x4 in most experiments
Sept. 11, 2007, JJPAR 17Aenao Group/Toronto
What Regions are Locally Cached?
If we had as many counters as regions: Block Allocation: counter[region]++ Block Eviction: counter[region]-- Region cached only if counter[Region] non-zero
Not Practical: E.g., 16K Regions and 4G Memory 256K counters
Region Tag offset
counter
Sept. 11, 2007, JJPAR 18Aenao Group/Toronto
What Regions are Locally Cached?
Region Tag offset
counter
hash()
Imprecise: Records a superset of locally cached Regions False positives: lost opportunity, correctness preserved Small: e.g., 256 entries for 1M cache
Power-Optimized structures described in the paper
Sept. 11, 2007, JJPAR 19Aenao Group/Toronto
LFSR-Based Implementation
Region Tag offset
LFSR
hash()
Zero
D
ete
ctor
Linear-Feedback Shift Register Array Increment/Decrement/Is Zero?
130nm commercial technology ISLPED ’06 Faster: 1.6x to 3.7x More Energy Efficient: 1.4x to 2.3x But Area: 3.2x
Sept. 11, 2007, JJPAR 20Aenao Group/Toronto
Filter Rates: SPLASH-II
0%
25%
50%
75%
100%
256 512 1K 2K
p4.512K.R4K
p4.512K.R16K
p8.512K.R4K
p8.512K.R16K
Iden
tifi
ed
Glo
bal R
eg
ion
Mis
ses
CRH Size
bett
er
Jason Cantin@Wisconsin studied commercial workloads40% filter rate
Region-Centric Disambiguation
Join work w/
Greg Steffan and Mihai Burcea
Patrick Akl
Andreas Moshovos
Sept. 11, 2007, JJPAR 22Aenao Group/Toronto
Speculative Parallelization Models
Thread level speculation Transactional Memory
Original Speculative Parallelization
tim
e
write a
read b
write a
read a
Good Scenario Bad Scenario
Need to Compare Addresses Across Code Pieces
Sept. 11, 2007, JJPAR 23Aenao Group/Toronto
Ex #2: Region-Centric Disambiguation
Send digest at region level Region-conflict
Send block-level info
Reduced bandwidth, potential for performance and power
Task 1 Task 2 Task 1 Task 2
Mem
ory
Space
Conventional Region-Centric
Sept. 11, 2007, JJPAR 24Aenao Group/Toronto
How Much Traffic Can We Save?
TLS benchmarks from STAMPEDE group (G. Steffan) Approximate timing model
Potential for traffic reduction by 38%
0
0.2
0.4
0.6
0.8
1
32 64 128 256 512 1024
Disambiguation Traffic Reduction with TLS
Region Size (Bytes)
Tra
ffic
Rat
io
Better
Sept. 11, 2007, JJPAR 25Aenao Group/Toronto
Exploiting Region-Level Information
Region Coherence Arrays Cantin, Lipasti and Smith
RegionScout Both of these reduce snoop lookups (and broadcasts) in snoop
coherence protocolsOur work
Spatial Memory Prefetching Leverages spatial memory patterns for prefetching with
commercial workloads Impetus Group at CMU
Stealth Prefetching Cantin, Lipasti and Smith
Sept. 11, 2007, JJPAR 26Aenao Group/Toronto
Coarse-Grain Techniques Today
Overhead Storage: e.g., 60% of tags Functionality: Restrict placement, Region Evictions
Loss of Information
Hard to justify for a commercial design
CPU
I$ D$AuxiliaryTracking
DATATAGS
Conventional Cache
Sept. 11, 2007, JJPAR 27Aenao Group/Toronto
Rethinking Cache Design
Can we provide a common substrate for all these optimizations? Redesign caches:
Regions a first class citizen
RegionTracker Cache
CPU
I$ D$EmbeddedTracking
DATA
Dual-GrainTAGS
Sept. 11, 2007, JJPAR 28Aenao Group/Toronto
RegionTracker Cache
Goals Expose region behavior
Is region X cached? Which blocks are?
Facilitate management at the region level Evict/migrate region X Do something with all blocks in X
Constraints: Data movement only at the block level No increase in area No decrease in performance Complexity Associativity
Sept. 11, 2007, JJPAR 29Aenao Group/Toronto
Region-Based Caches Start with conventional 16-way cache and replace tag array Sector Caches
Hit rate suffers: 20% loss Sector Pool Caches
High Associavity: 48-way for matching a 16-way cache Decoupled-Sector Caches
No coarse-grain info Replacements require searching
No previous design is adequate
RegionTracker: Meets all requirements But does not save as much tag resources
Sept. 11, 2007, JJPAR 30Aenao Group/Toronto
Sector Cache
Reduced Area and Power Increased miss-rates (2.5% - 96% for 1kB sectors) Replacement?
D-way Data
{
D-way Region Tags
RVA
Data Array
Sept. 11, 2007, JJPAR 31Aenao Group/Toronto
Sector Pool Cache
M > D Requires highly associative cache to achieve
same performance as RegionTracker (~48-way)
D-way Data
{
Data Array
M-way Region Tags
RVA
1 D
SR
Sept. 11, 2007, JJPAR 32Aenao Group/Toronto
Decoupled-Sectored Cache
Has multiple block evictions Requires scanning “status” array No simple mechanism to avoid this
Does NOT expose region-level information
Sept. 11, 2007, JJPAR 33Aenao Group/Toronto
RegionTracker
In practice L <= D Decouple Data and Lookup organizations Lower Associativity lookups with no hit-rate penalty RegionTracker provides complete solution
D-way Data
{
Data Array
L-way Region Tags
RVA
1 D
SR
Sept. 11, 2007, JJPAR 34Aenao Group/Toronto
RegionTracker Cache
L1
L1
L1
L1
RVA
ERB
Data Array
BST
Block and Region LookupsRegion Tag + Way Per
Block
Evict Region Blocks Lazily
Simplify replacement and reduce area
Status per block + RVA set backpointer
Can be banked and partitioned
Sept. 11, 2007, JJPAR 35Aenao Group/Toronto
Region-Aware Cache: Performance vs. Area
Commercial workloads: DB2, Oracle, TPC-C and TPC-H, Apache, Zeus SimICS + SimFlex, Sampling, 2K Regions
Miss Rate vs. Size for 128MB Cache Designs
1
1.02
1.04
1.06
1.08
1.1
0 0.2 0.4 0.6 0.8 1Relative Tag Array Size
Re
lati
ve
Mis
s R
ate
Sector Cache
Sector Pool
RA+
sqrt(2) Rule
better
Sept. 11, 2007, JJPAR 36Aenao Group/Toronto
RegionTracker-RegionScout
One bit per Region tag: Known to be not shared 1KB Regions, Commercial workloads 512KB L2 private caches
Filter 41% of snoops at “Zero Cost” compared to conventional cache
0%
20%
40%
60%
RS-1K CRH RS-2K CRH RS-4K CRH RSRTReduct
ion in B
roadca
sts
bett
er
BlockScout
Sept. 11, 2007, JJPAR 37Aenao Group/Toronto
Directory Optimizations Base Architecture
L3 Data DRAM
CoreDirectory
L3 Tags
L2 Tags
L2 Data
Sept. 11, 2007, JJPAR 38Aenao Group/Toronto
Coherence Delegation
Eliminate 3-hop overhead Attract directory tracking to nodes
Directory Lookup
Remote L2 containing data
Requesting Node
Ideal Path
Sept. 11, 2007, JJPAR 39Aenao Group/Toronto
Predictor Virtualization
InterconnectInterconnect
L2L2
CPUCPU
L1-L1-DD
L1-IL1-I
CPUCPU
L1-L1-DD
L1-IL1-I
CPUCPU
L1-L1-DD
L1-IL1-I
CPUCPU
L1-L1-DD
L1-IL1-I
Main MemoryMain Memory
Optimization Engines: Predictors
CPUCPU CPUCPU
L1-L1-DD
L1-IL1-I
CPUCPU
L1-L1-DD
L1-IL1-I
CPUCPU CPUCPU
L1-L1-DD
L1-IL1-I
CPUCPU
L1-L1-DD
L1-IL1-I
CPUCPU CPUCPU CPUCPUCPUCPU CPUCPU
L1-L1-DD
L1-IL1-IL1-L1-DD
L1-L1-DD
L1-IL1-IL1-L1-DD L1-IL1-IL1-L1-
DDL1-IL1-IL1-L1-
DDL1-L1-DD
L1-IL1-IL1-L1-DD
L1-IL1-IL1-L1-DD L1-IL1-IL1-L1-
DDL1-IL1-IL1-L1-
DDL1-IL1-IL1-L1-
DD
Sept. 11, 2007, JJPAR 40Aenao Group/Toronto
Motivating Trends
Chip multiprocessors Space dedicated to predictors X #processors
Larger predictor table Increased performance
Memory hierarchies Increased capacities
Use conventional memory hierarchies to store predictor information
Sept. 11, 2007, JJPAR 41Aenao Group/Toronto
PV Architecture
Optimization Engine
Predictor Table
entry index prediction
Sept. 11, 2007, JJPAR 42Aenao Group/Toronto
PV Architecture
Optimization Engine
entry index prediction
Predictor Virtualization
Sept. 11, 2007, JJPAR 43Aenao Group/Toronto
PV Architecture
Optimization Engine
entry index prediction
+
indexPVStart
PVCache MSHR
PVProxy
L2
Main MemoryPVTable
Sept. 11, 2007, JJPAR 44Aenao Group/Toronto
Virtualized Spatial Memory Streaming
-100
10
20304050
607080
Apache Zeus DB2 Oracle Qry 1 Qry 2 Qry 16 Qry 17
Per
cen
tag
e S
pee
du
p
SMS - 1K sets SMS - 8 sets SMS - PVCache 8 sets
Original Prefetcher: Cost: 80KB
Virtualized Prefetcer: Cost: <1Kbyte
Nearly Identical Performance
Region-Centric Memory Design
AENAO Research GroupPatrick Akl, M.A.Sc. C.Ioana Burcea, Ph.D. C.
Myrto Papadopoulou, M.A.Sc. C.Elham Safi, Ph.D. C.
Jason Zebchuk, M.A.Sc. C.Andreas Moshovos
{pakl, ioana, myrto, elham, zebchuk, moshovos}@eecg.toronto.edu
Sept. 11, 2007, JJPAR 46Aenao Group/Toronto
Summary
Caches are getting larger Time to look at the “big picture” Region-Centric Memory Design
Expose region-level info Allow management at the region-level
RegionScout eliminate broadcasts for snoop coherence
Region-Centric Disambiguation Reduce bandwidth for TLS or TM
Region-Aware Memory “Same” area and performance as conventional + region info
Predictor Virtualization