Impulse Project DARPA Review – July 2000
description
Transcript of Impulse Project DARPA Review – July 2000
1Impulse Adaptable Memory System
Impulse Project
DARPA Review – July 2000
University of Utah
and
University of Massachusetts at Amherst
2Impulse Adaptable Memory System
Technology Trends
Disturbing trends (for a memory architect):– Memory gap widening (CPUs improving 60%/year, DRAM only 7%)– Internal CPU parallelism is escalating– Emerging applications with poor locality (multimedia, databases, …)– Cache size growing much faster than TLB reach– Ugly CPIs: Perl and Sites, OSDI 1996
Possible solutions:– Bigger, deeper cache hierarchies– Better latency-tolerating CPU features (non-blocking cache, OOO, …)– Migrate computation to the DRAMs– Let software control how data is managed (Impulse)
3Impulse Adaptable Memory System
Simple Example Problem Sum of diagonal elements of dense matrix
Problems– Wasted bus bandwidth
– Low cache utilization
– Low cache hit ratio
CachePhysical Memory
Memory Bus
for (i = 0; i < n; i++)
sum += A[i][i];
Memory Controller
4Impulse Adaptable Memory System
The Impulse Idea What if software could do the following?
Improvements– No wasted bus bandwidth
– Better cache utilization
– Higher cache and TLB hit ratios
CachePhysical Memory
Memory bus Memory Controller
Create diag[*] corresponding to A[*][*]for (i = 0; i < n; i++) sum += diag[i];
5Impulse Adaptable Memory System
How? Add Extra Level of Mapping Shadow address: “unused” physical address MC maps shadow address to physical address Applications configure MC through OS
Real physical space
Shadow address space
MM
U/T
LB
virtual space physical space real physical memory
Imp
uls
e M
C
6Impulse Adaptable Memory System
Address Translations
ConventionalSystem
Virtual Memory
ShadowMemory
PseudoVirtual
MemoryPhysical Memory
MMU/TLB
diagonal
MMU/TLB
Physical Memory
Virtual Memory
ImpulseSystem
Word-grainedPage-grained
7Impulse Adaptable Memory System
Impulse Features
Base-stride scatter/gather data– Walk columns or diagonals efficiently
– Remap matrix tiles to contiguous memory without copying
Indirection vector accesses– Static vectors (e.g., perform A[index[i]] efficiently)
– Dynamic cacheline assembly
Remap pages– Create superpages from disjoint base pages
– No-copy page coloring
Aggressive controller-based prefetching– Prefetch data from DRAMs (sequential and pointer-directed)
8Impulse Adaptable Memory System
Exploiting Impulse
1. Application asks OS to setup remapping2. OS allocates free shadow configuration register
• sets up dense “page table” that points to target data
• downloads address of this page table to configuration register
3. OS allocates free shadow and virtual address space• maps application virtual addresses to shadow physical addresses
• returns virtual address corresponding to remapped data to app
1. TLB translation (VA to shadow)2. Fine-grained remapping (if any)3. Remapped addresses pass through MC-TLB4. DRAM scheduler “collects” data5. Application accesses (dense) remapped data
Set
upU
se
9Impulse Adaptable Memory System
Architecture Overview
RegisterFile
ShadowEngine
ShadowEngine
MTLB MTLB
DRAMBank
Controller
DRAMBank
Controller
WritebackBuffer
.
.
.
RequestQueue
ScoreboardOut
Buffer
PrefetchUnit
ShadowStaging
Unit
DATACOHDATA ADDR
I/O
11Impulse Adaptable Memory System
Benchmarks Fine-grained remapping benchmarks
– Conjugate gradient (core of DARPA vision benchmark)
– Ray tracing
Page-grained remapping benchmarks– SPEC95 (dynamic superpage promotion)
– Compress (no-copy page coloring)
Prefetching benchmarks– SPECint 95 suite (3-15% performance improvement)
– Synthetic tree microbenchmarks
12Impulse Adaptable Memory System
Conjugate Gradient
Row A P => B
1 2 3 4 5 6
12
54
63
x
Data
Column 1 5 7 8 3 9
014
Store logical sparse matrix A using Yale storage scheme– Data stores non-zero elements (much larger than P)
– Row[i] indicates where the ith row begins in Data
– Column[i] is the column number of Data[i]
13Impulse Adaptable Memory System
Optimizing Conjugate Gradient
for i=0 to n-1 do sum = 0; for j = Row[i] to Row[i+1]+1 do sum = Data[j] * P[Col[j]]; b = sum;
Pi = remap_indirect(P, Col, n, …);for i=0 to n-1 do sum = 0; for j = Row[i] to Row[i+1]+1 do sum = Data[j] * Pi[j]; b = sum;
Original Code Optimized Code
Issues:• Data and Col are large streams
• P reusable, but forced out of cache
• Poor L1 cache hit rates
• Interference in L2 cache
Issues:• Indirect access to P[Col[j]] turned
into sequential streaming access
• No reuse on P now
• Side effect: eliminate access to Col• Significant improvement to hit rates
(both L1 and TLB)
14Impulse Adaptable Memory System
Conjugate Gradient Results
Base Impulse
Time (cycles) 5.48B 1.77B
L1 hit ratio 63.4% 77.8%
L2 hit ratio 19.7% 15.9%
TLB cycles 10.1M 0.5M
Speedup --- 3.1X
Significant improvement in effective cache locality
15Impulse Adaptable Memory System
Volume Rendering: Ray Tracing
Problem: Ray traversals are “random” memory accesses Solution: Calculate addresses of rays as “indirection vector
Access rays via Impulse-remapped data structure
16Impulse Adaptable Memory System
Volume Rendering Results
Orig (A) Impulse (A) Orig (B) Impulse (B)
Time 264M 185M 1440M 285M
L1 hit ratio 96.8% 96.6% 86.3% 91.7%
L2 hit ratio 0.8% 0.9% 0.4% 6.2%
TLB cycles 0.30M 0.31M 259M 0.13M
Speedup -- 1.4X -- 6.1X
A: rays follow natural memory layout (X axis) B: rays perpendicular to natural memory layout (Z axis)
17Impulse Adaptable Memory System
Coarse Grained Remappings
Page-grained remapping Aggressive use of synthetic superpages
– modified kernel TLB miss handler to detect pages responsible for frequent TLB misses
– create superpage by page-grained remapping on memory controller
– no copying, therefore can be far more aggressive
No-copy page coloring– Problem: conflicts in the physically-indexed L2 cache
– Normal solution: copy to non-conflicting pages
– Impulse solution: remap to non-conflict pages
18Impulse Adaptable Memory System
0x40138000
0x06155000
0x04012000
0x00004000
0x00005000
0x00007000
0x00006000
Virtual Addresses
0x80240000
0x80243000
0x80242000
0x80241000
Shadow Addresses
Physical Addresses
0x12011000
Shadow-Backed Superpages
SPECint95 improves 5-20% MTLB increases effective reach of CPU TLB Superpage large and multiple arrays at compile time
– at allocation time (cheapest) or dynamically
19Impulse Adaptable Memory System
MMC-Based Prefetching
Idea: Prefetch data off of DRAMs into SRAM on MMC
Misprediction penalties significantly reduced– conflict misses due to cache capacity limitations
– system bus bandwidth
Exploits “free” DRAM bandwidth at MMC level– higher aggregate DRAM bandwidth than cache or bus bandwidth
Reduces latency of accesses that hit in prefetch cache
20Impulse Adaptable Memory System
Pointer-based Microbenchmarks
Random walk down tree w/ N-children per node– vary number of children from 1 (linked list) to 3 (trinary tree)
Baseline: compiler-directed prefetching Impulse: MMC prefetches next nodes in tree (1-ahead)
– allocate nodes in shadow region
– tell MMC what offsets represent pointers
Root
Child1 ChildNChild2
Child1 Child2 ChildN...
...
21Impulse Adaptable Memory System
Pointer Prefetching Results
P1 (N) P1 (C) P1 (I) P3 (N) P3 (C) P3 (I)
Time 100M 99.7M 84.7M 124M 197M 109M
L1 hit ratio 67.5% 98.8% 67.5% 68.2% 97.9% 68.2%
L2 hit ratio 0.4% 0.1% 0.4% 0.4% 0.3% 0.5%
TLB cycles 1.6M 1.2M 1.6M 6.2M 6.2M 6.0M
Speedup --- 1.0X 1.2X --- -0.3X 1.14X
P1(N): singly-linked list, no prefetching P3(C): triply-linked list, compiler-directed prefetching P#(I): Impulse MMC-directed prefetching
22Impulse Adaptable Memory System
Prototyping Status
Four stage prototype strategy I: Slow conventional MMC
II: Fast conventional MMC
III: Impulse on an FPGA
IV: Impulse in an ASIC
Current Status: Stage I complete (pictured)
Stage II imminent (final testing)
Stage III underway (3/01)
Stage IV next year (12/01)
23Impulse Adaptable Memory System
Summary Impulse Benefits
– Higher memory bus utilization
– Higher cache utilization
– Turns sparse memory operations into dense ones
Range of optimizations– Fine-grained data remapping
– Page-grained data remapping
– Memory-based prefetching
Impact– Performance increase for small increase in cost
– Does not require changes to CPUs, caches, or DRAMs
24Impulse Adaptable Memory System
Questions?
http://www.cs.utah.edu/impulse