Impulse Project DARPA Review – July 2000

1Impulse Adaptable Memory System

Impulse Project

DARPA Review – July 2000

University of Utah

and

University of Massachusetts at Amherst


Technology Trends

Disturbing trends (for a memory architect):– Memory gap widening (CPUs improving 60%/year, DRAM only 7%)– Internal CPU parallelism is escalating– Emerging applications with poor locality (multimedia, databases, …)– Cache size growing much faster than TLB reach– Ugly CPIs: Perl and Sites, OSDI 1996

Possible solutions:– Bigger, deeper cache hierarchies– Better latency-tolerating CPU features (non-blocking cache, OOO, …)– Migrate computation to the DRAMs– Let software control how data is managed (Impulse)


Simple Example Problem Sum of diagonal elements of dense matrix

Problems– Wasted bus bandwidth

– Low cache utilization

– Low cache hit ratio

CachePhysical Memory

Memory Bus

for (i = 0; i < n; i++)

sum += A[i][i];

Memory Controller


The Impulse Idea What if software could do the following?

Improvements– No wasted bus bandwidth

– Better cache utilization

– Higher cache and TLB hit ratios

CachePhysical Memory

Memory bus Memory Controller

Create diag[*] corresponding to A[*][*]for (i = 0; i < n; i++) sum += diag[i];


How? Add Extra Level of Mapping Shadow address: “unused” physical address MC maps shadow address to physical address Applications configure MC through OS

Real physical space

Shadow address space

MM

U/T

LB

virtual space physical space real physical memory

Imp

uls

e M

C


Address Translations

ConventionalSystem

Virtual Memory

ShadowMemory

PseudoVirtual

MemoryPhysical Memory

MMU/TLB

diagonal

MMU/TLB

Physical Memory

Virtual Memory

ImpulseSystem

Word-grainedPage-grained


Impulse Features

Base-stride scatter/gather data– Walk columns or diagonals efficiently

– Remap matrix tiles to contiguous memory without copying

Indirection vector accesses– Static vectors (e.g., perform A[index[i]] efficiently)

– Dynamic cacheline assembly

Remap pages– Create superpages from disjoint base pages

– No-copy page coloring

Aggressive controller-based prefetching– Prefetch data from DRAMs (sequential and pointer-directed)


Exploiting Impulse

1. Application asks OS to setup remapping2. OS allocates free shadow configuration register

• sets up dense “page table” that points to target data

• downloads address of this page table to configuration register

3. OS allocates free shadow and virtual address space• maps application virtual addresses to shadow physical addresses

• returns virtual address corresponding to remapped data to app

1. TLB translation (VA to shadow)2. Fine-grained remapping (if any)3. Remapped addresses pass through MC-TLB4. DRAM scheduler “collects” data5. Application accesses (dense) remapped data

Set

upU

se


Architecture Overview

RegisterFile

ShadowEngine

ShadowEngine

MTLB MTLB

DRAMBank

Controller

DRAMBank

Controller

WritebackBuffer

.

.

.

RequestQueue

ScoreboardOut

Buffer

PrefetchUnit

ShadowStaging

Unit

DATACOHDATA ADDR

I/O


Benchmarks Fine-grained remapping benchmarks

– Conjugate gradient (core of DARPA vision benchmark)

– Ray tracing

Page-grained remapping benchmarks– SPEC95 (dynamic superpage promotion)

– Compress (no-copy page coloring)

Prefetching benchmarks– SPECint 95 suite (3-15% performance improvement)

– Synthetic tree microbenchmarks


Conjugate Gradient

Row A P => B

1 2 3 4 5 6

12

54

63

x

Data

Column 1 5 7 8 3 9

014

Store logical sparse matrix A using Yale storage scheme– Data stores non-zero elements (much larger than P)

– Row[i] indicates where the ith row begins in Data

– Column[i] is the column number of Data[i]


Optimizing Conjugate Gradient

for i=0 to n-1 do sum = 0; for j = Row[i] to Row[i+1]+1 do sum = Data[j] * P[Col[j]]; b = sum;

Pi = remap_indirect(P, Col, n, …);for i=0 to n-1 do sum = 0; for j = Row[i] to Row[i+1]+1 do sum = Data[j] * Pi[j]; b = sum;

Original Code Optimized Code

Issues:• Data and Col are large streams

• P reusable, but forced out of cache

• Poor L1 cache hit rates

• Interference in L2 cache

Issues:• Indirect access to P[Col[j]] turned

into sequential streaming access

• No reuse on P now

• Side effect: eliminate access to Col• Significant improvement to hit rates

(both L1 and TLB)


Conjugate Gradient Results

Base Impulse

Time (cycles) 5.48B 1.77B

L1 hit ratio 63.4% 77.8%

L2 hit ratio 19.7% 15.9%

TLB cycles 10.1M 0.5M

Speedup --- 3.1X

Significant improvement in effective cache locality


Volume Rendering: Ray Tracing

Problem: Ray traversals are “random” memory accesses Solution: Calculate addresses of rays as “indirection vector

Access rays via Impulse-remapped data structure


Volume Rendering Results

Orig (A) Impulse (A) Orig (B) Impulse (B)

Time 264M 185M 1440M 285M

L1 hit ratio 96.8% 96.6% 86.3% 91.7%

L2 hit ratio 0.8% 0.9% 0.4% 6.2%

TLB cycles 0.30M 0.31M 259M 0.13M

Speedup -- 1.4X -- 6.1X

A: rays follow natural memory layout (X axis) B: rays perpendicular to natural memory layout (Z axis)


Coarse Grained Remappings

Page-grained remapping Aggressive use of synthetic superpages

– modified kernel TLB miss handler to detect pages responsible for frequent TLB misses

– create superpage by page-grained remapping on memory controller

– no copying, therefore can be far more aggressive

No-copy page coloring– Problem: conflicts in the physically-indexed L2 cache

– Normal solution: copy to non-conflicting pages

– Impulse solution: remap to non-conflict pages


0x40138000

0x06155000

0x04012000

0x00004000

0x00005000

0x00007000

0x00006000

Virtual Addresses

0x80240000

0x80243000

0x80242000

0x80241000

Shadow Addresses

Physical Addresses

0x12011000

Shadow-Backed Superpages

SPECint95 improves 5-20% MTLB increases effective reach of CPU TLB Superpage large and multiple arrays at compile time

– at allocation time (cheapest) or dynamically


MMC-Based Prefetching

Idea: Prefetch data off of DRAMs into SRAM on MMC

Misprediction penalties significantly reduced– conflict misses due to cache capacity limitations

– system bus bandwidth

Exploits “free” DRAM bandwidth at MMC level– higher aggregate DRAM bandwidth than cache or bus bandwidth

Reduces latency of accesses that hit in prefetch cache


Pointer-based Microbenchmarks

Random walk down tree w/ N-children per node– vary number of children from 1 (linked list) to 3 (trinary tree)

Baseline: compiler-directed prefetching Impulse: MMC prefetches next nodes in tree (1-ahead)

– allocate nodes in shadow region

– tell MMC what offsets represent pointers

Root

Child1 ChildNChild2

Child1 Child2 ChildN...

...


Pointer Prefetching Results

P1 (N) P1 (C) P1 (I) P3 (N) P3 (C) P3 (I)

Time 100M 99.7M 84.7M 124M 197M 109M

L1 hit ratio 67.5% 98.8% 67.5% 68.2% 97.9% 68.2%

L2 hit ratio 0.4% 0.1% 0.4% 0.4% 0.3% 0.5%

TLB cycles 1.6M 1.2M 1.6M 6.2M 6.2M 6.0M

Speedup --- 1.0X 1.2X --- -0.3X 1.14X

P1(N): singly-linked list, no prefetching P3(C): triply-linked list, compiler-directed prefetching P#(I): Impulse MMC-directed prefetching


Prototyping Status

Four stage prototype strategy I: Slow conventional MMC

II: Fast conventional MMC

III: Impulse on an FPGA

IV: Impulse in an ASIC

Current Status: Stage I complete (pictured)

Stage II imminent (final testing)

Stage III underway (3/01)

Stage IV next year (12/01)


Summary Impulse Benefits

– Higher memory bus utilization

– Higher cache utilization

– Turns sparse memory operations into dense ones

Range of optimizations– Fine-grained data remapping

– Page-grained data remapping

– Memory-based prefetching

Impact– Performance increase for small increase in cost

– Does not require changes to CPUs, caches, or DRAMs


Questions?

http://www.cs.utah.edu/impulse

Impulse Project DARPA Review – July 2000

Documents

Transcript of Impulse Project DARPA Review – July 2000