Design and Analysis of a Robust Pipelined Memory System Hao Wang †, Haiquan (Chuck) Zhao *, Bill...

29
Design and Analysis of a Robust Pipelined Memory System Hao Wang , Haiquan (Chuck) Zhao * , Bill Lin , and Jun (Jim) Xu * University of California, San Diego * Georgia Institute of Technology Infocom 2010, San Diego

Transcript of Design and Analysis of a Robust Pipelined Memory System Hao Wang †, Haiquan (Chuck) Zhao *, Bill...

Design and Analysis of a Robust Pipelined Memory System

Hao Wang†, Haiquan (Chuck) Zhao*, Bill Lin†, and Jun (Jim) Xu*

†University of California, San Diego*Georgia Institute of Technology

Infocom 2010, San Diego

Memory Wall

• Modern Internet routers need to manage large amounts of packet- and flow-level data at line rates

• e.g., need to maintain per-flow records during a monitoring period, but

– Core routers have millions of flows, translating to100’s of megabytes of storage

– 40 Gb/s OC-768 link, new packet can arrive every 8 ns

2

Memory Wall

• SRAM/DRAM dilemma

• SRAM: access latency typically between 5-15 ns (fast enough for 8 ns line rate)

• But the capacity of SRAMs is substantially inadequate in many cases: 4 MB largest typically (much less than 100’s of MBs needed)

3

Memory Wall

• DRAM provides inexpensive bulk storage

• But random access latency typically 50- 100 ns(much slower than 8 ns needed for 40 Gb/s line rate)

• Conventional wisdom is that DRAMsare not fast enough to keep up withever-increasing line rates

4

Memory Design Wish List

• Line rate memory bandwidth (like SRAM)

• Inexpensive bulk storage (like DRAM)

• Predictable performance

• Robustness to adversarial access patterns

5

• Modern DRAMs can be fast and cheap!– Graphics, video games, and HDTV – At commodity pricing, just

$0.01/MB currently, $20 for 2GB!

6

Main Observation

Example: Rambus XDR Memory• 16 internal banks

7

8

Memory Interleaving• Performance achieved through memory interleaving

– e.g. suppose we have B = 6 DRAM banks and access pattern is sequential

– Effective memory bandwidth B times faster

1

7

13

::

2

8

14

::

3

9

15

::

4

10

16

::

5

11

17

::

6

12

18

::

1 2 3 4 5 6

7 8 9 10 11 12

8

9

Memory Interleaving• But, suppose access pattern is as follows:

• Memory bandwidth degrades to worst-case DRAM latency

1

7

13

::

2

8

14

::

3

9

15

::

4

10

16

::

5

11

17

::

6

12

18

::

1

7

1319

25

9

10

Memory Interleaving• One solution is to apply pseudo-randomization of

memory locations

::

::

::

::

::

::

10

Adversarial Access Patterns

• However, memory bandwidth can still degrade to worst-case DRAM latency even with randomization:

1. Lookups to same global variable will trigger accesses to same memory bank

2. Attacker can flood packets with same TCP/IP header, triggering updates to the same memory location and memory bank, regardless of the randomization function.

11

Outline

• Problem and Background

→Proposed Design

• Theoretical Analysis

• Evaluation

12

SRAMdataaddr

data outop

012345

W

a

RW

b

W

c

RR

time012345

RRR

time

acc

SRAMEmulationdata

addrdata outop

012345

W

a

RW

b

W

c

RR

timeD+1D+2D+3D+4D

RRR

time+5D

acc

13

Pipelined Memory Abstraction Emulates SRAM with Fixed Delay

Implications of Emulation

• Fixed pipeline delay: If a read operation is issued at time t to an emulated SRAM, the data is available from the memory controller at exactly t + D (instead of same cycle).

• Coherency: The read operations output the same results as an ideal SRAM system.

1414

Proposed Solution: Basic Idea

• Keep SRAM reservation table of memory operations and data that occurred in last C cycles

• Avoid introducing new DRAM operation for memory references to same location within C cycles

15

Details of Memory Architecture

request buffers DRAM banks

reservation table

random address

permutation

input operations

dataout

B

op addr dataop addr dataop addr data

op addr data

C

……

…R-linkR-linkR-link

R-link

ppp

p

MRI table (CAM)

C MRW table (CAM)

C

16

Merging of Operations• Requests arrive from right to left.

1.

2.

3.

1717

WRITEREAD + WRITE…

WRITEWRITE + WRITE…

read copies data from write

2nd write overwrites 1st write

READREAD + READ2nd read copies data from 1st read…

READWRITE + … READWRITE +

Proposed Solution• Rigorously prove that with merging, worst-case delay

for memory operation is bounded by some fixed Dw.h.p.

• Provide pipelined memory abstraction in which operations issued at time t are completed at exactlyt + D cycles later (instead of same cycle).

• Reservation table with C > D also used to implement the pipeline delay, as well as serving as a “cache”.

18

Outline

• Problem and Background

• Proposed Design

→Theoretical Analysis

• Evaluation

19

Robustness

• At most one write operation in a request buffer every C cycles to a particular memory address.

• At most one read operation in a request buffer every C cycles to a particular memory address.

• At most one read operation followed by one write operation in a request buffer every C cycles to a particular address.

2020

Theoretical Analysis

• Worst case analysis

• Convex ordering

• Large deviation theory

• Prove: with a cache of size C, the best an attacker can do is to send repetitive requests every C+1 cycles.

21

Bound on Overflow Probability

• Want to bound the probability that a request buffer overflows in n cycles

• is the number of updates to a bank during cycles [s, t], , K is the length of a request queue.

• For total overflow probability bound multiply by B.

,0

Pr[ ] Pr[ ]s ts t n

overflow D

, ,:s t s tD X K

,s tX

t s

22

Chernoff Inequality

• Since this is true for all θ>0,

• We want to find the update sequence that maximizes

,Pr[ ] Pr[ ]s tD X K

( )Pr[ ]X Ke e

( )

[ ]X

K

E e

e

, ( )0

[ ]Pr[ ] min

X

s t K

E eD

e

[ ]XE e

23

Worst Case Request Patterns

• q1+q2 +1 requests for distinct counters ,

• q1 requests repeat 2T times each

• q2 requests repeat 2T-1 times each• 1 request repeat r times each

1 1q T C 2 12 2 1q Tq T

C

TC

1 21, q q ra a

24

1 22 2 1r Tq T q

1

Outline

• Problem and Background

• Proposed Design

• Theoretical Analysis

→Evaluation

25

Evaluation

• Overflow probability for 16 million addresses, µ=1/10, and B=32.

80 90 100 110 120 130 140 150 160 170 18010

-14

10-12

10-10

10-8

10-6

10-4

10-2

100

Queue Length K

Ove

rflo

w P

roba

bilit

y B

ound

C=6000

C=7000

C=8000

C=9000

26

SRAM 156 KB,CAM 24 KB

Evaluation

• Overflow probability for 16 million addresses, µ=1/10, and C=8000.

80 90 100 110 120 130 140 150 160 170 18010

-30

10-25

10-20

10-15

10-10

10-5

100

Request Buffer Size K

Ove

rflo

w P

roba

bilit

y B

ound

B=32

B=34

B=36

B=38

27

• Proposed a robust memory architecture that provides throughput of SRAM with density of DRAM.

• Unlike conventional caching that have unpredictable hit/miss performance, our design guarantees w.h.p. a pipelined memory architecture abstraction that can support new memory operation every cycle with fixed pipeline delay.

• Convex ordering and large deviation theory to rigorously prove robustness under adversarial accesses.

28

Conclusion

Thank You