Tiered-Latency DRAMece447/s14/lib/exe/fetch.php?...4 Historical DRAM Trend 0 20 40 60 80 100 0.0 0.5...

44
Tiered-Latency DRAM: A Low Latency and A Low Cost DRAM Architecture Donghyuk Lee, Yoongu Kim, Vivek Seshadri, Jamie Liu, Lavanya Subramanian, Onur Mutlu

Transcript of Tiered-Latency DRAMece447/s14/lib/exe/fetch.php?...4 Historical DRAM Trend 0 20 40 60 80 100 0.0 0.5...

Page 1: Tiered-Latency DRAMece447/s14/lib/exe/fetch.php?...4 Historical DRAM Trend 0 20 40 60 80 100 0.0 0.5 1.0 1.5 2.0 2.5 2000 2003 2006 2008 2011) (ns) Year Capacity Latency (tRC) 16X-20%

1

Tiered-Latency DRAM:A Low Latency and A Low Cost

DRAM Architecture

Donghyuk Lee, Yoongu Kim, Vivek Seshadri, Jamie Liu, Lavanya Subramanian, Onur Mutlu

Page 2: Tiered-Latency DRAMece447/s14/lib/exe/fetch.php?...4 Historical DRAM Trend 0 20 40 60 80 100 0.0 0.5 1.0 1.5 2.0 2.5 2000 2003 2006 2008 2011) (ns) Year Capacity Latency (tRC) 16X-20%

2

Executive Summary

• Problem: DRAM latency is a critical performance bottleneck

• Our Goal: Reduce DRAM latency with low area cost

• Observation: Long bitlines in DRAM are the dominant source of DRAM latency

• Key Idea: Divide long bitlines into two shorter segments

–Fast and slow segments

• Tiered-latency DRAM: Enables latency heterogeneity in DRAM

–Can leverage this in many ways to improve performance and reduce power consumption

• Results: When the fast segment is used as a cache to the slow segment Significant performance improvement (>12%) and power reduction (>23%) at low area cost (3%)

Page 3: Tiered-Latency DRAMece447/s14/lib/exe/fetch.php?...4 Historical DRAM Trend 0 20 40 60 80 100 0.0 0.5 1.0 1.5 2.0 2.5 2000 2003 2006 2008 2011) (ns) Year Capacity Latency (tRC) 16X-20%

3

Outline

• Motivation & Key Idea

• Tiered-Latency DRAM

• Leveraging Tiered-Latency DRAM

• Evaluation Results

Page 4: Tiered-Latency DRAMece447/s14/lib/exe/fetch.php?...4 Historical DRAM Trend 0 20 40 60 80 100 0.0 0.5 1.0 1.5 2.0 2.5 2000 2003 2006 2008 2011) (ns) Year Capacity Latency (tRC) 16X-20%

4

Historical DRAM Trend

0

20

40

60

80

100

0.0

0.5

1.0

1.5

2.0

2.5

2000 2003 2006 2008 2011

Late

ncy

(n

s)

Cap

acit

y (G

b)

Year

Capacity Latency (tRC)

16X

-20%

DRAM latency continues to be a critical bottleneck

Page 5: Tiered-Latency DRAMece447/s14/lib/exe/fetch.php?...4 Historical DRAM Trend 0 20 40 60 80 100 0.0 0.5 1.0 1.5 2.0 2.5 2000 2003 2006 2008 2011) (ns) Year Capacity Latency (tRC) 16X-20%

5

DRAM Latency = Subarray Latency + I/O Latency

What Causes the Long Latency?DRAM Chip

channel

cell array

I/O

DRAM Chip

channel

I/O

subarray

DRAM Latency = Subarray Latency + I/O Latency

DominantSu

bar

ray

I/O

Page 6: Tiered-Latency DRAMece447/s14/lib/exe/fetch.php?...4 Historical DRAM Trend 0 20 40 60 80 100 0.0 0.5 1.0 1.5 2.0 2.5 2000 2003 2006 2008 2011) (ns) Year Capacity Latency (tRC) 16X-20%

6

Why is the Subarray So Slow?

Subarray

Bit

line:

51

2 c

ells

extremely large sense amplifier(≈100X the cell size)

cap

acit

or access

transistor

wordline

bit

line

Cell

Ro

w d

eco

der

Sense amplifier

Long Bitline: Amortize sense amplifier → Small areaLong Bitline: Large bitline cap. → High latency

cell

Page 7: Tiered-Latency DRAMece447/s14/lib/exe/fetch.php?...4 Historical DRAM Trend 0 20 40 60 80 100 0.0 0.5 1.0 1.5 2.0 2.5 2000 2003 2006 2008 2011) (ns) Year Capacity Latency (tRC) 16X-20%

7

Trade-Off: Area (Die Size) vs. Latency

Faster

Smaller

Short BitlineLong Bitline

Trade-Off: Area vs. Latency

Page 8: Tiered-Latency DRAMece447/s14/lib/exe/fetch.php?...4 Historical DRAM Trend 0 20 40 60 80 100 0.0 0.5 1.0 1.5 2.0 2.5 2000 2003 2006 2008 2011) (ns) Year Capacity Latency (tRC) 16X-20%

8

Trade-Off: Area (Die Size) vs. Latency

0

1

2

3

4

0 10 20 30 40 50 60 70

No

rmal

ize

d D

RA

M A

rea

Latency (ns)

64

32

128

256 512 cells/bitline

Commodity DRAM

Long Bitline

Ch

eap

er

Faster

Fancy DRAMShort Bitline

Page 9: Tiered-Latency DRAMece447/s14/lib/exe/fetch.php?...4 Historical DRAM Trend 0 20 40 60 80 100 0.0 0.5 1.0 1.5 2.0 2.5 2000 2003 2006 2008 2011) (ns) Year Capacity Latency (tRC) 16X-20%

9

Short Bitline

Low Latency

Approximating the Best of Both Worlds

Long Bitline

Small Area

Long Bitline

Low Latency

Short BitlineOur Proposal

Small Area

Short Bitline Fast

Need Isolation

Add Isolation Transistors

High Latency

Large Area

Page 10: Tiered-Latency DRAMece447/s14/lib/exe/fetch.php?...4 Historical DRAM Trend 0 20 40 60 80 100 0.0 0.5 1.0 1.5 2.0 2.5 2000 2003 2006 2008 2011) (ns) Year Capacity Latency (tRC) 16X-20%

10

Approximating the Best of Both Worlds

Low Latency

Our Proposal

Small Area Long BitlineSmall Area

Long Bitline

High Latency

Short Bitline

Low Latency

Short Bitline

Large Area

Tiered-Latency DRAM

Low Latency

Small area using long

bitline

Page 11: Tiered-Latency DRAMece447/s14/lib/exe/fetch.php?...4 Historical DRAM Trend 0 20 40 60 80 100 0.0 0.5 1.0 1.5 2.0 2.5 2000 2003 2006 2008 2011) (ns) Year Capacity Latency (tRC) 16X-20%

11

Outline

• Motivation & Key Idea

• Tiered-Latency DRAM

• Leveraging Tiered-Latency DRAM

• Evaluation Results

Page 12: Tiered-Latency DRAMece447/s14/lib/exe/fetch.php?...4 Historical DRAM Trend 0 20 40 60 80 100 0.0 0.5 1.0 1.5 2.0 2.5 2000 2003 2006 2008 2011) (ns) Year Capacity Latency (tRC) 16X-20%

12

Tiered-Latency DRAM

Near Segment

Far Segment

Isolation Transistor

• Divide a bitline into two segments with an isolation transistor

Sense Amplifier

Page 13: Tiered-Latency DRAMece447/s14/lib/exe/fetch.php?...4 Historical DRAM Trend 0 20 40 60 80 100 0.0 0.5 1.0 1.5 2.0 2.5 2000 2003 2006 2008 2011) (ns) Year Capacity Latency (tRC) 16X-20%

13

Far SegmentFar Segment

Near Segment Access

Near Segment

Isolation Transistor

• Turn off the isolation transistor

Isolation Transistor (off)

Sense Amplifier

Reduced bitline capacitance

Low latency & low power

Reduced bitline length

Page 14: Tiered-Latency DRAMece447/s14/lib/exe/fetch.php?...4 Historical DRAM Trend 0 20 40 60 80 100 0.0 0.5 1.0 1.5 2.0 2.5 2000 2003 2006 2008 2011) (ns) Year Capacity Latency (tRC) 16X-20%

14

Near SegmentNear Segment

Far Segment Access

• Turn on the isolation transistor

Far Segment

Isolation TransistorIsolation Transistor (on)

Sense Amplifier

Large bitline capacitance

Additional resistance of isolation transistor

Long bitline length

High latency & high power

Page 15: Tiered-Latency DRAMece447/s14/lib/exe/fetch.php?...4 Historical DRAM Trend 0 20 40 60 80 100 0.0 0.5 1.0 1.5 2.0 2.5 2000 2003 2006 2008 2011) (ns) Year Capacity Latency (tRC) 16X-20%

15

Latency, Power, and Area Evaluation• Commodity DRAM: 512 cells/bitline

• TL-DRAM: 512 cells/bitline– Near segment: 32 cells

– Far segment: 480 cells

• Latency Evaluation– SPICE simulation using circuit-level DRAM model

• Power and Area Evaluation– DRAM area/power simulator from Rambus

– DDR3 energy calculator from Micron

Page 16: Tiered-Latency DRAMece447/s14/lib/exe/fetch.php?...4 Historical DRAM Trend 0 20 40 60 80 100 0.0 0.5 1.0 1.5 2.0 2.5 2000 2003 2006 2008 2011) (ns) Year Capacity Latency (tRC) 16X-20%

16

0%

50%

100%

150%

0%

50%

100%

150%

Commodity DRAM vs. TL-DRAM La

ten

cy

Po

we

r

–56%

+23%

–51%

+49%

• DRAM Latency (tRC) • DRAM Power

• DRAM Area Overhead~3%: mainly due to the isolation transistors

TL-DRAMCommodity

DRAM

Near Far Commodity DRAM

Near Far

TL-DRAM

(52.5ns)

Page 17: Tiered-Latency DRAMece447/s14/lib/exe/fetch.php?...4 Historical DRAM Trend 0 20 40 60 80 100 0.0 0.5 1.0 1.5 2.0 2.5 2000 2003 2006 2008 2011) (ns) Year Capacity Latency (tRC) 16X-20%

17

Latency vs. Near Segment Length

0

20

40

60

80

1 2 4 8 16 32 64 128 256 512

Near Segment Far Segment

Late

ncy

(n

s)

Longer near segment length leads to higher near segment latency

Page 18: Tiered-Latency DRAMece447/s14/lib/exe/fetch.php?...4 Historical DRAM Trend 0 20 40 60 80 100 0.0 0.5 1.0 1.5 2.0 2.5 2000 2003 2006 2008 2011) (ns) Year Capacity Latency (tRC) 16X-20%

18

Latency vs. Near Segment Length

0

20

40

60

80

1 2 4 8 16 32 64 128 256 512

Near Segment Far Segment

Late

ncy

(n

s)

Far segment latency is higher than commodity DRAM latency

Far Segment Length = 512 – Near Segment Length

Page 19: Tiered-Latency DRAMece447/s14/lib/exe/fetch.php?...4 Historical DRAM Trend 0 20 40 60 80 100 0.0 0.5 1.0 1.5 2.0 2.5 2000 2003 2006 2008 2011) (ns) Year Capacity Latency (tRC) 16X-20%

19

Trade-Off: Area (Die-Area) vs. Latency

0

1

2

3

4

0 10 20 30 40 50 60 70

No

rmal

ize

d D

RA

M A

rea

Latency (ns)

64

32

128256 512 cells/bitline

Ch

eap

er

Faster

Near Segment Far Segment

Page 20: Tiered-Latency DRAMece447/s14/lib/exe/fetch.php?...4 Historical DRAM Trend 0 20 40 60 80 100 0.0 0.5 1.0 1.5 2.0 2.5 2000 2003 2006 2008 2011) (ns) Year Capacity Latency (tRC) 16X-20%

20

Outline

• Motivation & Key Idea

• Tiered-Latency DRAM

• Leveraging Tiered-Latency DRAM

• Evaluation Results

Page 21: Tiered-Latency DRAMece447/s14/lib/exe/fetch.php?...4 Historical DRAM Trend 0 20 40 60 80 100 0.0 0.5 1.0 1.5 2.0 2.5 2000 2003 2006 2008 2011) (ns) Year Capacity Latency (tRC) 16X-20%

21

Leveraging Tiered-Latency DRAM

• TL-DRAM is a substrate that can be leveraged by the hardware and/or software

• Many potential uses1. Use near segment as hardware-managed inclusive

cache to far segment

2. Use near segment as hardware-managed exclusivecache to far segment

3. Profile-based page mapping by operating system

4. Simply replace DRAM with TL-DRAM

Page 22: Tiered-Latency DRAMece447/s14/lib/exe/fetch.php?...4 Historical DRAM Trend 0 20 40 60 80 100 0.0 0.5 1.0 1.5 2.0 2.5 2000 2003 2006 2008 2011) (ns) Year Capacity Latency (tRC) 16X-20%

22

subarray

Near Segment as Hardware-Managed Cache

TL-DRAM

I/O

cache

mainmemory

• Challenge 1: How to efficiently migrate a row between segments?

• Challenge 2: How to efficiently manage the cache?

far segment

near segmentsense amplifier

channel

Page 23: Tiered-Latency DRAMece447/s14/lib/exe/fetch.php?...4 Historical DRAM Trend 0 20 40 60 80 100 0.0 0.5 1.0 1.5 2.0 2.5 2000 2003 2006 2008 2011) (ns) Year Capacity Latency (tRC) 16X-20%

23

Inter-Segment Migration

Near Segment

Far Segment

Isolation Transistor

Sense Amplifier

Source

Destination

• Goal: Migrate source row into destination row

• Naïve way: Memory controller reads the source row byte by byte and writes to destination row byte by byte

→ High latency

Page 24: Tiered-Latency DRAMece447/s14/lib/exe/fetch.php?...4 Historical DRAM Trend 0 20 40 60 80 100 0.0 0.5 1.0 1.5 2.0 2.5 2000 2003 2006 2008 2011) (ns) Year Capacity Latency (tRC) 16X-20%

24

Inter-Segment Migration• Our way:

– Source and destination cells share bitlines

– Transfer data from source to destination across shared bitlines concurrently

Near Segment

Far Segment

Isolation Transistor

Sense Amplifier

Source

Destination

Page 25: Tiered-Latency DRAMece447/s14/lib/exe/fetch.php?...4 Historical DRAM Trend 0 20 40 60 80 100 0.0 0.5 1.0 1.5 2.0 2.5 2000 2003 2006 2008 2011) (ns) Year Capacity Latency (tRC) 16X-20%

25

Inter-Segment Migration

Near Segment

Far Segment

Isolation Transistor

Sense Amplifier

• Our way: – Source and destination cells share bitlines

– Transfer data from source to destination acrossshared bitlines concurrently

Step 2: Activate destination row to connect cell and bitline

Step 1: Activate source row

Additional ~4ns over row access latency

Migration is overlapped with source row access

Page 26: Tiered-Latency DRAMece447/s14/lib/exe/fetch.php?...4 Historical DRAM Trend 0 20 40 60 80 100 0.0 0.5 1.0 1.5 2.0 2.5 2000 2003 2006 2008 2011) (ns) Year Capacity Latency (tRC) 16X-20%

26

subarray

Near Segment as Hardware-Managed Cache

TL-DRAM

I/O

cache

mainmemory

• Challenge 1: How to efficiently migrate a row between segments?

• Challenge 2: How to efficiently manage the cache?

far segment

near segmentsense amplifier

channel

Page 27: Tiered-Latency DRAMece447/s14/lib/exe/fetch.php?...4 Historical DRAM Trend 0 20 40 60 80 100 0.0 0.5 1.0 1.5 2.0 2.5 2000 2003 2006 2008 2011) (ns) Year Capacity Latency (tRC) 16X-20%

27

Three Caching Mechanisms

Time

Caching

Time

Baseline

Row 1

Row 1 Row 2

Row 2Wait-inducing row Wait until finishing Req1

Cached rowReduced wait

Is there another benefit of caching?

Req. for Row 1

Req. for Row 2

Req. for Row 1

Req. for Row 2

1. SC (Simple Caching)

– Classic LRU cache

– Benefit: Reduced reuse latency

2. WMC (Wait-Minimized Caching)

– Identify and cache only wait-inducing rows

– Benefit: Reduced wait

3. BBC (Benefit-Based Caching)

– BBC ≈ SC + WMC

– Benefit: Reduced reuse latency & reduced wait

Page 28: Tiered-Latency DRAMece447/s14/lib/exe/fetch.php?...4 Historical DRAM Trend 0 20 40 60 80 100 0.0 0.5 1.0 1.5 2.0 2.5 2000 2003 2006 2008 2011) (ns) Year Capacity Latency (tRC) 16X-20%

28

Outline

• Motivation & Key Idea

• Tiered-Latency DRAM

• Leveraging Tiered-Latency DRAM

• Evaluation Results

Page 29: Tiered-Latency DRAMece447/s14/lib/exe/fetch.php?...4 Historical DRAM Trend 0 20 40 60 80 100 0.0 0.5 1.0 1.5 2.0 2.5 2000 2003 2006 2008 2011) (ns) Year Capacity Latency (tRC) 16X-20%

29

Evaluation Methodology• System simulator

– CPU: Instruction-trace-based x86 simulator

– Memory: Cycle-accurate DDR3 DRAM simulator

• Workloads– 32 Benchmarks from TPC, STREAM, SPEC CPU2006

• Metrics– Single-core: Instructions-Per-Cycle

– Multi-core: Weighted speedup

Page 30: Tiered-Latency DRAMece447/s14/lib/exe/fetch.php?...4 Historical DRAM Trend 0 20 40 60 80 100 0.0 0.5 1.0 1.5 2.0 2.5 2000 2003 2006 2008 2011) (ns) Year Capacity Latency (tRC) 16X-20%

30

Configurations• System configuration

– CPU: 5.3GHz

– LLC: 512kB private per core

– Memory: DDR3-1066• 1-2 channel, 1 rank/channel

• 8 banks, 32 subarrays/bank, 512 cells/bitline

• Row-interleaved mapping & closed-row policy

• TL-DRAM configuration– Total bitline length: 512 cells/bitline

– Near segment length: 1-256 cells

Page 31: Tiered-Latency DRAMece447/s14/lib/exe/fetch.php?...4 Historical DRAM Trend 0 20 40 60 80 100 0.0 0.5 1.0 1.5 2.0 2.5 2000 2003 2006 2008 2011) (ns) Year Capacity Latency (tRC) 16X-20%

31

Single-Core: Performance & Power

0%

3%

6%

9%

12%

15%SC WMC BBC

0%

20%

40%

60%

80%

100%SC WMC BBC

IPC

Imp

rove

me

nt

No

rmal

ize

d P

ow

er12.7% –23%

Using near segment as a cache improves performance and reduces power consumption

Page 32: Tiered-Latency DRAMece447/s14/lib/exe/fetch.php?...4 Historical DRAM Trend 0 20 40 60 80 100 0.0 0.5 1.0 1.5 2.0 2.5 2000 2003 2006 2008 2011) (ns) Year Capacity Latency (tRC) 16X-20%

32

Single-Core: Varying Near Segment Length

0%

3%

6%

9%

12%

15%

1 2 4 8 16 32 64 128 256

SC WMC BBC

IPC

Imp

rove

me

nt

Near Segment Length (cells)

By adjusting the near segment length, we can trade off cache capacity for cache latency

Larger cache capacity

Higher caching latency

Maximum IPCImprovement

Page 33: Tiered-Latency DRAMece447/s14/lib/exe/fetch.php?...4 Historical DRAM Trend 0 20 40 60 80 100 0.0 0.5 1.0 1.5 2.0 2.5 2000 2003 2006 2008 2011) (ns) Year Capacity Latency (tRC) 16X-20%

33

Dual-Core Evaluation

• We categorize single-core benchmarks into two categories1. Sens: benchmarks whose performance is sensitive

to near segment capacity

2. Insens: benchmarks whose performance is insensitive to near segment capacity

• Dual-core workload categorization1. Sens/Sens

2. Sens/Insens

3. Insens/Insens

Page 34: Tiered-Latency DRAMece447/s14/lib/exe/fetch.php?...4 Historical DRAM Trend 0 20 40 60 80 100 0.0 0.5 1.0 1.5 2.0 2.5 2000 2003 2006 2008 2011) (ns) Year Capacity Latency (tRC) 16X-20%

34

Dual-Core: Sens/Sens

0%

5%

10%

15%

20%

16 32 64 128

SC WMC BBC

Pe

rfo

rman

ce I

mp

rov.

BBC/WMC show more perf. improvement

Near segment length (cells)

Larger near segment capacity leads to higher performance improvement in sensitive workloads

Page 35: Tiered-Latency DRAMece447/s14/lib/exe/fetch.php?...4 Historical DRAM Trend 0 20 40 60 80 100 0.0 0.5 1.0 1.5 2.0 2.5 2000 2003 2006 2008 2011) (ns) Year Capacity Latency (tRC) 16X-20%

35

Dual-Core: Sens/Insens & Insens/Insens

0%

5%

10%

15%

20%

16 32 64 128

SC WMC BBC

Near segment length

Using near segment as a cache provides high performance improvement regardless of near segment capacity

Pe

rfo

rman

ce I

mp

rov.

Page 36: Tiered-Latency DRAMece447/s14/lib/exe/fetch.php?...4 Historical DRAM Trend 0 20 40 60 80 100 0.0 0.5 1.0 1.5 2.0 2.5 2000 2003 2006 2008 2011) (ns) Year Capacity Latency (tRC) 16X-20%

36

Other Mechanisms & Results in Paper

• More mechanisms for leveraging TL-DRAM– Hardware-managed exclusive caching mechanism

– Profile-based page mapping to near segment

– TL-DRAM improves performance and reduces power consumption with other mechanisms

• More than two tiers– Latency evaluation for three-tier TL-DRAM

• Detailed circuit evaluationfor DRAM latency and power consumption– Examination of tRC and tRCD

• Implementation details and storage cost analysis in memory controller

Page 37: Tiered-Latency DRAMece447/s14/lib/exe/fetch.php?...4 Historical DRAM Trend 0 20 40 60 80 100 0.0 0.5 1.0 1.5 2.0 2.5 2000 2003 2006 2008 2011) (ns) Year Capacity Latency (tRC) 16X-20%

37

Conclusion

• Problem: DRAM latency is a critical performance bottleneck

• Our Goal: Reduce DRAM latency with low area cost

• Observation: Long bitlines in DRAM are the dominant source of DRAM latency

• Key Idea: Divide long bitlines into two shorter segments

–Fast and slow segments

• Tiered-latency DRAM: Enables latency heterogeneity in DRAM

–Can leverage this in many ways to improve performance and reduce power consumption

• Results: When the fast segment is used as a cache to the slow segment Significant performance improvement (>12%) and power reduction (>23%) at low area cost (3%)

Page 38: Tiered-Latency DRAMece447/s14/lib/exe/fetch.php?...4 Historical DRAM Trend 0 20 40 60 80 100 0.0 0.5 1.0 1.5 2.0 2.5 2000 2003 2006 2008 2011) (ns) Year Capacity Latency (tRC) 16X-20%

38

Thank You

Page 39: Tiered-Latency DRAMece447/s14/lib/exe/fetch.php?...4 Historical DRAM Trend 0 20 40 60 80 100 0.0 0.5 1.0 1.5 2.0 2.5 2000 2003 2006 2008 2011) (ns) Year Capacity Latency (tRC) 16X-20%

39

Tiered-Latency DRAM:A Low Latency and A Low Cost

DRAM Architecture

Donghyuk Lee, Yoongu Kim, Vivek Seshadri, Jamie Liu, Lavanya Subramanian, Onur Mutlu

Page 40: Tiered-Latency DRAMece447/s14/lib/exe/fetch.php?...4 Historical DRAM Trend 0 20 40 60 80 100 0.0 0.5 1.0 1.5 2.0 2.5 2000 2003 2006 2008 2011) (ns) Year Capacity Latency (tRC) 16X-20%

40

Backup Slides

Page 41: Tiered-Latency DRAMece447/s14/lib/exe/fetch.php?...4 Historical DRAM Trend 0 20 40 60 80 100 0.0 0.5 1.0 1.5 2.0 2.5 2000 2003 2006 2008 2011) (ns) Year Capacity Latency (tRC) 16X-20%

41

Storage Cost in Memory Controller• Organization

– Bitline Length: 512 cells/bitline

– Near Segment Length: 32 cells

– Far Segment Length: 480 cells

– Inclusive Caching

• Simple caching and wait-minimized caching– Tag Storage: 9 KB

– Replace Information: 5 KB

• Benefit-based caching– Tag storage: 9 KB

– Replace Information: 8 KB (8 bit benefit field/near segment row)

Page 42: Tiered-Latency DRAMece447/s14/lib/exe/fetch.php?...4 Historical DRAM Trend 0 20 40 60 80 100 0.0 0.5 1.0 1.5 2.0 2.5 2000 2003 2006 2008 2011) (ns) Year Capacity Latency (tRC) 16X-20%

42

0%

5%

10%

15%

1 (1-ch) 2 (2-ch) 4 (4-ch)

Performance ImprovementPower Reduction

Pe

rf.I

mp

rove

me

nt

& P

ow

er

Re

du

ctio

n

Core-count (# of memory channels)

Hardware-managed Exclusive Cache• Near and Far segment: Main memory

• Caching: Swapping near and far segment row– Need one dummy row to swap

Performance improvement is lower than Inclusive caching due to high swapping latency

7.2% 8.9%9.9%9.4%

11.4%14.3%

Page 43: Tiered-Latency DRAMece447/s14/lib/exe/fetch.php?...4 Historical DRAM Trend 0 20 40 60 80 100 0.0 0.5 1.0 1.5 2.0 2.5 2000 2003 2006 2008 2011) (ns) Year Capacity Latency (tRC) 16X-20%

43

0%5%

10%15%20%25%30%

1 (1-ch) 2 (2-ch) 4 (4-ch)

Performance Improvement

Power Reduction

Pe

rf.I

mp

rove

me

nt

& P

ow

er

Re

du

ctio

n

Core-count (# of memory channels)

Profile-Based Page Mapping

• Operating system profiles applications and maps frequently accessed rows to the near segment

Allocating frequently accessed rows in the near segment provides performance improvement

8.9%11.6%

7.2%

19%

24.8%21.5%

Page 44: Tiered-Latency DRAMece447/s14/lib/exe/fetch.php?...4 Historical DRAM Trend 0 20 40 60 80 100 0.0 0.5 1.0 1.5 2.0 2.5 2000 2003 2006 2008 2011) (ns) Year Capacity Latency (tRC) 16X-20%

44

Three-Tier Analysis

• Three tiers– Add two isolation transistors

– Near/Mid/Far segment length: 32/224/256 Cells

More tiers enable finer-grained caching and partitioning mechanisms

0%

30%

60%

90%

120%

150%

180%

Late

ncy

–56%

Three-Tier TL-DRAMCommodity

DRAM

Near Mid Far

–23%

57%