Tiered-Latency DRAMece447/s14/lib/exe/fetch.php?...4 Historical DRAM Trend 0 20 40 60 80 100 0.0 0.5...
Transcript of Tiered-Latency DRAMece447/s14/lib/exe/fetch.php?...4 Historical DRAM Trend 0 20 40 60 80 100 0.0 0.5...
1
Tiered-Latency DRAM:A Low Latency and A Low Cost
DRAM Architecture
Donghyuk Lee, Yoongu Kim, Vivek Seshadri, Jamie Liu, Lavanya Subramanian, Onur Mutlu
2
Executive Summary
• Problem: DRAM latency is a critical performance bottleneck
• Our Goal: Reduce DRAM latency with low area cost
• Observation: Long bitlines in DRAM are the dominant source of DRAM latency
• Key Idea: Divide long bitlines into two shorter segments
–Fast and slow segments
• Tiered-latency DRAM: Enables latency heterogeneity in DRAM
–Can leverage this in many ways to improve performance and reduce power consumption
• Results: When the fast segment is used as a cache to the slow segment Significant performance improvement (>12%) and power reduction (>23%) at low area cost (3%)
3
Outline
• Motivation & Key Idea
• Tiered-Latency DRAM
• Leveraging Tiered-Latency DRAM
• Evaluation Results
4
Historical DRAM Trend
0
20
40
60
80
100
0.0
0.5
1.0
1.5
2.0
2.5
2000 2003 2006 2008 2011
Late
ncy
(n
s)
Cap
acit
y (G
b)
Year
Capacity Latency (tRC)
16X
-20%
DRAM latency continues to be a critical bottleneck
5
DRAM Latency = Subarray Latency + I/O Latency
What Causes the Long Latency?DRAM Chip
channel
cell array
I/O
DRAM Chip
channel
I/O
subarray
DRAM Latency = Subarray Latency + I/O Latency
DominantSu
bar
ray
I/O
6
Why is the Subarray So Slow?
Subarray
Bit
line:
51
2 c
ells
extremely large sense amplifier(≈100X the cell size)
cap
acit
or access
transistor
wordline
bit
line
Cell
Ro
w d
eco
der
Sense amplifier
Long Bitline: Amortize sense amplifier → Small areaLong Bitline: Large bitline cap. → High latency
cell
7
Trade-Off: Area (Die Size) vs. Latency
Faster
Smaller
Short BitlineLong Bitline
Trade-Off: Area vs. Latency
8
Trade-Off: Area (Die Size) vs. Latency
0
1
2
3
4
0 10 20 30 40 50 60 70
No
rmal
ize
d D
RA
M A
rea
Latency (ns)
64
32
128
256 512 cells/bitline
Commodity DRAM
Long Bitline
Ch
eap
er
Faster
Fancy DRAMShort Bitline
9
Short Bitline
Low Latency
Approximating the Best of Both Worlds
Long Bitline
Small Area
Long Bitline
Low Latency
Short BitlineOur Proposal
Small Area
Short Bitline Fast
Need Isolation
Add Isolation Transistors
High Latency
Large Area
10
Approximating the Best of Both Worlds
Low Latency
Our Proposal
Small Area Long BitlineSmall Area
Long Bitline
High Latency
Short Bitline
Low Latency
Short Bitline
Large Area
Tiered-Latency DRAM
Low Latency
Small area using long
bitline
11
Outline
• Motivation & Key Idea
• Tiered-Latency DRAM
• Leveraging Tiered-Latency DRAM
• Evaluation Results
12
Tiered-Latency DRAM
Near Segment
Far Segment
Isolation Transistor
• Divide a bitline into two segments with an isolation transistor
Sense Amplifier
13
Far SegmentFar Segment
Near Segment Access
Near Segment
Isolation Transistor
• Turn off the isolation transistor
Isolation Transistor (off)
Sense Amplifier
Reduced bitline capacitance
Low latency & low power
Reduced bitline length
14
Near SegmentNear Segment
Far Segment Access
• Turn on the isolation transistor
Far Segment
Isolation TransistorIsolation Transistor (on)
Sense Amplifier
Large bitline capacitance
Additional resistance of isolation transistor
Long bitline length
High latency & high power
15
Latency, Power, and Area Evaluation• Commodity DRAM: 512 cells/bitline
• TL-DRAM: 512 cells/bitline– Near segment: 32 cells
– Far segment: 480 cells
• Latency Evaluation– SPICE simulation using circuit-level DRAM model
• Power and Area Evaluation– DRAM area/power simulator from Rambus
– DDR3 energy calculator from Micron
16
0%
50%
100%
150%
0%
50%
100%
150%
Commodity DRAM vs. TL-DRAM La
ten
cy
Po
we
r
–56%
+23%
–51%
+49%
• DRAM Latency (tRC) • DRAM Power
• DRAM Area Overhead~3%: mainly due to the isolation transistors
TL-DRAMCommodity
DRAM
Near Far Commodity DRAM
Near Far
TL-DRAM
(52.5ns)
17
Latency vs. Near Segment Length
0
20
40
60
80
1 2 4 8 16 32 64 128 256 512
Near Segment Far Segment
Late
ncy
(n
s)
Longer near segment length leads to higher near segment latency
18
Latency vs. Near Segment Length
0
20
40
60
80
1 2 4 8 16 32 64 128 256 512
Near Segment Far Segment
Late
ncy
(n
s)
Far segment latency is higher than commodity DRAM latency
Far Segment Length = 512 – Near Segment Length
19
Trade-Off: Area (Die-Area) vs. Latency
0
1
2
3
4
0 10 20 30 40 50 60 70
No
rmal
ize
d D
RA
M A
rea
Latency (ns)
64
32
128256 512 cells/bitline
Ch
eap
er
Faster
Near Segment Far Segment
20
Outline
• Motivation & Key Idea
• Tiered-Latency DRAM
• Leveraging Tiered-Latency DRAM
• Evaluation Results
21
Leveraging Tiered-Latency DRAM
• TL-DRAM is a substrate that can be leveraged by the hardware and/or software
• Many potential uses1. Use near segment as hardware-managed inclusive
cache to far segment
2. Use near segment as hardware-managed exclusivecache to far segment
3. Profile-based page mapping by operating system
4. Simply replace DRAM with TL-DRAM
22
subarray
Near Segment as Hardware-Managed Cache
TL-DRAM
I/O
cache
mainmemory
• Challenge 1: How to efficiently migrate a row between segments?
• Challenge 2: How to efficiently manage the cache?
far segment
near segmentsense amplifier
channel
23
Inter-Segment Migration
Near Segment
Far Segment
Isolation Transistor
Sense Amplifier
Source
Destination
• Goal: Migrate source row into destination row
• Naïve way: Memory controller reads the source row byte by byte and writes to destination row byte by byte
→ High latency
24
Inter-Segment Migration• Our way:
– Source and destination cells share bitlines
– Transfer data from source to destination across shared bitlines concurrently
Near Segment
Far Segment
Isolation Transistor
Sense Amplifier
Source
Destination
25
Inter-Segment Migration
Near Segment
Far Segment
Isolation Transistor
Sense Amplifier
• Our way: – Source and destination cells share bitlines
– Transfer data from source to destination acrossshared bitlines concurrently
Step 2: Activate destination row to connect cell and bitline
Step 1: Activate source row
Additional ~4ns over row access latency
Migration is overlapped with source row access
26
subarray
Near Segment as Hardware-Managed Cache
TL-DRAM
I/O
cache
mainmemory
• Challenge 1: How to efficiently migrate a row between segments?
• Challenge 2: How to efficiently manage the cache?
far segment
near segmentsense amplifier
channel
27
Three Caching Mechanisms
Time
Caching
Time
Baseline
Row 1
Row 1 Row 2
Row 2Wait-inducing row Wait until finishing Req1
Cached rowReduced wait
Is there another benefit of caching?
Req. for Row 1
Req. for Row 2
Req. for Row 1
Req. for Row 2
1. SC (Simple Caching)
– Classic LRU cache
– Benefit: Reduced reuse latency
2. WMC (Wait-Minimized Caching)
– Identify and cache only wait-inducing rows
– Benefit: Reduced wait
3. BBC (Benefit-Based Caching)
– BBC ≈ SC + WMC
– Benefit: Reduced reuse latency & reduced wait
28
Outline
• Motivation & Key Idea
• Tiered-Latency DRAM
• Leveraging Tiered-Latency DRAM
• Evaluation Results
29
Evaluation Methodology• System simulator
– CPU: Instruction-trace-based x86 simulator
– Memory: Cycle-accurate DDR3 DRAM simulator
• Workloads– 32 Benchmarks from TPC, STREAM, SPEC CPU2006
• Metrics– Single-core: Instructions-Per-Cycle
– Multi-core: Weighted speedup
30
Configurations• System configuration
– CPU: 5.3GHz
– LLC: 512kB private per core
– Memory: DDR3-1066• 1-2 channel, 1 rank/channel
• 8 banks, 32 subarrays/bank, 512 cells/bitline
• Row-interleaved mapping & closed-row policy
• TL-DRAM configuration– Total bitline length: 512 cells/bitline
– Near segment length: 1-256 cells
31
Single-Core: Performance & Power
0%
3%
6%
9%
12%
15%SC WMC BBC
0%
20%
40%
60%
80%
100%SC WMC BBC
IPC
Imp
rove
me
nt
No
rmal
ize
d P
ow
er12.7% –23%
Using near segment as a cache improves performance and reduces power consumption
32
Single-Core: Varying Near Segment Length
0%
3%
6%
9%
12%
15%
1 2 4 8 16 32 64 128 256
SC WMC BBC
IPC
Imp
rove
me
nt
Near Segment Length (cells)
By adjusting the near segment length, we can trade off cache capacity for cache latency
Larger cache capacity
Higher caching latency
Maximum IPCImprovement
33
Dual-Core Evaluation
• We categorize single-core benchmarks into two categories1. Sens: benchmarks whose performance is sensitive
to near segment capacity
2. Insens: benchmarks whose performance is insensitive to near segment capacity
• Dual-core workload categorization1. Sens/Sens
2. Sens/Insens
3. Insens/Insens
34
Dual-Core: Sens/Sens
0%
5%
10%
15%
20%
16 32 64 128
SC WMC BBC
Pe
rfo
rman
ce I
mp
rov.
BBC/WMC show more perf. improvement
Near segment length (cells)
Larger near segment capacity leads to higher performance improvement in sensitive workloads
35
Dual-Core: Sens/Insens & Insens/Insens
0%
5%
10%
15%
20%
16 32 64 128
SC WMC BBC
Near segment length
Using near segment as a cache provides high performance improvement regardless of near segment capacity
Pe
rfo
rman
ce I
mp
rov.
36
Other Mechanisms & Results in Paper
• More mechanisms for leveraging TL-DRAM– Hardware-managed exclusive caching mechanism
– Profile-based page mapping to near segment
– TL-DRAM improves performance and reduces power consumption with other mechanisms
• More than two tiers– Latency evaluation for three-tier TL-DRAM
• Detailed circuit evaluationfor DRAM latency and power consumption– Examination of tRC and tRCD
• Implementation details and storage cost analysis in memory controller
37
Conclusion
• Problem: DRAM latency is a critical performance bottleneck
• Our Goal: Reduce DRAM latency with low area cost
• Observation: Long bitlines in DRAM are the dominant source of DRAM latency
• Key Idea: Divide long bitlines into two shorter segments
–Fast and slow segments
• Tiered-latency DRAM: Enables latency heterogeneity in DRAM
–Can leverage this in many ways to improve performance and reduce power consumption
• Results: When the fast segment is used as a cache to the slow segment Significant performance improvement (>12%) and power reduction (>23%) at low area cost (3%)
38
Thank You
39
Tiered-Latency DRAM:A Low Latency and A Low Cost
DRAM Architecture
Donghyuk Lee, Yoongu Kim, Vivek Seshadri, Jamie Liu, Lavanya Subramanian, Onur Mutlu
40
Backup Slides
41
Storage Cost in Memory Controller• Organization
– Bitline Length: 512 cells/bitline
– Near Segment Length: 32 cells
– Far Segment Length: 480 cells
– Inclusive Caching
• Simple caching and wait-minimized caching– Tag Storage: 9 KB
– Replace Information: 5 KB
• Benefit-based caching– Tag storage: 9 KB
– Replace Information: 8 KB (8 bit benefit field/near segment row)
42
0%
5%
10%
15%
1 (1-ch) 2 (2-ch) 4 (4-ch)
Performance ImprovementPower Reduction
Pe
rf.I
mp
rove
me
nt
& P
ow
er
Re
du
ctio
n
Core-count (# of memory channels)
Hardware-managed Exclusive Cache• Near and Far segment: Main memory
• Caching: Swapping near and far segment row– Need one dummy row to swap
Performance improvement is lower than Inclusive caching due to high swapping latency
7.2% 8.9%9.9%9.4%
11.4%14.3%
43
0%5%
10%15%20%25%30%
1 (1-ch) 2 (2-ch) 4 (4-ch)
Performance Improvement
Power Reduction
Pe
rf.I
mp
rove
me
nt
& P
ow
er
Re
du
ctio
n
Core-count (# of memory channels)
Profile-Based Page Mapping
• Operating system profiles applications and maps frequently accessed rows to the near segment
Allocating frequently accessed rows in the near segment provides performance improvement
8.9%11.6%
7.2%
19%
24.8%21.5%
44
Three-Tier Analysis
• Three tiers– Add two isolation transistors
– Near/Mid/Far segment length: 32/224/256 Cells
More tiers enable finer-grained caching and partitioning mechanisms
0%
30%
60%
90%
120%
150%
180%
Late
ncy
–56%
Three-Tier TL-DRAMCommodity
DRAM
Near Mid Far
–23%
57%