Rethinking DRAM Design and Organization for Energy-Constrained Multi-Cores
description
Transcript of Rethinking DRAM Design and Organization for Energy-Constrained Multi-Cores
![Page 1: Rethinking DRAM Design and Organization for Energy-Constrained Multi-Cores](https://reader036.fdocuments.net/reader036/viewer/2022062315/56816092550346895dcfb648/html5/thumbnails/1.jpg)
Rethinking DRAM Design and Organization for Energy-Constrained Multi-Cores
Aniruddha N. Udipi,Naveen Muralimanohar*,Niladrish Chatterjee,Rajeev Balasubramonian, Al Davis, Norm Jouppi*
University of Utah and *HP Labs
![Page 2: Rethinking DRAM Design and Organization for Energy-Constrained Multi-Cores](https://reader036.fdocuments.net/reader036/viewer/2022062315/56816092550346895dcfb648/html5/thumbnails/2.jpg)
Why a complete DRAM redesign?
2
Courtesy: http://www.iiasa.ac.at
High Density
Low Cost-per-bit
Energy efficient
June 1994
Time for DRAM’s own “right-hand turn” Rethink design for modern constraints
JEDEC SDRAM Standard
Cost-per-bit over time
![Page 3: Rethinking DRAM Design and Organization for Energy-Constrained Multi-Cores](https://reader036.fdocuments.net/reader036/viewer/2022062315/56816092550346895dcfb648/html5/thumbnails/3.jpg)
3
Memory Trends
• Energy– Large scale systems attribute 25-40% of total
power to the memory subsystem– Capital acquisition costs = operating costs over 3 years– Energy is a first-order design constraint
• Access patterns– Increasing socket, core, and thread counts– Final memory request stream extremely random– Cannot design for locality
![Page 4: Rethinking DRAM Design and Organization for Energy-Constrained Multi-Cores](https://reader036.fdocuments.net/reader036/viewer/2022062315/56816092550346895dcfb648/html5/thumbnails/4.jpg)
Memory Trends
4
![Page 5: Rethinking DRAM Design and Organization for Energy-Constrained Multi-Cores](https://reader036.fdocuments.net/reader036/viewer/2022062315/56816092550346895dcfb648/html5/thumbnails/5.jpg)
Memory Trends
5
![Page 6: Rethinking DRAM Design and Organization for Energy-Constrained Multi-Cores](https://reader036.fdocuments.net/reader036/viewer/2022062315/56816092550346895dcfb648/html5/thumbnails/6.jpg)
6
Memory Trends
• Energy– Large scale systems attribute 25-40% of total
power to the memory subsystem– Capital acquisition costs = operating costs over 3 years– Energy is a first-order design constraint
• Access patterns– Increasing socket, core, and thread counts– Final memory request stream extremely random– Cannot design for locality– What is exact overfetch degree?
• DRAM Reliability– Critical apps require chipkill-level reliability– Building fault-tolerance out of unreliable components is expensive– Schroeder et al., SIGMETRICS 2009
![Page 7: Rethinking DRAM Design and Organization for Energy-Constrained Multi-Cores](https://reader036.fdocuments.net/reader036/viewer/2022062315/56816092550346895dcfb648/html5/thumbnails/7.jpg)
7
Related Work
• Overfetch– Ahn et al. (SC ’09), Ware et al. (ICCD ’06), Sudan et al.
(ASPLOS ’10)
• DRAM Low-power modes– Hur et al. (HPCA ’08), Fan et al. (ISLPED ’01), Pandey
et al. (HPCA ’06) • DRAM Redesign
– Loh (ISCA ’08), Beamer et al. (ISCA ’10)• Chipkill mechanisms
– Yoon and Erez (ASPLOS ’10)
![Page 8: Rethinking DRAM Design and Organization for Energy-Constrained Multi-Cores](https://reader036.fdocuments.net/reader036/viewer/2022062315/56816092550346895dcfb648/html5/thumbnails/8.jpg)
Executive Summary
• Rethink DRAM design for modern constraints– Low-locality, reduced energy consumption, optimize TCO
• Selective Bitline Activation (SBA)– Minimal design changes– Considerable dynamic energy reductions for small latency and
area penalties
• Single Subarray Access (SSA)– Significant changes to memory interface– Large dynamic and static energy savings
• Chipkill-level reliability– Reduced energy and storage overheads for reliability
8
![Page 9: Rethinking DRAM Design and Organization for Energy-Constrained Multi-Cores](https://reader036.fdocuments.net/reader036/viewer/2022062315/56816092550346895dcfb648/html5/thumbnails/9.jpg)
9
Outline
• DRAM systems overview• Selective Bitline Activation (SBA)• Single Subarray Access (SSA)• Chipkill-level reliability• Conclusion
![Page 10: Rethinking DRAM Design and Organization for Energy-Constrained Multi-Cores](https://reader036.fdocuments.net/reader036/viewer/2022062315/56816092550346895dcfb648/html5/thumbnails/10.jpg)
Basic Organization
10
…
Memory bus or channel
Rank
DRAMchip ordeviceBank
Array1/8th of therow buffer
One word ofdata output
DIMM
On-chip Memory
Controller
![Page 11: Rethinking DRAM Design and Organization for Energy-Constrained Multi-Cores](https://reader036.fdocuments.net/reader036/viewer/2022062315/56816092550346895dcfb648/html5/thumbnails/11.jpg)
Basic DRAM Operation
11
RAS
CAS
Cache Line
DRAM Chip DRAM Chip DRAM Chip DRAM Chip
Row Buffer
One bank shown in each chip
![Page 12: Rethinking DRAM Design and Organization for Energy-Constrained Multi-Cores](https://reader036.fdocuments.net/reader036/viewer/2022062315/56816092550346895dcfb648/html5/thumbnails/12.jpg)
12
Outline
• DRAM systems overview• Selective Bitline Activation (SBA)• Single Subarray Access (SSA)• Chipkill-level reliability• Conclusion
![Page 13: Rethinking DRAM Design and Organization for Energy-Constrained Multi-Cores](https://reader036.fdocuments.net/reader036/viewer/2022062315/56816092550346895dcfb648/html5/thumbnails/13.jpg)
Selective Bitline Activation
• Activate only those bitlines corresponding to the requested cache line – reduce dynamic energy
– Some area overhead depending on access granularity – we pick 16 cache lines for 12.5% area overhead
• Requires no changes to the interface and minimal control changes
13
![Page 14: Rethinking DRAM Design and Organization for Energy-Constrained Multi-Cores](https://reader036.fdocuments.net/reader036/viewer/2022062315/56816092550346895dcfb648/html5/thumbnails/14.jpg)
14
Outline
• DRAM systems overview• Selective Bitline Activation (SBA)• Single Subarray Access (SSA)• Chipkill-level reliability• Conclusion
![Page 15: Rethinking DRAM Design and Organization for Energy-Constrained Multi-Cores](https://reader036.fdocuments.net/reader036/viewer/2022062315/56816092550346895dcfb648/html5/thumbnails/15.jpg)
Key Idea
• Incandescent light bulb• Low purchase cost• High operating cost• Commodity
• Energy-efficient light bulb• Higher purchase cost• Much lower operating cost• Value-addition
15
It’s worth a small increase in capital costs to gain large reductions in operating costs
$3.00 13W
$0.30 60W
And not 10X, just 15-20%!
![Page 16: Rethinking DRAM Design and Organization for Energy-Constrained Multi-Cores](https://reader036.fdocuments.net/reader036/viewer/2022062315/56816092550346895dcfb648/html5/thumbnails/16.jpg)
Wishlist of features
• Eliminate overfetch– Disregard locality
• Increase opportunities for power-down
• Increase parallelism
• Enable efficient reliability mechanisms
16
![Page 17: Rethinking DRAM Design and Organization for Energy-Constrained Multi-Cores](https://reader036.fdocuments.net/reader036/viewer/2022062315/56816092550346895dcfb648/html5/thumbnails/17.jpg)
SSA Architecture
17
MEMORY CONTROLLER
8 8
ADDR/CMD BUS
64 Bytes
Bank
Subarray
Bitlines
Row buffer
Global Interconnect to I/O
ONE DRAM CHIP
DIMM
8 8 8 8 8 88DATA BUS
![Page 18: Rethinking DRAM Design and Organization for Energy-Constrained Multi-Cores](https://reader036.fdocuments.net/reader036/viewer/2022062315/56816092550346895dcfb648/html5/thumbnails/18.jpg)
SSA Basics
• Entire DRAM chip divided into small subarrays
• Width of each subarray is exactly one cache line
• Fetch entire cache line from a single subarray in a single DRAM chip – SSA
• Groups of subarrays combined into “banks” to keep peripheral circuit overheads low
• Close page policy and “posted-RAS” similar to SBA
• Data bus to processor essentially split into 8 narrow buses
18
![Page 19: Rethinking DRAM Design and Organization for Energy-Constrained Multi-Cores](https://reader036.fdocuments.net/reader036/viewer/2022062315/56816092550346895dcfb648/html5/thumbnails/19.jpg)
SSA Operation
19
Address
Cache Line
DRAM ChipSubarray
DRAM ChipSubarray
DRAM ChipSubarray
DRAM ChipSubarraySubarray Subarray Subarray Subarray
Sleep Mode(or other parallelaccesses)
Subarray Subarray Subarray SubarraySubarray Subarray Subarray Subarray
![Page 20: Rethinking DRAM Design and Organization for Energy-Constrained Multi-Cores](https://reader036.fdocuments.net/reader036/viewer/2022062315/56816092550346895dcfb648/html5/thumbnails/20.jpg)
SSA Impact
• Energy reduction– Dynamic – fewer bitlines activated– Static – smaller activation footprint – more and longer spells
of inactivity – better power down
• Latency impact– Limited pins per cache line – serialization latency– Higher bank-level parallelism – shorter queuing delays
• Area increase– More peripheral circuitry and I/O at finer granularities – area
overhead (< 5%)
20
![Page 21: Rethinking DRAM Design and Organization for Energy-Constrained Multi-Cores](https://reader036.fdocuments.net/reader036/viewer/2022062315/56816092550346895dcfb648/html5/thumbnails/21.jpg)
Methodology
• Simics based simulator– ‘ooo-micro-arch’ and ‘trans-staller’
• FCFS/FR-FCFS scheduling policies• Address mapping and DRAM models from DRAMSim• DRAM data from Micron datasheets• Area/Energy numbers from heavily modified CACTI 6.5• PARSEC/NAS/STREAM benchmarks• 8 single-threaded OOO cores, 32 KB L1, 2 MB L2• 2GHz processor, 400MHz DRAM
21
![Page 22: Rethinking DRAM Design and Organization for Energy-Constrained Multi-Cores](https://reader036.fdocuments.net/reader036/viewer/2022062315/56816092550346895dcfb648/html5/thumbnails/22.jpg)
Dynamic Energy Reduction
22
Moving to close page policy – 73% energy increase on average Compared to open page, 3X reduction with SBA, 6.4X with SSA
![Page 23: Rethinking DRAM Design and Organization for Energy-Constrained Multi-Cores](https://reader036.fdocuments.net/reader036/viewer/2022062315/56816092550346895dcfb648/html5/thumbnails/23.jpg)
Contributors to energy consumption
23
64 cache lines in baseline 16 cache lines in SBA 1 cache line in SSA
![Page 24: Rethinking DRAM Design and Organization for Energy-Constrained Multi-Cores](https://reader036.fdocuments.net/reader036/viewer/2022062315/56816092550346895dcfb648/html5/thumbnails/24.jpg)
Static Energy – Power down modes
• Current DRAM chips already support several low-power modes
• Consider the low-overhead power down mode: 5.5X lower energy, 3 cycle wakeup time
• For a constant 5% latency increase– 17% low-power operation in the baseline– 80% low-power operation in SSA
24
![Page 25: Rethinking DRAM Design and Organization for Energy-Constrained Multi-Cores](https://reader036.fdocuments.net/reader036/viewer/2022062315/56816092550346895dcfb648/html5/thumbnails/25.jpg)
Latency Characteristics
25
• Impact of Open/Close page policy – 17% decrease (10/12) or 28% increase (2/12)
• Posted-RAS adds about 10%• Serialization/Queuing delay balance in SSA - 30% decrease (6/12) or
40% increase (6/12)
![Page 26: Rethinking DRAM Design and Organization for Energy-Constrained Multi-Cores](https://reader036.fdocuments.net/reader036/viewer/2022062315/56816092550346895dcfb648/html5/thumbnails/26.jpg)
Contributors to Latency
26
![Page 27: Rethinking DRAM Design and Organization for Energy-Constrained Multi-Cores](https://reader036.fdocuments.net/reader036/viewer/2022062315/56816092550346895dcfb648/html5/thumbnails/27.jpg)
27
Outline
• DRAM systems overview• Selective Bitline Activation (SBA)• Single Subarray Access (SSA)• Chipkill-level reliability• Conclusion
![Page 28: Rethinking DRAM Design and Organization for Energy-Constrained Multi-Cores](https://reader036.fdocuments.net/reader036/viewer/2022062315/56816092550346895dcfb648/html5/thumbnails/28.jpg)
DRAM Reliability
• Many server applications require chipkill-level reliability – failure of an entire DRAM chip
• One example of existing systems– 64-bit word requires 8-bit ECC – Each of these 72 bits must be read out of a different
chip, else a chip failure will lead to a multi-bit error in the 72-bit field – unrecoverable!
– Reading 72 chips - significant overfetch!• Chipkill even more of a concern for SSA since entire cache line comes from a single chip
28
![Page 29: Rethinking DRAM Design and Organization for Energy-Constrained Multi-Cores](https://reader036.fdocuments.net/reader036/viewer/2022062315/56816092550346895dcfb648/html5/thumbnails/29.jpg)
Proposed Solution
Approach similar to RAID-5
29
DIMM
L0 C L1 C L2 C L3 C L4 C L5 C L6 C L7 C P0 C
L9 C L10 C L11C L12 C L13 C L14 C L15 C P1 C L8 C..
C L56 C L57 C L58 C L59 C L60 C L61 C L62 C L63 C
.
...
.
...
.
...
.
...
P7
DRAM DEVICE
L – Cache Line C – Local Checksum P – Global Parity
![Page 30: Rethinking DRAM Design and Organization for Energy-Constrained Multi-Cores](https://reader036.fdocuments.net/reader036/viewer/2022062315/56816092550346895dcfb648/html5/thumbnails/30.jpg)
Chipkill design
• Two-tier error protection
• Tier - 1 protection – self-contained error detection– 8-bit checksum/cache line – 1.625% storage overhead– Every cache line read is now slightly longer
• Tear -2 protection – global error correction– RAID-like striped parity across 8+1 chips– 12.5% storage overhead
• Error-free access (common case)– 1 chip reads– 2 chip writes – leads to some bank contention – 12% IPC degradation
• Erroneous access– 9 chip operation
30
![Page 31: Rethinking DRAM Design and Organization for Energy-Constrained Multi-Cores](https://reader036.fdocuments.net/reader036/viewer/2022062315/56816092550346895dcfb648/html5/thumbnails/31.jpg)
31
Outline
• DRAM systems overview• Selective Bitline Activation (SBA)• Single Subarray Access (SSA)• Chipkill-level reliability• Conclusion
![Page 32: Rethinking DRAM Design and Organization for Energy-Constrained Multi-Cores](https://reader036.fdocuments.net/reader036/viewer/2022062315/56816092550346895dcfb648/html5/thumbnails/32.jpg)
Key Contributions
• Redesign of DRAM microarchitecture
• Substantial chip access energy savings (up to 6X)
• Overall, performance is a wash
• Minor area impact (12% with SBA, 4.5% with SSA)
• Two-tier chipkill-level reliability with minimal energy and storage overheads
32
![Page 33: Rethinking DRAM Design and Organization for Energy-Constrained Multi-Cores](https://reader036.fdocuments.net/reader036/viewer/2022062315/56816092550346895dcfb648/html5/thumbnails/33.jpg)
Now is the time for new architectures..
• Take into account modern constraints
• Energy far more critical today than before
• Cost-per-bit perhaps less important – optimize TCO– Operating costs over 3 years = capital acquisition costs
• Memory reliability is important for many server applications.
Memory system’s “right-hand-turn” is long overdue
33