Buffer-On-Board Memory System 1 Name: Aurangozeb ISCA 2012.
Transcript of Buffer-On-Board Memory System 1 Name: Aurangozeb ISCA 2012.
![Page 1: Buffer-On-Board Memory System 1 Name: Aurangozeb ISCA 2012.](https://reader030.fdocuments.net/reader030/viewer/2022032414/56649ee75503460f94bf7ffd/html5/thumbnails/1.jpg)
Buffer-On-Board Memory System
1
Name: AurangozebISCA 2012
![Page 2: Buffer-On-Board Memory System 1 Name: Aurangozeb ISCA 2012.](https://reader030.fdocuments.net/reader030/viewer/2022032414/56649ee75503460f94bf7ffd/html5/thumbnails/2.jpg)
2
Outline
• Introduction• Modern Memory System• Buffer-On-Board (BOB) Memory System• BOB Simulation Suite• BOB Simulation Result
• Limit-Case Simulation• Full System Simulation
• Conclusion
![Page 3: Buffer-On-Board Memory System 1 Name: Aurangozeb ISCA 2012.](https://reader030.fdocuments.net/reader030/viewer/2022032414/56649ee75503460f94bf7ffd/html5/thumbnails/3.jpg)
3
Introduction (1/2)
• Modification of Memory system to cope with high speed.• Dual Inline Memory Module (DIMM) : <100 MHz speed. • Signal Integrity (i.e. Cross-talk, Reflection) issue at high speed of operation.
Reduce no. of DIMM to increase CLK speed. Limits the total capacity
• One Simple solution: • Increase capacity of single DIMM• Drawback: Difficult to decrease DRAM capacitor size. Cost does not scale linearly
![Page 4: Buffer-On-Board Memory System 1 Name: Aurangozeb ISCA 2012.](https://reader030.fdocuments.net/reader030/viewer/2022032414/56649ee75503460f94bf7ffd/html5/thumbnails/4.jpg)
4
Introduction (2/2)
• FB-DIMM Memory Solution: • Advanced Memory Buffer (AMB) with DDRx DRAM
to interpret packetized protocol and issue DRAM specific command.
• Support fast and slow speed of operation.• Drawback:
High speed I/O of AMB: Heat & Power issue Not cost effective
• Solution from IBM / INTEL / AMD : • A single logic chip. Not for one logic chip per FB-
DIMM• Control DRAM and communicate with CPU over a
relatively faster and narrow bus.• New architecture using low cost DIMMs
![Page 5: Buffer-On-Board Memory System 1 Name: Aurangozeb ISCA 2012.](https://reader030.fdocuments.net/reader030/viewer/2022032414/56649ee75503460f94bf7ffd/html5/thumbnails/5.jpg)
5
Modern Memory System
• Consideration• Ranks of memory per channel• DRAM type • No. of channels per processor
![Page 6: Buffer-On-Board Memory System 1 Name: Aurangozeb ISCA 2012.](https://reader030.fdocuments.net/reader030/viewer/2022032414/56649ee75503460f94bf7ffd/html5/thumbnails/6.jpg)
6
Buffer-On-Board (BOB) Memory System (1/2)
• Multiple BOB Channels• Each Channel consists of LR-, R-,
or U-DIMMs• Single & Simple controller for each
channel• Faster and Narrower bus (Link Bus)
between simple controller and CPU
![Page 7: Buffer-On-Board Memory System 1 Name: Aurangozeb ISCA 2012.](https://reader030.fdocuments.net/reader030/viewer/2022032414/56649ee75503460f94bf7ffd/html5/thumbnails/7.jpg)
7
Buffer-On-Board (BOB) Memory System (2/2)
• Operation:• Request Packet over link bus: Address + Req. Type + Data
(if write)• Translate Request into DRAM specific command (ACTIVATE,
READ, WRITE etc.) and issue to DRAM Ranks.• A Command Queue: Dynamic Scheduling• Read Return Queue: Sorting after data receive• Response Packet contains: Data + Address of initial request.
• BOB controller:• Address mapping• Returning data to CPU/Cache• Packetizing Request• Interpret Response packets: From & To simple controller
• Encapsulation: to support narrower link bus• Use multiple clock to transmit total data.
• A cross-bar switch: Any port to any link bus.
![Page 8: Buffer-On-Board Memory System 1 Name: Aurangozeb ISCA 2012.](https://reader030.fdocuments.net/reader030/viewer/2022032414/56649ee75503460f94bf7ffd/html5/thumbnails/8.jpg)
8
BOB Simulation Suite
• Two Separate Simulators• Developed by authors and MARSSx86 • A multi-core x86 simulator developed at SUNY-Binghamton
• Cycle Based Simulator written in C++• Encapsulate: Main BOB, each BOB, Associated Link and
simple controller.• Two Modes
• Stand-alone: Request parameterization, Random address or trace file are issued to memory system
• Full system simulation: Receive Request from MARSSx86• Memory
• A DDR3-1066 (MT41J512M4-187E)• A DDR3-1333 device (MT41J1G4-15E), and • A DDR3-1600 device (MT41J256M4-125E)
ref.[16]
![Page 9: Buffer-On-Board Memory System 1 Name: Aurangozeb ISCA 2012.](https://reader030.fdocuments.net/reader030/viewer/2022032414/56649ee75503460f94bf7ffd/html5/thumbnails/9.jpg)
9
BOB Simulation Result
• Two Experiments:• A limit-case simulation: random address stream is issued into
a BOB memory system.• A full system simulation: an operating system is booted on an
x86 processor and applications are executed• Benchmark
• NAS parallel benchmarks• PARSEC benchmark suite [9]• STREAM.
• Emphasized multi-threaded applications to demonstrate the types of workloads this memory architecture is likely to encounter.
• Design tradeoffs: Costs such as total pin count, power dissipation, and physical space (or total DIMM count).
![Page 10: Buffer-On-Board Memory System 1 Name: Aurangozeb ISCA 2012.](https://reader030.fdocuments.net/reader030/viewer/2022032414/56649ee75503460f94bf7ffd/html5/thumbnails/10.jpg)
10
Limit-Case Simulation
• Optimal rank depth for each DRAM channel is between 2 and 4• If Return Queue is full, no further read or write.• A read return queue must have at least enough capacity for four
responses packets.
• Simple Controller & DRAM Efficiency
![Page 11: Buffer-On-Board Memory System 1 Name: Aurangozeb ISCA 2012.](https://reader030.fdocuments.net/reader030/viewer/2022032414/56649ee75503460f94bf7ffd/html5/thumbnails/11.jpg)
• Width and speed of buses optimization: No stall the DRAM• A read-to-write request ratio of approximately 2-to-1• Equations 1 & 2: Bandwidth required by each link bus to prevent
them from negatively impacting the efficiency of each channel.11
Limit-Case Simulation
• Link Bus Configuration (1/2)
![Page 12: Buffer-On-Board Memory System 1 Name: Aurangozeb ISCA 2012.](https://reader030.fdocuments.net/reader030/viewer/2022032414/56649ee75503460f94bf7ffd/html5/thumbnails/12.jpg)
12
Limit-Case Simulation
• Weighting the response link bus more than the request : May be ideal for some application
• Side-effect: Serializing the communication on unidirectional buses
• Link Bus Configuration (2/2)
![Page 13: Buffer-On-Board Memory System 1 Name: Aurangozeb ISCA 2012.](https://reader030.fdocuments.net/reader030/viewer/2022032414/56649ee75503460f94bf7ffd/html5/thumbnails/13.jpg)
13
Limit-Case Simulation
• Multiple logically independent channels of DRAM to share the same link bus and simple controller
• Reduce costs such as pin-out, logic fabrication, and physical space.• Reduce the number of simple controllers
• Multi-Channel Optimization
![Page 14: Buffer-On-Board Memory System 1 Name: Aurangozeb ISCA 2012.](https://reader030.fdocuments.net/reader030/viewer/2022032414/56649ee75503460f94bf7ffd/html5/thumbnails/14.jpg)
14
Limit-Case Simulation
• 8 DRAM channels, each with 4 ranks (32 DIMMs making 256 GB total)
• CPU has up to 128 pins which can be used for data lanes
• These lanes are operated at 3.2 GHz (6.4 Gb/s)
• Cost Constrained Simulations
![Page 15: Buffer-On-Board Memory System 1 Name: Aurangozeb ISCA 2012.](https://reader030.fdocuments.net/reader030/viewer/2022032414/56649ee75503460f94bf7ffd/html5/thumbnails/15.jpg)
15
Full System Simulations
• Optimal rank depth for each DRAM channel is between 2 and 4• If Return Queue is full, no further read or write.• A read return queue must have at least enough capacity for four
responses packets.
• Simple Controller & DRAM Efficiency
![Page 16: Buffer-On-Board Memory System 1 Name: Aurangozeb ISCA 2012.](https://reader030.fdocuments.net/reader030/viewer/2022032414/56649ee75503460f94bf7ffd/html5/thumbnails/16.jpg)
• Width and speed of buses optimization: No stall the DRAM• A read-to-write request ratio of approximately 2-to-1• Equations 1 & 2: Bandwidth required by each link bus to prevent
them from negatively impacting the efficiency of each channel.16
Limit-Case Simulation
• Link Bus Configuration (1/2)
![Page 17: Buffer-On-Board Memory System 1 Name: Aurangozeb ISCA 2012.](https://reader030.fdocuments.net/reader030/viewer/2022032414/56649ee75503460f94bf7ffd/html5/thumbnails/17.jpg)
17
Limit-Case Simulation
• Weighting the response link bus more than the request : May be ideal for some application
• Side-effect: Serializing the communication on unidirectional buses
• Link Bus Configuration (2/2)
![Page 18: Buffer-On-Board Memory System 1 Name: Aurangozeb ISCA 2012.](https://reader030.fdocuments.net/reader030/viewer/2022032414/56649ee75503460f94bf7ffd/html5/thumbnails/18.jpg)
18
Limit-Case Simulation
• Multiple logically independent channels of DRAM to share the same link bus and simple controller
• Reduce costs such as pin-out, logic fabrication, and physical space.• Reduce the number of simple controllers
• Multi-Channel Optimization
![Page 19: Buffer-On-Board Memory System 1 Name: Aurangozeb ISCA 2012.](https://reader030.fdocuments.net/reader030/viewer/2022032414/56649ee75503460f94bf7ffd/html5/thumbnails/19.jpg)
19
Full System Simulations
• STREAM and mcol generate the greatest average• This is due to the request mix generated during region of interest• STREAM: 46% reads and 54% writes• mcol: 99% reads.
• Performance & Power Trade-offs
![Page 20: Buffer-On-Board Memory System 1 Name: Aurangozeb ISCA 2012.](https://reader030.fdocuments.net/reader030/viewer/2022032414/56649ee75503460f94bf7ffd/html5/thumbnails/20.jpg)
20
Full System Simulations
• Performance & Power Trade-offs
![Page 21: Buffer-On-Board Memory System 1 Name: Aurangozeb ISCA 2012.](https://reader030.fdocuments.net/reader030/viewer/2022032414/56649ee75503460f94bf7ffd/html5/thumbnails/21.jpg)
21
Full System Simulations
• Address & Channel Mapping
![Page 22: Buffer-On-Board Memory System 1 Name: Aurangozeb ISCA 2012.](https://reader030.fdocuments.net/reader030/viewer/2022032414/56649ee75503460f94bf7ffd/html5/thumbnails/22.jpg)
22
Full System Simulations
• Address & Channel Mapping
![Page 23: Buffer-On-Board Memory System 1 Name: Aurangozeb ISCA 2012.](https://reader030.fdocuments.net/reader030/viewer/2022032414/56649ee75503460f94bf7ffd/html5/thumbnails/23.jpg)
23
Full System Simulations
• Address & Channel Mapping
![Page 24: Buffer-On-Board Memory System 1 Name: Aurangozeb ISCA 2012.](https://reader030.fdocuments.net/reader030/viewer/2022032414/56649ee75503460f94bf7ffd/html5/thumbnails/24.jpg)
24
Conclusion
• A new memory architecture: Increase both speed and capacity.
• Intermediate logic between the CPU and DIMMs.• Verified by implementing two configurations:
• Limit-Case Simulation• Full System Simulation
• Queue depths, proper bus configurations, and address mappings are considered to achieve peak efficiency.
• Cost-constrained simulations are also performed.• The buffer-on-board architecture: An ideal near-term
solution.