4/17/20151 Improving Memory Bank-Level Parallelism in the Presence of Prefetching Chang Joo Lee...
-
Upload
jaylynn-safford -
Category
Documents
-
view
220 -
download
0
Transcript of 4/17/20151 Improving Memory Bank-Level Parallelism in the Presence of Prefetching Chang Joo Lee...
04/21/23 1
Improving Memory Bank-Level Parallelism in the Presence of Prefetching
Chang Joo Lee
Veynu Narasiman
Onur Mutlu*
Yale N. Patt
Electrical and Computer Engineering The University of Texas at Austin
* Electrical and Computer EngineeringCarnegie Mellon University
04/21/23 2
Main Memory System
• Crucial to high performance computing
• Made of DRAM chips
• Multiple banks→ Each bank can be accessed independently
04/21/23 3
Memory Bank-Level Parallelism (BLP)
Req B0
Req B1
bank 0 bank 1
DRAM system
DRAM controller
DRAMDRAM
DRAM request buffer
Req B1
Bank 0
Bank 1
Time
Overlapped time
Req B0
Data busData for Req B0
Data for Req B1Older
DRAM throughput increased
04/21/23 4
Memory Latency-Tolerance Mechanisms
• Out-of-order execution, prefetching, runahead etc. • Increase outstanding memory requests
on the chip– Memory-Level Parallelism (MLP) [Glew’98]
• Hope many requests will be serviced in parallel in the memory system
• Higher performance can be achieved when BLP is exposed to the DRAM controller
04/21/23 5
Problems
• On-chip buffers e.g., Miss Status Holding Registers (MSHRs) are limited in size– Limit the BLP exposed to the DRAM controller– E.g., requests to the same bank fill up MSHRs
• In CMPs, memory requests from different cores are mixed together in DRAM request buffers– Destroy the BLP of each application running on CMPs
Request Issue policies are critical to BLP exploited by DRAM controller
04/21/23 6
Goals
1. Maximize the BLP exposed from each core to the DRAM controller→ Increase DRAM throughput for useful requests
2. Preserve the BLP of each application in CMPs→ Increase system performance
BLP-Aware Prefetch Issue (BAPI):Decides the order in which prefetches are sent from prefetcher to MSHRs
BLP-Preserving Multi-core Request Issue (BPMRI):Decides the order in which memory requests are sent from each core to DRAM request buffers
Goals and Proposal
04/21/23 7
DRAM BLP-Aware Request Issue Policies
• BLP-Aware Prefetch Issue (BAPI)BLP-Aware Prefetch Issue (BAPI)
• BLP-Preserving Multi-core Request Issue (BPMRI)
04/21/23 8
What Can Limit DRAM BLP?
• Miss Status Holding Registers (MSHRs) are NOT large enough to handle many memory requests [Tuck, MICRO’06]
– MSHRs keep track of all outstanding misses for a core → Total number of demand/prefetch requests
≤ total number of MSHR entries
– Complex, latency-critical, and power-hungry→ Not scalable
Request issue policy to MSHRs affects the level of BLP exploited by DRAM controller
04/21/23 9
What Can Limit DRAM BLP?
Prefetch request buffer
MSHRs
Dem B0
β:
α
To DRAM
Core
Pref B0
Pref B1
Pref B1
DRAM request buffers
Bank 0
Bank 1
DRAM service time
Overlapped time
Dem B0 Pref B0
Pref B1 Pref B1
DRAM service time
Overlapped time
Dem B0 Pref B0
Pref B1 Pref B1
Bank 0
Bank 1
FIFO (Intel Core)FIFO (Intel Core)
BLP-awareBLP-aware
Bank 0 Bank 1
Saved time
Older
α: Increasing the number of requests ≠ high DRAM BLP
β
2 requests 0 request
Pref B0
Pref B1
Pref B1
β
Simple issue policy improves DRAM BLP
1 request 1 request
Full
04/21/23 10
BLP-Aware Prefetch Issue (BAPI)
• Sends prefetches to MSHRs based on current BLP exposed in the memory system– Sends a prefetch mapped to the least busy DRAM bank
• Adaptively limits the issue of prefetches based on prefetch accuracy estimation– Low prefetch accuracy
→ Fewer prefetches issued to MSHRs– High prefetch accuracy
→ Maximize BLP
04/21/23 11
Implementation of BAPI
• FIFO prefetch request buffer per DRAM bank– Stores prefetches mapped to the corresponding
DRAM bank
• MSHR occupancy counter per DRAM bank – Keeps track of the number of outstanding requests to
the corresponding DRAM bank
• Prefetch accuracy register– Stores the estimated prefetch accuracy periodically
04/21/23 12
BAPI Policy
Every prefetch issue cycle1. Make the oldest prefetch to each bank valid
only if the bank’s MSHR occupancy counter≤ prefetch send threshold
2. Among valid prefetches, select the request to the bank with minimum MSHR occupancy counter value
04/21/23 13
Adaptivity of BAPI
• Prefetch Send Threshold– Reserves MSHR entries for prefetches to
different banks
– Adjusted based on prefetch accuracy• Low prefetch accuracy → low prefetch send threshold• High prefetch accuracy → high prefetch send threshold
04/21/23 14
DRAM BLP-Aware Request Issue Policies
• BLP-Aware Prefetch Issue (BAPI)
• BLP-Preserving Multi-core Request Issue BLP-Preserving Multi-core Request Issue (BPMRI)(BPMRI)
04/21/23 15
BLP Destruction in CMP Systems
• DRAM request buffers are shared by multiple cores– To exploit the BLP of a core, the BLP should be
exposed to DRAM request buffers– BLP potential of a core can be destroyed
by the interference from other cores’ requests
Request issue policy from each core to DRAM request buffers affects BLP of each application
04/21/23 16
Why is DRAM BLP Destroyed?
To DRAM
Core A
DRAM request buffers
Bank 0
Bank 1 Time
Req A0
Round-robinRound-robin
BLP-PreservingBLP-Preserving
Bank 0 Bank 1
Core B
Request issuer
Req A0
Req A1
Req B1
Req B0
Req B0
Req B1 Req A1
Core ACore B
Stall
Stall
Bank 0
Bank 1 Time
Req A0 Req B0
Req B1Req A1
Core ACore B
Stall
Stall
Saved cycles for Core A
Increased cycles for Core B
Req A0
Req A1
Req B1Req B0
DRAM controller
Older
Older
Serializes requests from each core
Issue policy should preserve DRAM BLP
04/21/23 17
BLP-Preserving Multi-Core Request Issue (BPMRI)
• Consecutively sends requests from one core to DRAM request buffers
• Limits the maximum number of consecutive requests sent from one core– Prevent starvation of memory non-intensive applications
• Prioritizes memory non-intensive applications – Impact of delaying requests from memory non-intensive
application > Impact of delaying requests from memory intensive application
04/21/23 18
Implementation of BPMRI
• Last-level (L2) cache miss counter per core– Stores the number of L2 cache misses from the core
• Rank register per core– Fewer L2 cache misses → higher rank– More L2 cache misses → lower rank
04/21/23 19
BPMRI Policy
Every request issue cycleIf consecutive requests from selected core ≥
request send threshold
then selected core ← highest ranked core
issue oldest request from selected core
04/21/23 20
Simulation Methodology• x86 cycle accurate simulator• Baseline processor configuration
– Per core• 4-wide issue, out-of-order, 128-entry ROB• Stream prefetcher (prefetch degree: 4, prefetch distance: 64)• 32-entry MSHRs• 512KB 8-way L2 cache
– Shared• On-chip, demand-first FR-FCFS memory controller(s)• 1, 2, 4 DRAM channels for 1, 4, 8-core systems• 64, 128, 512-entry DRAM request buffers for 1, 4 and 8-core systems• DDR3 1600 DRAM, 15-15-15ns, 8KB row buffer
04/21/23 21
Simulation Methodology
• Workloads– 14 most memory-intensive SPEC CPU 2000/2006 benchmarks
for single-core system– 30 and 15 SPEC 2000/2006 workloads for 4 and 8-core CMPs
• Pseudo-randomly chosen multiprogrammed
• BAPI’s prefetch send threshold:
• BPMRI’s request send threshold: 10• Prefetch accuracy estimation and rank decision
are made every 100K cycles
Prefetch accuracy (%) 0~40 40~85 85~100
Threshold 1 7 27
04/21/23 22
Performance of BLP-Aware Issue Policies
0
0.2
0.4
0.6
0.8
1
1.2
No pref Pref BLP-aware
Pe
rfo
rma
nc
e n
orm
ali
zed
to
pre
f
0
0.2
0.4
0.6
0.8
1
1.2
No pref Pref BLP-aware
Pe
rfo
rma
nc
e n
orm
ali
zed
to
pre
f4-core 8-core1-core
0
0.2
0.4
0.6
0.8
1
1.2
No pref Pref BAPI
Pe
rfo
rma
nc
e n
orm
ali
zed
to
pre
f
13.8% 13.6%8.5%
04/21/23 23
Hardware Storage Cost for 4-core CMP
Cost (bits)Cost (bits)
BAPIBAPI 94,368
BPMRIBPMRI 72
Total 94,440
• Total storage: 94,440 bits (11.5KB) – 0.6% of L2 cache data storage
• Logic is not on the critical path– Issue decision can be made slower than processor cycle
04/21/23 24
Conclusion
• Uncontrolled memory request issue policies limit the level of BLP exploited by DRAM controller
• BLP-Aware Prefetch Issue – Increases the BLP of useful requests from each core exposed to
DRAM controller
• BLP-Preserving Multi-core Request Issue– Ensures requests from the same core can be serviced in parallel
by DRAM controller
• Simple, low-storage cost• Significantly improve DRAM throughput and
performance for both single and multi-core systems• Applicable to other memory technologies
04/21/23 25
Questions?