4/17/20151 Improving Memory Bank-Level Parallelism in the Presence of Prefetching Chang Joo Lee...

25
06/15/22 1 Improving Memory Bank-Level Parallelism in the Presence of Prefetching Chang Joo Lee Veynu Narasiman Onur Mutlu* Yale N. Patt Electrical and Computer Engineering The University of Texas at Austin * Electrical and Computer Engineering Carnegie Mellon University

Transcript of 4/17/20151 Improving Memory Bank-Level Parallelism in the Presence of Prefetching Chang Joo Lee...

Page 1: 4/17/20151 Improving Memory Bank-Level Parallelism in the Presence of Prefetching Chang Joo Lee Veynu Narasiman Onur Mutlu* Yale N. Patt Electrical and.

04/21/23 1

Improving Memory Bank-Level Parallelism in the Presence of Prefetching

Chang Joo Lee

Veynu Narasiman

Onur Mutlu*

Yale N. Patt

Electrical and Computer Engineering The University of Texas at Austin

* Electrical and Computer EngineeringCarnegie Mellon University

Page 2: 4/17/20151 Improving Memory Bank-Level Parallelism in the Presence of Prefetching Chang Joo Lee Veynu Narasiman Onur Mutlu* Yale N. Patt Electrical and.

04/21/23 2

Main Memory System

• Crucial to high performance computing

• Made of DRAM chips

• Multiple banks→ Each bank can be accessed independently

Page 3: 4/17/20151 Improving Memory Bank-Level Parallelism in the Presence of Prefetching Chang Joo Lee Veynu Narasiman Onur Mutlu* Yale N. Patt Electrical and.

04/21/23 3

Memory Bank-Level Parallelism (BLP)

Req B0

Req B1

bank 0 bank 1

DRAM system

DRAM controller

DRAMDRAM

DRAM request buffer

Req B1

Bank 0

Bank 1

Time

Overlapped time

Req B0

Data busData for Req B0

Data for Req B1Older

DRAM throughput increased

Page 4: 4/17/20151 Improving Memory Bank-Level Parallelism in the Presence of Prefetching Chang Joo Lee Veynu Narasiman Onur Mutlu* Yale N. Patt Electrical and.

04/21/23 4

Memory Latency-Tolerance Mechanisms

• Out-of-order execution, prefetching, runahead etc. • Increase outstanding memory requests

on the chip– Memory-Level Parallelism (MLP) [Glew’98]

• Hope many requests will be serviced in parallel in the memory system

• Higher performance can be achieved when BLP is exposed to the DRAM controller

Page 5: 4/17/20151 Improving Memory Bank-Level Parallelism in the Presence of Prefetching Chang Joo Lee Veynu Narasiman Onur Mutlu* Yale N. Patt Electrical and.

04/21/23 5

Problems

• On-chip buffers e.g., Miss Status Holding Registers (MSHRs) are limited in size– Limit the BLP exposed to the DRAM controller– E.g., requests to the same bank fill up MSHRs

• In CMPs, memory requests from different cores are mixed together in DRAM request buffers– Destroy the BLP of each application running on CMPs

Request Issue policies are critical to BLP exploited by DRAM controller

Page 6: 4/17/20151 Improving Memory Bank-Level Parallelism in the Presence of Prefetching Chang Joo Lee Veynu Narasiman Onur Mutlu* Yale N. Patt Electrical and.

04/21/23 6

Goals

1. Maximize the BLP exposed from each core to the DRAM controller→ Increase DRAM throughput for useful requests

2. Preserve the BLP of each application in CMPs→ Increase system performance

BLP-Aware Prefetch Issue (BAPI):Decides the order in which prefetches are sent from prefetcher to MSHRs

BLP-Preserving Multi-core Request Issue (BPMRI):Decides the order in which memory requests are sent from each core to DRAM request buffers

Goals and Proposal

Page 7: 4/17/20151 Improving Memory Bank-Level Parallelism in the Presence of Prefetching Chang Joo Lee Veynu Narasiman Onur Mutlu* Yale N. Patt Electrical and.

04/21/23 7

DRAM BLP-Aware Request Issue Policies

• BLP-Aware Prefetch Issue (BAPI)BLP-Aware Prefetch Issue (BAPI)

• BLP-Preserving Multi-core Request Issue (BPMRI)

Page 8: 4/17/20151 Improving Memory Bank-Level Parallelism in the Presence of Prefetching Chang Joo Lee Veynu Narasiman Onur Mutlu* Yale N. Patt Electrical and.

04/21/23 8

What Can Limit DRAM BLP?

• Miss Status Holding Registers (MSHRs) are NOT large enough to handle many memory requests [Tuck, MICRO’06]

– MSHRs keep track of all outstanding misses for a core → Total number of demand/prefetch requests

≤ total number of MSHR entries

– Complex, latency-critical, and power-hungry→ Not scalable

Request issue policy to MSHRs affects the level of BLP exploited by DRAM controller

Page 9: 4/17/20151 Improving Memory Bank-Level Parallelism in the Presence of Prefetching Chang Joo Lee Veynu Narasiman Onur Mutlu* Yale N. Patt Electrical and.

04/21/23 9

What Can Limit DRAM BLP?

Prefetch request buffer

MSHRs

Dem B0

β:

α

To DRAM

Core

Pref B0

Pref B1

Pref B1

DRAM request buffers

Bank 0

Bank 1

DRAM service time

Overlapped time

Dem B0 Pref B0

Pref B1 Pref B1

DRAM service time

Overlapped time

Dem B0 Pref B0

Pref B1 Pref B1

Bank 0

Bank 1

FIFO (Intel Core)FIFO (Intel Core)

BLP-awareBLP-aware

Bank 0 Bank 1

Saved time

Older

α: Increasing the number of requests ≠ high DRAM BLP

β

2 requests 0 request

Pref B0

Pref B1

Pref B1

β

Simple issue policy improves DRAM BLP

1 request 1 request

Full

Page 10: 4/17/20151 Improving Memory Bank-Level Parallelism in the Presence of Prefetching Chang Joo Lee Veynu Narasiman Onur Mutlu* Yale N. Patt Electrical and.

04/21/23 10

BLP-Aware Prefetch Issue (BAPI)

• Sends prefetches to MSHRs based on current BLP exposed in the memory system– Sends a prefetch mapped to the least busy DRAM bank

• Adaptively limits the issue of prefetches based on prefetch accuracy estimation– Low prefetch accuracy

→ Fewer prefetches issued to MSHRs– High prefetch accuracy

→ Maximize BLP

Page 11: 4/17/20151 Improving Memory Bank-Level Parallelism in the Presence of Prefetching Chang Joo Lee Veynu Narasiman Onur Mutlu* Yale N. Patt Electrical and.

04/21/23 11

Implementation of BAPI

• FIFO prefetch request buffer per DRAM bank– Stores prefetches mapped to the corresponding

DRAM bank

• MSHR occupancy counter per DRAM bank – Keeps track of the number of outstanding requests to

the corresponding DRAM bank

• Prefetch accuracy register– Stores the estimated prefetch accuracy periodically

Page 12: 4/17/20151 Improving Memory Bank-Level Parallelism in the Presence of Prefetching Chang Joo Lee Veynu Narasiman Onur Mutlu* Yale N. Patt Electrical and.

04/21/23 12

BAPI Policy

Every prefetch issue cycle1. Make the oldest prefetch to each bank valid

only if the bank’s MSHR occupancy counter≤ prefetch send threshold

2. Among valid prefetches, select the request to the bank with minimum MSHR occupancy counter value

Page 13: 4/17/20151 Improving Memory Bank-Level Parallelism in the Presence of Prefetching Chang Joo Lee Veynu Narasiman Onur Mutlu* Yale N. Patt Electrical and.

04/21/23 13

Adaptivity of BAPI

• Prefetch Send Threshold– Reserves MSHR entries for prefetches to

different banks

– Adjusted based on prefetch accuracy• Low prefetch accuracy → low prefetch send threshold• High prefetch accuracy → high prefetch send threshold

Page 14: 4/17/20151 Improving Memory Bank-Level Parallelism in the Presence of Prefetching Chang Joo Lee Veynu Narasiman Onur Mutlu* Yale N. Patt Electrical and.

04/21/23 14

DRAM BLP-Aware Request Issue Policies

• BLP-Aware Prefetch Issue (BAPI)

• BLP-Preserving Multi-core Request Issue BLP-Preserving Multi-core Request Issue (BPMRI)(BPMRI)

Page 15: 4/17/20151 Improving Memory Bank-Level Parallelism in the Presence of Prefetching Chang Joo Lee Veynu Narasiman Onur Mutlu* Yale N. Patt Electrical and.

04/21/23 15

BLP Destruction in CMP Systems

• DRAM request buffers are shared by multiple cores– To exploit the BLP of a core, the BLP should be

exposed to DRAM request buffers– BLP potential of a core can be destroyed

by the interference from other cores’ requests

Request issue policy from each core to DRAM request buffers affects BLP of each application

Page 16: 4/17/20151 Improving Memory Bank-Level Parallelism in the Presence of Prefetching Chang Joo Lee Veynu Narasiman Onur Mutlu* Yale N. Patt Electrical and.

04/21/23 16

Why is DRAM BLP Destroyed?

To DRAM

Core A

DRAM request buffers

Bank 0

Bank 1 Time

Req A0

Round-robinRound-robin

BLP-PreservingBLP-Preserving

Bank 0 Bank 1

Core B

Request issuer

Req A0

Req A1

Req B1

Req B0

Req B0

Req B1 Req A1

Core ACore B

Stall

Stall

Bank 0

Bank 1 Time

Req A0 Req B0

Req B1Req A1

Core ACore B

Stall

Stall

Saved cycles for Core A

Increased cycles for Core B

Req A0

Req A1

Req B1Req B0

DRAM controller

Older

Older

Serializes requests from each core

Issue policy should preserve DRAM BLP

Page 17: 4/17/20151 Improving Memory Bank-Level Parallelism in the Presence of Prefetching Chang Joo Lee Veynu Narasiman Onur Mutlu* Yale N. Patt Electrical and.

04/21/23 17

BLP-Preserving Multi-Core Request Issue (BPMRI)

• Consecutively sends requests from one core to DRAM request buffers

• Limits the maximum number of consecutive requests sent from one core– Prevent starvation of memory non-intensive applications

• Prioritizes memory non-intensive applications – Impact of delaying requests from memory non-intensive

application > Impact of delaying requests from memory intensive application

Page 18: 4/17/20151 Improving Memory Bank-Level Parallelism in the Presence of Prefetching Chang Joo Lee Veynu Narasiman Onur Mutlu* Yale N. Patt Electrical and.

04/21/23 18

Implementation of BPMRI

• Last-level (L2) cache miss counter per core– Stores the number of L2 cache misses from the core

• Rank register per core– Fewer L2 cache misses → higher rank– More L2 cache misses → lower rank

Page 19: 4/17/20151 Improving Memory Bank-Level Parallelism in the Presence of Prefetching Chang Joo Lee Veynu Narasiman Onur Mutlu* Yale N. Patt Electrical and.

04/21/23 19

BPMRI Policy

Every request issue cycleIf consecutive requests from selected core ≥

request send threshold

then selected core ← highest ranked core

issue oldest request from selected core

Page 20: 4/17/20151 Improving Memory Bank-Level Parallelism in the Presence of Prefetching Chang Joo Lee Veynu Narasiman Onur Mutlu* Yale N. Patt Electrical and.

04/21/23 20

Simulation Methodology• x86 cycle accurate simulator• Baseline processor configuration

– Per core• 4-wide issue, out-of-order, 128-entry ROB• Stream prefetcher (prefetch degree: 4, prefetch distance: 64)• 32-entry MSHRs• 512KB 8-way L2 cache

– Shared• On-chip, demand-first FR-FCFS memory controller(s)• 1, 2, 4 DRAM channels for 1, 4, 8-core systems• 64, 128, 512-entry DRAM request buffers for 1, 4 and 8-core systems• DDR3 1600 DRAM, 15-15-15ns, 8KB row buffer

Page 21: 4/17/20151 Improving Memory Bank-Level Parallelism in the Presence of Prefetching Chang Joo Lee Veynu Narasiman Onur Mutlu* Yale N. Patt Electrical and.

04/21/23 21

Simulation Methodology

• Workloads– 14 most memory-intensive SPEC CPU 2000/2006 benchmarks

for single-core system– 30 and 15 SPEC 2000/2006 workloads for 4 and 8-core CMPs

• Pseudo-randomly chosen multiprogrammed

• BAPI’s prefetch send threshold:

• BPMRI’s request send threshold: 10• Prefetch accuracy estimation and rank decision

are made every 100K cycles

Prefetch accuracy (%) 0~40 40~85 85~100

Threshold 1 7 27

Page 22: 4/17/20151 Improving Memory Bank-Level Parallelism in the Presence of Prefetching Chang Joo Lee Veynu Narasiman Onur Mutlu* Yale N. Patt Electrical and.

04/21/23 22

Performance of BLP-Aware Issue Policies

0

0.2

0.4

0.6

0.8

1

1.2

No pref Pref BLP-aware

Pe

rfo

rma

nc

e n

orm

ali

zed

to

pre

f

0

0.2

0.4

0.6

0.8

1

1.2

No pref Pref BLP-aware

Pe

rfo

rma

nc

e n

orm

ali

zed

to

pre

f4-core 8-core1-core

0

0.2

0.4

0.6

0.8

1

1.2

No pref Pref BAPI

Pe

rfo

rma

nc

e n

orm

ali

zed

to

pre

f

13.8% 13.6%8.5%

Page 23: 4/17/20151 Improving Memory Bank-Level Parallelism in the Presence of Prefetching Chang Joo Lee Veynu Narasiman Onur Mutlu* Yale N. Patt Electrical and.

04/21/23 23

Hardware Storage Cost for 4-core CMP

Cost (bits)Cost (bits)

BAPIBAPI 94,368

BPMRIBPMRI 72

Total 94,440

• Total storage: 94,440 bits (11.5KB) – 0.6% of L2 cache data storage

• Logic is not on the critical path– Issue decision can be made slower than processor cycle

Page 24: 4/17/20151 Improving Memory Bank-Level Parallelism in the Presence of Prefetching Chang Joo Lee Veynu Narasiman Onur Mutlu* Yale N. Patt Electrical and.

04/21/23 24

Conclusion

• Uncontrolled memory request issue policies limit the level of BLP exploited by DRAM controller

• BLP-Aware Prefetch Issue – Increases the BLP of useful requests from each core exposed to

DRAM controller

• BLP-Preserving Multi-core Request Issue– Ensures requests from the same core can be serviced in parallel

by DRAM controller

• Simple, low-storage cost• Significantly improve DRAM throughput and

performance for both single and multi-core systems• Applicable to other memory technologies

Page 25: 4/17/20151 Improving Memory Bank-Level Parallelism in the Presence of Prefetching Chang Joo Lee Veynu Narasiman Onur Mutlu* Yale N. Patt Electrical and.

04/21/23 25

Questions?