Yasuo Ishii, Kouhei Hosokawa, Mary Inaba , Kei Hiraki

37
High Performance Memory Access Scheduling Using Compute-Phase Prediction and Writeback-Refresh Overlap Yasuo Ishii, Kouhei Hosokawa, Mary Inaba, Kei Hiraki

description

High Performance Memory Access Scheduling Using Compute-Phase Prediction and Writeback-Refresh Overlap. Yasuo Ishii, Kouhei Hosokawa, Mary Inaba , Kei Hiraki. Design Goal: High Performance Scheduler. Three Evaluation Metrics Execution Time (Performance) Energy-Delay Product - PowerPoint PPT Presentation

Transcript of Yasuo Ishii, Kouhei Hosokawa, Mary Inaba , Kei Hiraki

Page 1: Yasuo  Ishii,  Kouhei  Hosokawa, Mary  Inaba , Kei  Hiraki

High PerformanceMemory Access SchedulingUsing Compute-Phase Prediction and Writeback-Refresh Overlap

Yasuo Ishii, Kouhei Hosokawa, Mary Inaba, Kei Hiraki

Page 2: Yasuo  Ishii,  Kouhei  Hosokawa, Mary  Inaba , Kei  Hiraki

Design Goal: High Performance Scheduler

Three Evaluation Metrics Execution Time (Performance) Energy-Delay Product Performance-Fairness Product

We found several trade-offs among these metrics The best execution time (performance) configuration does not

show the best energy-delay product

Page 3: Yasuo  Ishii,  Kouhei  Hosokawa, Mary  Inaba , Kei  Hiraki

Contribution Proposals

Compute-Phase Prediction Thread-priority control technique for multi-core processor

Writeback-Refresh Overlap Mitigates refresh penalty on multi-rank memory system

Optimizations MLP-aware priority control Memory bus reservation Activate throttling

Page 4: Yasuo  Ishii,  Kouhei  Hosokawa, Mary  Inaba , Kei  Hiraki

Outline Proposals

Compute-Phase Prediction Thread-priority control technique for multi-core processor

Writeback-Refresh Overlap Mitigates refresh penalty on multi-rank memory system

Optimizations MLP-aware priority control Memory bus reservation Activate throttling

Page 5: Yasuo  Ishii,  Kouhei  Hosokawa, Mary  Inaba , Kei  Hiraki

Non-priority requests

Priority requests

Thread-priority Control Thread-priority control is beneficial for multi-core chips

Network Fair Queuing[Nesbit+ 2006], Atlas[Kim+ 2010], Thread Cluster Memory Scheduling[Kim+ 2010]

Typically, policies are updated periodically (Each epoch contains millions of cycles in TCM)

Compute-Intensive

Memory-Intensive

Memory(DRAM)

Memory-Intensive

Compute-Intensive

priority status is not yet changed

Core 0

Core 1

high priority

Page 6: Yasuo  Ishii,  Kouhei  Hosokawa, Mary  Inaba , Kei  Hiraki

Example: Memory Traffic of Blackscholes

One application contains both memory-intensive phases and compute-intensive phases

0102030405060708090

100

Miss

per

Kilo

Inst

ructi

ons (

MPK

I)

Page 7: Yasuo  Ishii,  Kouhei  Hosokawa, Mary  Inaba , Kei  Hiraki

Phase-prediction result of TCM

We think this inaccurate classification is caused by the conventional periodically updating prediction strategy

0102030405060708090

100

Miss

per

Kilo

Inst

ructi

ons (

MPK

I)

Compute-phase Memory-phase

Page 8: Yasuo  Ishii,  Kouhei  Hosokawa, Mary  Inaba , Kei  Hiraki

Contribution 1: Compute-Phase Prediction “Distance-based phase prediction” to realize fine-grain

thread priority control scheme

Core Memory(DRAM)

Distance = # of committed instructions between 2 memory requests

Core DRAM

Distance of req. exceed Θinterval

Compute-phase

Core DRAM

Non-distant of req. continue Θdistant times

Memory-phase

Page 9: Yasuo  Ishii,  Kouhei  Hosokawa, Mary  Inaba , Kei  Hiraki

Phase-prediction of Compute-Phase Prediction

Prediction result nearly matches the optimal classification Improves fairness and system throughput

0

10

20

30

40

50

60

70

80

90

100

Miss

per

Kilo

Inst

ructi

ons (

MPK

I)

Page 10: Yasuo  Ishii,  Kouhei  Hosokawa, Mary  Inaba , Kei  Hiraki

Outline Proposals

Compute-Phase Prediction Thread-priority control technique for multi-core processor

Writeback-Refresh Overlap Mitigates refresh penalty on multi-rank memory system

Optimizations MLP-aware priority control Memory bus reservation Activate throttling

Page 11: Yasuo  Ishii,  Kouhei  Hosokawa, Mary  Inaba , Kei  Hiraki

DRAM refreshing penalty

DRAM refreshing increases the stall time of read requests Stall of read requests increases the execution time Shifting refresh timing cannot reduce the stall time

This increases the threat of stall time for read requests

tREFI tRFCRank-0

Rank-1

Mem. BusStall of read requestsIncreases the threat of stall

Page 12: Yasuo  Ishii,  Kouhei  Hosokawa, Mary  Inaba , Kei  Hiraki

Contribution 2: Writeback-Refresh Overlap

Typically, modern controllers divide read phases and write phases to reduce bus turnaround penalties

Overlaps refresh command with the write phase Avoid to increasing the stall time of read requests

R

Rank-0

Rank-1

Mem. Bus WRead requests stall

R W R W R W R W R W R W R

Page 13: Yasuo  Ishii,  Kouhei  Hosokawa, Mary  Inaba , Kei  Hiraki

Outline Proposals

Compute-Phase Prediction Thread-priority control technique for multi-core processor

Writeback-Refresh Overlap Mitigates refresh penalty on multi-rank memory system

Optimizations MLP-aware priority control Memory bus reservation Activate throttling

Page 14: Yasuo  Ishii,  Kouhei  Hosokawa, Mary  Inaba , Kei  Hiraki

Optimization 1: MLP-Aware Priority Control

Prioritizes Low-MLP requests to reduce the stall time. This priority is higher than the priority control of compute-

phase predictions Minimalist [Kaseridis+ 2011] also uses MLP-aware scheduling

load(1)load(0)Core 0

Memory(DDR3)

Core 1

load(1)

Request Queue

gives extra priority

Stall

Stallload(1)load (0)load(1)load(1)load(1)

Page 15: Yasuo  Ishii,  Kouhei  Hosokawa, Mary  Inaba , Kei  Hiraki

Optimization 2: Memory Bus Reservation Reserves HW resources to reduce the latency of critical

read requests Data bus for read and write (considering tRTR/tWTR penalty)

This method improves the system throughput and fairness

Command-Rank-0 ACT RD

tRAS

Command-Rank-1

BLRD

RD

RD RD

Additional penalty

Memory bus

Page 16: Yasuo  Ishii,  Kouhei  Hosokawa, Mary  Inaba , Kei  Hiraki

Optimization 3: Activate Throttling Controls precharge / activation based on tFAW tracking

Too early precharge command does not contribute to the latency reduction of following activate command

Command-Rank-0

Memory clock

tFAW tRP

PREACT

1

ACT

2

ACT

3

ACT

4

Activate throttling increases the chance of row-hit access

ACT

Row-conflict

Page 17: Yasuo  Ishii,  Kouhei  Hosokawa, Mary  Inaba , Kei  Hiraki

Optimization: Other Techniques

Aggressive precharge Reduces row-conflict penalties

Force refreshing When tREFI timer has expired, the force refresh is issued

Adds extra priority to the timeout requests Promotes old read request to the higher priority Eliminates the starvation

Page 18: Yasuo  Ishii,  Kouhei  Hosokawa, Mary  Inaba , Kei  Hiraki

Implementation: Optimized Memory Controller

The optimized controller does not require large HW cost We mainly extend thread-priority control and controller state

through our new scheduling technique

Read Queue

Write Queue

Refresh Queue

MU

X

ControllerState

Refresh Timer

Processor Core

DDR3Devices

Thread PriorityControl

EnhancedController

State

Adds priority bit for each request

Extends controller state (2-bit)

Page 19: Yasuo  Ishii,  Kouhei  Hosokawa, Mary  Inaba , Kei  Hiraki

Implementation: Hardware Cost

Per-channel resource (341.25B) Compute-Phase Prediction (258B) Writeback-Refresh Overlap (2-bit) Other features (83B)

Per-request resource (3-bit) Priority bit, Row-hit bit, Timeout flag bit

Overall Hardware Cost: 2649B

Page 20: Yasuo  Ishii,  Kouhei  Hosokawa, Mary  Inaba , Kei  Hiraki

Evaluation Results

Performance Improvement

Page 21: Yasuo  Ishii,  Kouhei  Hosokawa, Mary  Inaba , Kei  Hiraki

Evaluation Results

Exec time : 11.2% PFP : 20.9% EDP : 20.2%Performance Improvement

Max Slowdown : 10.8%

Page 22: Yasuo  Ishii,  Kouhei  Hosokawa, Mary  Inaba , Kei  Hiraki

Evaluation Results

Performance Improvement

Max : 12.9% Max : 26.2%Max : 14.9%

Exec time : 11.2% PFP : 20.9% EDP : 20.2%Max Slowdown : 10.8%

Page 23: Yasuo  Ishii,  Kouhei  Hosokawa, Mary  Inaba , Kei  Hiraki

Evaluation Results1c

han

4cha

n

1cha

n

4cha

n

1cha

n

4cha

n

1cha

n

4cha

n

1cha

n

4cha

n

1cha

n

4cha

n

4cha

n

4cha

n

1cha

n

4cha

n

1cha

n

4cha

n /

MT-canneal

MT-canneal

bl-bl-fr-fr

bl-bl-fr-fr

c1-c1 c1-c2 c1-c1-c2-c2

c1-c1-c2-c2

c2 c2 fa-fa-fe-fe

fa-fa-fe-fe

fl-fl-sw-sw-c2-c2-fe-fe

fl-fl-sw-sw-c2-c2-fe-

fe-bl-bl-fr-fr-c1-c1-st-st

fl-sw-c2-c2

fl-sw-c2-c2

st-st-st-st

st-st-st-st

Overall

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1Total Execution Time

FCFS Close Proposed

Page 24: Yasuo  Ishii,  Kouhei  Hosokawa, Mary  Inaba , Kei  Hiraki

Evaluation Results1c

han

4cha

n

1cha

n

4cha

n

1cha

n

4cha

n

1cha

n

4cha

n

1cha

n

4cha

n

1cha

n

4cha

n

4cha

n

4cha

n

1cha

n

4cha

n

1cha

n

4cha

n /

MT-canneal

MT-canneal

bl-bl-fr-fr

bl-bl-fr-fr

c1-c1 c1-c2 c1-c1-c2-c2

c1-c1-c2-c2

c2 c2 fa-fa-fe-fe

fa-fa-fe-fe

fl-fl-sw-sw-c2-c2-fe-fe

fl-fl-sw-sw-c2-c2-fe-

fe-bl-bl-fr-fr-c1-c1-st-st

fl-sw-c2-c2

fl-sw-c2-c2

st-st-st-st

st-st-st-st

Overall

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1Total Execution Time

FCFS Close Proposed

Page 25: Yasuo  Ishii,  Kouhei  Hosokawa, Mary  Inaba , Kei  Hiraki

Evaluation Results1c

han

4cha

n

1cha

n

4cha

n

1cha

n

4cha

n

1cha

n

4cha

n

1cha

n

4cha

n

1cha

n

4cha

n

4cha

n

4cha

n

1cha

n

4cha

n

1cha

n

4cha

n /

MT-canneal

MT-canneal

bl-bl-fr-fr

bl-bl-fr-fr

c1-c1 c1-c2 c1-c1-c2-c2

c1-c1-c2-c2

c2 c2 fa-fa-fe-fe

fa-fa-fe-fe

fl-fl-sw-sw-c2-c2-fe-fe

fl-fl-sw-sw-c2-c2-fe-

fe-bl-bl-fr-fr-c1-c1-st-st

fl-sw-c2-c2

fl-sw-c2-c2

st-st-st-st

st-st-st-st

Overall

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1Total Execution Time

FCFS Close Proposed

Page 26: Yasuo  Ishii,  Kouhei  Hosokawa, Mary  Inaba , Kei  Hiraki

Evaluation Results

00.10.20.30.40.50.60.70.80.9

11c

han

4cha

n

1cha

n

4cha

n

1cha

n

4cha

n

1cha

n

4cha

n

1cha

n

4cha

n

1cha

n

4cha

n

4cha

n

4cha

n

1cha

n

4cha

n

1cha

n

4cha

n /

MT-canneal

MT-canneal

bl-bl-fr-fr

bl-bl-fr-fr

c1-c1 c1-c2 c1-c1-c2-c2

c1-c1-c2-c2

c2 c2 fa-fa-fe-fe

fa-fa-fe-fe

fl-fl-sw-sw-c2-c2-fe-fe

fl-fl-sw-sw-c2-c2-fe-fe-bl-

bl-fr-fr-c1-c1-st-st

fl-sw-c2-c2

fl-sw-c2-c2

st-st-st-st

st-st-st-st

Overall

0

0.2

0.4

0.6

0.8

1

FCFS Close Proposed

Max

Slowdow

nEDP

Page 27: Yasuo  Ishii,  Kouhei  Hosokawa, Mary  Inaba , Kei  Hiraki

11.2% Performance improvement from FCFS consists of Close Page Policy: 4.2% Baseline Optimization: 4.9% Proposal Optimization: 1.9%

Baseline optimization accomplishes a 9.1% improvement

Optimization Breakdown

Performance...0%

2%

4%

6%

8%

10%

ProposedOptimization

BaselineOptimization

Close Page

FCFS(base)

4.2%

4.9%

1.9%

・ Timeout Detection・Write Queue Spill Prevention・ Auto-Precharge・Max Activate-Number Restriction

ProposalsCompute-phase predictionWriteback-refresh overlap

OptimizationsMLP-aware priority control Memory bus reservationActivate throttling

Page 28: Yasuo  Ishii,  Kouhei  Hosokawa, Mary  Inaba , Kei  Hiraki

Optimization Breakdown 11.2% Performance improvement

from FCFS consists of Close Page Policy: 4.2% Baseline Optimization: 4.9% Proposal Optimization: 1.9%

Baseline optimization accomplishes a 9.1% improvement

Performance...0%

2%

4%

6%

8%

10%

4.2%

4.9%

1.9%ProposedOptimization

BaselineOptimization

Close Page

FCFS(base)

Page 29: Yasuo  Ishii,  Kouhei  Hosokawa, Mary  Inaba , Kei  Hiraki

Performance/EDP summary

19.0619.5620.0620.5621.0621.56

2941

2991

3041

3091

3141

Close Page Policy

Optimization baseline

EDP

Exec time

(3065, 20.58)

(3054, 20.25)

(3173, 21.7)

(2987, 20.08)

(2975, 19.79)

(3012, 19.71)

(2990, 19.17)

(2981, 19.11)

Y.Moon

K.Fang

K. kuroyanagi C. Li

L. Chen

T. Ikeda Ours

Page 30: Yasuo  Ishii,  Kouhei  Hosokawa, Mary  Inaba , Kei  Hiraki

Performance/EDP summary

19.0619.5620.0620.5621.0621.56

2941

2991

3041

3091

3141

Close Page Policy

Final score

EDP

Exec time

(3065, 20.58)

(3054, 20.25)

(3173, 21.7)

(2987, 20.08)

(2975, 19.79)

(3012, 19.71)

(2990, 19.17)

(2941, 19.06)

Optimization baseline

Y.Moon

K.Fang

K. kuroyanagi C. Li

L. Chen

T. Ikeda Ours

(2981, 19.11)

Page 31: Yasuo  Ishii,  Kouhei  Hosokawa, Mary  Inaba , Kei  Hiraki

Performance/EDP summary

19.0619.5620.0620.5621.0621.56

2941

2991

3041

3091

3141

Close Page Policy

(3065, 20.58)

(3054, 20.25)

(3173, 21.7)

(2987, 20.08)

(2975, 19.79)

(2990, 19.17)

(2941, 19.06)Final score

EDP

Exec time

(3012, 19.71)

Optimization baseline

Y.Moon

K.Fang

K. kuroyanagi C. Li

L. Chen

T. Ikeda Ours

(2981, 19.11)

Page 32: Yasuo  Ishii,  Kouhei  Hosokawa, Mary  Inaba , Kei  Hiraki

Optimization History

18.7518.9519.1519.3519.5519.75

2940

2950

2960

2970

2980

2990

3000

3010 EDP

Exec time

Optimization baseline

Final score

(2975, 19.79)

(3012, 19.71)

(2990, 19.17)

(2981, 19.11)

(2941, 19.06)Y.Moon

K.Fang

K. kuroyanagi

Ours

Page 33: Yasuo  Ishii,  Kouhei  Hosokawa, Mary  Inaba , Kei  Hiraki

18.7518.9519.1519.3519.5519.75

2940

2950

2960

2970

2980

2990

3000

3010 EDP

Exec time

Optimization baseline

Opt 1: MLP-aware priority controlOpt 2: Mem bus resevation Opt 3: ACT throttling

Final score

Optimization HistoryY.Moon

K.Fang

K. kuroyanagi

Ours

(2975, 19.79)

(3012, 19.71)

(2990, 19.17)

(2981, 19.11)

(2941, 19.06)

(2953, 18.75)

Page 34: Yasuo  Ishii,  Kouhei  Hosokawa, Mary  Inaba , Kei  Hiraki

18.7518.9519.1519.3519.5519.75

2940

2950

2960

2970

2980

2990

3000

3010 EDP

Exec time

Optimization baseline

Compute-phase predictionWriteback-refresh overlap

Final score

Optimization HistoryY.Moon

K.Fang

K. kuroyanagi

Ours (2953, 18.75)

(2975, 19.79)

(3012, 19.71)

(2990, 19.17)

(2981, 19.11)

(2941, 19.06)

Opt 1: MLP-aware priority controlOpt 2: Mem bus resevation Opt 3: ACT throttling

Page 35: Yasuo  Ishii,  Kouhei  Hosokawa, Mary  Inaba , Kei  Hiraki

18.7518.9519.1519.3519.5519.75

2940

2950

2960

2970

2980

2990

3000

3010 EDP

Exec time

Optimization baseline

Compute-phase predictionWriteback-refresh overlap

Final score

Optimization HistoryY.Moon

K.Fang

K. kuroyanagi

Ours (2953, 18.75)

(2975, 19.79)

(3012, 19.71)

(2990, 19.17)

(2981, 19.11)

(2941, 19.06)

Opt 1: MLP-aware priority controlOpt 2: Mem bus resevation Opt 3: ACT throttling

Page 36: Yasuo  Ishii,  Kouhei  Hosokawa, Mary  Inaba , Kei  Hiraki

Conclusion

High Performance Memory Access Scheduling Proposals

Novel thread-priority control method: Compute-phase prediction Cost-effective refreshing method: Writeback-refresh overlap

Optimization strategies MLP-aware priority control, Memory bus reservation, Activate

Throttling, Aggressive precharge, force refresh, timeout handling

The optimized scheduler reduces exec time by 11.2% Several trade-offs between performance and EDP Aggregating the various optimization strategies is

most important for the DRAM system efficiency

Page 37: Yasuo  Ishii,  Kouhei  Hosokawa, Mary  Inaba , Kei  Hiraki

Q&A