1 DIEF: An Accurate Interference Feedback Mechanism for Chip Multiprocessor Memory Systems Magnus...

1

DIEF: An Accurate Interference FeedbackMechanism for Chip Multiprocessor MemorySystems

Magnus Jahre†, Marius Grannaes† ‡ and Lasse Natvig†

† Norwegian University of Science and Technology‡ Energy Micro

2

Chip Multiprocessor Resources

• Hardware-controlled, shared resources– Interconnect bandwidth– Shared cache capacity– Memory bus bandwidth– Memory capacity is allocated by the operating system

Interference can occur in all shared units

CPU 1

Inte

rcon

nect

MainMemory

MemoryBus

D-Cache

I-Cache

CPU 2D-Cache

I-Cache

CPU 3D-Cache

I-Cache

CPU 4D-Cache

I-Cache

Sha

red

Cac

he

Mem

ory

Con

trol

ler

Private Memory System Shared Memory System

Current CMP implementations do not take interference into

account

3

Why Control Resource Allocation?

Provide predictable performance

Support OS scheduler assumptions

Cloud: Fulfill Service Level Agreement

4

Resource Allocation Tasks

Measurement

Allocation(Policy)

Enforcement(Mechanism)

Focus of this work

5

Resource Allocation Baselines

Baseline = Interference-free configuration

Quantify performance impact from interference

Private Mode and Shared Mode

6

Multi-Programmed Baseline

• All processes in a workload run concurrently

• Static and equal partitioning of all shared resources

50%Program

B

50%Program

A

Memory Bus

Shared Cache

50%: Program B50%: Program A

Multiprogrammed Baseline

7

Single Program Baseline

• The process is run alone in one core

• All other cores are idle

• Exclusive access to all shared resources

100%Program

A

Shared Cache

Memory Bus

100%: Program A

Single Program Baseline

100%Program

B

Shared Cache

Memory Bus

100%: Program B

8

Baseline Weaknesses

• Multiprogrammed Baseline– Only accounts for interference in partitioned resources– Static and equal division of DRAM bandwidth does not give equal

latency– Complex relationship between resource allocation and performance

• Single Program Baseline– Does not exist in shared mode

Dynamic Interference Estimation Framework (DIEF)

9

Outline

• Introduction

• Dynamic Interference Estimation Framework– Shared Cache– Memory Bus – On-chip interconnect

• Results

• Summary

10

Interference Estimation

Full-System Interference EstimationAggregate interference from different units

Common unit of measureAverage Latency (Clock Cycles)

DIEFGeneral, component-based framework

11

Interference Definition

InterferencePrivate Mode

Latency

Estimate ErrorPrivate

Mode Latency Measurement

Shared Mode Latency

PrivateMode Latency

Estimate

12

Shared Cache Interference

B

NM

ABA M N

Auxiliary Tag Directories

CP

U 0

CP

U 1

Cache Accesses:

B

Shared Cache

...... ...

......

...

13


B

NM

AAB M N


CP

U 0

CP

U 1

Cache Accesses:

B

Shared Cache

...... ...

......

...

C

C

Eviction may not be interference

14


B

NM

AAB M


CP

U 0

CP

U 1

Cache Accesses:

B

Shared Cache

...... ...

......

...

C

C CB

N

Interference cost = miss penalty

Hit

Miss

15

Bus Interference Requirements

• Out-of-order memory bus scheduling• Shared mode only cache misses and cache hits• Shared cache writebacks

Computing private latency based on shared mode queue contents is difficult

Emulate private scheduling in the shared mode

16

E D

Shared Bus Queue

C B

D C B A

1202004040

Arrival Order

Head Pointer

Execution Order

15

32

Latency Lookup Table

Bank 0

Bank 1

...

...

Open Page Emulation Registers

Memory Latency Estimation Buffer

Bank/ Page Mapping: A à (0,15), B à (0,19), C à (0,15), D à (1,32)

Estimated Queue Latency 120 40 40+ +=

BCD 40200

17

Interconnect Interference

A

F E

BCCPU 0

CPU 1

L2 Bank 0

L2 Bank 1

Interference Counters

0 0

A

E

48

CPU 1 delays CPU 0

18

Outline

• Introduction


• Results

• Summary

19

Relative Estimation Errors

1C 2C 4C 1C 2C 4C 1C 2C 4C 1C 2C 4C 1C 2C 4C 1C 2C 4C4 Cores 8 Cores 16 Cores 4 Cores 8 Cores 16 Cores

Crossbar Ring

-4 %

0 %

4 %

8 %

Ave

rag

e R

elat

ive

Err

or

20

RMS Error Breakdown

1 C 2 C 4 C 1 C 2 C 4 C 1 C 2 C 4 C 1 C 2 C 4 C 1 C 2 C 4 C 1 C 2 C 4 C4 Cores 8 Cores 16 Cores 4 Cores 8 Cores 16 Cores

Crossbar Ring

0

20

40

60

80

100

Bus Queue Bus ServiceInterconnect Request Queue

Su

m o

f A

vera

ge

Per

-B

ench

mar

k P

er-U

nit

RM

S

Err

or

(clo

ck c

ycle

s)

Remaining units contribute less than 2 clock cycles

21

Auxiliary Tag Directory Accuracy

1C 2C 4C 1C 2C 4C 1C 2C 4C 1C 2C 4C 1C 2C 4C 1C 2C 4C4 8 16 4 8 16

Crossbar Ring

-2 %

0 %

2 %

Rel

ativ

e M

iss

Est

imat

e E

rro

r

22

Outline

• Introduction


• Results

• Summary

23

Summary• Memory system interference causes unpredictable

performance

• DIEF provides– Accurate private mode latency estimates– Accurate shared mode latency measurements

• Future opportunities– Guiding dynamic optimizations– Guiding OS scheduling decisions– Debugging and optimization

24

Thank you!

Visit our website:http://research.idi.ntnu.no/multicore/

Questions?

25

Experiment Methodology

• M5 simulator– Extended with crossbar and ring on-chip interconnect models– DDR2 memory bus model

• Randomly generated workloads of SPEC2000 benchmarks– 40 4-core workloads– 20 8-core workloads– 10 16-core workloads

1 DIEF: An Accurate Interference Feedback Mechanism for Chip Multiprocessor Memory Systems Magnus...

Documents

Transcript of 1 DIEF: An Accurate Interference Feedback Mechanism for Chip Multiprocessor Memory Systems Magnus...