FIS - DIEF - Declaração de Informações Econômico-fiscais - CE.pdf
1 DIEF: An Accurate Interference Feedback Mechanism for Chip Multiprocessor Memory Systems Magnus...
-
Upload
cory-russell -
Category
Documents
-
view
218 -
download
1
Transcript of 1 DIEF: An Accurate Interference Feedback Mechanism for Chip Multiprocessor Memory Systems Magnus...
1
DIEF: An Accurate Interference FeedbackMechanism for Chip Multiprocessor MemorySystems
Magnus Jahre†, Marius Grannaes† ‡ and Lasse Natvig†
† Norwegian University of Science and Technology‡ Energy Micro
2
Chip Multiprocessor Resources
• Hardware-controlled, shared resources– Interconnect bandwidth– Shared cache capacity– Memory bus bandwidth– Memory capacity is allocated by the operating system
Interference can occur in all shared units
CPU 1
Inte
rcon
nect
MainMemory
MemoryBus
D-Cache
I-Cache
CPU 2D-Cache
I-Cache
CPU 3D-Cache
I-Cache
CPU 4D-Cache
I-Cache
Sha
red
Cac
he
Mem
ory
Con
trol
ler
Private Memory System Shared Memory System
Current CMP implementations do not take interference into
account
3
Why Control Resource Allocation?
Provide predictable performance
Support OS scheduler assumptions
Cloud: Fulfill Service Level Agreement
4
Resource Allocation Tasks
Measurement
Allocation(Policy)
Enforcement(Mechanism)
Focus of this work
5
Resource Allocation Baselines
Baseline = Interference-free configuration
Quantify performance impact from interference
Private Mode and Shared Mode
6
Multi-Programmed Baseline
• All processes in a workload run concurrently
• Static and equal partitioning of all shared resources
50%Program
B
50%Program
A
Memory Bus
Shared Cache
50%: Program B50%: Program A
Multiprogrammed Baseline
7
Single Program Baseline
• The process is run alone in one core
• All other cores are idle
• Exclusive access to all shared resources
100%Program
A
Shared Cache
Memory Bus
100%: Program A
Single Program Baseline
100%Program
B
Shared Cache
Memory Bus
100%: Program B
8
Baseline Weaknesses
• Multiprogrammed Baseline– Only accounts for interference in partitioned resources– Static and equal division of DRAM bandwidth does not give equal
latency– Complex relationship between resource allocation and performance
• Single Program Baseline– Does not exist in shared mode
Dynamic Interference Estimation Framework (DIEF)
9
Outline
• Introduction
• Dynamic Interference Estimation Framework– Shared Cache– Memory Bus – On-chip interconnect
• Results
• Summary
10
Interference Estimation
Full-System Interference EstimationAggregate interference from different units
Common unit of measureAverage Latency (Clock Cycles)
DIEFGeneral, component-based framework
11
Interference Definition
InterferencePrivate Mode
Latency
Estimate ErrorPrivate
Mode Latency Measurement
Shared Mode Latency
PrivateMode Latency
Estimate
12
Shared Cache Interference
B
NM
ABA M N
Auxiliary Tag Directories
CP
U 0
CP
U 1
Cache Accesses:
B
Shared Cache
...... ...
......
...
13
Shared Cache Interference
B
NM
AAB M N
Auxiliary Tag Directories
CP
U 0
CP
U 1
Cache Accesses:
B
Shared Cache
...... ...
......
...
C
C
Eviction may not be interference
14
Shared Cache Interference
B
NM
AAB M
Auxiliary Tag Directories
CP
U 0
CP
U 1
Cache Accesses:
B
Shared Cache
...... ...
......
...
C
C CB
N
Interference cost = miss penalty
Hit
Miss
15
Bus Interference Requirements
• Out-of-order memory bus scheduling• Shared mode only cache misses and cache hits• Shared cache writebacks
Computing private latency based on shared mode queue contents is difficult
Emulate private scheduling in the shared mode
16
E D
Shared Bus Queue
C B
D C B A
1202004040
Arrival Order
Head Pointer
Execution Order
15
32
Latency Lookup Table
Bank 0
Bank 1
...
...
Open Page Emulation Registers
Memory Latency Estimation Buffer
Bank/ Page Mapping: A à (0,15), B à (0,19), C à (0,15), D à (1,32)
Estimated Queue Latency 120 40 40+ +=
BCD 40200
17
Interconnect Interference
A
F E
BCCPU 0
CPU 1
L2 Bank 0
L2 Bank 1
Interference Counters
0 0
A
E
48
CPU 1 delays CPU 0
18
Outline
• Introduction
• Dynamic Interference Estimation Framework– Shared Cache– Memory Bus – On-chip interconnect
• Results
• Summary
19
Relative Estimation Errors
1C 2C 4C 1C 2C 4C 1C 2C 4C 1C 2C 4C 1C 2C 4C 1C 2C 4C4 Cores 8 Cores 16 Cores 4 Cores 8 Cores 16 Cores
Crossbar Ring
-4 %
0 %
4 %
8 %
Ave
rag
e R
elat
ive
Err
or
20
RMS Error Breakdown
1 C 2 C 4 C 1 C 2 C 4 C 1 C 2 C 4 C 1 C 2 C 4 C 1 C 2 C 4 C 1 C 2 C 4 C4 Cores 8 Cores 16 Cores 4 Cores 8 Cores 16 Cores
Crossbar Ring
0
20
40
60
80
100
Bus Queue Bus ServiceInterconnect Request Queue
Su
m o
f A
vera
ge
Per
-B
ench
mar
k P
er-U
nit
RM
S
Err
or
(clo
ck c
ycle
s)
Remaining units contribute less than 2 clock cycles
21
Auxiliary Tag Directory Accuracy
1C 2C 4C 1C 2C 4C 1C 2C 4C 1C 2C 4C 1C 2C 4C 1C 2C 4C4 8 16 4 8 16
Crossbar Ring
-2 %
0 %
2 %
Rel
ativ
e M
iss
Est
imat
e E
rro
r
22
Outline
• Introduction
• Dynamic Interference Estimation Framework– Shared Cache– Memory Bus – On-chip interconnect
• Results
• Summary
23
Summary• Memory system interference causes unpredictable
performance
• DIEF provides– Accurate private mode latency estimates– Accurate shared mode latency measurements
• Future opportunities– Guiding dynamic optimizations– Guiding OS scheduling decisions– Debugging and optimization
24
Thank you!
Visit our website:http://research.idi.ntnu.no/multicore/
Questions?
25
Experiment Methodology
• M5 simulator– Extended with crossbar and ring on-chip interconnect models– DDR2 memory bus model
• Randomly generated workloads of SPEC2000 benchmarks– 40 4-core workloads– 20 8-core workloads– 10 16-core workloads