Boosting Single-thread Performance in Multi-Core Systems …isca09.cs.columbia.edu/pres/40.pdf · 1...
Transcript of Boosting Single-thread Performance in Multi-Core Systems …isca09.cs.columbia.edu/pres/40.pdf · 1...
11
Boosting Single-thread Performance in Multi-Core Systems through Fine-Grain
Multi-Threading
Carlos Madriles, Pedro López, Josep M. Codina, Enric Gibert,
Fernando Latorre, Alejandro Martínez, Raúl Martínez and Antonio González
ISCA’09
June 24th
Intel Labs Barcelona
22
Why Is Single Thread Performance Important?
• Exploiting TLP is more power-efficient than exploiting ILP
– The trend is to increase the number of cores integrated into the same die
– Power and thermal constraints advocate for simpler cores
• However, single thread performance matters
– Sequential and lightly threaded apps
– Sequential regions limit scalability of parallelism (Amdahl’s law)
1 2 4 8 16 32 64
s=0s=0,1s=0.5
2
10
100s : serial fraction
overhead=0
overhead>0s=0s=0.1
Speed-up (c) = 1 / (s + ((1-s) / c))
Schemes to boost single-thread performance on CMPs are crucial
33
Main Research Efforts
• Main research efforts to exploit both ILP and TLP for boosting single-thread performance on CMPs
– Adaptative and cluster mircroarchitectures
– Combine or fuse cores
– Parallelism constrained by the instruction window
– Speculative multithreading (SpMT) techniques
– Traditional schemes constrain the exploitable parallelism
New techniques to exploit parallelism are needed
44
Our Contribution: Anaphase
• Novel speculative multithreading technique to boost single-thread applications on CMPs
• Effectively exploits ILP, TLP, and MLP
– Thread decomposition performed at instruction granularity
– Adapts to available parallelism in the application
– Minimizes inter-thread dependences
– Improves workload balance
– Increases the amount of exploitable MLP
– Hw / Sw hybrid scheme
– Low hardware complexity and powerful thread decomposition
• Cost-effective hardware support on top of a conventional CMP
– A novel uncore module supports anaphase threads execution
– Very few changes on the cores
55
Outline
• The Anaphase Scheme
– Threading Model
– Decomposition Algorithm
• The Anaphase Hardware Support
– Architecture Overview
– Architectural State Management
• Evaluation
• Conclusions
66
Anaphase Scheme Overview
C0
L2
L3
C1 C2
L2
C3
CMP with Anaphase Hw
Profile
Select hot regions
Fine-grain Thread
Decomposition
Serial Program
Paralellized Program
Compiler
HwSw
TILE
Anaphase Parallelizer
It behaves like a fused core
77
Anaphase Scheme Overview
C0
L2
L3
C1 C2
L2
C3
CMP with Anaphase Hw
Profile
Select hot regions
Fine-grain Thread
Decomposition
Serial Program
Paralellized Program
Compiler
HwSw
TILE
Anaphase Parallelizer
Main distinguishing features:
1. Threading model2. Fine-grain thread decomposition3. Hardware support
It behaves like a fused core
88
Threads may be composed of non-consecutive instructions
Threading Model: Key Features
A1
A2
A3 (Br)
B1
B2
B3
B4
D1
D2
D3 (Br)
C1
C2 (Br)
Original Region CFG
A2
B2
B3
D1
D2
C1
C2 (Br)
A1
A3 (Br)
B1
B4
D3 (Br)
Thread 1 Thread 2
99
Uses a heuristic to decide how to handle inter-thread dependences:
• Ignore Hw recovers in case of error
• Explicit communication• Pre-compute (p-slice)
Threading Model: Key Features
A1
A2
A3 (Br)
B1
B2
B3
B4
D1
D2
D3 (Br)
C1
C2 (Br)
Original Region CFG
A2
B2
B3
D1
D2
C1
C2 (Br)
A1
A3 (Br)
B1
B4
D3 (Br)
Thread 1 Thread 2
1010
Replicate those instructions needed to generate the consumed value
Threading Model: Key Features
A1
A2
A3 (Br)
B1
B2
B3
B4
D1
D2
D3 (Br)
C1
C2 (Br)
Original Region CFG
A2
B2
B3
D1
D2
C1
C2 (Br)
A1
A3 (Br)
B1
B4
D3 (Br)
Thread 1 Thread 2
A1’
1111
A2
B2
B3
D1
D2
C1
C2 (Br)
A1’
A3 (Br)
D3 (Br)
Threading Model: Key Features
A1
A2
A3 (Br)
B1
B2
B3
B4
D1
D2
D3 (Br)
C1
C2 (Br)
Original Region CFG
A1
A3 (Br)
B1
B4
D3 (Br)
Thread 1 Thread 2
Threads include all required control instructions• Some branches may need to be replicated• Keep Hw simple (no changes in front-end)
1212
Profile
Select hotregions
Fine-grainThread
Decomposition
Decomposition Algorithm [1]
Compiler
• Multi-level graph partitioning
– Coarsening
– First partition of the region DDG
– Based on: critical path, independence, workload balance, and delinquent loads
– Refinement
– Evolution of the traditional K-L algorithm
– Refines first partition based on a model that estimates execution time
[1] C. Madriles, P. López, J.M. Codina, E. Gibert, F. Latorre, A. Martínez, R. Martínez, A. González, “Anaphase: A Fine-Grain Thread Decomposition Scheme for Speculative Multithreading”, to be published in PACT’09.
1313
Outline
• The Anaphase Scheme
– Threading Model
– Decomposition Algorithm
• The Anaphase Hardware Support
– Architecture Overview
– Architectural State Management
• Evaluation
• Conclusions
1414
Architecture Overview
MEM
Core 0
IC DC
Core 1
IC DC
L2
TILE
Core 0
IC DC
Core 1
IC DC
L2
ICMC ICMC
About 7% increase in area on an Intel® CoreTM 2 Duo
1515
ICMC
Tile in cooperativemode
Tile in single-coremode
Anaphase Execution
Core 0
IC DC
Core 1
IC DC
L2
TILE
Sequential
0
1
2
3
4
5
6
7
1616
0
ICMC
Tile in cooperativemode
Anaphase Execution
Core 0
IC DC
Core 1
IC DC
L2
TILE
Thread 1 Thread 2
0
2
4
6
1
3
5
7
Local retire in threadorder
1
1717
ICMC
Tile in cooperativemode
Anaphase Execution
Core 0
IC DC
Core 1
IC DC
L2
TILE
Thread 1 Thread 2
0
2
4
6
1
3
5
7
2 1
Global commit in program order
1818
ICMC
Tile in cooperativemode
Anaphase Execution
Core 0
IC DC
Core 1
IC DC
L2
TILE
Thread 1 Thread 2
0
2
4
6
1
3
5
7
5
6
4
7
1919
ICMC
Tile in cooperativemode
Anaphase Execution
Core 0
IC DC
Core 1
IC DC
L2
TILE
Thread 1 Thread 2
0
2
4
6
1
3
5
7
17
Inter-core memoryviolation!
2020
ICMC
Tile in cooperativemode
Tile in single-coremode
Anaphase Execution
Core 0
IC DC
Core 1
IC DC
L2
TILE
Sequential
0
1
2
3
4
5
6
7
2121
Architectural State Management
• Memory– Speculative state kept private in L1 and L2 caches
– Merge performed in program order in the L2 cache
– Detects inter-core memory violations at chunk granularity
– Supports frequent memory checkpoints and very fast squash and commit operations
• Registers– Speculative state kept in core PRF
– Register architectural state merged in the ICMC in program order
– Supports frequent register checkpoints
– Reduces pressure on the PRF
– It allows a core to make forward progress
2222
Outline
• The Anaphase Scheme
– Threading Model
– Decomposition Algorithm
• The Anaphase Hardware Support
– Architecture Overview
– Architectural State Management
• Evaluation
• Conclusions
2323
Experimental Framework
• Compiler
– Anaphase partitioning implemented on top of the Intel®
Compiler (icc)
• Simulator
– Cycle accurate x86 CMP (1 tile with 2 OoO cores modeled)
– Two different core configurations evaluated
– Intel® CoreTM 2 like (Medium core)
– Half-sized main structures (Tiny core)
– Core Fusion[1] proposal implemented for comparison
• Benchmarks: Spec2006
– Profiling: PinPoint 20M traces using train input set
– Evaluation: 100M traces using ref input set
[1] E. Ipek, M. Kirman, N. Kirman, and J.F. Martinez, “Core fusion: Accommodating Software Diversity in Chip Multiprocessors”, in Proc. of the Int. Symp. On Computer Architecture, 2007
2424
0.9
1
1.1
1.2
1.3
1.4
1.5
1.6
1.7
1.8
1.9
2
bw
aves
gam
ess
Gem
sFD
TD lbm
lesl
ie3
d
milc
nam
d
po
vray
sop
lex
sph
inx3
ton
to wrf
AV
G S
pec
FP
asta
r
bzi
p2
gcc
gob
mk
h2
64
ref
hm
mer
libq
uan
tum
mcf
om
net
pp
per
lben
ch
sjen
g
zeu
smp
AV
G S
pec
INT
AV
G
Spee
du
p
CoreFusion Anaphase
Performance Evaluation
Results shown for Tiny core configuration
2.67 2.27
MLP exploited by Anaphase
AVG SpeedUp: Anaphase=41% CoreFusion=28%
2525
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
bw
aves
gam
ess
Gem
sFD
TD lbm
lesl
ie3
d
milc
nam
d
po
vray
sop
lex
sph
inx3
ton
to wrf
asta
r
bzi
p2
gcc
gob
mk
h2
64
ref
hm
mer
libq
uan
tum
mcf
om
net
pp
per
lben
ch
sjen
g
zeu
smp
AV
G
Dyn
amic
Inst
ruct
ion
s (%
)
Serial Parallel Replicated Communications Squashed
Dynamic Instruction Breakdown
High coverage less 10% serial codeLow overhead 22% replicated instrs and explicit commsFew explicit comms only 3%High accuracy less 1% squashed instrs
2626
Conclusions
• Anaphase is a Sw / Hw technique that leverages multiple cores to boost single-thread code
• It exploits ILP, TLP, and MLP with high accuracy and low overheads
• Low hardware overhead (7%)
• Important benefits: 41% average speedup for Spec2006
2727
Thank you!