University of Michigan Electrical Engineering and Computer Science 1 Extending Multicore...
-
Upload
austen-miller -
Category
Documents
-
view
216 -
download
0
Transcript of University of Michigan Electrical Engineering and Computer Science 1 Extending Multicore...
1 University of MichiganElectrical Engineering and Computer Science
Extending Multicore Architectures to Exploit Hybrid Parallelism inSingle-Thread Applications
Hongtao Zhong, Steven A. Lieberman,
and Scott A. Mahlke
Advanced Computer Architecture Laboratory
University of Michigan
2 University of MichiganElectrical Engineering and Computer Science
Multicore Architectures
• Multicore becomes a trend– Intel Core Duo, 2005– Intel Core Quad, 2006– Sun T1, 8 cores, 2005– 16 – 32 cores, near future
• Need for simpler cores– Power density– Cooling costs
• Multiple cores on a chip– High throughput– Good for multithreaded apps
core 0
core 1
core 2
core 3
core 0
core 1
core 2
core 7
L2 L2
L2 L2
8 core Sun T1 processor
interconnect
3 University of MichiganElectrical Engineering and Computer Science
How About Single Thread Applications?
Single thread performance, Core Duo vs. Pentium M (same cache, same platform)Source : Meldelson et al. Intel Technology Journal, Vol 10, Issue 02, 2006
4 University of MichiganElectrical Engineering and Computer Science
Objective of this Work
• Automatically accelerate single thread applications on multicore systems– Exploit irregular parallelism across cores
• Instruction level parallelism (ILP)• Fine-grain thread level parallelism (TLP )• Loop level parallelism (LLP)
– Adaptive architecture• Configurate resources to exploit available parallelism• Dynamic adaptability
Hybrid parallelism
5 University of MichiganElectrical Engineering and Computer Science
Approach
• Voltron: Hardware/software approach– Architecture mechanisms
• Dual mode execution (coupled, decoupled)• Flexible inter-core communication• Fast thread spawning• Efficient memory ordering• High rate-of-return speculation
– Compiler techniques• Compiler controlled distributed branch• Fine-grain thread extraction• Speculative loop parallelization with recovery
6 University of MichiganElectrical Engineering and Computer Science
Parallelism Type 1: ILP
+
L
>>
*
L
+
L
+
+
/
+
/
*
L
<<
-
S
-
<
+
|
&
L
|
L
-
&
S
+
br
7 University of MichiganElectrical Engineering and Computer Science
Parallelism Type 1: ILP
• Emulate VLIW– Low latency
communication
+
L
>>
*
L
+
L
+
+
/
+
/
*
L
<<
-
S
-
<
+
|
&
L
|
L
-
&
S
+
brCore 0 Core 1 Core 2 Core 3
8 University of MichiganElectrical Engineering and Computer Science
Parallelism Type 1: ILP
• Emulate VLIW– Low latency
communication– Compiler controlled
distributed branch– Lockstep execution
+
L
>>
*
L
+
L
+
+
/
+
/
*
L
<<
-
S
-
<
+
|
&
L
|
L
-
&
S
+
brCore 0 Core 1 Core 2 Core 3
br brbr
9 University of MichiganElectrical Engineering and Computer Science
Voltron Architecture for ILP
stall bus
br bus
Core 0 Core 1
Core 2 Core 3
Banked L2 Cache
Banked L2 Cache
GPR FPR PR BTR
Register Files
FU MemFU
. . .
To northTo west
L1Instruction Cache
L1Data Cache
From Banked L2 To/From Banked L2
Instruction Fetch/Decode
CommFU
10 University of MichiganElectrical Engineering and Computer Science
Experimental Setup
• Trimaran Toolset• Simulator
– Multiple cores, multiple instruction stream– Inter-core communication– MOESI coherent protocol
• Configuration– 1 ALU, 1 memory unit, 1 communication unit per core– 1 cycle inter-core move latency per hop– 4KB L1 I-cache, 4KB L1 D-cache per core– 128KB shared L2 cache– Single core baseline
• 25 benchmarks from SpecInt, SpecFP, and MediaBench
11 University of MichiganElectrical Engineering and Computer Science
1
1.2
1.4
1.6
1.8
2
2.213
2.ijp
eg
164.
gzip
175.
vpr
197.
pars
er
255.
vort
ex
256.
bzip
2
052.
alvi
nn
056.
ear
171.
swim
172.
mgr
id
177.
mes
a
179.
art
183.
equa
ke
cjpe
g
djpe
g
epic
g721
deco
de
g721
enco
de
gsm
deco
de
gsm
enco
de
mpe
g2de
c
mpe
g2en
c
raw
caud
io
raw
daud
io
unep
ic
aver
age
2 core 4 core
ILP Speedup
SpecInt MediabenchSpecFP
Achieved > 80% of the performance on wide VLIW with same resources.
12 University of MichiganElectrical Engineering and Computer Science
Parallelism Type 2 : Fine-grain TLP
CB
D
E
CB
D
E
• Fine-grain threads– Few instructions– Scalar communication– Shared stack frame
A
13 University of MichiganElectrical Engineering and Computer Science
Parallelism Type 2 : Fine-grain TLP
ld st ld ldldst
CB
D
E
A
CB
D
E
• Fine-grain threads– Few instructions– Scalar communication– Shared stack frame
14 University of MichiganElectrical Engineering and Computer Science
B
D
C
E
Parallelism Type 2 : Fine-grain TLP
ld st ld ldldst
A A’
Core 0 Core 1
• Fine-grain threads– Few instruction– Scalar communication– Shared stack frame
• Decoupled execution– Different control flow– Asynchronous communication
• Fast thread spawning• Efficient memory ordering• Compiler algorithm
– Memory dependences– Load balance
15 University of MichiganElectrical Engineering and Computer Science
Core 0 Core 1
Core 2 Core 3
Banked L2 Cache
Banked L2 Cache
Voltron for Fine-grain TLP
GPR FPR PR BTR
Register Files
FU MemFU
. . .
To northTo west
L1Instruction Cache
L1Data Cache
From Banked L2 To/From Banked L2
Instruction Fetch/Decode
CommFU
Core
Comm FU
To west To north
To Register File
Direc
t M
ode
Byp
ass
Direc
t M
ode
Byp
ass
Routing Logic
Send
Que
ue
Rec
eive
Que
ue
16 University of MichiganElectrical Engineering and Computer Science
Dual Mode Network
• Coupled mode– Direct bypass [Multiflow]– Coupled execution– 1 cycle min latency, num_hops
• Decoupled mode– Message queues [RAW]– SEND / RECV– Decoupled execution – 3 cycle min latency, 2 + num_hops– Fast fine-grain thread spawning– Enforce operation ordering Core
Comm FU
To west To north
To Register File
Direc
t M
ode
Byp
ass
Direc
t M
ode
Byp
ass
Core
Comm FU
To west To north
To Register File
Direc
t M
ode
Byp
ass
Direc
t M
ode
Byp
ass
Core
Comm FU
To west To north
To Register File
Direc
t M
ode
Byp
ass
Direc
t M
ode
Byp
ass
Routing Logic
Send
Que
ue
Rec
eive
Que
ue
17 University of MichiganElectrical Engineering and Computer Science
1
1.1
1.2
1.3
1.4
1.5
1.6
1.7
1.813
2.ijp
eg
164.
gzip
175.
vpr
197.
pars
er
255.
vort
ex
256.
bzip
2
052.
alvi
nn
056.
ear
171.
swim
172.
mgr
id
177.
mes
a
179.
art
183.
equa
ke
cjpe
g
djpe
g
epic
g721
deco
de
g721
enco
de
gsm
deco
de
gsm
enco
de
mpe
g2de
c
mpe
g2en
c
raw
caud
io
raw
daud
io
unep
ic
aver
age
2 core 4 core
Fine-grain TLP Speedup
SpecInt MediabenchSpecFP
Works better for memory intensive applications
* * * * * * * * *
18 University of MichiganElectrical Engineering and Computer Science
Parallelism Type 3 : LLP
• DOALL loops– No cross-iteration
dependences– Iterations can execute in
parallel– Memory dependences hard to
prove
19 University of MichiganElectrical Engineering and Computer Science
Parallelism Type 3 : LLP
• DOALL loops– No cross-iteration
dependences– Iterations can execute in
parallel– Memory dependences hard to
prove
• Statistical DOALL– Profile memory dependences– Speculatively parallelize– Detect violation and rollback
core 0
init
finalize
reset
iter 0-3
core 1
init
finalize
reset
iter 4-7iter 0-7
Unexpected dependence
restart
20 University of MichiganElectrical Engineering and Computer Science
Voltron for LLP
Core 0 Core 1
Core 2 Core 3
Banked L2 Cache
Banked L2 Cache
GPR FPR PR BTR
Register Files
FU MemFU
. . .
To northTo west
L1Instruction Cache
L1 D-cachew/ Transactional Mem
Support
From Banked L2 To/From Banked L2
Instruction Fetch/Decode
CommFU
T tag state data
cache
• Detect memory dependence violation• Roll back memory state• Compiler roll back register state
21 University of MichiganElectrical Engineering and Computer Science
1
1.1
1.2
1.3
1.4
1.5
1.6
1.7
1.8
1.9
2132.ijp
eg
164.g
zip
175.v
pr
197.p
ars
er
255.v
ort
ex
256.b
zip
2
052.a
lvin
n
056.e
ar
171.s
wim
172.m
grid
177.m
esa
179.a
rt
183.e
quake
cjp
eg
djp
eg
epic
g721decode
g721encode
gsm
decode
gsm
encode
mpeg2dec
mpeg2enc
raw
caudio
raw
daudio
unepic
avera
ge
2 core 4 core3.12 3.03 3.03
LLP Speedup
SpecInt MediabenchSpecFP
Accelerate non-provable DOALL and small loops
22 University of MichiganElectrical Engineering and Computer Science
1
1.5
2
2.5
3
3.5
13
2.ij
pe
g
16
4.g
zip
17
5.v
pr
19
7.p
ars
er
25
5.v
ort
ex
25
6.b
zip
2
05
2.a
lvin
n
05
6.e
ar
17
1.s
wim
17
2.m
gri
d
17
7.m
esa
17
9.a
rt
18
3.e
qu
ake
cjp
eg
djp
eg
ep
ic
g7
21
de
cod
e
g7
21
en
cod
e
gsm
de
cod
e
gsm
en
cod
e
mp
eg
2d
ec
mp
eg
2e
nc
raw
cau
dio
raw
da
ud
io
un
ep
ic
ave
rag
e
2 core 4 core
Speedup for Hybrid Execution
SpecInt MediabenchSpecFP
•2 core average – ILP:1.23, TLP: 1.16, LLP: 1.17, Hybrid: 1.46•4 core average – ILP:1.33, TLP: 1.23, LLP: 1.37, Hybrid: 1.83
23 University of MichiganElectrical Engineering and Computer Science
0%
20%
40%
60%
80%
100%132.ijp
eg
164.g
zip
175.v
pr
197.p
ars
er
255.v
ort
ex
256.b
zip
2
052.a
lvin
n
056.e
ar
171.s
wim
172.m
grid
177.m
esa
179.a
rt
183.e
quake
cjp
eg
djp
eg
epic
g721decode
g721encode
gsm
decode
gsm
encode
mpeg2dec
mpeg2enc
raw
caudio
raw
daudio
unepic
avera
ge
decoupled
coupled
Time BreakdownSpecInt MediabenchSpecFP
Both coupled and decoupled mode are necessary.
24 University of MichiganElectrical Engineering and Computer Science
Conclusions and Future Work
• Voltron – Adaptive multicore system– Accelerate single thread applications– Exploit ILP, fine-grain TLP and statistical LLP
• Coupled and decoupled execution• Dual-mode operand network• Compiler managed loop speculation
– Hybrid parallelism combines the benefits
• Future work– Fine-grain thread identification– Virtualization of resources