David Stephenson Timothy Kong Dee Lee Fred Chow Guang R. Gao
Transcript of David Stephenson Timothy Kong Dee Lee Fred Chow Guang R. Gao
•Juergen Ributzka•David Stephenson•Timothy Kong•Dee Lee•Fred Chow•Guang R. Gao
Overview
• SiCortex Multiprocessor
• Software Pipelining Framework
• Results
• Future Work
3/23/2009 Copyright © 2009 - Juergen Ributzka. All rights reserved. 2
SiCortex Multiprocessor
• RISC
• 6 cores (MIPS 5KF)
• 500 – 700 MHz
• In-Order Execution
• Limited-Dual Issue
• Low Power (12 – 16 Watt)
3/23/2009 Copyright © 2009 - Juergen Ributzka. All rights reserved. 3
Software Pipelining Framework
3/23/2009 Copyright © 2009 - Juergen Ributzka. All rights reserved. 4
DDG
MII
MS
MVE
RA
CG
Front-End
Middle-End
Back-End
FORTRAN C/C++ Java
Loop NestOptimizer
GlobalOptimizer
CodeGenerator
Data Dependence Graph (DDG)
• Cyclic Data Dependence Graph
• No register anti- and output-dependencies
• Only single BBs
3/23/2009 Copyright © 2009 - Juergen Ributzka. All rights reserved. 5
DDG
MII
MS
MVE
RA
CG
S1
S2
S1
S2
Minimum Initiation Interval
• Limited by:
– Resource Requirements
– Recurrence Requirements
3/23/2009 Copyright © 2009 - Juergen Ributzka. All rights reserved. 6
DDG
MII
MS
MVE
RA
CG
Modulo Scheduling (MS)
• Huff Modulo Scheduling
– Lifetime sensitive
– Uses backtracking
• Hyper Node Reduction Scheduling
– Lifetime sensitive
– No backtracking
3/23/2009 Copyright © 2009 - Juergen Ributzka. All rights reserved. 7
DDG
MII
MS
MVE
RA
CG
Modulo Variable Expansion (MVE)
• Separate TN set for every iteration
• # of unrollings depends on longest lifetime
3/23/2009 Copyright © 2009 - Juergen Ributzka. All rights reserved. 8
DDG
MII
MS
MVE
RA
CG
TN10 op1 TNXX…TN20 op2 TN10 TNYY
TN11 op1 TNXX…TN21 op2 TN11 TNYY
TN10 op1 TNXX…TN20 op2 TN10 TNYY
Iteration
C
y
c
l
e
Register Allocation (RA)
• Only kernels are fully register allocated
• No regions
– SWP kernels are not a black box for GRA and LRA anymore
• Currently no register spilling support
3/23/2009 Copyright © 2009 - Juergen Ributzka. All rights reserved. 9
DDG
MII
MS
MVE
RA
CG
Code Generation (CG)
• Prologues and epilogues are partially register allocated
• Need several epilogues due to different register sets
3/23/2009 Copyright © 2009 - Juergen Ributzka. All rights reserved. 10
DDG
MII
MS
MVE
RA
CG
P
P0
K0
K1 E0
E1
E2
E
Benchmark Results
3/23/2009 Copyright © 2009 - Juergen Ributzka. All rights reserved. 11
0.9
0.95
1
1.05
1.1
1.15
1.2
BT CG EP FT IS LU MG SP UA
spe
ed
up
NAS Parallel Benchmark
ABI n32
ABI 64
Benchmark Results
3/23/2009 Copyright © 2009 - Juergen Ributzka. All rights reserved. 12
0.95
0.96
0.97
0.98
0.99
1
1.01
1.02
1.03
1.04
1.05
41
0.b
wav
es
43
3.m
ilc
43
4.z
eusm
p
43
5.g
rom
acs
43
6.c
actu
sAD
M
43
7.le
slie
3d
44
4.n
amd
44
7.d
ealII
45
0.s
op
lex
45
3.p
ovr
ay
45
4.c
alcu
lix
45
9.G
emsF
DTD
46
5.t
on
to
47
0.lb
m
48
1.w
rf
48
2.s
ph
inx3
40
0.p
erlb
ench
40
1.b
zip
2
40
3.g
cc
42
9.m
cf
44
5.g
ob
mk
45
6.h
mm
er
45
8.s
jen
g
46
2.li
bq
uan
tum
46
4.h
26
ref
47
1.o
mn
etp
p
47
3.a
star
spe
ed
up
SPEC2006
ABI n32
ABI 64
Conclusion and Future Work
• Spilling support needed to enable more loops for SWP
• Screen out loops with small trip count during runtime to reduce SWP overhead
• Hit-under-miss support to overcome current hardware limitations
3/23/2009 Copyright © 2009 - Juergen Ributzka. All rights reserved. 13
Questions ?
3/23/2009 Copyright © 2009 - Juergen Ributzka. All rights reserved. 14