University of Michigan Electrical Engineering and Computer Science FLASH: Foresighted Latency-Aware...
-
date post
20-Dec-2015 -
Category
Documents
-
view
215 -
download
0
Transcript of University of Michigan Electrical Engineering and Computer Science FLASH: Foresighted Latency-Aware...
University of MichiganElectrical Engineering and Computer Science
FLASH: Foresighted Latency-Aware Scheduling Heuristic for Processors
with Customized Datapaths
Manjunath Kudlur, Kevin Fan, Michael Chu, Rajiv Ravindran, Nathan Clark, Scott Mahlke
Advanced Computer Architecture Laboratory
University of Michigan
University of MichiganElectrical Engineering and Computer Science
Introduction• Bypass network : Important
component of datapath• Allows for data forwarding to
reduce pipeline stalls• Full bypass: any FU can
bypass from any other FU and from any pipeline stage
• Cost of full bypass increases quadratically with number of FUs
# paths = (# FU)2 bypassable stages input ports per FU output ports per FU
University of MichiganElectrical Engineering and Computer Science
Case for Bypass Customization
• Only few bypasses are heavily utilized• The heavily utilized bypasses vary widely from
application to application
0.65
0.7
0.75
0.8
0.85
0.9
0.95
1
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
Percent Utilization
No
rma
lize
d C
um
ula
tiv
e N
um
be
r o
f B
yp
as
se
s
Ind
ivid
ua
l B
yp
as
s P
ath
s
More
Less
Customize bypass network in an application specific processor by removing under-utilized paths
University of MichiganElectrical Engineering and Computer Science
Implications of Bypass Customization
Execute Stage Pipeline Latch
Memory Stage Pipeline Latch
Register File
University of MichiganElectrical Engineering and Computer Science
A
Implications of Bypass Customization
Execute Stage Pipeline Latch
Memory Stage Pipeline Latch
Register File
B
DFG
University of MichiganElectrical Engineering and Computer Science
A
Implications of Bypass Customization
• Latency depends on– Which FU the operation
is scheduled on– Which FU the operation’s
consumer is scheduled on
Execute Stage Pipeline Latch
Memory Stage Pipeline Latch
Register File
B
1 Cycle
University of MichiganElectrical Engineering and Computer Science
A
Implications of Bypass Customization
• Latency depends on– Which FU the operation
is scheduled on– Which FU the operation’s
consumer is scheduled on
Execute Stage Pipeline Latch
Memory Stage Pipeline Latch
Register File
B
2 Cycles
University of MichiganElectrical Engineering and Computer Science
A
Implications of Bypass Customization
• Latency depends on– Which FU the operation
is scheduled on– Which FU the operation’s
consumer is scheduled on
• Latency of an operation no longer constant– Varies per consumer
Execute Stage Pipeline Latch
Memory Stage Pipeline Latch
Register File
B
3 Cycles
Bypass Customization introduces non-uniform operation latencies
University of MichiganElectrical Engineering and Computer Science
Effects on List Scheduler (LS)
• Used widely in many compilation systems • Assign each operation to a free FU at the earliest time (Greedy!)• When more than one free FU available, pick one arbitrarily
WHILE (Readylist is non-empty)DO op Next unscheduled operation in priority order ; stime Earliest time when op can be scheduled ; WHILE (no free resource available to execute op at stime) DO stime stime + 1 ; END res Free resource capable of executing op; schedule (op, res, stime) ;END
University of MichiganElectrical Engineering and Computer Science
LS on Full Bypass Machine
A B C
Operations have 1-cycle latency. Machine with full bypass network.
1
2 3 4
5 6
DFG
University of MichiganElectrical Engineering and Computer Science
LS on Full Bypass Machine
1
2 3 4
5 6
Cycle A B C
0 1
1
2
A B C
Operations have 1-cycle latency. Machine with full bypass network.
University of MichiganElectrical Engineering and Computer Science
LS on Full Bypass Machine
1
2 3 4
5 6
Cycle A B C
0 1
1 2 3 4
2 5 6
Schedule length = 3 cycles
A B C
Operations have 1-cycle latency. Machine with full bypass network.
University of MichiganElectrical Engineering and Computer Science
LS on Full Bypass Machine
1
2 3 4
5 6
Cycle A B C
0 1
1 2 3 4
2 5 6
Schedule length = 3 cycles
A B C
Operations have 1-cycle latency. Machine with full bypass network.
University of MichiganElectrical Engineering and Computer Science
LS on Full Bypass Machine
1
2 3 4
5 6
Cycle A B C
0 1
1 2 3 4
2 5 6
Schedule length = 3 cycles
A B C
Operations have 1-cycle latency. Machine with full bypass network.
University of MichiganElectrical Engineering and Computer Science
LS on Full Bypass Machine
1
2 3 4
5 6
Cycle A B C
0 1
1 2 3 4
2 5 6
Schedule length = 3 cycles
Choice of FU does not affect schedulelength in a machine with full bypass.
A B C
Operations have 1-cycle latency. Machine with full bypass network.
University of MichiganElectrical Engineering and Computer Science
LS on Partial Bypass Machine
A B C
Operations have 1-cycle latency. Assume3 cycles required to transmit value via registerfile, if bypass path does not exist.
1
2 3 4
5 6
Cycle A B C
0 1
1 2
2
3 3 4
4 6 5
Schedule length = 5 cycles
University of MichiganElectrical Engineering and Computer Science
LS on Partial Bypass Machine
1
2 3 4
5 6
Cycle A B C
0 1
1 2 3 4
2 6 5
Schedule length = 3 cycles
A B C
Operations have 1-cycle latency. Assume3 cycles required to transmit value via registerfile, if bypass path does not exist.
University of MichiganElectrical Engineering and Computer Science
LS on Partial Bypass Machine
1
2 3 4
5 6
Cycle A B C
0 1
1 2 3
2
3 4
4 5 6
Schedule length = 5 cycles
A B C
Operations have 1-cycle latency. Assume3 cycles required to transmit value via registerfile, if bypass path does not exist.
University of MichiganElectrical Engineering and Computer Science
LS on Partial Bypass Machine
1
2 3 4
5 6
Cycle A B C
0 1
1 2 3
2
3 4
4 5 6
Schedule length = 5 cycles
Choice of FU affects schedulelength drastically in a machine with partial bypass.Arbitrary choice no good!
A B C
Operations have 1-cycle latency. Assume3 cycles required to transmit value via registerfile, if bypass path does not exist.
University of MichiganElectrical Engineering and Computer Science
Greediness of LS
A B C
Operations have 1-cycle latency. Assume3 cycles required to transmit value via registerfile, if bypass path does not exist.
1
2 3 4
Partial DFG
Cycle A B C
i
i+1
i+2
i+3
i+4
Consider Scheduling Op1
Earliest Time
University of MichiganElectrical Engineering and Computer Science
Greediness of LS
A B C
Operations have 1-cycle latency. Assume3 cycles required to transmit value via registerfile, if bypass path does not exist.
1
2 3 4
Cycle A B C
i
i+1 1
i+2 2
i+3
i+4 3 4
Greedily scheduling op 1at cycle i+1 delays ops 3 and 4
University of MichiganElectrical Engineering and Computer Science
Greediness of LS
A B C
Operations have 1-cycle latency. Assume3 cycles required to transmit value via registerfile, if bypass path does not exist.
1
2 3 4
Cycle A B C
i
i+1 1
i+2 2 3
i+3
i+4 4
Greedily scheduling op 1 at cycle i+1 delays op 4
University of MichiganElectrical Engineering and Computer Science
Greediness of LS
A B C
Operations have 1-cycle latency. Assume3 cycles required to transmit value via registerfile, if bypass path does not exist.
1
2 3 4
Cycle A B C
i
i+1
i+2 1
i+3 2 3 4Delayed 1 cycle
Delaying ops could improve schedule.Being Greedy no good!
University of MichiganElectrical Engineering and Computer Science
FLASH : Goals
• Keep the List Scheduling framework, it is fast and widely used
• Effectively deal with non-uniform latencies– Intelligently select from among multiple co-
equal choices– Avoid greedy choices by delaying schedule
slots
University of MichiganElectrical Engineering and Computer Science
Observation I
A
B
Consider FU choices for operation A :
University of MichiganElectrical Engineering and Computer Science
Observation I
A
B
No Good!
Consider FU choices for operation A :
3 cycle delay
University of MichiganElectrical Engineering and Computer Science
Observation I
A
B
Good!
Consider FU choices for operation A :
• An FU with a low latency path to a consumer FU is good
• Thus, the consumer operation won’t be delayed
No delay
University of MichiganElectrical Engineering and Computer Science
Observation I
A
B
C
Good ???
Consider FU choices for operation A :
• An FU with a low latency path to a consumer FU is good
• Thus, the consumer operation won’t be delayed
No Delay
3 cycle delay
University of MichiganElectrical Engineering and Computer Science
Observation I
A
B
C
Better!
Consider FU choices for operation A :
• An FU with a low latency path to a consumer FU is good
• Thus, the consumer operation won’t be delayed
• Same observation extends to consumer’s consumer, and so on
No Delay
No Delay
An FU which does not delay the consumers is a good choice
University of MichiganElectrical Engineering and Computer Science
Observation II
Consider FU choices for operation A :
A
BC
D
Slack 1 Slack 0
University of MichiganElectrical Engineering and Computer Science
Observation II
Consider FU choices for operation A :
A
BC
D
Good ???
• All consumers are not equal
No Delay 3 cycle delay
University of MichiganElectrical Engineering and Computer Science
Observation II
• All consumers are not equal
• Its better to delay a non-critical consumer
• Criticality
Consider FU choices for operation A :
A
BC
D
Better!
An FU which does not delay acritical chain of consumers is a good choice
No Delay
No Delay
3 cycle delay
1
SLACK
University of MichiganElectrical Engineering and Computer Science
The FLASH Technique
• Compute the merit (FLASH_RANK) of each FU choice for an operation
• FLASH_RANK - weighted estimate of schedule lengths of the dependence chains of an operation
• Schedule the operation on the FU with the best FLASH_RANK
• Avoid greediness by delaying schedule slot, if necessary
FLASH_RANK(op, FU) = MAXc
Estimated schedule length of c
where c is a dependence chain of op
1
Slack(c) + 1X
University of MichiganElectrical Engineering and Computer Science
FLASH_RANK Example
A
BC
D
Slack 1 Slack 0
FLASH_RANK(op, FU) = MAXc
Estimated schedule length of c
where c is a dependence chain of op
1
Slack(c) + 1X
University of MichiganElectrical Engineering and Computer Science
FLASH_RANK Example
A
BC
D
Slack 1 Slack 0
FLASH_RANK(A, Green FU) = ?
FLASH_RANK(op, FU) = MAXc
Estimated schedule length of c
where c is a dependence chain of op
1
Slack(c) + 1X
University of MichiganElectrical Engineering and Computer Science
FLASH_RANK Example
A
BC
D
Cycle 1
FLASH_RANK(A, Green FU) =
MAX 1
1 + 1X 1
FLASH_RANK(op, FU) = MAXc
Estimated schedule length of c
where c is a dependence chain of op
1
Slack(c) + 1X
University of MichiganElectrical Engineering and Computer Science
FLASH_RANK Example
A
BC
D
FLASH_RANK(A, Green FU) =
MAX 0.5 ,
1
0 + 1X
Cycle 1
Cycle 4
4
FLASH_RANK(op, FU) = MAXc
Estimated schedule length of c
where c is a dependence chain of op
1
Slack(c) + 1X
University of MichiganElectrical Engineering and Computer Science
FLASH_RANK Example
A
BC
D
FLASH_RANK(A, Green FU) =
MAX 0.5 , 4 = 4
FLASH_RANK(op, FU) = MAXc
Estimated schedule length of c
where c is a dependence chain of op
1
Slack(c) + 1X
University of MichiganElectrical Engineering and Computer Science
FLASH_RANK Example
A
BC
D
Slack 1 Slack 0
FLASH_RANK(A, Yellow FU) = ?
FLASH_RANK(op, FU) = MAXc
Estimated schedule length of c
where c is a dependence chain of op
1
Slack(c) + 1X
University of MichiganElectrical Engineering and Computer Science
FLASH_RANK Example
A
BC
D
Cycle 1
FLASH_RANK(A, Yellow FU) =
MAX 1
1 + 1X 1
FLASH_RANK(op, FU) = MAXc
Estimated schedule length of c
where c is a dependence chain of op
1
Slack(c) + 1X
University of MichiganElectrical Engineering and Computer Science
FLASH_RANK Example
A
BC
D
FLASH_RANK(A, Yellow FU) =
MAX 0.5 ,
1
0 + 1X
Cycle 1
Cycle 2
2
FLASH_RANK(op, FU) = MAXc
Estimated schedule length of c
where c is a dependence chain of op
1
Slack(c) + 1X
University of MichiganElectrical Engineering and Computer Science
FLASH_RANK Example
A
BC
D
FLASH_RANK(A, Yellow FU) =
MAX 0.5 , 2 = 2
Choose Yellow FU for op A
FLASH_RANK(op, FU) = MAXc
Estimated schedule length of c
where c is a dependence chain of op
1
Slack(c) + 1X
University of MichiganElectrical Engineering and Computer Science
Some Practical Considerations
• Impractical to estimate schedule length of entire dependence chain (few 10s of operations)– Truncate dependence chains to manageable depths, say 2
or 3 (Look Ahead depth)
• Impractical to calculate schedule lengths of all dependence chains together – Many dependence chains originate from an operation– Consider dependence chains independently– Ignore resource constraint between dependence chains
University of MichiganElectrical Engineering and Computer Science
Experiments
• Implemented in TRIMARAN compiler framework
• Evaluated MediaBench and SPECint2000• Machine is a 9 wide VLIW (4I, 2F, 2M, 1B)• Application specific bypass network [Fan ’03]
– 30% cost of a full bypass network
University of MichiganElectrical Engineering and Computer Science
Comparisons
• Baseline is the performance achieved by the traditional list scheduler
• Global Resource Preference (GRP) algorithm [Fan ’03]– Global pre-scheduling phase assigns FU
preferences to operations based on Bottom-Up Greedy (BUG) schedule estimates
– List scheduler uses these preferences as hints while scheduling
University of MichiganElectrical Engineering and Computer Science
FLASH vs. GRP
0
5
10
15
20
25
30
35
40
45
50
164.
gzip
175.
vpr
181.
mcf
256.
bzip
2
300.
twol
f
cjpe
g
djpe
g
pegw
iten
c
pegw
itde
c
g721
deco
de
g721
enco
de
epic
unep
ic
mpe
g2de
c
mpe
g2en
c
pgpe
ncod
e
gsm
deco
de
gsm
enco
de
raw
caud
io
raw
daud
io
Ave
rage
Benchmark
Speedup (
%)
GRP LA1 LA2 LA3
University of MichiganElectrical Engineering and Computer Science
Bypass Utilization
0
0.5
1
1.5
2
2.5
3
3.5
164.g
zip
175.v
pr
181.m
cf
256.b
zip
2
300.tw
olf
cjp
eg
djp
eg
pegw
itenc
pegw
itdec
g721decode
g721encode
epic
unepic
mpeg2dec
mpeg2enc
pgpdecode
pgpencode
gsm
decode
gsm
encode
raw
caudio
raw
daudio
Harm
onic
Mean
Benchmark
Uti
lizati
on (
#bypasses/cycle
)Original Utilization FLASH Utilization
University of MichiganElectrical Engineering and Computer Science
Conclusion
• Developed a effective scheduling heuristic for machines with customized bypass interconnect– Intelligent FU choice– Avoid greediness
• Average performance improvement of 25% over baseline– Bypass paths utilized better
• Could be applied to other cases of non-uniform latencies