Post on 30-Dec-2015
Un/DoPack: Re-Clustering of Large System-on-Chip Designs with
Interconnect Variation for Low-Cost FPGAs
Marvin Tom*Marvin Tom* Xilinx Inc. (marvin.tom@xilinx.com)
San Jose, CA, USA*Work performed at University of British Columbia
David Leong University of British Columbia (davel@ece.ubc.ca)Vancouver, BC, Canada
Guy Lemieux University of British Columbia (lemieux@ece.ubc.ca)Vancouver, BC, Canada
2
Overview
• Introduction, Goals and Motivation– Reduce channel width, lower cost, make circuits “routable”
• Benchmark Circuits – Varying amount of interconnect variation
• Un/DoPack CAD Tool:– Iterative channel width reduction by whitespace insertion
• Results
• Conclusion
3
Overview
• Introduction, Goals and Motivation– Reduce channel width, lower cost, make circuits “routable”
• Benchmark Circuits – Varying amount of interconnect variation
• Un/DoPack CAD Tool:– Iterative channel width reduction by whitespace insertion
• Results
• Conclusion
4
Mesh-Based FPGA Architecture• 9 logic blocks• 4 wires per channel• 3*4=12 total horizontal tracks
L L L
L L L
L L L
L L L
L L L
L L L
L L L
L
L
L
L
• Larger FPGAs have more “aggregate” interconnect
• 16 logic blocks• 4 wires per channel• 4*4=16 total horizontal tracks
5
Motivation: Area of FPGA Devices
alu4
apex2
apex4
bigkey
des
diffeq
dsip
elliptic
ex1010
ex5p
frisc
misex3
pdc
s298s38417
s38584seq
spla
tseng
10
20
30
40
50
60
70
80
90
0 50 100 150 200 250 300
CLB Count
Routed Channel
Width
Number ofLayout Tiles
SIZE ofLayout Tile
Total Layout AREA= SIZE * Number
MCNC Circuits Mapped onto an FPGA
6
Motivation: Channel Width Demand
alu4
apex2
apex4
bigkey
des
diffeq
dsip
elliptic
ex1010
ex5p
frisc
misex3
pdc
s298s38417
s38584seq
spla
tseng
10
20
30
40
50
60
70
80
90
0 50 100 150 200 250 300
CLB Count
Routed Channel
Width
Logic RangeUser buys bigger device.
InterconnectRange
User hasno choice!
Devices built for worst-casechannel width (fixed width)
Interconnect dominates area (>70%)
MCNC Circuits Mapped onto an FPGA
7
Goal: Reduce Channel Width
alu4
apex2
apex4
bigkey
des
diffeq
dsip
elliptic
ex1010
ex5p
frisc
misex3
pdc
s298s38417
s38584seq
spla
tseng
10
20
30
40
50
60
70
80
90
0 50 100 150 200 250 300
CLB Count
Routed Channel
Width
But { apex4, elliptic, frisc, ex1010, spla, pdc } are unroutable….
Can we make them routable in a Constrained FPGA?
Altera Cyclone• Channel width constraint of 80 routing tracks
Constrained FPGA• Channel width constraint of 60 routing tracks• Smaller area, lower cost for low-channel-width circuits
8
alu4
apex2
apex4
bigkey
clma
des
diffeq
dsip
elliptic
ex1010
ex5p
frisc
misex3
pdc
s298s38417
s38584seq
spla
tseng
pdc
ex1010
frisc splaapex4 elliptic
10
20
30
40
50
60
70
80
90
0 50 100 150 200 250 300 350 400 450 500 550 600 650 700
CLB Count
Ro
ute
d C
ha
nn
el W
idth
Possible Solution• Trade-off logic utilization for channel width
– User can always buy more logic…. (not more wires)
FPGA 1 FPGA 2
L L L L
L L L L
L L L L
L L L L
L L L L
L L L L
L L L L
L L L L
L
L
L
L
L L L L L
Trade-off:
CLB count
for
Channel width
What about area??
9
Features and Costs of Two FPGA Families
• Sample Benchmark Circuit– 10,000 LEs– 150 Routing Tracks– No Multipliers– 100 K Memory
Altera Device LEs Memory Mult. Routing Cost
Cyclone 1C12 12,060 239,616 0 80 $56
Stratix 1S10 10,570 920,448 48 232 $190
Cyclone 1C20 20,060 294,912 0 80 $100
Stratix 1S20 18,460 1,669,249 80 232 $350
• Sample Benchmark Circuit– 20,000 LEs– 75 Routing Tracks
10
Overview
• Introduction, Goals and Motivation– Reduce channel width, lower cost, make circuits “routable”
• Benchmark Circuits – Varying amount of interconnect variation
• Un/DoPack CAD Tool:– Iterative channel width reduction by whitespace insertion
• Results
• Conclusion
11
GNL Circuit Benchmark Suite
• Create benchmark circuits with variation– SoC <==> Randomly integrate/stitch together “IP Blocks”– IP Blocks have varied interconnect needs
• Generate Netlist (GNL)– Stroobandt @ Ghent University– Synthetic benchmark generator
• GNL circuits generated hierarchically– Root # I/Os, # IP blocks– Second Level 20 IP blocks, # LEs, Rent parameter
12
Rent Linear Interpolation• 7 benchmark circuits• Average Rent = 0.62, Stdev Rent = 0 0.12• 240/120 primary inputs/outputs
0.350.40
0.450.50
0.550.60
0.650.70
0.750.80
bigke
y
s385
84.1
ellipt
icdif
feq
s298 alu
4
mise
x3 pdc
ex5p
ex10
10
IP Blocks
Ren
t P
aram
eter
Stdev000Stdev002
Stdev004Stdev006
Stdev008 / meta cloneStdev010
Stdev012
13
Overview
• Introduction, Goals and Motivation– Reduce channel width, lower cost, make circuits “routable”
• Benchmark Circuits – Varying amount of interconnect variation
• Un/DoPack CAD Tool:– Iterative channel width reduction by whitespace insertion
• Results
• Conclusion
14
Un/DoPack Flow
• Iterative non-uniform cluster depopulation tool
• Step 1: Traditional SIS/VPR• Step 2: UnPack:
– Congestion Calculator
• Step 3: DoPack:– Incremental Re-Cluster
• Step 4,5: Fast Place/Route
Circuit DescriptionArchitecture Description
Channel Width ConstraintArray Size Constraint
Cluster(iRAC Replica)
Placement(VPR)
Routing(VPR)
Channel WidthConstraint Met?
Success!
CongestionCalculator(UnPack)
Fast Placement(Incremental or
VPR)
Fast Routing(VPR)
Channel WidthConstraint Met?
Yes Yes
No No
Array Size LimitsReached?
Failure
Yes
No
Synthesize andTechnology Map(SIS/Flowmap)
IncrementalCluster
(DoPack)
15
Un/DoPack Flow: SIS/VPRCircuit Description
Architecture DescriptionChannel Width Constraint
Array Size Constraint
Cluster(iRAC Replica)
Placement(VPR)
Routing(VPR)
Channel WidthConstraint Met?
Success!
CongestionCalculator(UnPack)
Fast Placement(Incremental or
VPR)
Fast Routing(VPR)
Channel WidthConstraint Met?
Yes Yes
No No
Array Size LimitsReached?
Failure
Yes
No
Synthesize andTechnology Map(SIS/Flowmap)
IncrementalCluster
(DoPack)
• Step 1: Traditional SIS/VPR
Circuit DescriptionArchitecture Description
Channel Width ConstraintArray Size Constraint
16
Un/DoPack Flow: SIS/VPRCircuit Description
Architecture DescriptionChannel Width Constraint
Array Size Constraint
Cluster(iRAC Replica)
Placement(VPR)
Routing(VPR)
Channel WidthConstraint Met?
Success!
CongestionCalculator(UnPack)
Fast Placement(Incremental or
VPR)
Fast Routing(VPR)
Channel WidthConstraint Met?
Yes Yes
No No
Array Size LimitsReached?
Failure
Yes
No
Synthesize andTechnology Map(SIS/Flowmap)
IncrementalCluster
(DoPack)
• Step 1: Traditional SIS/VPR
Cluster(iRAC Replica)
Placement(VPR)
Routing(VPR)
Synthesize andTechnology Map(SIS/Flowmap)
17
Un/DoPack Flow: SIS/VPRCircuit Description
Architecture DescriptionChannel Width Constraint
Array Size Constraint
Cluster(iRAC Replica)
Placement(VPR)
Routing(VPR)
Channel WidthConstraint Met?
Success!
CongestionCalculator(UnPack)
Fast Placement(Incremental or
VPR)
Fast Routing(VPR)
Channel WidthConstraint Met?
Yes Yes
No No
Array Size LimitsReached?
Failure
Yes
No
Synthesize andTechnology Map(SIS/Flowmap)
IncrementalCluster
(DoPack)
• Step 1: Traditional SIS/VPR
Channel WidthConstraint Met?
Success!
Yes
No
18
Un/DoPack Flow: UnPackCircuit Description
Architecture DescriptionChannel Width Constraint
Array Size Constraint
Cluster(iRAC Replica)
Placement(VPR)
Routing(VPR)
Channel WidthConstraint Met?
Success!
CongestionCalculator(UnPack)
Fast Placement(Incremental or
VPR)
Fast Routing(VPR)
Channel WidthConstraint Met?
Yes Yes
No No
Array Size LimitsReached?
Failure
Yes
No
Synthesize andTechnology Map(SIS/Flowmap)
IncrementalCluster
(DoPack)
• Step 2: UnPack:– Congestion Calculator
CongestionCalculator(UnPack)
Array Size LimitsReached?
Failure
Yes
No
19
Un/DoPack Flow: UnPackCircuit Description
Architecture DescriptionChannel Width Constraint
Array Size Constraint
Cluster(iRAC Replica)
Placement(VPR)
Routing(VPR)
Channel WidthConstraint Met?
Success!
CongestionCalculator(UnPack)
Fast Placement(Incremental or
VPR)
Fast Routing(VPR)
Channel WidthConstraint Met?
Yes Yes
No No
Array Size LimitsReached?
Failure
Yes
No
Synthesize andTechnology Map(SIS/Flowmap)
IncrementalCluster
(DoPack)
• Step 2: UnPack– Generate Congestion Map– CLB Label = Largest CW occ
in 4 adjacent channels
010
2030
4050
010
2030
4050
0
20
40
60
80
100
120
CLB X-LocationCLB Y-Location
CLB
Lab
el
010
2030
4050
60
010
2030
4050
600
20
40
60
80
100
120
CLB X-LocationCLB Y-Location
CLB
Lab
el
20
Un/DoPack Flow: UnPackCircuit Description
Architecture DescriptionChannel Width Constraint
Array Size Constraint
Cluster(iRAC Replica)
Placement(VPR)
Routing(VPR)
Channel WidthConstraint Met?
Success!
CongestionCalculator(UnPack)
Fast Placement(Incremental or
VPR)
Fast Routing(VPR)
Channel WidthConstraint Met?
Yes Yes
No No
Array Size LimitsReached?
Failure
Yes
No
Synthesize andTechnology Map(SIS/Flowmap)
IncrementalCluster
(DoPack)
• Step 2: UnPack:– Depop Center = Largest CLB label
M X M Array
21
Un/DoPack Flow: UnPackCircuit Description
Architecture DescriptionChannel Width Constraint
Array Size Constraint
Cluster(iRAC Replica)
Placement(VPR)
Routing(VPR)
Channel WidthConstraint Met?
Success!
CongestionCalculator(UnPack)
Fast Placement(Incremental or
VPR)
Fast Routing(VPR)
Channel WidthConstraint Met?
Yes Yes
No No
Array Size LimitsReached?
Failure
Yes
No
Synthesize andTechnology Map(SIS/Flowmap)
IncrementalCluster
(DoPack)
• Step 2: UnPack:– Option 1 Coarse Grain:
• Dpop Radius = M/4
• Dpop Amt: 1 new row/col in array
M X M Array
22
Un/DoPack Flow: UnPackCircuit Description
Architecture DescriptionChannel Width Constraint
Array Size Constraint
Cluster(iRAC Replica)
Placement(VPR)
Routing(VPR)
Channel WidthConstraint Met?
Success!
CongestionCalculator(UnPack)
Fast Placement(Incremental or
VPR)
Fast Routing(VPR)
Channel WidthConstraint Met?
Yes Yes
No No
Array Size LimitsReached?
Failure
Yes
No
Synthesize andTechnology Map(SIS/Flowmap)
IncrementalCluster
(DoPack)
• Step 2: UnPack:– Option 2 Fine Grain:
• Dpop Radius = M/4, M/5, M/6, M/8
• Dpop Amt: 1 new row/col in region
M X M Array
23
Un/DoPack Flow: DoPackCircuit Description
Architecture DescriptionChannel Width Constraint
Array Size Constraint
Cluster(iRAC Replica)
Placement(VPR)
Routing(VPR)
Channel WidthConstraint Met?
Success!
CongestionCalculator(UnPack)
Fast Placement(Incremental or
VPR)
Fast Routing(VPR)
Channel WidthConstraint Met?
Yes Yes
No No
Array Size LimitsReached?
Failure
Yes
No
Synthesize andTechnology Map(SIS/Flowmap)
IncrementalCluster
(DoPack)
• Step 3: DoPack:– Incremental Re-Cluster
IncrementalCluster
(DoPack)
No
24
Un/DoPack Flow: Fast P&RCircuit Description
Architecture DescriptionChannel Width Constraint
Array Size Constraint
Cluster(iRAC Replica)
Placement(VPR)
Routing(VPR)
Channel WidthConstraint Met?
Success!
CongestionCalculator(UnPack)
Fast Placement(Incremental or
VPR)
Fast Routing(VPR)
Channel WidthConstraint Met?
Yes Yes
No No
Array Size LimitsReached?
Failure
Yes
No
Synthesize andTechnology Map(SIS/Flowmap)
IncrementalCluster
(DoPack)
• Step 4,5: Fast Place/Route
Success!
Fast Placement(Incremental or
VPR)
Fast Routing(VPR)
Channel WidthConstraint Met?
Yes
No
25
Un/DoPack Flow: Fast P&RCircuit Description
Architecture DescriptionChannel Width Constraint
Array Size Constraint
Cluster(iRAC Replica)
Placement(VPR)
Routing(VPR)
Channel WidthConstraint Met?
Success!
CongestionCalculator(UnPack)
Fast Placement(Incremental or
VPR)
Fast Routing(VPR)
Channel WidthConstraint Met?
Yes Yes
No No
Array Size LimitsReached?
Failure
Yes
No
Synthesize andTechnology Map(SIS/Flowmap)
IncrementalCluster
(DoPack)
• Step 4,5: Fast Place/Route
• Fast Placement– UBC Incremental Placer
(under development)– VPR –fast
• Fast Router– Use illegal pathfinder solution
from first iterations• Unsuccessful so far
– Use full routed solution• Slow but reliable
26
Overview
• Introduction, Goals and Motivation– Reduce channel width, lower cost, make circuits “routable”
• Benchmark Circuits – Varying amount of interconnect variation
• Un/DoPack CAD Tool:– Iterative channel width reduction by whitespace insertion
• Results
• Conclusion
27
Un/DoPack: Baseline Flow
• UnPack: Coarse grained congestion calculator• DoPack: iRAC replica• Fast Place: UBC Incremental Placer• Fast Route: None
• FPGA Architecture: – LUT size (k) = 6– Cluster size (N) = 16– Inputs per cluster (I) = 51– Wires of length (L) = 4
28
Area of GNL Benchmarks
0.901.001.101.201.301.401.501.601.701.801.902.00
0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 1 1.05
% of Maximum Channel Width
No
rmal
ized
Are
a
stdev0
stdev002
stdev004
stdev006
stdev008 / meta clone
stdev010
stdev012
29
Interconnect Variation: Impact on FPGA Architecture Design
70
80
90
100
110
120
130
140
Min
imu
m R
ou
ted
Ch
an
ne
l W
idth
Baseline
10% Area Increase
20% Area Increase
25% Area Increase
High VariationHigh VariationCircuits RequireCircuits Require
Wide Channel WidthWide Channel Width
30
Critical Path of GNL Benchmarks
0.95
1.00
1.05
1.10
1.15
1.20
1.25
0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 1 1.05% of Max Channel Width
Nor
mal
ized
Crit
ical
Pat
h
31
Un/DoPack Congestion Map
010
2030
4050
010
2030
4050
0
20
40
60
80
100
120
CLB X-LocationCLB Y-Location
CLB
Lab
el
010
2030
4050
60
010
2030
4050
600
20
40
60
80
100
120
CLB X-LocationCLB Y-Location
CLB
Lab
el
Before
AfterUn/DoPack
32
Multi-Region Un-Pack
• Depopulate multiple regions at once – Depopulate each region
separately– Smaller radius
= M/10
• Handle overlapping regions
33
Normalized Area
0.80
1.00
1.20
1.40
1.60
1.80
2.00
2.20
2.40
2.60
2.80
3.00
0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 1
Channel Width Constraint (% of max MRCW)
Nor
mal
ized
Are
a
stdev000
stdev008 / clone
stdev010
34
Normalized Critical Path
0.95
1.00
1.05
1.10
1.15
1.20
1.25
0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 1
Channel Width Constraint (% of max MRCW)
Nor
mal
ized
Crit
ical
Pat
h D
elay
stdev000
stdev008 / clone
stdev010
35
Run-Time Comparisons
0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 1
Channel Width Constraint (% of max MRCW)
Lo
g R
un
Tim
e (
in h
ou
rs)
stdev000
stdev008
stdev010
MR stdev000
MR stdev008 / clone
MR stdev010
36
Conclusion• Un/DoPack: FPGA CAD flow
– Find “local” congestion depopulate reduced interconnect demand
• FPGA benchmark circuit “suite”– Stdev: Used to vary interconnect demand
• Discoveries…– “Non-uniform” depopulation limits area inflation– “Interconnect variation” important for area inflation and FPGA
architecture design– “Routing closure” achieved by re-clustering and incremental
place & route• UNROUTABLE circuits made ROUTABLE
buy an FPGA with MORE LOGIC!!!
End of Talk