Fast Compilation for Reconfigurable Hardware
description
Transcript of Fast Compilation for Reconfigurable Hardware
Fast Compilation for Reconfigurable Hardware
Mihai Budiu and Seth Copen Goldstein Carnegie Mellon University
Computer Science Department
Joint work withSrihari Cadambi, Herman Schmit, Matt Moe,
Robert Taylor, Ronald Laufer
FPGA, Feb 23 1999 (c) 1998 by Mihai Budiu 2
GoalTo program reconfigurable devices using the standard
software development processes:
– Compile C or Java– Do it quickly
Partitioner
DIL
Java
Data-flow Intermediate Language
Configuration
Reconfigurable HW CPU
This talk
FPGA, Feb 23 1999 (c) 1998 by Mihai Budiu 3
Compiler Performance on 1D DCT (8 inputs 8 bit each)
DIL Classical tools
Total Compile time 2.4s ~75minSynopsis+Design Manager
Place and route 1s Design Manager 14m22sTarget clock speed 75Mhz 33MhzCircuit size 7816 bit-ops 899 CLBsApplication speed-up 20 ~20Target PipeRench Xilinx 4085XL
Compilation: ~700x faster
FPGA, Feb 23 1999 (c) 1998 by Mihai Budiu 4
The Place and Route Problem
Interconnection
operators
+
.
<<[1,2]
>><<
&~ ~
+
Processing elements
<< >>
.[1,2]
Interconnectionnetwork
&
<<
FPGA, Feb 23 1999 (c) 1998 by Mihai Budiu 5
Our Target:
• Medium grain processing elements (4 bits)
• Pipelined architecture
• Virtualized hardware
• Local interconnection network
• Wide pipelined bus
FPGA, Feb 23 1999 (c) 1998 by Mihai Budiu 6
The Place and Route Problem
Interconnection
operators
+
.
<<[1,2]
>><<
&~ ~
+
Processing elements
<< >>
.[1,2]
Interconnectionnetwork
&
<<
Stripe
FPGA, Feb 23 1999 (c) 1998 by Mihai Budiu 7
Why Place and Route Is Hard
• Hard constraints:– Stripe width – Pipelined bus width
• Word-based circuit– interconnection network switches words– fixed PE size
• Scarce input ports for the interconnection network
FPGA, Feb 23 1999 (c) 1998 by Mihai Budiu 8
How We Simplify Place and Route
• Computation-oriented programs (restricted language, with unidirectional data flow)
• Hardware resources virtualized• Relatively rich interconnection network• High granularity placement (I.e. one 32-bit adder
instead of 100 gates)• There is a wide pipelined bus available• Timing is very predictable
FPGA, Feb 23 1999 (c) 1998 by Mihai Budiu 9
The Key Idea
• Global analysis and transformations guarantee placeability using lazy noops (conservatively)
• Deterministic, greedy place & route (no backtracking)
• All passes linear time in the size of the circuit
FPGA, Feb 23 1999 (c) 1998 by Mihai Budiu 10
Guaranteeing Placement
+
.
<<[1,2]
>><<
&~
+
.
<<
[1,2]
>>
<<
&~
noop
noop
Complexpermutation
Simplepermutation
Simplepermutation
The inserted noops are sufficient but not necessary
Simplepermutation
FPGA, Feb 23 1999 (c) 1998 by Mihai Budiu 11
Placement of a Non-lazy Noop
&~
noop
+
+
&~
noop
noop
FPGA, Feb 23 1999 (c) 1998 by Mihai Budiu 12
Lazy Noops Are Not Placed
&~
+
+
&~
noop
noop
FPGA, Feb 23 1999 (c) 1998 by Mihai Budiu 13
Place and Route Overview
• Analysis:– Noops have been inserted to guarantee that the
graph is routable.
• Place & Route: – will determine which lazy noops are instantiated
Next: actual Place and Route
FPGA, Feb 23 1999 (c) 1998 by Mihai Budiu 14
Already placed
Step1: Analyze Routability
+
&~
noop
noop
&
~
+ + + + + + +
Q: can we place the + given the placement of its ancestors?
FPGA, Feb 23 1999 (c) 1998 by Mihai Budiu 15
Step 2: If a Node Is Unroutable
Solution: promote a lazy noop
+
&~
noop
noop
+
&~
noop
noop
FPGA, Feb 23 1999 (c) 1998 by Mihai Budiu 16
Step 3: Choosing a Noop
Closest noop which is routable.
+
&~
noop
noop
+
&~
noop
noop
FPGA, Feb 23 1999 (c) 1998 by Mihai Budiu 17
Other Details
• Operators are decomposed in pieces for:– timing constraints– size constraints
• When placing optimize for– register pressure when accessing the bus– constraints placed on future nodes
• Long critical paths are sliced with pipeline registers
FPGA, Feb 23 1999 (c) 1998 by Mihai Budiu 18
Compilation Times (Seconds on PII/400)
1.36
2.27
1.25
0.13
2.43
0.84
8.07
0.07
0.950.47
0.86
0
1
2
3
4
5
6
7
8
9
FPGA, Feb 23 1999 (c) 1998 by Mihai Budiu 19
Compilation Speed (PII/400)
0
2000
4000
6000
8000
10000
12000
14000
16000
18000
20000
Bit
Op
era
tio
ns/
Ker
nel
0
2000
4000
6000
8000
10000
12000
Bit
Op
era
tio
ns
Co
mp
iled
/Sec
bitopsbitops/sec
FPGA, Feb 23 1999 (c) 1998 by Mihai Budiu 20
Compilation Times Breakdown
0%
20%
40%
60%
80%
100%other
place
analysis
library
simplification
evaluation
Place and route
FPGA, Feb 23 1999 (c) 1998 by Mihai Budiu 21
Placed Circuit Utilization
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
utilization effective utilization
FPGA, Feb 23 1999 (c) 1998 by Mihai Budiu 22
Simulated Speed-up vs. UltraSparc @ 300Mhz
328.8
29.020.6
90.961.8
26.0
76.1
1.0
10.0
100.0
1000.0
ATR Cordic DCT FIR IDEA Nqueens Over
FPGA, Feb 23 1999 (c) 1998 by Mihai Budiu 23
Conclusions
• Fast compilation from HLL achievable (seconds not tens of minutes.)
• High-quality output achievable (60% density)
• Linear-time Place and Route feasible using the technique of lazy noops
FPGA, Feb 23 1999 (c) 1998 by Mihai Budiu 24
Future Work
• Time-multiplexing the bus
• Porting to commercial FPGAs
• Front-end from C/Java to DIL
FPGA, Feb 23 1999 (c) 1998 by Mihai Budiu 25
How We Simplify Place and Route
• Computation-oriented programs (restricted language, with unidirectional data flow)
Hardware resources virtualized• Relatively rich interconnection network• High granularity placement (I.e. one 32-bit adder
instead of 100 gates) There is a wide pipelined bus available• Timing is very predictable
FPGA, Feb 23 1999 (c) 1998 by Mihai Budiu 28
Timing and Size Guarantees
+24
2424+
+
+
2424
24
8
88
88
8 8
8
8
FPGA, Feb 23 1999 (c) 1998 by Mihai Budiu 29
Optimize for Register Pressure
&
~
+ + + + + + +
Cost: 1 2 1 -- -- 0
Best position
+
&~
noop
noop
FPGA, Feb 23 1999 (c) 1998 by Mihai Budiu 30
KernelsBenchmark Description
ATR Automatic Target Recognition (image pattern scan)Cordic Honeywell timing benchmark for vector rotation.CSD Canonical signed multiplier with the constant 123.DCT One-dimensional 8-point discrete cosine transform.Encoder Huffman encoder for fixed frequencies.FIR Finite Impulse Response filter with 20 taps.IDEA PGP encryption algorithm.Nqueens 8x8 queens solution tester.Over Porter-Duff “over” operator.Square Squaring a 16-bit number.Varpoly Evaluating a degree-3 polynomial with variable coefficients
in a given point.