Liquid Metal’s OPTIMUS: Synthesis of Efficient Streaming Hardware
description
Transcript of Liquid Metal’s OPTIMUS: Synthesis of Efficient Streaming Hardware
![Page 1: Liquid Metal’s OPTIMUS: Synthesis of Efficient Streaming Hardware](https://reader035.fdocuments.net/reader035/viewer/2022062310/56816387550346895dd47228/html5/thumbnails/1.jpg)
Liquid Metal’s OPTIMUS:Synthesis of Efficient Streaming Hardware
Scott Mahlke (University of Michigan) and Rodric Rabbah (IBM)DAC HLS Tutorial, San Francisco 2009
1
![Page 2: Liquid Metal’s OPTIMUS: Synthesis of Efficient Streaming Hardware](https://reader035.fdocuments.net/reader035/viewer/2022062310/56816387550346895dd47228/html5/thumbnails/2.jpg)
your apphere
JIT compilerconfigureslogic
Dynamic Application Specific Customization of HW
2
Inspired by ASIC paradigm:• High Performance• Low Power
![Page 3: Liquid Metal’s OPTIMUS: Synthesis of Efficient Streaming Hardware](https://reader035.fdocuments.net/reader035/viewer/2022062310/56816387550346895dd47228/html5/thumbnails/3.jpg)
Liquid Metal: “JIT the Hardware”
3
Single language for programming HW & SW Run in a standard JVM, or synthesize in HW Fluidly move computation between HW & SW Do for HW (viz. FPGAs) what FORTRAN did for
computing Address critical technology trendsPower address impractical growth of power and
cooling demandsArchitecture enabling million way parallelism vs. small
scale multicoresVersatility in the field & on the fly customization to end-
user applications
Applications demand for pervasive streaming and mobile content (WWW, multimedia, gaming)
ASIC-like
Reconfigurable
![Page 4: Liquid Metal’s OPTIMUS: Synthesis of Efficient Streaming Hardware](https://reader035.fdocuments.net/reader035/viewer/2022062310/56816387550346895dd47228/html5/thumbnails/4.jpg)
Lime: the Liquid Metal Language
4
Design Principles: Object-oriented, Java-like, Java-compatible Raise level of abstraction Parallel constructs that simplify code Target synthesis while retaining generality
![Page 5: Liquid Metal’s OPTIMUS: Synthesis of Efficient Streaming Hardware](https://reader035.fdocuments.net/reader035/viewer/2022062310/56816387550346895dd47228/html5/thumbnails/5.jpg)
4 reasons not another *C to HDL approach
Emphasis on programmer productivity Leverage rich Java IDEs, libraries, and analysis
Not an auto-parallelization approach Lime is explicitly parallel and synthesizable
Fast fail-safe mechanism Lime may be refined into parallel SW implementation
Intrinsic opportunity for online optimizations Static optimizations with dynamic refinement
![Page 6: Liquid Metal’s OPTIMUS: Synthesis of Efficient Streaming Hardware](https://reader035.fdocuments.net/reader035/viewer/2022062310/56816387550346895dd47228/html5/thumbnails/6.jpg)
Lime Overview6
Computation is well encapsulatedData-flow driven computationMultiple “clock domains
Tasks, Value typesHW (FPGA): Lime:
Bit-level control and reasoning
Memory usage statically determined before layout
Abstract OO programming down to the bit-level!
Ordinal-indexed arrays, bounded loops
Streaming primitives
Template-like Generics
Rate “matching” operators
![Page 7: Liquid Metal’s OPTIMUS: Synthesis of Efficient Streaming Hardware](https://reader035.fdocuments.net/reader035/viewer/2022062310/56816387550346895dd47228/html5/thumbnails/7.jpg)
Streams: Exposing Computational Structure
7
Stream primitives are integral to the language
Tasks in streams are strongly isolated Only the endpoints may perform side-
effects Provide macro-level functional
programming abstraction… … While allowing traditional imperative
programming inside
![Page 8: Liquid Metal’s OPTIMUS: Synthesis of Efficient Streaming Hardware](https://reader035.fdocuments.net/reader035/viewer/2022062310/56816387550346895dd47228/html5/thumbnails/8.jpg)
A Brief Introduction to Stream Operations
8
int stream s1 = { 1, 1, 2, 3, 5, 8 };
A finite stream literal:
int stream s2 = task 3;
An infinite stream of 3’s:
int stream s3 = s2 * 17;double stream s4 = Math.sin(s1);double stream s5 = s3 + s4;
Stream expressions:
These operations create and connect tasks. Execution occurs later: lazy computation, functional.
![Page 9: Liquid Metal’s OPTIMUS: Synthesis of Efficient Streaming Hardware](https://reader035.fdocuments.net/reader035/viewer/2022062310/56816387550346895dd47228/html5/thumbnails/9.jpg)
Simple Audio Processing9
value int[] squareWave(int freq, int rate, int amplitude) { int wavelength = rate / freq; int[] samples = new int[wavelength];
for (int s: 1::wavelength) samples[s] = (s <= wavelength/2) ? 0 : amplitude;
return (value int[]) samples;}
int stream sqwaves = task squareWave(1000, 44100, 80));
task AudioSink(44100).play(sqwaves);
![Page 10: Liquid Metal’s OPTIMUS: Synthesis of Efficient Streaming Hardware](https://reader035.fdocuments.net/reader035/viewer/2022062310/56816387550346895dd47228/html5/thumbnails/10.jpg)
Liquid Metal Tool Chain10 Lime
QuicklimeFront-EndCompiler
StreamingIR
LM VM Virtex5 FPGALM VM
Xilinxbitfile
XilinxVHDL
Compiler
HDL
Cell BELM VM
Cell binary
Cell SDK
C
CrucibleBack-EndCompiler
OptimusBack-EndCompiler
FPGAModel
![Page 11: Liquid Metal’s OPTIMUS: Synthesis of Efficient Streaming Hardware](https://reader035.fdocuments.net/reader035/viewer/2022062310/56816387550346895dd47228/html5/thumbnails/11.jpg)
Streaming Intermediate Representation (SIR)
11
splitter joiner
joiner splitter
Task:
SplitJoin:
Feedback Loop:
switch joiner
Switch:
Pipeline:
• Task may be stateless or have state• Task mapped to “module” with FIFO I/O• Task graphs are hierarchical & structured
![Page 12: Liquid Metal’s OPTIMUS: Synthesis of Efficient Streaming Hardware](https://reader035.fdocuments.net/reader035/viewer/2022062310/56816387550346895dd47228/html5/thumbnails/12.jpg)
SIR Compiler Optimizations12
Address FPGA compilation challenges Finite, non-virtualizable device Complex optimization space
Throughput, latency, power, area Very long synthesis times (minutes-hours)
Task fusion and fission load balancing, scalabilityStream buffer allocation locality enhancing, manage
cache footprint or SRAM and control logic complexity
Data access fusion reduce critical path length, improve communication-to-computation balance
![Page 13: Liquid Metal’s OPTIMUS: Synthesis of Efficient Streaming Hardware](https://reader035.fdocuments.net/reader035/viewer/2022062310/56816387550346895dd47228/html5/thumbnails/13.jpg)
Preliminary Liquid Metal Results on Energy Consumption: FPGA vs PPC 405
13
FFT
Para
llel A
...
Bubb
le S
ort
Mer
ge S
ort
Disc
rete
...
DES
Mat
rix M
ult..
.
Mat
rix B
loc.
..
Aver
age0
0.2
0.4
0.6
0.8
Frac
tion
of P
ower
PC E
nerg
y
~1.4~1.4~1.4 2.25
• Liquid Metal on Virtex 4 FPGA, 1.6W• C reference implementation on PPC 405, 0.5W
![Page 14: Liquid Metal’s OPTIMUS: Synthesis of Efficient Streaming Hardware](https://reader035.fdocuments.net/reader035/viewer/2022062310/56816387550346895dd47228/html5/thumbnails/14.jpg)
Preliminary Liquid Metal Results on Parallelism: FPGA vs PPC 405
14
• Liquid Metal on Virtex 4 FPGA, 1.6W• C reference implementation on PPC 405, 0.5W
![Page 15: Liquid Metal’s OPTIMUS: Synthesis of Efficient Streaming Hardware](https://reader035.fdocuments.net/reader035/viewer/2022062310/56816387550346895dd47228/html5/thumbnails/15.jpg)
Handel-C Comparison Compared DES and DCT with hand-optimized
Handel-C implementation
Performance 5% faster before optimizations 12x faster after optimizations
Area 66% larger before optimizations 90% larger after optimizations
15
![Page 16: Liquid Metal’s OPTIMUS: Synthesis of Efficient Streaming Hardware](https://reader035.fdocuments.net/reader035/viewer/2022062310/56816387550346895dd47228/html5/thumbnails/16.jpg)
Overview
Compilation Flow
Scheduling
Optimizations
Results
16
![Page 17: Liquid Metal’s OPTIMUS: Synthesis of Efficient Streaming Hardware](https://reader035.fdocuments.net/reader035/viewer/2022062310/56816387550346895dd47228/html5/thumbnails/17.jpg)
Top Level Compilation
Filter
Controller
M0
Init
M1
…
. . .
i0 i1 ix
OmO0O0
…
Mn
Work Source
Filter Filter
Round-Robin Splitter(8,8,8,8)
FilterFilter
Round-Robin Joiner(1,1,1,1)
Sink
a[ ]
i
Init
Controller
Controller
Controller
Controller
Controller
Controller
Controller
Controller
A
B EC
HGF I
J
D
Work
Work
WorkWorkWork
Work
Work
Work
Source
Filter Filter
Round-Robin Splitter(8,8,8,8)
FilterFilter
Round-Robin Joiner(1,1,1,1)
Sink
B DC
F
E
A
J
IHG
17
![Page 18: Liquid Metal’s OPTIMUS: Synthesis of Efficient Streaming Hardware](https://reader035.fdocuments.net/reader035/viewer/2022062310/56816387550346895dd47228/html5/thumbnails/18.jpg)
Filter Compilation
sum = 0i = 0
temp = pop( )
sum = sum + tempi = i + 1Branch bb2 if i < 8
push(sum)
1
2
3
4
Basic Block
Register
Control in
Control outs
Mem
ory/Queue ports
Ack
Live data outsLive data ins
bb1
bb2
bb3
bb4
Live out Data
Live out Data
Register
mux mux
Register
Register
Register
FIFO Read
FIFO Write
Control
Token
Control Token
Control Token
Ack
Ack
Ack
18
![Page 19: Liquid Metal’s OPTIMUS: Synthesis of Efficient Streaming Hardware](https://reader035.fdocuments.net/reader035/viewer/2022062310/56816387550346895dd47228/html5/thumbnails/19.jpg)
Operation Compilation
FU
…
…
i0 im
o0 on
predicate
ADDADD
CMP
Register
i 1 temp sum
8
Control out 3
11
1
temp
Control out 4
Control in
…
sum = sum + tempi = i + 1Branch bb2 if i < 8
19
![Page 20: Liquid Metal’s OPTIMUS: Synthesis of Efficient Streaming Hardware](https://reader035.fdocuments.net/reader035/viewer/2022062310/56816387550346895dd47228/html5/thumbnails/20.jpg)
Static Stream Scheduling
20
Filter 1
Filter 2
Push 2
Pop 3
Each queue has to be deep enough to hold values generated from a single execution of the connected filter
Double buffering is needed
Buffer access is non-blocking
A controller module is needed to orchestrate the schedule
Controller uses finite state machine to execute the steady state schedule
20
![Page 21: Liquid Metal’s OPTIMUS: Synthesis of Efficient Streaming Hardware](https://reader035.fdocuments.net/reader035/viewer/2022062310/56816387550346895dd47228/html5/thumbnails/21.jpg)
Greedy Stream Scheduling
Filter 1
Filter 2
Filters fire eagerly. Blocking channel access.
Allows for potentially smaller channels
Controller is not needed
Results produced with lower latency.
21
![Page 22: Liquid Metal’s OPTIMUS: Synthesis of Efficient Streaming Hardware](https://reader035.fdocuments.net/reader035/viewer/2022062310/56816387550346895dd47228/html5/thumbnails/22.jpg)
Latency Comparison
FF
T
Par
alle
l Add
er
Bub
ble
Sor
t
Mer
ge S
ort
Dis
cret
e C
os...
DE
S
Mat
rix M
ultip
ly
Mat
rix B
lock
M...
Ave
rage
0
2
4
6
8
10
12
14
16
18
Late
ncy
of S
tatic
Rel
ativ
e to
Gre
edy
22
![Page 23: Liquid Metal’s OPTIMUS: Synthesis of Efficient Streaming Hardware](https://reader035.fdocuments.net/reader035/viewer/2022062310/56816387550346895dd47228/html5/thumbnails/23.jpg)
Area ComparisonF
FT
Par
alle
l Add
er
Bub
ble
Sor
t
Mer
ge S
ort
Dis
cret
e C
os...
DE
S
Mat
rix M
ultip
ly
Mat
rix B
lock
M...
Ave
rage
0
10
20
30
40
50
60
70
80
90
100Circuits with static schedulerCircuits with greedy scheduler
%
of
FPG
A
Are
a
23
![Page 24: Liquid Metal’s OPTIMUS: Synthesis of Efficient Streaming Hardware](https://reader035.fdocuments.net/reader035/viewer/2022062310/56816387550346895dd47228/html5/thumbnails/24.jpg)
Optimizations
Streaming optimizations (macro functional) Channel allocations, Channel access fusion, Critical Path
Balancing, Filter fission and fusion, etc. Doing these optimization needs global information about the
stream graph Typically performed manually using existing tools
Classic optimizations (micro functional) Flip-flop elimination, Common subexpression elimination,
Constant folding, Loop unrolling, etc. Typically included in existing compilers and tools
24
![Page 25: Liquid Metal’s OPTIMUS: Synthesis of Efficient Streaming Hardware](https://reader035.fdocuments.net/reader035/viewer/2022062310/56816387550346895dd47228/html5/thumbnails/25.jpg)
Channel Allocation
Larger channels: More SRAM More control logic Less stalls
Interlocking makes sure that each filter gets the
right data or blocks.
What is the right channel size?
25
![Page 26: Liquid Metal’s OPTIMUS: Synthesis of Efficient Streaming Hardware](https://reader035.fdocuments.net/reader035/viewer/2022062310/56816387550346895dd47228/html5/thumbnails/26.jpg)
Channel Allocation Algorithm
Set the size of the channels to infinity.
Warm-up the queues.
Record the steady state instruction schedules for each pair.
Unroll the schedules to have the same number of pushes and pops.
Find the maximum number of overlapping lifetimes.
26
![Page 27: Liquid Metal’s OPTIMUS: Synthesis of Efficient Streaming Hardware](https://reader035.fdocuments.net/reader035/viewer/2022062310/56816387550346895dd47228/html5/thumbnails/27.jpg)
Channel Allocation Example
----
----
push
----
push
----
push
push
push
----
----
push
----
----
pop
----
----
----
pop
----
pop
pop
pop
pop
Max overlap = 3
Producer Consumer
Source
Filter 1
Filter 2
Sink
27
![Page 28: Liquid Metal’s OPTIMUS: Synthesis of Efficient Streaming Hardware](https://reader035.fdocuments.net/reader035/viewer/2022062310/56816387550346895dd47228/html5/thumbnails/28.jpg)
Channel Allocation28
FFT
Para
llel A
...
Bubb
le S
ort
Merg
e So
rt
Disc
rete
...
DES
Matri
x Mu
l...
Matri
x Bl
o...
Aver
age0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
Rel
ativ
e C
hann
el S
ize
Aft
er
Opt
imiz
atio
n
![Page 29: Liquid Metal’s OPTIMUS: Synthesis of Efficient Streaming Hardware](https://reader035.fdocuments.net/reader035/viewer/2022062310/56816387550346895dd47228/html5/thumbnails/29.jpg)
Channel Access Fusion
Each channel access (push or pop) takes one cycle.
Communication to computation ratio
Longer critical path latency
Limit task-level parallelism
29
![Page 30: Liquid Metal’s OPTIMUS: Synthesis of Efficient Streaming Hardware](https://reader035.fdocuments.net/reader035/viewer/2022062310/56816387550346895dd47228/html5/thumbnails/30.jpg)
Channel Access Fusion Algorithm
Clustering channel access operations Loop Unrolling Code Motion Balancing the groups
Similar to vectorization Wide channels
30
rrrrrrrr
w
w
w
w
r
w
w
r
Write Mult. = 1
Read Mult. = 8
Write Mult. = 8
Read Mult. = 8
Write Mult. = 4
Read Mult. = 1
30
![Page 31: Liquid Metal’s OPTIMUS: Synthesis of Efficient Streaming Hardware](https://reader035.fdocuments.net/reader035/viewer/2022062310/56816387550346895dd47228/html5/thumbnails/31.jpg)
Access Fusion Example
Some caveats
int sum = 0; for (int i = 0; i < 32; i++) sum+ = pop(); push(sum);
int sum = 0; int t1, t2, t3, t4; for (int i = 0; i < 8; i++) { (t1, t2, t3, t4) = pop4(); sum+ = t1 + t2 + t3 + t4; } push(sum); }}
int sum = 0; for (int i = 0; i < 32; i++) sum+ = pop(); pop(); pop(); push(sum);
int sum = 0; for (int i = 0; i < 8; i++) { sum+ = pop(); sum+ = pop(); sum+ = pop(); sum+ = pop(); } pop(); pop(); push(sum);
31
![Page 32: Liquid Metal’s OPTIMUS: Synthesis of Efficient Streaming Hardware](https://reader035.fdocuments.net/reader035/viewer/2022062310/56816387550346895dd47228/html5/thumbnails/32.jpg)
FFT
Para
llel A
...
Bubb
le S
ort
Mer
ge S
ort
Disc
rete
...
DES
Mat
rix M
ult..
.
Mat
rix B
loc.
..
Aver
age0
1
2
3
4
5
6
7
8
Spee
dup
(x10
0%)
Access Fusion32
![Page 33: Liquid Metal’s OPTIMUS: Synthesis of Efficient Streaming Hardware](https://reader035.fdocuments.net/reader035/viewer/2022062310/56816387550346895dd47228/html5/thumbnails/33.jpg)
Critical Path Balancing Critical path is set by the longest combinational
path in the filters Optimus uses its internal FPGA model to estimate
how this impacts throughput and latency Balancing Algorithm:
Optimus take target clock as input Start with least number of basic blocks Form USE/DEF chains for the filter Use the internal FPGA model to measure critical path
latency Break the paths whose latency exceeds the target
33
![Page 34: Liquid Metal’s OPTIMUS: Synthesis of Efficient Streaming Hardware](https://reader035.fdocuments.net/reader035/viewer/2022062310/56816387550346895dd47228/html5/thumbnails/34.jpg)
Critical Path Balancing Example
Mul
Add
MulMul
Sub
Add
MulMul
Sub
Mul
Sub
Add Sub Add Sub
Add Sub
Mul Mul
Add Add
Shift Shif
t
Add
AddSub
Add
MulMul
Sub
Mul
Add
Add SubAdd Sub
Add
Shift
1
1
1
2
21
33
4Operation
Delay
Add/Sub 4Shift 2Multiply 10
34
![Page 35: Liquid Metal’s OPTIMUS: Synthesis of Efficient Streaming Hardware](https://reader035.fdocuments.net/reader035/viewer/2022062310/56816387550346895dd47228/html5/thumbnails/35.jpg)
Liquid Metal 35
Interdisciplinary effort addressing the entire stack One language for programming HW (FPGAs) and
SW Liquid Metal VM: JIT the hardware!
GPU MulticoreCPU ???FPGA
LiquidMetal VM
Program all withLime
![Page 36: Liquid Metal’s OPTIMUS: Synthesis of Efficient Streaming Hardware](https://reader035.fdocuments.net/reader035/viewer/2022062310/56816387550346895dd47228/html5/thumbnails/36.jpg)
Streaming IR
Expose structure: computation and communication
Uniform framework for pipeline and data parallelism
Canonical representation for stream-aware optimizations
![Page 37: Liquid Metal’s OPTIMUS: Synthesis of Efficient Streaming Hardware](https://reader035.fdocuments.net/reader035/viewer/2022062310/56816387550346895dd47228/html5/thumbnails/37.jpg)
Streaming OptimizationsMacro-functional Fold streaming IR
graphs into FPGA… Fusion, fission,
replication …subject to
latency, area, and throughput constraints
Micro-functional Micro-pipelining Channel
allocation Access fusion Flip-flop
elimination
![Page 38: Liquid Metal’s OPTIMUS: Synthesis of Efficient Streaming Hardware](https://reader035.fdocuments.net/reader035/viewer/2022062310/56816387550346895dd47228/html5/thumbnails/38.jpg)
Ongoing Effort Application development
Streaming for enterprise and consumer Real-time applications
Compiler and JIT Pre-provisioning profitable HW implementations Runtime opportunities to “JIT” the HW
Advanced dynamic reconfiguration support in VM Predictive, hides latency
New platforms Tightly coupled, higher bandwidth, lower
latency communication Heterogeneous MPSoC systems – FPGA +
processors
38