From Software to Circuits: High-Level Synthesis for FPGA-Based Processor/Accelerator Systems
High-Level Synthesis: Creating Custom Circuits from High-Level Code
description
Transcript of High-Level Synthesis: Creating Custom Circuits from High-Level Code
High-Level Synthesis: Creating Custom Circuits from High-Level Code
Greg Stitt
ECE Department
University of Florida
Existing FPGA Tool Flow Register-transfer (RT) synthesis
Specify RT structure (muxes, registers, etc) + Allows precise specification - But, time consuming, difficult, error prone
HDL
Netlist
Bitfile
Processor FPGA
RT Synthesis
Physical Design
Technology Mapping
Placement
Routing
Future FPGA Tool Flow?
HDL
Netlist
Bitfile
Processor FPGA
RT Synthesis
Physical Design
Technology Mapping
Placement
Routing
High-level Synthesis
C/C++, Java, etc.
High-level Synthesis Wouldn’t it be nice to write high-level code?
Ratio of C to VHDL developers (10000:1 ?) + Easier to specify + Separates function from architecture
+ More portable - Hardware potentially slower
Similar to assembly code era Programmers could always beat compiler But, no longer the case
Hopefully, high-level synthesis will catch up to manual effort
High-level Synthesis More challenging than compilation
Compilation maps behavior into assembly instructions
Architecture is known to compiler High-level synthesis creates a custom architecture
to execute behavior Huge hardware exploration space Best solution may include microprocessors Should handle any high-level code
Not all code appropriate for hardware
High-level Synthesis
First, consider how to manually convert high-level code into circuit
Steps 1) Build FSM for controller 2) Build datapath based on FSM
acc = 0;for (i=0; i < 128; i++) acc += a[i];
Manual Example
Build a FSM (controller) Decompose code into states
acc = 0;for (i=0; i < 128; i++) acc += a[i];
if (i < 128)
acc=0, i = 0
load a[i]
acc += a[i]
i++
Done
Manual Example
Build a datapath Allocate resources for each state
acci
if (i < 128)
acc=0, i = 0
load a[i]
acc += a[i]
i++
Done
<
addra[i]
++ +
1 128 1
acc = 0;for (i=0; i < 128; i++) acc += a[i];
Manual Example
Build a datapath Determine register inputs
acci
if (i < 128)
acc=0, i = 0
load a[i]
acc += a[i]
i++
Done
<
addra[i]
++ +
1 128
2x1
0
2x1
0
1
2x1
&a
In from memory
acc = 0;for (i=0; i < 128; i++) acc += a[i];
Manual Example
Build a datapath Add outputs
acci
if (i < 128)
acc=0, i = 0
load a[i]
acc += a[i]
i++
Done
<
addra[i]
++ +
1 128
2x1
0
2x1
0
1
2x1
&a
In from memory
Memory addressacc
acc = 0;for (i=0; i < 128; i++) acc += a[i];
Manual Example
Build a datapath Add control signals
acci
if (i < 128)
acc=0, i = 0
load a[i]
acc += a[i]
i++
Done
<
addra[i]
++ +
1 128
2x1
0
2x1
0
1
2x1
&a
In from memory
Memory addressacc
acc = 0;for (i=0; i < 128; i++) acc += a[i];
Manual Example
Combine controller+datapath
acci
<
addra[i]
++ +
1 128
2x1
0
2x1
0
1
2x1
&a
In from memory
Memory addressaccDone Memory Read
Controller
acc = 0;for (i=0; i < 128; i++) acc += a[i];
Manual Example
Alternatives Use one adder (plus muxes)
acci
<
addra[i]
+
128
2x1
0
2x1
0
1
2x1
&a
In from memory
Memory addressacc
MUX MUX
Manual Example
Comparison with high-level synthesis Determining when to perform each
operation => Scheduling
Allocating resource for each operation => Resource allocation
Mapping operations onto resources => Binding
Another Example
Your turnx=0;for (i=0; i < 100; i++) { if (a[i] > 0) x ++; else x --;
a[i] = x;}//output x
Steps1) Build FSM (do not perform if conversion)2) Build datapath based on FSM
High-Level Synthesis
High-level Code
Custom Circuit
High-Level Synthesis
Could be C, C++, Java, Perl, Python, SystemC, ImpulseC, etc.
Usually a RT VHDL description, but could as low level as a bit file
High-Level Synthesis
High-Level Synthesis
acc = 0;for (i=0; i < 128; i++) acc += a[i];
acci
<
addra[i]
++ +1 128
2x1
0
2x1
0
1
2x1
&a
In from memory
Memory addressaccDone Memory Read
Controller
Main Steps
Syntactic Analysis
Optimization
Scheduling/Resource Allocation
Binding/Resource Sharing
High-level Code
Intermediate Representation
Controller + Datapath
Converts code to intermediate representation - allows all following steps to use language independent format.
Determines when each operation will execute, and resources used
Maps operations onto physical resources
Front-end
Back-end
Syntactic Analysis Definition: Analysis of code to verify syntactic
correctness Converts code into intermediate representation
2 steps 1) Lexical analysis (Lexing) 2) Parsing
Syntactic Analysis
High-level Code
Intermediate Representation
Lexical Analysis
Parsing
Lexical Analysis Lexical analysis (lexing) breaks code into a
series of defined tokens Token: defined language constructs
x = 0;if (y < z) x = 1;
Lexical Analysis
ID(x), ASSIGN, INT(0), SEMICOLON, IF, LPAREN, ID(y), LT, ID(z), RPAREN, ID(x), ASSIGN, INT(1), SEMICOLON
Lexing Tools
Define tokens using regular expressions - outputs C code that lexes input Common tool is “lex”
/* braces and parentheses */"[" { YYPRINT; return LBRACE; }"]" { YYPRINT; return RBRACE; }"," { YYPRINT; return COMMA; }";" { YYPRINT; return SEMICOLON; }"!" { YYPRINT; return EXCLAMATION; }"{" { YYPRINT; return LBRACKET; }"}" { YYPRINT; return RBRACKET; }"-" { YYPRINT; return MINUS; }
/* integers[0-9]+ { yylval.intVal = atoi( yytext ); return INT;}
Parsing
Analysis of token sequence to determine correct grammatical structure Languages defined by context-free
grammar
Program = ExpExp = Stmt SEMICOLON |
IF LPAREN Cond RPAREN Exp |
Exp Exp
Cond = ID Comp ID
Stmt = ID ASSIGN INTx = 0;if (y < z) x = 1;
x = 0; x = 0; y = 1;
if (a < b) x = 10;
if (var1 != var2) x = 10;
x = 0;if (y < z) x = 1; y = 5; t = 1;
GrammarCorrect Programs
Comp = LT | NE
Parsing
Program = ExpExp = S SEMICOLON |
IF LPAREN Cond RPAREN Exp |
Exp Exp
Cond = ID Comp ID
S = ID ASSIGN INT
Grammar
Comp = LT | NE
x = y;
x = 3 + 5;
x = 5;;
if (x+5 > y) x = 2;
Incorrect Programs
x = 5
Parsing Tools Define grammar in special language
Automatically creates parser based on grammar Popular tool is “yacc” - yet-another-compiler-
compiler
program: functions { $$ = $1; } ;
functions: function { $$ = $1; } | functions function { $$ = $1; } ; function: HEXNUMBER LABEL COLON code { $$ = $2; } ;
Intermediate Representation
Parser converts tokens to intermediate representation Usually, an abstract syntax tree
x = 0;if (y < z) x = 1;d = 6;
Assign
if
cond assign assign
x 0
x 1 d 6y < z
Intermediate Representation Why use intermediate representation?
Easier to analyze/optimize than source code Theoretically can be used for all languages
Makes synthesis back end language independent
Syntactic Analysis
C Code
Intermediate Representation
Syntactic Analysis
Java
Syntactic Analysis
Perl
Back End
Scheduling, resource allocation, binding, independent of source language - sometimes optimizations too
Intermediate Representation
Different Types Abstract Syntax Tree Control/Data Flow Graph (CDFG) Sequencing Graph
Etc.
We will focus on CDFG Combines control flow graph (CFG) and
data flow graph (DFG)
Control flow graphs CFG
Represents control flow dependencies of basic blocks
Basic block is section of code that always executes from beginning to end
I.e. no jumps into or out of block
acc = 0;for (i=0; i < 128; i++) acc += a[i];
if (i < 128)
acc=0, i = 0
acc += a[i]i ++
Done
Control flow graphs
Your turn
Create a CFG for this code
i = 0;while (j < 10) { if (x < 5) y = 2; else if (z < 10) y = 6;}
Data Flow Graphs
DFG Represents data dependencies between
operations
x = a+b;y = c*d;z = x - y;
+ *
-
a b c d
x y z
Control/Data Flow Graph
Combines CFG and DFG Maintains DFG for each node of CFG
acc = 0;for (i=0; i < 128; i++) acc += a[i];
if (i < 128)
acc=0; i=0;
acc += a[i]i ++
Done
acc
0
i
0
+
acc a[i]
acc
+
i 1
i
High-Level Synthesis: Optimization
Synthesis Optimizations After creating CDFG, high-level synthesis
optimizes graph Goals
Reduce area Improve latency Increase parallelism Reduce power/energy
2 types Data flow optimizations Control flow optimizations
Data Flow Optimizations
Tree-height reduction Generally made possible from commutativity, associativity,
and distributivity
+
+
+
+ +
+
a b c da b c d
+
+
*
a b c d
+ *
+
a b c d
Data Flow Optimizations
Operator Strength Reduction Replacing an expensive (“strong”) operation with a faster
one Common example: replacing multiply/divide with shift
b[i] = a[i] * 8; b[i] = a[i] << 3;
a = b * 5; c = b << 2;a = b + c;
1 multiplication 0 multiplications
a = b * 13;c = b << 2;d = b << 3;a = c + d + b;
Data Flow Optimizations
Constant propagation Statically evaluate expressions with
constants
x = 0;y = x * 15;z = y + 10;
x = 0;y = 0;z = 10;
Data Flow Optimizations
Function Specialization Create specialized code for common inputs
Treat common inputs as constants If inputs not known statically, must include if statement for each
call to specialized function
int f (int x) { y = x * 15; return y + 10;}
for (I=0; I < 1000; I++) f(0); …}
int f_opt () { return 10;}
for (I=0; I < 1000; I++) f_opt(0); …}
Treat frequent input as a constant
int f (int x) { y = x * 15; return y + 10;}
Data Flow Optimizations
Common sub-expression elimination If expression appears more than once,
repetitions can be replaced
a = x + y; . . . . . . . . . . . . b = c * 25 + x + y;
a = x + y; . . . . . . . . . . . . b = c * 25 + a;
x + y already determined
Data Flow Optimizations
Dead code elimination Remove code that is never executed
May seem like stupid code, but often comes from constant propagation or function specialization
int f (int x) { if (x > 0 ) a = b * 15; else a = b / 4; return a;}
int f_opt () { a = b * 15; return a;}
Specialized version for x > 0 does not need else branch - “dead code”
Data Flow Optimizations
Code motion (hoisting/sinking) Avoid repeated computation
for (I=0; I < 100; I++) { z = x + y; b[i] = a[i] + z ;}
z = x + y;for (I=0; I < 100; I++) { b[i] = a[i] + z ;}
Control Flow Optimizations
Loop Unrolling Replicate body of loop
May increase parallelism
for (i=0; i < 128; i++) a[i] = b[i] + c[i];
for (i=0; i < 128; i+=2) { a[i] = b[i] + c[i]; a[i+1] = b[i+1] + c[i+1]}
Control Flow Optimizations
Function Inlining Replace function call with body of function
Common for both SW and HW SW - Eliminates function call instructions HW - Eliminates unnecessary control states
for (i=0; i < 128; i++) a[i] = f( b[i], c[i] );. . . .int f (int a, int b) { return a + b * 15;}
for (i=0; i < 128; i++) a[i] = b[i] + c[i] * 15;
Control Flow Optimizations
Conditional Expansion Replace if with logic expression
Execute if/else bodies in parallel
y = abif (a) x = b+delse x =bd
y = abx = a(b+d) + a’bd
y = abx = y + d(a+b)
[DeMicheli]
Can be further optimized to:
Example
Optimize this
x = 0;y = a + b;if (x < 15) z = a + b - c;else z = x + 12;output = z * 12;
High-Level Synthesis:Scheduling/Resource Allocation
Scheduling Scheduling assigns a start time to each
operation in DFG Start times must not violate dependencies in DFG Start times must meet performance constraints
Alternatively, resource constraints
Performed on the DFG of each CFG node => Can’t execute multiple CFG nodes in parallel
Examples
+
+
+
+ +
+
a b c da b c d
+ +
+
a b c d
Cycle1
Cycle2
Cycle3
Cycle3
Cycle1 Cycle2
Cycle1
Cycle2
Scheduling Problems Several types of scheduling problems
Usually some combination of performance and resource constraints
Problems: Unconstrained
Not very useful, every schedule is valid Minimum latency Latency constrained Mininum-latency, resource constrained
i.e. find the schedule with the shortest latency, that uses less than a specified # of resources
NP-Complete Mininum-resource, latency constrained
i.e. find the schedule that meets the latency constraint (which may be anything), and uses the minimum # of resources
NP-Complete
Minimum Latency Scheduling ASAP (as soon as possible) algorithm
Find a candidate node Candidate is a node whose predecessors have been scheduled and
completed (or has no predecessors) Schedule node one cycle later than max cycle of predecessor Repeat until all nodes scheduled
+ +
*
a b c d
*
- <
e f g h
Cycle1
Cycle2
Cycle3
+Cycle4
Minimum possible latency - 4 cycles
Minimum Latency Scheduling ALAP (as late as possible) algorithm
Run ASAP, get minimum latency L Find a candidate
Candidate is node whose successors are scheduled (or has none) Schedule node one cycle before min cycle of predecessor
Nodes with no successors scheduled to cycle L Repeat until all nodes scheduled
+ +
*
a b c d
*
- <
e f g h
Cycle1
Cycle2
Cycle3
+Cycle4
Cycle4
Cycle3
L = 4 cycles
Minimum Latency Scheduling ALAP (as late as possible) algorithm
Run ASAP, get minimum latency L Find a candidate
Candidate is node whose successors are scheduled (or has none) Schedule node one cycle before min cycle of predecessor
Nodes with no successors scheduled to cycle L Repeat until all nodes scheduled
+ +
*
a b c d
* -
<
e f g h
Cycle1
Cycle2
Cycle3
+Cycle4
L = 4 cycles
Minimum Latency Scheduling ALAP
Has to run ASAP first, seems pointless But, many heuristics need the mobility/slack of
each operation ASAP gives the earliest possible time for an operation ALAP gives the latest possible time for an operation
Slack = difference between earliest and latest possible schedule
Slack = 0 implies operation has to be done in the current scheduled cycle
The larger the slack, the more options a heuristic has to schedule the operation
Latency-Constrained Scheduling
Instead of finding the minimum latency, find latency less than L Solutions:
Use ASAP, verify that minimum latency less than L
Use ALAP starting with cycle L instead of minimum latency (don’t need ASAP)
Scheduling with Resource Constraints
Schedule must use less than specified number of resources
+ +
*
a b c d
+
-
e f g
Cycle1
Cycle3
Cycle4
+Cycle5
*
Cycle2
Constraints: 1 ALU (+/-), 1 Multiplier
Scheduling with Resource Constraints
Schedule must use less than specified number of resources
+ +
*
a b c d
+
-
e f g
Cycle1
Cycle2
Cycle3
+Cycle4
*
Constraints: 2 ALU (+/-), 1 Multiplier
Mininum-Latency, Resource-Constrained Scheduling
Definition: Given resource constraints, find schedule that has the minimum latency Example:
+ +
+
a b c d
-
e f g
Cycle1
Cycle3
Cycle6+
*
Cycle2
Constraints: 1 ALU (+/-), 1 Multiplier
Cycle4
Cycle5
Mininum-Latency, Resource-Constrained Scheduling
Definition: Given resource constraints, find schedule that has the minimum latency Example:
+ +
+
a b c d
-
e f g
Cycle1
Cycle4
Cycle5+
*
Cycle2
Constraints: 1 ALU (+/-), 1 Multiplier
Cycle3
Different schedules may use same resources, but have different latencies
Mininum-Latency, Resource-Constrained Scheduling
Hu’s Algorithm Assumes one type of resource
Basic Idea Input: graph, # of resources r 1) Label each node by max distance from output
i.e. Use path length as priority 2) Determine C, the set of scheduling candidates
Candidate if either no predecessors, or predecessors scheduled
3) From C, schedule up to r nodes to current cycle, using label as priority
4) Increment current cycle, repeat from 2) until all nodes scheduled
Mininum-Latency, Resource-Constrained Scheduling
Hu’s Algorithm Example
+ +
*
a b c d
+
-
e f g
+
*
j k
-
r = 3
Mininum-Latency, Resource-Constrained Scheduling
Hu’s Algorithm Step 1 - Label each node by max distance
from output i.e. use path length as priority
a b c d e f g
4 4 3
23
2
1
j k
1
r = 3
Mininum-Latency, Resource-Constrained Scheduling
Hu’s Algorithm Step 2 - Determine C, the set of scheduling
candidates
a b c d e f g
4 4 3
23
2
1
j k
1C =
r = 3Cycle 1
Mininum-Latency, Resource-Constrained Scheduling
Hu’s Algorithm Step 3 - From C, schedule up to r nodes to
current cycle, using label as priority
a b c d e f g
4 4 3
23
2
1
j k
1
r = 3
Not scheduled due to lower priority
Cycle1
Cycle 1
Mininum-Latency, Resource-Constrained Scheduling
Hu’s Algorithm Step 2
a b c d e f g
4 4 3
23
2
1
j k
1
r = 3
C =
Cycle1
Cycle 2
Mininum-Latency, Resource-Constrained Scheduling
Hu’s Algorithm Step 3
a b c d e f g
4 4 3
23
2
1
j k
1
r = 3
Cycle1
Cycle2
Cycle 2
Mininum-Latency, Resource-Constrained Scheduling
Hu’s Algorithm Skipping to finish
a b c d e f g
4 4 3
23
2
1
j k
1
r = 3
Cycle1
Cycle2
Cycle3
Cycle4
Mininum-Latency, Resource-Constrained Scheduling
Hu’s is simplified problem Common Extensions:
Multiple resource types Multi-cycle operation
a b c d
+ -
/
*Cycle1
Cycle2
Mininum-Latency, Resource-Constrained Scheduling
List Scheduling - (minimum latency, resource-constrained version) Extension for multiple resource types Basic Idea - Hu’s algorithm for each resource type
Input: graph, set of constraints R for each resource type 1) Label nodes based on max distance to output 2) For each resource type t
3) Determine candidate nodes, C (those w/ no predecessors or w/ scheduled predecessors )
4) Schedule up to Rt operations from C based on priority, to current cycle
Rt is the constraint on resource type t 3) Increment cycle, repeat from 2) until all nodes
scheduled
Mininum-Latency, Resource-Constrained Scheduling
List scheduling - minimum latency Step 1 - Label nodes based on max distance to
output (not shown, so you can see operations) *nodes given IDs for illustration purposes
a b c d e f g
* + +
**
+
-
j k
-1 2 3 4
5 6
78
2 ALUs (+/-), 2 Multipliers
Mininum-Latency, Resource-Constrained Scheduling
List scheduling - minimum latency For each resource type t
3) Determine candidate nodes, C (those w/ no predecessors or w/ scheduled predecessors)
4) Schedule up to Rt operations from C based on priority, to current cycle
Rt is the constraint on resource type t
a b c d e f g
* + +
**
+
-
j k
-1 2 3 4
5 6
78
Mult ALU Cycle
1 2,3,4 1
2 ALUs (+/-), 2 Multipliers Candidates
Mininum-Latency, Resource-Constrained Scheduling
List scheduling - minimum latency For each resource type t
3) Determine candidate nodes, C (those w/ no predecessors or w/ scheduled predecessors)
4) Schedule up to Rt operations from C based on priority, to current cycle
Rt is the constraint on resource type t
a b c d e f g
* + +
**
+
-
j k
-1 2 3 4
5 6
78
Mult ALU Cycle
1 2,3,4 1
2 ALUs (+/-), 2 Multipliers Candidates
Cycle1Candidate, but not scheduled due to low priority
Mininum-Latency, Resource-Constrained Scheduling
List scheduling - minimum latency For each resource type t
3) Determine candidate nodes, C (those w/ no predecessors or w/ scheduled predecessors)
4) Schedule up to Rt operations from C based on priority, to current cycle
Rt is the constraint on resource type t
a b c d e f g
* + +
**
+
-
j k
-1 2 3 4
5 6
78
Mult ALU Cycle
1 2,3,4 1
5,6 4 2
2 ALUs (+/-), 2 Multipliers Candidates
Cycle1
Mininum-Latency, Resource-Constrained Scheduling
List scheduling - minimum latency For each resource type t
3) Determine candidate nodes, C (those w/ no predecessors or w/ scheduled predecessors)
4) Schedule up to Rt operations from C based on priority, to current cycle
Rt is the constraint on resource type t
a b c d e f g
* + +
**
+
-
j k
-1 2 3 4
5 6
78
Mult ALU Cycle
1 2,3,4 1
5,6 4 2
2 ALUs (+/-), 2 Multipliers Candidates
Cycle1
Cycle2
Mininum-Latency, Resource-Constrained Scheduling
List scheduling - minimum latency For each resource type t
3) Determine candidate nodes, C (those w/ no predecessors or w/ scheduled predecessors)
4) Schedule up to Rt operations from C based on priority, to current cycle
Rt is the constraint on resource type t
a b c d e f g
* + +
**
+
-
j k
-1 2 3 4
5 6
78
Mult ALU Cycle
1 2,3,4 1
5,6 4 2
7 3
2 ALUs (+/-), 2 Multipliers Candidates
Cycle1
Cycle2
Mininum-Latency, Resource-Constrained Scheduling
List scheduling - (minimum latency) Final schedule Note - ASAP would require more resources
ALAP wouldn’t but in general, it would
a b c d e f g
* + +
**
+
-
j k
-
1 2 3
45 6
78
Cycle1
Cycle2
Cycle3
Cycle4
2 ALUs (+/-), 2 Multipliers
Mininum-Latency, Resource-Constrained Scheduling
Extension for multicycle operations Same idea (differences shown in red) Input: graph, set of constraints R for each resource type 1) Label nodes based on max cycle latency to output 2) For each resource type t
3) Determine candidate nodes, C (those w/ no predecessors or w/ scheduled and completed predecessors)
4) Schedule up to (Rt - nt) operations from C based on priority, one cycle after predecessor
Rt is the constraint on resource type t nt is the number of resource t in use from previous cycles
Repeat from 2) until all nodes scheduled
Mininum-Latency, Resource-Constrained Scheduling
Example:
a b c d e f g
*+ +
*
*
+
-
j k
*
1 2 3
4
5
6
78
Cycle1
Cycle2
Cycle5
Cycle6
2 ALUs (+/-), 2 Multipliers
Cycle4
Cycle3
Cycle7
List Scheduling (Min Latency) Your turn (2 ALUs, 1 Mult)
Steps (will be on test) 1) Label nodes with priority 2) Update candidate list for each cycle 3) Redraw graph to show schedule
+ * * -
+
+
1 2 3
7
4
9-
+ +
*
5 6
8
-10
11
List Scheduling (Min Latency)
Your turn (2 ALUs, 1 Mult, Mults take 2 cycles)
a b c d e f g
+ + * *
*
+
1 2 3
5
4
7
Minimum-Resource, Latency-Constrained
Note that if no resource constraints given, schedule determines number of required resources Max # of each resource type used in a single cycle
+ +
*
a b c d
+
-
e f g
Cycle1
Cycle2
Cycle3
+Cycle4
*
3 ALUs
2 Mults
Minimum-Resource, Latency-Constrained
Minimum-Resource Latency-Constrained Scheduling: For all schedules that have latency less than the
constraint, find the one that uses the fewest resources
+ +
*
a b c d
+
-
e f g
Cycle1
Cycle2
Cycle3+Cycle4
*+ +
*
a b c d
+
-
e f g
Cycle1
Cycle2
Cycle3
+Cycle4
*
3 ALUs, 2 Mult 2 ALUs, 1 Mult
Latency Constraint <= 4 Latency Constraint <= 4
Minimum-Resource, Latency-Constrained
List scheduling (Minimum resource version) Basic Idea
1) Compute latest start times for each op using ALAP with specified latency constraint
Latest start times must include multicycle operations 2) For each resource type
3) Determine candidate nodes 4) Compute slack for each candidate
Slack = current cycle - latest possible cycle 5) Schedule ops with 0 slack
Update required number of resources (assume 1 of each to start with)
6) Schedule ops that require no extra resources 7) Repeat from 2) until all nodes scheduled
Minimum-Resource, Latency-Constrained
1) Find ALAP schedulea b c d e f g
* + +
**
-
j k
-1 2 3 4
5 6
7
Node LPC
1 1
2 1
3 1
4 3
5 2
6 2
7 3
Last Possible Cycle
a b c d e f g
* + +
**
-
j k
-
1 2 3
45 6
7
Cycle1
Cycle2
Cycle3
Defines last possible cycle for each operation
Latency Constraint = 3 cycles
Minimum-Resource, Latency-Constrained
2) For each resource type 3) Determine candidate nodes C 4) Compute slack for each candidate
Slack = current cycle - latest possible cycle
Node LPC
1 1
2 1
3 1
4 3
5 2
6 2
7 3
SlackCandidates = {1,2,3,4} Cycle
0
0
0
2
Cycle 1
a b c d e f g
* + +
**
-
j k
-1 2 3 4
5 6
7
Initial Resources = 1 Mult, 1 ALU
Minimum-Resource, Latency-Constrained
5)Schedule ops with 0 slack Update required number of resources
6) Schedule ops that require no extra resources
Node LPC
1 1
2 1
3 1
4 3
5 2
6 2
7 3
SlackCandidates = {1,2,3,4}
Cycle
0 1
0 1
0 1
2 X
Resources = 1 Mult, 2 ALU
4 requires 1 more ALU - not scheduled
Cycle 1
a b c d e f g
* + +
**
-
j k
-1 2 3 4
5 6
7
Minimum-Resource, Latency-Constrained
2)For each resource type 3) Determine candidate nodes C 4) Compute slack for each candidate
Slack = current cycle - latest possible cycle
Node LPC
1 1
2 1
3 1
4 3
5 2
6 2
7 3
SlackCandidates = {4,5,6}
Cycle
1
1
1
1
0
0
Resources = 1 Mult, 2 ALU
Cycle 2
a b c d e f g
* + +
**
-
j k
-1 2 3 4
5 6
7
Minimum-Resource, Latency-Constrained
5)Schedule ops with 0 slack Update required number of resources
6) Schedule ops that require no extra resources
Node LPC
1 1
2 1
3 1
4 3
5 2
6 2
7 3
SlackCandidates = {4,5,6}
Cycle
1
1
1
1 2
0 2
0 2
Resources = 2 Mult, 2 ALU
Cycle 2
Already 1 ALU - 4 can be scheduled
a b c d e f g
* + +
**
-
j k
-1 2 3 4
5 6
7
Minimum-Resource, Latency-Constrained
2)For each resource type 3) Determine candidate nodes C 4) Compute slack for each candidate
Slack = current cycle - latest possible cycle
Node LPC
1 1
2 1
3 1
4 3
5 2
6 2
7 3
SlackCandidates = {7}
Cycle
1
1
1
2
2
2
0
Resources = 2 Mult, 2 ALU
Cycle 3
a b c d e f g
* + +
**
-
j k
-1 2 3 4
5 6
7
Minimum-Resource, Latency-Constrained
Final Schedule
Node LPC
1 1
2 1
3 1
4 3
5 2
6 2
7 3
Slack Cycle
1
1
1
2
2
2
3
a b c d e f g
* + +
**
-
j k
-1 2 3
45 6
7
Cycle1
Cycle2
Cycle3
Required Resources = 2 Mult, 2 ALU
Other extensions Chaining
Multiple operations in a single cycle
Pipelining Input: DFG, data delivery rate For fully pipelined circuit, must have one resource
per operation (remember systolic arrays)
a b c d
+ +
+
-
e f
/Multiple adds may be faster than 1 divide - perform adds in one cycle
Summary Scheduling assigns each operation in a DFG a
start time Done for each DFG in the CDFG
Different Types Minimum Latency
ASAP, ALAP Latency-constrained
ASAP, ALAP Minimum-latency, resource-constrained
Hu’s Algorithm List Scheduling
Minimum-resource, latency-constrained List Scheduling
High-level Synthesis: Binding/Resource Sharing
Binding
During scheduling, we determined: When ops will execute How many resources are needed
We still need to decide which ops execute on which resources => Binding If multiple ops use the same resource
=>Resource Sharing
Binding
Basic Idea - Map operations onto resources such that operations in same cycle don’t use same resource
* + +
**
+
-
-
1 2 3
45 6
78
Cycle1
Cycle2
Cycle3
Cycle4
2 ALUs (+/-), 2 Multipliers
Mult1 ALU1 ALU2 Mult2
Binding
Many possibilities Bad binding may increase resources, require huge
steering logic, reduce clock, etc.
* + +
**
+
-
-
1 2 3
45 6
78
Cycle1
Cycle2
Cycle3
Cycle4
2 ALUs (+/-), 2 Multipliers
Mult1 ALU1 ALU2Mult2
Binding
Can’t do this 1 resource can’t perform multiple ops
simultaneously!
* + +
**
+
-
-
1 2 3
45 6
78
Cycle1
Cycle2
Cycle3
Cycle4
2 ALUs (+/-), 2 Multipliers
Binding How to automate?
More graph theory Compatibility Graph
Each node is an operation Edges represent compatible operations
Compatible - if two ops can share a resource I.e. Ops that use same type of resource (ALU, etc.)
and are scheduled to different cycles
Compatibility Graph
* + +
**
+
-
-
1 2 3
45 6
78
Cycle1
Cycle2
Cycle3
Cycle4
2 8
7 4
3
1 6
5
ALUs Mults
5 and 6 not compatible (same cycle)2 and 3 not
compatible (same cycle)
Compatibility Graph
* + +
**
+
-
-
1 2 3
45 6
78
Cycle1
Cycle2
Cycle3
Cycle4
2 8
7 4
3
1 6
5
ALUs Mults
Note - Fully connected subgraphs can share a resource (all involved nodes are compatible)
Compatibility Graph
* + +
**
+
-
-
1 2 3
45 6
78
Cycle1
Cycle2
Cycle3
Cycle4
2 8
7 4
3
1 6
5
ALUs Mults
Note - Fully connected subgraphs can share a resource (all involved nodes are compatible)
Compatibility Graph
* + +
**
+
-
-
1 2 3
45 6
78
Cycle1
Cycle2
Cycle3
Cycle4
2 8
7 4
3
1 6
5
ALUs Mults
Note - Fully connected subgraphs can share a resource (all involved nodes are compatible)
Compatibility Graph Binding: Find minimum number of fully
connected subgraphs that cover entire graph Well-known problem: Clique partitioning (NP-
complete)
Cliques = { {2,8,7,4},{3},{1,5},{6} } ALU1 executes 2,8,7,4 ALU2 executes 3 MULT1 executes 1,5 MULT2 executes 6
2 8
7 4
3
1 6
5
Compatibility Graph
* + +
**
+
-
-
1 2 3
45 6
78
Cycle1
Cycle2
Cycle3
Cycle4
2 8
7 4
3
1 6
5
ALUs Mults
Final Binding:
Compatibility Graph
* + +
**
+
-
-
1 2 3
45 6
78
Cycle1
Cycle2
Cycle3
Cycle4
2 8
7 4
3
1 6
5
ALUs Mults
Alternative Final Binding:
Translation to Datapath
* + +
**
+
-
-
1 2 3
45 6
78
Cycle1
Cycle2
Cycle3
Cycle4
Mult(1,5) Mult(6)ALU(2,7,8,4) ALU(3)
Mux Mux
Reg
Mux Mux
a b c h
Reg RegReg
a b c d e f g h i
d e i g e f
1) Add resources and registers
2) Add mux for each input
3) Add input to left mux for each left input in DFG
4) Do same for right mux
5) If only 1 input, remove mux
Left Edge Algorithm Alternative to clique partitioning
Take scheduled DFG, rotate it 90 degrees
2 ALUs (+/-), 2 Multipliers
a b f g
*+ +
*
*
+
-
j k
*
1 2 3
4
5
6
78
Cycle1
Cycle2
Cycle5
Cycle6
Cycle4
Cycle3
Cycle7
c d e
Left Edge Algorithm
2 ALUs (+/-), 2 Multipliers*
++
*
*
+
-
*
12
3
4
5
6
7
8
Cyc
le1
Cyc
le2
Cyc
le5
Cyc
le6
Cyc
le4
Cyc
le3
Cyc
le7
1) Initialize right_edge to 0
2) Find a node N whose left edge is >= right_edge
3) Bind N to a particular resource
4) Update right_edge to the right edge of N
5) Repeat from 2) for nodes using the same resource type until right_edge passes all nodes
6) Repeat from 1) until all nodes bound
right_edge
Left Edge Algorithm
2 ALUs (+/-), 2 Multipliers*
++
*
*
+
-
*
12
3
4
5
6
7
8
Cyc
le1
Cyc
le2
Cyc
le5
Cyc
le6
Cyc
le4
Cyc
le3
Cyc
le7
1) Initialize right_edge to 0
2) Find a node N whose left edge is >= right_edge
3) Bind N to a particular resource
4) Update right_edge to the right edge of N
5) Repeat from 2) for nodes using the same resource type until right_edge passes all nodes
6) Repeat from 1) until all nodes bound
right_edge
Left Edge Algorithm
2 ALUs (+/-), 2 Multipliers*
++
*
*
+
-
*
12
3
4
5
6
7
8
Cyc
le1
Cyc
le2
Cyc
le5
Cyc
le6
Cyc
le4
Cyc
le3
Cyc
le7
1) Initialize right_edge to 0
2) Find a node N whose left edge is >= right_edge
3) Bind N to a particular resource
4) Update right_edge to the right edge of N
5) Repeat from 2) for nodes using the same resource type until right_edge passes all nodes
6) Repeat from 1) until all nodes bound
right_edge
Left Edge Algorithm
2 ALUs (+/-), 2 Multipliers*
++
*
*
+
-
*
12
3
4
5
6
7
8
Cyc
le1
Cyc
le2
Cyc
le5
Cyc
le6
Cyc
le4
Cyc
le3
Cyc
le7
1) Initialize right_edge to 0
2) Find a node N whose left edge is >= right_edge
3) Bind N to a particular resource
4) Update right_edge to the right edge of N
5) Repeat from 2) for nodes using the same resource type until right_edge passes all nodes
6) Repeat from 1) until all nodes bound
right_edge
Left Edge Algorithm
2 ALUs (+/-), 2 Multipliers*
++
*
*
+
-
*
12
3
4
5
6
7
8
Cyc
le1
Cyc
le2
Cyc
le5
Cyc
le6
Cyc
le4
Cyc
le3
Cyc
le7
1) Initialize right_edge to 0
2) Find a node N whose left edge is >= right_edge
3) Bind N to a particular resource
4) Update right_edge to the right edge of N
5) Repeat from 2) for nodes using the same resource type until right_edge passes all nodes
6) Repeat from 1) until all nodes bound
right_edge
Left Edge Algorithm
2 ALUs (+/-), 2 Multipliers*
++
*
*
+
-
*
12
3
4
5
6
7
8
Cyc
le1
Cyc
le2
Cyc
le5
Cyc
le6
Cyc
le4
Cyc
le3
Cyc
le7
1) Initialize right_edge to 0
2) Find a node N whose left edge is >= right_edge
3) Bind N to a particular resource
4) Update right_edge to the right edge of N
5) Repeat from 2) for nodes using the same resource type until right_edge passes all nodes
6) Repeat from 1) until all nodes bound
right_edge
Left Edge Algorithm
2 ALUs (+/-), 2 Multipliers*
++
*
*
+
-
*
12
3
4
5
6
7
8
Cyc
le1
Cyc
le2
Cyc
le5
Cyc
le6
Cyc
le4
Cyc
le3
Cyc
le7
1) Initialize right_edge to 0
2) Find a node N whose left edge is >= right_edge
3) Bind N to a particular resource
4) Update right_edge to the right edge of N
5) Repeat from 2) for nodes using the same resource type until right_edge passes all nodes
6) Repeat from 1) until all nodes bound
right_edge
Left Edge Algorithm
2 ALUs (+/-), 2 Multipliers*
++
*
*
+
-
*
12
3
4
5
6
7
8
Cyc
le1
Cyc
le2
Cyc
le5
Cyc
le6
Cyc
le4
Cyc
le3
Cyc
le7
1) Initialize right_edge to 0
2) Find a node N whose left edge is >= right_edge
3) Bind N to a particular resource
4) Update right_edge to the right edge of N
5) Repeat from 2) for nodes using the same resource type until right_edge passes all nodes
6) Repeat from 1) until all nodes bound
right_edge
Left Edge Algorithm
2 ALUs (+/-), 2 Multipliers*
++
*
*
+
-
*
12
3
4
5
6
7
8
Cyc
le1
Cyc
le2
Cyc
le5
Cyc
le6
Cyc
le4
Cyc
le3
Cyc
le7
1) Initialize right_edge to 0
2) Find a node N whose left edge is >= right_edge
3) Bind N to a particular resource
4) Update right_edge to the right edge of N
5) Repeat from 2) for nodes using the same resource type until right_edge passes all nodes
6) Repeat from 1) until all nodes bound
right_edge
Extensions Algorithms presented so far find a valid
binding But, do not consider amount of steering logic
required Different bindings can require significantly different
# of muxes One solution
Extend compatibility graph Use weighted edges/nodes - cost function representing
steering logic Perform clique partitioning, finding the set of cliques that
minimize weight
Binding Summary
Binding maps operations onto physical resources Determines sharing among resources
Binding may greatly affect steering logic Trivial for fully-pipelined circuits
1 resource per operation Straightforward translation from bound
DFG to datapath
High-level Synthesis:Summary
Main Steps Front-end (lexing/parsing) converts code into intermediate
representation We looked at CDFG
Scheduling assigns a start time for each operation in DFG CFG node start times defined by control dependencies Resource allocation determined by schedule
Binding maps scheduled operations onto physical resources Determines how resources are shared
Big picture: Scheduled/Bound DFG can be translated into a datapath CFG can be translated to a controller => High-level synthesis can create a custom circuit for any CDFG!
Limitations Pipeline parallelism
Discussed techniques only exploit arithmetic and bit level parallelism
Not much potential for speedup How to create pipelined circuit?
Difficult for general code Must try to transform into known “templates”
i.e. if conversion, loop fusion, etc. Not always possible/practical
Existing tools Academic
ROCCC - Riverside Optimizing Compiler for Configurable Computing SPARK
Commercial Catapult C (Mentor Graphics) Celoxica DK Design Suite
Limitations Task-level parallelism
Parallelism in CDFG limited to individual control states
Can’t have multiple states executing concurrently Potential solution: use model other than CDFG
Kahn Process Networks Nodes represents parallel processes/tasks Edges represent communication between processes
Discussed techniques can create a controller+datapath for each process
Must also consider communication buffers Challenge:
Most high-level code does not have explicit parallelism Difficult/impossible to extract task-level parallelism from code
Limitations Coding practices limit circuit performance
Very often, languages contain constructs not appropriate for circuit implementation
Recursion, pointers, virtual functions, etc.
Potential solution: use specialized languages Remove problematic constructs, add task-level
parallelism Challenges:
Difficult to learn new languages Many designers resist changes to tool flow
Limitations Expert designers can achieve better circuits
High-level synthesis has to work with specification in code
Can be difficult to automatically create efficient pipeline May require dozens of optimizations applied in a
particular order Expert designer can transform algorithm
Synthesis can transform code, but can’t change algorithm
Potential Solution: ??? New language? New methodology? New tools?