High-Level Synthesis: Creating Custom Circuits from High-Level Code

High-Level Synthesis: Creating Custom Circuits from High-Level Code

Greg Stitt

ECE Department

University of Florida

Existing FPGA Tool Flow Register-transfer (RT) synthesis

Specify RT structure (muxes, registers, etc) + Allows precise specification - But, time consuming, difficult, error prone

HDL

Netlist

Bitfile

Processor FPGA

RT Synthesis

Physical Design

Technology Mapping

Placement

Routing

Future FPGA Tool Flow?

HDL

Netlist

Bitfile

Processor FPGA

RT Synthesis

Physical Design

Technology Mapping

Placement

Routing

High-level Synthesis

C/C++, Java, etc.

High-level Synthesis Wouldn’t it be nice to write high-level code?

Ratio of C to VHDL developers (10000:1 ?) + Easier to specify + Separates function from architecture

+ More portable - Hardware potentially slower

Similar to assembly code era Programmers could always beat compiler But, no longer the case

Hopefully, high-level synthesis will catch up to manual effort

High-level Synthesis More challenging than compilation

Compilation maps behavior into assembly instructions

Architecture is known to compiler High-level synthesis creates a custom architecture

to execute behavior Huge hardware exploration space Best solution may include microprocessors Should handle any high-level code

Not all code appropriate for hardware

High-level Synthesis

First, consider how to manually convert high-level code into circuit

Steps 1) Build FSM for controller 2) Build datapath based on FSM

acc = 0;for (i=0; i < 128; i++) acc += a[i];

Manual Example

Build a FSM (controller) Decompose code into states

acc = 0;for (i=0; i < 128; i++) acc += a[i];

if (i < 128)

acc=0, i = 0

load a[i]

acc += a[i]

i++

Done

Manual Example

Build a datapath Allocate resources for each state

acci

if (i < 128)

acc=0, i = 0

load a[i]

acc += a[i]

i++

Done

<

addra[i]

++ +

1 128 1

acc = 0;for (i=0; i < 128; i++) acc += a[i];

Manual Example

Build a datapath Determine register inputs

acci

if (i < 128)

acc=0, i = 0

load a[i]

acc += a[i]

i++

Done

<

addra[i]

++ +

1 128

2x1

0

2x1

0

1

2x1

&a

In from memory

acc = 0;for (i=0; i < 128; i++) acc += a[i];

Manual Example

Build a datapath Add outputs

acci

if (i < 128)

acc=0, i = 0

load a[i]

acc += a[i]

i++

Done

<

addra[i]

++ +

1 128

2x1

0

2x1

0

1

2x1

&a

In from memory

Memory addressacc

acc = 0;for (i=0; i < 128; i++) acc += a[i];

Manual Example

Build a datapath Add control signals

acci

if (i < 128)

acc=0, i = 0

load a[i]

acc += a[i]

i++

Done

<

addra[i]

++ +

1 128

2x1

0

2x1

0

1

2x1

&a

In from memory

Memory addressacc

acc = 0;for (i=0; i < 128; i++) acc += a[i];

Manual Example

Combine controller+datapath

acci

<

addra[i]

++ +

1 128

2x1

0

2x1

0

1

2x1

&a

In from memory

Memory addressaccDone Memory Read

Controller

acc = 0;for (i=0; i < 128; i++) acc += a[i];

Manual Example

Alternatives Use one adder (plus muxes)

acci

<

addra[i]

+

128

2x1

0

2x1

0

1

2x1

&a

In from memory

Memory addressacc

MUX MUX

Manual Example

Comparison with high-level synthesis Determining when to perform each

operation => Scheduling

Allocating resource for each operation => Resource allocation

Mapping operations onto resources => Binding

Another Example

Your turnx=0;for (i=0; i < 100; i++) { if (a[i] > 0) x ++; else x --;

a[i] = x;}//output x

Steps1) Build FSM (do not perform if conversion)2) Build datapath based on FSM

High-Level Synthesis

High-level Code

Custom Circuit


Could be C, C++, Java, Perl, Python, SystemC, ImpulseC, etc.

Usually a RT VHDL description, but could as low level as a bit file



acc = 0;for (i=0; i < 128; i++) acc += a[i];

acci

<

addra[i]

++ +1 128

2x1

0

2x1

0

1

2x1

&a

In from memory

Memory addressaccDone Memory Read

Controller

Main Steps

Syntactic Analysis

Optimization

Scheduling/Resource Allocation

Binding/Resource Sharing

High-level Code

Intermediate Representation

Controller + Datapath

Converts code to intermediate representation - allows all following steps to use language independent format.

Determines when each operation will execute, and resources used

Maps operations onto physical resources

Front-end

Back-end

Syntactic Analysis Definition: Analysis of code to verify syntactic

correctness Converts code into intermediate representation

2 steps 1) Lexical analysis (Lexing) 2) Parsing

Syntactic Analysis

High-level Code


Lexical Analysis

Parsing

Lexical Analysis Lexical analysis (lexing) breaks code into a

series of defined tokens Token: defined language constructs

x = 0;if (y < z) x = 1;

Lexical Analysis

ID(x), ASSIGN, INT(0), SEMICOLON, IF, LPAREN, ID(y), LT, ID(z), RPAREN, ID(x), ASSIGN, INT(1), SEMICOLON

Lexing Tools

Define tokens using regular expressions - outputs C code that lexes input Common tool is “lex”

/* braces and parentheses */"[" { YYPRINT; return LBRACE; }"]" { YYPRINT; return RBRACE; }"," { YYPRINT; return COMMA; }";" { YYPRINT; return SEMICOLON; }"!" { YYPRINT; return EXCLAMATION; }"{" { YYPRINT; return LBRACKET; }"}" { YYPRINT; return RBRACKET; }"-" { YYPRINT; return MINUS; }

/* integers[0-9]+ { yylval.intVal = atoi( yytext ); return INT;}

Parsing

Analysis of token sequence to determine correct grammatical structure Languages defined by context-free

grammar

Program = ExpExp = Stmt SEMICOLON |

IF LPAREN Cond RPAREN Exp |

Exp Exp

Cond = ID Comp ID

Stmt = ID ASSIGN INTx = 0;if (y < z) x = 1;

x = 0; x = 0; y = 1;

if (a < b) x = 10;

if (var1 != var2) x = 10;

x = 0;if (y < z) x = 1; y = 5; t = 1;

GrammarCorrect Programs

Comp = LT | NE

Parsing

Program = ExpExp = S SEMICOLON |

IF LPAREN Cond RPAREN Exp |

Exp Exp

Cond = ID Comp ID

S = ID ASSIGN INT

Grammar

Comp = LT | NE

x = y;

x = 3 + 5;

x = 5;;

if (x+5 > y) x = 2;

Incorrect Programs

x = 5

Parsing Tools Define grammar in special language

Automatically creates parser based on grammar Popular tool is “yacc” - yet-another-compiler-

compiler

program: functions { $$ = $1; } ;

functions: function { $$ = $1; } | functions function { $$ = $1; } ; function: HEXNUMBER LABEL COLON code { $$ = $2; } ;


Parser converts tokens to intermediate representation Usually, an abstract syntax tree

x = 0;if (y < z) x = 1;d = 6;

Assign

if

cond assign assign

x 0

x 1 d 6y < z

Intermediate Representation Why use intermediate representation?

Easier to analyze/optimize than source code Theoretically can be used for all languages

Makes synthesis back end language independent

Syntactic Analysis

C Code


Syntactic Analysis

Java

Syntactic Analysis

Perl

Back End

Scheduling, resource allocation, binding, independent of source language - sometimes optimizations too


Different Types Abstract Syntax Tree Control/Data Flow Graph (CDFG) Sequencing Graph

Etc.

We will focus on CDFG Combines control flow graph (CFG) and

data flow graph (DFG)

Control flow graphs CFG

Represents control flow dependencies of basic blocks

Basic block is section of code that always executes from beginning to end

I.e. no jumps into or out of block

acc = 0;for (i=0; i < 128; i++) acc += a[i];

if (i < 128)

acc=0, i = 0

acc += a[i]i ++

Done

Control flow graphs

Your turn

Create a CFG for this code

i = 0;while (j < 10) { if (x < 5) y = 2; else if (z < 10) y = 6;}

Data Flow Graphs

DFG Represents data dependencies between

operations

x = a+b;y = c*d;z = x - y;

+ *

-

a b c d

x y z

Control/Data Flow Graph

Combines CFG and DFG Maintains DFG for each node of CFG

acc = 0;for (i=0; i < 128; i++) acc += a[i];

if (i < 128)

acc=0; i=0;

acc += a[i]i ++

Done

acc

0

i

0

+

acc a[i]

acc

+

i 1

i

High-Level Synthesis: Optimization

Synthesis Optimizations After creating CDFG, high-level synthesis

optimizes graph Goals

Reduce area Improve latency Increase parallelism Reduce power/energy

2 types Data flow optimizations Control flow optimizations

Data Flow Optimizations

Tree-height reduction Generally made possible from commutativity, associativity,

and distributivity

+

+

+

+ +

+

a b c da b c d

+

+

*

a b c d

+ *

+

a b c d


Operator Strength Reduction Replacing an expensive (“strong”) operation with a faster

one Common example: replacing multiply/divide with shift

b[i] = a[i] * 8; b[i] = a[i] << 3;

a = b * 5; c = b << 2;a = b + c;

1 multiplication 0 multiplications

a = b * 13;c = b << 2;d = b << 3;a = c + d + b;


Constant propagation Statically evaluate expressions with

constants

x = 0;y = x * 15;z = y + 10;

x = 0;y = 0;z = 10;


Function Specialization Create specialized code for common inputs

Treat common inputs as constants If inputs not known statically, must include if statement for each

call to specialized function

int f (int x) { y = x * 15; return y + 10;}

for (I=0; I < 1000; I++) f(0); …}

int f_opt () { return 10;}

for (I=0; I < 1000; I++) f_opt(0); …}

Treat frequent input as a constant

int f (int x) { y = x * 15; return y + 10;}


Common sub-expression elimination If expression appears more than once,

repetitions can be replaced

a = x + y; . . . . . . . . . . . . b = c * 25 + x + y;

a = x + y; . . . . . . . . . . . . b = c * 25 + a;

x + y already determined


Dead code elimination Remove code that is never executed

May seem like stupid code, but often comes from constant propagation or function specialization

int f (int x) { if (x > 0 ) a = b * 15; else a = b / 4; return a;}

int f_opt () { a = b * 15; return a;}

Specialized version for x > 0 does not need else branch - “dead code”


Code motion (hoisting/sinking) Avoid repeated computation

for (I=0; I < 100; I++) { z = x + y; b[i] = a[i] + z ;}

z = x + y;for (I=0; I < 100; I++) { b[i] = a[i] + z ;}

Control Flow Optimizations

Loop Unrolling Replicate body of loop

May increase parallelism

for (i=0; i < 128; i++) a[i] = b[i] + c[i];

for (i=0; i < 128; i+=2) { a[i] = b[i] + c[i]; a[i+1] = b[i+1] + c[i+1]}


Function Inlining Replace function call with body of function

Common for both SW and HW SW - Eliminates function call instructions HW - Eliminates unnecessary control states

for (i=0; i < 128; i++) a[i] = f( b[i], c[i] );. . . .int f (int a, int b) { return a + b * 15;}

for (i=0; i < 128; i++) a[i] = b[i] + c[i] * 15;


Conditional Expansion Replace if with logic expression

Execute if/else bodies in parallel

y = abif (a) x = b+delse x =bd

y = abx = a(b+d) + a’bd

y = abx = y + d(a+b)

[DeMicheli]

Can be further optimized to:

Example

Optimize this

x = 0;y = a + b;if (x < 15) z = a + b - c;else z = x + 12;output = z * 12;

High-Level Synthesis:Scheduling/Resource Allocation

Scheduling Scheduling assigns a start time to each

operation in DFG Start times must not violate dependencies in DFG Start times must meet performance constraints

Alternatively, resource constraints

Performed on the DFG of each CFG node => Can’t execute multiple CFG nodes in parallel

Examples

+

+

+

+ +

+

a b c da b c d

+ +

+

a b c d

Cycle1

Cycle2

Cycle3

Cycle3

Cycle1 Cycle2

Cycle1

Cycle2

Scheduling Problems Several types of scheduling problems

Usually some combination of performance and resource constraints

Problems: Unconstrained

Not very useful, every schedule is valid Minimum latency Latency constrained Mininum-latency, resource constrained

i.e. find the schedule with the shortest latency, that uses less than a specified # of resources

NP-Complete Mininum-resource, latency constrained

i.e. find the schedule that meets the latency constraint (which may be anything), and uses the minimum # of resources

NP-Complete

Minimum Latency Scheduling ASAP (as soon as possible) algorithm

Find a candidate node Candidate is a node whose predecessors have been scheduled and

completed (or has no predecessors) Schedule node one cycle later than max cycle of predecessor Repeat until all nodes scheduled

+ +

*

a b c d

*

- <

e f g h

Cycle1

Cycle2

Cycle3

+Cycle4

Minimum possible latency - 4 cycles

Minimum Latency Scheduling ALAP (as late as possible) algorithm

Run ASAP, get minimum latency L Find a candidate

Candidate is node whose successors are scheduled (or has none) Schedule node one cycle before min cycle of predecessor

Nodes with no successors scheduled to cycle L Repeat until all nodes scheduled

+ +

*

a b c d

*

- <

e f g h

Cycle1

Cycle2

Cycle3

+Cycle4

Cycle4

Cycle3

L = 4 cycles

Minimum Latency Scheduling ALAP (as late as possible) algorithm

Run ASAP, get minimum latency L Find a candidate

Candidate is node whose successors are scheduled (or has none) Schedule node one cycle before min cycle of predecessor

Nodes with no successors scheduled to cycle L Repeat until all nodes scheduled

+ +

*

a b c d

* -

<

e f g h

Cycle1

Cycle2

Cycle3

+Cycle4

L = 4 cycles

Minimum Latency Scheduling ALAP

Has to run ASAP first, seems pointless But, many heuristics need the mobility/slack of

each operation ASAP gives the earliest possible time for an operation ALAP gives the latest possible time for an operation

Slack = difference between earliest and latest possible schedule

Slack = 0 implies operation has to be done in the current scheduled cycle

The larger the slack, the more options a heuristic has to schedule the operation

Latency-Constrained Scheduling

Instead of finding the minimum latency, find latency less than L Solutions:

Use ASAP, verify that minimum latency less than L

Use ALAP starting with cycle L instead of minimum latency (don’t need ASAP)

Scheduling with Resource Constraints

Schedule must use less than specified number of resources

+ +

*

a b c d

+

-

e f g

Cycle1

Cycle3

Cycle4

+Cycle5

*

Cycle2

Constraints: 1 ALU (+/-), 1 Multiplier

Scheduling with Resource Constraints

Schedule must use less than specified number of resources

+ +

*

a b c d

+

-

e f g

Cycle1

Cycle2

Cycle3

+Cycle4

*


Mininum-Latency, Resource-Constrained Scheduling

Definition: Given resource constraints, find schedule that has the minimum latency Example:

+ +

+

a b c d

-

e f g

Cycle1

Cycle3

Cycle6+

*

Cycle2


Cycle4

Cycle5


Definition: Given resource constraints, find schedule that has the minimum latency Example:

+ +

+

a b c d

-

e f g

Cycle1

Cycle4

Cycle5+

*

Cycle2


Cycle3

Different schedules may use same resources, but have different latencies


Hu’s Algorithm Assumes one type of resource

Basic Idea Input: graph, # of resources r 1) Label each node by max distance from output

i.e. Use path length as priority 2) Determine C, the set of scheduling candidates

Candidate if either no predecessors, or predecessors scheduled

3) From C, schedule up to r nodes to current cycle, using label as priority

4) Increment current cycle, repeat from 2) until all nodes scheduled


Hu’s Algorithm Example

+ +

*

a b c d

+

-

e f g

+

*

j k

-

r = 3


Hu’s Algorithm Step 1 - Label each node by max distance

from output i.e. use path length as priority

a b c d e f g

4 4 3

23

2

1

j k

1

r = 3


Hu’s Algorithm Step 2 - Determine C, the set of scheduling

candidates

a b c d e f g

4 4 3

23

2

1

j k

1C =

r = 3Cycle 1


Hu’s Algorithm Step 3 - From C, schedule up to r nodes to

current cycle, using label as priority

a b c d e f g

4 4 3

23

2

1

j k

1

r = 3

Not scheduled due to lower priority

Cycle1

Cycle 1


Hu’s Algorithm Step 2

a b c d e f g

4 4 3

23

2

1

j k

1

r = 3

C =

Cycle1

Cycle 2


Hu’s Algorithm Step 3

a b c d e f g

4 4 3

23

2

1

j k

1

r = 3

Cycle1

Cycle2

Cycle 2


Hu’s Algorithm Skipping to finish

a b c d e f g

4 4 3

23

2

1

j k

1

r = 3

Cycle1

Cycle2

Cycle3

Cycle4


Hu’s is simplified problem Common Extensions:

Multiple resource types Multi-cycle operation

a b c d

+ -

/

*Cycle1

Cycle2


List Scheduling - (minimum latency, resource-constrained version) Extension for multiple resource types Basic Idea - Hu’s algorithm for each resource type

Input: graph, set of constraints R for each resource type 1) Label nodes based on max distance to output 2) For each resource type t

3) Determine candidate nodes, C (those w/ no predecessors or w/ scheduled predecessors )

4) Schedule up to Rt operations from C based on priority, to current cycle

Rt is the constraint on resource type t 3) Increment cycle, repeat from 2) until all nodes

scheduled


List scheduling - minimum latency Step 1 - Label nodes based on max distance to

output (not shown, so you can see operations) *nodes given IDs for illustration purposes

a b c d e f g

* + +

**

+

-

j k

-1 2 3 4

5 6

78

2 ALUs (+/-), 2 Multipliers


List scheduling - minimum latency For each resource type t

3) Determine candidate nodes, C (those w/ no predecessors or w/ scheduled predecessors)


Rt is the constraint on resource type t

a b c d e f g

* + +

**

+

-

j k

-1 2 3 4

5 6

78

Mult ALU Cycle

1 2,3,4 1

2 ALUs (+/-), 2 Multipliers Candidates






a b c d e f g

* + +

**

+

-

j k

-1 2 3 4

5 6

78

Mult ALU Cycle

1 2,3,4 1


Cycle1Candidate, but not scheduled due to low priority






a b c d e f g

* + +

**

+

-

j k

-1 2 3 4

5 6

78

Mult ALU Cycle

1 2,3,4 1

5,6 4 2


Cycle1






a b c d e f g

* + +

**

+

-

j k

-1 2 3 4

5 6

78

Mult ALU Cycle

1 2,3,4 1

5,6 4 2


Cycle1

Cycle2






a b c d e f g

* + +

**

+

-

j k

-1 2 3 4

5 6

78

Mult ALU Cycle

1 2,3,4 1

5,6 4 2

7 3


Cycle1

Cycle2


List scheduling - (minimum latency) Final schedule Note - ASAP would require more resources

ALAP wouldn’t but in general, it would

a b c d e f g

* + +

**

+

-

j k

-

1 2 3

45 6

78

Cycle1

Cycle2

Cycle3

Cycle4



Extension for multicycle operations Same idea (differences shown in red) Input: graph, set of constraints R for each resource type 1) Label nodes based on max cycle latency to output 2) For each resource type t

3) Determine candidate nodes, C (those w/ no predecessors or w/ scheduled and completed predecessors)

4) Schedule up to (Rt - nt) operations from C based on priority, one cycle after predecessor

Rt is the constraint on resource type t nt is the number of resource t in use from previous cycles

Repeat from 2) until all nodes scheduled


Example:

a b c d e f g

*+ +

*

*

+

-

j k

*

1 2 3

4

5

6

78

Cycle1

Cycle2

Cycle5

Cycle6


Cycle4

Cycle3

Cycle7

List Scheduling (Min Latency) Your turn (2 ALUs, 1 Mult)

Steps (will be on test) 1) Label nodes with priority 2) Update candidate list for each cycle 3) Redraw graph to show schedule

+ * * -

+

+

1 2 3

7

4

9-

+ +

*

5 6

8

-10

11

List Scheduling (Min Latency)

Your turn (2 ALUs, 1 Mult, Mults take 2 cycles)

a b c d e f g

+ + * *

*

+

1 2 3

5

4

7

Minimum-Resource, Latency-Constrained

Note that if no resource constraints given, schedule determines number of required resources Max # of each resource type used in a single cycle

+ +

*

a b c d

+

-

e f g

Cycle1

Cycle2

Cycle3

+Cycle4

*

3 ALUs

2 Mults


Minimum-Resource Latency-Constrained Scheduling: For all schedules that have latency less than the

constraint, find the one that uses the fewest resources

+ +

*

a b c d

+

-

e f g

Cycle1

Cycle2

Cycle3+Cycle4

*+ +

*

a b c d

+

-

e f g

Cycle1

Cycle2

Cycle3

+Cycle4

*

3 ALUs, 2 Mult 2 ALUs, 1 Mult

Latency Constraint <= 4 Latency Constraint <= 4


List scheduling (Minimum resource version) Basic Idea

1) Compute latest start times for each op using ALAP with specified latency constraint

Latest start times must include multicycle operations 2) For each resource type

3) Determine candidate nodes 4) Compute slack for each candidate

Slack = current cycle - latest possible cycle 5) Schedule ops with 0 slack

Update required number of resources (assume 1 of each to start with)

6) Schedule ops that require no extra resources 7) Repeat from 2) until all nodes scheduled


1) Find ALAP schedulea b c d e f g

* + +

**

-

j k

-1 2 3 4

5 6

7

Node LPC

1 1

2 1

3 1

4 3

5 2

6 2

7 3

Last Possible Cycle

a b c d e f g

* + +

**

-

j k

-

1 2 3

45 6

7

Cycle1

Cycle2

Cycle3

Defines last possible cycle for each operation

Latency Constraint = 3 cycles


2) For each resource type 3) Determine candidate nodes C 4) Compute slack for each candidate

Slack = current cycle - latest possible cycle

Node LPC

1 1

2 1

3 1

4 3

5 2

6 2

7 3

SlackCandidates = {1,2,3,4} Cycle

0

0

0

2

Cycle 1

a b c d e f g

* + +

**

-

j k

-1 2 3 4

5 6

7

Initial Resources = 1 Mult, 1 ALU


5)Schedule ops with 0 slack Update required number of resources

6) Schedule ops that require no extra resources

Node LPC

1 1

2 1

3 1

4 3

5 2

6 2

7 3

SlackCandidates = {1,2,3,4}

Cycle

0 1

0 1

0 1

2 X

Resources = 1 Mult, 2 ALU

4 requires 1 more ALU - not scheduled

Cycle 1

a b c d e f g

* + +

**

-

j k

-1 2 3 4

5 6

7


2)For each resource type 3) Determine candidate nodes C 4) Compute slack for each candidate


Node LPC

1 1

2 1

3 1

4 3

5 2

6 2

7 3

SlackCandidates = {4,5,6}

Cycle

1

1

1

1

0

0


Cycle 2

a b c d e f g

* + +

**

-

j k

-1 2 3 4

5 6

7


5)Schedule ops with 0 slack Update required number of resources

6) Schedule ops that require no extra resources

Node LPC

1 1

2 1

3 1

4 3

5 2

6 2

7 3

SlackCandidates = {4,5,6}

Cycle

1

1

1

1 2

0 2

0 2


Cycle 2

Already 1 ALU - 4 can be scheduled

a b c d e f g

* + +

**

-

j k

-1 2 3 4

5 6

7


2)For each resource type 3) Determine candidate nodes C 4) Compute slack for each candidate


Node LPC

1 1

2 1

3 1

4 3

5 2

6 2

7 3

SlackCandidates = {7}

Cycle

1

1

1

2

2

2

0


Cycle 3

a b c d e f g

* + +

**

-

j k

-1 2 3 4

5 6

7


Final Schedule

Node LPC

1 1

2 1

3 1

4 3

5 2

6 2

7 3

Slack Cycle

1

1

1

2

2

2

3

a b c d e f g

* + +

**

-

j k

-1 2 3

45 6

7

Cycle1

Cycle2

Cycle3

Required Resources = 2 Mult, 2 ALU

Other extensions Chaining

Multiple operations in a single cycle

Pipelining Input: DFG, data delivery rate For fully pipelined circuit, must have one resource

per operation (remember systolic arrays)

a b c d

+ +

+

-

e f

/Multiple adds may be faster than 1 divide - perform adds in one cycle

Summary Scheduling assigns each operation in a DFG a

start time Done for each DFG in the CDFG

Different Types Minimum Latency

ASAP, ALAP Latency-constrained

ASAP, ALAP Minimum-latency, resource-constrained

Hu’s Algorithm List Scheduling

Minimum-resource, latency-constrained List Scheduling

High-level Synthesis: Binding/Resource Sharing

Binding

During scheduling, we determined: When ops will execute How many resources are needed

We still need to decide which ops execute on which resources => Binding If multiple ops use the same resource

=>Resource Sharing

Binding

Basic Idea - Map operations onto resources such that operations in same cycle don’t use same resource

* + +

**

+

-

-

1 2 3

45 6

78

Cycle1

Cycle2

Cycle3

Cycle4


Mult1 ALU1 ALU2 Mult2

Binding

Many possibilities Bad binding may increase resources, require huge

steering logic, reduce clock, etc.

* + +

**

+

-

-

1 2 3

45 6

78

Cycle1

Cycle2

Cycle3

Cycle4


Mult1 ALU1 ALU2Mult2

Binding

Can’t do this 1 resource can’t perform multiple ops

simultaneously!

* + +

**

+

-

-

1 2 3

45 6

78

Cycle1

Cycle2

Cycle3

Cycle4


Binding How to automate?

More graph theory Compatibility Graph

Each node is an operation Edges represent compatible operations

Compatible - if two ops can share a resource I.e. Ops that use same type of resource (ALU, etc.)

and are scheduled to different cycles

Compatibility Graph

* + +

**

+

-

-

1 2 3

45 6

78

Cycle1

Cycle2

Cycle3

Cycle4

2 8

7 4

3

1 6

5

ALUs Mults

5 and 6 not compatible (same cycle)2 and 3 not

compatible (same cycle)

Compatibility Graph

* + +

**

+

-

-

1 2 3

45 6

78

Cycle1

Cycle2

Cycle3

Cycle4

2 8

7 4

3

1 6

5

ALUs Mults

Note - Fully connected subgraphs can share a resource (all involved nodes are compatible)

Compatibility Graph Binding: Find minimum number of fully

connected subgraphs that cover entire graph Well-known problem: Clique partitioning (NP-

complete)

Cliques = { {2,8,7,4},{3},{1,5},{6} } ALU1 executes 2,8,7,4 ALU2 executes 3 MULT1 executes 1,5 MULT2 executes 6

2 8

7 4

3

1 6

5

Compatibility Graph

* + +

**

+

-

-

1 2 3

45 6

78

Cycle1

Cycle2

Cycle3

Cycle4

2 8

7 4

3

1 6

5

ALUs Mults

Final Binding:

Compatibility Graph

* + +

**

+

-

-

1 2 3

45 6

78

Cycle1

Cycle2

Cycle3

Cycle4

2 8

7 4

3

1 6

5

ALUs Mults

Alternative Final Binding:

Translation to Datapath

* + +

**

+

-

-

1 2 3

45 6

78

Cycle1

Cycle2

Cycle3

Cycle4

Mult(1,5) Mult(6)ALU(2,7,8,4) ALU(3)

Mux Mux

Reg

Mux Mux

a b c h

Reg RegReg

a b c d e f g h i

d e i g e f

1) Add resources and registers

2) Add mux for each input

3) Add input to left mux for each left input in DFG

4) Do same for right mux

5) If only 1 input, remove mux

Left Edge Algorithm Alternative to clique partitioning

Take scheduled DFG, rotate it 90 degrees


a b f g

*+ +

*

*

+

-

j k

*

1 2 3

4

5

6

78

Cycle1

Cycle2

Cycle5

Cycle6

Cycle4

Cycle3

Cycle7

c d e

Left Edge Algorithm

2 ALUs (+/-), 2 Multipliers*

++

*

*

+

-

*

12

3

4

5

6

7

8

Cyc

le1

Cyc

le2

Cyc

le5

Cyc

le6

Cyc

le4

Cyc

le3

Cyc

le7

1) Initialize right_edge to 0

2) Find a node N whose left edge is >= right_edge

3) Bind N to a particular resource

4) Update right_edge to the right edge of N

5) Repeat from 2) for nodes using the same resource type until right_edge passes all nodes

6) Repeat from 1) until all nodes bound

right_edge

Extensions Algorithms presented so far find a valid

binding But, do not consider amount of steering logic

required Different bindings can require significantly different

# of muxes One solution

Extend compatibility graph Use weighted edges/nodes - cost function representing

steering logic Perform clique partitioning, finding the set of cliques that

minimize weight

Binding Summary

Binding maps operations onto physical resources Determines sharing among resources

Binding may greatly affect steering logic Trivial for fully-pipelined circuits

1 resource per operation Straightforward translation from bound

DFG to datapath

High-level Synthesis:Summary

Main Steps Front-end (lexing/parsing) converts code into intermediate

representation We looked at CDFG

Scheduling assigns a start time for each operation in DFG CFG node start times defined by control dependencies Resource allocation determined by schedule

Binding maps scheduled operations onto physical resources Determines how resources are shared

Big picture: Scheduled/Bound DFG can be translated into a datapath CFG can be translated to a controller => High-level synthesis can create a custom circuit for any CDFG!

Limitations Pipeline parallelism

Discussed techniques only exploit arithmetic and bit level parallelism

Not much potential for speedup How to create pipelined circuit?

Difficult for general code Must try to transform into known “templates”

i.e. if conversion, loop fusion, etc. Not always possible/practical

Existing tools Academic

ROCCC - Riverside Optimizing Compiler for Configurable Computing SPARK

Commercial Catapult C (Mentor Graphics) Celoxica DK Design Suite

Limitations Task-level parallelism

Parallelism in CDFG limited to individual control states

Can’t have multiple states executing concurrently Potential solution: use model other than CDFG

Kahn Process Networks Nodes represents parallel processes/tasks Edges represent communication between processes

Discussed techniques can create a controller+datapath for each process

Must also consider communication buffers Challenge:

Most high-level code does not have explicit parallelism Difficult/impossible to extract task-level parallelism from code

Limitations Coding practices limit circuit performance

Very often, languages contain constructs not appropriate for circuit implementation

Recursion, pointers, virtual functions, etc.

Potential solution: use specialized languages Remove problematic constructs, add task-level

parallelism Challenges:

Difficult to learn new languages Many designers resist changes to tool flow

Limitations Expert designers can achieve better circuits

High-level synthesis has to work with specification in code

Can be difficult to automatically create efficient pipeline May require dozens of optimizations applied in a

particular order Expert designer can transform algorithm

Synthesis can transform code, but can’t change algorithm

Potential Solution: ??? New language? New methodology? New tools?

High-Level Synthesis: Creating Custom Circuits from High-Level Code

Documents

Transcript of High-Level Synthesis: Creating Custom Circuits from High-Level Code