Architecture Synthesis - eim.uni-paderborn.de

Architecture Synthesis

Dr. Tobias KenterProf. Dr. Christian Plessl

SS 2017

High-Performance IT Systems groupPaderborn University

Version 1.6.1 – 2017-06-25

Motivation

• translate program/algorithm into dedicated hardware• units of translation

– single basic block (combinational)– complete program (sequential)

2

finitestate

machine

controller

registers

FU

data path

FU

• FSM controls of data path (FSM actions)

• data path sends feedback to FSM (FSM conditions)

1: int gcd(int a, int b) {2: while(a != b) {3: if(a > b) {4: a = a - b;5: } else {6: b = b - a;7: }8: }9: return a;

10:}

Overview

• translation of basic blocks– formal modeling

§ sequence graph, resource graph§ cost function, execution times§ synthesis problems: allocation, binding, scheduling

– scheduling algorithms§ ASAP, ALAP§ extended ASAP, ALAP§ list scheduling

– optimal scheduling§ introduction to integer linear programming (ILP)§ scheduling with ILP

• translation of complete programs– finite state machine and data path– micro-coded controllers

3

Overview






4

Sequencing Graph

5

* +

-

<

* * *

* * +

-

NOP

NOP

0

1 2 3 4 5

6 7 8 9

10

11

12VS = {v0 ,v1 , …,v12}

GS (VS , ES )

Excurse: Source of example, relation to DFG

6

Resource Graph

• GR (VR, ER )– set of nodes VR = VS ⋃ VT

§ VS are the nodes of the sequencing graph (without NOPs)§ VT represent resource types (adder, multiplier, ALU, …)

– set of edges (vS, vT ) ∈E R with vS∈VS , vT∈VT

§ an instance of resource type vT can be used to implement operation vS

Resource Graph – Example

7

* 1

* 3

* 4

+ 5

*6

*7

+8

<9

-10

-11

ALU

VT = {vMUL , vALU}

* 2

MUL

Cost, Execution Time

• cost function c: VT → Z– assigns a cost value to each resource type– example:

• execution times w: ER → Z+– assigns the execution time of operation vS∈ VS on resource

type vT∈ VT to the edge (vS ,vT) ∈ ER

– example:

8

MUL ALUc(vMUL ) = 8 c(vALU) = 4

* 2

MULw(v2 ,vMUL) = 1

Allocation, Binding

• allocation α: VT → Z+– assigns a number α(vT) of available instances to each

resource type vT

• binding is given by the two functionsβ: VS → VT and 𝛾: VS → Z+

– β(vS) = vT means that operation vS is implemented by resource type vT(possible β’s shown in the resource graph)

– 𝛾(vS) = r denotes that vS is implemented by the r–th instance of vT ; r ≤ α(vT)

9

Allocation, Binding – Example (1)

10

α(vMUL) = 3

β(v1) = vMUL 𝛾(v1) = 1β(v2) = vMUL 𝛾(v2) = 2β(v3) = vMUL 𝛾(v3) = 3β(v4) = vMUL 𝛾(v4) = 1β(v6) = vMUL 𝛾(v6) = 2β(v7) = vMUL 𝛾(v7) = 3

c(vMUL) = 8

w(v1 ,vMUL) = 1w(v2 ,vMUL) = 1w(v3 ,vMUL) = 1w(v4 ,vMUL) = 1w(v6 ,vMUL) = 1w(v7 ,vMUL) = 1

* 1

* 3

* 4

*6

*7

* 2

MUL

Allocation, Binding – Example (2)

11

+ 5

+8

<9

-10

-11

ALUα(vALU) = 2

β(v5 ) = vALU 𝛾(v5 ) = 1β(v8 ) = vALU 𝛾(v8 ) = 1β(v9 ) = vALU 𝛾(v9 ) = 1β(v10 ) = vALU 𝛾(v10) = 2β(v11) = vALU 𝛾(v11) = 1

c(vALU) = 4

w(v5 ,vALU ) = 1w(v8 ,vALU ) = 1w(v9 ,vALU ) = 1w(v10 ,vALU ) = 1w(v11 ,vALU ) = 1

Schedule

• schedule 𝜏: VS → Z+– assigns a start time to each operation under the constraint

𝜏(vj) - 𝜏 (vi) ≥ w(vi , β(vi)) ∀(vi ,vj)∈ ES

• latency L of a scheduled sequencing graph– difference in start times between end node and start node

L = 𝜏(vn) - 𝜏 (v0)

12

Schedule - Example

13

1

2

3

4

time

* +

-

<

* *

** *

+

-

1 2 3

4

5

6 7

8

9

10

11

NOP 12

NOP0

5 L = 𝜏(v12) - 𝜏(v0) = 5-1= 4

Synthesis

• allocation, binding, scheduling

• finding (α,β,𝛾,𝜏) that optimize latency and cost under resource and timing constraints– algorithms for architecture synthesis discussed in, e.g.

§ J.Teich, C. Haubelt, Digitale Hardware/Software-Systeme, Springer 2007§ G. De Micheli, Synthesis and Optimization of Digital Circuits, McGrawHill 1994

– synthesis problem variants§ multicycle operations, operator chaining§ several possible resource types for an operation§ iterative schedules, pipelining

• in the following– scheduling without resource constraints

§ ASAP, ALAP– scheduling under resource constraints

§ extended ASAP, ALAP§ list scheduling

14

Overview






15

Scheduling without Resource Constraints

• ASAP (as soon as possible)– determines the earliest possible start times for the operations– minimal latency

• ALAP (as late as possible)– determines the latest possible start times for the operations

under a given latency bound

• slack (mobility) of operations– difference of start times: (ALAP with ASAP latency bound) - ASAP– if slack = 0 → operation is on the critical path

16

ASAP Scheduling

• ASAP: as soon as possible scheduling• algorithm

17

ASAP( GS (V,E) ) {schedule v0 by setting 𝜏(v0) = 1repeat {

select a vertex vj whose predecessors are all scheduledschedule vj by setting

}until (vn is scheduled)return 𝜏

}

τ (vj ) =i:(vi ,vj )∈Emax τ (vi )+w(vi,β(vi ))

ASAP Example

18

1

2

3

4

time

* +

-

<

* * *

* * +

-

1 2 3 4 5

6 7 8 9

10

11

NOP 12

NOP0

5 minimal latency: L = 4

ALAP Scheduling

• ALAP: as late as possible scheduling• requires a latency bound L

– otherwise nodes could be arbitrarily delayed– typically the schedule length of ASAP schedule is used as latency bound

• algorithm

19

ALAP( GS (V,E), L ) {schedule vn by setting 𝜏(vn) = L +1repeat {

select a vertex vi whose successors are all scheduledschedule vi by setting

}until (v0 is scheduled)return 𝜏

}

τ (vi ) =j:(vi ,vj )∈Emin τ (vj )−w(vi,β(vi ))

ALAP Example

20

1

2

3

4

time

*

+-

<

*

*

*

*

*

+-

1 2

3

4 5

6

7

8 9

10

11

NOP 12

NOP0

5

latency bound: L = 4

Slack (mobility)

21

*

*1

1

*

* +

<

*

+

*

-

*

*

-

2

2

2

2

*

+

+

<

0

0

0

0

01

2

3

4

time

Scheduling under Resource Constraints (1)

• Extended ASAP, ALAP– first ASAP or ALAP– then move operations down (ASAP) or up (ALAP) until resource

constraints are satisfied

22

Extended ALAP

23

1

2

3

4

time

*

+

- <

*

*

*

*

*

+-

1 2

3

4

56

7

8

910

11

NOP 12

NOP

5

resource constraints: α(vMUL) = 2, α(vALU) = 2

latency bound: L = 4

Scheduling under Resource Constraints (2)

• list scheduling– operations are prioritized according to some criterion, e.g.

slack,

time = 1repeat

for each resource (vT , α(vT))determine all ready operations vS with β(vs)= vT and schedule the one with the highest priority

time ++;until (vn is scheduled)

24

7

List Scheduling (1)

• example– criterion: number of successor nodes– resource constraints: α(vMUL) = 1, α(vALU) = 1

25

654321 * +

-

<

* * *

* * +

-

3 3

2

2 1 1

1

1

0 0

0

execute at time

List Scheduling (2)

26

1

2

3

4

5

6

7

* +

-

<

*

*

*

*

+

*

-

timeL = 𝜏(v12) - 𝜏(v0) = 8-1= 7✎ Exercise 5.1: Architecture Synthesis

Overview






27

ILP Basics (1)

• Linear Program (LP)– is a special kind of optimization problem

28

– real variables – linear constraints modeled by (in)equations– linear cost function

with

• Integer Linear Program (ILP)– LP where all variables must be integer

• 0-1 LP (binary LP, ZOLP)– LP where all variables must be binary

• Mixed Integer Linear Program (MILP)– LP where some variables must be integer

𝐴 $ 𝑥 ≤ 𝑏 𝑥 ≥ 0

max𝑐. $ 𝑥

𝑥 ∈ 𝑅1, 𝐴 ∈ 𝑅3×1𝑏 ∈ 𝑅3, 𝑐 ∈ 𝑅1

𝑥 ∈ 𝑍1

𝑥 ∈ {0,1}

ILP Basics (2)

• example– 3 binary variables x1, x2, x3

– constraint that at least two variablesmust be set to 1

– minimize cost function

• solution

29

x1 x2 x3 cost0 1 1 10

1 0 1 9

1 1 0 11

1 1 1 15

𝑥6, 𝑥7, 𝑥8 ∈ {0,1}𝑥6 + 𝑥7 + 𝑥8 ≥ 2

min5𝑥6 + 6𝑥7 + 4𝑥8

ILP Basics (3)

• solving ILPs– ILPs are NP-complete, exponential runtime in worst case– solved by branch-and-bound algorithms to optimality

(runtime depends on the efficiency of the model)– popular solvers

§ CPLEX (commercial)http://www.ibm.com/software/commerce/optimization/cplex-optimizer/

§ Gurobi (commercial, free academic license)http://www.gurobi.com/

§ lp_solve (open source)http://lpsolve.sourceforge.net/5.5/

• challenges– translate the problem into an efficient ILP model– use only linear constraints and cost function

§ non-linearities of the problem can (possibly) be modeled by several linear constraints, but this increases the complexity of the ILP

30

Example: Optimal DFG Scheduling

31

* +

-

<

* * *

* * +

-

NOP

NOP

0

1 2 3 4 5

6 7 8 9

10

11

12VS = {v0 ,v1 , …,v12}

all operations take one time step

resource constraints:• 2 MUL• 2 ALU (+,-,<)

→ find schedule with minimal latency

alternative problem: minimize resourcesunder latency constraint

GS (VS, ES )

ILP for Optimal DFG Scheduling (1)

• binary variables– xi,l =1, iff operation vi starts in time step l,

→ we need to define a latency bound

• uniqueness constraints– each operation starts in exactly one time step

• starting time– the starting time 𝜏(vi) of operation vi can be expressed as

32

Llni ,...,1 ,...,1 ==

C𝑥D,E = 1, ∀𝑖 = 1,… , 𝑛E

𝜏 𝑣D = C𝑙 $ 𝑥D,EE


• simple resource constraints (execution time for operations = 1)– in each time step, use no more resources than available

• sequencing constraints– an operation cannot start earlier than the finishing time of its predecessors

33

• required for these simple resource constraints• used in sequencing constraints of upcoming example

C 𝑥D,E ≤ 𝛼 𝑣NOP ,Q RS TRUVW

C 𝑥D,E ≤ 𝛼 𝑣XPO ,∀𝑙 = 1,… , 𝐿Q RS TRZWV

C𝑙 ⋅ 𝑥\,E ≥ 𝑤 𝑣D,𝛽(𝑣D ) +E

C𝑙 ⋅ 𝑥D,E,∀𝑒𝑑𝑔𝑒𝑠(𝑣D, 𝑣\)E

𝑤 𝑣D, 𝛽(𝑣D )=1restriction


• cost function– minimize start time of last operation

34

min𝜏 𝑣1 = min(C𝑙 ⋅ 𝑥1,E)E

Example: Scheduling of DFG (1)

35

/*********************/ /* integer variables *//*********************/int x1_1;int x1_2;int x1_3;int x1_4;int x1_5;int x1_6;int x1_7;

int x2_1;int x2_2;int x2_3;int x2_4;int x2_5;int x2_6;int x2_7;











/**********************/ /* variables in {0,1} *//**********************/x1_1 <= 1;x1_2 <= 1;x1_3 <= 1;x1_4 <= 1;x1_5 <= 1;x1_6 <= 1;x1_7 <= 1;

x2_1 <= 1;x2_2 <= 1;x2_3 <= 1;x2_4 <= 1;x2_5 <= 1;x2_6 <= 1;x2_7 <= 1;

x3_1 <= 1;x3_2 <= 1;x3_3 <= 1;x3_4 <= 1;x3_5 <= 1;x3_6 <= 1;x3_7 <= 1;

x4_1 <= 1;x4_2 <= 1;x4_3 <= 1;x4_4 <= 1;x4_5 <= 1;x4_6 <= 1;x4_7 <= 1;

x5_1 <= 1;x5_2 <= 1;x5_3 <= 1;x5_4 <= 1;x5_5 <= 1;x5_6 <= 1;x5_7 <= 1;

x6_1 <= 1;x6_2 <= 1;x6_3 <= 1;x6_4 <= 1;x6_5 <= 1;x6_6 <= 1;x6_7 <= 1;

x7_1 <= 1;x7_2 <= 1;x7_3 <= 1;x7_4 <= 1;x7_5 <= 1;x7_6 <= 1;x7_7 <= 1;

x8_1 <= 1;x8_2 <= 1;x8_3 <= 1;x8_4 <= 1;x8_5 <= 1;x8_6 <= 1;x8_7 <= 1;

x9_1 <= 1;x9_2 <= 1;x9_3 <= 1;x9_4 <= 1;x9_5 <= 1;x9_6 <= 1;x9_7 <= 1;

x10_1 <= 1;x10_2 <= 1;x10_3 <= 1;x10_4 <= 1;x10_5 <= 1;x10_6 <= 1;x10_7 <= 1;

x11_1 <= 1;x11_2 <= 1;x11_3 <= 1;x11_4 <= 1;x11_5 <= 1;x11_6 <= 1;x11_7 <= 1;

x12_1 <= 1;x12_2 <= 1;x12_3 <= 1;x12_4 <= 1;x12_5 <= 1;x12_6 <= 1;x12_7 <= 1;

latency bound = 7

problem descriptionin LP format


36

/**************************/ /* uniqueness constraints *//**************************/x1_1 + x1_2 + x1_3 + x1_4 + x1_5 + x1_6 + x1_7 = 1;x2_1 + x2_2 + x2_3 + x2_4 + x2_5 + x2_6 + x2_7 = 1;x3_1 + x3_2 + x3_3 + x3_4 + x3_5 + x3_6 + x3_7 = 1;x4_1 + x4_2 + x4_3 + x4_4 + x4_5 + x4_6 + x4_7 = 1;x5_1 + x5_2 + x5_3 + x5_4 + x5_5 + x5_6 + x5_7 = 1;x6_1 + x6_2 + x6_3 + x6_4 + x6_5 + x6_6 + x6_7 = 1;x7_1 + x7_2 + x7_3 + x7_4 + x7_5 + x7_6 + x7_7 = 1;x8_1 + x8_2 + x8_3 + x8_4 + x8_5 + x8_6 + x8_7 = 1;x9_1 + x9_2 + x9_3 + x9_4 + x9_5 + x9_6 + x9_7 = 1;x10_1 + x10_2 + x10_3 + x10_4 + x10_5 + x10_6 + x10_7 = 1;x11_1 + x11_2 + x11_3 + x11_4 + x11_5 + x11_6 + x11_7 = 1;x12_1 + x12_2 + x12_3 + x12_4 + x12_5 + x12_6 + x12_7 = 1;

/************************/ /* resource constraints *//************************/x1_1 + x2_1 + x3_1 + x4_1 + x6_1 + x7_1 <= 2;x5_1 + x8_1 + x9_1 + x10_1 + x11_1 <= 2;

x1_2 + x2_2 + x3_2 + x4_2 + x6_2 + x7_2 <= 2;x5_2 + x8_2 + x9_2 + x10_2 + x11_2 <= 2;

x1_3 + x2_3 + x3_3 + x4_3 + x6_3 + x7_3 <= 2;x5_3 + x8_3 + x9_3 + x10_3 + x11_3 <= 2;

x1_4 + x2_4 + x3_4 + x4_4 + x6_4 + x7_4 <= 2;x5_4 + x8_4 + x9_4 + x10_4 + x11_4 <= 2;

x1_5 + x2_5 + x3_5 + x4_5 + x6_5 + x7_5 <= 2;x5_5 + x8_5 + x9_5 + x10_5 + x11_5 <= 2;

x1_6 + x2_6 + x3_6 + x4_6 + x6_6 + x7_6 <= 2;x5_6 + x8_6 + x9_6 + x10_6 + x11_6 <= 2;

x1_7 + x2_7 + x3_7 + x4_7 + x6_7 + x7_7 <= 2;x5_7 + x8_7 + x9_7 + x10_7 + x11_7 <= 2;

/**********************/ /* objective function *//**********************/min: 1 x12_1 + 2 x12_2 + 3 x12_3 + 4 x12_4 + 5 x12_5 + 6 x12_6 + 7 x12_7;

/**************************/ /* sequencing constraints *//**************************//* 1->6 */1 x6_1 + 2 x6_2 + 3 x6_3 + 4 x6_4 + 5 x6_5 + 6 x6_6 + 7 x6_7 >=1 + 1 x1_1 + 2 x1_2 + 3 x1_3 + 4 x1_4 + 5 x1_5 + 6 x1_6 + 7 x1_7;

/* 2->6 */1 x6_1 + 2 x6_2 + 3 x6_3 + 4 x6_4 + 5 x6_5 + 6 x6_6 + 7 x6_7 >= 1 + 1 x2_1 + 2 x2_2 + 3 x2_3 + 4 x2_4 + 5 x2_5 + 6 x2_6 + 7 x2_7;

/* 6->10 */1 x10_1 + 2 x10_2 + 3 x10_3 + 4 x10_4 + 5 x10_5 + 6 x10_6 + 7 x10_7 >=1 + 1 x6_1 + 2 x6_2 + 3 x6_3 + 4 x6_4 + 5 x6_5 + 6 x6_6 + 7 x6_7;

/* 10->11 */1 x11_1 + 2 x11_2 + 3 x11_3 + 4 x11_4 + 5 x11_5 + 6 x11_6 + 7 x11_7 >=1 + 1 x10_1 + 2 x10_2 + 3 x10_3 + 4 x10_4 + 5 x10_5 + 6 x10_6 + 7 x10_7;

/* 11->12 */1 x12_1 + 2 x12_2 + 3 x12_3 + 4 x12_4 + 5 x12_5 + 6 x12_6 + 7 x12_7 >=1 + 1 x11_1 + 2 x11_2 + 3 x11_3 + 4 x11_4 + 5 x11_5 + 6 x11_6 + 7 x11_7;

/* 3->7 */1 x7_1 + 2 x7_2 + 3 x7_3 + 4 x7_4 + 5 x7_5 + 6 x7_6 + 7 x7_7 >=1 + 1 x3_1 + 2 x3_2 + 3 x3_3 + 4 x3_4 + 5 x3_5 + 6 x3_6 + 7 x3_7;

/* 7->11 */1 x11_1 + 2 x11_2 + 3 x11_3 + 4 x11_4 + 5 x11_5 + 6 x11_6 + 7 x11_7 >=1 + 1 x7_1 + 2 x7_2 + 3 x7_3 + 4 x7_4 + 5 x7_5 + 6 x7_6 + 7 x7_7;

/* 4->8 */1 x8_1 + 2 x8_2 + 3 x8_3 + 4 x8_4 + 5 x8_5 + 6 x8_6 + 7 x8_7 >=1 + 1 x4_1 + 2 x4_2 + 3 x4_3 + 4 x4_4 + 5 x4_5 + 6 x4_6 + 7 x4_7;

/* 8->12 */1 x12_1 + 2 x12_2 + 3 x12_3 + 4 x12_4 + 5 x12_5 + 6 x12_6 + 7 x12_7 >=1 + 1 x8_1 + 2 x8_2 + 3 x8_3 + 4 x8_4 + 5 x8_5 + 6 x8_6 + 7 x8_7;

/* 5->9 */1 x9_1 + 2 x9_2 + 3 x9_3 + 4 x9_4 + 5 x9_5 + 6 x9_6 + 7 x9_7 >=1 + 1 x5_1 + 2 x5_2 + 3 x5_3 + 4 x5_4 + 5 x5_5 + 6 x5_6 + 7 x5_7;

/* 9->12 */1 x12_1 + 2 x12_2 + 3 x12_3 + 4 x12_4 + 5 x12_5 + 6 x12_6 + 7 x12_7 >=1 + 1 x9_1 + 2 x9_2 + 3 x9_3 + 4 x9_4 + 5 x9_5 + 6 x9_6 + 7 x9_7;


37

Example: DFG - Optimal Schedule

38

1

2

3

4

time

* +

- <

*

*

*

*

*

+-

1 2

3

4

5

6

7

8

910

11

NOP 12

NOP0

5

latency = 4

Generalized and Optimized Formulation (1)

• compute the earliest and latest scheduling time for each node using ASAP and ALAP scheduling (without resource constraints)– latency bound for ALAP determined by list scheduling

• binary variables

• uniqueness constraints

• scheduling time

39

li = τ (vi )ASAP hi = τ (vi )

ALAP

vi

xi,t ∈ {0,1} ∀vi ∈V, ∀t : li ≤ t ≤ hi

xi,t =1 ∀vi ∈Vt=li

hi

∑

t ⋅ xi,t = τ (vi ) ∀vi ∈Vt=li

hi

∑

✎ Question: How to compute an ASAP/ALAP schedule with ILP?


• sequencing, using di := 𝑤 𝑣D,𝛽(𝑣D )

• resource constraints (di=1 not required anymore)

• cost function

40

min l ⋅ xn,ll∑

τ (vj )−τ (vi ) ≥ di ∀(vi,vj )∈ E

xi,t−pp=max{0,t−hi}

min{di−1,t−li}

∑ ≤α(rki:(vi ,rk )∈ER

∑ )

∀rk ∈VT , ∀i=1

V

min{li} ≤ t ≤i=1

V

max{hi}


• explanation generalized resource constraints (slightly simplified: more complex summation bounds used to avoid summing undefined variables)

• e.g. assume node v1 is executed on a resource with di=3, then the resource usage caused by v1 in time steps t=1,2,3, ... is:

41

xi,t−pp=0

di−1

∑ =1:∀t :τ (vi ) ≤ t ≤ τ (vi )+ di −1

0 :otherwise

%&'

('

t =1: x1,1t = 2 : x1,2 + x1,1t = 3 : x1,3 + x1,2 + x1,1t = 4 : x1,4 + x1,3 + x1,2

resource busy if v1 has been started at t=1

resource busy if v1 has just been started at t=4 or stilloccupied if started at t=2 or t=3

resource busy if v1 has just been started at t=2 or stilloccupied if started at t=1resource busy if v1 has just been started at t=3 or stilloccupied if started at t=1 or t=2

✎ Exercise 5.1: Architecture Synthesis

Iterative Scheduling

• most applications execute a dataflow graph not just once but in a loop– generation and optimization of periodic schedules– execution in pipelined iterations for optimizing performance

• parameters– period/iteration interval/initiation interval (P): time between start of

execution of operation vi between two successive iterations– latency (L): time for between start of first and last operation for a given

interval

42

time

iteration

23

1

4

1 2 3 4 5 6

v1 v2 v3 v4

v1 v2 v3 v4

v1 v2 v3 v4

v1 v2 v3 v4

L=4,P=1

time

iteration

23

1

4

1 2 3 4 5 6

v1 v2 v3 v4

v1 v2 v3 v4

v1 v2 v3 v4

L=4,P=2

Iterative Scheduling

• iterative schedule examples (visualized with Gantt charts)

43

130 4 Ablaufplanung

0

Ressourcen

1 32 4 5 6 7 8 9

Zeitv1 v2

v3 v4

P = L = 9

r2

r1

Abb. 4.17. Ablaufplanung und Bindung

Bei diesem Ablaufplan betragt das Iterationsintervall P = 9, d. h. alle neun Zeit-schritte wiederholt sich die Berechnung eines neuen Datensatzes. Ein solcher Ab-laufplan entspricht einer sequentiellen Abarbeitung aufeinander folgender Iteratio-nen. In Abb. 4.17 ist zusatzlich zum Ablaufplan eine vollstatische Bindung mit einerRessource des Typs r1 und einer Ressource des Typs r2 dargestellt. Eine solche Dar-stellung bezeichnet man auch als Gantt-Chart. Man erkennt, dass die Ressource vomTyp r2 wahrend der Zeitschritte von t = 0 bis zum Zeitschritt t = 4 jeder Iterationnicht arbeitet. Offensichtlich ist es jedoch bei geeigneter Speicherung der Zwischen-ergebnisse moglich, gleichzeitig mit dem Beginn der Berechnung von Operation v4auf der Ressource vom Typ r2, die Berechnung eines neuen Datensatzes durch Pla-nung von v1 auf der Ressource des Typs r1 zu beginnen. Anders ausgedruckt bedeu-tet dies, dass in einer Iteration die Knoten v1(n),v2(n),v3(n) und v4(n−1) berechnetwerden. Ein Ablaufplan, der diese Art der nebenlaufigen Abarbeitung von Iterationen(Fließbandverarbeitung) berucksichtigt, ist in Tabelle 4.2 dargestellt. Das Iterations-intervall betragt jetzt P = 7. Abbildung 4.18 zeigt die entsprechende Auslastung derRessourcen. Der Ablaufplan ist nicht uberlappend, da sich kein Ausfuhrungsintervallmit den Iterationsintervallgrenzen t = 0 und t = 7 uberschneidet.

Tabelle 4.2. Ein Ablaufplan mit funktionaler Fließbandverarbeitung

P 7vi v1 v2 v3 v4β (vi) r1 r1 r2 r2ti 0 4 4 7/0

Ein Beispiel fur einen uberlappenden Ablaufplan sei im Folgenden gegeben:

Beispiel 4.5.6. Gegeben seien der Problemgraph nach Abb. 4.14 und der zugehorigeRessourcegraph nach Abb. 4.16. Das Iterationsintervall kann auf P = 6 Zeitschritte

4.5 Periodische Ablaufplanungsprobleme⊗

131

0

Ressourcen

1 32 4 5 6 7 8 9

Zeit

v4(n−1)

v1(n)

v3(n)

v2(n)

P = 7

r2

r1

Abb. 4.18. Ablaufplanung mit funktionaler Fließbandverarbeitung

gesenkt werden, wenn v3 zum Zeitschritt t3 = 4 beginnt und am Zeitpunkt t = 1 desdarauf folgenden Abarbeitungszyklus endet. Entsprechend verschiebt sich der Be-ginn von Operation v4 auf t4 = 1. Der Ablaufplan ist in Tabelle 4.3 dargestellt, dieentsprechende Belegung der Ressourcen in Abb. 4.19. Es handelt sich hier um einenuberlappenden Ablaufplan, da das Ausfuhrungsintervall von v3 mit der Iterationsin-tervallgrenze P = 6 uberlappt.

Tabelle 4.3. Ein uberlappender Ablaufplan mit funktionaler Fließbandverarbeitung


0

Ressourcen

1 32 4 5 6 7 8 9

Zeit

v3(n−1)

v1(n)

v4(n−1) v3(n)

v3(n)

P = 6

r2

r1

Abb. 4.19. Uberlappende Ablaufplanung mit funktionaler Fließbandverarbeitung

schedule with functional pipelining i.e. operations from different iterations are processed in one period

overlapping schedule with functional pipelining i.e. operations from an iteration may overlap the border of the period P=6

sequential processing of successive iterations


131

0

Ressourcen

1 32 4 5 6 7 8 9

Zeit

v4(n−1)

v1(n)

v3(n)

v2(n)

P = 7

r2

r1





0

Ressourcen

1 32 4 5 6 7 8 9

Zeit

v3(n−1)

v1(n)

v4(n−1) v3(n)

v3(n)

P = 6

r2

r1



131

0

Ressourcen

1 32 4 5 6 7 8 9

Zeit

v4(n−1)

v1(n)

v3(n)

v2(n)

P = 7

r2

r1





0

Ressourcen

1 32 4 5 6 7 8 9

Zeit

v3(n−1)

v1(n)

v4(n−1) v3(n)

v3(n)

P = 6

r2

r1


Teich & Haubelt 2007

Overview






44

Motivation

• translate program/algorithm into dedicated hardware• units of translation

– single basic block (combinational)– complete program (sequential)

45

finitestate

machine

controller

registers

FU

data path

FU

• FSM controls of data path (FSM actions)

• data path sends feedback to FSM (FSM conditions)


10:}

Translating SW to HW – High-level Synthesis

• basic idea: “execute” control data flow graph• control flow graph

– describes possible execution sequences of basic blocks– implement CFG as finite state machine (FSM)

• data flow graph– describes arithmetic operations on basic block level– implement data flow graphs as data paths– use registers for storing and transferring data between basic blocks

46

finitestate

machine

controller

registers

FU

data path

FU

Example Program: Greatest Common Divider (GCD)

47


10:}

Control Data Flow Graph for GCD

48

generated with Clang/LLVM

false

true

false

true

%b2 = sub i32 %b1, %a2 br label %BB1

%ifcond = icmp sgt i32 %a2, %b1br i1 %ifcond, label %BB4, label %BB5

%b1 = phi i32 [ %b2, %BB5 ], [ %b, %BB0 ]%a1 = phi i32 [ %a2, %BB5 ], [ %a, %BB0 ]br label %BB2

%a2 = phi i32 [ %a1, %BB1 ], [ %a3, %BB4 ]%whilecond = icmp eq i32 %b1, %a2br i1 %whilecond, label %BB6, label %BB3

%a3 = sub i32 %a2, %b1br label %BB2

br label %BB1

BB2

BB4

BB0

BB3

BB5BB1

ret i32 %a2

BB6

false

false

true

false

true






br label %BB1

BB2

BB4

BB0

BB3

BB5BB1

ret i32 %a2

BB6

false


49

data pathpredicates (flags) for conditional jumps (input to controller FSM)

conditional state transitions (evaluated by controller FSM)

phi nodes define registers + input multiplexers (other values are temporary)

Data Path for GCD

50

a2a1b1

a2 a1 b1

– > – ==

ifcond whilecond

a bselb1enb1sela1ena2 sela2 ena1

input signals from controller

output signals to controller

data pathpredicates (flags)phi nodes (registers + input multiplexer)

Data Path for GCD (Block Diagram)

51

result (a2)

ifcond

whilecond

a b

selb1

enb1

sela1

ena2

sela2

ena1

data path


52

false

true

false

true






br label %BB1

BB2

BB4

BB0

BB3

BB5BB1

ret i32 %a2

BB6

false

Control FSM for GCD

53

0

1

2

6

34

5– / a51

– / a01

– / a12

whilecond / a26!whilecond / a23

!ifcond / a35

ifcond / a34

– / a42

edge labels: c / a c = condition for transition, a = action executed during transition

Transition Table and Actions

54

current state

condition next state

action sela1

ena1

sela2

ena2

selb1

enb1

0 - 1 a01 a 1 - 0 b 1

1 - 2 a12 - 0 a1 1 - 0

2 whilecond 6 a26 - 0 - 0 - 0

2 !whilecond 3 a23 - 0 - 0 - 0

3 ifcond 4 a34 - 0 - 0 - 0

4 - 2 a42 - 0 a3 1 - 0

3 !ifcond 5 a35 - 0 - 0 - 0

5 - 1 a51 a2 1 - 0 b2 1

input signals to data path

transition table and data path can be easily translated to hardware description language (e.g. VHDL or Verilog)

Control FSM for GCD (Block Diagram)

55

ifcond

whilecondselb1

enb1

sela1

ena2

sela2

ena1next state and

output logic

state

nextstate

Limitations and Potential Improvements

• limitations of this basic high-level synthesis approach– restricted subset of C only (no function calls, memory access, …)– limited amount of parallelism (CDFG is executed sequentially)– no sharing of operators in data path– single cycle execution model (each basic block is executed in exactly one

cycle)– FSMs can become very complex

• features of more advanced high-level synthesis methods– sharing of operators in data path– program transformations to increase parallelism– loop pipelining and retiming– …

• many tools available– commercial: e.g. Impulse C, Mentor Catapult C, AutoESL (now in Xilinx

Vivado / SDAccel)– free/open source: e.g. ROCCC, SPARK, LegUp, C-to-Verilog

56

Case Study: GCD with Xilinx SDAccel

• express gdc function as OpenCL kernel

• build FPGA designxocc -t sw_emu --report estimate --platform xilinx:adm-pcie-7v3:1ddr:3.0 -s src-cl/gcd.cl -o bin-cl/gcd.sw_emu.xclbin

57

GCD with Xilinx SDAccel: FSM

• FSM with 136 states

…

58

one-hot coding of states

need to interface to communication bus + memory

current state, next state

‘d2

‘d1

GCD with Xilinx SDAccel: Data Path

• two data registers (no duplicate a1 and a2)

• predicates

59

whilecond

ifcond

GCD with Xilinx SDAccel: Arithmetic in Data Path

• One subtraction or two?

• Both sides of condition evaluated in parallel• Another pair of multiplexers enables or disables result

• ifcond not used in FSM, but data path internally

60

ifcond

✎ Example: GCD data path with SDAccel

Microprogrammed Architectures

• two ends of the spectrum for implementing applications– CPU: generic fully programmable architecture, application can be easily

varied after fabrication– ASIC/FPGA: highly specialized, application is fixed at design time

• middle ground: Microprogrammed Architectures– also tailored to particular application or class of applications– but the controller is programmable (instead of a fixed FSM)

61

FSMD vs. Microprogrammed Architecture

62

136 5 Microprogrammed Architectures

In this chapter, we will investigate the microprogramming technique and learn howit can be used to provide flexibility to hardware circuits while maintaining fullcustomizability.

5.2 Microprogrammed Control

Figure 5.3 shows a microprogrammed machine next to an FSMD. The fundamentalidea in microprogramming is to replace the next-state logic of the finite state-machine with a programmable memory, called the control store. The control storeholds microinstructions, and is addressed using a register called CSAR (ControlStore Address Register). The CSAR is equivalent to a program counter in micro-processors. The next-value of CSAR is determined by the next-address logic, usingthe current value of CSAR, the current microinstruction and the value of status flagsevaluated by the datapath. The default next-value of the CSAR corresponds to theprevious CSAR value incremented by one. This way, the control store will return astream of microinstructions from sequential memory locations. However, the next-address logic can also implement conditional jumps or immediate jumps.

Thus, the next-address logic, the CSAR, and the control store implement theequivalent of an instruction-fetch cycle in a microprocessor. In the design of Fig. 5.3,each microinstruction takes a single clock cycle to execute. Within a single clockcycle, the following activities occur:

! The CSAR provides an address to the control store which retrieves a microin-struction. The microinstruction is split into two parts: a command-field and ajump-field. The command-field serves as a command for the datapath. The jump-field goes to the next-address logic.

! The datapath executes the command encoded in the microinstruction, and returnsstatus information to the next-address logic.

Next-StateLogic

status

state reg

ControlStore

CSAR

Next-Address

Logicstatus

Micro-instruction

Jump field

Datapath DatapathCommand field

FSMD Micro-programmed Machine

Fig. 5.3 In contrast to FSM-based control, microprogramming uses a flexible control scheme

Schaumont 2010

finite state machinewith data path (FSMD)

fixed

microprogrammedarchitecture


In this chapter, we will investigate the microprogramming technique and learn howit can be used to provide flexibility to hardware circuits while maintaining fullcustomizability.

5.2 Microprogrammed Control

Figure 5.3 shows a microprogrammed machine next to an FSMD. The fundamentalidea in microprogramming is to replace the next-state logic of the finite state-machine with a programmable memory, called the control store. The control storeholds microinstructions, and is addressed using a register called CSAR (ControlStore Address Register). The CSAR is equivalent to a program counter in micro-processors. The next-value of CSAR is determined by the next-address logic, usingthe current value of CSAR, the current microinstruction and the value of status flagsevaluated by the datapath. The default next-value of the CSAR corresponds to theprevious CSAR value incremented by one. This way, the control store will return astream of microinstructions from sequential memory locations. However, the next-address logic can also implement conditional jumps or immediate jumps.

Thus, the next-address logic, the CSAR, and the control store implement theequivalent of an instruction-fetch cycle in a microprocessor. In the design of Fig. 5.3,each microinstruction takes a single clock cycle to execute. Within a single clockcycle, the following activities occur:

! The CSAR provides an address to the control store which retrieves a microin-struction. The microinstruction is split into two parts: a command-field and ajump-field. The command-field serves as a command for the datapath. The jump-field goes to the next-address logic.

! The datapath executes the command encoded in the microinstruction, and returnsstatus information to the next-address logic.

Next-StateLogic

status

state reg

ControlStore

CSAR

Next-Address

Logicstatus

Micro-instruction

Jump field

Datapath DatapathCommand field

FSMD Micro-programmed Machine

Fig. 5.3 In contrast to FSM-based control, microprogramming uses a flexible control scheme

programmableCSAR (control store address register)corresponds to instruction counter in CPU

Microinstruction Encoding

• microinstruction specifies– commands for data path– jump field

• jump field– how to compute address of

next microinstruction based on feedback from data path

– optional address constant

• address– absolute address of

microinstruction– here: address field width is 12

bits, hence 4096 microinstructions can be addressed

63


32 bits

0000

0001

0010

1010

0100

CSAR = CSAR + 1

CSAR = address

CSAR = cf ? address : CSAR + 1

CSAR = cf ? CSAR + 1 : address

CSAR = zf ? address : CSAR + 1

Default

Jump

Jump if carry

Jump if no carry

Jump if zero

next

Jump if not zero

CSAR

ControlStore

NextAddress

Logic

Datapath

NextCSAR

next + address

micro-instruction

flagsdatapathcommand

datapath command next address

Command field Jump field

1100 CSAR = zf ? CSAR + 1 : address

cf + zf

Fig. 5.4 Sample format for a 32-bit microinstruction word

reserved for the next-address logic. Let us first consider the part for the next-addresslogic. The address field holds an absolute target address, pointing to a locationin the control store. In this case, the address is 12 bit, which means that this mi-croinstruction format would be suited for a control store with 4,096 locations. Thenext field encodes the operation that will lead to the next value of CSAR. Thedefault operation is, as discussed earlier, to increment CSAR. For such instructions,the address field remains unused. When next has values different from the defaultone, various jump instructions can be encoded. An absolute jump will transfer thevalue of the address field into CSAR. A conditional jump will use the value of aflag to conditionally update the CSAR or else increment it. Obviously, the format asshown is quite bulky and may consume a large amount of storage. For example, theaddress field is only used for jump instructions. In an average micro-processorprogram, about 1 instruction out of 5 is a jump. Therefore, if the microprogram con-tains only a few jump instructions, then the storage for the address field is wasted. Toavoid this, we will need to optimize the microinstruction format. For example, whenmicroinstructions are no jumps, then the bits used for the address field could begiven a different purpose.

microinstruction

Schaumont 2010

Command Field – Horizontal vs. Vertical Microcode

• tradeoff between code density and decoding effort

• horizontal microcode– microinstruction directly

contains bits to control data path

– no decoding required– low code density

• vertical microcode– microinstructions contain

encoded form of data path control signals

– decoding required– higher code density

64Schaumont 2010

5.3 Microinstruction Encoding 139

5.3.2 Command Field

The design of the datapath command format reveals another interesting trade-off:we can either opt for a very wide microinstruction word, or else we can prefer anarrow microinstruction word. A wide microinstruction word allows each controlbit of the datapath to be stored separately. A narrow microinstruction word, on theother hand, will require the creation of “symbolic instructions”, which are encodedgroups of control-bits for the datapath. The FSMD model relies on such symbolicinstructions. Each of the above approaches has a specific name. Horizontal microin-structions use no encoding at all. They represent each control bit in the datapath witha separate bit in the microinstruction format. Vertical microinstructions on the otherhand encode the control bits for the datapath as much as possible. A few bits of themicroinstruction can define the value of many more control bits in the datapath.

Figure 5.5 demonstrates an example of vertical and horizontal microinstructionsin the datapath. We wish to create a microprogrammed machine with three instruc-tions on a single register a. The three instructions do one of the following: double thevalue in a, decrement the value in a, or initialize the value in a. The datapath shownon the bottom of Fig. 5.5 contains two multiplexers and a programmable adder/sub-tractor. It can be easily verified that each of the instructions enumerated above canbe implemented as a combination of control bit values for each multiplexer and forthe adder/subtractor. The controller on top shows two possible encodings for thethree instructions: a horizontal encoding and a vertical encoding.

VerticalMicrocode

HorizontalMicrocodeMicro-instruction

a = 2 * a

a = a –1

a = IN

000

110

001

00

10

01

Decoder

Micro-ProgrammedController

a

+ / –

IN

1 1

0

1

0

Datapath

sel1 sel2 alu

Fig. 5.5 Example of vertical versus horizontal microprogramming

Example: A Microcoded Datapath

65Schaumont 2010

5.4 The Microprogrammed Datapath 141

5.4 The Microprogrammed Datapath

The datapath of a microprogrammed machine consists of three elements: compu-tation units, storage, and communication busses. Each of these may contribute afew control bits to the microinstruction word. For example, multi-function compu-tation units have selection bits that determine their specific function, storage unitshave address bits and read/write command bits, and communication busses havesource/destination control bits. The datapath may also generate condition flags forthe microprogrammed controller. Typically, these will be generated by the compu-tation units.

5.4.1 Datapath Architecture

Figure 5.7 illustrates a microprogrammed controller with a datapath attached. Thedatapath includes an ALU with shifter unit, a register file with 8 entries, an accumu-lator register, and an input port.

The microinstruction word is shown on top of Fig. 5.7 and contains 6 fields. Twofields, Nxt and Address, are used by the microprogrammed controller. The otherare used by the datapath. The type of encoding is mixed horizontal/vertical: theoverall machine uses a horizontal encoding: each module of the machine is con-trolled independently. The submodules within the machine on the other hand use avertical encoding. For example, the ALU field contains 4 bits. In this case, the ALUcomponent in the datapath will execute up to 16 different commands.

The machine completes a single instruction per clock cycle. The ALU combinesan operand from the accumulator register with an operand from the register fileor the input port. The result of the operation is returned to the register file or the

unused

RegisterFile

Address NxtALU ShifterSBUSDest

ControlStore

CSAR

Next-AddressLogic

Input

ACC

Shift

flags

AddressDest NxtShifterALUSBUS

Fig. 5.7 A microprogrammed datapath

• same microinstruction format as in previous example

• 4 data path units to be controlled• SBUS: select operand• ALU: choose ALU operation• Shifter: optional bit shift of ALU result• Dest: select location for storing the result

Example: Microinstruction Encoding

66Schaumont 2010

5.4 The Microprogrammed Datapath 143

Table 5.1 Microinstruction encoding of the example machine

Field Width Encoding

SBUS 4 Selects the operand that will drive the S-Bus

0000 R0 0101 R5

0001 R1 0110 R6

0010 R2 0111 R7

0011 R3 1000 Input

0100 R4 1001 Address/Constant

ALU 4 Selects the operation performed by the ALU

0000 ACC 0110 ACC — S-Bus

0001 S-Bus 0111 not S-Bus

0010 ACC C SBus 1000 ACC C 1

0011 ACC ! SBus 1001 SBus ! 1

0100 SBus ! ACC 1010 0

0101 ACC & S-Bus 1011 1

Shifter 3 Selects the function of the programmable shifter

000 logical SHL(ALU) 100 arith SHL(ALU)

001 logical SHR(ALU) 101 arith SHR(ALU)

010 rotate left ALU 111 ALU

011 rotate right ALU

Dest 4 Selects the target that will store S-Bus

0000 R0 0101 R5

0001 R1 0110 R6

0010 R2 0111 R7

0011 R3 1000 ACC

0100 R4 1111 unconnected

Nxt 4 Selects next-value for CSAR

0000 CSAR C 1 1010 cf ? CSAR C 1 : Address

0001 Address 0100 zf ? Address : CSAR C 1

0010 cf ? Address : CSAR C 1 1100 zf ? CSAR C 1 : Address

As an example, let us develop a microprogram that reads two numbers from theinput port and that evaluates their greatest common divisor (GCD) using Euclid’salgorithm. The first step is to develop a microprogram in terms of register transfers.A possible approach is shown in Listing 5.1. Lines 2 and 3 in this program read intwo values from the input port and store these values in registers R0 and ACC. Atthe end of the program, the resulting GCD will be available in either ACC or R0,and the program will continue until both values are equal. The stop test is imple-mented in line 4, using a subtraction of two registers and a conditional jump basedon the zero-flag. Assuming both registers contain different values, the program willcontinue to subtract the largest register from the smallest one. This requires to findwhich of R0 and ACC is bigger, and it is implemented with a conditional jump inline 5. The bigger-than test is implemented using a subtraction, a left-shift, and atest on the resulting carry-flag. If the carry-flag is set, then the most-significant bitof the subtraction would be one, indicating a negative result in two’s complement

available microinstructions and their encoding

Example: How to Encode a Microinstruction

67Schaumont 2010

how to encode “ACC ← R2”?144 5 Microprogrammed Architectures

ACC <– R2

SBUS0010

ALU0001

Shifter111

Nxt0000

Dest1000

Address000000000000

RT-levelInstruction

Micro-InstructionField Encoding

Micro-InstructionFormation

{0,0010,0001,111,1000,0000,000000000000}

{0001,0000,1111,1000,0000,0000,0000,0000}

0 × 10F80000Micro-Instruction

Encoding

Fig. 5.8 Forming microinstructions from register-transfer instructions

Listing 5.1 Micro-program to evaluate a GCD1 ; Command Field || Jump Field2 IN -> R03 IN -> ACC4 Lcheck: (R0 - ACC) || JUMP_IF_Z Ldone5 (R0 - ACC) << 1 || JUMP_IF_C LSmall6 R0 - ACC -> R0 || JUMP Lcheck7 Lsmall: ACC - R0 -> ACC || JUMP Lcheck8 Ldone: JUMP Ldone

logic. This conditional jump-if-carry will be taken if R0 is smaller than ACC. Thecombination of lines 5, 6, and 7 shows how an if-then-else statement can be createdusing multiple conditional and unconditional jump instructions. When the programis complete, in line 8, it iterates in an infinite loop. Depending on how this micropro-gram is integrated into a bigger program, this infinite loop would have to be replacedwith an appropriate control transfer to the next task.

5.5 Implementing a Microprogrammed Machine

In this section, we discuss a sample implementation of a microprogrammed machinein the GEZEL language. This can be used as a template for other implementations.

5.5.1 Microinstruction Word Definition

A convenient starting point in the design of a microprogrammed machine is thedefinition of the microinstruction. This includes the allocation of microinstructioncontrol bits and the definition of the meaning of relevant bit-patterns.

Example: Implementing GCD on this Architecture

68Schaumont 2010

1: int gcd(int a, int b) {2: while(a != b) {3: if(a > b) {4: a = a - b;5: } else {6: b = b - a;7: }8: }9: return a;10:}

; Command Field || Jump Field; --------------------------------------------------------------------------1: IN -> R0 ; read a, store in R02: IN -> ACC ; read b, store in ACC3: Lcheck: R0 - ACC || JUMP_IF_Z Ldone ; check while condition4: (R0 – ACC) << 1 || JUMP_IF_C Lsmall ; check whether R0<ACC, ..5: R0 - ACC -> R0 || JUMP Lcheck ; if so, ACC – RO -> ACC6: Lsmall: ACC - R0 -> ACC || JUMP Lcheck ; else R0 - ACC -> R07: Ldone: || JUMP Ldone ; infinite loop, end of prog

Changes

• 2016-06-25 (v1.6.1)– add case study of GCD synthesis with Xilinx SDAccel

• 2016-06-11 (v1.6.0)– updated for SS2017

• 2016-07-10 (v1.5.2)– fix more equations that have been corrupted by PowerPoint update

• 2016-06-06 (v1.5.1)– change University name on title slide

• 2016-05-27 (v1.5.0)– updated for SS2016, fix broken fonts for PowerPoint 15

• 2015-05-28 (v1.4.0)– updated for SS2015

• 2014-05-22 (v1.3.1)– explain on slide 41 that di=1 no longer required

69

Changes

• 2014-05-06 (v1.3.0)– updated for SS2014

• 2013-06-18 (v1.2.4)– fixed label of basic blocks B4 and B5 also on p.52

• 2013-06-06 (v1.2.3)– cosmetic changes– fixed index in equation on p.41– fixed label of basic blocks B4 and B5 on p.48 + p.49

• 2013-05-23 (v1.2.2)– clarified assumption of unit delay on p.33

• 2013-05-16 (v1.2.1)– fix typo on slide 19, terminate algorithm when v0 is scheduled (not vn)

• 2013-05-13 (v1.2.)– updated for SS2013, merged all architecture synthesis materials into a

single presentation

70

Architecture Synthesis - eim.uni-paderborn.de

Documents

Transcript of Architecture Synthesis - eim.uni-paderborn.de