Architecture Synthesis - eim.uni-paderborn.de
Transcript of Architecture Synthesis - eim.uni-paderborn.de
Architecture Synthesis
Dr. Tobias KenterProf. Dr. Christian Plessl
SS 2017
High-Performance IT Systems groupPaderborn University
Version 1.6.1 – 2017-06-25
Motivation
• translate program/algorithm into dedicated hardware• units of translation
– single basic block (combinational)– complete program (sequential)
2
finitestate
machine
controller
registers
FU
data path
FU
• FSM controls of data path (FSM actions)
• data path sends feedback to FSM (FSM conditions)
1: int gcd(int a, int b) {2: while(a != b) {3: if(a > b) {4: a = a - b;5: } else {6: b = b - a;7: }8: }9: return a;
10:}
Overview
• translation of basic blocks– formal modeling
§ sequence graph, resource graph§ cost function, execution times§ synthesis problems: allocation, binding, scheduling
– scheduling algorithms§ ASAP, ALAP§ extended ASAP, ALAP§ list scheduling
– optimal scheduling§ introduction to integer linear programming (ILP)§ scheduling with ILP
• translation of complete programs– finite state machine and data path– micro-coded controllers
3
Overview
• translation of basic blocks– formal modeling
§ sequence graph, resource graph§ cost function, execution times§ synthesis problems: allocation, binding, scheduling
– scheduling algorithms§ ASAP, ALAP§ extended ASAP, ALAP§ list scheduling
– optimal scheduling§ introduction to integer linear programming (ILP)§ scheduling with ILP
• translation of complete programs– finite state machine and data path– micro-coded controllers
4
Sequencing Graph
5
* +
-
<
* * *
* * +
-
NOP
NOP
0
1 2 3 4 5
6 7 8 9
10
11
12VS = {v0 ,v1 , …,v12}
GS (VS , ES )
Excurse: Source of example, relation to DFG
6
Resource Graph
• GR (VR, ER )– set of nodes VR = VS ⋃ VT
§ VS are the nodes of the sequencing graph (without NOPs)§ VT represent resource types (adder, multiplier, ALU, …)
– set of edges (vS, vT ) ∈E R with vS∈VS , vT∈VT
§ an instance of resource type vT can be used to implement operation vS
Resource Graph – Example
7
* 1
* 3
* 4
+ 5
*6
*7
+8
<9
-10
-11
ALU
VT = {vMUL , vALU}
* 2
MUL
Cost, Execution Time
• cost function c: VT → Z– assigns a cost value to each resource type– example:
• execution times w: ER → Z+– assigns the execution time of operation vS∈ VS on resource
type vT∈ VT to the edge (vS ,vT) ∈ ER
– example:
8
MUL ALUc(vMUL ) = 8 c(vALU) = 4
* 2
MULw(v2 ,vMUL) = 1
Allocation, Binding
• allocation α: VT → Z+– assigns a number α(vT) of available instances to each
resource type vT
• binding is given by the two functionsβ: VS → VT and 𝛾: VS → Z+
– β(vS) = vT means that operation vS is implemented by resource type vT(possible β’s shown in the resource graph)
– 𝛾(vS) = r denotes that vS is implemented by the r–th instance of vT ; r ≤ α(vT)
9
Allocation, Binding – Example (1)
10
α(vMUL) = 3
β(v1) = vMUL 𝛾(v1) = 1β(v2) = vMUL 𝛾(v2) = 2β(v3) = vMUL 𝛾(v3) = 3β(v4) = vMUL 𝛾(v4) = 1β(v6) = vMUL 𝛾(v6) = 2β(v7) = vMUL 𝛾(v7) = 3
c(vMUL) = 8
w(v1 ,vMUL) = 1w(v2 ,vMUL) = 1w(v3 ,vMUL) = 1w(v4 ,vMUL) = 1w(v6 ,vMUL) = 1w(v7 ,vMUL) = 1
* 1
* 3
* 4
*6
*7
* 2
MUL
Allocation, Binding – Example (2)
11
+ 5
+8
<9
-10
-11
ALUα(vALU) = 2
β(v5 ) = vALU 𝛾(v5 ) = 1β(v8 ) = vALU 𝛾(v8 ) = 1β(v9 ) = vALU 𝛾(v9 ) = 1β(v10 ) = vALU 𝛾(v10) = 2β(v11) = vALU 𝛾(v11) = 1
c(vALU) = 4
w(v5 ,vALU ) = 1w(v8 ,vALU ) = 1w(v9 ,vALU ) = 1w(v10 ,vALU ) = 1w(v11 ,vALU ) = 1
Schedule
• schedule 𝜏: VS → Z+– assigns a start time to each operation under the constraint
𝜏(vj) - 𝜏 (vi) ≥ w(vi , β(vi)) ∀(vi ,vj)∈ ES
• latency L of a scheduled sequencing graph– difference in start times between end node and start node
L = 𝜏(vn) - 𝜏 (v0)
12
Schedule - Example
13
1
2
3
4
time
* +
-
<
* *
** *
+
-
1 2 3
4
5
6 7
8
9
10
11
NOP 12
NOP0
5 L = 𝜏(v12) - 𝜏(v0) = 5-1= 4
Synthesis
• allocation, binding, scheduling
• finding (α,β,𝛾,𝜏) that optimize latency and cost under resource and timing constraints– algorithms for architecture synthesis discussed in, e.g.
§ J.Teich, C. Haubelt, Digitale Hardware/Software-Systeme, Springer 2007§ G. De Micheli, Synthesis and Optimization of Digital Circuits, McGrawHill 1994
– synthesis problem variants§ multicycle operations, operator chaining§ several possible resource types for an operation§ iterative schedules, pipelining
• in the following– scheduling without resource constraints
§ ASAP, ALAP– scheduling under resource constraints
§ extended ASAP, ALAP§ list scheduling
14
Overview
• translation of basic blocks– formal modeling
§ sequence graph, resource graph§ cost function, execution times§ synthesis problems: allocation, binding, scheduling
– scheduling algorithms§ ASAP, ALAP§ extended ASAP, ALAP§ list scheduling
– optimal scheduling§ introduction to integer linear programming (ILP)§ scheduling with ILP
• translation of complete programs– finite state machine and data path– micro-coded controllers
15
Scheduling without Resource Constraints
• ASAP (as soon as possible)– determines the earliest possible start times for the operations– minimal latency
• ALAP (as late as possible)– determines the latest possible start times for the operations
under a given latency bound
• slack (mobility) of operations– difference of start times: (ALAP with ASAP latency bound) - ASAP– if slack = 0 → operation is on the critical path
16
ASAP Scheduling
• ASAP: as soon as possible scheduling• algorithm
17
ASAP( GS (V,E) ) {schedule v0 by setting 𝜏(v0) = 1repeat {
select a vertex vj whose predecessors are all scheduledschedule vj by setting
}until (vn is scheduled)return 𝜏
}
τ (vj ) =i:(vi ,vj )∈Emax τ (vi )+w(vi,β(vi ))
ASAP Example
18
1
2
3
4
time
* +
-
<
* * *
* * +
-
1 2 3 4 5
6 7 8 9
10
11
NOP 12
NOP0
5 minimal latency: L = 4
ALAP Scheduling
• ALAP: as late as possible scheduling• requires a latency bound L
– otherwise nodes could be arbitrarily delayed– typically the schedule length of ASAP schedule is used as latency bound
• algorithm
19
ALAP( GS (V,E), L ) {schedule vn by setting 𝜏(vn) = L +1repeat {
select a vertex vi whose successors are all scheduledschedule vi by setting
}until (v0 is scheduled)return 𝜏
}
τ (vi ) =j:(vi ,vj )∈Emin τ (vj )−w(vi,β(vi ))
ALAP Example
20
1
2
3
4
time
*
+-
<
*
*
*
*
*
+-
1 2
3
4 5
6
7
8 9
10
11
NOP 12
NOP0
5
latency bound: L = 4
Slack (mobility)
21
*
*1
1
*
* +
<
*
+
*
-
*
*
-
2
2
2
2
*
+
+
<
0
0
0
0
01
2
3
4
time
Scheduling under Resource Constraints (1)
• Extended ASAP, ALAP– first ASAP or ALAP– then move operations down (ASAP) or up (ALAP) until resource
constraints are satisfied
22
Extended ALAP
23
1
2
3
4
time
*
+
- <
*
*
*
*
*
+-
1 2
3
4
56
7
8
910
11
NOP 12
NOP
5
resource constraints: α(vMUL) = 2, α(vALU) = 2
latency bound: L = 4
Scheduling under Resource Constraints (2)
• list scheduling– operations are prioritized according to some criterion, e.g.
slack,
time = 1repeat
for each resource (vT , α(vT))determine all ready operations vS with β(vs)= vT and schedule the one with the highest priority
time ++;until (vn is scheduled)
24
7
List Scheduling (1)
• example– criterion: number of successor nodes– resource constraints: α(vMUL) = 1, α(vALU) = 1
25
654321 * +
-
<
* * *
* * +
-
3 3
2
2 1 1
1
1
0 0
0
execute at time
List Scheduling (2)
26
1
2
3
4
5
6
7
* +
-
<
*
*
*
*
+
*
-
timeL = 𝜏(v12) - 𝜏(v0) = 8-1= 7✎ Exercise 5.1: Architecture Synthesis
Overview
• translation of basic blocks– formal modeling
§ sequence graph, resource graph§ cost function, execution times§ synthesis problems: allocation, binding, scheduling
– scheduling algorithms§ ASAP, ALAP§ extended ASAP, ALAP§ list scheduling
– optimal scheduling§ introduction to integer linear programming (ILP)§ scheduling with ILP
• translation of complete programs– finite state machine and data path– micro-coded controllers
27
ILP Basics (1)
• Linear Program (LP)– is a special kind of optimization problem
28
– real variables – linear constraints modeled by (in)equations– linear cost function
with
• Integer Linear Program (ILP)– LP where all variables must be integer
• 0-1 LP (binary LP, ZOLP)– LP where all variables must be binary
• Mixed Integer Linear Program (MILP)– LP where some variables must be integer
𝐴 $ 𝑥 ≤ 𝑏 𝑥 ≥ 0
max𝑐. $ 𝑥
𝑥 ∈ 𝑅1, 𝐴 ∈ 𝑅3×1𝑏 ∈ 𝑅3, 𝑐 ∈ 𝑅1
𝑥 ∈ 𝑍1
𝑥 ∈ {0,1}
ILP Basics (2)
• example– 3 binary variables x1, x2, x3
– constraint that at least two variablesmust be set to 1
– minimize cost function
• solution
29
x1 x2 x3 cost0 1 1 10
1 0 1 9
1 1 0 11
1 1 1 15
𝑥6, 𝑥7, 𝑥8 ∈ {0,1}𝑥6 + 𝑥7 + 𝑥8 ≥ 2
min5𝑥6 + 6𝑥7 + 4𝑥8
ILP Basics (3)
• solving ILPs– ILPs are NP-complete, exponential runtime in worst case– solved by branch-and-bound algorithms to optimality
(runtime depends on the efficiency of the model)– popular solvers
§ CPLEX (commercial)http://www.ibm.com/software/commerce/optimization/cplex-optimizer/
§ Gurobi (commercial, free academic license)http://www.gurobi.com/
§ lp_solve (open source)http://lpsolve.sourceforge.net/5.5/
• challenges– translate the problem into an efficient ILP model– use only linear constraints and cost function
§ non-linearities of the problem can (possibly) be modeled by several linear constraints, but this increases the complexity of the ILP
30
Example: Optimal DFG Scheduling
31
* +
-
<
* * *
* * +
-
NOP
NOP
0
1 2 3 4 5
6 7 8 9
10
11
12VS = {v0 ,v1 , …,v12}
all operations take one time step
resource constraints:• 2 MUL• 2 ALU (+,-,<)
→ find schedule with minimal latency
alternative problem: minimize resourcesunder latency constraint
GS (VS, ES )
ILP for Optimal DFG Scheduling (1)
• binary variables– xi,l =1, iff operation vi starts in time step l,
→ we need to define a latency bound
• uniqueness constraints– each operation starts in exactly one time step
• starting time– the starting time 𝜏(vi) of operation vi can be expressed as
32
Llni ,...,1 ,...,1 ==
C𝑥D,E = 1, ∀𝑖 = 1,… , 𝑛E
𝜏 𝑣D = C𝑙 $ 𝑥D,EE
ILP for Optimal DFG Scheduling (2)
• simple resource constraints (execution time for operations = 1)– in each time step, use no more resources than available
• sequencing constraints– an operation cannot start earlier than the finishing time of its predecessors
33
• required for these simple resource constraints• used in sequencing constraints of upcoming example
C 𝑥D,E ≤ 𝛼 𝑣NOP ,Q RS TRUVW
C 𝑥D,E ≤ 𝛼 𝑣XPO ,∀𝑙 = 1,… , 𝐿Q RS TRZWV
C𝑙 ⋅ 𝑥\,E ≥ 𝑤 𝑣D,𝛽(𝑣D ) +E
C𝑙 ⋅ 𝑥D,E,∀𝑒𝑑𝑔𝑒𝑠(𝑣D, 𝑣\)E
𝑤 𝑣D, 𝛽(𝑣D )=1restriction
ILP for Optimal DFG Scheduling (3)
• cost function– minimize start time of last operation
34
min𝜏 𝑣1 = min(C𝑙 ⋅ 𝑥1,E)E
Example: Scheduling of DFG (1)
35
/*********************/ /* integer variables *//*********************/int x1_1;int x1_2;int x1_3;int x1_4;int x1_5;int x1_6;int x1_7;
int x2_1;int x2_2;int x2_3;int x2_4;int x2_5;int x2_6;int x2_7;
int x3_1;int x3_2;int x3_3;int x3_4;int x3_5;int x3_6;int x3_7;
int x4_1;int x4_2;int x4_3;int x4_4;int x4_5;int x4_6;int x4_7;
int x5_1;int x5_2;int x5_3;int x5_4;int x5_5;int x5_6;int x5_7;
int x6_1;int x6_2;int x6_3;int x6_4;int x6_5;int x6_6;int x6_7;
int x7_1;int x7_2;int x7_3;int x7_4;int x7_5;int x7_6;int x7_7;
int x8_1;int x8_2;int x8_3;int x8_4;int x8_5;int x8_6;int x8_7;
int x9_1;int x9_2;int x9_3;int x9_4;int x9_5;int x9_6;int x9_7;
int x10_1;int x10_2;int x10_3;int x10_4;int x10_5;int x10_6;int x10_7;
int x11_1;int x11_2;int x11_3;int x11_4;int x11_5;int x11_6;int x11_7;
int x12_1;int x12_2;int x12_3;int x12_4;int x12_5;int x12_6;int x12_7;
/**********************/ /* variables in {0,1} *//**********************/x1_1 <= 1;x1_2 <= 1;x1_3 <= 1;x1_4 <= 1;x1_5 <= 1;x1_6 <= 1;x1_7 <= 1;
x2_1 <= 1;x2_2 <= 1;x2_3 <= 1;x2_4 <= 1;x2_5 <= 1;x2_6 <= 1;x2_7 <= 1;
x3_1 <= 1;x3_2 <= 1;x3_3 <= 1;x3_4 <= 1;x3_5 <= 1;x3_6 <= 1;x3_7 <= 1;
x4_1 <= 1;x4_2 <= 1;x4_3 <= 1;x4_4 <= 1;x4_5 <= 1;x4_6 <= 1;x4_7 <= 1;
x5_1 <= 1;x5_2 <= 1;x5_3 <= 1;x5_4 <= 1;x5_5 <= 1;x5_6 <= 1;x5_7 <= 1;
x6_1 <= 1;x6_2 <= 1;x6_3 <= 1;x6_4 <= 1;x6_5 <= 1;x6_6 <= 1;x6_7 <= 1;
x7_1 <= 1;x7_2 <= 1;x7_3 <= 1;x7_4 <= 1;x7_5 <= 1;x7_6 <= 1;x7_7 <= 1;
x8_1 <= 1;x8_2 <= 1;x8_3 <= 1;x8_4 <= 1;x8_5 <= 1;x8_6 <= 1;x8_7 <= 1;
x9_1 <= 1;x9_2 <= 1;x9_3 <= 1;x9_4 <= 1;x9_5 <= 1;x9_6 <= 1;x9_7 <= 1;
x10_1 <= 1;x10_2 <= 1;x10_3 <= 1;x10_4 <= 1;x10_5 <= 1;x10_6 <= 1;x10_7 <= 1;
x11_1 <= 1;x11_2 <= 1;x11_3 <= 1;x11_4 <= 1;x11_5 <= 1;x11_6 <= 1;x11_7 <= 1;
x12_1 <= 1;x12_2 <= 1;x12_3 <= 1;x12_4 <= 1;x12_5 <= 1;x12_6 <= 1;x12_7 <= 1;
latency bound = 7
problem descriptionin LP format
Example: Scheduling of DFG (2)
36
/**************************/ /* uniqueness constraints *//**************************/x1_1 + x1_2 + x1_3 + x1_4 + x1_5 + x1_6 + x1_7 = 1;x2_1 + x2_2 + x2_3 + x2_4 + x2_5 + x2_6 + x2_7 = 1;x3_1 + x3_2 + x3_3 + x3_4 + x3_5 + x3_6 + x3_7 = 1;x4_1 + x4_2 + x4_3 + x4_4 + x4_5 + x4_6 + x4_7 = 1;x5_1 + x5_2 + x5_3 + x5_4 + x5_5 + x5_6 + x5_7 = 1;x6_1 + x6_2 + x6_3 + x6_4 + x6_5 + x6_6 + x6_7 = 1;x7_1 + x7_2 + x7_3 + x7_4 + x7_5 + x7_6 + x7_7 = 1;x8_1 + x8_2 + x8_3 + x8_4 + x8_5 + x8_6 + x8_7 = 1;x9_1 + x9_2 + x9_3 + x9_4 + x9_5 + x9_6 + x9_7 = 1;x10_1 + x10_2 + x10_3 + x10_4 + x10_5 + x10_6 + x10_7 = 1;x11_1 + x11_2 + x11_3 + x11_4 + x11_5 + x11_6 + x11_7 = 1;x12_1 + x12_2 + x12_3 + x12_4 + x12_5 + x12_6 + x12_7 = 1;
/************************/ /* resource constraints *//************************/x1_1 + x2_1 + x3_1 + x4_1 + x6_1 + x7_1 <= 2;x5_1 + x8_1 + x9_1 + x10_1 + x11_1 <= 2;
x1_2 + x2_2 + x3_2 + x4_2 + x6_2 + x7_2 <= 2;x5_2 + x8_2 + x9_2 + x10_2 + x11_2 <= 2;
x1_3 + x2_3 + x3_3 + x4_3 + x6_3 + x7_3 <= 2;x5_3 + x8_3 + x9_3 + x10_3 + x11_3 <= 2;
x1_4 + x2_4 + x3_4 + x4_4 + x6_4 + x7_4 <= 2;x5_4 + x8_4 + x9_4 + x10_4 + x11_4 <= 2;
x1_5 + x2_5 + x3_5 + x4_5 + x6_5 + x7_5 <= 2;x5_5 + x8_5 + x9_5 + x10_5 + x11_5 <= 2;
x1_6 + x2_6 + x3_6 + x4_6 + x6_6 + x7_6 <= 2;x5_6 + x8_6 + x9_6 + x10_6 + x11_6 <= 2;
x1_7 + x2_7 + x3_7 + x4_7 + x6_7 + x7_7 <= 2;x5_7 + x8_7 + x9_7 + x10_7 + x11_7 <= 2;
/**********************/ /* objective function *//**********************/min: 1 x12_1 + 2 x12_2 + 3 x12_3 + 4 x12_4 + 5 x12_5 + 6 x12_6 + 7 x12_7;
/**************************/ /* sequencing constraints *//**************************//* 1->6 */1 x6_1 + 2 x6_2 + 3 x6_3 + 4 x6_4 + 5 x6_5 + 6 x6_6 + 7 x6_7 >=1 + 1 x1_1 + 2 x1_2 + 3 x1_3 + 4 x1_4 + 5 x1_5 + 6 x1_6 + 7 x1_7;
/* 2->6 */1 x6_1 + 2 x6_2 + 3 x6_3 + 4 x6_4 + 5 x6_5 + 6 x6_6 + 7 x6_7 >= 1 + 1 x2_1 + 2 x2_2 + 3 x2_3 + 4 x2_4 + 5 x2_5 + 6 x2_6 + 7 x2_7;
/* 6->10 */1 x10_1 + 2 x10_2 + 3 x10_3 + 4 x10_4 + 5 x10_5 + 6 x10_6 + 7 x10_7 >=1 + 1 x6_1 + 2 x6_2 + 3 x6_3 + 4 x6_4 + 5 x6_5 + 6 x6_6 + 7 x6_7;
/* 10->11 */1 x11_1 + 2 x11_2 + 3 x11_3 + 4 x11_4 + 5 x11_5 + 6 x11_6 + 7 x11_7 >=1 + 1 x10_1 + 2 x10_2 + 3 x10_3 + 4 x10_4 + 5 x10_5 + 6 x10_6 + 7 x10_7;
/* 11->12 */1 x12_1 + 2 x12_2 + 3 x12_3 + 4 x12_4 + 5 x12_5 + 6 x12_6 + 7 x12_7 >=1 + 1 x11_1 + 2 x11_2 + 3 x11_3 + 4 x11_4 + 5 x11_5 + 6 x11_6 + 7 x11_7;
/* 3->7 */1 x7_1 + 2 x7_2 + 3 x7_3 + 4 x7_4 + 5 x7_5 + 6 x7_6 + 7 x7_7 >=1 + 1 x3_1 + 2 x3_2 + 3 x3_3 + 4 x3_4 + 5 x3_5 + 6 x3_6 + 7 x3_7;
/* 7->11 */1 x11_1 + 2 x11_2 + 3 x11_3 + 4 x11_4 + 5 x11_5 + 6 x11_6 + 7 x11_7 >=1 + 1 x7_1 + 2 x7_2 + 3 x7_3 + 4 x7_4 + 5 x7_5 + 6 x7_6 + 7 x7_7;
/* 4->8 */1 x8_1 + 2 x8_2 + 3 x8_3 + 4 x8_4 + 5 x8_5 + 6 x8_6 + 7 x8_7 >=1 + 1 x4_1 + 2 x4_2 + 3 x4_3 + 4 x4_4 + 5 x4_5 + 6 x4_6 + 7 x4_7;
/* 8->12 */1 x12_1 + 2 x12_2 + 3 x12_3 + 4 x12_4 + 5 x12_5 + 6 x12_6 + 7 x12_7 >=1 + 1 x8_1 + 2 x8_2 + 3 x8_3 + 4 x8_4 + 5 x8_5 + 6 x8_6 + 7 x8_7;
/* 5->9 */1 x9_1 + 2 x9_2 + 3 x9_3 + 4 x9_4 + 5 x9_5 + 6 x9_6 + 7 x9_7 >=1 + 1 x5_1 + 2 x5_2 + 3 x5_3 + 4 x5_4 + 5 x5_5 + 6 x5_6 + 7 x5_7;
/* 9->12 */1 x12_1 + 2 x12_2 + 3 x12_3 + 4 x12_4 + 5 x12_5 + 6 x12_6 + 7 x12_7 >=1 + 1 x9_1 + 2 x9_2 + 3 x9_3 + 4 x9_4 + 5 x9_5 + 6 x9_6 + 7 x9_7;
Example: Scheduling of DFG (3)
37
Example: DFG - Optimal Schedule
38
1
2
3
4
time
* +
- <
*
*
*
*
*
+-
1 2
3
4
5
6
7
8
910
11
NOP 12
NOP0
5
latency = 4
Generalized and Optimized Formulation (1)
• compute the earliest and latest scheduling time for each node using ASAP and ALAP scheduling (without resource constraints)– latency bound for ALAP determined by list scheduling
• binary variables
• uniqueness constraints
• scheduling time
39
li = τ (vi )ASAP hi = τ (vi )
ALAP
vi
xi,t ∈ {0,1} ∀vi ∈V, ∀t : li ≤ t ≤ hi
xi,t =1 ∀vi ∈Vt=li
hi
∑
t ⋅ xi,t = τ (vi ) ∀vi ∈Vt=li
hi
∑
✎ Question: How to compute an ASAP/ALAP schedule with ILP?
Generalized and Optimized Formulation (2)
• sequencing, using di := 𝑤 𝑣D,𝛽(𝑣D )
• resource constraints (di=1 not required anymore)
• cost function
40
min l ⋅ xn,ll∑
τ (vj )−τ (vi ) ≥ di ∀(vi,vj )∈ E
xi,t−pp=max{0,t−hi}
min{di−1,t−li}
∑ ≤α(rki:(vi ,rk )∈ER
∑ )
∀rk ∈VT , ∀i=1
V
min{li} ≤ t ≤i=1
V
max{hi}
Generalized and Optimized Formulation (3)
• explanation generalized resource constraints (slightly simplified: more complex summation bounds used to avoid summing undefined variables)
• e.g. assume node v1 is executed on a resource with di=3, then the resource usage caused by v1 in time steps t=1,2,3, ... is:
41
xi,t−pp=0
di−1
∑ =1:∀t :τ (vi ) ≤ t ≤ τ (vi )+ di −1
0 :otherwise
%&'
('
t =1: x1,1t = 2 : x1,2 + x1,1t = 3 : x1,3 + x1,2 + x1,1t = 4 : x1,4 + x1,3 + x1,2
resource busy if v1 has been started at t=1
resource busy if v1 has just been started at t=4 or stilloccupied if started at t=2 or t=3
resource busy if v1 has just been started at t=2 or stilloccupied if started at t=1resource busy if v1 has just been started at t=3 or stilloccupied if started at t=1 or t=2
✎ Exercise 5.1: Architecture Synthesis
Iterative Scheduling
• most applications execute a dataflow graph not just once but in a loop– generation and optimization of periodic schedules– execution in pipelined iterations for optimizing performance
• parameters– period/iteration interval/initiation interval (P): time between start of
execution of operation vi between two successive iterations– latency (L): time for between start of first and last operation for a given
interval
42
time
iteration
23
1
4
1 2 3 4 5 6
v1 v2 v3 v4
v1 v2 v3 v4
v1 v2 v3 v4
v1 v2 v3 v4
L=4,P=1
time
iteration
23
1
4
1 2 3 4 5 6
v1 v2 v3 v4
v1 v2 v3 v4
v1 v2 v3 v4
L=4,P=2
Iterative Scheduling
• iterative schedule examples (visualized with Gantt charts)
43
130 4 Ablaufplanung
0
Ressourcen
1 32 4 5 6 7 8 9
Zeitv1 v2
v3 v4
P = L = 9
r2
r1
Abb. 4.17. Ablaufplanung und Bindung
Bei diesem Ablaufplan betragt das Iterationsintervall P = 9, d. h. alle neun Zeit-schritte wiederholt sich die Berechnung eines neuen Datensatzes. Ein solcher Ab-laufplan entspricht einer sequentiellen Abarbeitung aufeinander folgender Iteratio-nen. In Abb. 4.17 ist zusatzlich zum Ablaufplan eine vollstatische Bindung mit einerRessource des Typs r1 und einer Ressource des Typs r2 dargestellt. Eine solche Dar-stellung bezeichnet man auch als Gantt-Chart. Man erkennt, dass die Ressource vomTyp r2 wahrend der Zeitschritte von t = 0 bis zum Zeitschritt t = 4 jeder Iterationnicht arbeitet. Offensichtlich ist es jedoch bei geeigneter Speicherung der Zwischen-ergebnisse moglich, gleichzeitig mit dem Beginn der Berechnung von Operation v4auf der Ressource vom Typ r2, die Berechnung eines neuen Datensatzes durch Pla-nung von v1 auf der Ressource des Typs r1 zu beginnen. Anders ausgedruckt bedeu-tet dies, dass in einer Iteration die Knoten v1(n),v2(n),v3(n) und v4(n−1) berechnetwerden. Ein Ablaufplan, der diese Art der nebenlaufigen Abarbeitung von Iterationen(Fließbandverarbeitung) berucksichtigt, ist in Tabelle 4.2 dargestellt. Das Iterations-intervall betragt jetzt P = 7. Abbildung 4.18 zeigt die entsprechende Auslastung derRessourcen. Der Ablaufplan ist nicht uberlappend, da sich kein Ausfuhrungsintervallmit den Iterationsintervallgrenzen t = 0 und t = 7 uberschneidet.
Tabelle 4.2. Ein Ablaufplan mit funktionaler Fließbandverarbeitung
P 7vi v1 v2 v3 v4β (vi) r1 r1 r2 r2ti 0 4 4 7/0
Ein Beispiel fur einen uberlappenden Ablaufplan sei im Folgenden gegeben:
Beispiel 4.5.6. Gegeben seien der Problemgraph nach Abb. 4.14 und der zugehorigeRessourcegraph nach Abb. 4.16. Das Iterationsintervall kann auf P = 6 Zeitschritte
4.5 Periodische Ablaufplanungsprobleme⊗
131
0
Ressourcen
1 32 4 5 6 7 8 9
Zeit
v4(n−1)
v1(n)
v3(n)
v2(n)
P = 7
r2
r1
Abb. 4.18. Ablaufplanung mit funktionaler Fließbandverarbeitung
gesenkt werden, wenn v3 zum Zeitschritt t3 = 4 beginnt und am Zeitpunkt t = 1 desdarauf folgenden Abarbeitungszyklus endet. Entsprechend verschiebt sich der Be-ginn von Operation v4 auf t4 = 1. Der Ablaufplan ist in Tabelle 4.3 dargestellt, dieentsprechende Belegung der Ressourcen in Abb. 4.19. Es handelt sich hier um einenuberlappenden Ablaufplan, da das Ausfuhrungsintervall von v3 mit der Iterationsin-tervallgrenze P = 6 uberlappt.
Tabelle 4.3. Ein uberlappender Ablaufplan mit funktionaler Fließbandverarbeitung
P 6vi v1 v2 v3 v4β (vi) r1 r1 r2 r2ti 0 4 4 7/1
0
Ressourcen
1 32 4 5 6 7 8 9
Zeit
v3(n−1)
v1(n)
v4(n−1) v3(n)
v3(n)
P = 6
r2
r1
Abb. 4.19. Uberlappende Ablaufplanung mit funktionaler Fließbandverarbeitung
schedule with functional pipelining i.e. operations from different iterations are processed in one period
overlapping schedule with functional pipelining i.e. operations from an iteration may overlap the border of the period P=6
sequential processing of successive iterations
4.5 Periodische Ablaufplanungsprobleme⊗
131
0
Ressourcen
1 32 4 5 6 7 8 9
Zeit
v4(n−1)
v1(n)
v3(n)
v2(n)
P = 7
r2
r1
Abb. 4.18. Ablaufplanung mit funktionaler Fließbandverarbeitung
gesenkt werden, wenn v3 zum Zeitschritt t3 = 4 beginnt und am Zeitpunkt t = 1 desdarauf folgenden Abarbeitungszyklus endet. Entsprechend verschiebt sich der Be-ginn von Operation v4 auf t4 = 1. Der Ablaufplan ist in Tabelle 4.3 dargestellt, dieentsprechende Belegung der Ressourcen in Abb. 4.19. Es handelt sich hier um einenuberlappenden Ablaufplan, da das Ausfuhrungsintervall von v3 mit der Iterationsin-tervallgrenze P = 6 uberlappt.
Tabelle 4.3. Ein uberlappender Ablaufplan mit funktionaler Fließbandverarbeitung
P 6vi v1 v2 v3 v4β (vi) r1 r1 r2 r2ti 0 4 4 7/1
0
Ressourcen
1 32 4 5 6 7 8 9
Zeit
v3(n−1)
v1(n)
v4(n−1) v3(n)
v3(n)
P = 6
r2
r1
Abb. 4.19. Uberlappende Ablaufplanung mit funktionaler Fließbandverarbeitung
4.5 Periodische Ablaufplanungsprobleme⊗
131
0
Ressourcen
1 32 4 5 6 7 8 9
Zeit
v4(n−1)
v1(n)
v3(n)
v2(n)
P = 7
r2
r1
Abb. 4.18. Ablaufplanung mit funktionaler Fließbandverarbeitung
gesenkt werden, wenn v3 zum Zeitschritt t3 = 4 beginnt und am Zeitpunkt t = 1 desdarauf folgenden Abarbeitungszyklus endet. Entsprechend verschiebt sich der Be-ginn von Operation v4 auf t4 = 1. Der Ablaufplan ist in Tabelle 4.3 dargestellt, dieentsprechende Belegung der Ressourcen in Abb. 4.19. Es handelt sich hier um einenuberlappenden Ablaufplan, da das Ausfuhrungsintervall von v3 mit der Iterationsin-tervallgrenze P = 6 uberlappt.
Tabelle 4.3. Ein uberlappender Ablaufplan mit funktionaler Fließbandverarbeitung
P 6vi v1 v2 v3 v4β (vi) r1 r1 r2 r2ti 0 4 4 7/1
0
Ressourcen
1 32 4 5 6 7 8 9
Zeit
v3(n−1)
v1(n)
v4(n−1) v3(n)
v3(n)
P = 6
r2
r1
Abb. 4.19. Uberlappende Ablaufplanung mit funktionaler Fließbandverarbeitung
Teich & Haubelt 2007
Overview
• translation of basic blocks– formal modeling
§ sequence graph, resource graph§ cost function, execution times§ synthesis problems: allocation, binding, scheduling
– scheduling algorithms§ ASAP, ALAP§ extended ASAP, ALAP§ list scheduling
– optimal scheduling§ introduction to integer linear programming (ILP)§ scheduling with ILP
• translation of complete programs– finite state machine and data path– micro-coded controllers
44
Motivation
• translate program/algorithm into dedicated hardware• units of translation
– single basic block (combinational)– complete program (sequential)
45
finitestate
machine
controller
registers
FU
data path
FU
• FSM controls of data path (FSM actions)
• data path sends feedback to FSM (FSM conditions)
1: int gcd(int a, int b) {2: while(a != b) {3: if(a > b) {4: a = a - b;5: } else {6: b = b - a;7: }8: }9: return a;
10:}
Translating SW to HW – High-level Synthesis
• basic idea: “execute” control data flow graph• control flow graph
– describes possible execution sequences of basic blocks– implement CFG as finite state machine (FSM)
• data flow graph– describes arithmetic operations on basic block level– implement data flow graphs as data paths– use registers for storing and transferring data between basic blocks
46
finitestate
machine
controller
registers
FU
data path
FU
Example Program: Greatest Common Divider (GCD)
47
1: int gcd(int a, int b) {2: while(a != b) {3: if(a > b) {4: a = a - b;5: } else {6: b = b - a;7: }8: }9: return a;
10:}
Control Data Flow Graph for GCD
48
generated with Clang/LLVM
false
true
false
true
%b2 = sub i32 %b1, %a2 br label %BB1
%ifcond = icmp sgt i32 %a2, %b1br i1 %ifcond, label %BB4, label %BB5
%b1 = phi i32 [ %b2, %BB5 ], [ %b, %BB0 ]%a1 = phi i32 [ %a2, %BB5 ], [ %a, %BB0 ]br label %BB2
%a2 = phi i32 [ %a1, %BB1 ], [ %a3, %BB4 ]%whilecond = icmp eq i32 %b1, %a2br i1 %whilecond, label %BB6, label %BB3
%a3 = sub i32 %a2, %b1br label %BB2
br label %BB1
BB2
BB4
BB0
BB3
BB5BB1
ret i32 %a2
BB6
false
false
true
false
true
%b2 = sub i32 %b1, %a2 br label %BB1
%ifcond = icmp sgt i32 %a2, %b1br i1 %ifcond, label %BB4, label %BB5
%b1 = phi i32 [ %b2, %BB5 ], [ %b, %BB0 ]%a1 = phi i32 [ %a2, %BB5 ], [ %a, %BB0 ]br label %BB2
%a2 = phi i32 [ %a1, %BB1 ], [ %a3, %BB4 ]%whilecond = icmp eq i32 %b1, %a2br i1 %whilecond, label %BB6, label %BB3
%a3 = sub i32 %a2, %b1br label %BB2
br label %BB1
BB2
BB4
BB0
BB3
BB5BB1
ret i32 %a2
BB6
false
Control Data Flow Graph for GCD
49
data pathpredicates (flags) for conditional jumps (input to controller FSM)
conditional state transitions (evaluated by controller FSM)
phi nodes define registers + input multiplexers (other values are temporary)
Data Path for GCD
50
a2a1b1
a2 a1 b1
– > – ==
ifcond whilecond
a bselb1enb1sela1ena2 sela2 ena1
input signals from controller
output signals to controller
data pathpredicates (flags)phi nodes (registers + input multiplexer)
Data Path for GCD (Block Diagram)
51
result (a2)
ifcond
whilecond
a b
selb1
enb1
sela1
ena2
sela2
ena1
data path
Control Data Flow Graph for GCD
52
false
true
false
true
%b2 = sub i32 %b1, %a2 br label %BB1
%ifcond = icmp sgt i32 %a2, %b1br i1 %ifcond, label %BB4, label %BB5
%b1 = phi i32 [ %b2, %BB5 ], [ %b, %BB0 ]%a1 = phi i32 [ %a2, %BB5 ], [ %a, %BB0 ]br label %BB2
%a2 = phi i32 [ %a1, %BB1 ], [ %a3, %BB4 ]%whilecond = icmp eq i32 %b1, %a2br i1 %whilecond, label %BB6, label %BB3
%a3 = sub i32 %a2, %b1br label %BB2
br label %BB1
BB2
BB4
BB0
BB3
BB5BB1
ret i32 %a2
BB6
false
Control FSM for GCD
53
0
1
2
6
34
5– / a51
– / a01
– / a12
whilecond / a26!whilecond / a23
!ifcond / a35
ifcond / a34
– / a42
edge labels: c / a c = condition for transition, a = action executed during transition
Transition Table and Actions
54
current state
condition next state
action sela1
ena1
sela2
ena2
selb1
enb1
0 - 1 a01 a 1 - 0 b 1
1 - 2 a12 - 0 a1 1 - 0
2 whilecond 6 a26 - 0 - 0 - 0
2 !whilecond 3 a23 - 0 - 0 - 0
3 ifcond 4 a34 - 0 - 0 - 0
4 - 2 a42 - 0 a3 1 - 0
3 !ifcond 5 a35 - 0 - 0 - 0
5 - 1 a51 a2 1 - 0 b2 1
input signals to data path
transition table and data path can be easily translated to hardware description language (e.g. VHDL or Verilog)
Control FSM for GCD (Block Diagram)
55
ifcond
whilecondselb1
enb1
sela1
ena2
sela2
ena1next state and
output logic
state
nextstate
Limitations and Potential Improvements
• limitations of this basic high-level synthesis approach– restricted subset of C only (no function calls, memory access, …)– limited amount of parallelism (CDFG is executed sequentially)– no sharing of operators in data path– single cycle execution model (each basic block is executed in exactly one
cycle)– FSMs can become very complex
• features of more advanced high-level synthesis methods– sharing of operators in data path– program transformations to increase parallelism– loop pipelining and retiming– …
• many tools available– commercial: e.g. Impulse C, Mentor Catapult C, AutoESL (now in Xilinx
Vivado / SDAccel)– free/open source: e.g. ROCCC, SPARK, LegUp, C-to-Verilog
56
Case Study: GCD with Xilinx SDAccel
• express gdc function as OpenCL kernel
• build FPGA designxocc -t sw_emu --report estimate --platform xilinx:adm-pcie-7v3:1ddr:3.0 -s src-cl/gcd.cl -o bin-cl/gcd.sw_emu.xclbin
57
GCD with Xilinx SDAccel: FSM
• FSM with 136 states
…
58
one-hot coding of states
need to interface to communication bus + memory
current state, next state
‘d2
‘d1
GCD with Xilinx SDAccel: Data Path
• two data registers (no duplicate a1 and a2)
• predicates
59
whilecond
ifcond
GCD with Xilinx SDAccel: Arithmetic in Data Path
• One subtraction or two?
• Both sides of condition evaluated in parallel• Another pair of multiplexers enables or disables result
• ifcond not used in FSM, but data path internally
60
ifcond
✎ Example: GCD data path with SDAccel
Microprogrammed Architectures
• two ends of the spectrum for implementing applications– CPU: generic fully programmable architecture, application can be easily
varied after fabrication– ASIC/FPGA: highly specialized, application is fixed at design time
• middle ground: Microprogrammed Architectures– also tailored to particular application or class of applications– but the controller is programmable (instead of a fixed FSM)
61
FSMD vs. Microprogrammed Architecture
62
136 5 Microprogrammed Architectures
In this chapter, we will investigate the microprogramming technique and learn howit can be used to provide flexibility to hardware circuits while maintaining fullcustomizability.
5.2 Microprogrammed Control
Figure 5.3 shows a microprogrammed machine next to an FSMD. The fundamentalidea in microprogramming is to replace the next-state logic of the finite state-machine with a programmable memory, called the control store. The control storeholds microinstructions, and is addressed using a register called CSAR (ControlStore Address Register). The CSAR is equivalent to a program counter in micro-processors. The next-value of CSAR is determined by the next-address logic, usingthe current value of CSAR, the current microinstruction and the value of status flagsevaluated by the datapath. The default next-value of the CSAR corresponds to theprevious CSAR value incremented by one. This way, the control store will return astream of microinstructions from sequential memory locations. However, the next-address logic can also implement conditional jumps or immediate jumps.
Thus, the next-address logic, the CSAR, and the control store implement theequivalent of an instruction-fetch cycle in a microprocessor. In the design of Fig. 5.3,each microinstruction takes a single clock cycle to execute. Within a single clockcycle, the following activities occur:
! The CSAR provides an address to the control store which retrieves a microin-struction. The microinstruction is split into two parts: a command-field and ajump-field. The command-field serves as a command for the datapath. The jump-field goes to the next-address logic.
! The datapath executes the command encoded in the microinstruction, and returnsstatus information to the next-address logic.
Next-StateLogic
status
state reg
ControlStore
CSAR
Next-Address
Logicstatus
Micro-instruction
Jump field
Datapath DatapathCommand field
FSMD Micro-programmed Machine
Fig. 5.3 In contrast to FSM-based control, microprogramming uses a flexible control scheme
Schaumont 2010
finite state machinewith data path (FSMD)
fixed
microprogrammedarchitecture
136 5 Microprogrammed Architectures
In this chapter, we will investigate the microprogramming technique and learn howit can be used to provide flexibility to hardware circuits while maintaining fullcustomizability.
5.2 Microprogrammed Control
Figure 5.3 shows a microprogrammed machine next to an FSMD. The fundamentalidea in microprogramming is to replace the next-state logic of the finite state-machine with a programmable memory, called the control store. The control storeholds microinstructions, and is addressed using a register called CSAR (ControlStore Address Register). The CSAR is equivalent to a program counter in micro-processors. The next-value of CSAR is determined by the next-address logic, usingthe current value of CSAR, the current microinstruction and the value of status flagsevaluated by the datapath. The default next-value of the CSAR corresponds to theprevious CSAR value incremented by one. This way, the control store will return astream of microinstructions from sequential memory locations. However, the next-address logic can also implement conditional jumps or immediate jumps.
Thus, the next-address logic, the CSAR, and the control store implement theequivalent of an instruction-fetch cycle in a microprocessor. In the design of Fig. 5.3,each microinstruction takes a single clock cycle to execute. Within a single clockcycle, the following activities occur:
! The CSAR provides an address to the control store which retrieves a microin-struction. The microinstruction is split into two parts: a command-field and ajump-field. The command-field serves as a command for the datapath. The jump-field goes to the next-address logic.
! The datapath executes the command encoded in the microinstruction, and returnsstatus information to the next-address logic.
Next-StateLogic
status
state reg
ControlStore
CSAR
Next-Address
Logicstatus
Micro-instruction
Jump field
Datapath DatapathCommand field
FSMD Micro-programmed Machine
Fig. 5.3 In contrast to FSM-based control, microprogramming uses a flexible control scheme
programmableCSAR (control store address register)corresponds to instruction counter in CPU
Microinstruction Encoding
• microinstruction specifies– commands for data path– jump field
• jump field– how to compute address of
next microinstruction based on feedback from data path
– optional address constant
• address– absolute address of
microinstruction– here: address field width is 12
bits, hence 4096 microinstructions can be addressed
63
138 5 Microprogrammed Architectures
32 bits
0000
0001
0010
1010
0100
CSAR = CSAR + 1
CSAR = address
CSAR = cf ? address : CSAR + 1
CSAR = cf ? CSAR + 1 : address
CSAR = zf ? address : CSAR + 1
Default
Jump
Jump if carry
Jump if no carry
Jump if zero
next
Jump if not zero
CSAR
ControlStore
NextAddress
Logic
Datapath
NextCSAR
next + address
micro-instruction
flagsdatapathcommand
datapath command next address
Command field Jump field
1100 CSAR = zf ? CSAR + 1 : address
cf + zf
Fig. 5.4 Sample format for a 32-bit microinstruction word
reserved for the next-address logic. Let us first consider the part for the next-addresslogic. The address field holds an absolute target address, pointing to a locationin the control store. In this case, the address is 12 bit, which means that this mi-croinstruction format would be suited for a control store with 4,096 locations. Thenext field encodes the operation that will lead to the next value of CSAR. Thedefault operation is, as discussed earlier, to increment CSAR. For such instructions,the address field remains unused. When next has values different from the defaultone, various jump instructions can be encoded. An absolute jump will transfer thevalue of the address field into CSAR. A conditional jump will use the value of aflag to conditionally update the CSAR or else increment it. Obviously, the format asshown is quite bulky and may consume a large amount of storage. For example, theaddress field is only used for jump instructions. In an average micro-processorprogram, about 1 instruction out of 5 is a jump. Therefore, if the microprogram con-tains only a few jump instructions, then the storage for the address field is wasted. Toavoid this, we will need to optimize the microinstruction format. For example, whenmicroinstructions are no jumps, then the bits used for the address field could begiven a different purpose.
microinstruction
Schaumont 2010
Command Field – Horizontal vs. Vertical Microcode
• tradeoff between code density and decoding effort
• horizontal microcode– microinstruction directly
contains bits to control data path
– no decoding required– low code density
• vertical microcode– microinstructions contain
encoded form of data path control signals
– decoding required– higher code density
64Schaumont 2010
5.3 Microinstruction Encoding 139
5.3.2 Command Field
The design of the datapath command format reveals another interesting trade-off:we can either opt for a very wide microinstruction word, or else we can prefer anarrow microinstruction word. A wide microinstruction word allows each controlbit of the datapath to be stored separately. A narrow microinstruction word, on theother hand, will require the creation of “symbolic instructions”, which are encodedgroups of control-bits for the datapath. The FSMD model relies on such symbolicinstructions. Each of the above approaches has a specific name. Horizontal microin-structions use no encoding at all. They represent each control bit in the datapath witha separate bit in the microinstruction format. Vertical microinstructions on the otherhand encode the control bits for the datapath as much as possible. A few bits of themicroinstruction can define the value of many more control bits in the datapath.
Figure 5.5 demonstrates an example of vertical and horizontal microinstructionsin the datapath. We wish to create a microprogrammed machine with three instruc-tions on a single register a. The three instructions do one of the following: double thevalue in a, decrement the value in a, or initialize the value in a. The datapath shownon the bottom of Fig. 5.5 contains two multiplexers and a programmable adder/sub-tractor. It can be easily verified that each of the instructions enumerated above canbe implemented as a combination of control bit values for each multiplexer and forthe adder/subtractor. The controller on top shows two possible encodings for thethree instructions: a horizontal encoding and a vertical encoding.
VerticalMicrocode
HorizontalMicrocodeMicro-instruction
a = 2 * a
a = a –1
a = IN
000
110
001
00
10
01
Decoder
Micro-ProgrammedController
a
+ / –
IN
1 1
0
1
0
Datapath
sel1 sel2 alu
Fig. 5.5 Example of vertical versus horizontal microprogramming
Example: A Microcoded Datapath
65Schaumont 2010
5.4 The Microprogrammed Datapath 141
5.4 The Microprogrammed Datapath
The datapath of a microprogrammed machine consists of three elements: compu-tation units, storage, and communication busses. Each of these may contribute afew control bits to the microinstruction word. For example, multi-function compu-tation units have selection bits that determine their specific function, storage unitshave address bits and read/write command bits, and communication busses havesource/destination control bits. The datapath may also generate condition flags forthe microprogrammed controller. Typically, these will be generated by the compu-tation units.
5.4.1 Datapath Architecture
Figure 5.7 illustrates a microprogrammed controller with a datapath attached. Thedatapath includes an ALU with shifter unit, a register file with 8 entries, an accumu-lator register, and an input port.
The microinstruction word is shown on top of Fig. 5.7 and contains 6 fields. Twofields, Nxt and Address, are used by the microprogrammed controller. The otherare used by the datapath. The type of encoding is mixed horizontal/vertical: theoverall machine uses a horizontal encoding: each module of the machine is con-trolled independently. The submodules within the machine on the other hand use avertical encoding. For example, the ALU field contains 4 bits. In this case, the ALUcomponent in the datapath will execute up to 16 different commands.
The machine completes a single instruction per clock cycle. The ALU combinesan operand from the accumulator register with an operand from the register fileor the input port. The result of the operation is returned to the register file or the
unused
RegisterFile
Address NxtALU ShifterSBUSDest
ControlStore
CSAR
Next-AddressLogic
Input
ACC
Shift
flags
AddressDest NxtShifterALUSBUS
Fig. 5.7 A microprogrammed datapath
• same microinstruction format as in previous example
• 4 data path units to be controlled• SBUS: select operand• ALU: choose ALU operation• Shifter: optional bit shift of ALU result• Dest: select location for storing the result
Example: Microinstruction Encoding
66Schaumont 2010
5.4 The Microprogrammed Datapath 143
Table 5.1 Microinstruction encoding of the example machine
Field Width Encoding
SBUS 4 Selects the operand that will drive the S-Bus
0000 R0 0101 R5
0001 R1 0110 R6
0010 R2 0111 R7
0011 R3 1000 Input
0100 R4 1001 Address/Constant
ALU 4 Selects the operation performed by the ALU
0000 ACC 0110 ACC — S-Bus
0001 S-Bus 0111 not S-Bus
0010 ACC C SBus 1000 ACC C 1
0011 ACC ! SBus 1001 SBus ! 1
0100 SBus ! ACC 1010 0
0101 ACC & S-Bus 1011 1
Shifter 3 Selects the function of the programmable shifter
000 logical SHL(ALU) 100 arith SHL(ALU)
001 logical SHR(ALU) 101 arith SHR(ALU)
010 rotate left ALU 111 ALU
011 rotate right ALU
Dest 4 Selects the target that will store S-Bus
0000 R0 0101 R5
0001 R1 0110 R6
0010 R2 0111 R7
0011 R3 1000 ACC
0100 R4 1111 unconnected
Nxt 4 Selects next-value for CSAR
0000 CSAR C 1 1010 cf ? CSAR C 1 : Address
0001 Address 0100 zf ? Address : CSAR C 1
0010 cf ? Address : CSAR C 1 1100 zf ? CSAR C 1 : Address
As an example, let us develop a microprogram that reads two numbers from theinput port and that evaluates their greatest common divisor (GCD) using Euclid’salgorithm. The first step is to develop a microprogram in terms of register transfers.A possible approach is shown in Listing 5.1. Lines 2 and 3 in this program read intwo values from the input port and store these values in registers R0 and ACC. Atthe end of the program, the resulting GCD will be available in either ACC or R0,and the program will continue until both values are equal. The stop test is imple-mented in line 4, using a subtraction of two registers and a conditional jump basedon the zero-flag. Assuming both registers contain different values, the program willcontinue to subtract the largest register from the smallest one. This requires to findwhich of R0 and ACC is bigger, and it is implemented with a conditional jump inline 5. The bigger-than test is implemented using a subtraction, a left-shift, and atest on the resulting carry-flag. If the carry-flag is set, then the most-significant bitof the subtraction would be one, indicating a negative result in two’s complement
available microinstructions and their encoding
Example: How to Encode a Microinstruction
67Schaumont 2010
how to encode “ACC ← R2”?144 5 Microprogrammed Architectures
ACC <– R2
SBUS0010
ALU0001
Shifter111
Nxt0000
Dest1000
Address000000000000
RT-levelInstruction
Micro-InstructionField Encoding
Micro-InstructionFormation
{0,0010,0001,111,1000,0000,000000000000}
{0001,0000,1111,1000,0000,0000,0000,0000}
0 × 10F80000Micro-Instruction
Encoding
Fig. 5.8 Forming microinstructions from register-transfer instructions
Listing 5.1 Micro-program to evaluate a GCD1 ; Command Field || Jump Field2 IN -> R03 IN -> ACC4 Lcheck: (R0 - ACC) || JUMP_IF_Z Ldone5 (R0 - ACC) << 1 || JUMP_IF_C LSmall6 R0 - ACC -> R0 || JUMP Lcheck7 Lsmall: ACC - R0 -> ACC || JUMP Lcheck8 Ldone: JUMP Ldone
logic. This conditional jump-if-carry will be taken if R0 is smaller than ACC. Thecombination of lines 5, 6, and 7 shows how an if-then-else statement can be createdusing multiple conditional and unconditional jump instructions. When the programis complete, in line 8, it iterates in an infinite loop. Depending on how this micropro-gram is integrated into a bigger program, this infinite loop would have to be replacedwith an appropriate control transfer to the next task.
5.5 Implementing a Microprogrammed Machine
In this section, we discuss a sample implementation of a microprogrammed machinein the GEZEL language. This can be used as a template for other implementations.
5.5.1 Microinstruction Word Definition
A convenient starting point in the design of a microprogrammed machine is thedefinition of the microinstruction. This includes the allocation of microinstructioncontrol bits and the definition of the meaning of relevant bit-patterns.
Example: Implementing GCD on this Architecture
68Schaumont 2010
1: int gcd(int a, int b) {2: while(a != b) {3: if(a > b) {4: a = a - b;5: } else {6: b = b - a;7: }8: }9: return a;10:}
; Command Field || Jump Field; --------------------------------------------------------------------------1: IN -> R0 ; read a, store in R02: IN -> ACC ; read b, store in ACC3: Lcheck: R0 - ACC || JUMP_IF_Z Ldone ; check while condition4: (R0 – ACC) << 1 || JUMP_IF_C Lsmall ; check whether R0<ACC, ..5: R0 - ACC -> R0 || JUMP Lcheck ; if so, ACC – RO -> ACC6: Lsmall: ACC - R0 -> ACC || JUMP Lcheck ; else R0 - ACC -> R07: Ldone: || JUMP Ldone ; infinite loop, end of prog
Changes
• 2016-06-25 (v1.6.1)– add case study of GCD synthesis with Xilinx SDAccel
• 2016-06-11 (v1.6.0)– updated for SS2017
• 2016-07-10 (v1.5.2)– fix more equations that have been corrupted by PowerPoint update
• 2016-06-06 (v1.5.1)– change University name on title slide
• 2016-05-27 (v1.5.0)– updated for SS2016, fix broken fonts for PowerPoint 15
• 2015-05-28 (v1.4.0)– updated for SS2015
• 2014-05-22 (v1.3.1)– explain on slide 41 that di=1 no longer required
69
Changes
• 2014-05-06 (v1.3.0)– updated for SS2014
• 2013-06-18 (v1.2.4)– fixed label of basic blocks B4 and B5 also on p.52
• 2013-06-06 (v1.2.3)– cosmetic changes– fixed index in equation on p.41– fixed label of basic blocks B4 and B5 on p.48 + p.49
• 2013-05-23 (v1.2.2)– clarified assumption of unit delay on p.33
• 2013-05-16 (v1.2.1)– fix typo on slide 19, terminate algorithm when v0 is scheduled (not vn)
• 2013-05-13 (v1.2.)– updated for SS2013, merged all architecture synthesis materials into a
single presentation
70