Fast, Quasi-Optimal, and Pipelined Instruction-Set Extensions
description
Transcript of Fast, Quasi-Optimal, and Pipelined Instruction-Set Extensions
Ajay K. Verma, Philip Brisk and Paolo Ienne
Processor Architecture Laboratory (LAP)& Centre for Advanced Digital Systems (CSDA)
Ecole Polytechnique Fédérale de Lausanne (EPFL)
csda
csda
Fast, Quasi-Optimal, and Pipelined Fast, Quasi-Optimal, and Pipelined Instruction-Set ExtensionsInstruction-Set Extensions
2
Custom ISE IdentificationCustom ISE Identification
Register File
ALU MUL LD/ST
Data Memory
AFUout1 = F (in1, in2, in3, in4)out2 = G (in1, in2, in3, in4)
Limited number ofI/O ports
3
OutlineOutline
Problem formulation ISE selection I/O serialisation
Related work
Non-optimality of earlier work
Integer Linear Programming (ILP) formulation
Results
Conclusions
4
Problem FormulationProblem Formulation Given
a dataflow graph
a set of forbidden nodes
Find a subgraph S, which isconvex free of
forbidden nodes
And, has largest gainM (S) =
Nexec * (SW (S) – HW (S))
f
a
x2
x1 d
x3
h
b c e g
5
Convex SubgraphConvex Subgraph
d
cb
a
In order to execute the AFU we need the output of node b
Computation of node b requires the output of AFU
A non-convex AFU cannot be scheduled without creating a deadlock
6
I/O SerialisationI/O Serialisation
f
d
b c e
2 inputs, 4 outputsAvailable I/O ports: (1, 2)
cb
e
d
f
7
ISE Merit EstimationISE Merit Estimation
M (S) = Nexec * (SW (S) – HW (S))
f
a
x2
x1 d
x3
h
b c e g
cb
e
d
f
8
Related WorkRelated Work ISE identification under I/O constraints
Search space pruning using I/O and convexity constraints [Atasu03, Clark03, Yu04, Pozzi06, Yu07, Chen07]
ILP based approach [Atasu05] Pseudo-polynomial time algorithm [Bonzini07]
ISE identification under relaxed I/O constraints Restricted search space exploration [Pozzi05] Generation of a semi compact set of connected ISEs
[Pothineni07]
I/O serialisation Exponential time algorithms [Pozzi05, Pothineni07]
Algorithms for specific processor models Single-issue RISC processor model [Verma07]
9
Earlier WorkEarlier Work
ISE Selection I/O Serialisation
Atasu03
Yu07
Chen07
Bonzini07
Pozzi05
Pothineni07
Optimal ISEs selection undervarious I/O constraints
Exponential time I/O serialisation algorithm
10
Non-Optimality of Earlier WorkNon-Optimality of Earlier Work
.5
.6
.5
.6
.5
.6
.3
.2
.5
.6
.5
.6
.5
.6
.3
.2
cycle saved:
23.36
cycle saved:
15.02
cycle saved: 066
cycle saved: 112
11
Our ContributionsOur Contributions
Optimal ILP formulation for a large class of processor modelsEarlier work consider RISC processor model only
Single run In the earlier work ISE selection was done for
various I/O constraints
ISE selection and I/O scheduling togetherAnother source of non-optimality of earlier work
12
Integer Linear ProgrammingInteger Linear Programming
Objective function
Linear constraints
13
ILP FormulationILP Formulation
Linear constraintsNo forbidden nodesConvexity constraints I/O serialisation based constraints I/O access per cycle based constraints
Objective functionSaving in cycles should be maximum
14
ISE Selection Constraints (1 of 2)ISE Selection Constraints (1 of 2) Variable: For each node ni a Boolean variable xi
xi is true iff node ni is in the selected ISE
Constraint: No forbidden node should be in the ISE If ni is a forbidden node, then xi = 0
Variable: For each node ni two Boolean variables pi and si
pi (si) is true iff at least a predecessor (successor) of ni is in the selected ISE
Constraint: Subgraph corresponding to the selected ISE must be convex If (pi and si are true), then xi must be true (i.e., pi + si – xi ≤
1)
15
ISE Selection Constraints (2 of 2)ISE Selection Constraints (2 of 2)
Relationship between pi, si and xi
pi = 0 if ni has no children
U (xj U pj) where nj’s are children of ni
si = 0 if ni has no parents
U (xj U pj) where nj’s are parents of ni
16
I/O Serialisation Based Constraints (1 I/O Serialisation Based Constraints (1 of 3)of 3)
n1 n2
n3
n4
n5
Variable: An integer variable intDelayi
Denotes the cycle in which node ni is executed, e.g.,
intDelay1 = 0 intDelay4 = 1 intDelay5 = 2
Variable: A real variable fractionalDelayi Denotes the smallest time after
intDelayi cycle when output of ni are available, e.g.,
fractionalDelay3 = HW (n3) fractionalDelay4 = HW (n3) + HW (n4)
Variable: An integer variable ρij Denotes the number of stages across
the edges between the nodes ni and nj , e.g.,
ρ13 = 1 ρ34 = 0 ρ25 = 2
17
I/O Serialisation Based Constraints (2 I/O Serialisation Based Constraints (2 of 3)of 3)
Constraint: The difference between the cycles of predecessor and successor node is the same as number of latches on the edge connecting them, e.g., intDelay4 = intDelay3 +
ρ34
intDelay5 = intDelay2 + ρ25
Constraint: The total number of stages is the same as the last cycle in which an output node is computed, e.g., R = intDelay5 + ρ57 R = intDelay2 + ρ26
n1 n2
n3
n4
n5
n6n7
Extra latches on output edges are createdin order to realize an imaginary sink node
18
I/O Serialisation Based Constraints (3 I/O Serialisation Based Constraints (3 of 3)of 3)
Constraint: fractionalDelay of a node depends on the fractionalDelay of its predecessor nodes, e.g., Case 1: if node is the first node
in the cycle fractionalDelay3 = HW (n3)
Case 2: if node is not the first node in the cycle
fractionalDelay4 = fractionalDelay3 + HW (n4)
Constraint: fractionalDelay of a node should never exceed the cycle time, e.g., fractionalDelay3 ≤ λ fractionalDelay4 ≤ λ
n1 n2
n3
n4
n5
n6n7
19
I/O Access Per Cycle Based I/O Access Per Cycle Based Constraints Constraints
Variable: Boolean variables cikIN and cik
OUT
cikIN is true, iff ni is an input of ISE and is accessed in the
kth stage of execution (similarly for cikOUT)
Constraint: In each stage no more than m inputs should be accessed, and no more than n outputs should be written back, i.e., for each k ∑ cik
IN ≤ m
∑ cikOUT ≤ n
cikIN and cik
OUT can be computed using the intDelay, fractionalDelay of nodes and ρ values of incoming and outgoing edges of the AFU
20
Objective FunctionObjective Function
Saving in cycles should be maximized SW (S) – HW (S) should be maximum
SW (S) = ∑ xi SW (ni)
HW (S) = R
Any processor model where SW (S) and HW (S) can becomputed using linear inequalities, can be handled using ILP
21
Experimental SetupExperimental Setup
Input dataflowgraph
ISE selectionAtasu03
ISE selectionAtasu03
ILP method
I/O serialisationPozzi05
No serialisation
exp / subopt
exp / opt
22
Results (1 of 3)Results (1 of 3)
viterbi
adpcmdecoder adpcmcoder
No pipelining
Pozzi’s algorithm
ILP method
23
Results (2 of 3)Results (2 of 3)
Pozzi’s algorithm takes several hours on this benchmark, and produces inferior results
Benchmark: aes
Biggest dataflow graph: 703
After 3 minutes After an hour
24
Results (3 of 3)Results (3 of 3)
The best AFU with 22 inputs and 22 outputs
25
ConclusionsConclusions
ISE Selection I/O Serialisation
Atasu03
Yu07
Chen07
Bonzini07
Pozzi05
Pothineni07
The methodology can be generalized for a large class of processor models
Optimal, single run algorithm