Energy Efficient Hardware Synthesis of Polynomial Expressions 18 th International Conference on VLSI...
-
date post
21-Dec-2015 -
Category
Documents
-
view
214 -
download
0
Transcript of Energy Efficient Hardware Synthesis of Polynomial Expressions 18 th International Conference on VLSI...
Energy Efficient Hardware Synthesis of Polynomial Expressions
18th International Conference on VLSI Design
Anup Hosangadi
Ryan Kastner
ECE Department, UCSB
Farzan Fallah
Advanced CAD Research
Fujitsu Labs of America
Outline
Introduction Related Work Problem formulation Algorithms for optimizing polynomials Experimental results Conclusions
Introduction
Embedded system applications need to compute polynomial expressions
– Continuous functions can be approximated by Taylor Series
– Adaptive (polynomial) filters– Polynomial interpolation/extrapolation
in Computer Graphics– Encrpytion
Introduction
Commonly occuring computations implemented in hardware– More flexibility than processor architecture– NPAs (Hardware accelarators) in PICO project– Custom Instructions (Tensilica)– Upto 100 times improvement over processor
implementation (Kastner et.al TODAES’02)
Develop techniques for reducing power consumption
Related Work (Behavioral transforms)
Power consumption depends on many factors– Reducing number of operations
Hardware: (Nguyen and Chatterjee TVLSI’00) Software: (I.Hong et.al TODAES’99)
– Voltage reduction after speedup transformations Retiming, Pipelining, Algebraic restructuring
(Chandrakasan et. al TCAD’95)
Related Work
Scheduling and resource allocation– Shutting down unused resources (Monteiro et. al.
DAC 96)– Allocation of registers, functional units and
interconnects (A.Raghunathan et. al ICCD’94)
Multiple Vdd scheduling– Assigning supply voltage to each operation in
CDFG (M.Chang and M.Pedram TVLSI’97)
Related Work
Switching power is proportional to number of operations
Multiplications are expensive in Embedded systems – Average 40 times more power than addition at 5V
(V.Krishna et. al, VLSI Design 1999)
Careful optimization of expressions is therefore necessary to save power
2ddavgsw VfCP
Reducing operations in polynomial expressions
No good tool for polynomials– Designers rely on hand optimized libraries
Conventional compiler techniques: CSE and Value numbering not suited for polynomials.
Horner form: most popular representation– anxn + a1xn-1 + ….an-1x + a0 = (…((anx + an-1)x + an-2)x + ..a1)x + a0
– Not good for multivariate polynomials– Only a single polynomial expression at a time
Comparison with Horner form
Quartic-spline polynomial (3-D graphics)P = zu4 + 4avu3 + 6bu2v2 + 4uv3w + qv4
Horner form (from MapleTM)P = zu4 + (4au3 + (6bu2 + (4uw + qv)v)v)v
(17 multiplications) Proposed algebraic method:
d1 = v2 ; d2 = d1*v
P = u3(uz + ad2) + d1( qd1 + u(wd2 + 6bu) )(11 multiplications)
Related Work (Polynomial Expressions
Expression Factorization (M.A. Breuer JACM’69)– Allows only one kind of operator at a time
Using Symbolic Algebra (M.A.Peymandoust, De Micheli)– Mapping polynomial datapaths to libraries (DAC’01)– Low power embedded software (DATE’02)– Results depend heavily on set of library elements
eg. (a2 – b2) = (a+b)(a-b) iff (a+b) or (a-b) is a library element– Manipulates only a single expression at a time
F1 = A + B + C + D;
F2 = A + P + D;=> Extract (A + D)
Motivating Example
Consider set of expressions
Using CSE
yx– 4xy P
xyz– 4yz 4x P
zyx yx P
23
2
22
31
yx– 4xy P
xyz– 4yz 4x P
zyx yx P
23
2
22
31
xdydyd
xydzdyzd
xdzyddd
4 P
4 P
P
3133
2232
21
21211
xdydyd
xydzdyzd
xdzyddd
4 P
4 P
P
3133
2232
21
21211
16 multiplications and 4 additions/subtractions
12 multiplications and 4 additions/subtractions
Motivational Example
Using Horner transform
Using our algebraic technique
yx)x - (4y P
yz)x - (4 4yz P
)(y P
3
2
221
xyzz
yx)x - (4y P
yz)x - (4 4yz P
)(y P
3
2
221
xyzz
xyddd
xdzdd
yzxddxd
3323
2312
1311
P
4 - 4 P
P
xyddd
xdzdd
yzxddxd
3323
2312
1311
P
4 - 4 P
P
12 multiplications and 4 additions/subtractions
7 multiplications and 3 additions/subtractions
Introduction to algebraic technique for redundancy elimination
Algebraic techniques in multi-level logic synthesis (MLLS)– Decomposition, factoring reduce number of literals– Distill and Condense use Rectangle Covering methods
Polynomial Expressions (Our Technique)– Factoring, Single term common subexpressions reduces number of
multiplications– Multiple term common subexpressions reduces number of additions and
possibly multiplications
Key Differences (Generalization to handle higher orders)– Kernelling techniques– Finding single cube intersections
Introduction to our technique(Outline)
Find a subset of all possible subexpressions (kernel generation)
Transformation of Polynomial Expressions – Problem formulation
Extract multiple term common subexpressions and factors
Extract single term common factors
Introduction to our technique
Terminology– Literal: A variable or a constant eg. a,b,2,3.14– Cube: Product of literals e.g. +3a2b, -2a3b2c– SOP: Sum of cubes e.g. +3a2b – 2a3b2c– Cube-free expression: No literal or cube can divide
all the cubes of the expression– Kernel: A cube free sub-expression of an
expression, e.g. 3 – 2abc– Co-Kernel: A cube that is used to divide an
expression to get a kernel, e.g. a2b
Introduction to our Technique
Matrix Representation of Polynomial Expressions
– F = x3y – xy2z is represented by
– Each row represents a product term– Each column represents a variable/constant– Each element (i,j) represents power of variable j in term i
+/- x y z
+ 3 1 0
- 1 2 1
Generation of Kernels (example)
P1 = x3y + x2y2z {L} = {x,y,z}– Divide by x:
Ft = P1/x = x2y + xy2z
x y z
3 1 0
2 2 1
x y z
2 1 0
1 2 1
Generation of Kernels (example)
Ft = P1/x = x2y + xy2z
C = Biggest Cube dividing all cubes of Ft
x y z
2 1 0
1 2 1
1 1 0
/ C =
x y z
1 0 0
0 1 1
C = = xy
Generation of Kernels (example)
Obtain Kernel: F1 = Ft/C = (x2y + xy2z)/(xy) = ( x + yz)
Obtain Co-Kernel D1 = x*(xy) = x2y– No kernels within F1. Go back to P1
P1 = x3y + x2y2z– Divide now by next variable y
Ft = x3 + x2yz– C = x2
– But (x < y) ε C
Stop Here, to avoid repeating same kernel Ft/C = (x + yz)– No more kernels extracted– Record kernel F1 = P1 with co-kernel ‘1’
Concept of kernels and co-kernels
Theorem: Two expressions f and g can have a multiple term common subexpression iff there are 2 kernels Kf and Kg having a multiple term intersection
Detection of multiple term common subexpressions by intersection of sets of kernels
Each co-kernel : kernel pair represents a possible factorization– e.g. x3y + x2y2z = [x2y](x + yz)
Set of kernels a subset of all possible subexpressions
All Kernels and Co Kernels
yx– 4xy P
xyz– 4yz 4x P
zyx yx P
23
2
22
31
yx– 4xy P
xyz– 4yz 4x P
zyx yx P
23
2
22
31
Which kernels to choose?
)1](x - [4xy xy](x), - [4y : P
xyz](1) - 4yz [4x yz](4),[x yz](x), - [4 x](yz), - [4 : P
)1]([x ),yz](x [x : P
23
2
22321
y
zyxyy
)1](x - [4xy xy](x), - [4y : P
xyz](1) - 4yz [4x yz](4),[x yz](x), - [4 x](yz), - [4 : P
)1]([x ),yz](x [x : P
23
2
22321
y
zyxyy
Kernel Cube Matrix (KCM)
One row for each Kernel generated One column for each distinct kernel cube Each non-zero element represents a term
Kernel Cubes
x yz 4 -yz -xCoKernels
4 1(3) 1(4) 0 0 0
x2y 1(1) 1(2) 0 0 0
x 0 0 1(3) 1(5) 0
xy 0 0 1(6) 0 1(7)
yz 0 0 1(4) 0 1(5)
x3y
Finding Kernel Intersections(Distill Algorithm)
Each kernel intersection or factor appears as a rectangle– Rectangle: Set of rows and columns such that all
elements are ‘1’
Value of a rectangle = Weighted sum of the energy savings of the different operations
Goal: Maximum valued rectangular covering of KCM
Greedy heuristic: covering by prime rectangles
Modeling value function of a rectangle
Formula for weighted sum of energy savings on selection of a rectangle
R = # of rows ; C = # of columns M(Ri) = # of multiplications in row (co-kernel) i. M(Ci) = # of multiplications in column (kernel-cube) i m = ratio of average energy consumption of multiplication to addition in the target library
)1C()1R(
} ))C(M()1R())R(MR(1) - (C {mC
iR
i
)1C()1R(
} ))C(M()1R())R(MR(1) - (C {mC
iR
i
Value =
Distill Algorithm
Kernel Cubes
x yz 4 -yz -x
CoKernels
4 1(3) 1(4) 0 0 0
x2y 1(1) 1(2) 0 0 0
x 0 0 1(3) 1(5) 0
xy 0 0 1(6) 0 1(7)
yz 0 0 1(4) 0 1(5)
4x + 4yz = 4d1 d1 = (x + yz)
x3y + x2y2z = x2yd1
Saves 5 multiplications and 1 addition
Value = 201 units (m = 40)
Distill Algorithm
Kernel Cubes
x yz 4 -yz -x
CoKernels
4 1(3) 1(4) 0 0 0
x2y 1(1) 1(2) 0 0 0
x 0 0 1(3) 1(5) 0
xy 0 0 1(6) 0 1(7)
yz 0 0 1(4) 0 1(5)
Remove covered terms
4xy – x2y = xyd2
d2 = 4 – x
Saves 2 multiplications
Value = 80
Distill Algorithm
Distill algorithm exits after no more kernel intersections can be found
P1 = x2yd1 d1 = x + yz
P2 = 4d1 – xyz d2 = 4 - xP3 = xyd2
Can further optimize by finding single cube intersections
Finding single cube intersections (Condense algorithm)
Form Cube Literal Matrix (CLM) – One row for each cube– One column for each literal– Eg. 2 cubes F1 = a4b3c; and F2 = a2b4c2
a b c
4 3 1
2 4 2
Finding single cube intersections (Condense algorithm)
Each (single term) common subexpression appears as a rectangle.
– Rectangle: Set of rows and columns where all elements are non-zero
Value of a rectangle is number of multiplications saved by selecting it
– C = cube corresponding to the rectangle Value = Rows*( (ΣC[i] ) -1)
Maximum valued rectangular covering will give minimum number of multiplications
Use greedy iterative covering by prime rectangles
Cube Literal Matrix (Condense Algorithm)
Literals
Term +/- x y z 4 d1 d2
Cubes
1 + 2 1 0 0 1 0
2 + 0 0 0 1 1 0
3 - 1 1 1 0 0 0
4 + 1 1 0 0 0 1
5 + 1 0 0 0 0 0
6 + 0 1 1 0 0 0
7 + 0 0 0 1 0 0
8 - 1 0 0 0 0 0
Save 2 multiplications by extracting xy
CLM for our example after Distill algorithm
C = xy
Condense AlgorithmExtracting xy
No more favorable cube intersections found
Literals
Term +/- x y z 4 d1 d2
Cubes
1 + 1 0 0 0 1 0
2 + 0 0 0 1 1 0
3 - 0 0 1 0 0 0
4 + 0 0 0 0 0 1
5 + 1 0 0 0 0 0
6 + 0 1 1 0 0 0
7 + 0 0 0 1 0 0
8 - 1 0 0 0 0 0
Final Implementation
– Total 7 multiplications, 3 additions/subtractions– Savings of 5 multiplications, 1 addition/subtraction
compared to CSE Impossible to obtain such results using conventional
techniques
xyddd
xdzdd
yzxddxd
3323
2312
1311
P
4 - 4 P
P
xyddd
xdzdd
yzxddxd
3323
2312
1311
P
4 - 4 P
P
Experimental setup
Polynomials used in Computer graphics and Signal Processing
1.0 µ technology library, characterized for power consumption
Synthesized using Synopsys Design CompilerTM – Min Hardware constraints (1 adder + 1 multiplier)– Med Hardware constraints (Max 4 multipliers)
Experimental setup
Estimated power using Synopsys Power CompilerTM for random inputs, using RTL Simulator (VCSTM)
Compared energy consumption with CSE and Horner form
Compared energy after voltage scaling
Results (Comparing operations)
Original CSE Horner Our
Technique
M A M A M A M A
ex1 23 4 16 4 17 4 13 4
ex2 34 5 22 5 23 5 16 5
ex3 32 8 18 8 18 8 11 8
ex4 43 17 24 17 19 17 17 17
ex5 34 6 23 6 20 6 13 6
Avg 33.2 8 20.6 8 19.4 8 14 8
Results (Min Hardware constraints)
Area Energy Energy-Delay Energy
(Scaled V)
C H C H C H C H
ex1 7.5 0.1 13.6 25.6 20.4 39.4 24.6 49.5
ex2 0.3 -4.2 21.6 29.3 39.0 48.8 52.2 64.6
ex3 -7.5 -24.2 29.4 10.4 47.6 25.9 62.2 36.9
ex4 5.6 2.5 37.0 28.7 57.1 46.1 74.3 59.8
ex5 3.7 2.0 44.8 36.8 62.8 54.8 78.3 69.7
Avg 1.9 -4.8 29.3 26.1 45.4 43.0 58.3 56.1
Results (Med Hardware constraints)
Area Energy Energy-Delay Energy
(Scaled V)
C H C H C H C H
ex1 30.5 3.9 16.1 39.2 9.7 44.1 9.7 55.0
ex2 14.8 1.0 9.7 29.6 20.3 58.7 22.7 75.4
ex3 8.3 3.7 42.5 29.1 44.9 37.0 51.8 45.0
ex4 8.9 9.0 28.2 29.5 39.5 40.6 47.4 48.3
ex5 8.0 6.6 41.4 40.8 58.4 59.7 72.6 75.9
Avg 14.1 4.9 27.6 33.6 34.6 48.0 40.8 60.0
Conclusions
Technique to reduce number of operations in polynomial expressions
Large savings in energy consumption observed over CSE and Horner methods
Need to consider scheduling and resource allocation to obtain further improvements