Challenges in Automatic Optimization of Arithmetic Circuits

Post on 06-Jan-2016

40 views 1 download

description

csda. csda. Challenges in Automatic Optimization of Arithmetic Circuits. Ajay K. Verma , Philip Brisk and Paolo Ienne Processor Architecture Laboratory (LAP) & Centre for Advanced Digital Systems (CSDA) Ecole Polytechnique Fédérale de Lausanne (EPFL). - PowerPoint PPT Presentation

Transcript of Challenges in Automatic Optimization of Arithmetic Circuits

Ajay K. Verma, Philip Brisk and Paolo Ienne

Processor Architecture Laboratory (LAP)& Centre for Advanced Digital Systems (CSDA)

Ecole Polytechnique Fédérale de Lausanne (EPFL)

csda

csda

Challenges in Automatic Challenges in Automatic OptimizationOptimization

of Arithmetic Circuitsof Arithmetic Circuits

2

Circuit PerformanceCircuit PerformanceDepends Heavily on the DescriptionDepends Heavily on the Description

“Software”Multiplier

Multiplier withOptimized

Compressor Tree

Multiplier withCompressor Tree

3

Pre-Synthesis Optimization of Pre-Synthesis Optimization of Arithmetic CircuitsArithmetic Circuits

Original Circuit Description

Original Circuit Description

Physical DesignPhysical Design

Logic SynthesisLogic Synthesis

Arithmetic optimizationsArithmetic optimizations

Known architectures

Automaticarchitecture exploration

4

Automation and Computer Automation and Computer ArithmeticArithmetic

Automation Heuristics to optimize general

classes of circuits Kernel and co-kernel

extraction [Brayton82] Decomposition based

approaches for general circuits [Bertacco97, Mishchenko01, Yang02]

Algorithmic approaches for a particular class of circuits Variable group size CLA adder

[Lee91] Irregular partial product

compressors [Stelling98]

5

Logic SynthesisLogic Synthesis

Synthesis tools have become extremely good in optimizing circuits expressed in Sum-Of-Product form

And when there are plenty of XOR gates?

QPR

bbbbbbbbQ

aaaaaaaaP

43322110

43322110

Before expansion : 0.37 ns (138.2 μm2) After expansion : 0.26 ns (146.9 μm2)

Before expansion : 0.22 ns (58.8 μm2) After expansion : 0.27 ns (221.2 μm2)

QPR

babababaQ

babababababaP

)( 03213012

032121300330

)( QPPQ

?

6

OutlineOutline

Verma & Ienne; ICCAD 2004Verma, Brisk, & Ienne; TCAD 2008

Optimizingat Word-Level

Verma & Ienne; DAC 2006

ExploringMicroscopicStructure

CreatingMacroscopic

Structure

Verma, Brisk, & Ienne; DAC 2007Best Paper Award nominee

Verma, Brisk, & Ienne; IWLS 2008

7

OutlineOutline

Optimizingat Word-Level

CreatingMacroscopic

Structure

ExploringMicroscopicStructure

Low Complexity HighHigh Granularity Low

8

OutlineOutline

Optimizingat Word-Level

CreatingMacroscopic

Structure

ExploringMicroscopicStructure

9

Clustering: Maximization of the Use of Clustering: Maximization of the Use of

Carry-Save RepresentationCarry-Save Representation

Two addition nodes areseparated by NOT

The two addition nodesare clustered

Goal: Swap the adders with other logic operations while

preserving the semantics to cluster additions

10

Examples of TransformationsExamples of Transformations

(A << k) A . 2k

Advancing shift left over add (distributivity of multiplication over addition)

Advancing shift right overaddition is more complex

Advancing SEL over add (existence of the identity element of addition)

(C ? A : D) + (C ? B : 0)C ? (A + B) : D

11

Some Transformations Have a Some Transformations Have a CostCost

Advancing PP over add (distributive property of multiplication over addition)

This transformation has a

significant cost in terms of area!

12

Generation of All Pareto-Optimal Generation of All Pareto-Optimal ImplementationsImplementations

Theorem: The transformations form apersistent and confluent reduction system

Pareto-optimal: better than any other in terms of

area or critical-path delay

13

Example: Example: adpcmdecodeadpcmdecode Kernel Kernel

Compressor tree

AND network

0.85 ns,5678 μm2

0.51 ns,4901 μm2

14

OutlineOutline

Optimizingat Word-Level

CreatingMacroscopic

Structure

ExploringMicroscopicStructure

Limited scope for optimizations

Bit-level

15

Implementation of Subcircuits Implementation of Subcircuits Corresponding to Contiguous Layers Can Corresponding to Contiguous Layers Can

Be ImprovedBe Improved

ADD

LZD

Arithmetic

Logic

Leading Zero AnticipatorA direct implementation of LZA

in carry-select fashion [Gerwig99]

16

Input CondensationInput Condensation

Leader expressions: • Sufficient to evaluate the whole of an expression• Once you evaluate them, you can discard the input bits

Compute all leader expressions in parallel

Recursively compute leader expressions

again

IN

SomeLargeCircuit

OUT

IN

L |L| < |IN|

Smaller circuit

OUT

sc

8-input parallel counter

Leader expressions

17

Progressive Decomposition: Progressive Decomposition: Algorithm OverviewAlgorithm Overview

Choose a subset of input bitsHow many bits?Many different combinations?

Find leader expressionsOptimize via Boolean ring propertiesFind identities

Discard dependent expressions

x y zz = f(x, y)

Rewrite circuit in terms of leader expressions Recursively process the remaining circuit

18

Example: 3-Input Adder (sExample: 3-Input Adder (s22 Output)Output)

X = [a1b1 + (a1 + b1)a0b0] [(a1 b1 a0b0)c1 + c0(a0 b0)(c1 + (a1 b1 a0b0))]

Ripple-Carry Adder

L(X, {a1, b1, c1}) = {a1 b1 c1, a1b1 b1c1 a1c1}

sum carry Carry-save adder

0

++

X

++

a0 b0a1 b1

+

c1 c0

0

0Ripple-Carry Adder

3:2 Compressor CSA

a0 b0 c0

CSA

a1 b1 c1

++

0

0

X

19

A Better Division Is Used for A Better Division Is Used for Leader Expression Computation Leader Expression Computation

X = ab (c d e) cd (a b e)

X = (ab + cd) (a b c d e)

Based on the identity: pq (p q) = 0

Theorem: An expression of the form (PQ RS) can be factored as (P R) T, if there exist U and V such that 1) PU = RV = 0 and 2) Q S = U V

The ideal membership problem can be used to determine the existence of such U and V

20

Progressive Decomposition: Progressive Decomposition: Qualitative AnalysisQualitative Analysis

Completely agnostic of the type of circuit to optimize

Automatically infers successful circuit designs from the literature… Carry-lookahead adder (beyond minimal sizes)

Structured LZD/LOD circuit

Optimized LZA circuit (no sum computation)

Carry-save addition

Parallel counter

…and discovers some unknown to us!

Multi-Input comparisons (min/max)

21

Multi-Input ComparatorMulti-Input Comparator(Min/max of k n-bit Integers)(Min/max of k n-bit Integers)

Binary tree of comparatorsNumber of comparators: k − 1

Critical path delay: O(log n log k)Hardware area: O(kn)

0.46 ns, 1755 μm2

Pairwise comparison of inputs

Number of comparators: k (k − 1)/2Critical path delay: O(log n + log k)Hardware area: O(k2n)

0.21 ns, 3479 μm2

log*() is the number of times the logarithm function must be iteratively applied before the result is ≤ 1 – e.g., log*(265536) = 5

With Our Structuring Algorithm:

Bitwidth reduction using dominators and LODsNumber of LODs: k log* n

Critical path delay: O(log n + log k log* n)Hardware area: O(kn)

0.22 ns, 1331 μm2

22

OutlineOutline

Optimizingat Word-Level

CreatingMacroscopic

Structure

ExploringMicroscopicStructure

Reed-Muller formcan be very inefficient

Efficient implementationof the leader expressions ?

ExhaustiveExploration

23

Problem StatementProblem Statement

Given a set of Boolean expressions, generate all their Pareto-optimal implementations

no “reuse”total “reuse”

selective “reuse”

24

EnumeratingEnumeratingCommon Sub-ExpressionsCommon Sub-Expressions

Root: Original Reed-Muller form

Eitherxy or

xyreplaced by a new variable

The nodes of the DAG correspond to all partial implementations of the two expressions with

some sharing between them

25

Pruning the Enumeration DAGPruning the Enumeration DAG

The size of DAG can be as large as O ((n + m) 2m), where n is the number of variables and m is the size of Boolean expressionsEnumerating the whole DAG is computationally

infeasible

Pruning CriteriaRecognizing node equivalence (width reduction)Merging some reductions into a single one

(height reduction)Delaying certain reductions (branch reduction)

26

There Is Scope for Further There Is Scope for Further Pruning…Pruning…

Area and delay for all 6-bit adders generated by our algorithm

Without any pruning, it would be impossible to handle

expressions with more than five variables

Number of possible implementations: >1060

Number of explored implementations: 2687

Number of actual Pareto-optimal

implementations: 4

27

……but the Enumeration Algorithm Finds but the Enumeration Algorithm Finds Interesting Non-Trivial Relations!Interesting Non-Trivial Relations!

0123 aaaa

00010203 babababa

0123 bbbb

10111213 babababa

20212223 babababa

30313233 babababa

0211200110 bababababa 1221 baba

0110 baba 1221 baba 021120 bababa

+

4x4-bit multiplier:better than our best manually-designed

cell-based multiplier?!

The method has been generalized for higher bitwidth multipliersIt reduced the delay of the best cell-based 8 x 8-bit multiplier by 10%

Verma & Ienne; ASP-DAC 2007Best Paper Award nominee

28

SummarySummary

Verma & Ienne; ICCAD 2004Verma, Brisk, & Ienne; TCAD 2008

Optimizingat Word-Level

Verma & Ienne; DAC 2006

ExploringMicroscopicStructure

CreatingMacroscopic

Structure

Verma, Brisk, & Ienne; DAC 2007Best Paper Award nominee

Verma, Brisk, & Ienne; IWLS 2008

29

Computer Arithmetic and Computer Arithmetic and AutomationAutomation

Computer Arithmetic has been for long the domain of extremely ingenuous manually developed architectures

Automation has mostly addressed the optimization of such architectures through the exploration of the predefined design spaces they delimit

Logic synthesis, from the “bottom”, has failed to explore beyond known territories due to fairly fundamental issues

It is perhaps high time to tryto change all this…

30

Thanks!