Challenges in Automatic Optimization of Arithmetic Circuits

Ajay K. Verma, Philip Brisk and Paolo Ienne

Processor Architecture Laboratory (LAP)& Centre for Advanced Digital Systems (CSDA)

Ecole Polytechnique Fédérale de Lausanne (EPFL)

Challenges in Automatic Challenges in Automatic OptimizationOptimization

of Arithmetic Circuitsof Arithmetic Circuits

Circuit PerformanceCircuit PerformanceDepends Heavily on the DescriptionDepends Heavily on the Description

“Software”Multiplier

Multiplier withOptimized

Compressor Tree

Multiplier withCompressor Tree

Pre-Synthesis Optimization of Pre-Synthesis Optimization of Arithmetic CircuitsArithmetic Circuits

Original Circuit Description

Physical DesignPhysical Design

Logic SynthesisLogic Synthesis

Arithmetic optimizationsArithmetic optimizations

Known architectures

Automaticarchitecture exploration

Automation and Computer Automation and Computer ArithmeticArithmetic

Automation Heuristics to optimize general

classes of circuits Kernel and co-kernel

extraction [Brayton82] Decomposition based

approaches for general circuits [Bertacco97, Mishchenko01, Yang02]

Algorithmic approaches for a particular class of circuits Variable group size CLA adder

[Lee91] Irregular partial product

compressors [Stelling98]

Logic SynthesisLogic Synthesis

Synthesis tools have become extremely good in optimizing circuits expressed in Sum-Of-Product form

And when there are plenty of XOR gates?

bbbbbbbbQ

aaaaaaaaP

43322110

Before expansion : 0.37 ns (138.2 μm2) After expansion : 0.26 ns (146.9 μm2)

Before expansion : 0.22 ns (58.8 μm2) After expansion : 0.27 ns (221.2 μm2)

babababaQ

babababababaP

)( 03213012

032121300330

)( QPPQ

OutlineOutline

Verma & Ienne; ICCAD 2004Verma, Brisk, & Ienne; TCAD 2008

Optimizingat Word-Level

Verma & Ienne; DAC 2006

ExploringMicroscopicStructure

CreatingMacroscopic

Structure

Verma, Brisk, & Ienne; DAC 2007Best Paper Award nominee

Verma, Brisk, & Ienne; IWLS 2008

OutlineOutline

CreatingMacroscopic

Structure

Low Complexity HighHigh Granularity Low

OutlineOutline

CreatingMacroscopic

Structure

Clustering: Maximization of the Use of Clustering: Maximization of the Use of

Carry-Save RepresentationCarry-Save Representation

Two addition nodes areseparated by NOT

The two addition nodesare clustered

Goal: Swap the adders with other logic operations while

preserving the semantics to cluster additions

Examples of TransformationsExamples of Transformations

(A << k) A . 2k

Advancing shift left over add (distributivity of multiplication over addition)

Advancing shift right overaddition is more complex

Advancing SEL over add (existence of the identity element of addition)

(C ? A : D) + (C ? B : 0)C ? (A + B) : D

Some Transformations Have a Some Transformations Have a CostCost

Advancing PP over add (distributive property of multiplication over addition)

This transformation has a

significant cost in terms of area!

Generation of All Pareto-Optimal Generation of All Pareto-Optimal ImplementationsImplementations

Theorem: The transformations form apersistent and confluent reduction system

Pareto-optimal: better than any other in terms of

area or critical-path delay

Example: Example: adpcmdecodeadpcmdecode Kernel Kernel

Compressor tree

AND network

0.85 ns,5678 μm2

0.51 ns,4901 μm2

OutlineOutline

CreatingMacroscopic

Structure

Limited scope for optimizations

Bit-level

Implementation of Subcircuits Implementation of Subcircuits Corresponding to Contiguous Layers Can Corresponding to Contiguous Layers Can

Be ImprovedBe Improved

Arithmetic

Leading Zero AnticipatorA direct implementation of LZA

in carry-select fashion [Gerwig99]

Input CondensationInput Condensation

Leader expressions: • Sufficient to evaluate the whole of an expression• Once you evaluate them, you can discard the input bits

Compute all leader expressions in parallel

Recursively compute leader expressions

SomeLargeCircuit

L |L| < |IN|

Smaller circuit

8-input parallel counter

Leader expressions

Progressive Decomposition: Progressive Decomposition: Algorithm OverviewAlgorithm Overview

Choose a subset of input bitsHow many bits?Many different combinations?

Find leader expressionsOptimize via Boolean ring propertiesFind identities

Discard dependent expressions

x y zz = f(x, y)

Rewrite circuit in terms of leader expressions Recursively process the remaining circuit

Example: 3-Input Adder (sExample: 3-Input Adder (s22 Output)Output)

X = [a1b1 + (a1 + b1)a0b0] [(a1 b1 a0b0)c1 + c0(a0 b0)(c1 + (a1 b1 a0b0))]

Ripple-Carry Adder

L(X, {a1, b1, c1}) = {a1 b1 c1, a1b1 b1c1 a1c1}

sum carry Carry-save adder

a0 b0a1 b1

0Ripple-Carry Adder

3:2 Compressor CSA

a0 b0 c0

a1 b1 c1

A Better Division Is Used for A Better Division Is Used for Leader Expression Computation Leader Expression Computation

X = ab (c d e) cd (a b e)

X = (ab + cd) (a b c d e)

Based on the identity: pq (p q) = 0

Theorem: An expression of the form (PQ RS) can be factored as (P R) T, if there exist U and V such that 1) PU = RV = 0 and 2) Q S = U V

The ideal membership problem can be used to determine the existence of such U and V

Progressive Decomposition: Progressive Decomposition: Qualitative AnalysisQualitative Analysis

Completely agnostic of the type of circuit to optimize

Automatically infers successful circuit designs from the literature… Carry-lookahead adder (beyond minimal sizes)

Structured LZD/LOD circuit

Optimized LZA circuit (no sum computation)

Carry-save addition

Parallel counter

…and discovers some unknown to us!

Multi-Input comparisons (min/max)

Multi-Input ComparatorMulti-Input Comparator(Min/max of k n-bit Integers)(Min/max of k n-bit Integers)

Binary tree of comparatorsNumber of comparators: k − 1

Critical path delay: O(log n log k)Hardware area: O(kn)

0.46 ns, 1755 μm2

Pairwise comparison of inputs

Number of comparators: k (k − 1)/2Critical path delay: O(log n + log k)Hardware area: O(k2n)

0.21 ns, 3479 μm2

log*() is the number of times the logarithm function must be iteratively applied before the result is ≤ 1 – e.g., log*(265536) = 5

With Our Structuring Algorithm:

Bitwidth reduction using dominators and LODsNumber of LODs: k log* n

Critical path delay: O(log n + log k log* n)Hardware area: O(kn)

0.22 ns, 1331 μm2

OutlineOutline

CreatingMacroscopic

Structure

Reed-Muller formcan be very inefficient

Efficient implementationof the leader expressions ?

ExhaustiveExploration

Problem StatementProblem Statement

Given a set of Boolean expressions, generate all their Pareto-optimal implementations

no “reuse”total “reuse”

selective “reuse”

EnumeratingEnumeratingCommon Sub-ExpressionsCommon Sub-Expressions

Root: Original Reed-Muller form

Eitherxy or

xyreplaced by a new variable

The nodes of the DAG correspond to all partial implementations of the two expressions with

some sharing between them

Pruning the Enumeration DAGPruning the Enumeration DAG

The size of DAG can be as large as O ((n + m) 2m), where n is the number of variables and m is the size of Boolean expressionsEnumerating the whole DAG is computationally

infeasible

Pruning CriteriaRecognizing node equivalence (width reduction)Merging some reductions into a single one

(height reduction)Delaying certain reductions (branch reduction)

There Is Scope for Further There Is Scope for Further Pruning…Pruning…

Area and delay for all 6-bit adders generated by our algorithm

Without any pruning, it would be impossible to handle

expressions with more than five variables

Number of possible implementations: >1060

Number of explored implementations: 2687

Number of actual Pareto-optimal

implementations: 4

……but the Enumeration Algorithm Finds but the Enumeration Algorithm Finds Interesting Non-Trivial Relations!Interesting Non-Trivial Relations!

0123 aaaa

00010203 babababa

0123 bbbb

10111213 babababa

20212223 babababa

30313233 babababa

0211200110 bababababa 1221 baba

0110 baba 1221 baba 021120 bababa

4x4-bit multiplier:better than our best manually-designed

cell-based multiplier?!

The method has been generalized for higher bitwidth multipliersIt reduced the delay of the best cell-based 8 x 8-bit multiplier by 10%

Verma & Ienne; ASP-DAC 2007Best Paper Award nominee

SummarySummary

Verma & Ienne; ICCAD 2004Verma, Brisk, & Ienne; TCAD 2008

Verma & Ienne; DAC 2006

CreatingMacroscopic

Structure

Verma, Brisk, & Ienne; DAC 2007Best Paper Award nominee

Verma, Brisk, & Ienne; IWLS 2008

Computer Arithmetic and Computer Arithmetic and AutomationAutomation

Computer Arithmetic has been for long the domain of extremely ingenuous manually developed architectures

Automation has mostly addressed the optimization of such architectures through the exploration of the predefined design spaces they delimit

Logic synthesis, from the “bottom”, has failed to explore beyond known territories due to fairly fundamental issues

It is perhaps high time to tryto change all this…

Thanks!

Challenges in Automatic Optimization of Arithmetic Circuits

Documents

Transcript of Challenges in Automatic Optimization of Arithmetic Circuits

Chapter 6 – Digital Arithmetic: Operations & Circuits.

EE141 © Digital Integrated Circuits 2nd Arithmetic Circuits 1 Digital Integrated Circuits A Design Perspective Arithmetic Circuits Jan M. Rabaey Anantha.

Arithmetic Circuits · Carnegie Mellon 3 Motivation: Arithmetic Circuits Core of every digital circuit Everything else is side-dish, arithmetic circuits are the heart of the digital

Chapter 6 Digital Arithmetic: Operations & Circuits - …dasan.sejong.ac.kr/~sbmoon/classes/2013-1/digital/Ch6.pdfChapter 6 – Digital Arithmetic: Operations & Circuits . ... arithmetic

Arithmetic Circuits in CMOS VLSIeng.staff.alexu.edu.eg/.../2014/Arithmetic_Circuits_in_CMOS_VLSI.pdf · ARITHMETIC CIRCUITS IN CMOS VLSI. Faculty of Engineering - Alexandria University

Digital Integrated Circuits© Prentice Hall 1995 Arithmetic Arithmetic Building Blocks.

arithmetic circuits-2people.ee.duke.edu/.../Lectures/ArithmeticCircuits_2.pdf · 2011. 12. 1. · Arithmetic Circuits-2 • Multipliers – Array multipliers • Shifters ... •

Arithmetic Circuits - Universiti Malaysia Pahangee.ump.edu.my/hazlina/teaching_ESD/teaching_ESD_new_chap4...Electronic System Design Arithmetic Circuits ... Fundamental arithmetic

Arithmetic Functions & Circuits

Chapter 6 – Digital Arithmetic: Operations & Circuits

arithmetic circuits-2 2004 - Duke Universityece.duke.edu/~jmorizio/ece261/classlectures/arith...Arithmetic Circuits-2 • Multipliers – Array multipliers • Shifters – Barrel

Chapter 12 Arithmetic Circuits in CMOS VLSI

Equivalence Checking of Arithmetic Circuits

Arithmetic Circuits - Homepage - System Security · PDF fileIf arithmetic circuits are optimized performance will improve ... Ex: 11001 ROR 2 = 01110

LOW POWER ARITHMETIC CIRCUITS USING … arithmetic circuits were designed by these ... In digital processing where inexact computing ... Then complex arithmetic operations like Fourier

06 Arithmetic Circuits

Cmos Arithmetic Circuits

Digital Integrated Circuitsj.guntzel/ine5442/slides/Arithmetic-circuits-1.pdf · EE141 2 © Digital Integrated Circuits 2nd Slide 2 Arithmetic Circuits A Generic Digital Processor

Contemporary Logic Design Arithmetic Circuits © R.H. Katz Lecture #24: Arithmetic Circuits -1 Arithmetic Circuits (Part II) Randy H. Katz University of.

Computer Organization and Design Arithmetic Circuits