AVD Report Dwip Bhavsar Himanshu Verma

download AVD Report Dwip Bhavsar Himanshu Verma

of 27

description

Asynchronous design

Transcript of AVD Report Dwip Bhavsar Himanshu Verma

  • 1 | P a g e

    A Report

    Design of 2-Phase Bundled Data Mousetrap 8x8 Adiabatic

    Multiplier-Accumulator Unit

    Submitted in partial fulfilment of the requirements of the course

    MEL G623: ADVANCED VLSI DESIGN

    Submitted To

    Dr. Anu Gupta

    Submitted By

    Dwip Bhavsar 2012H123118P

    Himanshu Verma 2012H123120P

    Birla Institute of Technology & Science, Pilani

    November 2013

  • 2 | P a g e

    Abstract

    This work is the design of an asynchronous 8X8 Multiplier-Accumulator Unit using UMC

    180nm technology. The design follows an assigned asynchronous methodology due to the fact

    that in a synchronous design, the overall delay is affected by the delay of the least performing

    stage. The design employs pipelining in order to accommodate increased throughput and

    adiabatic modules have been built to decrease power consumption. The two-phase protocol has

    been followed wherein there is latching of data at both edges of the clock. The final unit consists

    of 8017 MOS devices and has a ceiling frequency of pipeline operation of 62.5 MHz. The power

    consumption is assessed to be 10.61 mW.

  • 3 | P a g e

    Contents Abstract ........................................................................................................................................... 2

    1. Introduction ................................................................................................................................. 4

    1.1 Design And Operation Of A Pipelined 8x8 Adiabatic Multiplier Accumulator ................ 5

    1.2 Design And Operation Of Multiplier ..................................................................................... 5

    1.3 Design And Operation Of Adder ........................................................................................... 6

    2. Pipelined 8 X 8 Adiabatic Multiplier Accumulator ................................................................... 7

    2.1 Adiabatic Circuits ................................................................................................................. 7

    2.1.1 Adiabatic Full Adder...................................................................................................... 8

    2.1.2 Partial Product Generator ............................................................................................... 9

    2.2 Mousetrap Pipeline ............................................................................................................. 11

    2.2.1 Xnor Gate ..................................................................................................................... 13

    2.2.2 Latch ............................................................................................................................ 13

    3. Multiplier ................................................................................................................................... 15

    4. Adder ......................................................................................................................................... 17

    5. MAC ......................................................................................................................................... 19

    6. Simulation Results .................................................................................................................... 21

    6.1 Statistics .............................................................................................................................. 21

    6.2 Multiplier Simulation Output: ............................................................................................ 22

    6.3 MAC Simulation Output ..................................................................................................... 24

    7. Conclusion ................................................................................................................................ 26

    8. References ................................................................................................................................. 27

  • 4 | P a g e

    LIST OF FIGURES

    FIGURE 1 CARRY SAVE MULTIPLIER 6

    FIGURE 2 CARRY BYPASS ADDER 6

    FIGURE 3 (A) 2PASCL INVERTER CIRCUIT. (B) WAVEFORMS FROM THE SIMULATION 7

    FIGURE 4 ADIABATIC FULL ADDER 8

    FIGURE 5 8BIT RIPPLE ADDER 9

    FIGURE 6 ADIABATIC NAND-AND 10

    FIGURE 7 PARTIAL-PRODUCT GENERATION 10

    FIGURE 8 3-STAGE MOUSETRAP PIPELINE 11

    FIGURE 9 SINGLE STAGE OF MOUSETRAP 12

    FIGURE 10 MOUSETRAP PIPELINE WITH LOGIC BLOCK 12

    FIGURE 11 XNOR-XOR 13

    FIGURE 12 D LATCH 14

    FIGURE 13 16-BIT LATCH 14

    FIGURE 14 CARRY SAVE MULTIPLIER BLOCK 15

    FIGURE 15 SINGLE STAGE OF CARRY SAVE MULTIPLIER 16

    FIGURE 16 8X8 MULTIPLIER 16

    FIGURE 17 CARRY BYPASS 17

    FIGURE 18 CARRY BYPASS SINGLE STAGE 18

    FIGURE 19 CARRY BYPASS SCHEMATIC 18

    FIGURE 20 MAC 19

    FIGURE 21 MAC: MULTIPLIER 20

    FIGURE 22 MAC: ACCUMULATOR 20

    FIGURE 23 SIMULATION RESULTS: 1 22

    FIGURE 24 SIMULATION RESULTS: 2 23

    FIGURE 25 SIMULATION RESULTS: 3 23

    FIGURE 26 SIMULATION RESULTS: 4 24

    FIGURE 27 SIMULATION RESULTS: 5 24

    FIGURE 28 SIMULATION RESULTS: 6 25

    FIGURE 29 SIMULATION RESULTS: 7 25

  • 5 | P a g e

    1. Introduction

    1.1 Design And Operation of A Pipelined 8x8 Adiabatic Multiplier Accumulator

    In computing, the multiplyaccumulate operation is a common step that computes the product of

    two numbers and adds that product to an accumulator. The hardware unit that performs the

    operation is known as a multiplieraccumulator (MAC) unit. The operation itself is also often

    called a MAC or a MAC operation. The operation can be explained as follows:

    A A + BXC

    With the widespread use of mobile and wireless devices and the increase of clock and logic

    speeds in meeting the new performance requirements, energy efficiency has become a key design

    aspect in the field of integrated circuits (ICs). For digital circuits, which mostly utilize

    Complementary Metal Oxide- Semiconductor (CMOS), voltage scaling is one of the main

    strategies as the power consumption is proportional to the square of the power supply voltage. To

    maintain high transistor drive current and thus achieve performance improvements, transistor

    thresholds must be scaled along with the supply voltage. However, threshold voltage, Vt scaling

    results in a substantial increase in sub-threshold leakage current.

    Power dissipation in conventional CMOS circuits primarily occurs during device switching. As

    opposed to the case of conventional charging, the rate of switching transition in adiabatic circuits

    is decreased because of the used of a time-varying voltage source instead of a fixed voltage

    supply. By spreading out the charge transfer more evenly over the entire time available, peak

    current is greatly reduced.Both the MOSFET diodes are used to recycle charges from the output

    node and to improve the discharging speed of internal signal nodes. By using two-split level

    sinusoidal waveforms, which have peak-to-peak voltage of 0.9V, the voltage difference between

    current carrying electrodes can be minimized, consequently power consumption is minimized., it

    can be seen that a 2PASCL-based NAND circuit can save up to 62% at transition frequencies of

    10 to 100 MHz

    1.2 Design And Operation Of Multiplier

    The design of the Multiplier-Accumulator (MAC) is highly modular with basic building blocks.

    The 8x8 multiplier operation is distributed into 8 pipeline stages and the carry save multiplier is

    used for fast carry propagation to reduce multiplier latency. The structure has the advantage that

    its worst case critical path is shorter and uniquely defined. The basic layout of carry save

    multiplier is shown in figure 4. The multiplier is built using blocks described below.

  • 6 | P a g e

    Figure 1 Carry Save Multiplier

    1.3 Design And Operation Of Adder

    The adder is the fundamental block in any arithmetic unit, and is often the speed-limiting circuit

    in a digital system. Hence, many parallel adder architectures have been proposed to increase

    speed, with reasonable area and power dissipation features.One of the fastest and efficient

    architectures in terms of area and power dissipation is the Carry Bypass Adder (CBA).

    A N-bit CBA is made up of N full adder gates, which are grouped together into blocks, whose

    size (i.e., the number of full adders per block) has to be properly chosen to minimize the time

    needed for a computation. The CBA architecture can be derived from that of a simple Ripple

    Carry Adder by stating that, when some contiguous full adders work in propagate (i.e., each of

    them has a carry output equal to the carry input), they can be bypassed to valuate the carry output

    of the last one, since it is equal to the carry input of the first. Hence, in a CBA the full adders are

    divided into groups, each of them is bypassed by a multiplexer if its full adders are all in propagate.

    Figure 2 Carry Bypass Adder

  • 7 | P a g e

    2. Pipelined 8 X 8 Adiabatic Multiplier Accumulator

    2.1 Adiabatic Circuits

    Figure 3 shows a circuit diagram and waveforms illustrating the operation of the 2PASCL

    inverter [10]. Both the MOSFET diodes are used to recycle charges from the output node and to

    improve the discharging speed of internal signal nodes. Such a circuit design is particularly

    advantageous if the signal nodes are preceded by a long chain of switches. By using these two

    split-level sinusoidal waveforms, which have peakto- peak voltages of 0.9 V, the voltage

    difference between the current-carrying electrodes can be minimized, consequently power

    consumption can be suppressed. The substrates of the pMOS and nMOS transistors are

    connected to and GND respectively.Since the criteria for maintaining thermal equilibrium, in

    which the voltage between the current-carrying electrodes is zero when the transistors are in the

    ON state [4] are satisfied, the energy accumulated in CL is not dissipated. Moreover, sinusoidal

    waveforms can be generated with a higher energy efficiency than trapezoidal waveforms

    Figure 3 (a) 2PASCL inverter circuit. (b) Waveforms from the simulation

    From the operation of 2PASCL, less dynamic switchings are seen as circuit nodes are not

    necessarily charging and discharging at every clock cycle which reduces the node switching

    activities significantly. The lower the switching activity, the lower its energy dissipation. One of

    the benefits is that the logic behaves like a static logic.

  • 8 | P a g e

    2.1.1 Adiabatic Full Adder

    FA is designed using two phase clocked adiabatic static CMOS logic (2PASCL) circuit

    techniques. Adiabatic logic is discussed in the introduction. The circuit diagram for adiabatic full

    adder is shown in fig. The full adder consists of two sub circuits, sum and carry. So, in each full

    adder two source transistors are required to output two signals sum and carry.

    Figure 4 Adiabatic Full Adder

  • 9 | P a g e

    Figure 5 8bit Ripple Adder

    2.1.2 Partial Product Generator

    This block generates 8-bit PP. Adiabatic AND gate is used for saving power. The circuit diagram

    for adiabatic full adder is shown in figure 11. Initially all the partial products are generated from

    inputs. But different group of partial products are required in the different pipeline stages. So,

    partial products which are required in the different stages must be pipelined till that stage to

    support parallelism.

  • 10 | P a g e

    Figure 6 Adiabatic Nand-And

    Figure 7 Partial-product generation

  • 11 | P a g e

    2.2 Mousetrap Pipeline

    An asynchronous pipeline style is introduced for high speed applications, called MOUSETRAP.

    The pipeline uses standard transparent latches and static logic in its datapath, and small latch

    controllers consisting of only a single gate per pipeline stage. This simple structure is combined

    with an efficient and highly-concurrent event-driven protocol between adjacent stages

    Figure 8 3-stage Mousetrap Pipeline

    Three pipeline stages are shown. Each stage consists of a data latch and a latch controller.

    Adjacent stages communicate with each other using requests (reqs) and acknowledgments .The data latch is a standard level-sensitive D-type transparent latch. The latch is normally

    transparent (i.e., enabled), allowing new data to pass through quickly.A commonly-used

    asynchronous scheme, called bundled data , is used to encode the datapath: a control signal,

    indicates arrival of new data at stage Ns inputs. In particular, a simple one-sided timing requirement must be satisfied for correct operation: must arrive after the data inputs to stage

    have stabilized. (When logic processing is added to the pipeline, the request signal in each stage

    is typically delayed by an amount that matches the latency of the associated function block, i.e.,

    by a matched delay.) Once new data has passed through stage Ns latch, is produced, which is sent to its latch controller, as well as to stages N-1 and N+1.The latch controller enables and

    disables the data latch. It consists of only a single XNOR gate with two inputs: N the done from

    the current stage, stage , and the ack from stage N+1.

  • 12 | P a g e

    Figure 9 Single stage of Mousetrap

    An alternate view of the basic pipeline is shown in Fig. 9. The latch inside a stage is shown

    separated into two parts: 1) a single bit latch that receives the incoming request reqN and

    produces doneN and the outgoing request reqN+1 and 2) the remainder of the latch which

    captures the data bits. In this representation, the bit latch and the XNOR together form the entire

    control circuit that generates and receives the handshake signals from the neighboring pipeline

    stages on the left and the right, and also produces the latch enable signal EN , which is internal to

    the stage, for controlling the latching action on the datapath.

    Figure 10 Mousetrap Pipeline with logic block

  • 13 | P a g e

    2.2.1 Xnor Gate

    Figure 11 Xnor-Xor

    2.2.2 Latch

    We have implemented a D-latch using transmission gates. When the enable signai is high which

    is coming from XNOR gate, the latch is transparent and the data is passed. When that signal is

    low, the latch gets opaque and data is held.

    A B Xnor Xor

    0 0 1 0

    0 1 0 1

    1 0 0 1

    1 1 1 0

  • 14 | P a g e

    Figure 12 D latch

    Figure 13 16-bit latch

  • 15 | P a g e

    3. Multiplier

    A naive organization for adding the partial products together in an array multiplier is to

    accumulate the partial products in each row and pass the accumulated partial products (using

    carry save adders) to the next row . This makes the total delay going through the carry save

    proportional to N, the number of bits in the data words (there are N rows or N/2 rows if Booth

    recoding is used.) This delay is too long if we use a fast adder for the last row which as delay ~

    log N.

    The easiest way to reduce the CSA delay is to make a tree structure instead of the linearly

    connected naive design. In general, we want an efficient way to add N (or N/2) partial products.

    Figure 14 Carry Save Multiplier Block

  • 16 | P a g e

    Figure 15 Single Stage of Carry Save Multiplier

    Figure 16 8x8 Multiplier

  • 17 | P a g e

    4. Adder

    The adder is the fundamental block in any arithmetic unit, and is often the speed-limiting circuit

    in a digital system. Hence, many parallel adder architectures have been proposed to increase

    speed, with reasonable area and power dissipation features.One of the fastest and efficient

    architectures in terms of area and power dissipation is the Carry Bypass Adder (CBA).

    A N-bit CBA is made up of N full adder gates, which are grouped together into blocks, whose

    size (i.e., the number of full adders per block) has to be properly chosen to minimize the time

    needed for a computation. The CBA architecture can be derived from that of a simple Ripple

    Carry Adder by stating that, when some contiguous full adders work in propagate (i.e., each of

    them has a carry output equal to the carry input), they can be bypassed to valuate the carry output

    of the last one, since it is equal to the carry input of the first. Hence, in a CBA the full adders are

    divided into groups, each of them is bypassed by a multiplexer if its full adders are all in propagate.

    Figure 17 Carry bypass

  • 18 | P a g e

    Figure 18 Carry bypass Single stage

    Figure 19 Carry bypass Schematic

  • 19 | P a g e

    5. MAC

    Mac schematic has been made using asynchronous mousetrap pipeline multiplier and carry

    bypass adder. Here latch is also used to latch the data of previous result. So we can add next

    multipliers output in it.

    Figure 20 MAC

  • 20 | P a g e

    Figure 21 MAC: Multiplier

    Figure 22 MAC: Accumulator

  • 21 | P a g e

    6. Simulation Results

    Initial MAC Output: 2b00000000 00000000

    Dataset Multiplier

    input 1

    Multiplier

    input 2

    Multiplier output MAC output

    1 2b10111001 2b01001010 2b00110101 01111010 2b00110101 01111010 2 2b10111001 2b11001010 2b10010001 11111010 2b11000111 01110100

    6.1 Statistics

    No of MOS devices (bsim3v3) : 8017

    No. of equations : 22290

    Simulation time : 14 min 31.2 sec

    Power : 10.61 mW

    Latency : 47.8 ns

    Throughput : 16 ns

    Max frequency of Pipeline Operation: 62.5 MHz

  • 22 | P a g e

    6.2 Multiplier Simulation Output:

    Figure 23 Simulation Results: 1

  • 23 | P a g e

    Figure 24 Simulation Results: 2

    Figure 25 Simulation Results: 3

  • 24 | P a g e

    6.3 MAC Simulation Output

    Figure 26 Simulation Results: 4

    Figure 27 Simulation Results: 5

  • 25 | P a g e

    Figure 28 Simulation Results: 6

    Figure 29 Simulation Results: 7

  • 26 | P a g e

    7. Conclusion

    An adiabatic two-phase bundled data 8X8 Multiplier-Accumulator Unit has been realized using

    UMC 180nm technology. The multiplication operation has been done using Asynchronous

    Mousetrap Pipeline. The multiplication operation has been split into 7 pipeline stages. The

    pipeline is an event-driven design style with XNOR elements being used to manage the events

    on the bundled data interface. Event-driven interface protocols permit old components to be

    replaced by new ones with improved throughput, latency, or cost characteristics. Because the

    handshake used here automatically takes care of delays in delivering or making use of data, such

    replacements can be made with assurance that the system will still operate properly. The

    pipelined design with an asynchronous control transfer scheme has made possible increased

    throughput with best utilization of the underlying processing hardware. The adiabatic modules

    have resulted in low power consumption with the final unit being assessed at 10.61 mW. Carry-

    save Multipliers have been used in order to reduce multiplier latency and a carry-bypass

    accumulator has been used as the final stage. The maximum frequency of pipeline operation is

    43.48 MHz and the design has resulted in 8017 MOS devices.

  • 27 | P a g e

    8. References

    [1] Dusan Suvakovic, C. Andre and T. Salama, Energy efficient adiabatic multiplier-accumulator design, Journal of VLSI Signal Processing 33, 83103, 2003.

    [2] Montek Singh and Steven M. Nowick, MOUSETRAP: High-Speed Transition-Signaling Asynchronous Pipelines, IEEE Transactions on Very Large Scale Integration (vlsi) systems, vol. 15, no. 6, june 2007.

    [3] N. Anuar, Y. Takahashi and T. Sekine, 4-bit ripple carry adder using two-phase clocked

    adiabatic static CMOS logic, Proc. IEEE TENCON 2009.

    [4] N. Alioto and G. Palumbo, Performance evaluation of adiabatic gates, Circuits and

    Systems, IEEE Transaction on , Vol.47, Issue 9,Sep. 2000,pp 1297-1308.

    [5] N. Anuar, Y. Takahashi and T. Sekine, Adiabatic logic verses CMOS for low power

    applications, Proc. ITC-CSCC 2009,pp.302-305,Jul. 2009.

    [6] V.I. Staroselskii, Adiabatic logic circuits: A review, Russian Microelectronics, Vol. 31,

    Issue 1, 2002, pp 37-58.

    [7] N. Anuar, Y. Takahashi and T. Sekine, Fundamental logics based on two phase clocked

    adiabatic Static CMOS Logic, Circuits and Systems, IEEE Transaction on ,ICECS 2009,

    Vol.47, Issue 9,Sep. 2000,pp 503-506.

    [8] J.M.Rabaey et.al., Digital Integrated Circuits-A design Perspective,2nd edition-PHI

    [9] J. Parso, S Furber, Principles of Asynchronous Circuit Design-A system Perspective,

    Kluwer Academic Publishers.