AVD Report Dwip Bhavsar Himanshu Verma

1 | P a g e

A Report

Design of 2-Phase Bundled Data Mousetrap 8x8 Adiabatic

Multiplier-Accumulator Unit

Submitted in partial fulfilment of the requirements of the course

MEL G623: ADVANCED VLSI DESIGN

Submitted To

Dr. Anu Gupta

Submitted By

Dwip Bhavsar 2012H123118P

Himanshu Verma 2012H123120P

Birla Institute of Technology & Science, Pilani

November 2013

2 | P a g e

Abstract

This work is the design of an asynchronous 8X8 Multiplier-Accumulator Unit using UMC

180nm technology. The design follows an assigned asynchronous methodology due to the fact

that in a synchronous design, the overall delay is affected by the delay of the least performing

stage. The design employs pipelining in order to accommodate increased throughput and

adiabatic modules have been built to decrease power consumption. The two-phase protocol has

been followed wherein there is latching of data at both edges of the clock. The final unit consists

of 8017 MOS devices and has a ceiling frequency of pipeline operation of 62.5 MHz. The power

consumption is assessed to be 10.61 mW.

3 | P a g e

Contents Abstract ........................................................................................................................................... 2

1. Introduction ................................................................................................................................. 4

1.1 Design And Operation Of A Pipelined 8x8 Adiabatic Multiplier Accumulator ................ 5

1.2 Design And Operation Of Multiplier ..................................................................................... 5

1.3 Design And Operation Of Adder ........................................................................................... 6

2. Pipelined 8 X 8 Adiabatic Multiplier Accumulator ................................................................... 7

2.1 Adiabatic Circuits ................................................................................................................. 7

2.1.1 Adiabatic Full Adder...................................................................................................... 8

2.1.2 Partial Product Generator ............................................................................................... 9

2.2 Mousetrap Pipeline ............................................................................................................. 11

2.2.1 Xnor Gate ..................................................................................................................... 13

2.2.2 Latch ............................................................................................................................ 13

3. Multiplier ................................................................................................................................... 15

4. Adder ......................................................................................................................................... 17

5. MAC ......................................................................................................................................... 19

6. Simulation Results .................................................................................................................... 21

6.1 Statistics .............................................................................................................................. 21

6.2 Multiplier Simulation Output: ............................................................................................ 22

6.3 MAC Simulation Output ..................................................................................................... 24

7. Conclusion ................................................................................................................................ 26

8. References ................................................................................................................................. 27

4 | P a g e

LIST OF FIGURES

FIGURE 1 CARRY SAVE MULTIPLIER 6

FIGURE 2 CARRY BYPASS ADDER 6

FIGURE 3 (A) 2PASCL INVERTER CIRCUIT. (B) WAVEFORMS FROM THE SIMULATION 7

FIGURE 4 ADIABATIC FULL ADDER 8

FIGURE 5 8BIT RIPPLE ADDER 9

FIGURE 6 ADIABATIC NAND-AND 10

FIGURE 7 PARTIAL-PRODUCT GENERATION 10

FIGURE 8 3-STAGE MOUSETRAP PIPELINE 11

FIGURE 9 SINGLE STAGE OF MOUSETRAP 12

FIGURE 10 MOUSETRAP PIPELINE WITH LOGIC BLOCK 12

FIGURE 11 XNOR-XOR 13

FIGURE 12 D LATCH 14

FIGURE 13 16-BIT LATCH 14

FIGURE 14 CARRY SAVE MULTIPLIER BLOCK 15

FIGURE 15 SINGLE STAGE OF CARRY SAVE MULTIPLIER 16

FIGURE 16 8X8 MULTIPLIER 16

FIGURE 17 CARRY BYPASS 17

FIGURE 18 CARRY BYPASS SINGLE STAGE 18

FIGURE 19 CARRY BYPASS SCHEMATIC 18

FIGURE 20 MAC 19

FIGURE 21 MAC: MULTIPLIER 20

FIGURE 22 MAC: ACCUMULATOR 20

FIGURE 23 SIMULATION RESULTS: 1 22







5 | P a g e

1. Introduction

1.1 Design And Operation of A Pipelined 8x8 Adiabatic Multiplier Accumulator

In computing, the multiplyaccumulate operation is a common step that computes the product of

two numbers and adds that product to an accumulator. The hardware unit that performs the

operation is known as a multiplieraccumulator (MAC) unit. The operation itself is also often

called a MAC or a MAC operation. The operation can be explained as follows:

A A + BXC

With the widespread use of mobile and wireless devices and the increase of clock and logic

speeds in meeting the new performance requirements, energy efficiency has become a key design

aspect in the field of integrated circuits (ICs). For digital circuits, which mostly utilize

Complementary Metal Oxide- Semiconductor (CMOS), voltage scaling is one of the main

strategies as the power consumption is proportional to the square of the power supply voltage. To

maintain high transistor drive current and thus achieve performance improvements, transistor

thresholds must be scaled along with the supply voltage. However, threshold voltage, Vt scaling

results in a substantial increase in sub-threshold leakage current.

Power dissipation in conventional CMOS circuits primarily occurs during device switching. As

opposed to the case of conventional charging, the rate of switching transition in adiabatic circuits

is decreased because of the used of a time-varying voltage source instead of a fixed voltage

supply. By spreading out the charge transfer more evenly over the entire time available, peak

current is greatly reduced.Both the MOSFET diodes are used to recycle charges from the output

node and to improve the discharging speed of internal signal nodes. By using two-split level

sinusoidal waveforms, which have peak-to-peak voltage of 0.9V, the voltage difference between

current carrying electrodes can be minimized, consequently power consumption is minimized., it

can be seen that a 2PASCL-based NAND circuit can save up to 62% at transition frequencies of

10 to 100 MHz

1.2 Design And Operation Of Multiplier

The design of the Multiplier-Accumulator (MAC) is highly modular with basic building blocks.

The 8x8 multiplier operation is distributed into 8 pipeline stages and the carry save multiplier is

used for fast carry propagation to reduce multiplier latency. The structure has the advantage that

its worst case critical path is shorter and uniquely defined. The basic layout of carry save

multiplier is shown in figure 4. The multiplier is built using blocks described below.

6 | P a g e

Figure 1 Carry Save Multiplier

1.3 Design And Operation Of Adder

The adder is the fundamental block in any arithmetic unit, and is often the speed-limiting circuit

in a digital system. Hence, many parallel adder architectures have been proposed to increase

speed, with reasonable area and power dissipation features.One of the fastest and efficient

architectures in terms of area and power dissipation is the Carry Bypass Adder (CBA).

A N-bit CBA is made up of N full adder gates, which are grouped together into blocks, whose

size (i.e., the number of full adders per block) has to be properly chosen to minimize the time

needed for a computation. The CBA architecture can be derived from that of a simple Ripple

Carry Adder by stating that, when some contiguous full adders work in propagate (i.e., each of

them has a carry output equal to the carry input), they can be bypassed to valuate the carry output

of the last one, since it is equal to the carry input of the first. Hence, in a CBA the full adders are

divided into groups, each of them is bypassed by a multiplexer if its full adders are all in propagate.

Figure 2 Carry Bypass Adder

7 | P a g e

2. Pipelined 8 X 8 Adiabatic Multiplier Accumulator

2.1 Adiabatic Circuits

Figure 3 shows a circuit diagram and waveforms illustrating the operation of the 2PASCL

inverter [10]. Both the MOSFET diodes are used to recycle charges from the output node and to

improve the discharging speed of internal signal nodes. Such a circuit design is particularly

advantageous if the signal nodes are preceded by a long chain of switches. By using these two

split-level sinusoidal waveforms, which have peakto- peak voltages of 0.9 V, the voltage

difference between the current-carrying electrodes can be minimized, consequently power

consumption can be suppressed. The substrates of the pMOS and nMOS transistors are

connected to and GND respectively.Since the criteria for maintaining thermal equilibrium, in

which the voltage between the current-carrying electrodes is zero when the transistors are in the

ON state [4] are satisfied, the energy accumulated in CL is not dissipated. Moreover, sinusoidal

waveforms can be generated with a higher energy efficiency than trapezoidal waveforms

Figure 3 (a) 2PASCL inverter circuit. (b) Waveforms from the simulation

From the operation of 2PASCL, less dynamic switchings are seen as circuit nodes are not

necessarily charging and discharging at every clock cycle which reduces the node switching

activities significantly. The lower the switching activity, the lower its energy dissipation. One of

the benefits is that the logic behaves like a static logic.

8 | P a g e

2.1.1 Adiabatic Full Adder

FA is designed using two phase clocked adiabatic static CMOS logic (2PASCL) circuit

techniques. Adiabatic logic is discussed in the introduction. The circuit diagram for adiabatic full

adder is shown in fig. The full adder consists of two sub circuits, sum and carry. So, in each full

adder two source transistors are required to output two signals sum and carry.

Figure 4 Adiabatic Full Adder

9 | P a g e

Figure 5 8bit Ripple Adder

2.1.2 Partial Product Generator

This block generates 8-bit PP. Adiabatic AND gate is used for saving power. The circuit diagram

for adiabatic full adder is shown in figure 11. Initially all the partial products are generated from

inputs. But different group of partial products are required in the different pipeline stages. So,

partial products which are required in the different stages must be pipelined till that stage to

support parallelism.

10 | P a g e

Figure 6 Adiabatic Nand-And

Figure 7 Partial-product generation

11 | P a g e

2.2 Mousetrap Pipeline

An asynchronous pipeline style is introduced for high speed applications, called MOUSETRAP.

The pipeline uses standard transparent latches and static logic in its datapath, and small latch

controllers consisting of only a single gate per pipeline stage. This simple structure is combined

with an efficient and highly-concurrent event-driven protocol between adjacent stages

Figure 8 3-stage Mousetrap Pipeline

Three pipeline stages are shown. Each stage consists of a data latch and a latch controller.

Adjacent stages communicate with each other using requests (reqs) and acknowledgments .The data latch is a standard level-sensitive D-type transparent latch. The latch is normally

transparent (i.e., enabled), allowing new data to pass through quickly.A commonly-used

asynchronous scheme, called bundled data , is used to encode the datapath: a control signal,

indicates arrival of new data at stage Ns inputs. In particular, a simple one-sided timing requirement must be satisfied for correct operation: must arrive after the data inputs to stage

have stabilized. (When logic processing is added to the pipeline, the request signal in each stage

is typically delayed by an amount that matches the latency of the associated function block, i.e.,

by a matched delay.) Once new data has passed through stage Ns latch, is produced, which is sent to its latch controller, as well as to stages N-1 and N+1.The latch controller enables and

disables the data latch. It consists of only a single XNOR gate with two inputs: N the done from

the current stage, stage , and the ack from stage N+1.

12 | P a g e

Figure 9 Single stage of Mousetrap

An alternate view of the basic pipeline is shown in Fig. 9. The latch inside a stage is shown

separated into two parts: 1) a single bit latch that receives the incoming request reqN and

produces doneN and the outgoing request reqN+1 and 2) the remainder of the latch which

captures the data bits. In this representation, the bit latch and the XNOR together form the entire

control circuit that generates and receives the handshake signals from the neighboring pipeline

stages on the left and the right, and also produces the latch enable signal EN , which is internal to

the stage, for controlling the latching action on the datapath.

Figure 10 Mousetrap Pipeline with logic block

13 | P a g e

2.2.1 Xnor Gate

Figure 11 Xnor-Xor

2.2.2 Latch

We have implemented a D-latch using transmission gates. When the enable signai is high which

is coming from XNOR gate, the latch is transparent and the data is passed. When that signal is

low, the latch gets opaque and data is held.

A B Xnor Xor

0 0 1 0

0 1 0 1

1 0 0 1

1 1 1 0

14 | P a g e

Figure 12 D latch

Figure 13 16-bit latch

15 | P a g e

3. Multiplier

A naive organization for adding the partial products together in an array multiplier is to

accumulate the partial products in each row and pass the accumulated partial products (using

carry save adders) to the next row . This makes the total delay going through the carry save

proportional to N, the number of bits in the data words (there are N rows or N/2 rows if Booth

recoding is used.) This delay is too long if we use a fast adder for the last row which as delay ~

log N.

The easiest way to reduce the CSA delay is to make a tree structure instead of the linearly

connected naive design. In general, we want an efficient way to add N (or N/2) partial products.

Figure 14 Carry Save Multiplier Block

16 | P a g e

Figure 15 Single Stage of Carry Save Multiplier

Figure 16 8x8 Multiplier

17 | P a g e

4. Adder

The adder is the fundamental block in any arithmetic unit, and is often the speed-limiting circuit

in a digital system. Hence, many parallel adder architectures have been proposed to increase

speed, with reasonable area and power dissipation features.One of the fastest and efficient

architectures in terms of area and power dissipation is the Carry Bypass Adder (CBA).

A N-bit CBA is made up of N full adder gates, which are grouped together into blocks, whose

size (i.e., the number of full adders per block) has to be properly chosen to minimize the time

needed for a computation. The CBA architecture can be derived from that of a simple Ripple

Carry Adder by stating that, when some contiguous full adders work in propagate (i.e., each of

them has a carry output equal to the carry input), they can be bypassed to valuate the carry output

of the last one, since it is equal to the carry input of the first. Hence, in a CBA the full adders are

divided into groups, each of them is bypassed by a multiplexer if its full adders are all in propagate.

Figure 17 Carry bypass

18 | P a g e

Figure 18 Carry bypass Single stage

Figure 19 Carry bypass Schematic

19 | P a g e

5. MAC

Mac schematic has been made using asynchronous mousetrap pipeline multiplier and carry

bypass adder. Here latch is also used to latch the data of previous result. So we can add next

multipliers output in it.

Figure 20 MAC

20 | P a g e

Figure 21 MAC: Multiplier

Figure 22 MAC: Accumulator

21 | P a g e

6. Simulation Results

Initial MAC Output: 2b00000000 00000000

Dataset Multiplier

input 1

Multiplier

input 2

Multiplier output MAC output

1 2b10111001 2b01001010 2b00110101 01111010 2b00110101 01111010 2 2b10111001 2b11001010 2b10010001 11111010 2b11000111 01110100

6.1 Statistics

No of MOS devices (bsim3v3) : 8017

No. of equations : 22290

Simulation time : 14 min 31.2 sec

Power : 10.61 mW

Latency : 47.8 ns

Throughput : 16 ns

Max frequency of Pipeline Operation: 62.5 MHz

22 | P a g e

6.2 Multiplier Simulation Output:

Figure 23 Simulation Results: 1

23 | P a g e



24 | P a g e

6.3 MAC Simulation Output



25 | P a g e



26 | P a g e

7. Conclusion

An adiabatic two-phase bundled data 8X8 Multiplier-Accumulator Unit has been realized using

UMC 180nm technology. The multiplication operation has been done using Asynchronous

Mousetrap Pipeline. The multiplication operation has been split into 7 pipeline stages. The

pipeline is an event-driven design style with XNOR elements being used to manage the events

on the bundled data interface. Event-driven interface protocols permit old components to be

replaced by new ones with improved throughput, latency, or cost characteristics. Because the

handshake used here automatically takes care of delays in delivering or making use of data, such

replacements can be made with assurance that the system will still operate properly. The

pipelined design with an asynchronous control transfer scheme has made possible increased

throughput with best utilization of the underlying processing hardware. The adiabatic modules

have resulted in low power consumption with the final unit being assessed at 10.61 mW. Carry-

save Multipliers have been used in order to reduce multiplier latency and a carry-bypass

accumulator has been used as the final stage. The maximum frequency of pipeline operation is

43.48 MHz and the design has resulted in 8017 MOS devices.

27 | P a g e

8. References

[1] Dusan Suvakovic, C. Andre and T. Salama, Energy efficient adiabatic multiplier-accumulator design, Journal of VLSI Signal Processing 33, 83103, 2003.

[2] Montek Singh and Steven M. Nowick, MOUSETRAP: High-Speed Transition-Signaling Asynchronous Pipelines, IEEE Transactions on Very Large Scale Integration (vlsi) systems, vol. 15, no. 6, june 2007.

[3] N. Anuar, Y. Takahashi and T. Sekine, 4-bit ripple carry adder using two-phase clocked

adiabatic static CMOS logic, Proc. IEEE TENCON 2009.

[4] N. Alioto and G. Palumbo, Performance evaluation of adiabatic gates, Circuits and

Systems, IEEE Transaction on , Vol.47, Issue 9,Sep. 2000,pp 1297-1308.

[5] N. Anuar, Y. Takahashi and T. Sekine, Adiabatic logic verses CMOS for low power

applications, Proc. ITC-CSCC 2009,pp.302-305,Jul. 2009.

[6] V.I. Staroselskii, Adiabatic logic circuits: A review, Russian Microelectronics, Vol. 31,

Issue 1, 2002, pp 37-58.

[7] N. Anuar, Y. Takahashi and T. Sekine, Fundamental logics based on two phase clocked

adiabatic Static CMOS Logic, Circuits and Systems, IEEE Transaction on ,ICECS 2009,

Vol.47, Issue 9,Sep. 2000,pp 503-506.

[8] J.M.Rabaey et.al., Digital Integrated Circuits-A design Perspective,2nd edition-PHI

[9] J. Parso, S Furber, Principles of Asynchronous Circuit Design-A system Perspective,

Kluwer Academic Publishers.

AVD Report Dwip Bhavsar Himanshu Verma

Documents

Transcript of AVD Report Dwip Bhavsar Himanshu Verma