[IEEE 2009 IEEE International Advance Computing Conference (IACC 2009) - Patiala, India...

11
2009 IEEE International Advance Computing Conference (IACC 2009) Patiala, India, 6-7 March 2009 Power Aware Design and Implementation of 8-bit Asynchronous Arithmetic and Logic Unit Hardeep Singh, DIT (Department of Information Technology) Sponsored Project Engineer, ECE Department, Thapar University, Patiala Punjab, India E-mail: hardy_gem(yahoo.co.in, hardeep.gem(gmail.com ABSTRACT The last fifteen years have witnessed a resurgence of interest in Ad asynchronous digital design techniques as they promise to liberate VLSI systems from clock skew problems, offer the AIowI*III potential for low power and high performance and encourage a RII modular design philosophy which makes incremental technological migration a much easier. One of the main Bab reasons for using asynchronous design is that it offers the Di1ta opportunity to exploit the data-dependent latency of many hIndsh 2d dhan&hke k operations in order to achieve low-power, high-performance, or low area. This paper describes a novel power aware 8-bit Figure 1. The Four-Phase Bundled Data Inter- asynchronous Arithmetic and Logic Unit (ALU). The designed -face Protocol ALU is targeted for low power. The 8-bit asynchronous Arithmetic and Logic Unit (ALU) has been designed entirely Thus, the last decade has witnessed an explosion of using the tool named Balsa, which is an AdvancedThstelatdcehswinsdanxpoon f using theool amedBals,wich sanAdvacd interest in asynchronous design techniques, which do not rely Asynchronous Hardware Description Language and Synthesis on global clocks but achieve synchronization by means of Tool, developed by University of Manchester, UK. localized synchronization protocols between the communicating subsystems. These protocols are typically in the Keywords form of local request and acknowledge signals, which provide information regarding the validity of data signals. An example Asynchronous logic, Balsa, AMULET (Asynchronous of such a protocol is the four-phase bundled data handshake Microprocessor Utilizing Low Energy Techniques), low synchronisation protocol illustrated in figure 1. power, ALU, Balsa, Power Validation, XPower, Xilinx and Other potential advantages of asynchronous logic are low benchmark power consumption, high performance and support for a modular design philosophy, which makes incremental 1 INTRODUCTION technological migration a much easier task. Several asynchronous design techniques have been developed [1] and A digital system is typically designed as a collection of are progressively finding their place in the mainstream VLSI subsystems, each performing a different computation and design, not least in the development of GALS (Globally communicating with its peers to exchange information. Before Asynchronous Locally Synchronous) systems. a communication transaction takes place, the subsystems A number of asynchronous processors have been involved need to synchronise, namely to wait for a common developed including NSR and Fred at the University of Utah, control state to be reached, which guarantees the validity of STRiP at Stanford University Sun's Counter flow pipeline data exchanged. processor FAM and TITAC at Tokyo University and Institute The predominant synchronization technique in hardware of Technology respectively, Hades at the University of design today is the utilisation of a global clock whose Hertfordshire, Sharp's Data-Driven Media Processor, Caltech's transitions define the points in time when communication processors and Lutonium and the series of asynchronous transactions can take place. This synchronous approach implementations of the ARM RISC processor (AMULETI, however has reached a critical point, with clock distribution AMULET2e, AMULET3 and SPA)[1][2] developed by the becoming an increasingly costly and complicated issue. AMULET group at the University of Manchester. 978-1-4244-2928-8/09/$25 .00 © 2009 IEEE 1 037

Transcript of [IEEE 2009 IEEE International Advance Computing Conference (IACC 2009) - Patiala, India...

Page 1: [IEEE 2009 IEEE International Advance Computing Conference (IACC 2009) - Patiala, India (2009.03.6-2009.03.7)] 2009 IEEE International Advance Computing Conference - Power Aware Design

2009 IEEE International Advance Computing Conference (IACC 2009)Patiala, India, 6-7 March 2009

Power Aware Design and Implementation of 8-bit AsynchronousArithmetic and Logic Unit

Hardeep Singh,DIT (Department of Information Technology) Sponsored Project Engineer,

ECE Department, Thapar University, PatialaPunjab, India

E-mail: hardy_gem(yahoo.co.in, hardeep.gem(gmail.com

ABSTRACT I I

The last fifteen years have witnessed a resurgence of interest in Adasynchronous digital design techniques as they promise toliberate VLSI systems from clock skew problems, offer the AIowI*IIIpotential for low power and high performance and encourage a RIImodular design philosophy which makes incrementaltechnological migration a much easier. One of the main Babreasons for using asynchronous design is that it offers the Di1taopportunity to exploit the data-dependent latency of many hInds h 2ddhan&hkekoperations in order to achieve low-power, high-performance, orlow area. This paper describes a novel power aware 8-bit Figure 1. The Four-Phase Bundled Data Inter-asynchronous Arithmetic and Logic Unit (ALU). The designed -face ProtocolALU is targeted for low power. The 8-bit asynchronousArithmetic and Logic Unit (ALU) has been designed entirely Thus, the last decade has witnessed an explosion ofusing the tool named Balsa, which is an AdvancedThstelatdcehswinsdanxpoon fusing theoolamedBals,wich sanAdvacd interest in asynchronous design techniques, which do not relyAsynchronous Hardware Description Language and Synthesis on global clocks but achieve synchronization by means ofTool, developed by University of Manchester, UK. localized synchronization protocols between the

communicating subsystems. These protocols are typically in theKeywords form of local request and acknowledge signals, which provide

information regarding the validity of data signals. An exampleAsynchronous logic, Balsa, AMULET (Asynchronous of such a protocol is the four-phase bundled data handshakeMicroprocessor Utilizing Low Energy Techniques), low synchronisation protocol illustrated in figure 1.power, ALU, Balsa, Power Validation, XPower, Xilinx and Other potential advantages of asynchronous logic are lowbenchmark power consumption, high performance and support for a

modular design philosophy, which makes incremental1 INTRODUCTION technological migration a much easier task. Several

asynchronous design techniques have been developed [1] andA digital system is typically designed as a collection of are progressively finding their place in the mainstream VLSIsubsystems, each performing a different computation and design, not least in the development of GALS (Globallycommunicating with its peers to exchange information. Before Asynchronous Locally Synchronous) systems.a communication transaction takes place, the subsystems A number of asynchronous processors have beeninvolved need to synchronise, namely to wait for a common developed including NSR and Fred at the University of Utah,control state to be reached, which guarantees the validity of STRiP at Stanford University Sun's Counter flow pipelinedata exchanged. processor FAM and TITAC at Tokyo University and Institute

The predominant synchronization technique in hardware of Technology respectively, Hades at the University ofdesign today is the utilisation of a global clock whose Hertfordshire, Sharp's Data-Driven Media Processor, Caltech'stransitions define the points in time when communication processors and Lutonium and the series of asynchronoustransactions can take place. This synchronous approach implementations of the ARM RISC processor (AMULETI,however has reached a critical point, with clock distribution AMULET2e, AMULET3 and SPA)[1][2] developed by thebecoming an increasingly costly and complicated issue. AMULET group at the University of Manchester.

978-1-4244-2928-8/09/$25.00 © 2009 IEEE 1037

Page 2: [IEEE 2009 IEEE International Advance Computing Conference (IACC 2009) - Patiala, India (2009.03.6-2009.03.7)] 2009 IEEE International Advance Computing Conference - Power Aware Design

A non fiaha toolA fie formiIat data LARD) I ARD) test fharness

balsa-md Batsa bas-lcd~bal sa-mgr

blsa-cL R dDvtEcods

brelelze-cogst Blie,ze balsa-li

Cosrt estEmate belsa-netit Behaviotural sisn.

Coiiipas:s net1fit EDI F 2 (J2 neid 1kt Veriog neticUi

cp(I) ~~~~ba]Lsa-pv ~ balsa ihdl

Comi pass D11 ovver c B)1 CadeDcc11 DB

|Nfeixtgen. Ci ompilr balsa-si Pear Silicon .n¶e,nblc |

1StiaiieMilt eietist Layotit oXilit.x bits:tream SDLF Layout

Cap. extraction Timing extractixon Silicon Ensetnbl

TneimeM ..ill'ian V o1

SirmuatioUii resuLts Simulation reSItS Sirnulation reSUSIt

Figure 2.Balsa Design Flow

2 Modelling and Simulation of asynchronous However, although synchronous languages and tools can andhardware indeed have been used for asynchronous hardware too,

hardware description languages are not suitablefor describingconcu7rrent non-dete7rministic asynch7ronous behaviou7r. Thus,

The need to deal with the ever increasing size and complexity thecreent n tere in asynchronousdi hasifuelleduanof computer system designs and quality assurance, reliability ite research ai aimingut develop techniquesand relentless time-to-market pressures have assigned a a riatfo mdin d i n aasynchnouscrucial role to modelling and simulation in computer and systems.telecommunication industries. Modelling and simulation are systems.essential tools for measuring the performance as well as 3 BaIvalidating the timing behaviour and functional correctness of alsaalternative architectural designs. Several simulation modellinglanguages and tools for synchronous logic design have been Balsa (figure 2) [1][8][9][1o][1 1] is both an asynchronousdeveloped [3][4][5][6] and have underpinned the development hardware synthesis framework and the language for describingof ever more complex synchronous VLSI circuits. such systems. It has been demonstrated by synthesising the

In the case of asynchronous systems, the role of DMA controller of Amulet3i as well as SPA, an AMULETsimulation is even more crucial as their concurrent, no core for smartcard applications. Balsa uses CSP-baseddeterministic behaviour makes any attempt to reason about constructs to express Register Transfer Level designtheir correctness and performance a very complicated task. descriptions in terms of channel communications and fie grainThis complexity renders modelling and simulation essential concurrent and sequential process decomposition.tools in the endeavor to gain an insight and understanding of Descriptions of designs (.balsa file) are then translatedthe behaviour of asynchronous systems. (balsa-c) into implementations in a syntax directed - fashion

1038 2009 IEEE Internaftionafl Advalnce Computing Conference (IACC 2009)

Page 3: [IEEE 2009 IEEE International Advance Computing Conference (IACC 2009) - Patiala, India (2009.03.6-2009.03.7)] 2009 IEEE International Advance Computing Conference - Power Aware Design

with language constructs being mapped into networks of 3.1.4 Real Time Text Simulation: The simulation is done atparameterized instances of "handshake components" handshake level and Real Time Text Simulation shows the(.breezefile) each of which has a concrete gate level simulation results in the textual form containing the time andimplementation [12] A number of tools are available to process value pair.the breeze handshake files, balsa-netlist automatically generates 3.1.5 Real Time Channel Simulation: Real Time ChannelCAD native netlist files, which can then be fed into the Simulation makes use of the coloring scheme to show thecommercial CAD tools that further synthesize the netlist to the simulation results in channel form. The Red and Greenfabricable layout. Three commercial CAD systems are currently signal indicates Request and Acknowledge phases indicatingsupported: Compass Design Automation tools from Avant, the completion of handshake and thus transfer of data.Xilinx Alliance FPGA design tools and Cadence DesignFramework II. Balsa uses one-to-one mapping between the 3.1.6 Handshake Circuit: The handshake circuit representslanguage constructs in the specification and the intermediate how the balsa description gets converted into thehandshake circuits that are produced. The transparency of interconnection of handshake component making up thencompilation makes it relatively easy for an experience user to module/design. The inputs are requested and written on to thetrace the incremental changes made at the language down to the input channels from the test input / harness file.circuit implementation level. The important feature of this handshake circuit is that it

actually shows the circuit handshaking progress throughanimation. The purpose of this animation is dual, first, it lets

3.1 Balsa Program Compilation Structure: Balsa us see that how the Balsa description given by the designer isprogram compilation structure consists of following important converted into a handshake process and second, as we havefiles describing the behavior and synthesis of asynchronous made use of 4- phase communication protocol or Returnhardware using the syntax directed compilation. Figure 3 to Zero Signaling convention, so this handshake circuitshows the structure. makes use of coloring scheme to distinguish between four

communication phases. Red color indicates the Request> Balsa Code phase, Green color indicates the Acknowledge phase, Blueh- color indicates the Request Done phase and Gray color

Handshake Graph indicates the Acknowledge Done phase. As the animation> Handshake Circuit Cost progresses, the behaviour of handshake circuit is indicated by

Balsa Real Time Text Simulation change in color.3.1.7 Event Thread Stucture: Event thread structure

p Real Time Channel Simulation precisely shows what happens during simulation at thehandshake level. In the time line window (see figure 14), the

p Handshake Circuit position of cursor indicates the state of the handshakecircuit. The state of the handshake includes what is the data

>Event Thread Stucture 7requested from the environment, data values present atdifferent internal channels, Request phase, Acknowledge

Figure 3: BALSA Program Compilation phase, Request Done phase, Acknowledge Done phase.Structure Request phase is shown by Red color, Acknowledge phase

is shown by Green color, Acknowledge Done phase isindicated by Blue color and Request Done phase is

3.1.1 Balsa Code: High-level description of design is entered indicated by Grey color. More importantly, this event threadusing BALSA.The BALSA program structure closely resembles structure can highlight the condition of deadlock. InC language. asynchronous system the condition of deadlock results when

one handshake component waits for the data input from the3.1.2 Handshake Graph: This handshake graph is composed of previous component. Deadlocks are usually the result ofhandshake components, which are elegantly described in the inefficient Balsa Description.Balsa library. 4 Arithmetic and Logic Unit

3.1.3 Handshake Circuit Cost: An Asynchronous circuit is An Arithmetic and Logic Unit (ALU) is acomposed of handshake components, which are defined in an combinational circuit that performs logic and arithmeticelegant manner in Balsa library. The circuit cost gives an estimate micro-operations on a pair of n bit operands ex. A [7:0]of area taken up by the handshake components making up any and B [7:0]. The operations performed by the ALU areasynchronous module. The cost of asynchronous circuit is controlled by the set of function select inputs. The 8-bitdependent particularly on the implementation technology being ALU designed in this paper is shown in Figure 4 and ittargeted. Three implementation technologies have been shown in implements the function as shown by table in Figure 5.Figure 2.

2009 IEEE International Advance Computing Conference (IACC 2009) 1039

Page 4: [IEEE 2009 IEEE International Advance Computing Conference (IACC 2009) - Patiala, India (2009.03.6-2009.03.7)] 2009 IEEE International Advance Computing Conference - Power Aware Design

5. Design Planning and Strategy:

In this paper, two models have been developed. First modelhas been developed using a Hardware Description Languagesuch as VHDL.This is a synchronous version of the abovearchitecture based ALU.The design of second model isbased on Asynchronous design methodology. For the design

Arithmetic and of asynchronous ALU, an asynchronous hardware synthesistool, called Balsa is used. Both the designs have been

Logi Unit modeled using the same algorithm. As the design is to bein implemented in reconfigurable logic, so tool called Xilinx is

used for it. Since a low power constraint is being targeted,C1out so power validation is performed for the both designs.

Sel S[3:O] For power validation, again a industry standard toolSdi SE3.01 ~~~~~~~~called XPower is used.

Figure 4: ALU Block Diagram 6. Power Validation Principle:

The power validation principle is based upon a very well

S3 S2 Si SO Function Operation known principal in VLSI. This principle is based on the factselect that, the power consumed (dynamic) by any CMOS circuitselect0ABAND

is governed by the following equationo o 0 0 AB AND

0 0 0 1 A-+B OR P = C.V2. E. F

O O 1 0 A@(B XORo o 1 1 AB NOTA Where P = Power in mW0 0 1 1 A NOTA0 1 0 0 B NOT B C = Capacitance in Faradso 1 0 1 A+B AddAandB0 1 1 0 A-B Subtract B V=volts

from Ao 1 1 1 A + 1 Increment A E = Switching activity(average no.

by 1 of transition per clock cycle)1 0 0 0 B +I Increment B F=FrequencyinHz

by 11 0 0 1 (A + B) + Increment

1 Sum ofA and The capacitance is defined by user's design ( and a designB by 1I implemented in specific device or device characteristics of

I 0 1 0 (A-B) + Increment the the routing resources and the elements in the device).the1 Difference voltage V is fixed for a specific device. F.E is the total

ofA and B number of transitions for a specific element; or the activityby 1 rate of each signal in design, is the most valuable element of

1 0 1 1 A Transfer A the above equation. Each element (LUT, FF etc) in1 1 0 0 B Transfer B reconfigurable device like FPGA, that can switch has

1 0 1 A + Cin Add Carry in capacitance associated with it. Clock signal and primaryto A input signals are assigned specific frequency by the use.

11 1 0 B + Cin Add Carry in Synchronous element are assigned activity (or toggle) rateto B relative to their associated clock. user supplied activity rate

1 1 1 1 (A + B) Add Carry in combined with device specific capacitance, static power andCin to Sum ofA the data to produce a power estimate of the design. The

and B accuracy of the switching activity data is crucial inFigure 5: Functions ofALU obtaining an accurate estimation ofpower consumption.

11040 2009 IEEE Internactionalz Advance Computing Conference (IACC 2009)

Page 5: [IEEE 2009 IEEE International Advance Computing Conference (IACC 2009) - Patiala, India (2009.03.6-2009.03.7)] 2009 IEEE International Advance Computing Conference - Power Aware Design

7 Discussion about asynchronous arithmetic 7.2 Asynchronous ALU Handshake Circuit Cost: Figure 7pipeline compilation structure shows the Asynchronous ALU Handshake Circuit Cost As

explained earlier, an asynchronous circuit is composed of7.1 Asynchronous ALU Handshake Graph: Figure 6 handshake components, which are elegantly described in theshows the Asynchronous ALU Handshake Graph. This Balsa library. The circuit cost gives an estimate of area takenhandshake graph is composed of handshake components, up by these handshake components making up thewhich are elegantly described in the Balsa library. As Asynchronous ALU and this cost is dependent particularly onindicated in the code that there are four inputs a, b, sel, the implementation technology being targeted (see Figure 2).c_in and two outputs namely out and c_out The input Figure shows cost of all the 65 handshake componentsvalues are requested from the environment and written on making up an Asynchronous ALU. The handshake componeto the input channel a and b, seln and cin , then after nts used are named as $BrzFork, $BrzFetch, $BrzLoop,computation (Multiplication), the result is written on to the $BrzCallMux ,$BrzSink ,$BrzFalseVariable, $BrzAdapt,output channel out and c_out. etc. For description about the component nomenclature and

terminology used, the reader can refer [9][1 0][1 1].

I~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~r Fi _~~d - X j( N o

11___I__I

/ _ _ _ _ _

/ Of l .1t1

I /a1 ; 3;a R J 't

( t tJ t / < <<<5<4 / gt \ F +\ g\ h> E S <4 D D~~~~~~~~~~~~~~~~~~~~~0~npoen~FU~(4~T1~ (1~4 1~(1'. ': ' " tlFt>',,!'J' ':'1 'L ' t i1 , , i '~~~~~~~~~~~~~0 (pnen ~ar~h&~{~ ~ (1~11 (021)

|~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~((ionpnr-n zF harab1&l1lo)li X( (11111 (15h

11W11~~~~~~~~(1111w~n011_ FvthI(0 taLL) 113 ii

-cuitGraph~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~0(vnnet~F~i 4~~&~(433D

2009..................................uting....ference........009)...4:1

...............................................................(..................7........

NN~~~~~~~~~(cno~t~Dz~d fL~(7~~T

-cuitGraph~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~Ulbt

2009 IEEE InternationalAdvance Computing Conference (IACC 2009)1041~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~N

Page 6: [IEEE 2009 IEEE International Advance Computing Conference (IACC 2009) - Patiala, India (2009.03.6-2009.03.7)] 2009 IEEE International Advance Computing Conference - Power Aware Design

7.3 Asynchronous ALU Real Time Text Simulation: 7.4 Asynchronous ALU Real Time Channel Simulation:Figure 8 shows the real time text simulation results. At Figure 9 shows Asynchronous ALU Real Time Channelsimulation time t = 813 ns, the input channels a, b,seln and Simulation. The Red and Green signals indicate Request andc_in are written with values 3,5,0,0 and respectively. As per Acknowledge phase reflecting the completion of handshakethe Balsa code description,. the operation selected is 2 i.e and thus transfer of data On completion of every handshake aSeln = 2. Refer the table in Figure 5, the ALU will perform new value appears on the input channel. The behaviorthe bitwise AND operation at the inputs a and b.The result of indicated in the Asynchronous ALU Real Time Text Simulationthis operation is written to the output. we can see that at t = is exactly depicted.10720 ns, output channels, out and C_out are reading 5 andO.Similary, it is evident that at t = 36213 ns, the inputchannels, C_in, Seln, b and a are written with data values 0,5, 8 and 253, in this case the selected ALU operation, Seln = 2t Tr TE MP5.The ALU will perform addition on the inputs at a and b.The addition of inputs a and b results in an Overflow and kdS~I£tPsthis Overflow is indicated by c_out at t = 46120 ns iechannel c_out is reading the overflown bit. Some other casesof Overflow could be verified at t = 69720 ns and t = 8150 _PtnMgak8r: 46p i

ns. 3~ak ~a.

rDX@vie-. gE TDIPhn-ins

|8>0'~~~~~~~~~~~sc-iN0rse2 cs dpl l lT2L2S138l.-t

li~~~~~~~~~~ca 1_&_ii| I a dli 111 Ultl n R | - D _|=,_~~~~~~~ON I

I~MN UWZ3idAHM WINFGRMM. -11

INM* WWF4tSl 0dii 11 ___

24:g34iDsn gut raii.g r

2413 Ihafl iW riling 1231 i _

141L:c4 1 naio

AIRle: d iing2ha'.t _

614 dn abai-

402.M t eadirng IIBM3 Chi Isii witingIBMl3 chih lb,Wpriting42413:h.~ wln

CE46 out, 66iibbg

NORdgi wilbk nIg12

352413: wiling It4

nln l& w9t4DhgtilingR234t_1_ _|

dNOR Didh1W2t*731: dn i3 rilng.

71213:hn, n&wtn

13613: h,in priling 2.1S1213 inilMinou'ei_291324 chn'1 n ghb 224

21124: chin IEUft reading23413: Chin ut edji~wilin2311:3 Chi n10& wtiin IjDD dii4Ii wlda23413: chin lb wriling 24423113: chin li wiling

I1520tMid'Wt1' Oeailgg20313314 chin Uhlt We~itg

_314 chi ............rtdig 1

Figure 8: Asynchronous ALU Real time Text Simul- Figure 9: Asynchronous Real Time Channel Simulation-ation

11042 2009 IEEE Internactionalz Advance Computing Conference (IACC 2009)

Page 7: [IEEE 2009 IEEE International Advance Computing Conference (IACC 2009) - Patiala, India (2009.03.6-2009.03.7)] 2009 IEEE International Advance Computing Conference - Power Aware Design

7.5 Asynchronous ALU Handshake Circuit Figure 10 7.6 Asynchronous ALU Structure: Figure 11 shows theshows the shows the handshake circuit of Asynchronous event thread structure of Asynchronous ALU. This eventALU. The input data is requested and is written on to the thread structure precisely shows what happens duringinput channels a, b, sel, and c_in from the environment or simulation at the handshake level. In the time line behaviortest harness file. Breeze-Sim controller controls this window above, the position of cursor indicates the state ofanimation. This animation shows how the handshake the handshake circuit. The state of the handshake includescomponents communicates with each other using 4- phase or what is the data requestedfrom the environment, data valuesReturn to Zero Signaling convention. This handshake present at different internal channels, Request phase,circuit makes use of coloring scheme to distinguish between Acknowledge phase, Request Done phase, Acknowledge Donefour communication phases. Red color indicates the phase. Request phase is shown by Red color, AcknowledgeRequest phase, Green color indicates the Acknowledge phase is shown by Green color, Acknowledge Done phasephase, Blue color indicates the Request Done phase and is indicated by Blue color and Request Done phase isGray color indicates the Acknowledge Done phase. As the indicated by Grey color. More importantly, this event threadanimation progresses, the behavior of handshake circuit is structure can highlight the condition of deadlock.indicated by change in color and this change in color a In asynchronous systems, the condition ofhandshake is performed and on completion of every deadlock results when one handshake component waits forhandshake, the old data value is removed from the channel the data input from the previous component. Deadlocks areand corresponding phases also gets changed as indicated by usually the result of inefficient description of asynchronousthe change of color during progress of animation. hardware and can be avoided by carefully writing the BALSA

description.

01N Tkb §PRt EnuIaio De Hi TTicnt S bOU

SEljctiEnAnni low jFat miil~E Seleco US =

MAIN

~oALU

TEST 5(1~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ET

IHARN~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~AN

E lii L

_~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ I=

Fadhaeigurea 10 snhoosAUHnsaeCrut Fgr t snhoosAUEettra tutrSim nS -1

U U' I KA al A r ro@ tMncrF B a E Tn l I @S 2 MAcW rnEsT LEaPtetnal ] L2 m oc hmLnBgrl~@oD ~PFigue1:Aycrnu HAL Hadhk Crut Fiur 11 snhoosAL Even thea StutrSimulation~~~~~~~iS1 ,.............................................. ...........FuXo.

i~~~~20 IEEE Ineatoa Advnc Coptin Conernc (IC 209 1043

Page 8: [IEEE 2009 IEEE International Advance Computing Conference (IACC 2009) - Patiala, India (2009.03.6-2009.03.7)] 2009 IEEE International Advance Computing Conference - Power Aware Design

Figure 12: Test Bench for Synchronous ALU Figure 13a: Asynchronous ALU Synthesized Circuit

8.Synchronous Model Development: As per the discussion ,-. rabove, the idea is to develop synchronous version of the M_ _

was used to design the Asynchronous counterpart.This ALU ISdesigned USing X-linx ISP 7.1i tool. The FigUre 12 shows thie __g XItest bench used for validating the design functionality.It canseen that at t = 70 ns, the data on the inputs, namely, C_i, |i

2 indicates the bitwise XOR, opeefration ntem inputs aaend=

simulation. Another case could be taken,2s dlatt= 5 ins theoperation selected is 5 and as per the Table in Figure 5, the g| l|;l E ::ALU will perform Addition on both the inputs a and b and this 1'- llItU I Ylis evident by the result.which is 146.Similary, we see that at t = ____ 1

incremnent input a by 1 anld this is confirmed with the result,

isndcaterbyFiure 2,U the iVerilyog netlist produced by the

Balsa is given to the Xilinx Tool. The synthesized circuit is ==shown in Figure 13a and Figure 13b. As can be seen in Figure l1113a, the structure of synthesized circuit is very much different ___________________________from the conventional synchronous design. The synthesized Sasynchronous ALU has additional no. of inputs and outputs in Figurel3b: Asynchronous ALU Synthesized Circuitcontrast to synchronous based design.

11044 2009 IEEE Internactionalz Advance Computing Conference (IACC 2009)

Page 9: [IEEE 2009 IEEE International Advance Computing Conference (IACC 2009) - Patiala, India (2009.03.6-2009.03.7)] 2009 IEEE International Advance Computing Conference - Power Aware Design

In addition to data inputs, such as a Od(7:0), b_Od(7:0) andSel Od(3:0) and data outputs, out_Od(7:O) andcarry out Od, there are other inputs and outputs and theseinputs and outputs are various request and acknowledgesignals associated with all the data input and output signals.For example, for input a Od(7:0), there is request signalnamed a Or and acknowledge signal named a Oa. Similarlyfor all the data signals we have request and acknowledgesignal associated with them. The circuit is activated by thesignal called activator Or, the event on this signals, makesthe other inputs such as a_Or b_Or and sel_Or, to sendrequest to the environment for transfer of data and on N1__-completion of handshake the data is transferred to the circuitand now he asynchronous computations starts and after |completion the result is written to the output port and again ______the data is transferred to the output when handshaking iscomplete.

1O.Gate level Design Verification of Asynchronous ALU

Refer Figure 14. It can be seen that at t = 140 ns, the |_data values at the various inputs are following, carry inO = 0d, a_Od 3 lb_Od= 5 and seln_Od 5. Here seln_Od 5 asper the table in Figure 5, the ALU will perform the additionon both the inputs and it is evident that at t = 280 ns the 111|.result, which is 8 ie out Od = 8 is indicated in the simulation.At t = 280 ns, the seln_Od = 7 and as per the table in Figure Figure 14: Test Bench for Asynchronous ALU5, the ALU will perform addition of carry_in_Od with inputa_Od, the result is indicated at t = 420 ns, when out_Od = 4.At t = 700 ns, the seln_Od= 8 and this indicates to ALU to -_increment the input a0Od by 1 and it is clearly evident at t =700 ns when the output out_Od - 15. Similarly at t = 840 ns, ;....the seln_Od =13 and this operation indicates to the ALU toincrement the sum of inputs a0Od and b_Od by 1 and thisconfirmed by the result at t = 840 ns when the output out_Od= 27.All the cases have been confirmed by this test bench.

11.Design Implementation: After all design timing andpower validation,the asynchronous model of ALU has beensuccessfully implanted in Spartan 3E FPGA board fromXilinx. The utilization summary shows more usage of 11hardware resources, this was expected because theasynchronous logic makes use of various handshakecomponents and these components takes up extra space and Mi0

so requires additional resources.Figure 15 shows the simulation results after the

design is placed and routed. This simulation confirms thatwhen the design is ready to be programmed in _11 1 1 i _IReconfigurable logic device such as Spartan 3E-FPGA thenit will perform as expected.

12 Benchmark Process: A comparative analysis is done inthis phase. Comparison iS done with an equivalent.-.synchronous design developed on the same guidelineswhich were used in designing Asynchronous Power Aware Figure 15: Final Simulation after Placing and RoutingALU. The comparison is done for the Power dissipated by the Designboth the designs when doing computation. Both the designsare ready for the Power Benchmark.

2009 IEEE Inxternational Advanxce Computing Conference (IACC 2009) 1045

Page 10: [IEEE 2009 IEEE International Advance Computing Conference (IACC 2009) - Patiala, India (2009.03.6-2009.03.7)] 2009 IEEE International Advance Computing Conference - Power Aware Design

442t Thdilim w4 d p i ad2 4h <Thd2 444«1A N pH...Rekase7 4.

2 2 (c)299 5CAll res<rve.........................4 ................... .......(c)..52........ed

XbM andNiih i# hiis wni Q*4it xmn difki~Tsi ii h ii ii Wu pm r

2<i~~~~~~< 444 22< ~~~~~~~~~~~~~ ou

l4 if 242< t4Xa

12.1n4L24V: Power Validation Benchmark: 13.2Con sThe........... Figure.16 1nd Figurel7. 4re the22owerreport

.............................................................. 222s 2<........a er ell known...........o f.asvnchronousM0c252

5442251 ....................25......

lifthid~~~ ~ ~ ~ ~ ~ ~ ~ ~~owr conume bvhtbothO thehbdesdn wee aidtefib6hid~~~~~~~ actimnzO inmdusr stndr tol.Asprth xecahn h

Figurw te16piixsXowerreportefoteycronuAsychonu Fiue1whichsPwrrpotfrSycrnu12.1 Power Validation Benchmark:~~~~ 13s nChonclusiosln: cnue es^Jwe odJrnt

TheF27uemW and Fis fre more thanpower reprt geeated In thi pae a.eywl nw datgfaycrnuby te XPwerool rom ilin Inc TheXPowr tol dsig methoousdology.e.uslow powler conclumpton thas bencAlycuaetheonower(dyin.aThe)reason, on theswitcingboe acivsity prvesucchronoullyThetsignfeenmethodologyisided asvabedoccnaurren acosssyncronousnets ign.th synheszeinetthe f synheronousswrtreefrlwpoe.Tevrgsyste dcircuit the performancsows the poerin ogtmrdissipation pwr onsue ybt h ein eevldtdbrporedforrance, Asynchronouevs incrasewhic tis increasFigur17thecows thequoeny caueportedb thesignalatvtchro ousAU hc usinouicnoldustrmetndadtol.Apethexcainsteies and ant isruls far inreasedanower dissssipaton.Incrnu ein osmsls oeri oprsnt

contrast~~ ~ ~ ~ ~ ~ ~ ~ ~ aynhonu deig methis,gsinc indchreou deigviablerey nhsystem~~ ~~ .decde th perormncofavany thankfulTo Dr.tev Fuberdissipation.

clontas to thisdsivncaeaSynhrnouspresignt doere anotrSeVlyonth I wilawy etakuoDrSeeFbrrogaclck so th.iavnaei o rsn hr n tivdnEdwards and Dr. Lillian Jannin, all from University of

frmteXoerrprLoUheaycrnusdsg.ae Manchester, England, for the knowledge I have acquiredALU. ~~~~~~~~~~~~~~~today.

11046t<e 20092 IEEE 223442,cti242Advc Copuin Cnerenc (IAC 209

Page 11: [IEEE 2009 IEEE International Advance Computing Conference (IACC 2009) - Patiala, India (2009.03.6-2009.03.7)] 2009 IEEE International Advance Computing Conference - Power Aware Design

References

[1] Spars, J., Furber, S., Principles of Asynchronous Circuit [14] David Keamey and Neil W. Bergmann. Bundled dataDesign- A Systems Perspective Kluwer Academic Publishers, asynchronous multipliers with data dependant computationHardcover ISBN 0-7923-7613-7, 2001 times. In Proc. International Symposium on Advanced

Research in Asynchronous Circuits and Systems, pages[2] Woods, J.V., Day, P., C. E.,. AMULETI: An 186-197. IEEE Computer Society Press, April 1997.Asynchronous ARM Microprocessor., IEEE Transactions onComputers 46 (4)(1997) pp.385-398. [15] Patterson, D.A.,Hennessy, J.L., Computer

Organization &Design, second edition, Morgan Kaufmam,[3] VERSIFY Release 2.0, http://www.ac.upc.es/ vlsi/versify/ 1997.

[4] Ykman-Couvreur, C., Lin, B., De Man, H.,. ASSASSIN: asynthesis system for asynchronous control circuits.,Tech.report IMEC Laboratory, Sep. 1994.

[5] Yun, K.Y., Dill, D. L., Nowick, S.M., Synthesis of 3DAsynchronous State Machines., Proceedings of ICCD'92:VLSIin Computers and Processors, Cambridge, MA, US,pp.346-350, October 1992

[6] Fuhrer, R. M., Jha, N. K., C.E.,. MINIMALIST: AnEnvironment for the Synthesis, Verification and Testability ofBurst-Mode Asynchronous Machines., Tech. Report CUCS-020-99, Department of Computer Science, ColumbiaUniversity, NY, US, July 1999

[7] Zhang, Q., Theodoropoulos, G., Towards an AsynchronousMIPS R3000 Processor., Proceedings of ACSAC'2003, Japan,Sep. 2003

[8] The Balsa Asynchronous Synthesis System,http://www.cs.man.ac.uk/apt/projects/Balsa/index.html

[9] Edwards D.A. and Bardsley A. 2001, "Balsa-AnAsynchronous Hardware Synthesis System". In Principles ofAsynchronous Circuit Design. Pp153-218. ISBN 0-7923-7613-7.

[10] Bardsley A. 1998, "Balsa: An Asynchronous CircuitSynthesis System". M.Phil. Thesis. The University ofManchester.

[11] Bardsley A. 2000, "Implementing Balsa handshakeCircuits". Ph.D. Thesis. The University of Manchester.Bardsley A. and Edwards D.A. 2000, "Synthesising anasynchronous DMA controller with Balsa". In Journal ofSystems Architecture, 46. Pp1309-1319.

[12] Van Berkel, C. H., Kessels, J., C.E., .The VLSIProgramming Language Tangram and its Translation intoHandshake Circuits., Proceedings of EDAC, 1991, pp. 384-389.

[13] Behrooz Parhami. Computer Arithmetic, Algorithms andHardware Designs. Oxford University Press,2000.

2009 IEEE Inxternational Advanxce Computing Conference (IACC 2009) 1047