T omor ro w s Digital Hardw are will b e Async hronous and V eried · 2012-06-22 · T omor ro w s...

13

Transcript of T omor ro w s Digital Hardw are will b e Async hronous and V eried · 2012-06-22 · T omor ro w s...

Page 1: T omor ro w s Digital Hardw are will b e Async hronous and V eried · 2012-06-22 · T omor ro w s Digital Hardw are will b e Async hronous and V eried Alain J Martin Departmen t

Tomorrow�s Digital Hardware will be Asynchronous and Veri�ed�

Alain J� Martin

Department of Computer Science� California Institute of Technology� Pasadena CA ������USA

Abstract Encouraged by the results of almost a decade of research and experimentation�we claim that tomorrow�s design methods for digital VLSI will be based on a concurrentprogramming approach to high�level synthesis� asynchronous techniques� and correctness�preserving program transformations�

� Introduction

It has become a clich�e to say that VLSI has revolutionized electronic design� But themost profound transformations in design methods for digital hardware are still to come�

With chip density quadrupling every two years for the last two decades� the quantita�tive changes brought about by VLSI could not be ignored� Yet� the qualitative changes inthe nature of the artifacts embodied in a digital chip have so far not been fully recognized�

A VLSI chip is a highly concurrent computation for the design of which the traditionalmethods of automata and switching theory are inadequate��� Not only is a chip oneof the most complex systems technology can produce� but it is also one of the mostfragile� Because of the nature of digital computation� a minute design or fabrication errorcan render the chip inoperable� Unlike software� integrated circuits are not repairable�The development costs are so high that a delay of a few weeks in the completion of anindustrial project may account for the di�erence between pro t and loss� To make mattersworse� the same rate of technological change that increases the complexity of the productsreduces the development time to the point that a manufacturer hardly has time to bringa new product to market before it becomes obsolete�

In view of the size of the problems� it is clear that criteria including correctness by con�struction� ease of composition and modi cation� robustness to changing or unpredictablephysical parameters are going to determine future design methods� The main thesis inthis paper is that it is possible to achieve these goals� without sacri�cing e�ciency� witha method that combines three aspects� a concurrent programming approach to high�levelsynthesis� asynchronous techniques for digital VLSI� and correctness�preserving programtransformations�

Lest the reader would immediately dismiss this claim as yet another beautiful theory

�This is a revised version of an invited paper published in the proceedings of the IFIP Congress �����

Information Processing ��� Volume I� Elsevier Science Publishers B�V�

Page 2: T omor ro w s Digital Hardw are will b e Async hronous and V eried · 2012-06-22 · T omor ro w s Digital Hardw are will b e Async hronous and V eried Alain J Martin Departmen t

waiting to be killed by an ugly little fact� let me mention without delay that the viewsexpressed here are strongly supported by almost a decade of research� experiments� andfabricated designs� The results of these experiments� conducted by my research group atCaltech� have been positive beyond our most optimistic expectations�

The method indeed produces correct and e�cient circuits� It has been applied toa series of di�cult problems� such as distributed mutual exclusion� arbitration� rout�ing automata� stack and queues� multipliers� and a suite of components for a completecomputer system comprising a pipelined microprocessor� static RAM� and memory man�agement unit� Apart from the memory management unit� which was simulated but notfabricated� all chips have been fabricated in CMOS� �some of them also in GaAs�� AllCMOS chips have been found functional �on rst silicon�

The discussion can best be partitioned into several research themes� The rst oneis the application of concurrent programming techniques to the design of VLSI circuits�The second issue is that of asynchronous techniques for digital circuits� The third issueis that of correctness by construction and program transformations� We will then discussthe expected in�uence of asynchronous VLSI and high�level synthesis on the architectureof future computing systems� Finally� we will discuss the in�uence of this approach onour understanding of concurrency and digital computation�

��� VLSI Design as Concurrent Computing

A VLSI system is a highly concurrent computation and therefore any approach to VLSIdesign should be a concurrent computing approach� Also� communication in VLSI is be�coming increasingly expensive� compared to switching� as the size of the wires determinesboth the switching costs and the area of a chip� A concurrent computation model forVLSI should re�ect those cost ratios� and a model in which communication is explicit ismore appropriate to control the cost of communication� Hence� we opted for a notationbased on the notion of concurrent processes communicating by explicit message�passingand assignments to variables�

The program notation that best matches those requirements is C�A�R� Hoare�s CSP ���I will review how CSP was modied to t our purposes� More generally� I will try to assessthe di�erences and similarities between programming in VLSI and traditional program�ming for stored�program computers� and show how these di�erences are re�ected in thenotation and programming style� A related question is� To which extent is it possible tocapture a VLSI designer�s choices toward a solution for a VLSI computation at the levelof a CSP program�

��� Concurrent Computing and Asynchronous VLSI

A high�level synthesis approach to VLSI design requires nding an interface that providesa good separation of the physical and algorithmic concerns� In synchronous techniques�clocks are used to implement sequencing� and thus knowledge on the duration of eachcomputation step has to be used� Since this knowledge is derived from the physical pa�rameters of the circuit� those techniques are detrimental to the use of high�level methods�Furthermore� with the increasing size of circuits� it becomes more and more di�cult�

Page 3: T omor ro w s Digital Hardw are will b e Async hronous and V eried · 2012-06-22 · T omor ro w s Digital Hardw are will b e Async hronous and V eried Alain J Martin Departmen t

and costly in area� delay� and power consumption� to distribute a clock signal across achip� Finally� the restrictions attached to wire lengths in order to maintain certain timingproperties add extra complication to the already di�cult layout problem�

For these reasons� asynchronous techniques� and among them� delay�insensitive tech�niques� are particularly attractive for high�level VLSI synthesis� A circuit is delay�insensitive when its correct operation is independent of any assumption on delays inoperators and wires except that the delays are �nite� � Obviously� delay�insensitive cir�cuits don�t use a clock and are therefore asynchronous� Sequencing is enforced entirelyby communication mechanisms�

Hence� the second aspect of the method is the design of asynchronous VLSI circuitsusing techniques from concurrent computation� Although it has been known for a longtime that asynchronous techniques were potentially superior to the standard synchronousones� they have been largely ignored so far because they were too di�cult to master� Inparticular� no good method was known to avoid the synchronization errors resulting in themalfunctionings called hazards� As a consequence� the circuits produced� when correct atall� were both too large and too slow�

The problem of hazards was easy to solve once it was addressed as a concurrent pro�gramming one� Knowledge of concurrent computing was essential to raise asynchronousdesign from an interesting curiosity to that of the design technique of the future� Con�versely� asynchronous VLSI design poses fundamental questions related to the nature ofconcurrent computing� and challenges conventional approaches to computer architecture�

��� Correctness by Program Transformations

The issue of correctness is central to all research in programming methodology� After twodecades of intense activity and impressive advances� the impact of the research on thesoftware industry is still disappointing� Perhaps� the main reason is that both softwaretechnology and� more importantly� software users are just too malleable and forgiving�It is technically possible�although very costly and dangerous�and socially acceptableto debug large pieces of software by intensive testing and modi�cations� But� as a VLSIdesigner puts it� if FORTRAN compilation cost ������ and took �� weeks� signi�cantlymore e�ort would go into the pre�compilation veri�cation of FORTRAN programs�����Because of the drastically di�erent cost structures between hardware and software� Ibelieve that designing systems that are correct by construction will happen in hardwareearlier than in software� Tomorrow�s digital hardware will be asynchronous and veri�ed�

The approach to designing correct VLSI circuits I advocate is that of program trans�formations� A circuit is �rst constructed as a concurrent program� This program is provedto meet the speci�cations� Often� the program is simple enough to be considered the spec�i�cation itself� The production of the �nal circuit is then a matter of applying a seriesof semantics�preserving transformations� Each transformation replaces a program with asemantically equivalent one� until a version is obtained that can be directly implementedin VLSI� We never need to leave the algorithmic domain for other forms of description� like

�We have proved that the class of entirely delay�insensitive circuits is very limited� and that somecompromise to delay�insensitivity has to be made in most circuits� See ���� for a discussion of theseissues�

Page 4: T omor ro w s Digital Hardw are will b e Async hronous and V eried · 2012-06-22 · T omor ro w s Digital Hardw are will b e Async hronous and V eried Alain J Martin Departmen t

�nite�state machines or state�transition graphs� hence eliminating an important source oferrors introduced during the informal translation from one representation to another�

Entirely automatic compilation of programs into circuits ��silicon compilation� ispossible with our method We have demonstrated the possibility by writing a compiler�� �But the resulting circuits are unnecessarily ine�cient compared to what can be achievedby what we call �designer�assisted compilation�� In such an approach� the designer uses aset programs that perform the transformations automatically�avoiding the clerical errorsthat humans excel at� But the designer can choose which transformations to apply byusing global knowledge �invariants of the system that is not available to an automaticcompiler� This approach gives excellent results �the circuits obtained are simpler andmore e�cient than those produced by a �seat�of�the�pants� approach� hence refuting theoften accepted �fatality� that formal methods necessarily lead to ine�cient designs�

� An Asynchronous Microprocessor

As an example� I will brie�y describe the �quasi delay�insensitive microprocessor mystudents and I designed in the fall of ���� �� � It is the �rst asynchronous microprocessorever designed� The chips were found fully functional on ��rst silicon��

The processor was �rst speci�ed as a sequential program� which was then transformedinto a concurrent program so as to pipeline instruction execution� The circuits werederived from the concurrent program by semantics�preserving program transformation�It took �ve persons� some of us working part�time� less than �ve months to complete thedesign from scratch�

The processor has a ���bit� RISC�like instruction set� It has sixteen registers� fourbuses� an ALU� and two adders� Instruction and data memories are separate� The chipsize is about ������ transistors� Two versions have been fabricated one in ��m MOSISSCMOS� and one in ����m MOSIS SCMOS� With the exception of isochronic forks��� and the interfaces with the memories� the chips are entirely delay�insensitive�

The ��m version runs at �� MIPS� The ����m version runs at �� MIPS� �These per�formance �gures are based on measurements from sequences of ALU instructions withoutcarry� They do not take advantage of the overlap between ALU and memory instructions�

We have tested the chips under a wide range of VDD voltage values� At room tem�perature� the ��m version is functional in a voltage range from �V down to ����V� Andit reaches �� MIPS at �V� We have also tested the chips cooled in liquid nitrogen� The��m version reaches �� MIPS at �V and �� MIPS at ��V� The ����m version reaches ��MIPS at �V� For the ��m version� the power consumption is ���mW at �V and ���mWat �V� For the ����m version� it is ���mW at �V and ���mW at �V�

� Power Consumption

The relevant measure of performance for microprocessors and other general�purpose de�vices that I propose is the speed�to�power ratio� for instance expressed in MIPS per Watt�Usual RISC microprocessors deliver less than �� MIPS�W at �V� The fastest RISC mi�croprocessor announced at the moment of writing� the ���bit DEC Alpha� is supposed

Page 5: T omor ro w s Digital Hardw are will b e Async hronous and V eried · 2012-06-22 · T omor ro w s Digital Hardw are will b e Async hronous and V eried Alain J Martin Departmen t

to run at ���MIPS and consume ��W� which amounts to ��MIPS�W at ���V� For ourmicroprocessor� the performances range from ��MIPS�W at �V for the ����m versionto ���MIPS�W at V for the �m version� Even with a � to ratio in word lengths�the discrepancy is worth some attention The power advantage of asynchronous circuits�which is suggested by this experiment� is still a matter of controversy� In the absence ofidentical designs implemented as both asynchronous and clocked circuits� it is di�cult tocompare the power performances of both implementation techniques�

An argument advanced against asynchronous design with respect to power consump�tion is the use of special data encoding techniques� like for instance dual�rail� whichrequire that a larger number of data wires switch for each data transmission than withusual clocked datapaths� In standard four�phase dual�rail encoding� which is the datatransmission scheme used in the microprocessor� for each transmission of an n�bit word�n wires change voltages twice�

However� we believe that the following factors contribute to a lower power consumptionin asynchronous circuits� First� the absence of a clock circuitry removes the main powersink� It has been said that half of the power consumed in the DEC Alpha is dissipatedby the clock�� Secondly� because of the reactive nature of asynchronous circuits� no poweris drawn by a part of a circuit when that part is not used� Thirdly� because of therequirement that signals should change monotonically� no voltage oscillations are allowed�hence eliminating spurious switching of gates�

� Programming in VLSI

Programming in VLSI requires overcoming prejudices from both hardware and softwaredesigners� VLSI designers experience the greatest di�culties with the idea that theircircuits can be conceived entirely as programs� Computer scientists designing circuits asprograms have the tendency to carry over to VLSI programs the cost assumptions that arevalid for stored�program implementations but not for a direct hardware implementation�

The main di�erence between software programming and VLSI programming is thatin VLSI� concurrency is free and sequencing is costly� Concurrency is implemented bymere juxtaposition of circuits� Sequencing requires synchronization� We therefore avoidsequencing as much as possible� and implement it as a restricted form of concurrency�

The program notation we use called CHP� for Communicating Hardware Processes�is not a hardware description language� It is inspired by C�A�R� Hoare�s CSP��� and E�W�Dijkstra�s guarded commands��� Compared to CSP� CHP contains both restrictionsand extensions� The restrictions are required by the limitations of hardware to booleanlogic� and by the impossibility to create resources during execution of the computation��Dynamic� objects� like certain data types and general recursion are excluded�

The only basic data type is the boolean� An integer is a collection of booleans� An�integer of length n� is a prede�ned record type consisting of n boolean components� Anyoperation on a data type other than boolean is a shorthand notation or function call forthe sequence of operations on boolean variables that will implement it�

All additions to CSP are motivated by e�ciency� Because of the extensive use ofconcurrency and communication� we have re�ned the communication mechanism with the

Page 6: T omor ro w s Digital Hardw are will b e Async hronous and V eried · 2012-06-22 · T omor ro w s Digital Hardw are will b e Async hronous and V eried Alain J Martin Departmen t

probe���� �which allows complete symmetry between input and output�� multiple channels�buses�� and other direct manipulations of ports� like the assignment of an input to anoutput� R�L� �

Also motivated by e�ciency are the availability of a restricted form of shared variables�and of both deterministic and non deterministic choices� A variable can be written by oneprocess and read by several other processes� As we design at several levels of re�nements�shared variables are introduced during the course of the transformations�for instanceduring process decomposition� We have also found cases when shared variables wereuseful at the communicating processes level�

It is very di�cult� if at all possible� to determine at �compile time� which choicesrequire arbitration� Since arbitration is expensive� we introduce two sets of control struc tures� a deterministic set and a non deterministic set� and let the programmer explicitlyindicate where arbitration is needed�

Justifying the adequacy of this notation for circuit construction would require at leastto show a series of convincing examples� Due to space limitation� we refer the reader tothe literature� in particular the papers describing the processor� See ���� ���� ���� ����� ����������

� Program Transformations

CHP is ideally suited for designing the control and synchronization parts of a computation�But� usually data manipulation and arithmetic are represented as integer operations inCHP� and therefore an important re�nement to the solution consists of replacing theinteger operations with their implementations as boolean operations � Rather than goingin one step from program to circuit� the designer applies a series of transformations tothe original CHP program� At each step� some part of the algorithm is re�ned and somealgebraic transformations can be applied leading to important optimizations�

The general justi�cation of this approach is that the task of designing a correct VLSIsystem is much more manageable if one starts with a simple� abstract� solution the cor rectness of which is easy to establish� The solution is then re�ned by repeated applicationsof a set of transformations the correctness of which has been established once and for all�

We are using three types of transformations� A CHP to CHP transformation can beapplied �rst to increase concurrency� This transformation is part of the high level designmore than the �compilation�� An example is the derivation of the pipelined version ofthe processor from a sequential version�

A second CHP to CHP transformation� called process decomposition� is used to sim plify the structure of the processes� It is a syntax directed transformation that is appliedrepeatedly until the structure of each process is either a sequence of communication ac tions of the form� ��C��C�� ���� or �reactive process� of the form�

���G� � S���� � � ��Gn� S

n�� �

The third type of transformations to be applied are the real �compilation� transfor mations� implementation of communication and arithmetic� and of sequencing�

Page 7: T omor ro w s Digital Hardw are will b e Async hronous and V eried · 2012-06-22 · T omor ro w s Digital Hardw are will b e Async hronous and V eried Alain J Martin Departmen t

� The Object Code� Production Rules

The notation for the object code provides the weakest possible form of control structureand the smallest number of program constructs� In fact� it contains exactly one construct�the production rule �PR�� and one control structure� the production rule set�

The production�rule notation is the canonical representation of a digital circuit� It canbe decomposed into several equivalent networks of digital operators� depending on the setof building blocks used� or even depending on the technology �e�g�� CMOS or GaAs� used�but the production�rule set represents the circuit independently of the chosen physicalimplementation�

A production rule �PR� is a construct of the form G �� S� where S is a simpleassignment� i�e� an assignment of the constant true or false to a boolean variable� andG is a boolean expression called the guard of the PR� For example� the NAND�gate withinputs x and y and output z has the production rules�

x � y �� z �

�x � �y �� z �

The semantics of a PR are de�ned only if the PR is stable� A PR G �� S is said tobe stable in a given computation� if� at any point of the computation� G either is false

or remains invariantly true until the completion of S� Stability is not guaranteed by theimplementation� It has to be enforced by the synthesis procedure�

An execution of the stable PR G �� S is an unbounded sequence of �rings� A �ringof G �� S with G true amounts to the execution of S� A �ring of G �� S with G false

amounts to a skip�A PR set is the concurrent composition of all PRs of the set� The only composition

operation on two PR sets is the set union� The implementation of two concurrent pro�cesses is the set union of the two PR sets implementing the processes and of the PR setsimplementing the channels between the processes� if any� PRs are complementary whenthey are of the type G �� x � and G �� x �� We require that complementary PRs benon�interfering�

Two complementary PRs are non�interfering when �G��G holds invariantly� It canbe proven that� under the stability of each PR and non�interference among complementaryPRs� the concurrent execution of the PRs of a set is equivalent to the following sequentialexecution�

��select a PR with a true guard� �re the PR

where the selection is weakly fair �each PR is selected in�nitely often��Hence� any valid execution of a production�rule set in which non�interference and

stability are ful�lled is equivalent to a non�deterministic sequential execution of theproduction�rule set� This equivalence facilitates the analysis of production�rule sets� Italso establishes the connection between our de�nition of concurrency as set union andthe more traditional de�nition based on interleaving of atomic actions� Observe that oursemantic model does not require the notion of atomic actions�

Page 8: T omor ro w s Digital Hardw are will b e Async hronous and V eried · 2012-06-22 · T omor ro w s Digital Hardw are will b e Async hronous and V eried Alain J Martin Departmen t

��� A Simple Example of Program Transformation

A common form of a process is the so�called �one�place bu�er�� ��L�aRf�a� � Theprocess receives a parameter a on input port L� and sends the result of the functionevaluation f�a� on the output port R� All pipeline stages in the microprocessor arevariations on this basic theme� The �rst transformation is the process decomposition thatseparates the control part of the process�the part that implements the sequencing�fromthe datapath�the part that manipulates data� The control part is simply the skeletonprocess� ��LR �

The next transformation is called �handshaking expansion�� It replaces the barecommunication actions with an implementation called �handshaking�� which is a syn�chronization protocol using two boolean variables for each port� li and lo for L� and ri

and ro for R� The handshaking expansion gives�

����li lo � �li lo � �ri ro � ��ri ro � �

In order to perform the next transformation� the �production�rule expansion�� we needto introduce a state variable x�

����li lo � x � �x �li lo � �ri ro � x � ��x ��ri ro � �

The production rule expansion is�

�x � �li � �ro �� lo �

lo �� x �

x � li �� lo �

x � �lo � ri �� ro �

ro �� x �

�x � �ri �� ro �

This production�rule set can be directly implemented in hardware� Observe that none ofthe binary operators represented by a pair of production rules setting and resetting thesame variable� is a standard gate� The method is particularly e�cient precisely becauseone is not required to map an implementation onto a particular set of standard gates� Itis also a true synthesis method� The circuits are derived by pure symbolic manipulationwithout any preconception of the result�

� Asynchronous vs� Synchronous Designs

The tradeo�s between asynchronous and synchronous designs can be described in termsof the advantages of ignorance versus the advantages of knowledge�

An asynchronous implementation�or more precisely� a delay�insensitive one�ignoresall information about timing� Conversely� a synchronous implementation exploits allavailable knowledge about timing� The advantage of ignorance is that the implementationhas to be correct independently of timing� therefore gaining qualities of robustness tovariations of physical parameters�

Page 9: T omor ro w s Digital Hardw are will b e Async hronous and V eried · 2012-06-22 · T omor ro w s Digital Hardw are will b e Async hronous and V eried Alain J Martin Departmen t

The price paid for ignorance is that the completions of all actions have to be detected�computed� locally� The so�called �completion detection� mechanism is the most costlyasynchronous technique� For instance� detecting that a value has been written in a one bit�register requires transistors in CMOS� Hence� an entirely delay�insensitive static RAMin CMOS will have an overhead of transistors per bit compared to an implementationin which delays are �assumed to be� known�

The reward for knowledge is eciency� If the duration of all actions is known precisely�sequencing of actions can be implemented eciently with a global clock� since a singleclock signal is enough to signify the end of a computation step and the start of thenext one� The price for knowledge is paid in several ways� First� knowledge of timingrelies on knowledge of the physical parameters of the design� and therefore creates anobligation to comply with the assumed values that limits the robustness to variations ofthese parameters�

An even higher price paid for knowledge is���doubt� Designers are aware that theirknowledge of both the physical properties of the devices and the runtime behavior of thecircuits is imperfect� Consequently� they have to lengthen the clock period to take intoaccount an error margin in the evaluation of the duration of a computation step� Thiserror margin is becoming prohibitive as technology advances�

Several sources of errors have to be accounted for� Miniaturization� �scaling�down�� ofthe devices introduces more and more variations in the geometry of the devices and theirprocessing� which cause variations in their electrical parameters� But more importantly�the duration of a computation step may vary signi cantly with the values of the data�For instance� the addition of two integer numbers using a ripple�carry adder varies in timewith the length of the carry�chain� The clock period of a synchronous implementation hasto be adjusted for the worst case� and thus a synchronous ripple�carry adder takes a timeproportional to the number of bits of the operands� On the other hand� an asynchronousripple�carry adder takes a time on the average proportional to the logarithm of the numberof bits�����

Global knowledge requires global information in the form of a clock signal� Distributinga clock signal across a chip with the requirement that the signal arrive at the di�erentlocations on the chip �at the same time� is becoming more and more problematic as thesize and the speed of the chips increase� The skew of the clock signal across the chipneeds to be absorbed by an added delay in the clock period� But more simply it will soonbecome very dicult� if not impossible� to distribute a clock signal across a chip and meetthe timing requirements of modern technology like Gallium Arsenide� or superconductingdevices�

Let us summarize the trade�o�s� In an asynchronous implementation� a time penaltytais paid for generating the completion signals ��completion detection�� and for the

handshaking mechanism in the control� In a synchronous implementation a time penalty ts

is paid for the inaccuracy of clock signal distribution ��clock skew�� and for the variationsdue to fabrication defects �geometry defect� ion implantation� etc��� We claim that at themoment� t

aand t

sare about equal for CMOS designs�

With these two drawbacks cancelling each other� we are left with the following alter�natives� For complex computations with data dependencies� asynchronous design has theadvantage of exploiting the best�case delay� whereas synchronous solutions have to adjust

Page 10: T omor ro w s Digital Hardw are will b e Async hronous and V eried · 2012-06-22 · T omor ro w s Digital Hardw are will b e Async hronous and V eried Alain J Martin Departmen t

��

to the worst case� For small and �data�regular� designs� synchronous solutions have theadvantage of both speed and size�

But in the long run� it is at the level of a complete system design that an asynchronousand concurrent computing approach will win�

� Asynchronous Systems

The extent of the in�uence of the clock mechanism on the architecture of computingsystems is more pervasive than designers realize�

The choice of an instruction set is severely restricted by the requirement that theexecution time of an instruction be equal to a multiple of the clock period� One maywonder how relevant the �CISC� vs� �RISC� debate would be if the duration of aninstruction execution were entirely �exible� It is quite possible to envision a choice ofinstruction set in which the most frequently used instructions are very short but yet someinstructions are included whose execution time may be long� Such an instruction set wouldmost likely mix the best attributes of both �RISC� and �CISC� types of instructions�

We have learned from the designs of the dierent versions of the microprocessor thatthe most signicant optimizations are done at the highest level� i�e� at the level of theconcurrent computation description of the circuit�

By removing the tight lockstep constraint imposed by the clock upon the concurrentactivity inside a chip or an ensemble of chips� the method allows the designer to exploitconcurrency to an extent di�cult to achieve with synchronous techniques� A whole arrayof communication and synchronization techniques from the eld of concurrent computa�tion and parallel algorithms are available which open up completely new architecturalpossibilities�

It has been said that all large systems that work have evolved from small systems thatworked� By providing an interface between components of a system that is independentof the physical properties of the dierent components� asynchronous techniques make itpossible to rene and �evolve� a family of designs without starting each new version fromscratch� A small example from the design of the microprocessor may illustrate this point�A week or so before we sent the second version to fabrication� we decided to replace theALU process with a two�process version� The purpose was to pipeline the execution of anALU instruction and the storing of the result� We were able to do the replacement andinclude the necessary changes in the other process involved without modifying the rest ofthe design�

Another� we hope� even more spectacular� example is the design of a GaAs versionof the microprocessor that we have completed a few months ago� We wanted to showthat the almost perfect interface between logical design and physical implementation thatthe method provides makes it possible to port a design from one technology �CMOS toanother� very dierent� one �GaAs with practically no design changes� In about half ayear� we were able to design an entirely new logic family �the standard implementation ofa set of operators in GaAs� and map the set of production rules dening the processor intoa network of GaAs transistors� Our rst choice of logic family was extremely conservative�we were concerned about noise immunity � and as a result the rst version was functional

Page 11: T omor ro w s Digital Hardw are will b e Async hronous and V eried · 2012-06-22 · T omor ro w s Digital Hardw are will b e Async hronous and V eried Alain J Martin Departmen t

��

but ran at only ��MIPS and consumed �W�We are now redesigning the chips with anotherchoice of logic family� We expect this version to run at ���MIPS and consume �W�

More than anything else� the modularity of this approach will revolutionize hardwaredesign by drastically reducing the design time and simplifying the interfaces�

The advantage of a simple interface can be exploited also at the level of system ar�chitecture� Future advances in computer technology will no longer be due entirely to thespeed increase of the semiconductors� but also� and mainly� to the ability to assemblevery large collections of chips� Parallel supercomputers all exploit this principle� Unlessthe interface mechanism between the chips is exible enough to accommodate a diver�sity of designs due to dierent fabrications or to dierent solutions� the reliability ofsuch architectures is very questionable� My experience with the design of the AMETEKmulticomputer ��� provides an illustration of this principle� I was able to convince thedesigners that �at least� the mesh routing network should be asynchronous� Consequently�it was possible� later in the development of the system� to mix two dierent families ofrouting chips�dierent designs and speed but same interface�inside the same routingnetwork without any noticeable eect�

� Conclusion

The results already achieved indicate that the type of approach to VLSI design we havedescribed is very promising� Based on these results� we will venture a vision of the futurethat we hope is only slightly optimistic�

Uniform notations and methods will be used across the hardware�software boundary�Concurrent computation paradigms will replace traditional ones �switching and automatatheory� in the design of VLSI systems� Logic veri�cation issues will become critical� veri��cation tools and methods will be widely used� Correctness by construction and high�levelsynthesis will become routine� Asynchronous circuits will be the preferred implemen�tation both for methodological reasons and for e�ciency reasons �power consumption�robustness� speed�� For large systems� designs using synthesis tools will outperform handdesigns both in reliability and performance� However� these new methods will require anew generation of designers�

Acknowledgements

The research described in this paper would not have been possible without the contri�butions of my present and recent students� Dra�zen Borkovi�c� Steve Burns �now at theUniversity of Washington�� Marcel van der Goot� Pieter Hazewindus� Tony Lee� ChristianNielsen� and Jos�e Tierno� The insights� help� and encouragement of my Caltech colleagueCharles L� Seitz are deeply appreciated� The research was sponsored by the Defense Ad�vanced Research Projects Agency� DARPA Order number ����� and monitored by theO�ce of Naval Research under contract number N���������K������

Page 12: T omor ro w s Digital Hardw are will b e Async hronous and V eried · 2012-06-22 · T omor ro w s Digital Hardw are will b e Async hronous and V eried Alain J Martin Departmen t

��

References

��� Steven M� Burns and Alain J� Martin� Syntax�directed Translation of ConcurrentPrograms into Self�timed Circuits� Proc� Fifth MIT Conference on Advanced Research

in VLSI� ed� J� Allen and F� Leighton� MIT Press� ���� �� �

��� Edsger W� Dijkstra� A Discipline of Programming� Prentice�Hall� Englewood Cli�sNJ� �����

��� C�A�R� Hoare� Communicating Sequential Processes� Comm� ACM ��� � ����������� �

�� David L� Johannsen� Silicon Compilation� Decennial Caltech Conference on VLSI�ed� C�L� Seitz� MIT Press� ������ �� ��

�� Alain J� Martin� The Probe� An Addition to Communication Primitives� Information

Processing letters ��� pp ������� �� �

��� A�J� Martin� S�M� Burns� T�K� Lee� D� Borkovic� P�J� Hazewindus� The Design ofan Asynchronous Microprocessor� Decennial Caltech Conference on VLSI� ed� C�L�Seitz� MIT Press� ������� �� ��

��� Alain J� Martin� Compiling Communicating Processes into Delay�insensitive VLSIcircuits� Distributed Computing� ����� �� ��

� � Alain J� Martin� Programming in VLSI� From Communicating Processes to Delay�Insensitive Circuits� UT Year of Programming Institute on Concurrent Programming�ed� C�A�R� Hoare� Addison�Wesley� Reading MA� �� ��

��� Alain J� Martin� Synthesis of Asynchronous VLSI Circuits� Formal Methods for VLSI

Design� ed� J� Staunstrup� North�Holland� �����

���� Alain J� Martin� The Limitations to Delay�Insensitivity in Asynchronous Circuits�Sixth MIT Conference on Advanced Research in VLSI� ed� W�J� Dally� MIT Press������

���� Alain J� Martin and Pieter J� Hazewindus� Testing Delay�Insensitive Circuits� Proc����� University of Santa Cruz Conference on Advanced Research in VLSI� ed� CarloH� S�equin� MIT Press� �� ����� �����

���� Alain J� Martin� Asynchronous Datapaths and the Design of an Asynchronous Adder�Formal Methods in System Design� ���� Kluwer� �������� �����

���� Carver Mead and Lynn Conway� Introduction to VLSI Systems� Addison�Wesley�Reading MA� �� ��

��� Christian D� Nielsen and Alain J� Martin� A Delay�Insensitive Multiply�AccumulateUnit� Caltech Technical Report CS�TR������� Computer Science Department� Cali�fornia Institute of Technology� �����

Page 13: T omor ro w s Digital Hardw are will b e Async hronous and V eried · 2012-06-22 · T omor ro w s Digital Hardw are will b e Async hronous and V eried Alain J Martin Departmen t

��

���� Seitz� C�L�� Athas� W�C�� Flaig� C�M�� Martin� A�J�� Seizovic� J�� Steele� C�S�� andSu� W��K� The Architecture and Programming of the Ametek Series � Multicom�puter� Proceedings of the Third Conference on Hypercube Concurrent Computers and

Applications� ACM Press� New York� �����