A CRITIOUE ON PARALLEL COMPUTER ARCHITECTURES …csd.ijs.si/silc/articles/Informatica88.pdf · A...

A CRITIOUE ON PARALLEL COMPUTER ARCHITECTURES INFORMATICA 2/88

UDK 681.3.001.519.6

L. M. PatnaikR. Govindarajan

M.Špegel"and J. Silc"

Indian Institute of Science BANGALOREand Jožef Stefan Institute"

The need for parallel processing is felt in many disciplines. Inthis paper we discuss the essential issues involved in thesesystems. Various architectural proposals reported in theliterature are cla&sified based on the execution order andparallelism exploited. Design and salient features of somearchitectures which have gained importance are highlighted.Architectual features of non-von Neumann systems such as data-driven, demand-driven, and neural computers, which open thehorizont for research in the new models of computations, arepresented in this paper. Principles and requirements ofprogramming languages and operating system9 for parallelarchitectures are also reviewed.

Contents

1. Introduction

Issues in Parallel Processing SystemsModels of ComputationControl Flow, Data-Driven Model,Demand-Driven Model

2.2. ConcurrencyTemporal, Spatial and AsynchronousParallelism, Granularity of Parallelism

2.3. Scheduling2.4. Synchronization

Von Neumann Parallel Processing SystemsPipeline and Vector ProcessingPipeline Architecture, VectorProcessing

3.2. SIMD Machines3.3. MIMD Architecture

Interconnection Netuorks, TightlyCoupled Multiprocessors, Loosely CoupledMultiprocessors

3.4. Reconfigurable ArchitectureConnection Machine

3.5. VLSI SystemsSiBtolic Arrays, Wavefront ArrayProcessors

3.6. Critique on von Neuraann Systeras

4. Non-von Neumann Architectures4.1. Data-Driven Model4.2. Demand-Driven Systems4.3. Neural Computers

5. Software Issues Related t'o MultiprocessingSystems

5.1. Parallel Programming LanguagesConventional Parallel Languages, Non-vonNeuman Languages

5.2. Multiprocessor Operatihg Systems

6. Conclusions

7. References

1• Introduction

There has been an ever-increasing need formore and more computing power. In a real-timeenviroraent, the demands are much more. Whilethe computers built around a single processorcannot stretch their processing speed beyond afew milliona of floating point operations persecond (FLOPS), an operating speed of a fewgiga FLOPS is required in many applications.Parallel processing techniques offer a•promising scope for the3e applications.

Several applications such as computergraphics, computer aided design (CAD) ofelectronic and mechanical systems, wheathermodeling and robotics have a high greed forhigh computirig power. To quote an example, raytracing techniques [32] used to display 3-dimensional objects in computer graphics haveto process 5xlO6 rays, each ray intersecting 10to 20 surfaces, and each intersectioncomputation requiring 20 to 30 floating. pointoperations on an average. In order to provide afast, interactive and flicker-free display,each frame has to be computed 30 times persecond. Thus, the total computation pouerrequired 60 giga FLOPS. Comparing this with theexecution speed of the present daysupercomputers -- whose expected/measuredperformance is about 100 millions FLOPS [53] --we observe that the deroand is more by an orderof magnitude of two.

The need for parallel processing isconspicuous from the above illustration. Therecent advances witnessed in VLSI technologyand the consequent decline in the hardware costfurther encourage the construction of massivelyparallel processing systems.

The paper is organized in the followingmanner. In Section 2, we point out the majoriasues associated with multiprocessing systems,with a brief discussion on each of theae items.

48

We classify parallel processing architecturesbroadly into two classes. The firat of them,the von Neumann type, includes allmultiprocessing systems that adopt theprinciples of the von Neumann computationmodel. Pipeline and vector processing, SIMDmachines, MIMD raachines and VLSI systems whichfall under this category are surveyed inSection 3. Supercomputers employ one or more oftheae techniques for achiving high performance.More emphasis is given to current and recentdevelopmenta in these areas. In non-von Neumannarchitecture, we include data-driven, demand-driven and neural computers. Their methodologyof computation is functionally different fromthe ones discussed earlier. An overview ofthese architectural configurations is presentedin Section 4.

The programming language and operatingsystem support required for working with theBecomplex parallel processing systems arereviened in detail in the subsequent section.Parallel programming languages are classifiedinto two categories: the conventional vonNeumann type (that adop the imperative style ofprogramming) and the non-von Neumann languages.Case studies of some of the popular languagesand their salient features are also reported.

We remark that an exhaustive coverage ofall contributions and research work reported inthis fascinanting area of parallel processingsystems is beyond the scope of this paper andwe do not attempt that. Rather, we willconcentrate on the basic principles behind thedesign of the various architectures. We willpay more attention to some specificarchitectures and models of computation whichhave gained importance.

2. Issues in Parallel Processing System3

In the this section we discuss the variousassociated with multiprocessing systems. First,it is interesting to look at the raodels ofcomputation, the way one has evolved from theother, leading to myrial architecturalconfigurations and machines.

2.1. Models of Computation

Depending on the instruction executionorder, coraputer systems can be classified ascontrol-driven, data-driven or demand-drivenmachines.

Control Flow Conventional von Neumannuniprocessing and multiprocessing systems havean explicit execution order specified by theprograraraer. Thia hinders the extent ofparallelism that can be exploited. Hovever,these control models perform well forstructured data items such as arrays [34]. Alsothe three decades of programming experience wehad with these raodels still makes it aproponent candidate for future generationcomputers.

Data-Driven Model In this model, theexecution order is specified by the datadependency alone [26]. Instructions areexecuted as soon as all their input argumentsbecome available to them. Data flow model issuitable for expression evaluation [34]. Since,only data dependency determines the executionorder, and the granualarity of parallelismexploited is as low as a single inatruction,this model of computation exploits maximumparallelism. However, because of its eagernessin evaluating functions, it executes aninstruction, if its operands are available,irrespective of whether or not it is requiredfor the computation of the final result.

Demand-Driven Model In demand-driven (alsocalled reduction) execution, pioneered byBerkling [15] and Backus [9], the demand forresult triggers the execution which ia turntriggers the evaluation of its arguments and soon. This demand propagation continues untilconstanta are encountered; then a value isreturned to the demanding node and executionproceeds in the opposite direction. Since ademand-driven computer performs only thosecomputations required to obtain the results, itwill perform less computations, in general,than a data-driven computer. Aa the computationmodel is 'laz^'1 in evaluating expreaaions,instructions accept partially filled datastructures2. Thia makea demand-driven model tosupport infinite data struotures. Such afacility ia not available in the other raodel8of computation. Houever, thia inodel auffersoverheads in terms of execution time due to theadditional propagation of the demand <forarguments). The control mechanism and themanagement of program meroory add further to theinefficiency [91].

2.2. Concurrency

The granularity and nature of parallelisraexploited, aignificantly influence theperformance of the parallel procesaing ayateins.

Temporal, Spatial and AaynchronousParalleliam By performing overlappedcomputation one can exploit tetnporalparallelism. Pipeline computers [53,77] executeinstructions in an overlapped raanner as in theassembly line of manufacturing to achieveparallelism. Exaraples of pipeline proceaaingare the execution of the different phasea of aninatruction namely, instruction fetch, decode,opcode fetch and execution; execution of thevarious ateps involved in floating pointarithmetic operations at various stages of thepipeline. These roachines are ideally suited forprocessing vector instructions [53] namely,vector add, vector raultiply and dot product.

Spatial parallelism is the paralleliaminherent in performing an operation over theelements of structured data, auch as arrays.Spatial paralleliam is eaay to detect andexploit [34]. The synchronization andscheduling overheads involved in exploitingspatial paralleliam are much leas compared tothose experienced in the other two models.

Multiple instruction multiple data (MIMD)[53] machines achieve asynchronous parallelismthrough a aet of interactive prpcessora withshared resources. Several independent processesrunning asynchronously on different processorswork to accomplish a common goal by exchangingmessages or by sharing common variables.

While spatial parallelism is easy todetectet and exploit, the range of applicationson which it is inherent i8 limited. MIMDmachines whieh exploit asynchronousparalleliam, on the other hand, offerflexibility in programming. Hence it can beused for a wide range of problems.

Granularitv of Parallelism For highperformance, we want to exploit as muoh

1 The nature of the raodel in executingan inatruction only on ia termed 'lazy'.

2 The computation of a recursivelydefined data atructure, for example SEQ(N) =CONS(N,SEQ(N+1)) never terrainates, and hence isinfinite. Such a. data Btructure can partiallybe filled by evaluating only those terma whichare required for further computation.

49

parallelism as possible. This means that thegranularity of the tasks should be low. Weobserve that as the level of granularity goesdown, the parallelism that can be exploitedincreases. Honever, loner the level ofparallelism raore will be tfae synchronizationoverheads. A tradeoff between the parallelismexploited and the synchronization overheads isrequired "to achieve high perforraance.

Another major issue involved isparallelism detection. Parallelisoi can eitfeerbe explicitly specified foy the user, or can bedetected frora a program written in a sequentlallanguage using an intelligent compiler. Forexpreasing parallelism explicitly, languageauch as CSP I50J, Occam [55] and ConcurrentPascal 145] can be used. The user need to havea knowledge of the architecture of the machineon which the program is executed, theapplication program and its runtimecharacteristics. The second approach whioh usesprogram restructuring techniques [71] totransform a sequential program into a parallelforni, does not required a knowledged-user.Houever, these techniques are inefficient andcannot detect parallelism to the fullestextent. Active research to develop efficientparallel programs using these paradigms isunderway.

2.3. Scheduling

In a multiprocessor system, schedulingalgorithms assign each task to one or raoreprocessors with the goal of achieving highperformance. Scheduling can again be static ordynamic (refer [6] for a comparison). In staticscheduling, the tasks are allocated toproceasors either during the algorithm designby the user or at compile time by anintelligent corapiler. In both these approaches,the scheduling costs are paid only once even ifthe program is run many times with differentdata. Morover, there is no runtime overhead.The disadvantage of static scheduling ispossible inefficiency in guessing the runtiraeprofile of each task.

Dynaraic scheduling at runtime offersbetter utilization of processors, but at thecost of additional scheduling time. The dynamicscheduling algorithm can be distributed orcentralized.

(iii) NYU Ultracomputer's 'fetch &. add' [42].

The subsequent sections highlight thearchitectural features of parallel processingsystems. The whole spectrum of architecturesproposed in the literature can broadly beclassified as conventional von Neumann parallelprocessing systems and the' non-von Neumannsystems.

3. Von Neumann Parallel Processing Systems

Von Neumann parallel processing systemscan be divided into three raajor architecturalconfigurations: pipeline, SIMD, and MIMDarchitectures. These three architectural modelsexploit the three kinds of parallelism, namelytemporal, spatial and asynchronous parallelismrespectively. Each of the above mentionedarchitectures is discussed in detail and thedesign and salient features of certain parallelprocessing roachines are highlighted in thissection. Reconfigurable architectures and VLSIsystems, though not entirely distinct from theabove discuased models, attract the attentionof researchers due to their many interestingfeatures. In particular VLSI systems offer newscope in computing for designing dedicatedarchitectures for a variety of applications.Conatit.uent elements of the VLSI systems aresystolic [63] and navefront [64] architectures.

3.1. Pipeline and Vector Processing

Pipeline Architecture Pipelining [77]offers an economical way to realize teraporalparallelism. The concept of pipeline processingin a computer is sirailar to manufacturing inassembly lines in an industrial plant. Toachieve pipelining one must subdivide the inputtask into a sequence of aubtasks, each of whichcan be executed by a specialized hardnare stagethat operates concurrently with other stages inthe pipeline. Succeasive tasks are pumped intothe pipe and get executed in an overlappedfashion at the subtask level. Pipelineprocessing leads to a tremendous improvement insystera throughput. A k-stage linear pipelinecould be at most k times faster. Hoviever, dueto memory conflicts, data dependency, branchand interrupts this ideal speedup cannot beachieved.

2.4. Synchronization

Synchronization methoda are required tocoordinate parallel execution of tasks in amultiproceasor system. Architectures withshared storage achive synchronization bysemaphores [27] and monitors [49]. In a massagepassing multiprocessing system, synchronizationof processes is achived using remote procedurecalls [46].

In order to update the shared variables ina multiprocessor in a consistent raanner(avoiding read-read, read-write races [53]),the updating of the variables is done in aregion called critical region. Only oneprocess ia allowed to enter the critical regionat a tirae. Preventing accesa by other processeswhen the critical region is being accessed by aprocess ia known as rautual exclusion.Synchronization primitivea are uaed to achievemutual exclusion.

Of the many aynchronization primitivesproposed (refer [35] for a survey), a few worthmentioning are

(i) the 'test & set1 primitive of IBM360/370 machines [17];(ii) 'lock' instruction used in C.mmp [101];

One way of claasifying a pipeline ia baaedon the function performed by the pipeline.There are two classea of pipelines based onthla claaaification, namely the instructionpipeline and the arithraetic pipelinei In theinstruction pipeline, the varioua phases ofinatruction execution auch aa instructionfetch, inatruction decode, operand fetch, andinstruction execution, are identified and areexecuted in the successive atagea of the linearpipeline. Thus, from pipeline, after an initialdelay, inatructiona are executed once everyclock cycle.

Arithmetic pipelines aubdivide thearithmetic operations auch as floating-pointadd or floating-point multiply into aubtaakaand execute them on specialized arithmetic andlogic units. In an inatruction pipeline, theinstruction execution unit can itself be anarithmetic pipeline to further improve theperformance. IBM 360/91 machine employs boththeae pipelines. Much research has already beendone on scheduling of pipelinea; buffering anddelaying techniques are used to improve theexecution apeed. In Cray 1 [82], there are 12functional pipeline units to perform bothscalar and floating point arithmeticoperationa. Cyber 205 supports four vectorpipelines in addition to a soalar arithmetic

50

unit.

Vector Procesaing In thia section, weexplaint the basic concepts of vectorprocessing [53]. Vector proceasors operate onvector data and execute vector instructions.Vector pipelineai unlike acalar pipelines, areassured of continuous atream of data. Theoverhead8 involved in initializing the veotorpipeline is compensated by the speedupimprovement gained, as the number of tasksexecuted is large. Loop termination conditionsare performed by specialized hardware invarious stage of the vector pipelines. Thesefeatures make vector proceasing more efficientthan the scalar pipelines.

Vectorizing compilers [6] tranaformprograms written in conventional imperativelanguages into vector instructiona auitable forexecution on vector proceaaora. The fact thatpresent day supercomputers have vectorizingcompilers and vector proceasors as their majorcomponents demonstratea that pipelineprocessing is an easy and efficient way ofrealizing high apeed computation, However, notall programs can be vectorized. Pipelineproceasors perform poorly in auch casea.

3.2. SIMD Machines

A synchronous array of parallel proceasoraexecuting in a lock-step faahion, conaiating ofmultiple procesaing elementa under thesuperviaion of one control unit ia called anarray proceaaor. The system and uaer programsare executed under the control of control unit.The array proceaaor can handle singleinatruction and multiple data (SIMD) stream.SIMD machines exploit apatial paralleliam, Theprocessing cella are interconnected by a data-routing network. The interconnection pattern tobe • established for apecific computation iaunder program control. Veotor instructiona arebroadcaat to the proceasing cells fordiatributed execution over differentproceaaora. The cells are passive deviceawithout instruction decoding capabilities.

Array processors became well publicizedwith development of Illiac IV [12], Clip 4[29], and Maa8ively Parallel Procesaors (MPP)[14]. Future research in SIMD machines iatowards designig and implementing multiple-SIMD(MSIMD) machinea [53], consisting of more thanone control unit. Each control unit. in thiaclass of machinea shared a pool of dynamicallyallocatable processora.

Array proceaaora are characterized bytheir ability to support paralleliara at a low-level. They are ideally auited for apecificapplications in the areas of image and aignalprocessing [81]. For example, cellular array[81] i8 a two dimensional array of processors,each of which communicate directly with it8neighbora. The atructure of the array ia veryappropriate in terma of layout on a chip. Workin this direction haa led ua to the deaign ofSIMD architecturea for CAD applications [78].

Hovever, as mentioned in aection 2.2, therange of applicationa over which SIMD machineacan be put into efficient uae ia reatricted andhence they are not candidate architectures forgeneral purpoae parallel processing machines.

3.3. MIMD Architecture

MIMD machinea can be groaaly characterizedby two attributea: firat, a multiproceaaoray8tem is a single computer that includesmultiple procesaors and aecond a multicoraputerthat has several autonornous computers which are

geographically diatributed and connectedthrough a communication netuork. There existsan important distinction betueen two aystems.In the firat case, the raultiple procesaors workconcurrently in order to aohieve a aingle .goal.Interaction between two proceaaora iseaaentially in terma of intermediate reaultaand aynchronization mesaagea. Khereasmulticomputera communicate among themaelveabaaically to ahare expenaive reaources. Eachautonomoua computer worka on an independenttaak. In thia aection we will concentrate raoreon multiproceasora. Diacu8aion on distributedcomputing syateras using coraputer network iabeyond fhe acope of thia paper.

Multiproceaaing syateins oan be claaaifiedinto two groups, based on how the procesaingelementa interact among themaelvea [53]. Whenaeveral processors communicate among themaelveathrough a shared global memory, we clasaifythem aa tightly coupled 8yatema (refer Fig. 1 ) T-Hence the rate at which the prooessors cahcommunicate is of the order of the bandwidth ofthe memory. On the other hand, we have loosly_coupled syatems nhere proceaaors do not share \acommon memory, but communicate uaing meaaag^paa8ing primitivea.

Interrupt signalinterconnecfjon

network (ISIN)

Processors

Unmappedlocalmemory

(ULM)

Memory map(MM)

1.0/PInter

connectionnelwork(tOPlN)

Input-output^diannels

Oisks

M interconnectionsnetwork( PMIS)

Sharedmemorymodules

(a) WITH0UT PRIVATE CACHE

Input-Outputchannets

Oisksd-1

Interrupt signalinterconneclion

netvvork(ISIN)

1-0/Pinter

comectionne(work(10P1N)

p/M interconnectionnelvvork ( PMIN)

Procossors

Unmapped Qlocal memorv(U-H)l T

r~H "-Memory map( MM) i

Pnvate caches I I(PC)

OMA nnd bul(ef| |

Pipelined sharedmemory modules

(b) WITH PRIVATE CACHE

F i g . 1 Model of a T i g h t l y Coupled System

51

(a) Linear array (b) Ring (c) Star (d) Tree (e) Near-neighbormesh

(f ) SystoMc array (g) ComDletelyconnected

(h) Chordalring

(i) Cube

Fig. 2 Static Interconnection Topologies

The interconnection network plays animportant role in raultiprocessor aystemB, andsignificantly influence the performance of thesyatem. A quick overview of the interconnectionnetuork is presented in this aection, which-will be followed by a siscussion on the twoconfigurations of multiprocesaing sy3tems.

Interconnection Networks The increasingpopularity of many proposed multiprocessingsystems with 10* to 105 processing elementsmakes the concept, design and implementation ofthe interconnection network a crucial factor inthe design of such systems. A typicalinterconnection network consists of a set ofauitching elements. The network can beclassified based on the following four designfactors: operation mode, control strategy,switching method and network topology [30]. Anetnork can send and receive messages in asynchronoua or asynchronous mode.Classification on the control strategy is basedon whether the switching elements arecontrolled in a centralized or distributedmanner.

order of n*, where n is the nuraberproceasing cells in the system.

of the

Multistage interconnection netvork (MIN)strikes a balance between cost and performance.It is dynaraic netviork using a number ofswitching elements, unlike a static networkuhere dedicated linka are used. A MIN of n x nsize establishea' a maximum of n links at anyinstance, at a cost of n log n. This attractivefeature makes MIN a proponet in manymultiprocessing aystems auch aa the New Yorkuniveraity Ultracomputer (NYU) [42].

(a) 8 x 8 baseline netvvork

Circuit switching and paoket switching aretwo switching methods adopted in a netuork. Incircuit switching, a phyaical path is actuallyestablished between the source and destinationnodes. In contraat, in packet auitching, datais put in a packet vihich ia routed through thenetwork without eatablishing a physical path.Fig. 2 and Fig. 3 depict some of the atatic anddynamic topologiea of computer netvrork.

The bus interconnection scheme connectsthe various proceasing cells through a commonshared bua. The bandwidth of the bua is verylow, and contention is a serious consequencewhen a large number of cells are connected tothe bua. Various schemes [11] such as daiaychain, parallel priority and time-alicedschemes, have been proposed to resolve buscontention. Fig. 4 shows the organization of aahared bua multiprocessing system with daisychain scheme for priority reaolving. Despiteits ahortcomings, ahared bus still attractssystem .deaigners because of its low cost andcomplexity and easy upgradability of thesystem.

Cross bar awitch [53] on the other hand isvery expenaive, but provides high memorybandwidth. The cost of the network ia of the

(b) 8x6 Benes netvvork

n x m rxr m x n

1

2

1

2

b4 1

2

(c) Clos netvvork

Fig. 3 Dynamic Interconnection Topologies

52

Busconlrol

unit

Device1

Device

\7

Devicem

Bus grant (BGT)

Bus busy (SACK)

Bus request(BRQ)

Bus

Fig. 4 Daisy Chain Implementat.ion of Shared Bus

Tightly Coupled Multiprocessors Tightlycoupled systems are ideally suited if highspeed or real-time processing is desired. In atightly coupled system a set of processingelements is connected to a set of memorymodules through an interconnection network.

If two or more processors attempt toaccess the same memory module, a conflictoccurs. Hence the memory raodules are eitherlow-order interleaved or high-order interleavedto reduce the number of conflicts. In a variantof tightly coupled systenis, each processor isallowed to have a private memory, called thecache for that processor. This will greatlyreduce the traffic through the network as localdata can be stored in and accessed from thecache.

Processors communicate the interraediateresults through the shared memory modules. Theset of processors may be homogeneous orheterogeneous. It is homogeneous if theprocessors are functionally identical. Also, aprocessor may differ from others in itscapability to access 1/0 systems, performanceand reliability.

Various multiprocessing systems have beenproposed using a shared memory architecture.Examples of these are, the C.mmp system [101],the Heterogenous Element Processor (HEP) [25],the New York university Ultracomputer (NYU)[42], Honeywell 60/66, Univac 1100/80 and IBM3084 AP. A dedicated raultiprocessingarchitecture with time shared bus has beendesigned and experimented by use for computergraphics applications [36,37,38].

Tightly coupled systems can tolerant ahigher degree of interaction without rauchdeterioration in perforraance. Houever, one oftlje limiting factor of tightly coupled systemsis the performance degradation due to memoryconflicts when the number of processors in thesy9tem ia increased.

Loo8ely Coupled Hultiprocessora Looselycoupled multiprocessor systems do not generallyencounter the degree of raemory conflictsexperienced by tightly coupled systems. In suchsysteras, each processor has a set ofinput/output devices and large local memorywhere it accesses most of the instructions anddata. Processes which execute on differentcomputer modules communicate by exchangingmessages through a message transfer system. Thedegree of coupling in such a system is veryloose (and hence the name). The determiningfactor in the degree of coupling is thecommunication topology of the associatedmessage transfer systera. Loosely coupledsystems are- efficient when the interactionsamong the tasks are minimal.

Fig. 5 illustrates the model of a looselycoupled system. The channel and arbiter svitchin each of the computer modules may have a highspeed communication meraory which is used forbuffering block transfers of messages. Thecommunication mernory is accessible by allprocessors through the communication interfaoe.The raessage transfer system for non-hierarchical system could be a simple timeshared bus. The performance of suchconfiguration is limited by the message arrivalrate on the bus, the raesage length, and the buscapacity (in bits per second).

(a) COMPUTER MOOULEComputer module 0 Computer module N-1

(LM) I i/o

Message Iransler syslem(MTS)

Fig. 5

(b) LOOSE COUPUNG 0F COMPUTER MODULE

Model of a Loosely Coupled System

Loosely coupled systems where processorsare connected using dedicated links, are alsopopular. In such a network of processors,locality (nearness between a pair ofprocessors) oan be exploited by schedulingtasks which interact among themselves more to agroup of processors uhich are closer to eachother. Failure of a node or link will notcatastrophically affect the system, asalternate paths can be established and thesystem can perform with graceful degradation.

The Cm« architecture [54] developed at theCarnegie Mellon University is one exaraple of aloosely coupled system. It consists of a set ofclusters connected by an intercluster bus (Fig.6). Each cluster has a set of processingelements connected over a map bus. The systemforms a good example for hierarchicalstructure.

53

Intercluster buses

Fig. 6 Cm» Architecture

A loosely coupled system designed forartificial intelligence and image processingapplication is ZMOB [80]. In this architecture,a set of 256 Zilog microprocessors areconnected by a pipeline system.

Another loosely coupled architecture whichhas attracted the attention of many researchersdue to its high performance and relatively lowcost, is the hypercube architecture [47,83].Because of the increased attention it hasreceived, we devote the following paragraphs tothis architecture.

The concept of a hypercube computer can betraced back to the work done in the early1960's by Squire and Palais [93] at yheUniversity of Michigan. They carried out adetailed paper design of a 4096-node (12-diraensional) hypercube.

TKe hypercube topology is an n-diraensionalgeneraTization of the simple cube. Thehypercube of diraension n has 2° cells. Eachcell is connected to n neighboring cells whichare at a Hamming distance of one. Theconnection betueen adjacent nodes is by point-to-point link. Fig. 7 shows the topology ofhypercube architecture for dimensions three andfour. The cube manager provides a high level

000

110*1

(a) Dimension = 3

(b) Dimension = A

Fig 7. Hypercubes

interface for system users. It serves as alocal host for the cube, supporting theprogramming environment, compilation, programloading, input/output and error handling.

Hypercube architecture has many attractiveproperties. The hypercube topology yields aregular array in which the node are close toone another: no more than n steps apart. At thesame time, the number of connections from eachnode to its neighbors is quite low (also equalto n). It thus strikes a balance betneen a two-dimensional array in which the internodeconnection costs id low, but the nodes are farapart O(n 1 / 2) steps on an average, and acorapletely connected array in whioh theinternode connection costs is high, but thenodes are only one step apart.

The hypercube architecture is homogeneousin the sense that all the hodes are identical.Further, the hypercube architecture ishierarchical and eminently partitionable. Forexample, a hypercube of dimension n+1 can bepertitioned into two hypercubes of dimension n.This means that it is quite easy to allocatesubcubes to subtasks, especially for problemswhich adopt a divide-and-conquer strategy.Lastly, the hypercube architecture can itselfembed other regular topologies such as tree,mesh, pyramid or hexagonal structure.

3.4. Reconfigurable Architecture

Another interesting aspect ofmultiprocessing systems in which activeresearch is under progress is reconfigurability[85]. Often, full potentials of multiprocessingsystems are not realised in raany applications.The reaspn for this is the mismatch between theapplication and the architecture. In order toalleviate this problera, we build dedicatedsystems where the architecture matches theapplication. Another approach which is activelvpursued by researchers is buildingmultiprocessing systems with programmableswitches; using these switches it is possibleto reconfigure the system depending on theapplication. Such a reconfigurable system, bynature, is flexible and hence can be used for avariety of applications. Configurable HighlyParallel computer (CHiP) [85] is areconfigurable system designed to suit thetopologies of various applications. In thissection we will highlight the salient featuresof another reconfigurable architecture, theConnection Machine [48], Its organization issimilar to an SIMD machine, but it functions asa reconfigurable architecture executingmultiple instructions on multiple data streams,and henče it is hard to classify this as SIMDor MIMD.

Connection Machine The desire to build amachine that will be able to perform thefunctions of a human mind, the thinkingmachine, is the raotivating force behind thedesign of Connection Machine. Specifically,retrieving commonsense knowledge frora a

54

semantic netvork was the application in thedeaigner'a mind.

The Connectlon Machine computes throughthe interaction of many, say a million, simpleidentical procesaing/memory cells. The tworequirements for the connection machine are:

(i) each processing element must be as smallas possible so that we can afford to have as anmany of them as we need;(ii) the procesaing elements should be

connected by software.

The Connection Machine architecturefollows directly from these two requirements.It provides a large number of tinyprocessor/meinory cells connected by aprogrammable netnork. Each cell is sufficientlysmall 8O that it is incapable of performingmeanlngful computations on its own. Inatead,multiple calla are connected together intodata-dependent patterns, called 'active datastructurea' that both represent and process thedata. The activities of these active datastructures are directed frora outside theConnection Machine by a conventional hostcomputer. This host computer stores datastructures on the Connection Machine in muchthe same way that a conventional machine storesthem in a memory. Unlike a conventional memory,though, the Connection Machine has noprocessor/memory bottleneck. The memory cellsthemselves do the processing. More precisely,the computation takes place through thecoordinated interaction of the cells in theactive data structure. Because thousands oreven millions of procesaing cella work on theproblem simultaneoualy, the computationproceeds much more rapidly than would bepossible on a conventional machine.

A Connection Machine is connected to aconventional computer much like a conventionalmemory. Ita internal atate can be read andvritten a word at a time from the conventionalmemory. It differa from a conventional meraoryin three aspects. First, associated with eachcell of atorage ia a proceasing cell that canperform local coraputationa baaed on theinformation atored in that cell. Second, thereexits a general intercommunication network thatcan connect all the cella in an arbitrary

Host AMemory Bus

Micro -controller

P|M

I I I i. . | . . . , . . . • . . . ,

65536 Cells Ix4096 bits/cells j3 2 ^ by|es memory

Connection Machine

500 M bi ts/ sec

Fig. 8 Connection Machine

pattern. Third, there is a high-bandwidthinput/output channel that can tranafer databetvieen the Connection Machlne and peripheraldevices at a much higher rate than would bepossible through the host.

A connection is formed between twoproceaaing memory cells by atoring a pointer inthe memory. These connectiona can be set up bythe host, loaded through the input/outputchannel, or determined dynamically by theConnection Machine itaelf. In this prototypesyatera, there are 65,536 (2»») proceasor/memorycella each with 4096 bita of memory. The blockdiagram of the Connection Machine with host,proceaaor/memory cell8, communication netuork,and input/otput is shown in Fig. 8.

The control of the individualproces8or/memory cella ia orcheastrated by thehost of the computer. For example, the hoat mayask each cell that ia in a certain atate to addtwo of it8 (cell's) memory locationa locallyand pass the reaulting aum to a connected cellthrough the communication netuork. Thus asingle coramand from the hoat can reault in tenaof thouaanda of additions and a permutation ofdata that depend8 on the pattern ofconnectiona. Each procesaor/nieniory cell ia aosmall that it ia es8entially incapable ofcomputing or even atoring the reaulta of anyaignificant computation on ita own. Inatead,the coraputation takea place in the orcheatratedinteraction of thousanda of cells through thecommunication netuork.

The ability to configure the topology ofthe machine to match the topology of theproblem turna out to be one of the mostimportant featurea of the Connection Machine.The Connection Machine can alao be used as acontent addresaable or asaociative memory, butit ia alao able to perform non-localcomputationa through its coramunication network.

3.5. VLSI Svstema

While the previoua aubaection discusseathe features and requirementa of reoonfigurablearchitecturea, this aection ia devoted to thearchiteoture of dedicated aystema. Nowdays,raature VLSI/WSI (Very Large Scale Integration /Wafer Scale Integration) technology permita themanufacture of circuita whose layouts haveminimum feature aize of 1 to 3 microns [69].The effective yields of VLSI/WSI fabricationprocesses make poaaible the implementation ofcircuita with upto half a million transiatorsat reaaonable coat -- even for relatively araallproduction quantitiea. Thia opena the horizontfor building systems with thousanda ofprocessora in a coat-effective and compactraanner.

The key attributes of VLSI computingatructures [33,63] are

(i) siraplicity and regularity,(ii) concurrency and communication and(iii) computation-intensiveneaa.A VLSI atructure should be auch that ita basicbuilding block ia aimple and regular, and isused repetitively with aimple interfacea. Thishelpa us to cope with high complexity. Thealgorithm designed for theae structurea ahouldsupport a high degree of concurrency, andemploy only, aimple, regular communioation andcontrol to allow efficient implementation. VLSIprocesaora are auitable for implementingcompute-bound algorithma. In VLSI processors,we discuaa the two classes of architectureanamely, 8yatolic and wavefront proceaaora.

Systolic Arraya Syatolic arraya areadraired for their elegance and potential for

55

-EH3a LINEAR SYSTOLIC ARRAV

T Tb MESH CONNECTED

SYSTOLIC ARRAV

c HEXAGONALSYSTOLIC ARRAY

d TRIANGULARSYSTOLIC ARRAY

e TREE STRUCTUREDSYSTOUC ARRAY

Fig. 9 Systolic Arrays

high performance. Systolic arrays belong to thegeneration of VLSI/WSI architectures for whichregularity and modularity are important toachieve area-efficient layouts. The concept ofsystolic architecture is a general raethodology

rather than being an ad hoc approach -- formapping high-level computations into hardwarestructures. In systolic system data flows fromthe computer memory in a rhythmic fashion,

passing through many processing elements beforeit returns to memory, much as blood circulatesto and from the heart. Systolic arrays derivetheir computational efficiency frommultiprocessing and pipelining. The data itemspumped into the systolic processors are reusedmany times as the move through the pipelines inthe array. This results in balancing theprocessing and input/output bandviidths, arequirement for any parallel processing systemfor alleviating the von Neumann bottleneck.

The topologies for interconneoting theprocessing elements of a sy8tolic array aremany. Some of the most commonly usedtopologies, namely linear, mesh, triangular,hexagonal and tree structures are shovrn in Fig.9.

Easentially, the whole operation of thesystolic system is synchronized with a globalclock and it may be visualized as a sequence ofcomputation and data transfer cycles.Incidentally, the clock JB the only globalsignal allowed in systolic architecture apartfrom the power and ground lines [93]. Duringthe data transfer cycle, all the prooessingelements pump data into the existing datachannels, to be accepted by the neighboringprocessing elements connected to the datachannels. After thia oycle is over, all theprocessing cells enter the computation cyolewhere each of the cells computes concurrentlytill the end of the computation cycle. This8equence goes on rhythmicall and perpetually instriot synchronisra with the clook beats.Systolio architectures thus capture theconcepts of parallel processing, pipelining andregular interconnection struoture in a unifiedframeuork [63].

Having all the desirable properties of anefficient special purpose system, the systolicarchitecture is an interesting area of researchfor a variety of applications, namely digitaland iraage processing [62], linear algebra[28,69], database systems [63], computergraphics [67,96,97], computer-aided design, andsolid modeling [60]. The systolic algorithmsare characterized by repeated computations of afew types of relatively simpleoperations thatare common to many input data items. Often, thealgorithms can be described with nested loopsor by recurrence equations that describecomputations performed on indexed data.

The systolic architectures were originallyconceived as devices capable of performing aspecialized task. Essentially, thesearchitectiires consist of a large number ofidentical processors, each having only a singlearithmetic or logic operatidn build into itsharduare. This greatly limits the applicabilityof a system to raany areas. The latest trend inresearch in this direction is towards thedevelopment of systolic cells which areversatile enough to iraplement the compute-intensive functions. The design of programmablesystolic cells [31] has been suggested as aneffective way touards achieving highperformance systolic systems. Each aystoliccell now possesses a rich instruction set alongwith some amount of local storage, which werecompletely absent in the original versions ofthe systolic architecture. By suitablyprogramming these systolic cells, a variety ofoperationa can be performed. The programmablenature of the systolic cells offers a highdegree of flexibility in operation and highperformanoe.

Wavefront Array Processors The datamovements in a systolic array are controlled byglobal timing-reference 'beats'. The burden of.synchronizing the operations of the entiresystolic coraputing network becomes heavy for

56

very large arrays. A simple solution is to takeadvantage of the data flow computing principle[91] (which will be discussed in detail inSection 4.1), which leads the designer towavefront array [64] processing.

The wavefront array combines systolicpipelining principle with the data flowcomputing concept. In fact, the navefront arraycan be viewed as a static data flow array thatsupports the direct hardware implementation ofregular data flow graphs. Exploitation of thedata flow principle makes theextraction ofparallelisra and programming for uavnfrnnt.arrays relatively simpler.

Synchronizing with the global clock andconsequently the large surge of current {due tothe simultaneous energizing OF changing ofstates of the components) are two majorproblems in systolic arrays. Theae can bealliviated in wavefront arrays, because oftheir asynchronous nature. When the processingtimes of the individual cells are not uniform,a synchronous array may have to accommodate theslowest cell by using a slower clock. Incontrast, uavefront arrays, because of theirdata-driven nature, do not have to hold backfaster cells in order to accommodate the slowerone. Wavefront arrays also yield higher speedwhen the computing times are data-dependent.Lastly, programming of wavefront arrays iseasier than that of systolic arrays becauseuavefront arrays require only the assignment ofcoraputations to processing elements, whereassystolic arrays require both assignment andscheduling of computations.

3.6. Critigue on von Neumann Systems

The most serious problem with theraultiprocessors which use the von Neumannmodel, as discussed in [10], is the presence ofglobally updatable shared memory. Specialmechanisms are required to ensure correctnessof results while updating a meraory cell. Theexplicit execution order to be specified by theprogrammer is another bottleneck of von Neumannsystems. This has led to research in non-vonNeumann architectures.

4. Non-von Neumann Architectures

The principal stirauli for developing thenon-von Neumann machines have come from thepioneering work on data flow machines by JackDennis [26], and on reduction languages andmachines by John Backus [9] and Klaus Berkling[15]. In data-driven model, the availability ofoperands triggers the execution of theoperation to be performed on thera, whereas indemand-driven model, the requirement for aresult triggers the operation that willgenerate the result. Of late, the realizationof the suitability of biological nervoussystems for many applications and the use ofartificial nets have triggered the design of anew system, the neural computer. In thissection, we present the details of these threenon-von Neumann approaches.

4.1. Data-Driven Model

In a data flow computer, an instruction isexecuted as soon as all its operands areavailable. Since the data availability solelydictates the execution order of instructions,there is no need for having a program counter.Also, the data are passed as values betweeninstructions; this eliminates shared memory andmakes the synchronization mechanism siinpler.

Thua data flow model alleviatea the shortcotningof the von Neumann model. Also, parallelism isexploited at lnstruotion level. Hence, a veryhigh speed of computing is possible.

The machine language for data flowcomputer ip the data flow graph [24]. The dataflow sraph conslsts of nodes, to representoperatora, and arcs, to carry data between thenodes. Tokena carry data valuea along the arcsto the nodea. When all the required operandsare available, the node is 'fired'. As aresult, the input tokena are removed and outputtokens are produced.

Data flow architectures are classified asstatio and dynatnic arohitecture, In ataticmodel, an additional constraint -- no tokenshould be preaent on any of the output aros ofnode — la required to enable the execution ofan inatruetion. The implioation of thia ia thatthe static data flow model cannot supportexeciition of reentrant routines. This severelyrestricta the extent of parallelism andaaynchrony that can be exploited in the staticmodel. In a dynamic model, additional tags,called environment tags {color), are asaociatedwith the token to diatinguish the variousinatantiations of a rentrant routine. Thua, thedyrramic model aupporta fine grain paralleliamwith full a8ynchrony. In this paper we describethe architecture of a data flow computer, uaingthe Mancheater data flow computer [98] as anillustrative example.

In Fig. 10, the svitch unit aerves aa aninterface betueen the host computer and thedata flow machine. It routea the intermediatereaulta to the token queue and the finalreaulta to the hoat computer. The token queueunit ia a FIFO buffer which smoothes the flowof tokena in the ring. Tokena in the tokenqueue are checked for their operand type. Alltokena directed to aingle input nodea aredireotly routed to the node atore. Other tokensare sent to the matching unit.

Tokens that arrive in the matching unitaearch for their raatching partner. If thesearch faila, the token ia atored in thematching atore; it awaita its partner. On theother hand, if the matching partner is found,it (the matching partner) ia removed from thematching atore; the inooming token 18 mergedwith its partner to form a group token. Thegroup token which containa the informationabout the operand valuea, address (of the nodeto which the token ia de8tined to) andenvironment tag ia aent to the node store.

The node store unit atores the programgraph; it stores the opcode for the operator,destination and operand type of reault tokens.Tokena entering the node store get the aboveinformation to form an executable token. Theexecutable packeta are sent to the diatributorunit which diatributea the tokena to one of thefree processing elementa. The proceaaingelement performs the operation specified by thetoken to produce result tokens. The resulttokena are collected by the arbitrator and areaent to the awitch unit. Fig. 10 depicta theblock schematic of the Manchester data flowcoraputer.

Reaearch in the area of data flowcoraputation ia a rapidly expanding area inUnited Statea, Japan and Europe. There are anumber of data flow projecta that are underwayin many univeraitiea. Some of them worthmentioning here are M.I.T. data flow computer[26] developed by Dennia, Irvine data flow

machine [7,39] by Arvind, Manchester data flowaystem [98], ita extended veraion (EXtendedMANcheater architecture) proposed by the

57

Input

Input tokenResult token^

Host

Svvitch

Output

Output loken

l Arbitrationnetvvork

. . .\

Processingunit

. . .

TokenTok«nqueue

Matchingunit

\Distribution/\ nctwofk /

Executablepacket

Grouppackst

Fig. 10 Block Schematic of Manchester Data Flow Computer

authors [74], Texas Instruments distributeddata processor [22], Utah data driven machine[23], Toulouse LAU system [20,76], Nevrcastledata-control flow computer [89], the efficientstatic dataflow architecture for specializedcomputation proposed by the authors [87], andthe high speed data flow aystem developed byNippon Telegraph and Telephone Systems [3].

Much work needs to be carried out in thedesign of languages for data flow machines, andimplementation of compilers for converting theprograms into data flow graphs. (Refer [88] forthe initial work on these issues.) Efficientmethods to overcorae the inherent overheadsassociated with exploiting . fine-grainparallelism have to be developed. Although manydata flow machines have been proposed in theliterature, no effort is made to prototypethem. Demonstrating the feasibility of the dataflow model of computation is thus a positivestep towards the commercialization of suchsystems.

4.2. Demand-Driven Systems

In contrast to control flow and' data flow

programs which are built from fixed-sizeinstructions, demand-driven (reduction) [91]programs are built from nested expressions.. Theneed for result triggers the execution' of aparticular instruction.

An important point to note is that, in areduction machine, a program is mathematicallyequivalent to its value. Demanding the resultof definition a, defined as x = (y+1) * (y-z),means that the embedded reference to x is to bereuritten in a simple form. This requires thatonly one definition of x may occur in aprogram, and all references to it give the samevalue, a prqperty known as referentialtransparency [91].

There are two form of reduction, calledstring reduction and graph reduction. The basisfor string reduction i3 that' each instructionthat accesses a particular definition will takeand manipulate a seperate copy of the functiondefinition. Whereas, in graph reduction eachinstructiori that accesseš a particulardefinition will manipulate references to thatdefinition. That is, graph reduction is basedon sharing of arguments using pointers. Instring reduction each access for a definition

Backing Store

(—» DEQ |«| |» DEQ [*| H

PU PU

DEQ

buffer input

direc. state

ReductionTabte

Actbnunit

CPstore

DEQ|*—|

PU

Memory unit ( Definitions)

Fig. 11 The Newcastle Reduction Machine

58

will result in the evaluation of thedefinition. Reduced definitions or data valuesare acoessed when the demanded definition haaalready been evaluated. While graph reductionmachine takes advantage of the shareddefinition (in term of the number ofdefinitions evaluated), it is more complex thanstring reduction.

There are two basic problems in supportingreduction approach on a maohine organization[90]: first, managing dynamically the memory ofthe program atructure being transformed and,second, keeping control information aboutinformation about the atate of thetransformation. The basic organization of areduction machine, the Neucastle reductionmachine [90], is presented below (refer Fig.

The reduction machine organisationdi8cussed here supports reduotion by expresaionevaluation. To find work, each procesaingelement traverses the subexpression in itsmemory looking for a reducible expression. Whena proceasing element locatea a reference to bereplaced by the corresponding definition, itsends a requeat to the communication unit viaits communication element. The communicationunit in such a computer frequently organized asa tree-struotured netuork on the assumptionthat the majority of communicationa willexhibit properties of locality of reference.Concurrency in auch reduction computers isrelated to the number of reduciblesubexpressiona at any instant and also to thenumber of procesaing eleraents that traversethese expres8ions.

Apart from the pioneering work of KlausBerkling [15], the stage of development ofreduction computers somevrhat lags behind thatof data flow computers [91]. Houever,researcher8 have demonstrated the principle andfeasibility of reduction oachine organizationby designing many prototype system such as theGMD reduction machine [58], Nevrcastle reductionmachine [90], Mago's oellular tree machine[66], Applicative Multiprocesaing Systems(populary knovm as AMPS) [57], and CombridgeUniveraity SKIM reduction machine [92].

4.3. Neural Computers

In the fields of image processing andspeech recognition, the ability to adapt andcontinue learning is eaaential. Traditionaltechniques used in these applications are notadaptive. It has been realized that thebiological nervous systera is more suitable for

applications involving pattern recognition andlearning [2]. Artificial neural nets have beenstudied during the last few years in the hopeof achieving human like performance in thefields of image processing and speechrecognition. The neural net models [65] attemptto achieve good performance via denseinterconnection of aimple computationalelements. The interest in this type of non-vonNeumann computing techniques in recent years iadue to the development of new net topologiesand algorithos, new analog VLSI implementationtechniques and the growing fascinantion for theunderstanding of the functioning of the humanbrain as well aa the realization that human-like performance is required for applicationsinvolving enormous araount of procesaing[51,65]. Several mathematical models have beenproposed to exhibit some of the essentialqualities of human mind: the ability torecognize patterns and relationahips, to storeand use knowledge, to reason and plan, to learnfrom experience and to undrstand what isobserved.

Neural net models are apecified by the nettopology, node characteristics and training orlearning rules. The computational elements ornodes used in neural net models have nonlinearcharacteriatics, typically analog, and arespecified by the type of nonlinearity. Most netalgorithma al3o adapt in time to improveperformance based on current results. Anyartificial neural model must neceasarily be aspeculation: definitive experimental evidenceabout the 8tructure and function of theneurological circuitry in the brain isextremely difficult to obtain since it ia hardto measure the neural activity withoutinterfering with the flow of information in theneural circuit. Further, the neurons areintricately interconnected and the flow ofinforraation is complicated by the presence ofraultiple feedback loops.

Nevertheless enough is known about someparts of the brain to fuel the desire forconstructing mathematical models of the neuralcircuit. In general, the raodela propose togenerate a senaory-activated goal-directedbehavior and control a multilevel hierarchy ofcomputing modules. At each level of hierarchy,input commanda are decomposed into strings ofoutput subcommands, that form input commands tothe next lower level. Feedback from externalenvironment or from internal sources drives thedecomposition proceas, and ateers selection ofsubcomraands to achieve the goal successfullv(refer Fig. 12).

Feed back

Analoginput —commands

Highest level

computingmodule

c

0e

>

. •

o •

o •

Lovvest levelcomputingmodute

Intermediatecommands

Outputcommands

Fig. 12 Neural Net Modules

59

The benefits of neural net include highcomputation rates, provided by massiveparallelism and a greater degree. of fault-tolerant since there are many processing nodeseach with primarily local connections.Designing artifioial neural nets to solveproblems and studying real biological nets mayalso change the way we think about problems andlead us to new insighta and algorithmicimprovements.

5. Software Issues Related to Multiprocessing

Systems

Having discussed the variousmultiprocessing systems, it is worth probingfurther into two aspects of parallelprocessing: the language to program and theoperating system support to handle thesecomplex systems.

5.1. Parallel Programming Languages

One of the motivations behind thedevelopraent of concurrent languages has beenthe structuring of software -- in particular,operating system -- by means of high-levellanguage constructs. The need for liberatingthe production of real-time applications fromassembly language has been another drivingforce. In the following discussion, we classifythe concurrent high-level languages into twogroups: 'traditional' languages of the vonNeumann type (based on imperative atyle ofprogramming) and the unconventional languages,such as data flow, functional and logic-basedlanguages. A quick review of some of theselanguages is presented below.

Conventional Parallel Languages Based onthe imperative style of programming, theselanguages are just the extensions of theirsequential counterparts. A concurrent languageshould allow programmers to define a set ofsequential activities to be executed inparallel, to initiate their evolution and tospecify their interaction (refer [4] for anexcellent survey on the concepts and notationsfor parallel programming). One important pointregards the 'granularity of parallelism', i.e.the kinds of granules that can be processed inparallel. Some languages specify concurrency atstatement level, and certain others at tasklevel . Constructs for specifying inter-activityinteraction are probably the most criticallinguistic aspects of concurrency. Languageconstructs ensuring mutual exclusion are calledsynohronization primitives. Some of the best-known and landmark solutions that have beenadopted to solve these problems are thesemaphores [27], raailboxes, monitors [49] andremote procedure calls [46]. Research in thisdirection is towards designing new raachanismsfor interprocessor communication, such asordered ports. [13]. In the following discussionwe restrict ourselves to a few parallelprogramming languages and their salientfeatures.

Communicating Sequential Processes (CSP)[50] is a language designed especially fordistributed architectures. In CSP, activitiescomraunicate via input/output coramands.Coramunication requires both the participatingprocesses to issue their commands. Also CSPachieves process synchronization using theinput/output commands. Another intereštingfeature of . this language is its ability toexpress non-determinism using guarded commands.An implementation of a subset of CSP [73] hasbeen has been succesafully attempted by the

authors' research group. Their work on thedesign of an architecture to execute CSP isreported in [79].

Distributed Processes (DP) [46], developedby Hansen, is proposed for real-timeapplications controlled by microcomputernetworks with distributed storage. In DP, atask consists of a fixed number of subtasksthat are started simultaneously and existforever. A process can call common proceduresdefined within other processes. Theseprocedures are executed when the otherprocesses are waiting for some conditions tobecome true. This is the only means ofcoramunication among the processes. Processesare synchronized by means of non-deterrainisticguarded commands.

. /Occam language [55], which is based on CSP

has also been designed to support concurrentapplications by using concurrent processesrunning in a distributed architecture. Theseprocesses comraunicate through channels. TheTransputer [56], also developed by InmosCorporation, supports the direct execution ofthis language.

Languages which adopt monitor-basedsolution for synchronization are orientedtowards architectures with shared memory.Examples of these languages are ConcurrentPascal [45], Ada [5], and Modula [99]. Theselanguages, in general, support strong typechecking and seperate compilation, and expressconcurrent actions using explicit constructs.

Concurrent Pascal [45] extends thesequential programming language Pascal usingthe concurrent programraing tools, processes andmonitors.. The main contribution of thislanguage is extending the concept of themonitor using an explicit hierar'chy of accessrights to shared data structures that can bestarted in the program text and checked by acorapiler.

The prograroming language Modula [99] isprimarily intended for programming dedicatedcomputer systeras. This language borrows manyideas from Pascal, . but in addition toconventional blook structure, it introduces aso-called module structure. A module is a setof procedures, data types and variables wherethe programmer has precise control over thenaraes that are imported from and exported tothe environment [99]. Modula includes generalmultiprooessing facilities such as processes,.interface modules and signals.

In Ada [5], we only have activecomponents, the tasks. Information may beexchanged among tasks via entries. An entry isvery similar to a procedure. The call to anentry is like a procedure call; parametersshould be passed if the called entry requiresthem. A randezvous is said to occur when thecaller is in the call state and the called taskis in the accept state. After executing theentry subprogram both the tasks resume theirparallel execution. Ada provides specific andelaborate protoools for task termination. Adais designed to support reliable and efficientreal-time programming.

Non-von Neumann Languages Conventionallanguages imitate the von Neumann computer. Thedependericy of these languages on the basicorganization of the von Neumann machine iaessentially a limitation to. express and exploitparallelism [10]. These imperative languagesperform a task by changing the state of asystem rather than raodifying the data directly.In parallel processing applications, it makesmore sense to use a language with a

60

nonsequential semantic base. Various paradigmshave been adopted and new programming languagesbased on these approaches have evolved. We willrestrict ourselves in this paper to two suchparadigms, namely the applicative and the non-procedural style of programming, and theresulting parallel versions of the languagesthat adopt these approaches.

Applicative languages (also referred to asfunctional languages) avoid side-effects, suchas those caused by an assignment statement. Thelack of side-effects accounts, at leastpartially, for the well-known Church-Rosserproperty, which essentially states that nomatter what order of computation is chosen inexecuting a program, the program is guaranteedto give the same result (assuming termination).This raarvellous determinacy property isinvaluable in parallel systems. Another keypoint is that in functional languages theparalleliam is iraplicit and supported by theirunderlying semantics.

A system of languages known as FunctinalProgramming languages (FP) [10] and Lisp [68]are two major outcoraes of the applicative atyleof programming. Languages for data flowarchitectures, which avoid side-effects andencourage single assignment, are also includedin the set of applicative languagea. Dennis1

Value oriented Algorithmic Language (VAL) [1],Arvind'3 Irvine Data flow language (Id) [8],and Keller'8 Flow Graph Language (FGL) [57] arecandidate examples in this category.

Considerable work has been done by us inthe area of aplicative programming languages. Ahigh level language for data flow computers,called Data Flow Language (DFL) [72], has beenproposed by us, and a compiler to convert theprograms written in this language into dataflow graphs has been implemented. The conceptsborrowed from CSP and DP when embedded intodata flow systems results in two new languagesfor distributed processing, namelyCommunicating Data Flow Channels (CDFC) andDistributed Data Flow (DDF) respectively [75].Communication and non-deterrainism features havebeen added to FP by us {40,41] to strengthenits power as a programming language. We havealso proposed that FP can be used as a languagefor program specification [41].

Although parallelism in a program isexpressed by the functional languages in anatural way, their autoraatic detection andmapping to processors do not result in optimalperformance. It is desirable to provide theuser with the ability to explicitly expressparallelism and mapping, retaining thefunctional style of programming. Languageswhich allow the programmers to annotate theparallelism and mapping scheme for the targetarchitecture lead to optimal performance on aparticular machine. Two languages developedwith this motivation are ParAlfl (the Para-Functional language) [52] and Multilisp [44].Efforts have been taken to exploit theadvantages offered by the functional languagesto the maximum extent by developing newmachines based on non-von Neumann architecture(refer [95] for a recent survey).

Applications such as the design ofknowledge base systems and natural languageprocessing revealed the inadequacies of theconventional programming languages to offerelegant solutions. The use of predicate logic,which is a high-level human oriented languagefor describing problera and problem solvingmethods for computers, promised great scope forthese applications. Logic programming languagescorabine simplicity with a lot of powerfulfeatures. They aeparate the logic and control

[59], the two major componenta of an algorithm,and thus enable the programmer to write morecorrect, raore easily improved and more readilyadapted programa. The powerful features ofthese languagea also include the declarativeatyle, the unifioation mechanism for parameterpassing and execution atrategy offered by non-deterrainiatio computation rule. The powerfulexecution mechanisra provided by theae languagesis due to the non-procedural paradigm. Anoutcome of the reaearch carried out in thisarea with these motivationa ia the design ofthe language Prolog [19].

Logic programming languagea offer threekinds of parallelism, namely the 'AND', 'OR1

and 'argument' paralleliam [21,43]. Theinability of the von Neumann architecture toefficiently execute logic programming language(in eaaence supporting non-procedural paradigm)haa led to the design of many parallel logicprograraming machinea [16,43,70,94,100]. Furtherresearch on these languages haa led to thedeaign of three parallel logic programminglanguages, the PARLOG (PARallel LOGicprogramming) [18], P-Prolog [103] andOoneurrent. Prolog [84].

With the increasing size and complexity ofparallel proceaaing systems, it becomeaessental to design efficient operating aysteraa,without which handling of such systems would beimpoaaible. The general principlea andrequireraenta of multiproceaaor operatingaystems are diacuaaed in the following aection.

5.2. Multiproceaaor Operating Systems

The basic goala for an operating syatemare to provide programmer-interfaoe to themachine, raanage reaources, provide mechaniamato iraplement policies, and facilitate matchingapplicationa to the machine. There isconceptually little difference betueen theoperating aystem requirementa of amultiprocesaor and thoae of a large computersystem with raultiprogramming. The operatingsjrstem for a raultiprocesaor ahould be able tosupport multiple aaynchronoua taska whichexecute concurrently, and hence ia morecoraplex.

The functional capabilities of amultiprocessor operating syatem includereaource allocation and manageraent schemea,memory and data set protection, prevention ofsystem deadlock and abnormal prooesstermination or exception handling and processorload-balancing. Also, the operating systemahould be capable of providing ayatenireconfiguration schemes to aupport gracefuldegradation of performance in the event of afailure. In the following diacuasion, weintroduce brifly the three baaicconfigurations, naraely maater-alave, aeparatesupervision and floating-supervision systems[53].

In a 'master-slave' configuration, oneprocessor, called the raaater, maintaina theatatua of all processors in the system andapportions the work to all the slaveprocesaors. Service oalla from the slaveproceasors are sent to the master for executiveaervice. Only one processor (the raaater) usesthe superviaor and ita asaociated procedures.The merit of thia configuration is itssimplicity. Houever, parallel processing systemwhich haa the maater-slave configuration iasusceptible to cataatrophic failures, and a lowutilization of the alave processora may resultif the raaster cannot despatoh processes fastenough. Cyber 170 and DEC System 10 uae thiamode of operation. The master-slave

61

configuration is most effective for specialapplications where the work load is well-defined.

In a 'seperate supervisor system.' , eachprocessor contains a copy of a basic kernel.Each processor services its own needs. However,since there is some interaction among theprocessors, it is necessary for some of thesupervisory code to be reentrant, unlike in themaster-slave raode. • Separate supervisor mode ismore reliable then master-slave mode. But thereplication of the kernel in all the processorscauses an under-utization of memory.

' The 'floating supervisor' scheme treatsall the proeessors as well as other resourcessymmetrically or as an anonymous pool ofresources. In this mode, the supervisor routinefloats from one processor to another, althoughseveral of the proceasors raay be executingservice routines simultaneously. Conflicts inservice requests are resolved by priorities.Table access should be carefully controlled tomaintain the systein integrity. The floatingsupervisor mode of operation has the advantagesof providing graceful degradation and betteravailability of reduced capacity systems.Furthermore, it is flexible and it providestrue redundancy and makes the most efficientuse of available resources. Examples of theoperating systems that exečute in this raode arethe MVS and VM in the IBM 3081 and the Hydra[102] on the C.mmp.

6. Conclusions

In this survey, we have identified thevarious issues involved in parallel processingsystems. Approaohes followed to solved theassociated problems have also been discussedand their relative merit are put forth. Theprinciples and the requirements of language andoperating system support for complexmultiprocessing systems are elaboratelydescribed. For the wide spectrum ofarchitectures proposed in the litarature, theirdesign principles and salient features arebrought out in a comparative menner.

While the envisaged potentials offer apromising scope for parallel processing systerasfor many applicationa, hardly a few systems arecommercialized. The reasons fot this is thelack of good software support for thesesystems. Design of intelligent compilers whichcan identify parallel subtasks in a program(written in a sequential language), schedulethe subtasks to the processing elements andmanage coramunication among the scheduled tasks,is a step toward this end. Although there aremany existing proposals in this line, none ofthem seems to achive all the three goals in anintegrated manner, relieving the burden fromthe user completely.

Another question that remains unansweredis whether or not to continue with von Neumannapproach for building complex parallelprocessing machines. While familiarity and thepast experience with control flow model inake ita proponent candidate, its inherentinefficiencies, such as the explicitspecification of control and global updatablememory, limit its capabilities. Although data-driven and demand-driven coraputers exploitmaximum parallelism in a program, their complexstructure and inadequate software support forcethe designer to have a second throught on theseapproaches.

With the advent of VLSI technology andRISC design, dedicated architectures arebecoming more and raore popular. However, the

inapplicability of these systems to a varietyof applications causes a serious concern, Atthe other end of the spectrum, we have generalpurpose parallel processing systeras which givedegrade performance due to the mismatch of thearchitecture and algorithm, and thereconfigurable machines. Considerable and ondesign efficient algorithms (for generalpurpose computing systems) which will bridgethe gap between the application program andarchitecture.

Finally, the research on neural computerand molecular machines is at its infancy.Modeling the neural circuits and understandingthe functioning of human brain have to beconsiderable refined before one could make usethem for building high speed computing systems.

The vast.r.ess of this fascinanting area inwhich active research is underway, and theinnumerable problems that remain to be solvedare themselves standing evidences for theproraising future of parallel processing. Withthe ever-growing greed for very high speed ofcomputing, and with the inability of theswitching devices to cope up with the need,parallel processing techniques seem to be theonly alternative.

7. References

1. W.B. Ackerman and J.B. Dennis, "VAL - AValue Oriented Algorithmic Language,Preliminary Reference Manual", Tech. Rep.TR-218, Lab. for Computer Science, M.I.T.Cambridge, MA, June 1979.

2. J.S. Albus, Brains, Behavior and Robotics,Byte Books, McGraw-Hill, New York, NY,1981 .

3. M. Ammamiya, R. Hasegawa and H. Mikami,"List Processing and Data Flow Machines",Lecture Note Series, No.436, ResearchInstitute for Mathematical Sciences, KyotoUniversity, September 1980.

4. G.R. Andrews and F.B. Schneider, "Conceptsand Notations for Concurrent Programming",Computing Surveys, Vol.15, No.1, pp.3-43,March 1983.

5. ANSI/MIL-STD 1815A, Reference Manual forAda Programming Language, 1983.

6. C.N. Arnold, "Performance Evaluation ofThree Automatic Vectorizer Packages",Proc. Int'l Conf. Parallel Processing,pp.235-243, 1982.

7. Arvind and K.P. Gostelow, "A ComputerCapable of Exchanging Processors forTime", Proc. IFIP Congress, pp.849-854,1977 .

8. Arvind, K.P. Gostelow and W. Plouffe, "AnAsynchronous Programming Language andComputing Machine", Tech. Rep. TR-114a,Dept. Information and Computer Soience,(Jniversity of California, Irvine, CA,December 1978.

9. J. Backus, "Reduction Languages andVariable Free Programming", Rep. RJ.1010,IBM T.J. Watson Research Center, YorktownHeights, NY, April 1972.

10. J. Backus, "Can Programming be Liberatedfrom von Neumann Style? A Funčtional Styleand its Algebra of Programs",Communications of the ASM, Vol.21, No.8,pp.613-641, August 1978.

62

11. W.L. Bain Jr. and S.R. Ahuja, "PerformanceAnalysis of High-Speed Digital Busses forMultiprocessing Systems", Proc. 8l " Ann.Symp. Computer Architecture, pp.107-131,May 1981.

12. G.H. Barnes, R.M. Brown, M. Kato, D.J.Kuck, D.L. Slotnick and R.A. Stokes, "TheILLIAC IV Computer", IEEE Trans. onComputers, Vol.C-17, No.8, pp.746-757,August 1968.

13. J. Basu, L.M. Patnaik and A.K. Gosnami,"Ordered Ports - A Language Construct forHigh Level Distributed Programming", TheComputer Journal, Vol.30, 1987.

14. K.E. Batcher, "Deaign of a MassivelyParallel Prooessor", IEEE Trans. onComputera, Vol.C-29, No.9, pp.836-840,September 1980.

15. K. Berkling, "Reduction Languages forReduction Machines", Proc. 2ni Int'l Syrap.Computer Architecture, Januar 1975.

16. L. Bic, "A Data-Driven Model for ParallelImplementation of Logic Programs", Dept.Information and Computer Science,University of California, Irvine, CA,January 1984.

17. R.P. Case and A. Padegs, "Architecture ofthe IBM System 370", Comraunication of theACM, Vol.21, No.l, pp.73-96, January 1978.

18. K. Clark and S. Gregory, "PARLOG: PARallelProgramming in LOGic", ACM Trans. onProgramming Languages and Systems, Vol.8,No.l, pp.1-49, January 1986.

19. W.F. Clocksim and C.S. Mellish,Programming in Prolog. Springer-Verlag,New York, NY, 1984.

20. D. Comte, A. Durrleu, 0. Gelly, A. Plasand J.C. Syre, "TEAU 9/7 SVSTEME LAU -Summary in English", CERT Tech. Rep.#1/3059, Centre d'Etudes et de Recherchesde Toulouse, October 1976.

21. J.S. Conery and D.F. Kibler, "ANDParallelism and Nondeterminism in LogicPrograms", New Generation Computing,Vol.3, No.l, pp.43-70, 1985.

22. M. Cornish, "The TI Data FlowArchitectures: The Power of Concurrencyfor Avionics", Proc. 3r« Conf. DigitalAvionics Systems, pp.19-25, November 1979.

23. A.L. Davis, "The Architecture and SystemMethod of DDMl: A Recursive StructuredData Driven Machine", Proc. 5l" Ann. Symp.Computer Architecture, pp.210-215, April1978.

24. A.L. Davis and R.M. Keller, "Data FlowGraphs", Coraputer, Vol.15, No.2, pp.26-41,February 1982.

25. Deneclor, Inc. Heterogeneous ElementProcessor: Principles of Operation, April1981.

'26. J.B. Dennis and "AD.P. Miaunas,Preliminary Architecture for a Basic DataFlow Processor", Proc. 2"" Int'1 Symp.Computer Architecture, pp.126-132, January1975.

27. E.W. Dijkstra, Discipline ofProgramming. Prentioe-Hall Inc, EnglewoodCliffs, NJ, 1976.

28. B.L. Drake, F.L. Luk, J.M. Speiser andJ.J. Symguski, "SLAPP: A Systolic LinearAlgebra Parallel Processor", Computer,Vol.20, No.7, pp.45-49, July 1987.

29. M.J. Duff, "CLIP-4", in Special ComputerArchitecture for Pattern Recognition, K.S.Fu and T. Ichigaua (Eds.) CRC Press, 1982.

30. T.Y. Feng, "A Survey of InterconnectionNetuorks", Computer, Vol.14, No.12, pp.12-27, December 1981.

31. A.l,. Kisher, II.T. Kung, L.M. Monier, II.Walker and Y. Dohi, "Architecture of thePSC: A Progrnmmnble Systolic Chip", Proe.10'" Ann. Symp. Computer Architecture,1983.

32. J.D. Foley and A. van Dam, Fundamentals ofInteractive Computer Graphios, Addison-Wesley, Reading, MA, 1982.

33. J.A.It. ForUes (ind H.W. Wah, "Syst.olicArrays -- From Concept to Implementation",Guest Editor's Introduction, Computer,Vol.20, No.7, pp.12-17, July 1987.

34. D.D. Gajski and J. Peir, "Essential Issuesin Multiprocessor Systems", Computer,Vo.1.18, No.6, pp.9-27, June 1985.

35. C. Ghezzi, "Concurrency in ProgrammingLanguages: A Survey", Parallel Computing,Vol.2, pp.229-241, 1985.

36. D. Ghosal and L.M. Patnaik, "PerformanceAnalysis of Hidden Surface RemovalAlgorithma on MIMD Computers", Proc. IEEEInt'l Conf. on Computers, Systeras andSignal Processing, pp.1666-1675, December1984.

37. D. Ghosal and L.M. Patnaik, "ParallelPolygon Scan Conversion Algorithms:Performance Evaluation on a Shared BusArchitecture", Computers and Graphics,Vol.10, No.l, pp.7-25, 1986.

38. D. Ghosal and L.M. Patnaik, "SHAMP: AnExperimental Multiprocessor System forPerformance Evaluation of ParallelAlgorithms", Multiprocessing andMicroprograraming, Vol.19, No.3, pp.179-192, 1987.

39. K.P. Gostelow and R.E. Thomas,"Performance of a Data Flow Computer",Tech. Rep. 127a, Dept. Information andComputer Science, University ofCalifirnia, Irvine, CA, October 1979.

40. A.K. Gosuami and L.M. Patnaik, "AnAlgebraic Approach Touards FormalDevelopment of Functional Programs",accepted for publication, New GenerationComputing.

41. A.K. Goswarai, "Non-Determinism andCommunication in Functional ProgrammingSystems: A Study in Formal ProgramDevelopment", Ph.D. Dissertation, Dept.Coraputer Science and Automation, IndianInstitute of "Science, Bangalore, India,October 1985.

42. A. Gottlieb, R. Grishman, C.P. Krushkal,K.P. McAaliffe, L. Randolph and M. Snir,"The NYU Ultracomputer -- Designing anMIMD Shared Memory Parallel Computer",IEEE Trans. on Computers, Vol.C-32, No.2,p.p. 175-189, February 1983.

43. Z. Halim, "A Data-Driven Machine for ORParallel Evaluation of Logic Programs",

63

New Generation Computing, Vol.4, No.l,pp.5-33, 1986.

44. R.H. Halstead Jr., "Multilisp: A Languagefor Concurrent Symbolic Computation", ACMTrans. on Programming Languages andSystems, October 1985.

45. P.B. Hansen, "The Programming LanguageConcurrent Pascal", IEEE Trans. onSoftuare Engineering, Vol.SE-1, Nb. 2,pp.199-209, June 1975.

46. P.B. Hansen, "Distributed Processes: AConcurrent Programming Concept",Communications of the ACM, Vol.21, No.ll,pp.934-941, November 1978.

47. J.P. Hayes, R. Jain, W.R. Martin, T.N.Mudge, L.R. Scott, K.G. Shin and Q.F.Stout, "Hypercube Computer Research at theUniversity of Michigan", Proo. SecondConf. Hypercube Multiprocessors,September/October 1986.

48. W.D. Hillis, The Connection Machine.M.I.T. Press, Cambridge, MA, 1986.

49. C.A.R. Hoare, "Monitors: An OperatingSystem Structuring Concept", Communicationof the ACM, Vol.17, No.10, pp.549-557,October 1974.

50. C.A.R. Hoare, "Communicating SequentialProcesses", Comraunication of the ACM,Vol.21, No.8, pp.666-677, Auguat 1978.

51. J.J. Hopfieed and D.W. Tank, "Computingwith Neural Circuits", Science, pp.625-633, August 1986.

52. P. Hudak, "Para-Functional Progratnming",Computer, Vol.19, No.8, pp.60-70, August1986.

53. K. Hwang and F.A. Briggs, ComputerArchitecture and Parallel Processing.McGraw-Hill Book Company, New York, NY,1984.

54. A.K. Jones and E.F. Gehringer, "Cm«Multiprocessor Project: A ResearchReview", Tech. Rep. CMU-CS-80-131,Carnegie-Mellon University, July 1980.

55. Inmos Ltd., Occam Reference Manual. 1986.

56. Inmos Ltd., transputer Reference Manual.Ref. No. 72TRN 04802, 1986.

57. R.M. Keller, S. Patil and G. Lindstrom, "ALoosely Coupled ApplicativeMultiprocessing System", Proc. Nat.Computer Conf., AFIPS, pp.861-870, 1979.

58. W.E. Kluge and H. Schlutter, "AnArchitecture for the Direct Execution ofReduction Languages", Proc. Int'l WorkshopHigh-Level Language Computer Architecture,pp.174-180, May 1980.

59. R.A. Koualski, ' "Algorithm = Logic +Control", Communication of . the ACM,Vol.22, No.7, pp.424-435, July 1979.

60. D. Krishnan and L.M. Patnaik, "SystolicArchitecture for Boolean Operations ohPolygons and Polyhedra", Eurographics, TheEuropian Association for ComputerGraphics, Vol.6, No.3, July 1987.

61. C.P. Kruskal and A. Weiss, "AllocatingIndependent Subtasks on ParallelProcessors", Proc. Int'l Conf. ParallelProcessing, pp.236-240, August 1984.

62. H.T. Kung and S.W. Song, "A Systolic 2DConvplution Chip", Dept. Computer Science,Carnegie Mellon University, Tech. Rep.CMU-CS-81-110, March 1981.

63. H.T. Kung, "Why Systolic Architectures?",Computer, Vol.15, No.l, pp.37-46, January1982.

64. S.Y. Kung, S.C. Lo, S.N. Jean and J.N.Hwang, "Wavefront Array ProcessorsConcept to Implementation", Computer,Vol.20, No.7, pp.18-33, July 1987.

65. R.P. Lippmann, "An Introduction toComputing with Neural Nets", IEEE ASSP,pp.4-20, April 1987.

66. G.A. Mago, "A Network of Multiprocessorsto Execute Reduction Languages", Int'1 J.Computer and Information Science, Part 1:Vol.8, No.5, pp.349-385, Part 2: Vol.8,No.6, pp.435-571, 1979.

67. P.C. Mathias and L.M. Patnaik, "A SystolicEvaluation of Linear, Quadratic and CubicExpressions", accepted for publication,Journal of Parallel and Distributed'Computing.

68. J. McCarthy, et.al., LISP 1.5 Programmer'sReference Manual. M.I.T. Press, Cambridge,MA, 1965.

69. C.A. Mead and L.A. Conway, Introduction toVLSI Systems, Addison Wesley, Reading, MA,1980.

70. R. Onai, M. Aso, H. Shimizu, K. Masuda andA. Matsumoto, "Architecture of a ReductionBased Parallel Inference Machine: PIM-P",New Generation Computing, Vol.3, No.2,pp.197-228, 1985.

71. D.A. Padua, D.J. Kuck and D.L. Lavrrrie,"High Speed Multiprocessor and CompilationTechniques", IEEE Trans. on Computers,Vol.C-29, No.9, pp.763-776, September1980.

72. L.M. Patnaik, P. Bhattacharya and R.Ganesh, "DFL: A Data Flow Language",Computer Languages, Vol.9, No.2, pp.97-106, 1984.

73. L.M. Patnaik and B.R. Badrinath,"Implementation of CSP-S for Descriptionof Distributed Algorithms", ComputerLanguages, Vol.9, No.3/4, pp.193-202,1984.

74. L.M. Patnaik, R. Govindarajan and N.S.Ramadoss, "Design and PerformanceEvaluation of EXMAN: An EXtendedMANchester Data Flow Computer", IEEETrans. on Computers, Vol.C-35, No.3,pp.229-244, March 1986.

75. L.M. Patnaik and J. Basu, "Two Tools forInterprocess Communication in DistributedData Flow Systems", The Computer Journal,Vol.29, No.6, pp.506-521, December 1986.

76. A. Plas, et.al., "LAU Syatem Architecture:A Parallel Data Driven Processor Based onSingle Assignment", Proc. 1976 Int'1 Conf.Parallel Processing, pp.293-302, August1976.

77. C.V. Ramamoorthy and H.F. Li, "PipelineArchitectures", Computing Surveys, Vol.9,No.l, pp.61-102, March 1977.

78. C.P. Ravikumar and L.M. Patnaik, "ParallelPlacement Based on Simulated Annealing",

64

IEEE Int'l Conf. Computer Design: VLSI inComputers and Processors, October 1987.

79. C.P. Ravikumar and L.M. Patnaik, "AnArchitecture for CSP and its Simulation",Int'l Conf. Parallel Processing, 1987.

80. C. Rieger, "ZMOB: Doing it in Parallel",Proc. Workshop Coraputer Architecture forPAIDM, pp.119-125, 1981.

81. A. Rosenfield, "Parallel Image ProcessingUsing Cellular Arrays", Coraputer, Vol.16,No.l, pp.14-20, Janyary 1983.

82. R.M. Russel, "The Cray-1 Computer System",Communications of the ACM, Vol.21, No.1,pp.63-72, January 1978.

83. C.L. Seitz, "The Cosmic Cube",Communication of the ACM, Vol.28, No.l,pp.22-33, January 1985.

84. E.Y. Shapiro, "Concurrent Prolog: AProgress Report", Computer, Vol.19, No.8,pp.44-58, August 1986.

85. L. Snyder, "Introduction to ConfigurableHighly Parallel Computer", Computer,Vol.15, No.l, pp.47-64, January 1982.

86. J.S. Squire and S.M. Palais, "ProKrammintfand Design Conaiderations for a llighlyParallel Computer", Proc. AFIP Conf.,Vol.23, pp.395-400, 1963.

87. J. Silc and B. Robič, "Efficient StaticDataflow Architecture for SpecializedComputatins", Proc. 12'" IMACS WorldCongress on Scientific Computing, Paris,France, July 1988.

88. K.R. Traub, "A Compiler for M.I.T. Tagged-Token Data Flow Architecture", M.S.Thesis, M.I.T., Cambridge, MA, 1986.

89. P.C. Treleaven, "Principle Coraponents of aData Flow Computer", Proc. 1978 EuromicroSymp., pp.366-374, October 1978.

90. P.C. Treleaven and G.F. Mole, "AMultiprocessor Reduction Machine for User-Defined Reduction Languages", Proc. 7'"Int'l Symp. Computer Architecture, pp.121-130, 1980.

91. P.C. Treleaven, D.R. Brownbridge and R.P.Hopkins, "Data-Driven and Demand-DrivenComputer Architecture", Computing Surveys,Vol.14, No.l, pp.93-143, March 1982.

92. D.A. Turner, "A New ImplementationTechnique for Applicative Languages",Software Practice and Experience, Vol.9,No.5, pp.31-49, September 1979.

93. J.D. Ullman, Computational Aspect of VLSI,Computer Science Press, 1984.

94. S. Umeyama and K. Taraura, "ParallelExecution of Logic Programs", Proc. 10'*Ann. Symp. Computer Architecture, pp.349-355, 1983.

95. S.R. Vehdahl, "A Survey of ProposedArchitectures for the Execution ofFunctional Languages", IEEE Trans. onComputers, Vol.C-33, No.12, pp.1050-1071,December 1984.

96. A.G. Venkataramana and L.M. Patnaik,"Design and Performance Evaluation of aSystolic Architecture for Hidden SurfaceRemoval", accepted for publication,Computer and Graphics.

97. A.G. Venkataramana and L.M. Patnaik,"Systolic Architecture for B-SplineSurfaces", accepted for publication,International Journal of ParallelProgramming.

98. I. Watson and J. Gurd, "A Prototype DataFlow Computer with Token Labelling", Proc.Nat. Computer Conf. New York, AFIPS Press,pp.623-628, 1979.

99. N. Wirth, "Modula: A Programming Languagefor Modular Multiprogramming", SoftwarePractioe and Experience, Vol.7, No.l,pp.3-35, January 1977.

100. M.J. Wise, "A Parallel PROLOG: TheConstruction of Data-Driven Model", ACMConf. LISP and Functional Programming,1982.

101. W.A. Wulf and C.G. Bell, "C.ramp — AMulti-miniprocessor", Proc. AFIPS FallJoint Computer Conf., Vol.41, pp.765-777,1972.

102. W.A. Wulf, et.al., "Overvieu of the HydraOperating System", Proc. 5«* Syrap.Operating System Principles, pp.122-131,November 1975.

103. R. Yang and H. Aiso, "P-Prolog: A ParallelLogic Language Based on ExclusiveRelation", New Generation Computing,Vol.5, No.l, pp.79-95, 1987..

A CRITIOUE ON PARALLEL COMPUTER ARCHITECTURES …csd.ijs.si/silc/articles/Informatica88.pdf · A...

Documents

Transcript of A CRITIOUE ON PARALLEL COMPUTER ARCHITECTURES …csd.ijs.si/silc/articles/Informatica88.pdf · A...