A Dynamically Partitionable Multicomputer System

13
IEEE TRANSACTIONS ON COMPUTERS, VOL. C-35, NO. 9, SEPTEMBER 1986 The Architecture of SM3: A Dynamically Partitionable Multicomputer System CHAITANYA K. BARU, MEMBER, IEEE AND STANLEY Y. W. SU Abstract-The architecture of a multicomputer system with switchable main memory modules (SM3) is presented. This architecture supports the efficient execution of parallel al- gorithms for nonnumeric processing by 1) allowing the sharing of switchable main memory modules between computers, 2) sup- porting dynamic partitioning of the system, and 3) employing global control lines to efficiently support interprocessor com- munication. Data transfer time is reduced to memory switching time by allowing some main memory modules to be switched between processors. Dynamic partitioning gives a common bus system the capability of an MIMD machine while performing global operations. The global control lines establish a quick and efficient high-level protocol in the system. The network is supervised by a control computer which oversees network partitioning and other global functions. The hardware involved is quite simple and the network is easily extensible. A simulation study using discrete event simulation techniques has been carried out and the results of the study are presented. The architecture of this system is compared to those of conventional local area networks and shared-memory systems in order to establish the distinct nature and characteristics of a multicomputer system based on the SM3 concept. Index Terms-Computer architecture, database machines, multicomputer systems, multiprocessors, parallel algorithms, parallel database processing, performance evaluation. I. INTRODUCTION THE benefits of resource sharing and the need for fast system response are sufficient justifications for interconnecting computers. A variety of computer intercon- nection schemes are in existence, e.g., large "long-haul" networks, geographically restricted local area networks [18], [26], and tightly coupled multiprocessor systems [3], [4], [9], [12], [17], [19], [28]. Our research interest has been in the design and implementation of a multicomputer system for supporting database management applications and the study of efficient algorithms for nonnumeric processing that will take advantage of the special properties of such a system. This paper describes a multicomputer system which is dynamically partitionable and has a number of switchable main memory modules (SM3)'. The nodes in this system are placed closer to Manuscript received February 4, 1985; revised July 9, 1985 and November 5, 1985. This work was supported by the Department of Energy under Contract DE-FG05-84ER13285. C. K. Baru was with the Database Systems Research and Development Center, University of Florida, Gainsville, FL 32611. He is now with the Department of Electrical Engineering and Computer Science, University of Michigan, Ann Arbor, MI 48109. S. Y. W. Su is with the Database Systems Research and Development Center, University of Florida, Gainsville, FL 32611. IEEE Log Number 8610074. each other than in a typical local area network, and the SM3 system does not employ LAN communication protocols like CSMA/CD, token passing, etc. At the same time, the system is not as closely coupled as a multiprocessor system since each node is a complete computer with its own memory and peripheral devices. Thus, the term multicomputer system is used. The architectural features and hardware facilities sup- port 1) network data exchange, 2) network communication and synchronization, and 3) parallel execution of concurrent processes to allow intra- and interquery concurrencies using a common-bus architecture. A description of the preliminary architecture and algorithm implementation is given in [11]. A. Network Data Exchange In a network system data files are generally stored in a distributed fashion among many nodes. In performing many common database operations (like the relational join of two global files), large quantities of data have to be moved among these nodes. The conventional approach of moving data is through a data link (e.g., a data bus), typically using input/ output instructions. This either ties up the data and control lines and the processors involved in the transfer, until the transfer is complete or burdens the processors involved with interrupt processing, error checking and synchronization tasks. The transfer time can become very significant when large amounts of data are transferred and data movement among nodes often becomes a major bottleneck. We propose to alleviate this problem by using main memory modules which can be switched among nodes. Data to be transferred are stored in these modules and switched to other nodes for exclusive access. Transfer time is, therefore, reduced to the module switching time. This concept of switchable memory is different, for example, from shared memories used for communication and scheduling [1], global synchronization [20], and creation of virtual address space [25] in that the main memory modules are physically switched between processors and, once switched, are accessed exclusively by different processors without the usual memory contention problem. B. Network Communication and Synchronization Since data files in a network are typically dispersed across nodes, queries for retrieving or manipulating the distributed files are issued to the entire set or to a selected subset of processors. Two types of network operations are necessary to process such queries; 1) communication among the nodes to transmit messages, status and commands, and 2) network synchronization among the parallel processes to ensure orderly 0018-9340/86/0900-0790$01 .00 © 1986 IEEE 790

Transcript of A Dynamically Partitionable Multicomputer System

Page 1: A Dynamically Partitionable Multicomputer System

IEEE TRANSACTIONS ON COMPUTERS, VOL. C-35, NO. 9, SEPTEMBER 1986

The Architecture of SM3: A DynamicallyPartitionable Multicomputer System

CHAITANYA K. BARU, MEMBER, IEEE AND STANLEY Y. W. SU

Abstract-The architecture of a multicomputer system withswitchable main memory modules (SM3) is presented. Thisarchitecture supports the efficient execution of parallel al-gorithms for nonnumeric processing by 1) allowing the sharing ofswitchable main memory modules between computers, 2) sup-porting dynamic partitioning of the system, and 3) employingglobal control lines to efficiently support interprocessor com-munication. Data transfer time is reduced to memory switchingtime by allowing some main memory modules to be switchedbetween processors. Dynamic partitioning gives a common bussystem the capability of an MIMD machine while performingglobal operations. The global control lines establish a quick andefficient high-level protocol in the system. The network issupervised by a control computer which oversees networkpartitioning and other global functions. The hardware involved isquite simple and the network is easily extensible. A simulationstudy using discrete event simulation techniques has been carriedout and the results of the study are presented. The architecture ofthis system is compared to those of conventional local areanetworks and shared-memory systems in order to establish thedistinct nature and characteristics of a multicomputer systembased on the SM3 concept.

Index Terms-Computer architecture, database machines,multicomputer systems, multiprocessors, parallel algorithms,parallel database processing, performance evaluation.

I. INTRODUCTION

THE benefits of resource sharing and the need for fastsystem response are sufficient justifications for

interconnecting computers. A variety of computer intercon-nection schemes are in existence, e.g., large "long-haul"networks, geographically restricted local area networks [18],[26], and tightly coupled multiprocessor systems [3], [4], [9],[12], [17], [19], [28]. Our research interest has been in thedesign and implementation of a multicomputer system forsupporting database management applications and the study ofefficient algorithms for nonnumeric processing that will takeadvantage of the special properties of such a system. Thispaper describes a multicomputer system which is dynamicallypartitionable and has a number of switchable main memorymodules (SM3)'. The nodes in this system are placed closer to

Manuscript received February 4, 1985; revised July 9, 1985 and November5, 1985. This work was supported by the Department of Energy underContract DE-FG05-84ER13285.

C. K. Baru was with the Database Systems Research and DevelopmentCenter, University of Florida, Gainsville, FL 32611. He is now with theDepartment of Electrical Engineering and Computer Science, University ofMichigan, Ann Arbor, MI 48109.

S. Y. W. Su is with the Database Systems Research and DevelopmentCenter, University of Florida, Gainsville, FL 32611.IEEE Log Number 8610074.

each other than in a typical local area network, and the SM3system does not employ LAN communication protocols likeCSMA/CD, token passing, etc. At the same time, the systemis not as closely coupled as a multiprocessor system since eachnode is a complete computer with its own memory andperipheral devices. Thus, the term multicomputer system isused. The architectural features and hardware facilities sup-port 1) network data exchange, 2) network communication andsynchronization, and 3) parallel execution of concurrentprocesses to allow intra- and interquery concurrencies using acommon-bus architecture. A description of the preliminaryarchitecture and algorithm implementation is given in [11].

A. Network Data ExchangeIn a network system data files are generally stored in a

distributed fashion among many nodes. In performing manycommon database operations (like the relational join of twoglobal files), large quantities of data have to be moved amongthese nodes. The conventional approach of moving data isthrough a data link (e.g., a data bus), typically using input/output instructions. This either ties up the data and controllines and the processors involved in the transfer, until thetransfer is complete or burdens the processors involved withinterrupt processing, error checking and synchronizationtasks. The transfer time can become very significant whenlarge amounts of data are transferred and data movementamong nodes often becomes a major bottleneck. We proposeto alleviate this problem by using main memory moduleswhich can be switched among nodes. Data to be transferredare stored in these modules and switched to other nodes forexclusive access. Transfer time is, therefore, reduced to themodule switching time. This concept of switchable memory isdifferent, for example, from shared memories used forcommunication and scheduling [1], global synchronization[20], and creation of virtual address space [25] in that the mainmemory modules are physically switched between processorsand, once switched, are accessed exclusively by differentprocessors without the usual memory contention problem.

B. Network Communication and SynchronizationSince data files in a network are typically dispersed across

nodes, queries for retrieving or manipulating the distributedfiles are issued to the entire set or to a selected subset ofprocessors. Two types of network operations are necessary toprocess such queries; 1) communication among the nodes totransmit messages, status and commands, and 2) networksynchronization among the parallel processes to ensure orderly

0018-9340/86/0900-0790$01 .00 © 1986 IEEE

790

Page 2: A Dynamically Partitionable Multicomputer System

BARU AND SU: THE ARCHITECTURE OF SM3

and meaningful computations. Packet-switched systemsachieve network communication and synchronization by form-ing packets in thesource nodes and routinIg them to the properdestination nodes. This is rather time-consuming since itinvolves not only packet transfer, packet routing, and interruptprocessing, but also the detection of errors in transmission andpossible retransmissions. Multiprocessor systems typically useshared global memories for message transfer and synchroniza-tion and suffer in performance due to bus and memorycontention even when a small number of processors isinvolved. The SM3 system uses hardware to achieve some ofthe functions required in network communication and synchro-nization. A number of control lines are used to provide ameans for synchronizing parallel processes.

C. Parallel Execution of Concurrent Processes

A database query can, typically, be represented as acollection of serial and parallel subtasks. In order to achievemaximum parallelism, one attempts to process these subtasksconcurrently (intraquery concurrency). Also, query responsetimes may be improved by concurrently executing multiplequeries issued by different users (interquery concurrency). Inorder to achieve both intra- and interquery concurrencies andto increase the system throughput, a system should be able toassign different amounts of resources (processors, memories,etc.) to different query subtasks (or queries), depending on thecomplexity of the subtasks (or queries). We propose to supportparallel execution of concurrent processes by employingdynamic physical partitioning of a common-bus network inorder to reconfigure it into a number of independent sub-networks (or clusters).

The concepts of physical partitioning and reconfigurationpresented in this paper bear some resemblance to severalexisting works [2], [9], [15], but with several importantdifferences. Reconfigurability, introduced in [15], allowscomputers to vary their word sizes (vertical partitioning) tomeet the requirements of the task(s) under execution. In SM3,reconfiguration is done at a higher level, by allowing thegrouping and regrouping of processors, and is employed inorder to exploit the large grain parallelism of the tasks at hand(horizontal partitioning).Logical partitioning and dynamic allocation of processors

operating on multiple queries were features first introduced inDIRECT [9] and also in the newer dataflow architecture [8].Similar to these two designs, the SM3 system can dynamicallyreconfigure the network and assign processors to supportinter- and intraquery concurrencies. Nevertheless, the archi-tecture and techniques used to achieve this in SM3 are quitedifferent from those used in [8] and [9]. First, the concept ofswitchable main memory modules is unique to the SM3system. Second, SM3 employs physical rather than logicalpartitioning of processors in order to form subnetworks. Thephysical isolation of processors reduces interference andinterruption among subnetworks, thereby increasing through-put, security, and reliability. Third, each SM3 processor hasits own secondary storage devices. Databases are stored in adistributed fashion and accessed in parallel by the processors.Thus, each node of SM3 is an independent computer system

capable of local as well as global processing, i.e., a multicom-puter system.Some aspects of memory sharing and network partitioning

used in SM3 -are. similar to those used in the MP/C system [2].For example, the MP/C system uses switches to physicallypartition processors and the memory modules of each proces-sor form a contiguous memory space. Nevertheless, there area number of significant differences: 1) the individual nodes ofSM3 are more powerful and independent than those of MP/C,which do not have their own secondary storage devices, 2)each so-called partition in MP/C has only one active processoras opposed to many in SM3, and 3) the MP/C system does nothave the concept of local plus global memory-all memory iseither local or global. Some more differences are listed in [24].Essentially, the two systems have very different design goals:MP/C is a distributed architecture approached from a multi-processor standpoint, whereas SM3 is a distributed architec-ture arrived at from the standpoint of a network of independentcomputer systems.The remainder of this paper is organized as follows. In

Section II, we introduce the concepts of network partitioning,global control lines, and the novel memory switching scheme.A more detailed description of-the hardware units and thefunctioning of the switchable memories is provided in SectionIII. Some parallel algorithms for database operations arediscussed in Section IV along with the results obtained fromsimulation of these algorithms. A comparison, in terms ofperformance and cost, of SM3 to conventional local areanetworks and shared-memory systems is provided in Section Vwhich is followed by a conclusion in Section VI.

II. MEMORY SWITCHING AND NETWORK PARTITIONING

Each node of the SM3 system contains a processing unit Pi,some local main memory LMi, a switchable main memorySMi, a memory switch SMSi, and secondary storage devices.The following notation will be used in the rest of the paper forthe various system components

CC control computerPi a local processorSMli, SM42i two switchable memory modules at

LMiSMSi

CCPCCPSi

Si

SW1li, SW12i,SW2i, SW3i, NSWiCCBSMB

Pilocal memory of Piswitch used for connecting SMi toPi, CCP or CCcluster control processorcluster control processor switch forPi; designates Pi as current CCPswitch associated with Pi, used forpartitioningstatus words associated with Pi

cluster control busswitchable memory bus.

A. A Switchable Memory NetworkLet two independent processors P1 and P2, each with its

own local memory, share a common main memory module

1791

Page 3: A Dynamically Partitionable Multicomputer System

IEEE TRANSACTIONS ON COMPUTERS, VOL. C-35, NO. 9, SEPTEMBER 1986

SM via a dual-throw switch SMS. Let the SM module bemapped into the memory space of either processor P1 orprocessor P2. When the SMS switch is thrown to one side theSM becomes part of one processor, say P1, which can eitherread from or write into the module. Conversely, when theswitch is thrown to the other side the other processor P2assumes control over the memory module. Data transferbetween the processors is achieved simply by writing data intothis memory module and controlling the position of the switch,thereby reducing data transfer time to the switching time of theswitch SMS. The switchable main memory and the localmemory occupy different address spaces. If the local memo-ries are in the address space 0 through m, then the switchablememories would occupy, say m + 1 through p. This allowsprocessors P1 and P2 to function either with or without theSM module.

Using the basic switchable memory concept one can build asystem with many processors centrally controlled by a controlcomputer CC. The CC can have the capability of accessingany switchable memory module via a system-wide bus calledthe switchable memory bus (SMB). The SM modules can beused to transfer data either among local processors or betweena processor and CC. Data transfer between processors, say Piand Pm, can be done either by switching the memory moduleor by letting CC read the contents of module SMi belonging toPi and write them into module SMm of Pm (the latter is,obviously, a slower process).

B. A Hardware Communication SchemeApart from the SMB bus which connects all the SM

modules together, the SM3 system has another bus CCBwhich connects together all the processors (Pi's). The CCBpasses through on/off switches S1, S2, ..., Si which arenormally off or closed. The control computer CC has directcontrol over the individual switches (Si's) and can break thebus at any point by turning them on. The CCB is very similarin construction to the common bus employed in the MI-CRONET system [21]-[23]. It consists of address, data, andcontrol lines, which allow processors to communicate withone another and synchronize their execution. The globalcontrol lines of MICRONET are adopted in the SM3 system inorder to provide efficient interprocessor communication. Justas in the MICRONET system, each processor has five localcontrol lines which are globally ANDed together to form thefive global control lines. When a local line is set in allprocessors, the corresponding global line will be set automati-cally. If a processor in the SM3 system is waiting for the otherprocessors to complete a particular action, it can simply sensea predetermined global line. When this line is set theprocessing can either be synchronized and continued or aninterrupt can be issued to the CC. Thus, a high-level protocolfor processor synchronization is implemented via the controllines.

C. A Partitionable NetworkIn MICRONET, the global query processing capability is

highly restricted since the set of control lines (which are key tomost algorithm implementations) can only be used for a single

query at a time and cannot support several queries concur-rently. In the SM3 system, as mentioned above, turning theSi's on will break the CCB and allows the system to bepartitioned into several independent groups or clusters ofadjacent processors. Thus, the global control lines can be usedindependently by each cluster. All the processors in a clusterwork on the same database command (in case of intraqueryconcurrency) or query (in case of interquery concurrency).The method of partitioning employed here, using on/offswitches on a common bus, requires physical adjacency ofdata. If data are dispersed over many nonadjacent processors,then either a large cluster is formed containing some proces-sors that do not have the required data or the data are movedaround in the system to ensure physical adjacency. Physicaladjacency is not a serious constraint since: 1) data can beinitially loaded into adjacent processors and 2) data can bemoved at high speeds using the switchable memory facility tomake them adjacent. Studying the effects of physical adja-cency (rather, nonadjacency) of data is part of our currentresearch effort.The advantages of this scheme, on the other hand, are 1) the

interconnection scheme has a simple common bus structure, 2)fewer switches are needed (as compared to say, the crossbarswitches) so that it is cheaper to build and less likely to fail, 3)since more than one independent cluster can be formed, thesystem supports both intra- and interquery concurrencies, and4) the software for controlling the clustering and reclusteringis very simply and the software programs running in theclusters are identical. To summarize, a query Q may beexecuted as follows. Suppose the query contains commands A,B, and C where A is the root of a tree and B and C are theleaves. The system can be partitioned into two clusters, oneworking on B and the other on C. Each cluster can utilize theglobal control lines in the same manner as the MICRONETsystem. When both B and C are completed, the two clusterscan be recombined and all the processors can now worktogether on A (again utilizing the global control lines), tocomplete the query.

III. SWITCHABLE MEMORY SYSTEM ARCHITECTUREA schematic of the SM3 multicomputer system is shown in

Fig. 1. As mentioned earlier, the system has two distinctbuses-a switchable memory bus SMB, which connects allSMi-modules in the system with CC and a cluster control busCCB, consisting of address, data, and global control lines asdescribed in Section II-B. The SMB is not partitionable and isused for transferring data between a Pi and the CC while theCCB is partitionable and is used solely for intraclustercommunication. The use of two buses allows the system tosupport multiple clusters operating concurrently.

A. Network Communication ModesSeveral communication modes are possible in a multicom-

puter system which is centrally controlled by a single controlcomputer. For example, the control computer can establishone-to-one communication with a single network processor inorder to send special messages to it or to test its status, etc. Itcan also invoke the one-to-all communication mode in order to

792

Page 4: A Dynamically Partitionable Multicomputer System

BARU AND SU: THE ARCHITECTURE OF SM3

Fig. 1. The switchable mai

broadcast commands, "bring up" the system and the like. Onthe other hand, a local processor Pi may have completed an

operation (like MIN, MAX, COUNT, etc.) and may want to transferthe result to CC; it can then establish a one-to-one communica-tion with CC. If Pi needs to communicate with anotherprocessor Pm in order to transfer data or messages, then itwould establish one-to-one communication with Pm. Pi can

also establish one-to-all communication (required in opera-tions like JOIN, statistical aggregation, etc.) with all the otherprocessors of the cluster, in order to facilitate intraclustercommunication.

Since data transfer is via the main memory modules, inordet to support the above communications requirements thearchitecture should i) allow the SMi's to be switched betweenthe Pi's and CC, ii) allow all SMi's to be mapped to the same

address space to implement the one-to-all communicationmode, iii) allow data to be read from only a single modulewhen all the modules are mapped to the same address space,and iv) synchronize accesses to the switchable memories andensure that any contention is resolved. These features are

supported by the use of status words associated with eachswitchable memory module and the processor.

B. The Status Words

Each switchable memory module is associated with a statusword, SWl which is used to record the current status of themodule in the one-to-one communication mode. Since twoswitchable memory modules are used per node, there are twostatus words, SWI and SWl2 per node. Every node is alsoassociated with a status word, SW2 which is used in the one-

to-all communication mode. The SWl l's and SW12's are

mapped to distinct addresses in the status word address space,thereby enabling them to be addressed individually by eitherthe Pi's or the CC. The SW2's, on the other hand, are all

cn (n~ ~~~~CLUSTER---CONTROL

L~~~~ L J~~~~ J~~BUSS4 SS S6 (CCB)

SWITCHABLE---MEMORY

BUS(SMB)

in memory modules system.

mapped to the same address and can, therefore, be accessedsimultaneously. The current status of both memory modules isavailable in the status word, SW3 which derives its bit valuesfrom SWl 1, SW12, and SW2. A node command/status wordNSW is used at every node in order to control the Si andCCPSi switches at the node and to specify which one of thetwo SMs is to be used in broadcast communication. Thestatus word address space, which is unique to the entiresystem, may be viewed as shown in Fig. 2.

C. Usage of Synchronization BitsAccess to the switchable memories is always preceded by

the setting of appropriate bits in the various status words toensure that there is no memory access conflict. If an SMmodule required by a processor is already in use, then theprocessor has to wait for the SM to be free. The following twosubsections describe the SM module write and read protocolsfollowed by the sending and receiving processors, respec-tively.

1) Sending Processor Protocol: The sending processorsets the write, busy and SMS bits in the appropriate SW1after ensuring that the corresponding SM module is free. Itthen transfers the data to the SM. The SM is trn-stated at theend of the operation and the receiving processor is issued aninterrupt. The notation SWlki specifies the SWl status wordof the kth memory module (k = 1, 2 for SM1, SM2) of the ithnetwork node. Similarly, SW2i, SW3i, and NSWi indicatethe corresponding status words of the ith node. The bits inSW3 are subscripted by k(k = 1 or 2) to indicate the two setsof bits for SMl and SM2, respectively. The psuedocode whichimplements this protocol is shown in Fig. 3.

2) Receiving Processor Protocol: The receiving processoris interrupted when the data become available in an SM. Uponreceiving an interrupt, the processor switches the appropriate

793

Page 5: A Dynamically Partitionable Multicomputer System

IEEE TRANSACTIONS ON COMPUTERS, VOL. C-35, NO. 9, SEPTEMBER 1986

4n

NSW

SWllSW 12

SW3

SW2

P1

7

4n

NSWswiI

SW12SW3

SW2

P2

4n-4

4n-3

4n-24n- 1

4n

NSWSWi

SW12SW3

SW2

Pn

Fig. 2. Status words address space.

/./I* Protocol in Pi to obtain write access to SMk. *JP* To obtain the corresponding protocol in CC or *l/ CCP, change all (c)'s to (p) and vite versa. *l

/* Since the status words are in multi-ported "1P* memory, a WRITE into this memory space is *//* followed by a READ to ensure that the WRITE *//* was successful

J "I

repeat P until all the data are transferred */repeat r until Pi has gained access to SMk *l

repeat P until SMk is free for access */READ SW3i(Busy(c)k, WD(p)k, WD(c)k)

until (Busy(c)k = 0) AND (WD(p)k = 0) AND (WD(c)k = 0);WRITE SWlki(Busy(p) := 1, SMS:= local);READ SW3i(Busy(p)k, SMSk)

until (Busy(p)k = 1) AND (SMSk = local);repeatWrite data into SMk;

until (End of Data) OR (SMk is full);WRITE SW1ki(Busy(p) 0, WD(p) := 1);WRITE SWI ki(INT(p) := 1, SMS := tri-stated);

until (End of Data);

Fig. 3. Sending processor protocol.

SM module to itself and proceeds to read the data. Thepsuedocode in Fig. 4 illustrates this protocol. As the codeimplies, the CC needs to poll each Pi in order to service theproper SM module(s).

D. Bus Organization and Interface HardwareIn the following two subsections, we shall first describe the

bus organization in each node of the SM3 system and thenprovide a more detailed description of the SM moduleinterface logic.

1) Bus Organization: Fig. 1 shows the bus structures of theoverall system. The switchable memory bus (SMB) and thecluster control bus (CCB) are system-wide buses. The internalprocessor bus, on the other hand, connects the CPU, memory,and other devices at each node. All three buses have access tothe status word address space and to the switchable memories.The CCB is partitionable via the Si switches. In each partition(or, cluster) one of the Pi's is designated as the cluster controlprocessor (CCP).

Fig. 5 shows the details of the bus organization for a singlenetwork node. Address references on the internal bus originatefrom the local Pi while those on the SMB originate from CC.References on the CCB normally originate from the CCP. Inthe node which is designated as the CCP, the internal bus isconnected to CCB, thus both have identical data. This requires

r Routine in CC/CCP for reading data from either SM1 *fr SM2. For corresponding routine in Pi, change all (p)'s *ll* c (c)'s and vice versa.

READ NSW(INT(p));if (INT(p) <> 1)

then (Try next node, if CC or CCP)else

RtEAD SW3i(Busyfpl1, WD(p1 INT p)1);READ SW3i(Busyp)2, WD(p)2IMNT(p)2);

I* Check if data is in SM1. If so, verify status of bits in *ll* SW1. If status is OK, set status to indicate that SMI *ll* is in use. If not, then error in accessin. *l

if (INT(p)1 = 1)then ff (Busy(p)l = 0) AND (WD(p)1 = 1)

then eg inWRITE SW1 1 (INT1S:= 0, Busy(c) := 1,

SMS := globallciuster);repeatHead data from SM1

until (End of Data);WRITE SW1 1 i(Busy(c) := 0, WD(p) =0,SMS: tri-stated

endelse ("Error in memory access protocoll")

l* If interrupt not from SM1 then check if from SM2. Check *ll* if data is in SM2, if so verify status of bits in SWi. If /l* status is OK, set status to indicate that SM2 is in use. *Jl* If not, then error in accessin9 SM module. *J

else if (INT(p)2 = 1)then if (Busy(p)2 = 0) AND (WD(p)2 = 1)

then beinWRITe SW12i(INT(p) := 0, Busy(c) :=1,

SMS= globaUcluster;repeatFead data from SM2until (End of Data);

WRITE SW12i(Busy(c) := 0, WD(p) =0,M S=tri-stated)

endelse ("Error in memory access protocol!")

else ("Error in Interrupts!');end P end of ELSE part of IF (INT(p)<>1) *l

Fig. 4. Receiving processor protocol.

Fig. 5. Bus organization at a node.

794

Page 6: A Dynamically Partitionable Multicomputer System

BARU AND SU: THE ARCHITECTURE OF SM3

special attention while accessing the status word addressspace, as described below in Section llI-D-2). Fig. 5 alsoshows the Si and CCPSi switches, controlled by lines from theSM module interface unit; the interrupt line from Pi to CC,connected via the interface unit to the SMB; and the interruptfrom CC to Pi connected to the Pi via the (MICRONET-like)global control lines interface.

In addition to the interrupt from CC, the five global controllines ofCCB (see Section II-B) are also connected to the Pi viathe global control lines interface. The MICRONET interfacewas built in two parts [21]. One part interfaces to the controllines and the other to the system bus. Thus, the control linesmay be connected to any computer system by changing onlythe system-specific part of the interface.

2) SM Module Interface: A block schematic of the SMmodule interface unit is shown in Fig. 6. The three systembuses consisting of address, data, and control lines are theinput to this unit while the output consists of two buses-one toSMI and another to SM2-along with the various control linesfrom the status words. The interface contains the status wordsalong with the logic to provide shared memory access to them;the switching logic for switching two out of the three buses tothe two SM modules; the logic to implement broadcastcommunication; and other relevant hardware.The interface hardware may be divided into two sections-

one to manage accesses to the status words and another toprovide access to the SMmodules. The higher order bits of theincoming address lines are decoded to determine if the statusword address space is being accessed. Since the status wordsare mapped to a globally unique address space, any processormay access any status word. Requests from the three inputbuses to the status word space are queued in sequence by anarbitration logic which resolves simultaneous requests byrandom assignment. The bus with the first memory accessrequest is switched to the status words by a I-of-3 portmultiplexer unit.The port multiplexer is controlled via two control bits either

from status word SW1 or from SW2. When both bits are set tozero, the SM module is tristated. The node designated asCCP carries the same memory references on its internal busand CCB, thereby generating duplicate requests to the statusword address space, as mentioned above in Section HI-D-1).In this case, the arbitration logic is adjusted to accept only oneof the two requests and to ignore the other. There is noduplicate requests problem while accessing the SM modulessince, in this case, one out of the three input buses is switchedexplicitly by the two control bits. This illustrates a basicdifference between a shared memory and a switchablememory.The decode and arbitration logic and the bus multiplexers of

the interface hardware, in conjunction with the status wordbits, provide true shared memory access to the status wordaddress space and switchable memory access to the SMmodules. The appropriate bus, with all the address, data, andcontrol lines is switched to each SM. In addition, the hardwarealso supplies the processor interrupts and the control lines tocontrol the various switches. Using this hardware the SM3system.can be configured as shown in Fig. 1. All the SMSi

switches are in cluster mode (SMS = 10); CCPS1, CCPS4,and CCPS6 are closed (CCPS = 1); and S3 and S5 are open(Si = 1). Thus, three clusters are formed consisting of 1) P1,P2, and P3, 2) P4 and P5, and 3) P6 to Pn, with PI, P4, andP6 designated as the respective CCP's.A preliminary architecture of the SM3 system was pre-

sented in [11]. The system has undergone some major changessince then, although the underlying principles still remain thesame. Each node of the system is now equipped with dual SMmodules in order to provide an overlap between data transferand CPU operations. In the -previous design the CC wasrequired to have simultaneous access to all the SM modules inthe system, which restricted the number of processors basedon the addressing capability of CC. This has been modified toallow the CC to access only one SM module at a time. If theCC needs to access some other module, then it switches outthe current module and switches in the new module. Also,unlike the previous design, we now permit individual proces-sors to set their own switches at each node. This is useful incertain operations, as described in [6] and [7]. A more detaileddescription of the architecture is available in [6].

IV. SOFTWARE ORGANIZATION AND PARALLEL ALGORITHMSA. General Software Organization

Each processor in the SM3 multicomputer system is amultiuser, stand-alone computer system with its own second-ary storage, the required operating system, related utilities,etc. The software required for performing nonlocal taskswould be repeated in each processor. Queries entered at anyone of the local processors in the system may be either local orglobal. A local query can be fully supported by a localprocessor, but a global query requires communication withsome other processor(s). All global queries are sent to the CCwhich decides on the processors that should participate in thequery execution and correspondingly forms a cluster. The CCoptimizes on cluster size by moving isdlated segments of data,if any exist, into processors which are closer to the main bodyof the data. This reduces the average size of a cluster andallows a higher degree of parallelism in the system. Todemonstrate how parallel, nonnumeric algorithms may beimplemented in the SM3 system, we briefly discuss twooperations of differing complexities (as classified in [13]),used in a DBMS environment-SELECT and JOIN. In addition,we also present results for the statistical aggregation bycategorization operation.

B. Parallel Algorithm Implementation1) Select: The SELECT operation starts with CC forming a

cluster and sending the operation to the cluster. The processorsuse a local line (see Section IH-B) say L(l), to indicate the endof local processing. The command is globally complete whenthe global line G(l) is set. Each processor reads a block of thesource relation and applies the selection criteria. Dual,switchable I/O buffers allow each block of the source relationR to be read in asynchronously by an independent I/Ocontroller. This permits overlapping of I/O and CPU opera-tions.The line L(1) is set when the source reaches end of file

795

Page 7: A Dynamically Partitionable Multicomputer System

IEEE TRANSACTIONS ON COMPUTERS, VOL. C-35, NO. 9, SEPTEMBER 1986

ADDRESS, DATA ANDCONTROL LINES

Fig. 6. Block schematic of SM module interface unit.

(EOF). The results from the SELECT operation may be requiredeither as temporary data at the local node or as the final outputto be routed to CC. In the latter case, the selected tuples aretransferred to CC via the dual SM modules. When a module isfull or the source relation is at EOF, the Pi uses the sendingprocessor protocol shown in Fig. 3, to switch the SM over toCC and continues with the selection operation until L(1) is set.A detailed analysis and evaluation of the SELECT operation isavailable in [24].

2) JOIN: The JOI of two relations A and B, over attributesa from A and b from B, is obtained by concatenating eachtuple from A with a tuple from B, such that aOb, where 0 is amember of the set { =, < =, <, >, > =, < >}. Weconsider only the equijoin operation and use the well-knownnested-loop algorithm for JOIN to highlight the features of theSM3 system. We shall assume that relation B is the smaller ofthe two relations and, hence, is broadcast block-by-block tothe entire cluster. Each processor JOINS every block of B withall its blocks of relation A.The control lines are used extensively to synchronize the

various phases of the operation. To start with, every processorreads in a block of relation A and a block of relation B fromsecondary storage. A phase consists of a broadcast of a blockof B by CCP, followed by the local JOINS in each processor.Each phase is synchronized by the use of a control line, sayL(1). When all the blocks of B have been broadcast from thecurrent CCP, the next processor is designated as CCP. Since,in our case, the time required to JOIN two blocks is greater thanthat required to read a block of data from I/O, the I/O

controller is in general "ahead" of the CPU. We use twocontrol lines, L(2) and L(3), to separately indicate the end of I/O operations and the end of the CPU operations in CCP.

Initially, all L(l)'s and L(3)'s are set to 0 (G(l) = G(3) =0) and all L(2)'s, except the one in CCP, are set to 1. As eachprocessor finishes joining the current block of B with all itsblocks ofA, it sets line L(1) = 1. Thus, G(1) = 1 implies thatall processors have completed the previous phase. At this pointthe processor whose L(2) = 0, viz. the CCP can broadcast itsnext block of B. When the I/O controller in CCP reaches theend of B, it sets L(2) = 1 (and, therefore, G(2) = 1). Thisprompts the processor that is to be the next CCP to set its L(2)to 0 and to start reading its block of B and prepare forbroadcast. Finally, the L(3) line is set to 1 in each CCP after ithas finished joining its last block ofB. Thus, L(3) = 1 impliesthat the current CCP has broadcast and joined all its blocks ofB. Therefore, G(3) = 1 indicates the global completion of theJOIN operation.

C. Simulation of Database OperationsAn analytical evaluation of the SM3 multicomputer system

has been carried out using timing equations. The objectivesbehind this analysis were i) to provide a comparison to othersystems/architectures and ii) to identify activities whichconsume more time than others, with a view toward optimiz-ing overall response time. Various studies have used differentparameter values for evaluation [10], [14], [27], dependingupon specific implementation. For the sake of equal compari-son, we use values similar to those employed in [10], while the

796

Page 8: A Dynamically Partitionable Multicomputer System

BARU AND SU: THE ARCHITECTURE OF SM3

actual values would ultimately depend on the specific hard-ware used. The details of parameter value selection, analysisof algorithms and performance evaluation using the timingequations, and comparison to other systems are available in[5], [24]. Here we shall present more recent results obtainedfrom a simulation experiment.

Simulation programs were written in the C language underUnix® for the SELECT and JOIN operations using the discrete-event simulation technique. A simulation study should, ingeneral, provide a better approximation of the system thantiming equations. The simulation models each node of theSM3 system as having an I/O processor, CPU, dual I/Obuffers, and dual SM's with a status word SW3, used toindicate interrupt status to the CC. In addition, the system hasa single control computer, CC. The dual I/O buffers areassumed to be simple switchable memories. They can beswitched between the I1/0 processor and the CPU, therebysupporting concurrent data transfer and CPU operations. Thealgorithms for SELECT and JOIN are implemented as described inIV-B-1 and IV-B-2), respectively.The simulation results for the SELECT operation show the

same trends as shown by a previous timing equation analysis[24]. Fig. 7 shows a comparison of results obtained fromsimulation versus those from timing equations. The xparameter, the number of processors in a cluster, is variedfrom 3 to 20 for a given source relation size and selectivity. Akey factor that determines operation time is the number of SMtransfers between Pi and CC. The number of SM transfers is-the ceiling function of the results obtained by dividing the sizeof the output relation at Pi (in blocks) by the size of the SMmodule (in blocks). Thus, if the SM module size is 1 block,-then, whether the output is 1.09 or 1.99 blocks, the number oftransfers is still 2.0. In the figure, the number ofSM transfersis seen to remain constant, due to this property, for the clustersizes in the intervals (8, 9), (10, 12), and (13, 19). For theseintervals the operation time increases with increase in clustersize. For the intervals (3, 8), (9, 10), (12, 13), and (19, 20) theoperation time decreases since there is a decrease in thenumber ofSM transfers. Thus, for a fixed source relation size,increasing the number of processors in a cluster (assuminguniform data distribution) would increase operation timeunless there is a decrease in the number of SM transfers.The fact that the SELECT operation is more sensitive to the

result relation size than to the source relation size is illustratedin Fig. 8, where the operation time is plotted against sourcerelation size (fixed selectivity). For source relation sizes (intuples) in the intervals (30 000, 40 000), (50 000, 70 000),and (80 000, 90 000) the operation times remain the same. Inthese cases, as shown in the figure, the number of SMtransfers required per node to transfer the results to CC is thesame even though the source size has increased. Thus, theoutput size has a greater impact on operation time than theinput size. Also, in these cases the CPU and I/O idle timesdecrease since, even though the operation time remainsconstant, both I/O and CPU have to do more work in order toread and process the larger source relations.

Unix® is a registered trademark of AT&T Bell Laboratories.

36F

33

30h

2710- 24x

o 21a)

E18

_i)E

i15

a

0

36F

Timing Equat'ion

I Simulatio

15

No. of Ironsters5 t1-4 .4 per node)

'tPTT-----n 3 1 193 5 7. 9 Ii 13 15 17 i9 20

Number of ProcessorsFig. 7. SELECT: Simulation and timing equations results versus the number

of processors.

1918

00

0

(U

U'

EaJc

0

0c.

_ 100903Zno8'0

- 70 <

60 2

50-5.-* 40 c_ 301

20 -0to -

2 3 4 5 6 7 8 9 10Source Relation Size (tuples) x 1000

Fig. 8. SELECT: Effect of source relation size on operation time, number ofSM transfers, CPU, and I/O idle times.

The output from the SELECT operation is routed back to CCvia the SM modules. The CC services SM requests from thePi's using a round-robin (RR) schedule. Thus, if CC iscurrently servicing the SM of, say, node P4 and if SM's fromP1, P3, and PS are queued for service, they will receiveservice from CC in the order-P5, P1, and P3. A possiblemodification is to adapt the CC service schedule to the loadoffered by each node. Rather than a simple RR scheme, theCC may first service one SM module from each node whereboth the modules are filled with data (i.e., each node which isblocked due to the unavailability of SM's) and then service theremaining nodes.The modified RR scheme has some effect on operation time

especially in cases of highly unbalanced output, i.e., where a

--I

797

Page 9: A Dynamically Partitionable Multicomputer System

IEEE TRANSACTIONS ON COMPUTERS, VOL. C-35, NO. 9, SEPTEMBER 1986

few nodes of the cluster account for a large part of the output.In the simulation, the selectivity factor of a single node (of a19-node cluster) was varied from 0.1 to 1.0. Fig. 9 shows theeffect of using a plain RR scheme versus an RR schememodified as stated above. For lower selectivities, i.e., 0.1 and0.2, the service pattern has no effect on operation time. Forhigher values of. selectivity, though the modified RR schemeperforms better than the plain RR scheme, the improvement isfound to be only in the range of five to ten percent for thegiven parameter values.The JoiN operation is inherently more complex and the

timing equations approach does not adequately capture all thedetails of this operation. This is clear from our simulationresults which are not very close to the results from the timingequation analysis, unlike the SELECT case. The discrepancy inthe results are attributed to the following factors. First, thesimulation was able to represent, much more realistically, thesharing of the dual I/O buffers between blocks of relations Aand B, unlike the timing equations which made simplifyingassumptions in this regard. Second, the assumption that I/Ooperations are totally overlapped by CPU computations is notalways true. When a block ofB is broadcast, the CCP needs towait on I/O for the next block ofA (since broadcast time is lessthan I/O access time), thereby adding to the overall time. Theoperation can be easily optimized by reusing the last block ofA for joining with the current block of B, as suggested in [161.A point to be noted is the definition of selectivity factor for

the JOIN operation. In [10] and [24], the maximum output of theJOIN operations was estimated as twice the product of thenumber of blocks in relations A and B. The factor of twoaccounts for the concatenation of tuples. The selectivity factorwas defined as the size of the output expressed as a fraction ofthis maximum size. This method, in fact, results in anunderestimation of the actual size of the output. Since theworst case output of a JOIN operation is the cross-product of thetwo relations, the maximum size should be computed asfollows. Let the number and size (in bytes) of tuples inrelations A and B be represented by Ta, Ba, and Tb, Bb,respectively. The maximUm possible output is then (Ta* Tb*(Ba + Bb)/BLKSIZE) blocks (where BLKSIZE is the blocksize in bytes) rather than 2*(Ta*Ba/BLKSIZE)*,(Tb*Bb/BLKSIZE) blocks, as computed using the previous method.The maximum possible output size computed using the newmethod is then ((Ba + Bb)/(2 *Ba *Bb)) *BLKSIZE times thevalue obtained using the old method. This ratio reduces toBLKSIZE/B when Ba = Bb = B. In this particularsimulation the output size is 265.4 blocks (19759 tuples of 175bytes each) which corresponds to 0.1 selectivity factor usingthe old method and 0.0007 selectivity factor using the newmethod. In either case, the JOIN operation clearly produces alarge amount of output which needs to be transferred to asingle output processor (either CC or a Pi).

Fig. 10 shows the plot for the JOIN operation for fixed sourcerelation sizes and varying (10 to 100) number of processors.The graphs in the figure can be divided into three ranges, (10,40) (50, 70), and (80, 100). In the range (10, 40), the CPTUcomponent of the operation is relatively high and increasingthe number of processors decreases the CPU time to an extent

14~

o 43

toX' 12

nB

a3c 1InE

0

E 10

O. 8

Plain RR.4

--~ -Modified RR

Number of Processors - 19SM output area size - 1.0 blockSource relation size - 50,000 tuples

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

Selectivity Factor in Node No. 19Fig. 9. SELECT: Effect of SM service policy on operation time.

521

x'104ur0u' 92E

g 80

0 68

a)a 560

44

Fig. 10.

Relation A: IQ000 tuplesRelation B 3,000 tuplesSelectivity: 0.1SM module output area =

Broadcost area 4.0 block

ns Simulation

L 120 %.-32--8032 - 80 ~ ~ Number of Block \- 40 Reads of Relation A

a

go, _0 20 30 40 50 60 70 80 90 tOO

Number of Processors

JOIN: Simulation and timing equations results versus the number ofprocessors.

that an overall decrease in operation time is observed. Thisphenomenon is captured both by the timing equations andsimulation. In the (50, 70) range, the I/O component becomesthe determining factor. In this range it is found that (the ceilof) the number of blocks of the source relations remains thesame (2 for relation A and 1 for relation B, for the givenparameter values) even when the number of processors isincreased. Thus, the number of iterations in the nested-loopalgorithm increases with increasing number of processors asshown in the figure, and the secondary storage accessassociated with each iteration of the loop adds to the overalloperation time. Since the timing equations were derived as-suming that I/O is always transparent to CPU, the results fromtiming equations are different froin those of simulation for thisrange. In the range (80, 100) each processor has less than oneblock each of relations A and B and a single I/O access perrelation is sufficient to load both the relations into main

798

15r

Page 10: A Dynamically Partitionable Multicomputer System

BARU AND SU: THE ARCHITECTURE OF SM3

memory. Thus, the operation time drops significantly whengoing from 70 to 80 processors and stays relatively lowbeyond 80 processors. The timing equations did not accountfor this special case; thus, the operation times that theycompute are relatively high in this range.

Finally, the simulation shows that a major factor indetermining the operation time is the availability of SMmodules for the purpose of broadcasting blocks of B. If even afraction of the module contains output tuples, the SM cannotbe used for broadcast, since the module should be completelyempty in order to broadcast a full block of B. Thus, longdelays are introduced at broadcast times. This problem issolved by increasing the size of each SM module anddesignating one part of the module for output and the otherpart for broadcast. With such an arrangement an SM module isalways available for broadcast purposes and no delay isincurred in this stage.

The JOIN operation time is sensitive to the output size (i.e.,number of times SMis switched to CC) and also to the numberof iterations in the nested-loop algorithm. Fig. 10 illustratesthe strong correlation that exists between the number of scansof relation A and the JOIN operation time (in the figure, theterm Block Reads is defined as follows: Block Reads =Number of Blocks in relation * Number of Scans of relation).The number of times that relation A is scanned can be reduceLby broadcasting more than one block of B at a time. Also, therelative size of the output and broadcast areas in the SM's canbe made flexible according to the needs of the specific queries.A larger broadcast area will reduce the number of broadcastsof relation B from a node (and, hence, the number of scans ofrelation A), whereas a larger output area will reduce thefrequency of service required from CC by a specific node.The effect on operation time of the relative sizes of the

broadcast and output areas is shown in Fig. 11. The total SMsize (output + broadcast) is fixed at 4.0 blocks and the outputarea size is varied from 0.5 to 3.5 blocks. For an output areasize of 0.5 blocks, the operation time is adversely affected bythe number of SM transfers required to move the output toCC. Increasing the output area size to 1.0 blocks improves thebalance between the output and broadcast areas and the overalloperation time decreases. A further increase in the output areareduces the broadcast area and increases the number of cyclesrequired to broadcast all of relation B. Since the entire relationA needs to be scanned for each broadcast cycle, this has theoverall effect of increasing the number of block reads of A,which directly affects the operation time. This is shown clearlyin Fig. 11 where increasing the output area from 1.0 to 1.5blocks and 2.5 to 3.5 blocks increases the block reads on Aand, consequently, the JOIN operation time.

V. COMPARISON TO OTHER ARCHITECTURES

Based on architectural features, the SM3 system can beplaced in a class between local area networks (LAN's) on theone hand and tightly coupled microprocessors on the other.Thus the term multicomputer has been used to describe thissystem. In this section we provide a qualitative and quantita-tive comparison of the SM3 architecture to conventional localarea networks and typical shared memory systems.

36 r

34 -00 320

- 30En0

E 28

a)E 26

° 24

a)o 22

20

.o

Number of Processors = 6Relation A = 10,000 tuplesRelotion B = 3,000 tuplesSelectivity - 0.1SM: output area size + broadcost

area size = 4.0 blocks

OperationTime

at

I_-

00

500 X4

400 .o0a)

300 x0-o

200 0C)

too -00co

0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0SM Output Area Size (in blocks)

Fig. 11. JOIN: Effect of varying ratio of SM broadcast area size to SMoutput area size on operation time.

A. SM3 Versus Local Area NetworksJust as in any LAN, each node in SM3 is an independent

computer capable of supporting local operations. In addition,SM3 can support global operations which involve two or moreprocessors. The type of operation, local or global, istransparent to the lay user who has a monolithic, single systemview of the entire multicomputer system. In the global mode,SM3 supports asynchronous, SIMD processing inside a singlecluster and MIMD processing with multiple clusters. Thearchitectural features of SM3 provide efficient support for thismode of operation by allowing physical partitioning ofprocessors into clusters and supporting fast interprocessorcommunication and data transfer. In a LAN, processors canonly be logically partitioned by using software means (e.g.,the one-to-few broadcast technique in Ethernet) and interpro-cessor communication is via packets transferred on thecommunication medium. The inherent data transfer, packetcommunication, etc. overheads make LAN's unsuitable forsupporting SIMD and MIMD processing.

In order to provide a rough comparison between SM3 and acommon bus LAN, we modified our simulation programs tosimulate conditions for a LAN. The time taken to transfer ablock of data between computers in SM3 is TSWITCH +TMOVE, which accounts for the memory switching time andthe time taken to read and write data from and to mainmemory. In a LAN for the same operation, we have TPKT +TMOVE + TXFER where TPKT is the packet assembly/dissembly overhead, TMOVE is as described above, andTXFER is the time taken to transfer a block of data on theinterconnection medium. Unlike SM3, a conventional LANhas no control lines and all interprocessor communication andsynchronization is achieved by transferring packets on thenetwork for which a time of TPKT has been assumed. Thiscommunication and synchronization overhead increases atleast linearly with the number of processors per cluster. Also,since the transmission medium in a LAN is shared by all thenetwork processors, data transfer among Pi's cannot beoverlapped with the transfer of data between a Pi and CC.Finally, the architecture of conventional LAN's does not

799

Page 11: A Dynamically Partitionable Multicomputer System

IEEE TRANSACTIONS ON COMPUTERS, VOL. C-35, NO. 9, SEPTEMBER 1986

permit dynamic partitioning. Therefore, all processors in anetwork regardless of the tasks that they are involved in,contend for the same bus thereby further degrading theperformance of parallel algorithms.The simulation of the JOIN operation was modified in order

to represent an architecture equivalent to that of a common busLAN. The SELECT operation is very straightforward and, apartfrom the additional time required to transfer output to CC, theoperation times for the SELECT operation in a LAN would besimilar to those obtained in SM3. For JOIN, we increased thetime taken to transfer data as indicated above. In this operationthe processors synchronize after every block of relation B hasbeen broadcast and joined. If bi represents the number ofblocks of relation B in the ith node, then in a cluster of Pprocessors, (P - 1)1 ceil (bi) packets need to be transmittedfor the purpose of interprocessor synchronization. Fig. 12shows the JOIN operation times for SM3 and a LAN along withthe number of packets required for synchronization in a LANversus the number of processors. Two graphs have beenplotted for the JOIN operation time in a LAN, assuming packetoverhead times of 1 ms/packet and 5 ms/packet, respectively.Even in the optimistic case of 1 ms/packet overhead, SM3outperforms the LAN and the performance of the LANdegrades steadily with increasing number of processors. Thegraph for the LAN is, in any case, optimistic since thesimulation is not very detailed and we have made assumptionsin favor of the LAN (e.g., the media access time is notaccounted for, CPU and network operations are assumed tooccur concurrently without affecting each other, etc.).We should reiterate here that the architectural features of

SM3 provide certain capabilities which LAN's do not possess.For example, in some applications it is advantageous to storecertain data objects (e.g. matrices), in physical memory,rather than virtual memory, in order to support fast processingof such objects. If these data are very large in size, they mayneed to be stored across many processors. In this case, the SMmodules can be used as expanded memory where, dependingon the switching of modules, the same address space refers todifferent data. Since SM modules can be switched in and outof the address space by simply writing into a status word, thisdoes not constitute much of an overhead. A more detailedanalysis of the use of switching memories for certain matrixoperations is available in [7]. The dynamic partitioning of thebus (CCB) is also a very useful feature of SM3 which is notavailable in conventional LAN's. One possible algorithm forthe statistical aggregation by categorization [5], for example,requires a global aggregation phase where all the localaggregates are combined together to obtain the global aggre-gate. This phase can be done as a binary-tree structuredoperation which, as indicated by a simulation study [6],drastically cuts down the overall operation time.

B. SM3 Versus Shared Memory SystemsMany variations of a conventional shared memory system

are in existence. Rather than attempting to simulate all thesesystems, we provide a qualitative comparison and somediscussion of the similarities and differences between SM3 andshared memory systems. A key difference is that in shared

o o10 20 30 40 50 60 70 80 90 100

Number of Processors

Fig. 12. JOIN: Comparison of operation times obtained in SM3 versus atypical common bus LAN.

memory systems a fixed amount of physical memory ismapped to a given address space which is shared by allprocessors. There is a 1-1 mapping between the logicaladdresses of the shared space and the physical locations of theshared module. In contrast, in SM3 there is a 1-n mappingbetween logical addresses and physical locations. The settingof the SMS switches determines which particular module(physical location) will be referenced by the given logicaladdress. Most shared memory systems fit one of the followingmodels, as shown in Fig. 13; 1) the shared module ismultiported so that each node has a port to memory, 2) manydistinct shared modules (mapped to different address spaces)are connected to processors via multiple buses so that anyprocessor may access any memory via any one of the buses, or3) processors and memories are interconnected via someinterconnection (switching) network, again to allow proces-sors to access any memory. In the first model, the number ofports required for the memory module equals the number ofprocessors in the system. The complexity of a multiportedmemory interface restricts the number of ports that can bemade available to any single memory module. This, therefore,precludes the use of such a memory scheme in the case of SM3where we expect the system to typically have hundreds ofprocessors. In the second model, many parallel buses arerequired to connect the memory modules to processors. InSM3, all the SM modules are connected to two system-wideparallel buses. As long as only two buses are used in model (2)above, the hardware complexity is equivalent to that of SM3.Using more than two buses increases the bus count and thenumber of ports required per memory module. In any case, thememory mapping employed in this scheme does not supportefflcient broadcast of data in the system. Finally, the thirdmodel uses an interconnection network to connect processorsto memory. Such interconnection schemes are typically muchmore complex than the common bus structure, especially

800

Page 12: A Dynamically Partitionable Multicomputer System

BARU AND SU: THE ARCHITECTURE OF SM3

mSHAREDMEMORY

mr+kMULTI- PORTINTERFACE

l 'I T..

Lm T+k -m+(n-i)k

i M 2 * . .n.m+k m+ 2k m +nk

I INTERCONNECTION NETWORK

Fig. 13. Typical shared memory systems.

when there is a requirement to support broadcast communica-tion.

All of the above schemes are more complex than theswitchable memory scheme where each SM module isinterfaced to 3 buses via a simple 1-of-3 multiplexer. Also, itshould be noted that the SM modules in SM3 are used mainlyto support data transfer among nodes. The above memory

schemes are very useful and efficient for supporting otherfunctions, e.g., variable sharing, synchronization, support ofsmall-grain parallelism, etc., but in general are much toocomplex for data transfer purposes. Finally, as in the case ofcommon-bus LAN's, most shared memory schemes do notsupport dynamic partitioning of the network into independentclusters. In SM3, we permit such partitioning in order tocreate clusters which, in fact, are clones of the originalnetwork.

VI. CONCLUSION

The architectural design and implementation of a multicom-puter system, called the switchable main memory modules(SM3) system has been described. The system makes use ofmain memory switching and network partitioning to enhancesystem throughput. It consists of a number of stand-alone,multiuser computers which share some memory modules. Ithas a control computer which is responsible for supervisingglobal queries, maintaining system directories, keeping con-

trol over the switches in the system, etc. The network can bedynamically partitioned into independent clusters which workon single queries/commands, thereby creating an MIMDstructure at the global (system) level. Data transfers inside a

cluster and between processors and the CC is enhanced by theuse of switchable memories. Data are written into a memorymodule and the module is switched between processors toachieve data transfer; this is more efficient than data transmis-sion over a bus as in conventional systems. Access to theswitchable memories is via a set of status words which residein a physically distributed, globally shared address space. The

network partitioning and memory switching features areimplemented by using simple hardware switches, the design ofwhich is very straightforward. Network design has been keptas simple as possible in order to provide easy expandabilityand modularity.A study of two database operations, SELECT and JOIN, using

discrete-event simulation techniques has revealed some inter-esting results. The execution time in the case of the SELECToperation is more sensitive to the output relation size than tothe input size. The simulation of the JOIN operation has shownthat the SM module should be logically structured into anoutput area and a broadcast area. The former is used fortransfering results and the latter for broadcasting data within acluster. The relative sizes of the two areas have a bearing onthe performance of the operation.The study of parallel algorithms for database operations in

SM3 has given rise to many problems for future investigation.Issues related to distribution of data for parallelism, allocationof clusters/processors to queries, query processing strategiesin multicomputer systems, are all promising areas for futureresearch. We are also studying the feasibility of using SM3 forsupporting large scale numeric computations.

REFERENCES[1] S. R. Ahuja and A. Asthana, "A multi-microprocessor architecture

with hardware support for communication and scheduling," in Proc.Architect Support Prog. Lang. Operat. Syst. Mar. 1982, pp. 205-209.

[2] B. W. Arden and R. Ginosar, "MP/C: A multiprocessor/computerarchitecture," in Proc. 8th Int. Conf. Comput. Architect., pp. 3-19,1981.

[3] H. Auer et al., "RDBM-a relational database machine," Inform.Syst. vol. 6, no. 2, pp. 91-100, 1981.

[4] J. Banerjee, D. K. Hsiao, and K. Kannan, "DBC-A databasecomputer for very large databases," IEEE Trans. Comput., vol. C-28, June 1979.

[5] C. K. Baru and S. Y. W. Su, "Performance of statistical aggregationoperations in the SM3 system," in Proc. ACM/SIGMOD Int. Conf.Management Data, June, 1984.

[6] C. K. Baru, "A multicomputer approach to non-numeric computa-tion," Ph.D. dissertation, Dep. Elec. Eng., Univ. Florida, Gainesville,Dec. 1985.

[7] C. K. Baru, A. K. Thakore, and S. Y. W. Su, "Matrix multiplicationon a multicomputer system with switchable main memory modules,"IEEE 1st Int. Conf. Supercomput. Syst., Tarpon Springs, FL, Dec.1985.

[8] H. Boral and D. J. DeWitt, "Design considerations for data-flowdatabase machines," in Proc. ACM/SIGMOD Int. Conf. Manage-ment Data, May, 1980.

[9] D. J. DeWitt, "DIRECT-A multiprocessor organization for support-ing relational database management systems," IEEE Trans. Comput.,vol. C-28, pp. 395-406, June 1979.

[10] D. J. DeWitt and P. B. Hawthorn, "A performance evaluation ofdatabase machine architectures," in Proc. 7th Int. Conf. Very LargeData Bases, France, pp. 199-213, Sept. 1981.

[11] T. Fei, C. K. Baru, and S. Y. W. Su, "A dynamically partitionablemulticomputer system with switchable main memory modules," inProc. COMDEC, Int. Conf. Comput. Data Eng., Apr. 1984, pp.42-49.

[12] G. Gardarin, "An introduction to SABRE-A multiprocessor databasecomputer," SIRIUS BIR-I-005, Publication SABRE, Aug. 1980.

[13] P. Hawthorn and M. Stonebraker, "Performance analysis of arelational database management system," in Proc. ACM/SIGMODInt. Conf. Management Data, May 1979.

[14] P. Hawthorn and D. J. DeWitt, "Performance evaluation of alternativedatabase machines," IEEE Trans. Software Eng., vol. SE-8, 1, pp.61-75, Jan. 1982.

[15] S. I. Kartashev and S. P. Kartashev, "Dynamic architectures:Problems and solutions," Computer, pp. 26-41, July 1978.

801

Page 13: A Dynamically Partitionable Multicomputer System

IEEE TRANSACTIONS ON COMPUTERS, VOL. C-35, NO. 9, SEPTEMBER 1986

[16] W. Kim, "A new way to compute the product and join of relations," inProc. ACM/SIGMOD, Int. Conf. Management Data, May 1980,pp. 179-187.

[17] H. 0. Leilich, G. Stiege, and H. C. Zeidler, "A search processor fordata management systems," in Proc. 4th Int. Conf. Very Large DataBases, West Germany, Sept. 1978, pp. 280-287.

[181 R. M. Metcalfe and D. R. Boggs, "Ethernet: Distributed packetswitching for local computer networks," Commun. ACM, vol. 19,pp. 395-403, July 1976.

[19] M. Missikoff and M. Terranova, "An overview of the project DBMACfor a relational database machine," IASI-CNR, Tech. Rep. Rome,Italy, 1982.

[20] M. Missikoff, "A domain based internal schema for relational databasemachines," in Proc. ACM/SIGMOD Int. Conf. ManagementData, June 1982.

[21] D. 0. Nickens, T. B. Genduso, and S. Y. W. Su, "The architectureand hardware implementation of a prototype MICRONET," in Proc.5th Int. Conf. Local Comput. Networks, Oct. 1980.

[22] S. Y. W. Su et al., "MICRONET-A microcomputer network systemfor managing distributed relational databases," in Proc. 4th Int. Conf.Very Large Data Bases, West Germany, Sept. 1978.

[23] S. Y. W. Su, "A microcomputer network system for distributedrelational databases: Design, implementation, and analysis," J. Tele-commun. Networks, vol. 3, no. 2, Fall 1983.

[24] S. Y. W. Su and C. K. Baru, "Dynamically partitionable multicompu-ters with switchable memory," J. Parallel Distrib. Comput., vol. 1,no. 2, Nov. 1984.

[25] R. J. Swan, S. H. Fuller, and D. P. Siewiorek, "CM*-A modularmultimicroprocessor," in Proc. Nat. Comput. Conf. 1977, pp. 637-644.

[26] K. J. Thurber and H. A. Freeman, Tutorial on Local ComputerNetworks, 2nd Ed. Los Alamitos, CA: IEEE Computer SocietyPress, 1981

[27] P. Valduriez, "Semi-Join algorithms for multiprocessor systems," inProc. ACM/SIGMOD Int. Conf. Management Data, June 1982,pp. 225-233.

[28] S. B. Yao, F. Tong, and Y. Z. Sheng, "The system architecture of adatabase machine (DBM)," IEEE Database Eng., vol. 4, Dec. 1981.

Chaitanya K. Baru (S'83-S'84-M'85) receivedthe B-Tech degree from the Indian Institute ofTechnology, Madras, in 1979 and the M.E. andPh.D. degrees from the University of Florida,Gainsville, in 1983 and 1985, respectively, all inelectrical engineering.

Since September 1985, he has been an AssistantProfessor in the Department of Electrical Engineer-ing and Computer Science at the University ofMichigan, Ann Arbor. His research interests in-clude database machines, parallel database process-ing, multicomputer architectures, and databasemanagement systems.

Stanley Y. W. Su received the M.S. and Ph.D.degrees in computer science from the University ofWisconsin, Madison, in 1965 and 1968, respec-tively.He is a Professor of Computer and Information

Sciences and of Electrical Engineering, and is theDirector of the Database Systems Research andDevelopment Center, University of Florida, Gains-ville. He was one of the founding members of theIEEE Computer Society's Technical Committee onDatabase Engineering. He served as the Co-chair-

man of the Second Workshop on Computer Architecture for NonumericProcessing 1976, the Program Chairman and organizer of the Workshop onDatabase Management in Taiwan 1977, the U.S. Conference Chairman of theFifth VLDB Conference 1979, and the General Chairman of the ACM'sSIGMOD International Conference on Management of Data 1982. He is anEditor of the IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, theInternational Journal on Computer Languages, and Information Sciences, andan Area Editor for the Journal on Parallel and Distributed Computing.

802