Parallel algorithms based on expander graphs for optical...

Parallel algorithms based on expander graphs foroptical computing

Ramamohan Paturi, Dau-Tsuong Lu, Joseph E. Ford, Sadik C. Esener, and Sing H. Lee

We consider the task of interconnecting processors to realize efficient parallel algorithms. We proposeinterconnecting processors using certain graphs called expander graphs, which can provide fast communica-tion from any group of processors to the rest of the network. We show that these interconnections wouldresult in a number of efficient parallel algorithms for sorting, routing, associative memory, and fault-tolerancenetworks. As the interconnections based on expander graphs are global and irregular, we reason that opticalinterconnections are preferred to electronic and propose implementation of these interconnections using theprogrammable optoelectronic multiprocessor architecture. Key words: Optical interconnection, opticalcomputing, expander graphs.

1. IntroductionTo cope with the ever increasing demand on com-

puting power, it is not enough to rely on faster devicetechnology. It is necessary to utilize parallel process-ing. Assuming that the communication overhead issmall and that the algorithm can be fully parallelized, atask requiring T sequential time steps can be per-formed in Tip steps by distributing the task among pprocessors. These two assumptions are the most im-portant considerations in designing efficient parallelalgorithms. We consider the design and implementa-tion of such algorithms. More specifically, we investi-gate the interconnection properties of very large scaleprocessor networks necessary to support efficient par-allel algorithms. We find certain interconnection net-works called expander graphs very useful for this pur-pose. Interconnections based on expander graphs canachieve global communication in constant time. Thisproperty of expander graphs is successfully exploitedin the design of several efficient parallel algorithms.'-10

However, it has been unclear how to construct andimplement good expander graphs.

We take the position that the interconnection net-works based on expander graphs are the key to imple-menting significantly efficient parallel algorithms.

All authors are with University of California, San Diego, La Jolla,California 92093; R. Paturi is in the Department of Computer Sci-ence & Engineering, the other authors are in the Department ofElectrical & Computer Engineering.

Received 3 January 1990.0003-6935/91/080917-11$05.00/0.© 1991 Optical Society of America.

To this end, we consider the design and the construc-tion of expander graphs. We describe a probabilisticapproach to construct and evaluate good expandergraphs. We then try to convince the reader that ex-pander graphs can indeed result in efficient algorithmsin a variety of situations. We then discuss an optoe-lectronic implementation of these interconnectionnetworks which combines the optical interconnectiontechnology with very large scale integration (VLSI)technology," thus overcoming the difficulties encoun-tered with pure'VLSI technology.

In Sec. II we explain the definition and theory ofexpander graphs. Using this theory, we present aprobabilistic approach to the construction of expandergraphs. In Sec. III we show that expander graphs giverise to efficient parallel algorithms in a number ofapplication domains. In particular, we explain howexpander graphs can be used to construct approximatehalvers, the basic building block of an optimal sortingalgorithm. We show how expanders can result in alower delay in routing applications. We also describeapplications in associative memory, object distribu-tion, fault-tolerant networks, and error correctingcodes. In Sec. IV, we discuss approaches to imple-menting irregular interconnections and propose an op-toelectronic system. Finally, in Sec. V, we discuss ourconclusions and suggest future research directions.

11. Expander GraphsEfficient parallel algorithms rely on the fast transfer

of information within the processor network.12 Al-though a fully interconnected crossbar network canaccomplish such communication in a single time step,the number and length of interconnections required(n2 for n nodes) make the implementation of such large

10 March 1991 / Vol. 30, No. 8 / APPLIED OPTICS 917

1 log n

\ ncreasing ammng/

log n

Fig. 1. Communication distance on a hypercube with n nodes.The nodes are denoted by logn-digit binary addresses. For n = 8,the longest communication distance is between (000) and (111). Ifinformation needs to be transmitted from the group of nodes in theshaded region to the remaining nodes, it requires /2 logn - c(logn)1/2

steps, which grows with logn.

scale networks prohibitively expensive. Often we arelimited to networks with a smaller number of intercon-nections per processor. An example of such a networkis hypercube with log2 n interconnections per proces-sor. Two processors are connected if the Hammingdistance between their log2 n-bit addresses is 1. Thehypercube and its variants, such as shuffle exchangeand cube connected cycles, have become popular be-cause of their relative ease of implementation andbecause a number of algorithms have been implement-ed on them with satisfactory performance. Such net-works cannot give us optimal parallel algorithms, how-ever, since they lack sufficient connectivity tofacilitate fast parallel communication. For example,consider a communication task in which some smallgroup, say 10%, of the processors hve informationwhich they need to transmit to the entire network.Assuming we have no control over the initial informa-tion distribution among the processors, any group of10% of the processors can be considered. Clearly, acrossbar network can accomplish this task in one stepbut is technologically expensive. A hypercube canalso accomplish this task but requires 0(logn) steps,which scales with the size of the network (Fig. 1). Weneed a network which can interconnect an arbitrarynumber of processors in a constant number of steps.

Interconnection networks based on expander graphscan provide the solution. They can accomplish thistask in a number of steps dependent only on the frac-tional size of the group and the graph's expansionfactor. The number of steps is independent of net-work size. We call this property global communica-tion with a constant number of steps. It has beenshown that for any given expansion there exist expan-der graphs with a constant number of fanouts per node(see Theorem 1 of Sec. II.B).13 This global communi-cation ability is of fundamental importance for inter-connection networks because a number of optimal al-gorithms can be developed based on such networks.In fact, there is some theoretical evidence that this

property is indispensable for designing optimal algo-rithms.' 2 One of these is a 0(logn) routing algorithmwhich can be used as a basis for a general purposeparallel computer.2

A. Definition of Expander GraphsExpander graphs are defined in terms of their prop-

erties. For convenience, we describe them as bipartitegraphs, i.e., graphs with connections between two dis-crete regions. A bipartite graph G = (I,O,E) has a set Iof input nodes and a set 0 of output nodes with E as theset of edges between input and output nodes. Weconsider only those bipartite graphs for which III = 101and define that III = 101 = n. Edge (ij) E connectsinput node i with the output node j. For any subset Aof inputs, we define the neighborhood (A) = U EO(ij) , E for some i e A}. The same applies to anysubset of outputs with its neighborhood in 0. We alsodefine a bipartite graph G to be d-regular if the degree,i.e., fanout, of every node in the graph G equals d.

For 0 < e < 1 and # > 1, a d-regular bipartite graph G= (I,O,E) is called a (d,e,f3) expander if, for all A c I(O)so that AI < en, the neighborhood (A) of A in G issuch that Ir(A)l 2 IAI. In other words, a graph ex-pands if every subset of nodes up to a given size has alarge neighborhood. We call /3 the exapnsion factor ofthe graph. Typically, we use expander graphs with 3= (1-e)/e. In the following subsections, we look at theexistence and the construction of expander graphs ingreater detail.

B. Properties of Expander GraphsIn an expander graph, the size of the neighborhood

of every set is larger than that set by a constant factor.This expanding property gives rise to a number ofinteresting computation and communication proper-ties. For example, this expansion property gives riseto approximate halving in a constant number of stepsusing compare and exchange operations. It also offersthe means of realizing a trade-off between storage ca-pacity for the interconnection patterns and the degreeof the network while retaining the error correction,exponential convergence, and robustness properties inassociative memory. Furthermore, this property alsoensures multiple paths between the processing ele-ments, providing the necessary redundancy for fault-tolerant communication networks. These propertiestogether with applications are presented in furtherdetail in Sec. III.

The large neighborhood criterion is a strong require-ment. Consequently, even the proof of existence ofsuch graphs is nontrivial. Such a proof is in fact pro-vided by a nonconstructive (probabilistic) argument.In this argument, we look at the set of all d-regularbipartite graphs with each side having n nodes togeth-er with a uniform distribution on it. We then showthat the fraction of these graphs that fails to have thegiven expanding property is <1. To make this frac-tion small, we select a suitable d as a function of theexpansion. Since the fraction of graphs which doesnot have the required expanding property is <1, cer-

918 APPLIED OPTICS / Vol. 30, No. 8 / 10 March 1991

tain graphs of degree d must exist that meet the givenexpansion requirements. This result is stated moreprecisely in the following theorem.13

Theorem 1: Suppose 0 < e </(1 - e) < 1. Let d bean integer satisfying

d> H(e + H(1-E) (1)H(c) - (1 - E)H[(l - 0)/]

where H(x) = -x log2x - (1 - ) log2(1 - x) is thebinary entropy function. Let I and 0 be two sets ofvertices, III = 101 = n, and let G be a random d-regularbipartite graph on the classes of vertices I and 0,obtained by choosing randomly d permutations from Ito 0. Then, with probability approaching 1 as n tendsto -, G is a [d,,(l -E)/e] expander.

Notice that we allow multiple edges here. Thistheorem guarantees expander graphs of any given ex-pansion whose degree is bounded as a function of E as inthe above equation. Notice that the degree bound isindependent of n. The problem is that this probabilis-tic argument does not give us a clue as to how toconstruct such expander graphs explicitly. Also, itseems that the explicit construction of expandergraphs is difficult due to our requirement that everysubset of nodes have a large neighborhood. Althoughthere are some explicit constructions of expandergraphs,14 15 none of these constructions offers high ex-pansion with a small degree bound. On the otherhand, the random construction shows that the degreeneed not be higher than -loge/e. 3 Also, the probabilis-tic argument shows that for a suitably chosen d and forall large n, almost all the d-regular bipartite graphs willhave the required expansion. This suggests that weshould use randomly generated d-regular graphs.Such an approach would be successful only when wehave the means to determine if a given d-regular bipar-tite graph has the necessary expansion. Computingthe expansion of a bipartite graph is a co-NP completeproblem,16 but estimating the lower bound of the ex-pansion is not difficult, as shown by Tanner.17 Weonly need to compute certain eigenvalues of the inci-dence matrix of the expander graph for this estima-tion. The precise result of Tanner is given below.

Let G = (I,O,E) be a d-regular bipartite graph. LetM be the real valued incidence matrix of the bipartitegraph: M = [mij] ,mij = 1 if the ith node in I is connect-ed to the jth node in 0 and zero otherwise. Since MMTis a real symmetric non-negative definite matrix, it isdiagonalizable and has real non-negative eigenvaluesand orthogonal eigenvectors. Let XI X2 2 ... 2 Xn bethe ordered eigenvalues of MMT. Note that for d-regular graphs Xi = d2. Then, the following theoremcan be used to find the lower bound of the expansion ofG.17

Theorem 2: If Xi > X2, for any E < 1, G is a (d,E,/3)expander with

,E(d2

- + x 2 (2)

As the computation of eigenvalues of a matrix isrelatively easy, this theorem provides an efficient tool

for estimating the expansion. However, such a proba-bilistic approach gives rise to very irregular intercon-nection networks which create severe routing difficul-ties for VLSI implementation. Our investigation of asuitable implementation technology revealed freespace optical interconnection as the preferred choice.In addition, optical interconnection is also superior forsuch global interconnection from both power andspeed considerations. In Sec. IV we present the spe-cific optical interconnection techiques to realize ex-pander graph interconnections. In Sec. II.C we giveour method for constructing expander graphs with agiven expansion using the theoretical results men-tioned here.

C. Probabilistic ConstructionTo generate an expander graph with a given expan-

sion factor, the three primary tasks are (a) generatingrandom d-regular graphs with a given degree d, (b)estimating its expansion by applying Theorem 2, and(c) selecting the graph with the best expansion over alarge number of iterations. We present this algorithmbelow in a step-by-step approach, assuming that n isthe number of input or output nodes on each side of thebipartite graphs and a = (1 - e)/E is the expansion:

Step 1: Generate Random d-Regular Graph.We use a random number generator to first generate

a random permutation of the first n integers. Thisrandom permutation will be interpreted as a one-to-one connection between the input and output nodes.Then we select d using Eq. (1). This d is the minimumdegree required to achieve the given expansion. Wegenerate d random permutations and construct the nX n incidence matrix of the corresponding d-regulargraph. Theorem 1 guarantees that there exists expan-der graphs with the given expansion, provided that weselect d according to Eq. (1) for the desired expansion.

Step 2: Estimate the Expansion Using Theorem 2.We use standard numerical routines to compute the

eigenvalues of this matrix. These eigenvalues togeth-er with d and e are substituted into Eq. (2) to obtain alower bound on the expansion of the graph.

Step 3: Selecting the Best Graph.Steps 1 and 2 are repeated for many iterations. We

select the graph with the largest expansion.Figure 2 graphically illustrates this algorithm. Us-

ing this algorithm, for various values of n and d, wefound the network with the least second largest eigen-value and computed its expansion using Theorem 2.We then plotted the relationship between the expan-sion j [forE = 1/(1 + i3)] and the degree d of the networkfor various values of the number of vertices n (Fig. 3).From this figure, it is clear that the relationship be-tween the expansion and the degree is largely indepen-dent of the size n of the network for larger values of n (n> 128). The discrepancy for smaller values of n can beexplained by the fact that the theoretical results3 areasymptotic. Hence we demonstrated that, for a given


Fig. 2. Flow of the probabilistic constru

E N I32 * N64 * N-128 0

3.50

3.00

2.50

ExpansionFactor(beta)

2.00

1.50

1.00

4 6 8 10 12 14 16

Degree (d)

Fig. 3. Plot of the expansion factor againstincreases monotonically with d indepe

expansion ,, the degree of the networwith the number of vertices.

Our initial experiments were conducn up to 1024. The incidence matrix is

storage mode. This approach will work for n up to| 5000 using a Cray Y-MP with 32 million memory

words. For larger values of n, n >> d, the incidencematrix is sparse. We also recall that it is a real sym-metric non-negative definite matrix. Consequentlywe can use skyline methods to cut down dramaticallyI i the storage requirement for large incidence matrices.18

Ill. Applications of Expander GraphsIn this section, we give a few applications where the

connectivity of expander graphs is successfully ex-IJ ploited to yield fast parallel algorithms and efficientdesigns.

A. Parallel SortingIt has been a long-standing problem to find an opti-

Ji mal sorting network with 0(logn) stages. It is easy tosee that we need at least logn stages of comparatorswith each stage performing 0(n) comparisons, since wehave a lower bound of n logn on the total number ofcomparisons required for sorting. The credit for dis-covering an optimal algorithm goes to Ajtai, Komlos,and Szemeredi (AKS), who came up with an 0(logn)stage sorting network, thereby matching the lowerbound to within a constant factor.4 The basic idea ofthe AKS algorithm is to halve recursively the givensequence of numbers. Such a recursive halving re-quires 0(logn) stages. Unlike a naive recursive halv-

ction algorithm. ing scheme in which it would take (log2n) steps (sinceeach exact halving requires logn time), the novelty ofN-256 N1024 the AKS sorting network is that it uses only approxi-

mate halving instead of exact halving. They handlethese approximately halved sequences using an effi-cient error-management scheme to obtain the sortedsequence. Such an approach can result in an 0(logn)algorithm provided approximate halving can be donequickly. This is how expander graphs enter the pic-ture. It turns out that we can do approximate halvingin constant time using expander graphs. We discussthe relationship between approximate halving and ex-pander graphs in a greater detail.

The idea is that we can use bipartite graphs to modelthe computation of an approximate halver network.The nodes in the bipartite graph represent the wires inthe comparator network. We partition the wires intotwo groups of equal sizes with the node sets I and 0corresponding to these two groups so that, at the end ofthe computation, most of the elements of the lower(higher) half of the inputs end up in the nodes of I(O).Hence we can assume that compare-and-exchange op-erations are only made between the nodes from differ-ent parts of the graph. Each such compare-and-ex-

18 20 22 24 change operation can be denoted by an edge in thebipartite graph. In this model, we discard the timing

degree. See that a information and collapse the stages. A comparatorndent of n. network of depth d would then be modeled by a bipar-

tite graph, each of whose nodes has degree d (countingk does not grow multiplicities of edges). Also, given a d-regular bipar-

tite graph, one can devise a comparator network with d:ted for values of stages that performs the same compare-and-exchangestored in the full operations in some order. Note that the algorithms we


* percentage error distribution D cumulative error I

13

12

3

2

d

Fig. 4. Unfolded view of an n = 32, d = 9 expander graph. e 0.33and expansion factor 22.0.

design will be insensitive to the order of these compari-sons.

We now define an e-halver as a comparator networkwhich takes the inputs aj,a2, . ., a2n and produces twoblocks (lower and higher) of outputs of equal length n.The idea is that the lower block contains all but an fraction of the n small elements of the input, and thehigher block contains all but an e fraction of the n largeelements of the input. In fact, an e-halver satisfies astricter condition. An e-halver has the property that,for any inputs and k < n, the number of elements fromthe k smallest elements of the input which are outputin the higher block is <ek. Similarly, the number ofelements from the k largest elements of the input thatare output in the lower block is <ek. In other words,the elements that are output in the wrong block aredistributed evenly in their respective halves.

It turns out that the bipartite graphs that can beused as an -halver are expander graphs. The closeconnection between expander graphs and e-halvers isestablished by the following fact:

Fact 1: The comparator network whose compare-and-exchange operations are given by a [d,,(1 -e)Aeexpander is an e-halver with depth d.

Since the depth in our comparator networks corre-sponds to parallel time, our goal would then be todesign -halvers with minimal depth for any given eand n. From Fact 1, it can be seen that this goal isequivalent to finding expander graphs of a given ex-pansion with the minimal degree. The significant factis that Theorem 1 implies that, for any given e, we canfind an e-halver network whose depth is independentof the size of the network. The depth is determinedonly by e. It is the existence of such bipartite graphsthat makes efficient parallel algorithms possible.

In Sec. II.C we constructed several expander graphsof varying n and d and estimated their expansion .The e can be easily computed from the relation e = 1/(1

100%

90%

80%

70%

60%

percentage 50%ot input

40%

30%

20%

10%

0%

0 1 2 3 4number of misclassifications alter halving

Fig. 5. Error distribution and cumulative error for an n = 32, d =5e-halver. See that 98% of all possible inputs results in two or less

errors. The possibility of five errors is only 0.000025%.

* N=32 * N=64 A N-128 0 N=256 0 N=1024

0.45

0.40 +

0.35 --

TheoreticalEpsilonUpperBound

0.30

0.25 -

0.20

4 6 8 10 12 14

Degree (d)

16 18 20 22 24

Fig. 6. Upper bound estimates of for the expander graphs in Fig. 3.

+ fi) (see Fig. 4). In other words, when we use theexpander graphs that we generated as approximatehalvers, we can guarantee that the number of elementsin the wrong half is at most en (see Fig. 5). This is the


5

Amountof

Misclassificati(epsilon)

* N-32 * N-64 k N-128 0 N=256 0 N=1024 |

0.25 --

0.20--

0.lo 15

0.10

0.05

0.00 - I I I I l I I l I I

4 6 8 10 12 14 16 18 20 22 24Degree (d)

Fig. 7. Empirical e obtained by halving 1000 permutations of ndistinct integers.

maximum error we have to consider when designing anerror-management scheme to achieve the AKS sortingalgorithm. Note that the expansion computed for Fig.3 are lower bound estimates; therefore, the e obtainedwould be upper bound estimates (see Fig. 6). To seehow these expander graphs perform as e-halvers, weused them to halve 1000 random permutations of ndistinct integers and recorded the number of integersthat ended up in the wrong halves. The highest errorcount was divided by n to give an empirical estimate ofe (see Fig. 7). This empirical e is significantly less thanthe upper bound estimated, confirming with Fig. 5 thatE-halvers are very efficient for a large majority of theinput.

In addition to the sorting algorithm, researchershave developed optimal algorithms for other relatedproblems, e.g., finding the maximum and median. 56

The problem with all these algorithms is that they areonly optimal in an asymptotic sense. This means thatthese algorithms can perform better only when weconsider problems of very large sizes, e.g., sorting bil-lions of numbers.

For smaller problem sizes, other existing algorithmswould be more efficient. Even if the AKS algorithm isnot presently competitive for practical problem sizes,one can, with improved algorithm analysis techniquesand advances in the theory of parallel algorithms, hopefor the development of parallel algorithms that areoptimal for practical problem sizes.3 The technologi-cal feasibility of irregular interconnections would giveimpetus for the development of better parallel algo-rithms.

B. RoutingOne of the principal and immediate applications of

expander graphs is for constructing efficient routingnetworks. It has been shown by Valiant19 that if amesssage is routed to a random destination and then toits real destination, the delay can be made proportion-al to the diameter of the network. One intuitive rea-son for smaller delays is that random destination rout-ing would tend to distribute the packets evenly acrossall the edges in the network, thereby minimizing thetraffic on each edge. It turns out that one can dispensewith random destination routing and achieve the sameeffect by using expander graphs. This was shown byUpfal.2 Recently, Leighton and Maggs showed thatthey can achieve significantly lesser delays and higherfault tolerance by augmenting a butterfly networkwith an expanding graph. In particular, they showedthat such an augmented butterfly is better than even adilated butterfly, which has the same amount of hard-ware. These results suggest that expander graphs willplay a significant role in the development of parallelcomputers.

C. Associative Memory

Associative memory is the ability to recall data givenpartial information. One of the well understood mod-els of associative memory is that of Hopfield.20 Thismodel assumes a fully interconnected network of neu-rons. Information is stored in this system by adjustingthe weights of the interconnections. This model has anumber of remarkable properties, which include errorcorrection, exponential convergence, and robustnesswith respect to errors in the weights. However, inpractice it is hard to implement such a network sincethe number of interconnections grows quadraticallywith the number of neurons. To retain many of thenice properties exhibited by the Hopfield model, Kom-los and Paturi7 have shown that one can use certainsparse networks, which should have global communi-cation properties similar to those of expander graphs.Nonexpanding networks like the hypercube wouldlack the error-correction properties of the Hopfieldmodel. In essence, expander graphs would give us themeans to realize a trade-off between the storage capac-ity and the degree of the network while retaining theerror correction, exponential convergence, and robust-ness properties. This shows that the connectivity pro-vided by an expander graph interconnections is versa-tile. We generated a 256-node expander graph with d= 20 and used it for interconnecting 256 neurons in asimple Hopfield network. Using <10% of the inter-connection required by a crossbar, we stored two 16 X16 binary images and demonstrated the error-correc-tion property for 20% random error (Fig. 8).

D. Object Redistribution

The object distribution algorithm is the central partof Cole and Vishkin's solutions to the O(logn) time taskscheduling problem and the (1) time processorscheduling problem. 8 We have a set of objects repre-senting the tasks to be performed in parallel. The goal


Input

Ste 2 Ste 3

Fig. 8. Convergence of a 16 X 16 input with 20% random error toone of the two stored images. The 256-neuron Hopfield network isinterconnected by an expander graph that uses <10% of the crossbar

interconnection.

is to divide this set of objects into collections of objectswith approximately equal sizes so that these collec-tions can be executed in a minimum number of parallelsteps. This problem is encountered when the objectsare distributed unevenly in the network, and no oneprocessor has access to all the objects. Such an unevendistribution can be made more balanced if the objectsare redistributed. The scheduling problems can besolved optimally if the redistribution can be done inconstant time. The proposed solution uses an expan-der graph to interconnect these collections. As weshuffle the objects between pairs of interconnectedcollections to achieve local balance between them, theglobal communication property of expander graphsassures more even global object distribution in a con-stant number of steps.

E. Fault-Tolerant Networks and Error-Correction CodesAchieving consensus in the presence of faults is a

basic problem in distributed computing. Hardware orsoftware faults can prevent a processor from cooperat-ing in the consensus process. In such a case, the goal isto obtain unanimity among the nonfaulty processors.The problem is that faulty processors can preventcommunication among the nonfaulty ones. It is alsopossible that faulty processors can introduce mislead-ing messages into the network. To achieve unanimity,Q(t) connectivity is necessary where t is the number offaults to be tolerated. This high connectivity require-ment can be relaxed if we are willing to lose somenonfaulty processors and settle for cooperation amongthe vast majority of the nonfaulty processors. In sucha case, one can use expander graphs to interconnect theprocessors. The communication properties of expan-

der graphs guarantee that we lose only a few nonfaultyprocessors. Hence expander graphs can be used asgood fault-tolerant networks.9

Economic memory storage requires a refresh or res-toration mechanism to counteract the accumulation oferrors. Such a mechanism must rely on redundancyand voting for restoration. This added computationalrequirement increases the possibility for device error.Thus the problem of information storage in the pres-ence of noise leads to the problem of computation inthe presence of noise. This problem is similar to thatof fault-tolerant computing. Here again global com-munication properties of expander graphs can be usedto implement a voting mechanism economically.' 0

IV. Optoelectronic ImplementationWe have now described the irregular interconnec-

tion approach to parallel computation and discussedsome of its advantages. In this section, we examinethe implementation technology. We show that a sys-tem combining local electronic computation with glob-al optical communications provides an excellent matchto the system requirements. We describe the pro-grammable optoelectronic multiprocessor (POEM)system being developed at UCSD and discuss how itcan support expander graphs. Two implementationsare discussed, one using fixed computer generated ho-lographic optical interconnections, the other using re-configurable volume (photorefractive) holographic in-terconnections.

A. Why Optoelectronics?Electronic VLSI technology is well established, in-

expensive, and reliable. It is excellent for logic opera-tions and local communications, as in a single process-ing element. However, as the length and density ofthe communication links increase, the disadvantagesof a purely electronic approach become significant. Inparticular, the irregular global communications de-scribed are catastrophic for VLSI. Time delay, energydissipation, and potential clock skew all grow withincreasing length-a problem for global communica-tions. Electronic crosstalk and reliability consider-ations limit the allowable number of line crossings,making the layout of irregular interconnection linksdifficult. As a result, the problem of communicationsbecomes critical in chip layout. Valuable silicon realestate is expended on connections, reducing theamount available for processing. For example, inmost VLSI chips, 70% of the silicon area is devoted tocommunications and related tasks, although most ofthe chip layout time is spent trying to minimize thispercentage.

The communications problem is basically topologi-cal. In VLSI electronics, all the processors lie in a 2-Dplane. As long as the communications between themare also restricted to that plane, there is competitionbetween communications and processing for the samelimited area. By introducing free-space optical inter-connection, communication links can be taken into thethird dimension above the processing plane. There


Opto-ElectronicProcessor Array Interconnection

Control SignalParallel Data Input / Output Clock / Control

Optoj l ctron Instructions

I S~ot tctor gic

P--esing Element

Interconnection Parallel-AccessControl Signal Opto-Electronic Memory

Processor Array

Fig. 9. Idealized POEM system, which combines local electronicprocessing with global optical communication.

are costs in power, speed, and complexity in convertingthe electronical signals to optical. However, it hasbeen shown2l that for links longer than a (technologydependent) break-even length Ic the optical link ismore efficient in terms of both power and speed. The1c was calculated using realistic optical and electronicperformance parameters and found to be as small as 1-2 mm. This means that optical links are preferred forwafer-scale integration implementation of parallelprocessors using global interconnections.

Based only on power and speed considerations, theoptical link is already preferred to the electronic wirefor long distance communications, but there are othersignificant advantages offered by the optoelectroniccombination. The area of the chip expended on thelong distance wires and any associated electronics (am-plifiers, signal boosters, etc.) is made available for pro-cessing. The VLSI layout is simplified. Problemsarising from clock skew are reduced, since all longdistance links have approximately the same length.In addition, there are two potential advantages whichcome from the physical separation of the interconnecttechnology from the processing plane: fault toleranceand reconfigurability. VLSI fabrication faults can becorrected after chip testing by selecting the connectionlinks to replace faulty processors with working spares.This reduces production costs without sacrificing effi-ciency. As a consequence of the optical long distancecommunication, all processors are effectively adjacent.If the connection can be changed during operation, thesystem becomes more versatile, efficient, and opera-tion fault tolerant. The type and time scale of reconfi-guration are technology dependent, but in general thebetter the connection pattern matches the problemrequirements, the more efficiently the available pro-cessing power can be applied to a variety of problems.

B. Programmable Optoelectronic MultiprocessorThe POEM is a generalized system approach to

parallel computing derived from these considerations.The POEM system was described in detail in Ref. 22.We briefly describe it here, then discuss its applicationto parallel processing with irregular interconnection.An idealized POEM system is shown in Fig. 9. TheVLSI wafer is divided into optoelectronic processing

PE plane

Information flow M-

Fig. 10. Unfolded geometry POEM system. Information entersfrom the left and flows to the right as its is processed.

PE plane

Information flow c2 :DFig. 11. Folded geometry POEM system. Information circulates

between the two processing planes.

elements (PEs) which perform computations and localcommunications electronically. Each PE also has oneor more optical detector and modulator with which itcan receive data and control instructions and commu-nicate to other processors. These optoelectronic PEscommunicate among each other through electrooptic(EO) modulation of coherent light (generated off-chip). Modulators are preferred to integrated lasersources for reduced on-chip power dissipation, simpli-fied fabrication, and increased reliability. The pro-cessing planes can be manufactured using, for exam-ple, silicon electronic processors fabricated ontransparent EO PLZT substrates. 23 The system iscontrolled in single instruction multiple data (SIMD)fashion by a serial host computer, which distributesthe clock signal, determines the tasks of the PEs onvarious wafers, and (for reconfigurable systems) con-trols the interconnection pattern. Data transfer is bitserial, but computations are made in parallel planes.Interconnections are made using holography (see Sec.IV.C) and may be fixed or reconfigurable depending onthe technology and application. The POEM architec-ture is a generalized approach to optoelectronic pro-cessing, describing any system using holographic inter-connection of electronic processing arrays. It isintended as a framework to be adapted into specificsystems matching the application requirement.

The POEM system can have either an unfolded (Fig.10) or folded (Fig. 11) geometry. The unfolded systemcan use fixed interconnects, which can be implementedwith thin computer generated holograms (CGHs) or,for certain regular interconnection patterns, with re-


fractive optics. A large number of processing planesare required to perform the computation. As a result,the hardware cost is placed on the processing electron-ics rather than the interconnection technology. Afixed-interconnection unfolded system can be effi-cient for some computations, but a more versatile com-puter will require reconfigurable interconnections.The folded system in Fig. 11 uses reconfigurable inter-connections to perform general purpose processingwith only two processing planes. Information is trans-ferred back and forth between the planes. The con-nections can be bidirectional, as shown, or they can bedifferent for the forward and return paths. The recon-figurable interconnects increase the computer's versa-tility and efficiency at a cost of increased optical sys-tem complexity. Using only two processing planesincreases hardware utilization but does not supportpipelined operations.

The speed and nature of reconfiguration determinewhich algorithms can be efficiently implemented.Clearly, a system which can update the connections inless than a single clock cycle is ideal, but some compu-tations and algorithms need to update connectionsrelatively infrequently after many clock cycles. Wehave found it convenient to categorize reconfigurableinterconnection systems according to their range andspeed of reconfiguration. A preprogrammed connec-tion system can switch at high speed between a limitedset of prerecorded patterns. These patterns must bechosen and stored before operation and can be updat-ed slowly (compared to the computer's run time) if atall. A reprogrammable connection system is com-pletely general; any desired connection pattern can beconstructed and implemented at the reconfigurationrate. Finally, an adaptive connection system pro-duces a continuous incremental change in the inter-connection pattern in response to the algorithm'sneeds.

C. Algorithm Implementation on POEMThe technology of interconnection depends heavily

on the needs of the algorithms to be implemented.Some highly regular interconnection patterns can beperformed using space invariant refractive optics suchas lenses, masks, and mirrors. For more general pat-terns, including the completely irregular connectionsdiscussed in this paper, space variance is needed. Apromising approach to space variant interconnectionis to use holography. Each connection can be stored asa single hologram. The input beam reads the holo-gram, reconstructing a wavefront propagating towardthe desired destinations. Holographic storage isdense and distributed, storing large amounts of infor-mation in a defect-tolerant manner. Most important,any desired connection pattern with arbitrary fanout,fanin, and direction can in principle be stored. Holo-grams are divided into two major types, thick (volume)and thin, according to whether their thickness is largeor small compared to the features of the recordedinterference pattern (the grating wavelength). InSecs. IV.C.1 and IV.C.2 we describe two POEM sys-

Mrrors

Processing planes

Modulatorlight Input

Fig. 12. POEM system using a fixed CGH. The EO modulatoroutput from one processing plane is interconnected by the fixed

CGH to the next processing plane.

tems implementing parallel irregularly connected al-gorithms. The first system is unfolded, using fixedthin holographic optical interconnects. The second isfolded with preprogrammed volume holographic inter-connections. We outline the procedure for approxi-mate halving on each system to illustrate operationaldifferences.

1. Unfolded POEM with Fixed CGH InterconnectsThin holograms can be used to perform fixed inter-

connections. They may be fabricated by recordingoptical interference patterns or computer generatedmasks in either phase or amplitude. A CGH withsubmicron features can be written by electron-beamlithography, then etched into glass plates.24 Multilev-el-phase CGHs can be designed to produce up to 100%diffraction efficiency, although transmission (ampli-tude modulation) CGHs are much less efficient.25

Figure 12 shows a POEM system which uses a fixedCGH to interconnect a series of parallel optoelectronicprocessor arrays.

The system shown uses a double pass faceted CGHarchitecture2 6 with one facet devoted to each detectorand modulator. Coherent light entering the modula-tors from below is polarization modulated and ana-lyzed. This output is collimated and directed to one ormore location in the next plane by modulator facets.Detector facets focus the incident light into detectors.The area of modulators and detectors is minimized toreduce device capacitance and response time.23 Datacan be input electronically or in parallel using spatiallight modulators (not shown) imaged onto detectors inthe processing planes. Assuming a 5- X 5-cm diffrac-tion-limited CGH with 0.5-Asm features and 700-nmlight, 128 X 128 processor arrays could be intercon-nected.26 Optoelectronic Si/PLZT processing arraysof this size are certainly feasible. More sophisticatedinterconnection hologram design and fabrication tech-niques currently under investigation should be able toaccommodate larger PE arrays.

To perform the approximate halver algorithm de-scribed in Sec. III.A, each processing element requirestwo detector inputs (the two values to be compared)and two modulator outputs. The planes are divided


Processingplane 2

Processingplane I

2f 2f PhotoretractiveModulator 21 crystalllight input 7

Recordingsource array

Mirror 2 f I2 axis rotation

Hologram recording optics

Fig. 13. POEM system using reconfigurable volume holograms. Photorefractive crystals contain several interconnection patterns distin-guished by frequency or phase multiplexing.

into two halves, higher and lower. One output fromeach PE connects directly to that PE's correspondinglocation in the next plane, while the other output con-nects to a quasirandom destination on the next plane'sother half. Each PE receives and compares the twoinput values. The higher half of the processors passesthe higher of the two values straight across and switch-es the lower. The lower half of the processors does theopposite. The input values are loaded from the leftand propagate in parallel to the right, sorted moreaccurately in each step into the higher and lower half-planes.

2. Folded POEM with ProgrammablePhotorefractive Interconnects

Thick (volume) holograms are dramatically differ-ent from planar holograms in that they exhibit readoutselectivity. When the readout beam mismatches thestored hologram in either the optical wavelength or thephase pattern, the diffraction efficiency decreases dra-matically. The degree of selectivity depends on thethickness of the hologram; for a 1-mm thick hologram,an angular mismatch of 0.10 (Ref. 27) cuts diffractionto nearly zero. This behavior allows the superpositionof multiple volume holograms, each coded with its ownreference wavefront. When one or more of the refer-ence wavefronts illuminates the hologram, the corre-sponding images are simultaneously recalled. Each ofthese volume holograms can in theory have high (ap-proaching 100%) diffraction efficiency. The principalvolume recording media are photorefractive crystals,which develop an index modulation (phase grating) incontinuous response to incident light.

Figure 13 shows a POEM system using multiplexedvolume holographic interconnects recorded in a pho-torefractive crystal. The processing planes are similarto those of the preceding example, except that nowbecause the connection pattern is reconfigurable, theinformation is exchanged back and forth between asingle pair of processing array planes, PAl and PA2.The optical system works by retrieving interconnec-tion patterns prestored as volume holograms superim-posed on a photorefractive crystal. In Fig. 13, a re-cording source array is used to record each processor'sinterconnection pattern sequentially. Computer-controlled scanning directs the recording images tospatially discrete crystal subvolumes. Multiple inter-connection patterns are superimposed on each subvo-lume using phase or wavelength multiplexing. Afterall the holograms are recorded, one complete intercon-nection pattern can be recalled in parallel using theinput coded with the proper phase or frequency. Thesystem is preprogrammed, reconfigurable betweenprestored patterns. Assuming diffraction-limited ho-lograms and 10% diffraction efficiency, two 50- X 50- X2.5-mm lithium niobate crystals could interconnect a128 X 128 input array with ten prestored patterns.Again, more sophisticated approaches should increasethe possible performance. In particular, prefabricat-ed CGH patterns could be used to provide wavefrontsfor volume storage, decreasing programming time andpossibly increasing array size.

To perform the approximate halving of n values, theinput is arbitrarily divided into two halves. Each halfis sent into one of two n/2 element processor arrays,PAI and PA2. Each processor stores its value, then


sends a copy to the other plane along a skewed 1-1connection pattern. Both the forward and the reverseconnection patterns are identical. In the next step,each processor compares the received value with theone it was originally given. Processors in PA1 storethe higher value and send the lower to PA2 using a newirregular connection. Processors in PA2 perform thesame operation in reverse, storing the higher of the twovalues. As the process continues, planes PAl and PA2hold in storage the higher and lower half, respectively,of the values with a steadily decreasing probability oferror. In this folded implementation a total of only nprocessors was required, each with a single detectorand modulator.

The two systems we have described are intendedonly to indicate the potential of optoelectronic pro-cessing for implementing irregularly interconnectedparallel algorithms. Both optical and electronic com-ponents may be replaced as more advanced versionsbecome available. For example, the correlation ma-trix-tensor multiplier system currently being investi-gated at UCSD may provide a more versatile repro-grammable interconnection system.28 The Si/PLZTprocessor planes may be replaced with faster switchingmultiple quantum well modulators. Most important,we have shown that optoelectronics is a technologywell suited to implementing these algorithms.

V. Conclusions and Further WorkWe proposed an interconnection architecture based

on expander graphs and have shown how these inter-connections could lead to efficient parallel algorithms.We have also reasoned that such graphs cannot beimplemented with existing VLSI technology but canbe made practical with optoelectronic computing tech-nology using free space optical interconnects.

Our further work will focus on experimentally dem-onstrating the feasibility of implementing expanderson optoelectronic computers and find new and moreefficient ways of using irregular interconnections.

The authors would like to acknowledge support byAFOSR grant 89-0440 and DARPA administered byAFOSR grant 88-0022.

References1. T. Leighton and B. Maggs, "Expanders Might be Practical:

Fast Algorithms for Routing Around Faults on Multibutter-flies," in Proceedings, IEEE Symposium on Foundations ofComputer Science (1989), pp. 384-389.

2. E. Upfal, "An O(logN) Deterministic Packet Routing Scheme,"in Proceedings, Twenty First Annual ACM Symposium onTheory of Computing (May 1989), pp. 241-250.

3. M. S. Paterson, "Improved Sorting Networks with O(logn)Depth," Research Report 89, Department of Computer Science,U. Warwick, Coventry, CV4 7AL, U.K. (1987).

4. M. Ajtai, J. Komlos, and E. Szemeredi, "Sorting in c logn Paral-lel Steps," Combinatorica 3, 1-19 (1983).

5. M. Ajtai, J. Komlos, W. L. Steiger, and E. Szemer6di, "Deter-ministic Selection in O(log logn) Parallel Time," in ACM Sym-posium on Theory of Computing, Vol. 18 (1986), pp. 188-195.

6. N. Pippenger, "Sorting and Selection in Rounds," SIAM J.Comput. 16, 1032-1038 (1987).

7. J. Komlos and R. Paturi, "Effect of Connectivity in an Associa-tive Memory Model," in Proceedings, IEEE Symposium onFoundations of Computer Science (1988), pp. 138-147.

8. R. Cole and U. Vishkin, "Approximate and Exact ParallelScheduling with Applications to List, Tree and Graph Prob-lems," in Twenty-Seventh Annual Symposium on Foundationsof Computer Science (1986), pp. 478-491.

9. C. Dwork, D. Peleg, N. Pippenger, and E. Upfal, "Fault Toler-ance in Networks of Bounded Degree," SIAM J. Comput. 17,975-988 (1988).

10. N. Pippenger, "The Memory Refresh Problem," in AdvancedResearch in VLSI, Proceedings, Fifth MIT Conference, J. Allenand F. T. Leighton, Eds. (MIT Press, Cambridge, 1988).

11. J. W. Goodman, "Optics as an Interconnect Technology," inOptical Processing and Computing, H. H. Arsenault, T. Szo-plik, and B. Macukow, Eds. (Academic, New York, 1989), Chap.1, pp. 1-32.

12. L. G. Valiant, "Graph-Theoretic Properties in ComputationalComplexity," J. Comput. Sys. Sci. 13, 278-285 (1988).

13. N. Alon, "Eigenvalues and Expanders," Combinatorica 6,83-96(1983).

14. G. A. Margulis, "Explicit Construction of Concentrators," Prob.Peredachi Inf. 9, 71-80 (1973) [Probl. Inf. Transm. 325-332(1973)].

15. 0. Gabber and Z. Galil, "Explicit Construction of Linear SizedSuperconcentrators," J. Comput. Syst. Sci. 22, 407-420 (1981).

16. M. Blum, R. M. Karp, 0. Vornberger, C. H. Papadimitriou, andM. Yannakakis, "The Complexity of Testing Whether a Graphis a Superconcentrator," Inf. Process. Lett. 13, 164-167 (1981).

17. R. M. Tanner, "Explicit Concentrators from Generalized N-Gons," SIAM J. Alg. Discuss. Math. 5, 287-293 (1984).

18. N. E. Gibbs, W. G. Poole, Jr., and P. K. Stockmeyer, "A Compar-ison of Several Bandwidth and Profile Reduction Algorithms,"ACM Trans. Math. Software 2, 322-330 (1976).

19. L. G. Valiant, "A Scheme for Fast Parallel Communication,"SIAM J. Comput. 11, 350-361 (1982).

20. J. J. Hopfield, "Neural Networks and Physical Systems withEmergent Collective Computational Abilities," Proc. Natl.Acad. Sci. USA 79, 2554-2558 (1982).

21. M. R. Feldman, S. C. Esener, C. C. Guest, and S. H. Lee, "Com-parisons Between Optical and Electrical Interconnects Based onPower and Speed Considerations," Appl. Opt. 27, 1742-1751(1988).

22. F. Kiamilev et al., "Programmable Opto-Electronic Multipro-cessors and Their Comparison with Symbolic Substitution forDigital Optical Computing," Opt. Eng. 28, 396-409 (1989).

23. S. H. Lee, S. C. Esener, M. A. Title, and T. J. Drabik, "Two-Dimensional Silicon/PLZT Spatial Light Modulators: DesignConsiderations and Technology," Opt. Eng. 25, 250-260 (1986).

24. K. S. Urquhart, S. H. Lee, C. C. Guest, M. R. Feldman, and H.Farhoosh, "Computer Aided Design of Computer GeneratedHolograms for Electron Beam Fabrication," Appl. Opt. 28,3387-3396 (1989).

25. G. L. Swanson, "Binary Optics Technology: The Theory andDesign of Multi-Level Diffractive Optical Elements," MIT Lin-coln Laboratory Technical Report 854 (1989).

26. M. R. Feldman and C. C. Guest, "Interconnect Density Capabili-ties of Computer Generated Holograms for Optical Interconnec-tion of Very Large Scale Integrated Circuits," Appl. Opt. 28,3134-3137 (1989).

27. R. J. Collier, C. B. Burckhardt, and L. H. Lin, Optical Hologra-phy (Academic, New York, 1971), Chap. 9.

28. J. E. Ford, Y. Fainman, and S. H. Lee, "Array Interconnectionby Phase-Coded Optical Correlation," Opt. Lett. 15, 1088-1090(1990).


Parallel algorithms based on expander graphs for optical...

Documents

Transcript of Parallel algorithms based on expander graphs for optical...