Finite elements and the method of conjugate gradients on a ...

6
General Disclaimer One or more of the Following Statements may affect this Document This document has been reproduced from the best copy furnished by the organizational source. It is being released in the interest of making available as much information as possible. This document may contain data, which exceeds the sheet parameters. It was furnished in this condition by the organizational source and is the best copy available. This document may contain tone-on-tone or color graphs, charts and/or pictures, which have been reproduced in black and white. This document is paginated as submitted by the original source. Portions of this document are not fully legible due to the historical nature of some of the material. However, it is the best reproduction available from the original submission. Produced by the NASA Center for Aerospace Information (CASI) https://ntrs.nasa.gov/search.jsp?R=19850019277 2018-02-12T18:40:39+00:00Z

Transcript of Finite elements and the method of conjugate gradients on a ...

Page 1: Finite elements and the method of conjugate gradients on a ...

General Disclaimer

One or more of the Following Statements may affect this Document

This document has been reproduced from the best copy furnished by the

organizational source. It is being released in the interest of making available as

much information as possible.

This document may contain data, which exceeds the sheet parameters. It was

furnished in this condition by the organizational source and is the best copy

available.

This document may contain tone-on-tone or color graphs, charts and/or pictures,

which have been reproduced in black and white.

This document is paginated as submitted by the original source.

Portions of this document are not fully legible due to the historical nature of some

of the material. However, it is the best reproduction available from the original

submission.

Produced by the NASA Center for Aerospace Information (CASI)

https://ntrs.nasa.gov/search.jsp?R=19850019277 2018-02-12T18:40:39+00:00Z

Page 2: Finite elements and the method of conjugate gradients on a ...

ZV

qP

(VISA-CR- 175812) Pz! M ELESENTS IND THEARTHOD 0? CONJOGITE 3E,DIENTS 011 1CONCORRS1iT p ROCiSSSOS (Jet Propulsion Lab.)5 p HC 102/9F 101 CSCL 12k

G3/64

FWto Moments sad the Yotbod of Coejaagatz Oreolieitsm • Cowerrew Prooessom

Gregory A. Lyzenga 3Arthur Reefsky

jet Propulslon LsOorsioryCs/i/ornis 1n011tute of 'ecAno/opy

Psse"ns, CA 01/09

Bradford M. Magersolsmo/of/cs/ /, e00re. Pry

Csli/ernie Institute of TecAne/ogyPsssOens, CA V/25

W85-27588

UnclaF21276

ABSTiLACTWe present on algorithm for the Iterettve solution of

ftnfte element problems on a concurrent processor. Themethod of conjugate gradients Is used to solve the systemof matrix equations, which Is distributed among theprocessors of a MIMO computer according to anelement-based spatial decomposition. This algorithm Isimplemented In a two-dtmensionsl elestostalics programon the Caltech tiypercubs concurrent processor. Theresults of tests on up to 32 processors show nearly linearconcurrent speedup, with efficiencies over 90I forsufficiently large problems.

IifTRODUCTION

Finite nd Elliptic Boundary Value ProblemsIle neon! to solve boundary value problems Involving

elliptic partial differential equations on geometricallycomplex domains arises In many engineering contexts,Ind Increasingl y in a wide variety of scientific fields. Thefinite element method, which has been treated In a largenumber of texts (e.g., Iienkewlcz,l), provides aflexible and powerful numerical technique for the solutionof such problems. The physical problems which havebeen solved by finite element methods come from suchdiverse fields as structural and continuum mechanics,fluid dynamics, hydrology, host flow analysis and manyothers.

Among the common features of most finite elementap p lications Is the need to form and to solve a matrixequation of the form:

Az - bMere, I is a vector of unknown quantities, to be

solved for at each of some number of 'node* locationswithin the problem domain. The grid of nodes (usuallyreprosenting a dlocretization of some physical spatialdomain) also defines a subdivision of the domain Intovolumes (or areas) called * elements* which Jointlyco , .iprise the entire problem domain and share nodesalong their boundaries (Fig. I ).

DOMAIN OFPROBLEM

R • VECTOR OF NODAL DEGREES OF FP.EEDOM

b VECTOR OF EFFECTIVE "FORCES"

A • ASSEMBLY OF ELEMENT "STIFFNESS"

AicD

ng.l Rolle elemest diecroLWMiee of a boos" vslue"ism

The stiffness matrix A contains the terms whichdetermine the Interaction among the different unknowndegrees of freedom In Z, and the forcing terms orboundary values are Introduced through the right handside, `. Operationally, entries of A ere computed byperforming Integrals over the elements which Include thenode In question. The stiffness matrix therefore consistsof in assembly of Individual element matrices, which oremapped into the 'global' matrix through a masterequation bookkeeping scheme.

Page 3: Finite elements and the method of conjugate gradients on a ...

ORIGIEAL Plaid,-

OF POOR QUALIV

Common to most such finite element applications Isthe problem of solving the system given in equation ( 1 ).Since the matrix A may be very large tospeclally inhigher- dimensional problems), considerable thoughtmust go Into solving It efficiently. Thiscom putation-Intensive stop Is the primary motivation forlooking to concurrent computation techniques.

A stiffness matrix A typically has a number ofproperties which Influence the choice of solutiontechnique. For example, many physics! problems giverise to a symmetric positive definite matrix. for theremainder of the present discussion we shall restrictourselves to this class of finite element problems. Thisdoes not mean that more general tales cannot be heatedusing concurrent techniques similar to those describedbelow, bul. such problems are not within the scope of thiswork.

Traditionally, derivatives of Gaussian eliminationhave been used to solve these systems, giving robustperformance under a wide range of matrix III-conditioningcirci • matences. Matrix III-conditioning may ocrur If agiven system is very nearly singular, or exhibits a varylarge ratio between the largest and smallest elgenvolues,In which case certain matrix operations which aresensitive to round-oif error may be Inaccurate. Whiledirect techniques such as Gaussian elimination are notparticularly sensitive to moderate I11-conditioning, theyentail a large amount of computational work. Whetherdealing with Iterative or direct techniques, the matrixproblem nearly always represents the dominantcomputational cost of a finite element calculation. In thecase of Gaussian elimination, this cost Inc r• sea rapidlywith problem site, being proportional to , where I..Is the total number of degrees of freedom-, and b Is lhfmean diagonal bandwidth of non-zero elements in A.

In s typical assembled finite element matrix, theentries In a given column or row are nonzero only withina gtven distance from the diagonal. This Is because agiven node only contributes Interaction terms from thosenodes with which It shares an element. Although judiciousnode numbering con minimize the overage column heightand thus the mean diagonal bandwidth of the matrix,problems with large numbers of nodes In more than onespatial direction will unavoidably have very largebandwidths. 'his moons that, especially for grids Inhigher dimensions, the work necessary to solve thesystem will increase much faster than the number ofnodes as we attempt to solve larger and larger systems.

Certain classes of structural analysis problems mayhave considerably smaller mean bandwidths then the mostgeneral three-dimensional problem, and for these, directtechniques remain a viable approach. Research intoconcurrent methods for these techniques Is already welladvanced( 2.3 " 1 ). The present authors have been mostInterested In continuum mechanics problems, for whichdirect methods are more limited In their utility.

For the above reasons, we have pursued theImplementation of an Increasingly popular llerattvetechnique, the method of conjugate gradients. Assumingthat It can be made to converge, the conjugate gradientmethod offers the potential for much improvedperformancc on large thro"Imsnslonal grids. This,combined with a rather straightforward concurrentgeneralization made this an attractive area to explore. Inaddition, the symmetric positwe-dNlnito nature of thematrices considered makes the method of conjugategradients a good f irst choice (see for exampleJennirgs,3). In the discussion which follows, we will

present :h• basic concurrent PIgorllhrn developed for thisapplication.

figure 2 shows schematically how a two-dimensionalfinite element grid might be spread over the processorsof an B-n,de MIMD mo0ins. Each subset of elements(delineated by the heavy lines) Is treated effectively as aseparate smaller finite element problem within eachprocessor. Adjoining subdomaine need only exchangeboundary Information with neighboring processors Inorder to complete calculations for each regionconcurrently. The next section discusses In more detailhow this concurrency Is managed.

i

r

plot. 11 Prot. IS pros. Ib `Prot. 11

Proc. 10 Ixoc. 11 pros. e2 proc LT- 1

--- X

Fl{. 2 A possible distribution, of a kwo-4aeasioaal{rid on as 11-processor concurrent cenputar

The Concurrent DecompositionThe conjugate gradient method represents a

technique for iteratively searching the apace of vectors XIn such a way as to minimize a function of the residualerrors. Briefly, the conjugate gradient procedureconsists of the following algorithm. Initially,

fill . 061. ► - A =(a) (2)

Then, for the kni step,

1. as - (fis t . [(tl ) / ('(II . AV%) )

2. =(t•II . =(tl . as pal

3 f4t.11 - rUI - at AP(" (3)

4. pit - Wt. 1). r(t•11) / (r(k) . f(k) )

S V(t.1) - f(k•I). Is Oki

6. k - k • 1; /e to /. (ritz iii e* as ti/ re'a rer1#0

As may be seen by Inspecting (3) above, thisalgorithm lnvolvss twd basic kinds of operations. TheIlrsl

(t) these Is the vector dot product, for example

[ [ . Thr fscond basic operation Is the matrix-vectorp roduct, Ao k . BoD of these primitive operations can bedone In parallel by decomposing the problem Into regionsof the physical doma i n space. A given *global* vector suchas f or V Is spread out among the processors of aconcurrent ensemble, with concurrent operations beingperformed on the various 'pieces' of the vectors in eachprocessor. The only need for Information from oulslls a

Page 4: Finite elements and the method of conjugate gradients on a ...

OFaG 11?,3RL F^,OF POOR QJ.^i i ^l

given processoroccurs when nods on the boundarybetwsn the 'jurisdictions' of two processors arecomputed.

Figure 3 Illustrates schematically the protocol usedfor handling shared degrees of freedom betweenprocessors In the two-dimensional case. Eachprocessor obeys a convention whereby It accumulatescontributions to global vector Quantities ( 1, r, p )from neighboring processors along the 'right' and 'down'edges of the region. Contributions arising from degreesof freedom along a processor's 'left' and 'up' edges aresent to the respective neighboring processor foraccumulation. The *lower right' and 'upper left' cornerdegrees of freedom are passed twice In reaching theirdestination.

SEND 10UP (SUI

SENDACROSS 101,ISU1^

ACCEPT,• RIGHT

SENDg1^MT(SL) • - _ \ URI

I IJ

I ACCEPT FROM/ ACROSS ux)

ACCEPT FROMDOWN (AD)

RESPONSIBILITYOf LOCALPROCESSOR

rig 3 Tvo-disensiosel prowcel for pessina sh&M sedaloelues

While only one processor has responsibility foraccumulating a given shared degree of freedom, eachneighboring processor must be provided with a copy of theaccumulated result for subsequent calculation Thus,the process of updating a quantity such as Ap whichInvolves adding contributions from different elements Indifferent processors, proceed$ as follows:

I. The 'local' contributions are calculated withineach processor Independently. This means calculatingthe contributions to the given matrix or vector productwhile neglecting the effect of any neighboring processor.

2. A right-to-left pass of edge contributions todone. Vector contributions along the left edge of aprocessor (I.e., those labeled 9L, al., and sX) are sentto the neighboring processor on the left, at the some timethat life corresponding right edge degrees of freedom(AR, AX, and SU) are received from the right. Thereceived contributions are added to the locally calculatedcontributions already resident In those storage '$lots'.

3. This Is followed by a down-to-up pass. Thecontributions labeled SU and sU are sent up, while the ADand sX slots receive contributions to accumulate. Notethat the corner degrees of freedom labeled sX are

transmitted and accumulated twice before reaching theirfinal destination in an sX slot located diagonally fromtheir starting point.

4. Finally, an up-to-down shift, followed by aleft-to-right shift which overwrites rather than adds tothe current contents of the boundary arrays serves todistribute the the final sum of all contributions to all theInvo l ved processors. rhls ends the communication cycle,and th. processors t i tan return to concurrent internalcomputatwnt.

Aside from this kind of calculation, In which aglobal vector Is updated on the basis of local (element)Information, there Is one other situation In whichprocessor 'responsibility' for global degrees of freedomIs significant. Global scalar products are calculated byforming partial dot products within each processor, whichare In turn forwarded to a single 'control process' (whichIn this case resides In a separate external processor)which performs the summation and takes action on theresult (e. g., terminating the Iterative loop). In thiscase, In order to avoid 'double-counting' any entry,each processor calculates Its partial dot product onlyusing Its Internal and accumulated decrees of freedom

The scheme described above Is applicable to thehypercube machine architecture specifically, and moregenerally to any MIMD computer which suppports thefollowing communications operations: (I ) globalbroadcu, t of data from a designated controlling processor processor to the concurrent array, (2) transmissionof unique dale 'messages' between array elements and thecontrol process, end (3) element-to-element datetransmission between lattice nearest neighbors.

IYPLIYOtTAT10NProorammtno Considerations

The example discussed here was written In the Clanguage, end cross-compiled on a VAX-I Ir7S0 systemfor execution on the 6066 microprocessor-based 32 nodeCaltect, Mypercubs (Mark II) machine. Listings andfurther information on this code may be obtained bycontacting one of the authors.

There are no basic algorithmic differences betweenthe concurrent conjugate gradient algorithm discussedhere and its equivalent counterpart on a sequentialmachine. Thus the concurrent program should, assumingInfinite precision of calculation, yield results Identical tothe sequential version.

In actual practice however, the exact resultobtained by the conjugate gradient method, and thenumber of iteration• required for It to converge, arerather sensitive to the finite precision of the numericalcalculations. The accumulation steps which occur duringInterprocessor communizations represent additionalarithmetic operations of finite precision, which Introducesome additional round-off error not present in thesequential equivalent. This causes the concurrent code toproduce slightly different (burl numerically egjivslent)results.

The computation/communication structure of theconcurrent algorithm Is quite regular, In that eachprocessor is simultaneously doing the some type of taskas every other, and the only load Imbalance or processorwaiting Is caused by giving processors responsibility fordifferent numbers of elements or degrees of freedom.The problem of how to best decompose an arbitrarilyshaped finite ailment grid presents s problem in thepreparation of the Input dote, but this preprocessing

49

Page 5: Finite elements and the method of conjugate gradients on a ...

0

^•1-n-112 0

lJl

2.0

1.5

l.0Pa

0.5

step ha$ nothing to do directly with the actual operationof the concurrent algorithm(J). highly Irregular and/ornon-recti lines r grids are handled transparently, just aslong as all shared boundary nodes are properly Identified(as per Fig. 3) In the Input data stream.

Among the special considerations to note for thisapplication are restrictions on available memory. In thepresent example, we have chosen to cal:ulste and storeeach element stiffness entry. In such a case, It becomespossible to exhaust the relatively modest memoryavailable In the present generation of machines. Thisproblem Is particularly acute In moving from two- tothree-dimensional problems. One approach to overcomethis limitation is not to store, but to recalculatestiffnessss at each Iteration. In such a case, the extraarithmetic operations may be minimized by utilizingone-point quadralure with mesh stabilization techniques(2).

Results of Initial TestsThe above described finite element/conjuget•

gradient program was tested on a series oftwo-dimensional plane strain elsslostatice problems,Intended to a) verify the proper execution of the programand b) provide benchmark measurements of theprogram's concurrent efficiency as a function of problemsize and number of processors.

The Hark II hypercubs was used in these lasts Inconfigurations from one node (0-dinienslonal cube)through the full 32 nodes (5-dimensional hypercube).The actual problems solved were rectangular arrays ofelements with boundary displacements Imposed to producea simple sneer field solution. Concurrent efficiency Is ameasure of how nearly a concurrent processor with mp rocessors approaches speeding a given calculation by efactor of m. Thus we obtain experimental efficiencies bydividing the time required to solve a given problem In asingle processor by the concurrent time and the numberof processors. As long as the sequential and concurrentalgorithms are truly equivalent, this efficiency E has anupper bound of unity.

Teats 1 - Two -dissenstanal can iugwA gradient run Yaea sadeffictenciem

&umber of •umber of Unee per efrtcracyelements erocomers iteration (sec.)

smaller problems' execution times. In thisextrapolation, we have assumed a strictly linear relationbetween number of equation$ and execution time perIteration. The single-processor proportionality constantIs assumed here to be 6.41 millisecond / Iteration perdegree of freedom. The Quoted experimental efficienciesare valid to the degree that this scaling assumption holds.

In a problem which Is perfectly load balanced, theconcurrent efficiency will be governed by the ratio of times pent communicating to time spent calculatingconcurrently. Since in a two-dimensional array, Shenumber of communicated edge values varies as G I =where a Is the total number pl equations per processor,this ra tio should very as W i ll and the cslculst on shouldobey an efficiency relation,

E - 1 - Cn- t/3 . (4)

where C to a constant. A log-log plot such as Figure 4may be used to compere the above efficiencies with thistheoretical prediction.

0.99

0.97

0.90

0.6E

01.0 2.0 3.0 4.0 5.0

log n

o 2 PROCESSORS q 16 PROCESSORSQ • 4 PROCESSORS tr 32 PROCESSORSA E PROCESSORS

•nN 0ER OF DEGREES

OF rREEDOM PERPROCESSOR

Is-

44 1 010IM 1 1.64256 1 392400 1 413576 1 797576 2 310576 4 11Z576 : !M576 16 0.52576 32 0211152 32 0534606 32 2.0119432 32 715

Fig. 4 Coacurrest speedup efricieacy an a fuacuos of

preblas am and somber of procomairs

On this plot, star symbol$ Indicate the 32-nodeefficiencies at various problem sizes. The results iserr

0.10 to agree well with the predicted tope - //I line. The047 points representing those cases with fewer then 16011 processors plot above the 32-nods trend because they do0.05 not Involve processors communicating on all four0.76 boundaries. In summary, the concurrent conjugate015 gradient algorithm exhibits performance characteristics(1.91 which are theoretically understood, and well suited for6.16 application to large ensembles of processors.

In Table I, we summsrtze the results of these lestruns on a variety of problem sites and numbers ofprocessors. Since the maximum number of elements wecould accomodsto within a single protestor was on theorder of 600, the efficiencies listed for the larger

32-node runs are based upon extrapolations of the

It Is interesting to note that the experimen!al valuefor the constant C In (4) Is apparently approximatel y I.This result, which agrees with the results of Melsr(Q) fora two-dimensional finite difference scheme on the somemachine, Is a reflection of the fact that the times forcalculation and communication of a single degree offreedom on this hardware are roughly equal.

Page 6: Finite elements and the method of conjugate gradients on a ...

i,I

I

11

COMUSIONS

We find that the Concurrent conjugate gradientalgorithm outlined here meets our expectations Inproviding large, nearly linear speedup In finite elementsystem solullons. Results from actual runs on th•32-node Mark II hypercube system yield not efficionciasupwards of 901. From this we can conclude that futuremajor use of this and related algorithms on large finiteelement applications will hinge upon the applicability ofiterative techniques, and not upon any Issue ofconcurroncy or efficiency.

Several areas are now Indicated for future researchin this area. Among the most Immediate needs are:( I ) •ffuctty precondttioners to speed convergence of theCG algorithm, (2) extension to three dimensions, (3)investigation of methods to reduce or eliminate stiffnessstorage requirements, and perhaps most Importantly,(1) automated procedures for brooking up and balancingon arbitrary finite element problem among processors.We anticipate addressing these problems In the nearfuture.

AaAOVLmGIEME SThe research in this publication was carried out

by the Jet Propulsion Laboratory, California Instituteof TocMelogy, under a Contract with the NationalAeronautics and Space Adolnistration. The author

11also gratefully ockno-lodes the Valuable advice andassistance of "of. T. J. R. Hughes (Stanford Uni..)and that of S. M. Otte and M. A. Johnson (Caltach').

REFEMas

1. Ztenklewicz, O.C., The Finite Element Method, 3rdad., McGraw-HIII, London, 1977.

Z. Heller, D., 'Some Aspects of the Cycllr. RsductlonAlgorithm for block Trldlegonel Linear Systems,' SlyJournal of Numerical Analysts. vol.13, 1976, pp.+84-196.

3. Salerno, M., Lltku, S., andMelosh, R.,'AFamilyof Permutations for Concurrent Factorization of BlockTridlogonal Matrices,' 2nd Conference on Vector and

ors In Comcutstlo al 1, Oxford,August 19b1 proceedings In press .

1. Sameh, A. and Kuck, D., 'Parallel Direct LinearSystem Solvers - A Survey,' Parallel Comwto r s -Parallel MathemNll:2, Int'I. Assoc. for Mothemallcs andComputers in Simulation, 1977.

S. Jennings, A., Matrix Computation for Engineersand Scientists, John Wiley and Sons, Chichester, 1977,pp. 212-211.

5. Fox, G. C. and Otto, S. W., 'Algorithms forConcurrent Processors,' Physics Today, vol. 37, No. S,May 1961.

7. belytschko, T,, Ong, J. S., Llu, W. K.Kennedy, J. M., 'Hourgloss Control In Linear andNonlinear Problems,' Computatlonal Methods andApplications In Mechanical Enoinearing, vol. 43, 1981,pp. 251 -176.

S. Meier, D.L., 'Two-Dimensional, Ono-FluidHydrodynamics: An Astrophysical Teat Problem for theNearest Neighbor Concurrent Pr • cessor,' Hm-90, July1961, Ceilech Concurrent Comput ikon Project , Pasadens,Calif.