A Highly Scalable Parallel Algorithm for Sparse Matrix ... · In this paper, we describe a scalable...

26
A Highly Scalable Parallel Algorithm for Sparse Matrix Factorization Anshul Gupta George Karypis and Vipin Kumar IBM T. J. Watson Research Center Department of Computer Science P.O. Box 218 University of Minnesota Yorktown Heights, NY 10598 Minneapolis, MN 55455 [email protected] karypis/kumar @cs.umn.edu Appears in IEEE Transactions on Parallel and Distributed Systems, 8(5):502–520, May 1997 Abstract In this paper, we describe a scalable parallel algorithm for sparse matrix factorization, analyze their per- formance and scalability, and present experimental results for up to 1024 processors on a Cray T3D parallel computer. Through our analysis and experimental results, we demonstrate that our algorithm substantially improves the state of the art in parallel direct solution of sparse linear systems—both in terms of scalability and overall performance. It is a well known fact that dense matrix factorization scales well and can be imple- mented efficiently on parallel computers. In this paper, we present the first algorithm to factor a wide class of sparse matrices (including those arising from two- and three-dimensional finite element problems) that is asymptotically as scalable as dense matrix factorization algorithms on a variety of parallel architectures. Our algorithm incurs less communication overhead and is more scalable than any previously known parallel formulation of sparse matrix factorization. Although, in this paper, we discuss Cholesky factorization of symmetric positive definite matrices, the algorithms can be adapted for solving sparse linear least squares problems and for Gaussian elimination of diagonally dominant matrices that are almost symmetric in struc- ture. An implementation of our sparse Cholesky factorization algorithm delivers up to 20 GFlops on a Cray T3D for medium-size structural engineering and linear programming problems. To the best of our knowl- edge, this is the highest performance ever obtained for sparse Cholesky factorization on any supercomputer. This work was supported by IST/BMDO through Army Research Office contract DA/DAAH04-93-G-0080, NSF grant NSG/1RI- 9216941, and by Army High Performance Computing Research Center under the auspices of the Department of the Army, Army Research Laboratory cooperative agreement number DAAH04-95-2-0003/contract number DAAH04-95-C-0008, the content of which does not necessarily reflect the position or the policy of the government, and no official endorsement should be inferred. Access to computing facilities were provided by Minnesota Supercomputer Institute, Cray Research Inc. and by the Pittsburgh Supercomputing Center. 1

Transcript of A Highly Scalable Parallel Algorithm for Sparse Matrix ... · In this paper, we describe a scalable...

Page 1: A Highly Scalable Parallel Algorithm for Sparse Matrix ... · In this paper, we describe a scalable parallel algorithm for sparse matrix factorization, analyze their per-formance

A Highly ScalableParallel Algorithm for SparseMatrixFactorization

AnshulGupta GeorgeKarypisandVipin Kumar

IBM T. J.WatsonResearchCenter Departmentof ComputerScienceP.O.Box 218 Universityof Minnesota

Yorktown Heights,NY 10598 Minneapolis,MN [email protected]

karypis/kumar @cs.umn.edu

Appearsin IEEETransactionson Parallel andDistributedSystems, 8(5):502–520,May 1997

Abstract

In thispaper, wedescribeascalableparallelalgorithmfor sparsematrix factorization,analyzetheir per-formanceandscalability, andpresentexperimentalresultsfor up to 1024processorson aCrayT3D parallelcomputer. Throughour analysisandexperimentalresults,we demonstratethatour algorithmsubstantiallyimprovesthestateof theart in paralleldirectsolutionof sparselinearsystems—bothin termsof scalabilityandoverallperformance.It is awell known factthatdensematrixfactorizationscaleswell andcanbeimple-mentedefficiently on parallelcomputers.In this paper, we presentthefirst algorithmto factora wide classof sparsematrices(including thosearisingfrom two- andthree-dimensionalfinite elementproblems)thatis asymptoticallyasscalableasdensematrix factorizationalgorithmson a varietyof parallelarchitectures.Ouralgorithmincurslesscommunicationoverheadandis morescalablethanany previouslyknown parallelformulationof sparsematrix factorization.Although, in this paper, we discussCholesky factorizationofsymmetricpositive definitematrices,the algorithmscanbe adaptedfor solving sparselinear leastsquaresproblemsandfor Gaussianeliminationof diagonallydominantmatricesthatarealmostsymmetricin struc-ture.An implementationof oursparseCholesky factorizationalgorithmdeliversup to 20 GFlopson aCrayT3D for medium-sizestructuralengineeringandlinearprogrammingproblems.To thebestof our knowl-edge,this is thehighestperformanceeverobtainedfor sparseCholesky factorizationonany supercomputer.

This work wassupportedby IST/BMDO throughArmy ResearchOffice contractDA/DAAH04-93-G-0080,NSFgrantNSG/1RI-

9216941,and by Army High PerformanceComputingResearchCenterunderthe auspicesof the Departmentof the Army, ArmyResearchLaboratorycooperative agreementnumberDAAH04-95-2-0003/contractnumberDAAH04-95-C-0008,thecontentof whichdoesnot necessarilyreflect the positionor the policy of the government,andno official endorsementshouldbe inferred. Accesstocomputingfacilitieswereprovidedby MinnesotaSupercomputerInstitute,CrayResearchInc. andby thePittsburgh SupercomputingCenter.

1

Page 2: A Highly Scalable Parallel Algorithm for Sparse Matrix ... · In this paper, we describe a scalable parallel algorithm for sparse matrix factorization, analyze their per-formance

1 Intr oduction

Solving largesparsesystemsof linearequationsis at thecoreof many problemsin engineeringandscientificcomputing.Suchsystemsaretypically solvedby two differenttypesof methods—iterativemethodsanddirectmethods. Thenatureof theproblemat handdetermineswhich methodis moresuitable.A direct methodforsolvingasparselinearsystemof theform involvesexplicit factorizationof thesparsecoefficientmatrix into theproductof loweranduppertriangularmatrices and . Thisis ahighly timeandmemoryconsumingstep;nevertheless,directmethodsareimportantbecauseof their generalityandrobustness.For linearsystemsarising in certainapplications,suchas linear programmingandstructuralengineeringapplications,they aretheonly feasiblesolutionmethods.In many otherapplicationstoo,directmethodsareoftenpreferredbecausetheeffort involvedin determiningandcomputingagoodpreconditionerfor aniterative solutionmayoutweighthe costof direct factorization.Furthermore,direct methodsprovide an effective meansfor solving multiplesystemswith thesamecoefficient matrix anddifferentright-handsidevectorsbecausethefactorizationsneedsto beperformedonly once.

A wide classof sparselinear systemshave a symmetricpositive definite(SPD)coefficient matrix that isfactoredusingCholesky factorization.AlthoughCholesky factorizationusedextensively in practice,their usefor solving large sparsesystemshasbeenmostly confinedto big vectorsupercomputersdueto its high timeandmemoryrequirements.As a result,parallelizationof sparseCholesky factorizationhasbeenthe subjectof intensive research[26, 55, 12, 15, 14, 18, 54, 40, 41, 3, 49, 50, 57, 9, 28, 26, 27, 51, 2, 1, 44, 58, 16, 55,43, 33, 5, 42, 4, 59]. We have developedhighly scalableformulationsof sparseCholesky factorizationthatsubstantiallyimprove the stateof the art in parallel direct solution of sparselinear systems—bothin termsof scalabilityandoverall performance.It is well known that densematrix factorizationcanbe implementedefficiently on distributed-memoryparallelcomputers[8, 46, 10, 34]. We show that theparallelCholesky fac-torizationalgorithmdescribedhereis asscalableasthebestparallelformulationof densematrix factorizationon bothmeshandhypercubearchitecturesfor a wide classof sparsematrices,including thosearisingin two-andthree-dimensionalfinite elementproblems.This algorithmincurslesscommunicationoverheadthananyknown parallel formulationof sparsematrix factorization,andhence,canutilize a highernumberof proces-sorseffectively. Thealgorithmpresentedherecandeliver speedupsin proportionto an increasingnumberofprocessorswhile requiringalmostconstantmemoryperprocessor.

It is difficult to deriveanalyticalexpressionsfor thenumberof arithmeticoperationsin factorizationandforthesize(in termsof numberof nonzeroentries)of the factorfor generalsparsematrices.This is becausethecomputationandfill-in duringthefactorizationof a sparsematrix is a functionof thethenumberandpositionof nonzerosin theoriginal matrix. In thecontext of the importantclassof sparsematricesthatareadjacencymatricesof graphswhose -nodesubgraphshave -nodeseparators(this classincludessparsematricesarisingout of all two-dimensionalfinite differenceandfinite elementproblems),thecontribution of this workcanbesummarizedby Figure11. A simplefan-outalgorithm[12] with column-wisepartitioningof an matrix of this typeon processorsresultsin an total communicationvolume[15] (box A). Thecommunicationvolumeof thecolumn-basedschemesrepresentedin box A hasbeenimprovedusingsmarterwaysof mappingthematrix columnsontoprocessors,suchas,thesubtree-to-subcubemapping[14] (box B).A numberof column-basedparallel factorizationalgorithms[40, 41, 3, 49, 50, 57, 12, 9, 28, 26, 55, 43, 5]

1In [47], PanandReif describea parallelsparsematrix factorizationalgorithmfor aPRAM typearchitecture.This algorithmis notcost-optimal(i.e., the processor-time productexceedstheserialcomplexity of sparsematrix factorization)andis not includedin theclassificationgivenin Figure1.

2

Page 3: A Highly Scalable Parallel Algorithm for Sparse Matrix ... · In this paper, we describe a scalable parallel algorithm for sparse matrix factorization, analyze their per-formance

dimensions according to subtree-subcube mapping.Dense submatrices partitioned dynamically in two

2-D partitioning of matrix among processors.

Columns distributed via subtree-subcube mapping.

Column-wise 1-D partitioning of matrix.A:

D:

C:

B:

(Np log(p))

Mapping: Subtree-subcube

(Np)

Scalability:

Mapping: Global

Ω 1.5Scalability: (p )

Ω

1.5 (p (log(p)) )

0.5Ω Θ 0.5 (Np ) (Np log(p))

Θ

A B

Ω

3

D

Mapping: Global

C

Global Mapping Subtree-to-Subcube Mapping

Partitioning

in Both

Dimensions

Columnwise

Partitioning

Partitioning: 1-DPartitioning: 1-D

Communication overhead:Communication overhead:

Partitioning: 2-DPartitioning: 2-D

Communication overhead:Communication overhead:

(p ) ((p log(p)) ) ΩΩ 33Scalability: Scalability:

Mapping: Subtree-subcube

Figure1: An overview of the performanceandscalability of parallelalgorithmsfor factorizationof sparsematricesresultingfrom two-dimensional -nodegrid graphs. Box D representsour algorithm, which is asignificantimprovementover otherknown classesof algorithmsfor thisproblem.

3

Page 4: A Highly Scalable Parallel Algorithm for Sparse Matrix ... · In this paper, we describe a scalable parallel algorithm for sparse matrix factorization, analyze their per-formance

ha

ve a lower boundof ! on thetotal communicationvolume[15]. Sincetheoverall computationis only#"%$ & [13], the ratio of communicationto computationof column-basedschemesis quite high. As a re-sult, thesecolumn-casedschemesscalevery poorly asthenumberof processorsis increased[55, 53]. In [2],Ashcraftproposesa fan-bothfamily of parallelCholesky factorizationalgorithmsthathave a total communi-cationvolumeof ' . Although the communicationvolumeis lessthanthe othercolumn-basedpartitioningschemes,theisoefficiency functionof Ashcraft’s algorithmis still '(*) dueto concurrency con-straintsbecausethealgorithmcannoteffectively utilize morethan + processorsfor matricesarisingfromtwo-dimensionalconstantnode-degreegraphs. Recently, a numberof schemeswith two-dimensionalparti-tioning of the matrix have beenproposed[18, 53, 52, 18, 54, 1, 44, 58, 16, 33, 42, 4, 59]. The leasttotalcommunicationvolumein mostof theseschemesis , (box C)2.

Most researcherssofarhave analyzedparallelsparsematrix factorizationin termsof thetotal communica-tion volume.It is noteworthy that,onany parallelarchitecture,thetotalcommunicationvolumeis only a lowerboundon theoverall communicationoverhead.It is thetotal communicationoverheadthatactuallydeterminestheoverall efficiency andspeedup,andis definedasthedifferencebetweentheparallelprocessor-time productandtheserialrun time[23, 34]. Thecommunicationoverheadcanbeasymptoticallyhigherthanthecommuni-cationvolume.For example,aone-to-allbroadcastalgorithmbasedonabinarytreecommunicationpatternhasa total communicationvolumeof -.(0/21 for broadcasting- wordsof dataamong processors.However,the broadcasttakes 3 stepsof (-. time each;hence,the total communicationoverheadis (-4 5,(on a hypercube).In the context of matrix factorization,theexperimentalstudyby Ashcraftet al. [3] servesto demonstratethe importanceof studyingthe total communicationoverheadratherthanvolume. In [3], thefan-inalgorithm,which hasa lower communicationvolumethanthedistributedmultifrontal algorithm,hasahigheroverhead(andhence,a lower efficiency) thanthemultifrontal algorithmfor thesamedistribution of thematrixamongtheprocessors.

The performanceandscalabilityanalysisof our algorithmis supportedby experimentalresultson up to1024processorsof nCUBE2[24] andCrayT3D parallelcomputers.Wehave beenableto achieve speedupsofup to 364on 1024processorsand230on 512processorsover a highly efficient sequentialimplementationformoderatelysizedproblemsfrom theHarwell-Boeingcollection[6]. In [29], we have appliedthis algorithmtoobtainahighly scalableparallelformulationof interiorpointalgorithmsandhaveobservedsignificantspeedupsin solving linearprogrammingproblems.On theCrayT3D, we have beenableto achieve up to 20 GFlopsonmedium-sizestructuralengineeringandlinear programmingproblems.To the bestof our knowledge,this isthefirst parallelimplementationof sparseCholesky factorizationthathasdeliveredspeedupsof thismagnitudeandhasbeenableto benefitfrom severalhundredprocessors.

In summary, researchershave improved the simpleparallelalgorithmwith 6 , communicationvolume(box A) alongtwo directions—oneby improving themappingof matrixcolumnsontoprocessors(boxB) and the other by splitting the matrix along both rows and columns(box C). In this paper, we describea parallel implementationof sparsematrix factorizationthat combinesthe benefitsof improvementsalongboth theselines. The total communicationoverheadof our algorithmis only 7 for factoringan 8 matrix on processorsif it correspondsto a graphthat satisfiesthe separatorcriterion. Our algorithmreducesthe communicationoverheadby a factor of 5, over the bestalgorithm implementedto date.Furthermore,asweshow in Section5, thisreductionin communicationoverheadby afactorof , resultsin an improvementin the scalabilityof the algorithm by a factor of 5, ) ; i.e., the rateat which the

2[4] and[59] couldbepossibleexceptions,but neitheradetailedcommunicationanalysis,norany experimentalresultsareavailable.

4

Page 5: A Highly Scalable Parallel Algorithm for Sparse Matrix ... · In this paper, we describe a scalable parallel algorithm for sparse matrix factorization, analyze their per-formance

problem9 sizemustincreasewith thenumberof processorsto maintainaconstantefficiency is lowerby a factorof :,;) . This can make the differencebetweenthe feasibility and non-feasibilityof parallel sparsefactorizationon highly parallel(=<>?@ ) computers.

The remainderof thepaperis organizedasfollows. Section2 describestheserialmultifrontal algorithmfor sparsematrix factorization.Section3 describesourparallelalgorithmbasedonmultifrontalelimination.InSection4, we derive expressionsfor thecommunicationoverheadof theparallelalgorithm. In Section5, weusetheisoefficiency analysis[34, 36, 17] to determinethescalabilityof our algorithmandcompareit with thescalabilityof otherparallelalgorithmsfor sparsematrix factorization.Section6 containsexperimentalresultson aCrayT3D parallelcomputer. Section7 containsconcludingremarks.

2 The Multifr ontal Algorithm for SparseMatrix Factorization

Themultifrontal algorithmfor sparsematrix factorizationwasproposedindependently, andin somewhat dif-ferentforms,by Speelpening[56] andDuff andReid[7], andlaterelucidatedin a tutorial by Liu [39]. In thissection,we briefly describea condensedversionof multifrontal sparseCholesky factorization.

Given a sparsematrix and the associatedelimination tree, the multifrontal algorithmcanbe recursivelyformulatedasshown in Figure2. ConsidertheCholesky factorizationof an A# sparsesymmetricpositivedefinitematrix into 3CB , where is a lower triangularmatrix. Thealgorithmperformsapostordertraversalof the eliminationtreeassociatedwith . Thereis a frontal matrix DFE andan updatematrix E associatedwith any node G . The row andcolumnindicesof D E correspondto the indicesof row andcolumn G of inincreasingorder.

In thebeginning, D E is initialized to an HJIK1JLHJI21 matrix, where H:I21 is thenumberof nonzerosin the lower triangularpartof column G of . Thefirst row andcolumnof this initial DFE is simply theuppertriangularpartof row G andthe lower triangularpartof column G of . Theremainderof DFE is initialized toall zeros.Line 2 of Figure2 illustratestheinitial DFE .

After thealgorithmhastraversedall the subtreesrootedat a node G , it endsup with a (M5IN1OP(MJIQ1frontalmatrix D E , whereM is thenumberof nonzerosin thestrictly lower triangularpartof column G in . Therow andcolumnindicesof thefinal assembledD E correspondto MRIP1 (possibly)noncontiguousindicesof rowandcolumn G of in increasingorder. If G is a leaf in theeliminationtreeof , thenthefinal DFE is thesameastheinitial DFE . Otherwise,thefinal DOE for eliminatingnode G is obtainedby merging theinitial DFE with theupdatematricesobtainedfrom all thesubtreesrootedat G via anextend-addoperation.Theextend-addis anassociative andcommutative operatoron two updatematricessuchthe index setof the result is the union oftheindex setsof theoriginal updatematrices.Eachentryin theoriginal updatematricesis mappedontosomelocationin theaccumulatedmatrix. If entriesfrom bothmatricesoverlapon a location,they areadded.Emptyentriesareassignedavalueof zero.Figure3 illustratestheextend-addoperation.

After DFE hasbeenassembledthroughthestepsof lines3–7of Figure2, asinglestepof thestandarddenseCholesky factorizationis performedwith node G asthepivot (lines8–12). At theendof theeliminationstep,thecolumnwith index G is removedfrom DOE andformsthecolumn G of . TheremainingMSTM matrix is calledtheupdatematrix E andis passedon to theparentof G in theeliminationtree.

Themultifrontalalgorithmis furtherillustratedin astep-by-stepfashionin Figure5 for factoringthematrixof Figure4(a).

5

Page 6: A Highly Scalable Parallel Algorithm for Sparse Matrix ... · In this paper, we describe a scalable parallel algorithm for sparse matrix factorization, analyze their per-formance

/* is thesparseU# symmetricpositive definitematrix to befactored. is thelower triangularmatrix suchthat VN3 B afterfactorization.2NWYX[Z \] and.^_X[Z \] , wherebadcfehgiN . Initially, ^_X[Z \F` for all cfehg .*/1. begin function Factor(G )

2. D Ekj

W E Z E W E Z lnm W E Z lpo qrqrq W E Z lfsW l m Z E ` ` qrqrq `W lpoZ E ` ` qrqrq `

......

.... ..

...W lfstZ E ` ` qrqrq `

;

3. for all c suchthatParent(c ) = G in theeliminationtreeof , do4. begin5. Factor(c );6. DOE := Extendadd(DFE , X );7. end/*At this stage,DFE is a (MuIv15#(MwId1 matrix,whereM is thenumberof nonzerosin thesub-diagonalpartof column G of . E is a MOxM matrix. Assumethatanindex c of D E or E correspondsto theindex y%X of and .*/8. for c j N` to M do

9. ^ l%z(Z E j NDFE(cfer`| D E `er` ;10. for g j 1 to M do11. for c j dg to M do12. OE(cfehg j NDFE(cfehg~/v^ lz(Z E ^ lZ E ;13. end function Factor.

Figure2: An elimination-treeguidedrecursive formulationof the serialmultifrontal algorithmfor Choleskyfactorizationof a sparsesymmetricpositive definitematrix into CB , where is a lower-triangularmatrix.If is therootof thepostorderedeliminationtreeof , thenacall to Factor( ) factorsthematrix .

6

Page 7: A Highly Scalable Parallel Algorithm for Sparse Matrix ... · In this paper, we describe a scalable parallel algorithm for sparse matrix factorization, analyze their per-formance

a

b

c

d

e f

iii 0

i

i

i 0

1

2

1 2 i 0

i 0

0

+ =i 0

i

i

i 0 i i

2

3

32

h

g

i

j

k l

ii

i

i

i

2

1

2

3

1 i 3

b

a+g

c+h

i

d

e f+j

k l

Figure3: Theextend-addoperationon two # triangularmatrices.It is assumedthat ctbic " icbidc ) .

0

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

17

18

16

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

P PP P

P

(b)

P

P PP

P PP

P

PPPP

0 1 3 4 9 10 12 13

2 5 11 14

6

7

8

15

16

17

18

Level 3

Level 0

Level 1

Level 2

P P P P P P P P1 2 3 4 5 6 7

P P P P

PP

P

0,1

0

2,3 4,5 6,7

0,1,2,3 4,5,6,7

0,1,2,3,4,5,6,7

(a)

Figure4: A symmetricsparsematrixandtheassociatedeliminationtreewith subtree-to-subcubemappingonto8 processors.Thenonzerosin theoriginalmatrixaredenotedby thesymbol“ ” andfill-ins aredenotedby thesymbol“ ”.

3 A Parallel Multifr ontal Algorithm

In thissectionwedescribetheparallelmultifrontalalgorithm.Weassumeahypercubeinterconnectionnetwork;however, the algorithm also can be adaptedfor a meshtopology (Section3.2) without any increasein theasymptoticcommunicationoverhead.Onotherarchitecturesaswell, suchasthoseof theCM-5,CrayT3D, andIBM SP-2,theasymptoticexpressionfor thecommunicationoverheadremainsthesame.In this paper, we usethetermrelaxedsupernodefor agroupof consecutive nodesin theeliminationtreewith onechild. Henceforth,any referenceto theheight,depth,or levelsof thetreewill bewith respectto therelaxedsupernodaltree.Forthesakeof simplicity, weassumethattherelaxedsupernodaleliminationtreeis abinarytreeupto thetop 3relaxed supernodallevels. Any eliminationtreecanbeconvertedto a binary relaxed supernodaltreesuitablefor parallelmultifrontal eliminationby asimplepreprocessingstepdescribedin detail in [24].

In orderto factorizethesparsematrix in parallel,portionsof theeliminationtreeareassignedto processorsusing the standardsubtree-to-subcubeassignmentstrategy. This assignmentis illustratedin Figure4(b) for

7

Page 8: A Highly Scalable Parallel Algorithm for Sparse Matrix ... · In this paper, we describe a scalable parallel algorithm for sparse matrix factorization, analyze their per-formance

11

16

17

11 16 17

11

15

16

11 15 16

14

17

18

14 17 18 14 15 18

14

15

18

F0 F3

6 7 8 18

6

7

8

18

15

16

17

15 16 171815 16 17

18

17

16

15

U U U U2 5 11 14

F5

1815 16 17

18

17

16

15

1815 16 17

6

7

8

18

6 7 8 18

2

7

8

5

18

8

8 185

11

16

17

11 16 17 11 15 16 14 17 18 14 15 18

14

15

18

11

15

16

14

17

18

0U 1U 3U 4U 9U

F17 U16F8 U7

8

8

8

8

1U0UF2 U10

6

7

8

6 7 8

15

16

17

15 16 17

6

7

8

6 7 8

7

8

7 8

18

18

18

18

18

18

18

18

18

18

17

0 2 7 8 1 2 6 7 3 5 8 18 4 5 6 18 9 10 12 13

0

2

7

8 7

1

2

6

3

5

8

18

4

5

6

18

9 10 12 13

(a) Initial frontal matrices

F F F F F F1 4 9

10 12 13

b

+ =U U

=F

18

14 12 13+

14

15

16

17

14

14 14

(c) Frontal matrices at the second level of the elimination tree

5

5

+ + =U U3 4

5

6

7

8

18

5 6 7 8 18

=

2 7 8

2

6

7

2 6 7

5

18

185

6

6

(b) Update matrices after one step of elimination in each frontal matrix

U U U10 12 13

(d) Update matrices after elimination of nodes 2, 5, 11, and 14

18= + = + 18 =

18

18

17

17

17

17

17

17

17

17

18= = + 18 =+

(e) Factorization of the remainder of the matrix

bbb

b

=+ +=

2

2

=+ +=

UF

11

16

11 16

11

11

11 9

2 7

2

7

b b

8

8 18

8

8 18

=+ += UF

16

16

U15 1411

15

15

15

16

17

18

15 16 17 18

=+ +=

7

7F U U6 2 5

6

6

6

7

8

18

6 7 8 18

F F + + =

U

1516 = =7 = + = + =U

7

8

7 8

6

7

8

18

7 8 18

7

8

18

7 8 18

16

17

16 17

16

17

16 17

16

17

18

16 17 18

16

17

18

16 17 18

= + +U UF18 8 = + + =

Figure5: Stepsin serialmultifrontal Cholesky factorizationof thematrix shown in Figure4(a). Thesymbol“+” denotesanextend-addoperation.Thenonzerosin theoriginal matrix aredenotedby thesymbol“ ” andfill-ins aredenotedby thesymbol“ ”.

8

Page 9: A Highly Scalable Parallel Algorithm for Sparse Matrix ... · In this paper, we describe a scalable parallel algorithm for sparse matrix factorization, analyze their per-formance

eight processors.With subtree-to-subcubeassignment,all processorsin the systemcooperateto factorizethe frontal matrix associatedwith the topmostrelaxed supernodeof the elimination tree. The two subtreesof the root areassignedto subcubesof ,|> processorseach. Eachsubtreeis further partitionedrecursivelyusingthe samestrategy. Thus,the subtreesat a depthof 5 relaxed supernodallevels areeachassignedto individual processors.Eachprocessorcanwork on this part of the treecompletelyindependentlywithoutany communicationoverhead.A call to the function Factor given in Figure2 with the root of a subtreeasthe argumentgeneratesthe updatematrix associatedwith that subtree. This updatematrix containsall theinformationthatneedsto becommunicatedfrom thesubtreein questionto othercolumnsof thematrix.

After theindependentfactorizationphase,pairsof processors( ,%\ and ,%\p " for `0aKgxi2,|> ) performaparallelextend-addon their updatematrices,say and , respectively. At theendof this parallelextend-addoperation,,%\ and ,%\p " roughlyequallyshareIK . Here,andin theremainderof this paper, thesign“+”in thecontext of matricesdenotesanextend-addoperation.More precisely, all evencolumnsof 2I2 go to,%\ andall oddcolumnsof NIV go to ,%\n " . At thenext level, subcubesof two processorseachperforma parallelextend-add.Eachsubcubeinitially hasoneupdatematrix. Thematrix resultingfrom theextend-addon thesetwo updatematricesis now mergedandsplit amongfour processors.To effect this split, all evenrows aremoved to the subcubewith the lower processorlabels,andall odd rows aremoved to the subcubewith the higherprocessorlabels. During this process,eachprocessorneedsto communicateonly oncewithits counterpartin theothersubcube.After this (second)parallelextend-addeachof theprocessorshasa blockof theupdatematrix roughlyone-fourththesizeof thewholeupdatematrix. Notethat,boththerows andthecolumnsof theupdatematrixaredistributedamongtheprocessorsin acyclic fashion.Similarly, in subsequentparallelextend-addoperations,theupdatematricesarealternatinglysplit alongthecolumnsandrows.

Assumethat the levelsof thebinary relaxed supernodaleliminationtreearelabeledstartingwith 0 at thetopasshown in Figure4(b). In general,at level ^ of therelaxedsupernodaleliminationtree, >n] processorswork on a singlefrontal or updatematrix. Theseprocessorsform a logical >, ](¡ ¢ L>,£¤ ]¥_ ¦ mesh.All updateandfrontal matricesat this level aredistributedon this meshof processors.Thecyclic distributionof rowsandcolumnsof thesematricesamongtheprocessorshelpsmaintainload-balance.Thedistribution alsoensuresthata parallelextend-addoperationcanbeperformedwith eachprocessorexchangingroughlyhalf ofits dataonly with its counterpartprocessorin theothersubcube.This distribution is fairly straightforward tomaintain. For example,during the first two parallelextend-addoperations,columnsandrows of the updatematricesaredistributed dependingon whethertheir leastsignificantbit (LSB) is 0 or 1. Indiceswith LSB= 0 go to the lower subcubeand thosewith LSB = 1 go to the highersubcube.Similarly, in the next twoparallelextend-addoperations,columnsandrows of theupdatematricesareexchangedamongtheprocessorsdependingon thesecondLSB of their indices.

Figure6 illustratesall theparallelextend-addoperationsthattake placeduringparallelmultifrontal factor-izationof thematrix shown in Figure4. Theportionof anupdatematrix that is sentout by its original ownerprocessoris shown in grey. Hence,if processors!X and ,\ with respective updatematrices and performaparallelextend-add,thenthefinal resultat !X will betheadd-extensionof thewhite portionof andthegreyportionof . Similarly, thefinal resultat ,\ will betheadd-extensionof thegrey portionof andthewhiteportionof . Figure7 furtherillustratesthethisprocessesby showing four consecutive extend-addoperationson hypotheticalupdatematricesto distributetheresultamong16processors.

Betweentwo successive parallelextend-addoperations,several stepsof denseCholesky eliminationmaybeperformed.Thenumberof suchsuccessive eliminationstepsis equalto thenumberof nodesin therelaxedsupernodebeingprocessed.Thecommunicationthat takesplacein this phaseis thestandardcommunication

9

Page 10: A Highly Scalable Parallel Algorithm for Sparse Matrix ... · In this paper, we describe a scalable parallel algorithm for sparse matrix factorization, analyze their per-formance

1U 3U 4U 9U

18

§§§ §§5

18

8

8 185

0U0P

5

5

6

8

18

§§§§7

8

7

§§§

§§18

8

8 18

6

6

§ §15

17

18

15 17

§§§

18

8

8 18

6

6

§7

6

§8

18

7

§7

7

§ §16

18

16 18

§17

16

§ § §18

16

15 17 15 17

17

15

7

8

7

§§

P U21

171416

14

§§ §16

17

18

16 18

P U6P U2 5 P U7 14P U5 11

+ UP U5 9 10

15

18

18

§

P0

11171611

17

16

11

8

7

2

(f) Update matrices after the third parallel extend-add operation

14 15 18

14

15

18

11

15

16

14

17

18

2 7 8

2

6

7

2 6 7

5

18

185

6

6

§§ §=§§§§=§=§§

§§§ §§§§ §x§§

§§=§=§§

§§§ §§§§§ §§

U U U10 12 13P P P P P P P1 2 3 4 5 6 7

§§16

17

16

6

8

7

2

2 6

§§ §§ §

8

§§ §§ §11 15 17

11

15

16

17

§§§§=§14

15

17

18

14 18

+ U0P U + UP U + UP U + UP U UP UUP U UP U UP U3 4 72 5 2 2 5 1 2 5 2 5 14 6 14 5 11 14 11 1411 11+ + + +

+ 1U0P 0U + 1UP 0U + UP U + UP U + UP U UP U UP U1 2 3 4 6 73 4 3 4 9 10 12 13 12 13+ +

7

8

18

§§

7

P U53

§§

P U

16

17

16

4

§ §17

18

15 17

§15

16

11

6

7

8

6 8

§§ §

0P U2

§§§

18

8

8 186

6

7

§ §§ §15

16

17

15 17

18

18

§

U P U4 178 P P P P P P1 5 2 6 3 7

18

18

§

P P P P P P1 5 2 6 3P0 + UP U4 8 17 7

(a) Update matrices before the first parallel extend-add operation

(b) Update matrices after the first parallel extend-add operation

(c) Update matrices before the second parallel extend-add operation

(d) Update matrices after the second parallel extend-add operation

(e) Update matrices before the third parallel extend-add operation

Figure6: Extend-addoperationson theupdatematricesduringparallelmultifrontal factorizationof thematrixshown in Figure4(a) on eight processors.,X©¨ ª denotesthe part of the matrix ª that resideson processornumberc . ª maybeanupdatematrix or theresultof performinganextend-addon two updatematrices.Theshadedportionsof amatrix aresentoutby aprocessorto its communicationpartnerin thatstep.

10

Page 11: A Highly Scalable Parallel Algorithm for Sparse Matrix ... · In this paper, we describe a scalable parallel algorithm for sparse matrix factorization, analyze their per-formance

(d) Update matrices before fourth parallel extend-add

4 5

4 5

6 7

6 7 6 7

(c) Update matrices before third parallel extend-add

(e) Final distribution on 16 processors

4 5

4 56 7

4 56 7 6 7

4 5

(a) Update matrices before first parallel extend-add

(b) Update matrices before second parallel extend-add

0000 0

00

00 0

3333 3

333 3 4

444

444 4

44 5

555

555 5

55 6

666

666 6

66 7

77 7

77 7

77

71111 1

11

11 1

2222 2

22

22 2

3

8888

888 8

88 9 9 9 9

9999

99

10101010

101010 10

1010 11

111111

111111 11

1111 12

12

12121212 12

1212

13131313 13

1313

13 1313

12

14 15141414

141414 14 14

1415 15 15

1515151515

15

2

3

4

5

2 3 4 5

6

7

8

9

6 7 8 9

6

7

10

11

6 7 10 11

0

1

3

4

0 1 3 4

3

4

5

6

3 4 5 6

7

8

9

10

7 8 109

4

6

7

5

4 5 6 7

4

6

7

5

4 5 6 7

3

4

5

6

3 4 5 6

0

1

3

4

0 1 3 4

7

8

9

10

7 8 109

4

10

5

6

7

8

9

10

4 5 6 7 8 9

6

7

10

11

6 7 10 11

0

1

2

3

4

5

0 1 2 3 4 5

6

7

8

9

10

11

6 7 8 9 10 11

898

8888 8

888

88

9999 9

99

9

6

7

8

9

10

11

6 7 8 9 10 11

10

11

10 11

2

3

6

7

2 3 6 7

141514

14141414 14

141414

1414

15151515 15

1515

15

0

3

4

5

6

7

9

0

0

1

3

4

5

6

7

1 3 4 7650 3 4 5 6 7 9

0

1

0 1

3

4

5

6

3 4 5 6

010

0000 0

000

00

1111 1

11

1

232

2222 2

222

22

3333 3

33

3

454

4444

5555

5555 5

5444 4

676

66666 6

6666

666 6

77777 7

777

77

101110

1010101010 10

10101010

101010

1111111111 11

111111

1111 11

121312

1212121212 12

121212

1212

1313131313 13

13131313

131313 13

G + HE + FC + DA + B

I + J K + L M + N O + P

5 44 5

5 44 5

5 44 5

6 7 7

6 7 7

6 7 7

6 7 7

6 7

6 7

6 7

6 7 6 7

6 7

6 7

45

45

45 45

4545 45

1110

1110 1110 1110 1110 1110

1110 1110 1110

1110

1110

1110

8 9

8 9

8 9 8 9

8 9 8 9 8 9

1514

1514

1514

1514

1514 1514 1514 1514 1514

151415141514

1514 1514

151413 12

13 12 13 12

13 12

13 12 13 1213 1213 12

13 12 13 12

8 1210 11 14 15

9

8 1210 11 14 15

9

8 1210 11 14 15

9

8 1210 11 14 15

96 7

0

1

2

3

4

5

6

7

8

9

10

11

0

1

2

3

4

5

6

7

8

9

10

11

0

1

2

3

4

5

6

7

8

9

10

11

0

1

2

3

0 321

A B C D E F G H

I J K L M N O P

6

6

0

0

8

9

11

8 9 11

4

7

9

4 7 9

2

3

2 3

6

7

6 7

4

54

8

81110

8 91110

8 91110

8 9

8

8

8

12

12

12

12

12

E + F + G + H I + J + K + L M + N + O + P

0

1

3

4

5

6

7

8

9

10

0 1 3 4 6 7 8 9 105

0

3

4

5

6

7

8

9

10

11

0 3 4 5 6 7 8 9 10 11

0

1

2

3

4

5

6

7

11

10

0 1 2 3 4 5 6 7 10 11

A + B + C + D + E + F + G + H I + J + K + L + M + N + O + P A + . . . + P

0 2 3 4 5 7 8 10 111 6

8 9 12 13 8 9

8 9 12 13 8 9 12

1210 11 14 15 10 11 14 15

10 11 14 15 10 11 14 15 151110 14129813

810 11

810 11

811

13 1310

13

0

0

0

1

1 1

0 2 3 4 5 7 8 10 111 6

2 3 2 3 2 300

2 30

2 3

2 3

8 1210 11 14 15

9

8 9 12 13 8 9

8 9 12 13 8 9 12

1210 11 14 15 10 11 14 15

10 11 14 15 10 11 14 15 151110 14129813

0

0

0

1

1

1

1

1

1

1

0 2 3 4 5 7 8 10 111 6

02 32 32 3

02 3 2 3 2 3

0

002 302 30202 3

02 3

2 3

54 4 576 6 7

1

547

7641

3

764

64

9 9 9

0

1

0

0

1

1

1

1 1

1

1 1

1

1 1

1

1

1

A + B + C + D

0

1

2

3

4

5

6

7

8

9

10

11

0 2 3 4 5 7 8 10 111 6

2 30 02 32 32 32 32 3

0 02 3 2 3 2 3 2 32 3

0 0

0 00 02 30 02 30 02 30 02 3

2 3

2 30 02 32 3

2 3

2 3

9

Figure7: Four successive parallelextend-addoperations(denotedby “+”) on hypotheticalupdatematricesfor multifrontal factorizationon 16 processors,numberedfrom 0 to 15. Thenumberinsidea box denotesthenumberof theprocessorthatownsthematrixelementrepresentedby thebox.

11

Page 12: A Highly Scalable Parallel Algorithm for Sparse Matrix ... · In this paper, we describe a scalable parallel algorithm for sparse matrix factorization, analyze their per-formance

5

4 56 7

447

56 7 6

12 13

1 4 5

1

5

9

0

3 9 11 1

3

7

11

0

15 5 7 13

11 14 15

3 6 71

5

9

3

7

11

0

0

2

10

0

4

8

3

6

10

0

4

8

0

0

2 8 10

3

6

10

0

12 14 4 6

Horizontal comunication

Vertical communication

0

0

8

0

Matrix indices

0

0

0

1

1 1

0 2 3 4 5 7 8 10 111 6

2 3 2 3 2 300

2 30

2 3

2 3

Horizontal subcubes

Vertical subcubes

(0,2,8,10), (1,3,9,11), (4,6,12,14), (5,7,13,15)

(0,1,4,5), (2,3,6,7), (8,9,12,13), (10,11,14,15)

9

8 1210 11 14 15

9

8 9 12 13 8 9

8 9 12 13 8 9 12

1210 11 14 15 10 11 14 15

10 11 14 15 10 11 14 15 151110 14129813

0

1

2

3

4

5

6

7

8

9

10

11

9

Processor numbers

Figure8: Thetwo communicationoperationsinvolved in a singleeliminationstep(index of pivot = 0 here)ofCholesky factorizationon a 1>#1> frontal matrixdistributedover 16 processors.

in pipelinedgrid-baseddenseCholesky factorization[46, 34]. If theaveragesizeof thefrontalmatricesis MkMduring the processingof a relaxed supernodewith - nodeson a y -processorsubcube,then (-. messagesof size (Mr| y« arepassedthroughthe grid in a pipelinedfashion. Figure8 shows the communicationforonestepof denseCholesky factorizationof a hypotheticalfrontal matrix for y0¬1@ . It is shown in [35] thatalthoughthis communicationdoesnot take placebetweenthenearestneighborson a subcube,thepathsof allcommunicationson any subcubeareconflict freewith e-cuberouting [45, 34] andcut-throughor worm-holeflow control. This is a directconsequenceof the fact thata circularshift is conflict freeon a hypercubewithe-cuberouting. Thus,a communicationpipelinecanbemaintainedamongtheprocessorsof a subcubeduringthedenseCholesky factorizationof frontalmatrices.

3.1 Block-cyclic mapping of matricesonto processors

In the parallelmultifrontal algorithmdescribedin this section,the rows andcolumnsof frontal andupdatematricesaredistributedamongtheprocessorsof asubcubein acyclic manner. For example,thedistribution ofamatrixwith indicesfrom 0 to 11ona16-processorsubcubeis shown in Figure7(e).The16processorsform alogicalmesh.Thearrangementof theprocessorsin thelogicalmeshis shown in Figure9(a). In thedistributionof Figure7(e), consecutive rows andcolumnsof the matrix aremappedonto neighboringprocessorsof thelogical mesh. If therearemore rows andcolumnsin the matrix than the numberof processorsin a row orcolumnof theprocessormesh,thentherows andcolumnsof thematrixarewrappedaroundon themesh.

Although themappingshown in Figure7(e) resultsin a very goodload balanceamongtheprocessors,ithasa disadvantage.Notice thatwhile performingthestepsof Cholesky factorizationon thematrix shown inFigure7(e), thecomputationcorrespondingto consecutive pivots startson differentprocessors.For example,pivot 0 onprocessor0, pivot 1 onprocessor3,pivot 2 onprocessor12,pivot 3 onprocessor15,andsoon. If themessagestartuptimeis high,thismayleadto significantdelaysbetweenthestagesof thepipeline.Furthermore,on cache-basedprocessors,theuseof BLAS-3 for eliminatingmultiple columnssimultaneouslyyieldsmuch

12

Page 13: A Highly Scalable Parallel Algorithm for Sparse Matrix ... · In this paper, we describe a scalable parallel algorithm for sparse matrix factorization, analyze their per-formance

0 1 4 5

2 3 6 7

8 9 12 13

10 11 14 15

(a) A logical mesh of 16 processors (b) Block-cyclic mapping with 2 2 blocksX

0

0 2 3 4 5 7 8 10 111 6

0

1

2

3

4

5

6

7

8

9

10

11

9 Matrix indices

Processor numbers

0 02 2

223

3388 8

8 99 9

9 1212 12

1010

1010

11 111111

1414 14

14 1515 15

0 000

00 0

2 222

2 222

33 33 3

33

1 111

4 444

5 555

6 666

7 777

Figure9: Block-cyclic mappingof a 1>#1> matrixon a logicalprocessormeshof 16 processors.

higherperformancethantheuseof BLAS-2 for eliminatingonecolumnatatime. Figure9(b)showsavariationof thecyclic mapping,calledblock-cyclic mapping[34], thatcanalleviate theseproblemsat thecostof someaddedloadimbalance.

Recall that in the mappingof Figure7(e), the leastsignificant ­h 5,|>p® bits of a row or column indexof the matrix determinethe processorto which that row or columnbelongs. Now if we disregard the leastsignificantbit, and determinethe distribution of rows and columnsby the ­h 3!|>p® bits startingwith thesecondleastsignificantbit, thenthemappingof Figure9(b) will result.In general,we candisregardthefirst Gleastsignificantbits, andarrive at a block-cyclic mappingwith a block sizeof >rEx.>rE . TheoptimalvalueofG dependson theratioof computationtime andthecommunicationlatency of theparallelcomputerin useandmayvary from onecomputerto anotherfor thesameproblem.In addition,increasingtheblock sizetoo muchmay causetoo muchload imbalanceduring the denseCholesky stepsandmay offset the advantageof usingBLAS-3.

3.2 Subtree-to-submeshmapping for the 2-D mesharchitecture

Themappingof rows andcolumnsdescribedso far worksfine for thehypercubenetwork. At eachlevel, theupdateandfrontal matricesaredistributedon a logical meshof processors(e.g.,Figure9(a)) suchthat eachrow andcolumnof thismeshis asubcubeof thehypercube.However, if theunderlyingarchitectureis amesh,thena row or a columnof the logical meshmay not correspondto a row or a columnof the physicalmesh.Thiswill leadto contentionfor communicationchannelsduringthepipelineddenseCholesky stepsof Figure8on a physicalmesh. To avoid this contentionfor communicationchannels,we definea subtree-to-submeshmappingin this subsection.Thesubtree-to-subcube mappingdescribedin Figure4(b) ensuresthatany subtreeof the relaxed supernodaleliminationtree is mappedonto a subcubeof the physicalhypercube.This helpsin localizing communicationat eachstageof factorizationamonggroupsof as few processorsas possible.Similarly, thesubtree-to-submeshmappingensuresthata subtreeis mappedentirelywithin a submeshof thephysicalmesh.

Note that in subtree-to-subcube mappingfor a >r¯ -processorhypercube,all level-° subtreesof the relaxedsupernodaleliminationtreearenumberedin increasingorderfrom left to right andasubtreelabeledc is mappedontoprocessorc . For example,thesubtreelabelingof Figure10(a)resultsin theupdateandfrontal matricesfor thesupernodesin thetopmost(level-0) relaxedsupernodeto bedistributedamong16 processorsasshown

13

Page 14: A Highly Scalable Parallel Algorithm for Sparse Matrix ... · In this paper, we describe a scalable parallel algorithm for sparse matrix factorization, analyze their per-formance

(b) Subtree-submesh assignment of level-4 subtrees and the corresponding logical mesh for level-0 supernode

0 1 4 5

2 3 6 7

8 9 12 13

10 11 14 15

0 1 2 3

4 5 6 7

8 9 10 11

12 13 14 15

0 1 6 7 8 9 14 154 5 2 3 12 13 10 11

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

(a) Subtree-subcube assignment of level-4 subtrees and the corresponding logical mesh for level-0 supernode

Figure10: Labelingof subtreesin subtree-to-subcube(a) andsubtree-to-submesh(b) mappings.

in Figure9(a).Thesubtree-to-submeshmappingstartswith a differentinitial labelingof the level-° subtrees.Figure10(b)shows this labelingfor 16 processors,which will resultin theupdateandfrontal matricesof thetopmostrelaxedsupernodebeingpartitionedon a ±²x± arrayof processorslabeledin a row-majorfashion.

We now definea function -.W© suchthat replacingevery referenceto processorc in subtree-to-subcubemappingby areferenceto processor-.W©!(cfeh-.eh resultsin asubtree-to-submeshmappingonan -³ mesh.We assumethat both - and are powers of two. We also assumethat either - U or - U |> (thisconfigurationmaximizesthecross-sectionwidth andminimizesthediameterof an -4 -processormesh).Thefunction -.W],(cfeh-.eh is givenby thefollowing recurrence:

-.W],(cfeh-.eh #´c , if c3iN>µ-.W],(cfeh-.eh #´-.W©,(cpe·¶ eh , if -´ eCc3i¸¶¹ µ-.W],(cfeh-.eh # ¶¹ I#-.W],(c~/ ¶3¹ e ¶ eh , if -´ eCc3< ¶¹ µ-.W],(cfeh-.eh #´-2º»-.W©,(cpeh-.e ¹ |p-4¼TI-.W©!(cfeh-.e ¹ , if -´ ¹ eRc3i ¶¹ µ-.W],(cfeh-.eh #´-.]º»-.W©,(c½/ ¶¹ eh-.e ¹ |p-4¼Iv1RI-.W©,(cu/ ¶¹ eh-.e ¹ , if -´ ¹ eRc3< ¶¹ µ

Theabove recurrencealwaysmapsa level- relaxedsupernodeof a binaryrelaxedsupernodaleliminationtreeontoan (-4 |> -processorsubmeshof the -4 -processortwo-dimensionalmesh.

14

Page 15: A Highly Scalable Parallel Algorithm for Sparse Matrix ... · In this paper, we describe a scalable parallel algorithm for sparse matrix factorization, analyze their per-formance

Analysis of Communication Overhead

In this section,we derive expressionsfor the communicationoverheadof our algorithmfor sparsematricesresultingfrom afinite differenceoperatoronregulartwo- andthree-dimensionalgrids.Within constantfactors,theseexpressionscanbe generalizedto all sparsematricesthat areadjacency matricesof graphswhose -nodesubgraphshave w -nodeand ( )· -nodeseparators,respectively. This is becausethepropertiesof separatorscan be generalizedfrom grids to all suchgraphswithin the sameorder of magnitudebounds[38, 37, 13]. We derive theseexpressionsfor both hypercubeand mesharchitectures,and also extend theresultsto sparsematricesresultingfrom three-dimensionalgraphswhose -nodesubgraphshave ( ) -nodeseparators.

Theparallelmultifrontalalgorithmdescribedin Section3 incurstwo typesof communicationoverhead:oneduringparallelextend-addoperations(Figure7) andtheotherduringthestepsof denseCholesky factorizationwhile processingthesupernodes(Figure8). Crucial to estimatingthecommunicationoverheadis estimatingthesizesof frontal andupdatematricesatany level of thesupernodaleliminationtree.

Considera v regularfinite differencegrid. Weanalyzethecommunicationoverheadfor factorizingthe ¿6 sparsematrixassociatedwith thisgrid on processors.In orderto simplify theanalysis,weassumea somewhat differentform of nested-dissectionthantheoneusedin theactualimplementation.This methodof analyzingthe communicationcomplexity of sparseCholesky factorizationhasbeenusedin [15] in thecontext of a column-basedsubtree-to-subcubescheme.Within very small constantfactors,theanalysisholdsfor the standardnesteddissection[11] of grid graphs. We considera cross-shapedseparator(describedin[15]) consistingof > U/1 nodesthat partitionsthe -nodesquaregrid into four squaresubgridsof size À/21|>Á. À/21|> . We call this thelevel-0 separatorthatpartitionstheoriginal grid (or thelevel-0grid) into four level-1 grids. The nodesin the separatorarenumberedafter the nodesin eachsubgridhavebeennumbered.To numberthe nodesin the subgrids,they are further partitionedin the sameway, andtheprocessis appliedrecursively until all nodesof theoriginalgrid arenumbered.Thesupernodaleliminationtreecorrespondingto this orderingis suchthateachnon-leafsupernodehasfour children.Thetopmostsupernodehas > ³/21 ( ÂÃ> ) nodes,andthe sizeof thesupernodesat eachsubsequentlevel of the treeis half ofthesupernodesizeat thepreviouslevel. Clearly, thenumberof supernodesincreasesby a factorof four ateachlevel, startingwith oneat thetop (level 0).

Thenesteddissectionschemedescribedabove hasthefollowing properties:(1) thesizeof level- subgridsis approximately |> +|> , (2) thenumberof nodesin a level- separatoris approximately> |> ,andhence,the lengthof a supernodeat level ^ of the supernodaleliminationtreeis approximately> |> .It hasbeenproved in [15] that thenumberof nonzerosthatan c4c subgridcancontribute to thenodesof itsborderingseparatorsis boundedby GYc , where G.Ãp±71|1> . Hence,a level- subgridcancontribute at mostG0|p±* nonzerosto its borderingnodes.Thesenonzerosarein theform of thetriangularupdatematrix that ispassedalongfrom therootof thesubtreecorrespondingto thesubgridto its parentin theeliminationtree.Thedimensionsof amatrixwith adensetriangularpartcontainingG0|p± entriesis roughly >G#|> >G|> .Thus,thesizeof an updatematrix passedon to level ^Ä/Q1 of thesupernodaleliminationtreefrom level ^ isroughlyupper-boundedby >G#|> >G+|> for ^Ä<N1 .

Thesizeof a level- supernodeis > |>r ; hence,a totalof > |>r eliminationstepstakeplacewhile thecomputationproceedsfrom thebottomof a level- supernodeto its top. A singleeliminationstepon a frontalmatrixof size (MÄI1R=(M!I1 producesanupdatematrixof sizeMwM . Sincethesizeof anupdatematrixat thetop of a level- supernodeis at most >G|> >G#|> , thesizeof thefrontal matrix at thebottomof the

15

Page 16: A Highly Scalable Parallel Algorithm for Sparse Matrix ... · In this paper, we describe a scalable parallel algorithm for sparse matrix factorization, analyze their per-formance

samesupernodeis upper-boundedby >GÅI0> +|> Æ >GÄI0> Ç|> . Hence,theaveragesizeof afrontalmatrixat level ^ of thesupernodaleliminationtreeis upper-boundedby >G3I1 |> 6 >G3I+1 |> .Let >G/v1 = È . Then È |>r~#È |>r is anupperboundon theaveragesizeof a frontalmatrixat level^ .

We arenow readyto derive expressionsfor the communicationoverheaddueto the parallelextend-addoperationsandtheeliminationstepsof denseCholesky on thefrontalmatrices.

4.1 Overheadin parallel extend-add

Beforethecomputationcorrespondingto level ^·/1 of thesupernodaleliminationtreestarts,aparallelextend-addoperationis performedon lower triangularportionsof theupdatematricesof size >G|> >G|> ,eachof which is distributed on a 7|> ,|> logical meshof processors.Thus, eachprocessorholdsroughly G0|p± wÉÊ(,|p± = G0|p elementsof an updatematrix. Assumingthat eachprocessorexchangesroughlyhalf of its datawith thecorrespondingprocessorof anothersubcube,MËIMÌG0|>p! time is spentincommunication,whereM Ë is themessagestartuptime and M Ì is theper-word transfertime. Notethat this timeis independentof ^ . Sincethereare 5,|> levelsatwhichparallelextend-addoperationstakeplace,thetotalcommunicationtime for theseoperationsis 0|p, : on a hypercube.Thetotal communicationoverheaddueto theparallelextend-addoperationsis Ê 5, on ahypercube.

4.2 Overheadin factorization steps

We have shown earlierthat theaveragesizeof a frontal matrix at level ^ of thesupernodaleliminationtreeisboundedby È |> È |> , where È = p±71|@/Q1 . This matrix is distributedon a ,|> !|> logicalmeshof processors.As shown in Figure8, therearetwo communicationoperationsinvolvedwith eacheliminationstepof denseCholesky. Theaveragesizeof amessageis È |> ÅÉK 7|> = È 0|p . It canbeshown [46, 34] thatin apipelinedimplementationona yÄ y meshof processors,thecommunicationtimefor H eliminationstepswith anaveragemessagesizeof - is (-.HY . Thereasonis thatalthougheachmessagemustgoto y« processors,messagescorrespondingto yY eliminationstepsareactivesimultaneouslyindifferentpartsof themesh.Hence,eachmessageeffectively contributesonly (-. to thetotalcommunicationtime. In our case,at level ^ of the supernodalelimination tree, the numberof stepsof denseCholesky is> |> . Thus the total communicationtime at level ^ is È 0|pP2> +|> = 0| ,1|> . Thetotal communicationtime for theeliminationstepsat top 3!|> levelsof thesupernodaleliminationtreeis0| ,Í rÎY] "(Ï 1|>rt . Thishasanupperboundof 0| , . Hence,thetotalcommunicationoverheaddueto theeliminationstepsis (²#0| 7 = , .

Theparallelmultifrontalalgorithmincursanadditionaloverheadof emptyingthepipeline times(oncebeforeeachparallelextend-add)andthenrefilling it. It canbeeasilyshown that this overheadis + eachtime thepipelinerestarts.Hence,theoverall overheaddueto restartingthepipeline 3 time is Ð , ,which is smallerin magnitudethanthe 7 communicationoverheadof thedenseCholesky eliminationsteps.

4.3 Communication overheadfor 3-D problems

Theanalysisof thecommunicationcomplexity for thesparsematricesarisingout of three-dimensionalfiniteelementproblemscanbeperformedalongthelinesof theanalysisfor thecaseof two-dimensionalgrids.Con-

16

Page 17: A Highly Scalable Parallel Algorithm for Sparse Matrix ... · In this paper, we describe a scalable parallel algorithm for sparse matrix factorization, analyze their per-formance

sideran #" )²" )4" ) grid thatis recursively partitionedinto eightsubgridsby aseparatorthatconsistsof threeorthogonal#" )Á.#" ) planes.Thenumberof nonzerosthatan c=c=c subgridcontributesto thenodesof its borderingseparatorsis (ctÑ· [15]. At level ^ , dueto ^ bisections,c is no morethan " ) |>r . Asa result,an updateor a frontal matrix at level ^ of the supernodaleliminationtreewill contain 6Ñ% ) |>nÑ%hentriesdistributed among,|Ò processors.Thus, the communicationtime for the parallelextend-addoper-ation at level ^ is Ñ% )*|> , . The total communicationtime for all parallel extend-addoperationsis Ñ% )*|p,Í ]Ó©] "¥Ï 1|> , which is Ñ% )·|p, . For thedenseCholesky eliminationstepsat any level, themessagesizeis )·| ! . Sincethereare )·|p± nodesin a level- separator, the total communication

time for theeliminationstepsis 6Ñ% ) | ,Í ]ÓÔ] "¥Ï 1|p±* , which is xÑ% ) | , .Hence,the total communicationoverheaddueto parallelextend-addoperationsis xÑ% ) andthat due

to the denseCholesky elimination stepsis Ñ% ) , . As in the 2-D case,theseasymptoticexpressionscanbegeneralizedto sparsematricesresultingfrom three-dimensionalgraphswhose -nodesubgraphshave( )· -nodeseparators.Thisclassincludesthelinearsystemsarisingoutof three-dimensionalfinite elementproblems.

4.4 Communication overheadon a mesh

ThecommunicationoverheadduethedenseCholesky eliminationstepsis thesameon boththemeshandthehypercubearchitecturesbecausethefrontal matricesaredistributedon a logical meshof processors.However,theparallelextendoperationsusetheentirecross-sectionbandwidthof a hypercube,andthecommunicationoverheaddueto themwill increaseon ameshdueto channelcontention.

Recallfrom Section4.1 that thecommunicationtime for parallelextend-addat any level is 0|p, on ahypercube.Theextend-addis performedamonggroupsof ,|p± processorsat level ^ of thesupernodalelimina-tion tree.Therefore,at level ^ , thecommunicationtimefor parallelextend-addona ,|> 7|> submeshis

0|> 7 . Thetotal communicationtime for all thelevelsis 0| 7Í ©ÎY] "¥Ï 1|> . Thishasanup-perboundof 0| , , andtheupperboundonthecorrespondingcommunicationoverheadtermis , .This is thesameasthe total communicationoverheadfor the eliminationsteps.Hence,for two-dimensionalproblems,theoverall asymptoticcommunicationoverheadis thesamefor bothmeshandhypercubearchitec-tures.

The communicationtime on a hypercubefor the parallel extend-addoperationat level ^ is Ñ% )*|> ! for three-dimensionalproblems(Section4.3). The correspondingcommunicationtime on a meshwould be 6Ñ% ) |(±* , . The total communicationtime for all the parallel extend-addoperationsis

xÑ% ) | 7Í Ó ] "¥Ï 1|p±*h , which is 6Ñ% ) | ! . As in the caseof two-dimensionalproblems,thisis asymptoticallyequalto thecommunicationtime for theeliminationsteps.

5 Scalability Analysis

The scalability of a parallel algorithm on a parallel architecturerefers to the capacityof the algorithm-architecturecombinationto effectively utilize an increasingnumberof processors.In this sectionwe usetheisoefficiency metric[34, 36, 17] to characterizethescalabilityof ouralgorithm.Theisoefficiency functionof acombinationof aparallelalgorithmandaparallelarchitecturerelatestheproblemsizeto thenumberof proces-sorsnecessaryto maintaina fixed efficiency or to deliver speedupsincreasingproportionallywith increasingnumberof processors.

17

Page 18: A Highly Scalable Parallel Algorithm for Sparse Matrix ... · In this paper, we describe a scalable parallel algorithm for sparse matrix factorization, analyze their per-formance

5.1 The isoefficiencymetric of scalability

Let Õ be thesizeof a problemin termsof the total numberof basicoperationsrequiredto solve a problemon a serialcomputer. For example, Õ U for multiplying a dense8d matrix with an -vector.The serialrun time of a problemof size Õ is given by ÖÄ×2ØMÙnÕ , whereMÙ is the time to performa singlebasiccomputationstep. If ÖÅÚ is the parallel run time of the sameproblemon processors,thenwe definean overheadfunction ÖÅÛ as *ÖÄÚP/ÖÅ× . Both ÖÅÚ and ÖÅÛ arefunctionsof Õ and , andwe often write themas ÖÅÚ5Õeh, and ÖÅÛ«Õ+eh, , respectively. The efficiency of a parallel systemwith processorsis given byÜ ÖÄ×C|(ÖÄ×=IÖÅÛ«Õ+eh, . If a parallelsystemis usedto solve a probleminstanceof a fixedsize Õ , thentheefficiency decreasesas increases.This is becausethe total overheadÖ Û Õeh, increaseswith . For manyparallelsystems,for a fixed , if theproblemsize Õ is increased,thentheefficiency increasesbecausefor agiven , ÖÅÛÔÕeh, grows slower than ÕÐ . For theseparallelsystems,theefficiency canbemaintainedat adesiredvalue(between0 and1) for increasing , provided Õ is alsoincreased.Wecall suchsystemsscalableparallelsystems.Note that for a givenparallelalgorithm,for differentparallelarchitectures,Õ mayhave toincreaseat differentrateswith respectto in orderto maintaina fixedefficiency. As thenumberof processorsare increased,the smallerthe growth rateof problemsize requiredto maintaina fixed efficiency, the morescalabletheparallelsystemis.

GiventhatÜ N1|1SIxÖÄÛÕeh,|(MÙ©ÕÐ , in orderto maintainafixedefficiency, Õ shouldbeproportional

to ÖÄÛ«Õeh, . In otherwords,thefollowing relationmustbesatisfiedin orderto maintainafixedefficiency:

ÕÝßÞMÙ ÖÅÛYÕeh,e (1)

where Þ Ü |1T/ Ü is a constantdependingon theefficiency to bemaintained.Equation(1) is thecentral

relationthat is usedto determinethe isoefficiency function of a parallelalgorithm-architecturecombination.This is accomplishedby abstractingÕ asa function of throughalgebraicmanipulationson Equation(1).If the problemsizeneedsto grow asfastas àrá:(, to maintainan efficiency

Ü, then àrá:(, is definedasthe

isoefficiency functionof theparallelalgorithm-architecture combinationfor efficiencyÜ

.

5.2 Scalability of the parallel multifr ontal algorithm

It is well known [13] that the total work involved in factoringtheadjacency matrix of an -nodegraphwithan + -nodeseparatorusingnesteddissectionorderingof nodesis #"%$ &· . We have shown in Section4that theoverall communicationoverheadof our schemeis 7 . FromEquation1, a fixedefficiency canbemaintainedif andonly if "%$ &0â¸ãä å , or ä ãçæØä å , or ã#è%é êÁëØì æ2å!è%é ê . In otherwords,theproblemsizemustbe increasedas íî å è%é êï to maintaina constantefficiency as å is increased.In comparison,a lowerboundon theisoefficiency functionof Rothberg andGupta’s scheme[53, 18] with a communicationoverheadof at leastíî ã ä åð ñòå ï is íî å è%é ê·î ð ñòå ï;ó·ï . Theisoefficiency functionof any column-basedschemeis at leastíî å ó«ï becausethe total communicationoverheadhasa lower boundof íî ã6å ï . Thus,the scalabilityof ouralgorithmis superiorto thatof theotherschemes.

It is easyto show thatthescalabilityof ouralgorithmis íî å!è%é ê ï evenfor thesparsematricesarisingoutofthree-dimensionalfinite elementgrids.Theproblemsizein thecaseof an ãÐôkã sparsematrixresultingfrom athree-dimensionalgrid is íî ã6õ ï [15]. Wehave shown in Section4.3thattheoverall communicationoverheadin thiscaseis íî ãxö%÷ ó ä å ï . To maintainafixedefficiency, ã6õbæã6ö%÷ ó ä å , or ã6õ%÷ ó æ ä å , or ã6õbëìøæå è%é ê .

A lowerboundontheisoefficiency functionfor densematrix factorizationis ùbî å è%é êï [34, 35] if thenumberof rank-1updatesperformedby theserialalgorithmis proportionalto therankof thematrix. Thefactorization

18

Page 19: A Highly Scalable Parallel Algorithm for Sparse Matrix ... · In this paper, we describe a scalable parallel algorithm for sparse matrix factorization, analyze their per-formance

ofú asparsematrixderivedfrom an ã -nodegraphwith an ûwî ã ï -nodeseparatorinvolvesadenseûwî ã ï ô ûwî ã ïmatrix factorization. ûwî ã ï is ùbî ä ã ï and ùbî ãxõ%÷ ó·ï for two- and three-dimensionalconstantnode-degreegraphs,respectively. Thus,thecomplexity of thedenseportionof factorizationfor thesetwo typesof matricesis ùî ãè%é ê ï and ùbî ã õ ï , respectively, which is of thesameorderasthecomputationrequiredto factortheentiresparsematrix [13, 15]. Therefore,theisoefficiency functionof sparsefactorizationof suchmatricesis boundedfrom below by the isoefficiency functionof densematrix factorization,which is ùbî å è%é ê·ï . As we have shownearlierin thissection,ouralgorithmachievesthis lowerboundfor bothtwo- andthree-dimensionalcases.

5.3 Scalability with respectto memory requirement

We have shown that theproblemsizemustincreasein proportionto å è%é ê for our algorithmto achieve a fixedefficiency. As the overall problemsize increases,so doesthe overall memoryrequirement.For an ã -nodetwo-dimensionalconstantnode-degreegraphs,the sizeof the lower triangularfactor ü is ùbî ãýð ñòTã ï [13].For a fixed efficiency, ì ë³ã è%é ê æ¸å è%é ê , which implies ãþæ¸å and ãýð ñòTãþæ¬åð ñò3å . As a result, if weincreasethe numberof processorswhile solving biggerproblemsto maintaina fixed efficiency, the overallmemoryrequirementincreasesat the rateof ùbî åð ñò5å ï andthe memoryrequirementper processorincreaselogarithmicallywith respectto thenumberof processors.

In thethree-dimensionalcase,sizeof thelower triangularfactor ü is ùbî ã ö%÷ ó ï [13]. For a fixedefficiency,ìÿëNãxõæå è%é ê , which implies ãUæå ó ÷ö and ã6ö%÷ ó ædå . Hence,in this casetheoverall memoryrequirementincreaseslinearly with the numberof processorsand the per-processormemoryrequirementis constantformaintaininga fixed efficiency. It canbe easily shown that for the three-dimensionalcase,the isoefficiencyfunctionshouldnot beof a higherorderthan ùî å!è%é ê ï if speedupsproportionalto thenumberof processorsaredesiredwithout increasingthememoryrequirementperprocessor. To thebestof our knowledge,thealgorithmdescribedin Section3 is the only parallelalgorithmfor sparseCholesky factorizationwith an isoefficiencyfunctionof ùbî å è%é êï .

6 Experimental Results

We implementedour new parallelsparsemultifrontal algorithmon a 1024-processorCrayT3D parallelcom-puter. Eachprocessoron the T3D is a 150 Mhz Dec Alpha chip, with peakperformanceof 150 MFlops for64-bit operations(doubleprecision). However, the peakperformanceof most level threeBLAS routinesisaround50 MFlops. The processorsareinterconnectedvia a threedimensionaltorusnetwork that hasa peakunidirectionalbandwidthof 150 MBytes per second,anda very small latency. Even thoughthe memoryonT3D is physicallydistributed,it canbeaddressedglobally. Thatis, processorscandirectly access(readand/orwrite) otherprocessor’s memory. T3D providesa library interfaceto this capabilitycalledSHMEM. We usedSHMEM to develop a lightweight messagepassingsystem.Using this systemwe wereableto achieve uni-directionaldatatransferratesup to 70 MBytes per second.This is significantlyhigher thanthe 35 MByteschannelbandwidthusuallyobtainedwhenusingT3D’sPVM.

For thecomputationperformedduring thedenseCholesky factorization,we usedsingle-processorimple-mentationof BLAS primitives.Theseroutinesarepartof thestandardscientificlibrary on T3D, andthey havebeenfine tunedfor theAlpha chip. Thenew algorithmwastestedon matricesfrom a varietyof sources.Fourmatrices(BCSSTK30,BCSSTK31,BCSSTK32,andBCSSTK33)comefrom theBoeing-Harwellmatrix set.MAROS-R7is from a linearprogrammingproblemtakenfrom NETLIB. COPTER2comesfrom a modelof a

19

Page 20: A Highly Scalable Parallel Algorithm for Sparse Matrix ... · In this paper, we describe a scalable parallel algorithm for sparse matrix factorization, analyze their per-formance

helicopter

rotor. CUBE35is a ô ô regularthree-dimensionalgrid. NUG15is from a linearprogram-ming problemderived from a quadraticassignmentproblemobtainedfrom AT&T. In all of our experiments,we usedspectralnesteddissection[48] to order the matrices. The factorizationalgorithmdescribedin thispaperwill work well with any typeof nesteddissection.In [21, 22, 20, 32, 30], we show thatnesteddissectionorderingswith properselectionof separatorscanyield betterquality orderingsthat traditionalheuristics,suchas,themultiple minimumdegreeheuristic.

Theperformanceobtainedby this algorithmin someof thesematricesis shown in Table1. Theoperationcountshows only thenumberof operationsrequiredto factorthenodesof theeliminationtree.

Figure11 graphicallyrepresentsthedatashown in Table1. Sinceall theseproblemsrunoutof memoryononeprocessor, thestandardspeedupandefficiency couldnotbecomputedexperimentally.

Numberof ProcessorsProblem OPC 16 32 64 128 256 512 1024PILOT87 2030 122550 504060 240M 0.32 0.44 0.73 1.05MAROS-R7 3136 330472 1345241 720M 0.48 0.83 1.41 2.14 3.02 4.07 4.48FLAP 51537 479620 4192304 940M 0.48 0.75 1.27 1.85 2.87 3.83 4.25BCSSTK33 8738 291583 2295377 1000M 0.49 0.76 1.30 1.94 2.90 4.36 6.02BCSSTK30 28924 1007284 5796797 2400M 1.48 2.42 3.59 5.56 7.54BCSSTK31 35588 572914 6415883 3100M 0.80 1.45 2.48 3.97 6.26 7.93BCSSTK32 44609 985046 8582414 4200M 1.51 2.63 4.16 6.91 8.90COPTER2 55476 352238 12681357 9200M 0.64 1.10 1.94 3.31 5.76 9.55 14.78CUBE35 42875 124950 11427033 10300M 0.67 1.27 2.26 3.92 6.46 10.33 15.70NUG15 6330 186075 10771554 29670M 4.32 7.54 12.53 19.92

Table1: Theperformanceof sparsedirectfactorizationon CrayT3D. For eachproblemthetablecontainsthenumberof equations of thematrix , theoriginal numberof nonzerosin , the nonzerosin the Choleskyfactor ü , thenumberof operationsrequiredto factorthenodes,andtheperformancein gigaflopsfor differentnumberof processors.

Thehighestperformanceof 19.9GFlopswasobtainedfor NUG15,whichis afairly denseproblem.Amongthe sparseproblems,a performanceof 15.7 GFlopswas obtainedfor CUBE35, which is a regular three-dimensionalproblem. Nearly as high performance(14.78GFlops)wasalso obtainedfor COPTER2whichis irregular. Sincebothproblemshave similar operationcount,this shows thatour algorithmperformsequallywell in factoringmatricesarisingin irregularproblems.Focusingourattentionon theotherproblemsshown inTable1, we seethatevenon smallerproblems,our algorithmperformsquitewell. For example,BCSSTK33wasableto achieve 2.90GFlopson256processorsandBCSSTK30achieved3.59GFlops.

To further illustratehow variouscomponentsof our algorithmwork, we have includeda breakdown ofthe variousphasesfor BCSSTK31andCUBE35in Table2. This tableshows the averagetime spentby alltheprocessorsin the local computationandin thedistributedcomputation.Furthermore,we breakdown thetime takenby distributedcomputationinto two majorphases,(a) denseCholesky factorization,(b) extend-addoverhead.Thelatter includesthecostof performingtheextend-addoperation,splitting thestacks,transferringthestacks,andidling dueto loadimbalancesin thesubforest-to-subcubepartitioning. Notethat thefiguresinthis tableareaveragesover all processors,andthey shouldbe usedonly asan approximateindicationof thetime requiredfor eachphase.

A numberof interestingobservationscan be madefrom this table. First, as the numberof processors

20

Page 21: A Highly Scalable Parallel Algorithm for Sparse Matrix ... · In this paper, we describe a scalable parallel algorithm for sparse matrix factorization, analyze their per-formance

0

2

4

6

8

10

12

14

16

18

20

32 64 128 256 512 1024

Gig

aFlo

ps

Processors

PILOT87MAROS-R7

FLAPBCSSTK33BCSSTK30BCSSTK31BCSSTK32COPTER2

CUBE35NUG15

Figure11: Plotof thetotalGigaflopsobtainedby ourparallelsparsemultifrontalalgorithmonvariousproblemson aCrayT3D.

increases,thetimespentprocessingthelocaltreein eachprocessordecreasessubstantiallybecausethesubforestassignedto eachprocessorbecomessmaller. This trendis morepronouncedfor three-dimensionalproblems,becausethey tendto have fairly shallow trees.Thecostof thedistributedextend-addphasedecreasesalmostlinearly asthe numberof processorsincreases.This is consistentwith the analysispresentedin 4, sincetheoverheadof distributedextend-addis íîî ð ñò5å ï å ï . Sincetheexpressionfor thetimespentduringtheextend-addstepsalsoincludesthe idling dueto load imbalance,the almostlinear decreasealsoshows that the loadimbalanceis quitesmall.

Thetimespentin distributeddenseCholesky factorizationdecreasesasthenumberof processorsincreases.This reductionis not linearwith respectto thenumberof processorsfor two reasons:(a) theratioof communi-cationto computationduringthedenseCholesky factorizationstepsincreases,and(b) for a fixedsizeproblemloadimbalancesdueto theblock cyclic mappingbecomesworseaså increases.

DistributedComputation LocalComp. Factorization Extend-AddBCSSTK31 64 0.17 1.34 0.58

128 0.06 0.90 0.32256 0.02 0.61 0.18

CUBE35 64 0.15 3.74 0.71128 0.06 2.25 0.43256 0.01 1.44 0.24

Table2: A break-down of thevariousphasesof thesparsemultifrontalalgorithmfor BCSSTK31andCUBE35.Eachnumberrepresentstime in seconds.

21

Page 22: A Highly Scalable Parallel Algorithm for Sparse Matrix ... · In this paper, we describe a scalable parallel algorithm for sparse matrix factorization, analyze their per-formance

For reasonsdiscussedin Section3.1,we distributedthe frontal matricesin a block-cyclic fashion.To getgoodperformanceonCrayT3D outof level threeBLAS routines,weusedablocksizeof sixteen(blocksizesoflessthansixteenresultin degradationof level 3 BLAS performanceon CrayT3D) For smallproblems,suchalargeblocksizeresultsin asignificantloadimbalancewithin thedensefactorizationphase.Thisloadimbalancebecomesworseas the numberof processorsincreases.However, as the sizeof the problemincreases,boththecommunicationoverheadduringdenseCholesky andthe load imbalancedueto theblock cyclic mappingbecomeslesssignificant. The reasonis that larger problemsusually have larger frontal matricesat the toplevels of the elimination tree,so even large processorgrids canbe effectively utilized to factor them. Thisis illustratedby comparinghow the variousoverheadsdecreasefor BCSSTK31andCUBE35. For example,for BCSSTK31,the factorizationon 128 processorsis only 48% fastercomparedto 64 processors,while forCUBE35,thefactorizationon 128processorsis 66%fastercomparedto 64 processors.

7 Concluding Remarks

In this paper, we analyticallyandexperimentallydemonstratethatscalableparallelimplementationsof directmethodsfor solvinglargesparsesystemsarepossible.WedescribeanimplementationonCrayT3D thatyieldsup to 20 GFlopson medium-sizeproblems. We usethe isoefficiency metric [34, 36, 17] to characterizethescalabilityof our algorithm.We show thattheisoefficiency functionof our algorithmis íî å è%é ê·ï on hypercubeandmesharchitecturesfor sparsematricesarisingout of both two- andthree-dimensionalproblems.We alsoshow that íî å!è%é ê ï is asymptoticallythe bestpossibleisoefficiency function for a parallel implementationofany directmethodfor solvingasystemof linearequations,eithersparseor dense.In [55], Schreiberconcludesthat it is not yet clearwhethersparsedirect solverscanbe madecompetitive at all for highly (å ) andmassively (å ) parallel computers.We hopethat, throughthis paper, we have given an affirmativeanswerto at leastapartof thequery.

Theprocessof obtaininga directsolutionof a sparsesystemof linearequationsof theform ë con-sistsof the following four phases:Ordering , which determinespermutationof thecoefficient matrix suchthat the factorizationincurslow fill-in andis numericallystable;Symbolic Factorization, which determinesthestructureof thetriangularmatricesthatwould resultfrom factorizingthecoefficient matrix resultingfromthe orderingstep;Numerical Factorization, which is the actualfactorizationstepthat performsarithmeticoperationson thecoefficient matrix to performsarithmeticoperationson thecoefficient matrix to producea lower triangularmatrix ü andan uppertriangularmatrix ; andSolution of Triangular Systems, whichproducesthesolutionvector by performingforwardandbackwardeliminationsonthetriangularmatricesre-sultingfrom numericalfactorization.Numericalfactorizationis themosttime-consumingof thesefour phases.However, in orderto maintainthescalabilityof theentiresolutionprocessandto getaroundsingle-processormemoryconstraints,theotherthreephasesneedto be parallelizedaswell. We have developedparallelalgo-rithmsfor theotherphasesthataretailoredto work in conjunctionwith thenumericalfactorizationalgorithm.In [31], we describean efficient parallelalgorithmfor determiningfill-reducing orderingsfor parallelfactor-izationof sparsematrices.This algorithm,while performingtheorderingin parallel,alsodistributesthedataamongthe processorsin way that the remainingstepscanbe carriedout with minimum data-movement. Atthe endof the parallelorderingstep,the parallelsymbolicfactorizationalgorithmdescribedin [19] canpro-ceedwithout any redistribution. In [19, 25], we presentefficient parallelalgorithmsfor solvingtheupperandlower triangularsystems.Theexperimentalresultsin [19, 25] show that thedatamappingschemedescribedin Section3 works well for triangularsolutions. We hopethat the work presentedin this paper, alongwith

22

Page 23: A Highly Scalable Parallel Algorithm for Sparse Matrix ... · In this paper, we describe a scalable parallel algorithm for sparse matrix factorization, analyze their per-formance

[19, 25, 31] will enablethe developmentof efficient practicalparallelsolvers for a broadrangeof scientificcomputingproblems.

References

[1] Cleve Ashcraft. Thedomain/segmentpartition for thefactorizationof sparsesymmetricpositive definitematrices.TechnicalReportECA-TR-148,BoeingComputerServices,Seattle,WA, 1990.

[2] CleveAshcraft.Thefan-bothfamilyof column-baseddistributedCholesky factorizationalgorithms.In Alan George,JohnR. Gilbert, andJosephW.-H. Liu, editors,GraphTheoryandSparseMatrix Computations. Springer-Verlag,New York, NY, 1993.

[3] Cleve Ashcraft,Stanley C. Eisenstat,JosephW.-H. Liu, andA. H. Sherman.A comparisonof threecolumnbaseddistributedsparsefactorizationschemes.TechnicalReportYALEU/DCS/RR-810,YaleUniversity, New Haven,CT,1990. Also appearsin Proceedingsof theFifth SIAMConferenceon Parallel Processingfor ScientificComputing,1991.

[4] J.M. Conroy. Parallelnesteddissection.Parallel Computing, 16:139–156,1990.

[5] J. M. Conroy, S. G. Kratzer, and R. F. Lucas. Multifrontal sparsesolvers in messagepassinganddataparallelenvironments- acomparativestudy. In Proceedingsof PARCO, 1993.

[6] Iain S. Duff, RogerG. Grimes,andJohnG. Lewis. Users’guidefor the Harwell-Boeingsparsematrix collection(releaseI). TechnicalReportTR/PA/92/86,ResearchandTechnologyDivision,BoeingComputerServices,Seattle,WA, 1992.

[7] Iain S. Duff andJohnK. Reid. The multifrontal solutionof indefinitesparsesymmetriclinear equations. ACMTransactionson MathematicalSoftware, 9(3):302–325,1983.

[8] K. A. Gallivan, R. J. Plemmons,and A. H. Sameh. Parallel algorithmsfor denselinear algebracomputations.SIAM Review, 32(1):54–135,March 1990. Also appearsin K. A. Gallivan et al. Parallel Algorithmsfor MatrixComputations. SIAM, Philadelphia,PA, 1990.

[9] G.A. GeistandEsmondG.-Y. Ng. Taskschedulingfor parallelsparseCholesky factorization.InternationalJournalof Parallel Programming, 18(4):291–314,1989.

[10] G. A. GeistandC. H. Romine. LU factorizationalgorithmson distributed-memorymultiprocessorarchitectures.SIAM Journal on Scientificand StatisticalComputing, 9(4):639–649,1988. Also availableasTechnicalReportORNL/TM-10383,OakRidgeNationalLaboratory, OakRidge,TN, 1987.

[11] Alan George.Nesteddissectionof aregularfinite-elementmesh.SIAMJournalonNumericalAnalysis, 10:345–363,1973.

[12] Alan George,MichaelT. Heath,JosephW.-H. Liu, andEsmondG.-Y. Ng. SparseCholesky factorizationon a localmemorymultiprocessor. SIAMJournalonScientificandStatisticalComputing, 9:327–340,1988.

[13] Alan GeorgeandJosephW.-H. Liu. ComputerSolutionof Large SparsePositiveDefiniteSystems. Prentice-Hall,NJ,1981.

[14] Alan George, JosephW.-H. Liu, and EsmondG.-Y. Ng. Communicationreductionin parallel sparseCholeskyfactorizationonahypercube.In MichaelT. Heath,editor, HypercubeMultiprocessors1987, pages576–586.SIAM,Philadelphia,PA, 1987.

[15] Alan George,JosephW.-H. Liu, andEsmondG.-Y. Ng. Communicationresultsfor parallelsparseCholesky factor-izationona hypercube.Parallel Computing, 10(3):287–298,May 1989.

23

Page 24: A Highly Scalable Parallel Algorithm for Sparse Matrix ... · In this paper, we describe a scalable parallel algorithm for sparse matrix factorization, analyze their per-formance

[16] JohnR. GilbertandRobertSchreiber. Highly parallelsparseCholesky factorization.SIAMJournalonScientificandStatisticalComputing, 13:1151–1172,1992.

[17] AnanthY. Grama,AnshulGupta,andVipin Kumar. Isoefficiency: Measuringthescalabilityof parallelalgorithmsandarchitectures.IEEE Parallel andDistributedTechnology, 1(3):12–21,August1993.

[18] AnoopGuptaandEdwardRothberg. An efficientblock-orientedapproachto parallelsparseCholesky factorization.In Supercomputing’93 Proceedings, 1993.

[19] AnshulGupta.AnalysisandDesignof ScalableParallel Algorithmsfor ScientificComputing. PhDthesis,Universityof Minnesota,Minneapolis,MN, 1995.

[20] Anshul Gupta. Fastandeffective algorithmsfor graphpartitioningandsparsematrix ordering. IBM Journal ofResearch andDevelopment, 41(1/2):171–183,January/March,1997.

[21] Anshul Gupta. Graphpartitioningbasedsparsematrix orderingalgorithmsfor interior-point methods.TechnicalReportRC20467,IBM T. J.WatsonResearchCenter, Yorktown Heights,NY, May 1996.

[22] AnshulGupta. WGPP:Watsongraphpartitioning(andsparsematrix ordering)package:Usersmanual.TechnicalReportRC20453,IBM T. J.WatsonResearchCenter, Yorktown Heights,NY, May 1996.

[23] Anshul GuptaandVipin Kumar. Performancepropertiesof large scaleparallelsystems.Journal of Parallel andDistributedComputing, 19:234–244,1993.Also availableasTechnicalReportTR 92-32,Departmentof ComputerScience,Universityof Minnesota,Minneapolis,MN.

[24] AnshulGuptaandVipin Kumar. A scalableparallelalgorithmfor sparsematrix factorization.TechnicalReport94-19,Departmentof ComputerScience,Universityof Minnesota,Minneapolis,MN, 1994.A shortversionappearsinSupercomputing’94 Proceedings.

[25] AnshulGuptaandVipin Kumar. Parallel algorithmsfor forwardandbacksubstitutionin direct solutionof sparselinearsystems.In Supercomputing’95 Proceedings, December1995.

[26] Michael T. Heath,EsmondG.-Y. Ng, andBarry W. Peyton. Parallel algorithmsfor sparselinear systems.SIAMReview, 33:420–460,1991. Also appearsin K. A. Gallivan et al. Parallel Algorithmsfor Matrix Computations.SIAM, Philadelphia,PA, 1990.

[27] MichaelT. HeathandPadmaRaghavan. Distributedsolutionof sparselinearsystems.TechnicalReport93-1793,Departmentof ComputerScience,Universityof Illinois, Urbana,IL, 1993.

[28] Laurie Hulbert and Earl Zmijewski. Limiting communicationin parallel sparseCholesky factorization. SIAMJournalon ScientificandStatisticalComputing, 12(5):1184–1197,September1991.

[29] GeorgeKarypis,AnshulGupta,andVipin Kumar. Parallelformulationof interiorpointalgorithms.TechnicalReport94-20,Departmentof ComputerScience,Universityof Minnesota,Minneapolis,MN, April 1994.Shorterversionsappearin Supercomputing’94 ProceedingsandProceedingsof the1995DIMACSWorkshoponParallel Processingof DiscreteOptimizationProblems.

[30] GeorgeKarypisandVipin Kumar. Analysisof multilevel graphpartitioning. TechnicalReportTR 95-037,Depart-mentof ComputerScience,Universityof Minnesota,1995.

[31] GeorgeKarypisandVipin Kumar. Parallelalgorithmsfor multilevel graphpartitioningandsparsematrix ordering.Journalof Parallel andDistributedComputing, 48:71–95,1998.

[32] GeorgeKarypisandVipin Kumar. A fastandhighqualitymultilevel schemefor partitioningirregulargraphs.SIAMJournalon ScientificComputing, 20(1),1999.

[33] S. G. KratzerandA. J.Cleary. Sparsematrix factorizationon SIMD parallelcomputers.In Alan George,JohnR.Gilbert,andJosephW.-H. Liu, editors,GraphTheoryandSparseMatrix Computations. Springer-Verlag,New York,NY, 1993.

24

Page 25: A Highly Scalable Parallel Algorithm for Sparse Matrix ... · In this paper, we describe a scalable parallel algorithm for sparse matrix factorization, analyze their per-formance

[34] Vipin Kumar, AnanthY. Grama,AnshulGupta,andGeorgeKarypis. Introductionto Parallel Computing:DesignandAnalysisof Algorithms. Benjamin/Cummings,RedwoodCity, CA, 1994.

[35] Vipin Kumar, AnanthY. Grama,AnshulGupta,andGeorgeKarypis. SolutionsManualfor Introductionto ParallelComputing. Benjamin/Cummings,RedwoodCity, CA, 1994.

[36] Vipin KumarandAnshulGupta.Analyzingscalabilityof parallelalgorithmsandarchitectures.Journal of Paralleland DistributedComputing, 22(3):379–391,1994. Also availableasTechnicalReportTR 91-18,DepartmentofComputerScienceDepartment,Universityof Minnesota,Minneapolis,MN.

[37] R. J.Lipton, D. J.Rose,andRobertE. Tarjan.Generalizednesteddissection.SIAMJournalonNumericalAnalysis,16:346–358,1979.

[38] R. J.Lipton andRobertE. Tarjan. A separatortheoremfor planargraphs.SIAMJournal on AppliedMathematics,36:177–189,1979.

[39] JosephW.-H. Liu. Themultifrontal methodfor sparsematrix solution:Theoryandpractice.SIAMReview, 34:82–109,1992.

[40] RobertF. Lucas.Solvingplanar systemsof equationson distributed-memorymultiprocessors. PhDthesis,Depart-mentof ElectricalEngineering,StanfordUniversity, PaloAlto, CA, 1987.

[41] RobertF. Lucas,Tom Blank,andJeromeJ.Tiemann.A parallelsolutionmethodfor largesparsesystemsof equa-tions. IEEETransactionson ComputerAidedDesign, CAD-6(6):981–991,November1987.

[42] F. Manne. Load Balancingin Parallel SparseMatrix Computations. PhD thesis,University of Bergen,Norway,1993.

[43] Mo Mu andJohnR.Rice.A grid-basedsubtree-subcubeassignmentstrategy for solvingpartialdifferentialequationson hypercubes.SIAMJournalonScientificandStatisticalComputing, 13(3):826–839,May 1992.

[44] Vijay K. Naik andM. Patrick.Datatraffic reductionschemesCholesky factorizationonasynchronousmultiprocessorsystems.In Supercomputing’89 Proceedings, 1989.Also availableasTechnicalReportRC14500,IBM T. J.WatsonResearchCenter, Yorktown Heights,NY.

[45] S. F. Nugent. The iPSC/2direct-connectcommunicationstechnology. In Proceedingsof the Third ConferenceonHypercubes,ConcurrentComputers,andApplications, pages51–60,1988.

[46] DianneP. O’Leary andG. W. Stewart. Assignmentandschedulingin parallelmatrix factorization.Linear Algebraandits Applications, 77:275–299,1986.

[47] V. PanandJ. H. Reif. Efficient parallelsolutionof linearsystems.In 17thAnnualACM Symposiumon TheoryofComputing, pages143–152,1985.

[48] Alex Pothen,HorstD. Simon,andK.-P. Liou. Partioningsparsematriceswith eigenvectorsof graphs.SIAMJournalon Matrix AnalysisandApplications, 11(3):430–452,1990.

[49] Alex PothenandChunguangSun. Distributedmultifrontal factorizationusingclique trees. In Proceedingsof theFifth SIAMConferenceon Parallel Processingfor ScientificComputing, pages34–40,1991.

[50] RolandPozoandSharonL. Smith. Performanceevaluationof the parallelmultifrontal methodin a distributed-memoryenvironment.In Proceedingsof theSixthSIAMConferenceonParallel Processingfor ScientificComputing,pages453–456,1993.

[51] PadmaRaghavan.Distributedsparsematrix factorization:QRandCholesky factorizations. PhDthesis,Pennsylva-niaStateUniversity, UniversityPark,PA, 1991.

[52] Edward Rothberg. Performanceof panelandblock approachesto sparseCholesky factorizationon the iPSC/860andParagonsystems.In Proceedingsof the1994ScalableHigh PerformanceComputingConference, May 1994.

25

Page 26: A Highly Scalable Parallel Algorithm for Sparse Matrix ... · In this paper, we describe a scalable parallel algorithm for sparse matrix factorization, analyze their per-formance

[53] EdwardRothberg andAnoopGupta.An efficientblock-orientedapproachto parallelsparseCholesky factorization.In Supercomputing’92 Proceedings, 1992.

[54] Edward Rothberg andRobertSchreiber. Improved load distribution in parallelsparseCholesky factorization. InSupercomputing’94 Proceedings, 1994.

[55] RobertSchreiber. Scalabilityof sparsedirect solvers. In Alan George, JohnR. Gilbert, andJosephW.-H. Liu,editors,SparseMatrix Computations:GraphTheoryIssuesandAlgorithms. Springer-Verlag,New York, NY, 1993.Also availableasRIACSTR 92.13,NASA AmesResearchCenter.

[56] B. Speelpening.Thegeneralizedelementmethod.TechnicalReportUIUCDCS-R-78-946,Departmentof ComputerScience,Universityof Illinois, Urbana,IL, November1978.

[57] ChunguangSun. Efficient parallelsolutionsof largesparseSPDsystemson distributed-memorymultiprocessors.TechnicalReportCTC92TR102,AdvancedComputingResearchInstitute, Centerfor Theory and SimulationinScienceandEngineering,CornellUniversity, Ithaca,NY, August1992.

[58] SeshVenugopalandVijay K. Naik. Effectsof partitioningandschedulingsparsematrix factorizationon communi-cationandloadbalance.In Supercomputing’91 Proceedings, pages866–875,1991.

[59] P. H. Worley andRobertSchreiber. Nesteddissectiononameshconnectedprocessorarray. In Arthur Wouk,editor,New ComputingEnvironments:Parallel, Vector, andSystolic, pages8–38.SIAM, Philadelphia,PA, 1986.

26