Graph Partitioning for FEM Applications: Reducing the ...chernykh/papers/HPCS 2019 Graph...

Graph Partitioning for FEM Applications:Reducing the Communication Volume with DSHEM

Jose Luis Gonzalez GarcıaGesellschaft fur wissenschaftliche

Datenverarbeitung mbH Gottingen(GWDG)

Gottingen, Niedersachsen [email protected]

Ramin YahyapourGesellschaft fur wissenschaftliche

Datenverarbeitung mbH Gottingen(GWDG)

Gottingen, Niedersachsen [email protected]

Andrei TchernykhCentro de Investigacion

Cientıfica y de EducacionSuperior de Ensenada

(CICESE)Ensenada, Baja California, Mexico

[email protected]

Abstract—The finite element method is primarily utilized tonumerically solve partial differential equations, most commonlyby the use iterative methods, over a compact domain. The PDEsdomain is represented by a mesh of information which needs to bedistributed among all available processors in a parallel computer.Distributing the mesh, known as the mesh partitioning problem,is NP-complete. Much effort focuses on graph partitioning andparallelization to address it. We introduce a new vertex matchingmodel called Directed Sorted Heavy Edge Matching intendedto reduce the communication volume during FEM simulationsand ensure efficient execution on distributed systems. Finally, weprovide performance analysis of DSHEM and comment on itsbenefits.

Index Terms—Graph partitioning, Mesh partitioning, Vertexmatching, Load balancing, Finite element method, Communica-tion Volume

I. BACKGROUND

The Finite Element Method (FEM), or Finite ElementAnalysis (FEA), is a technique used in different areas ofscientific computation such as engineering and physics. Ithelps to analyze the behavior of real life objects, or physicalphenomena, under different environmental conditions [1], [2].The problem is modeled with a finite number of discreteelements (mesh of information) resulting in a system ofalgebraic equations [3]. The system is then solved to computea numerical solution that approximates the conditions in thereal world. The discretization may result in large and sparematrices requiring huge amounts of processing power tobe solved. This situation makes sequential implementationsuseless in practice and parallel systems come into play, butthey bring new challenges in terms of efficiency [4]. Adaptivetechniques allow the solution error to be kept under controlwhile costs can be minimized [5].

The mesh of information has to be distributed amongstall available processors in a parallel system [2], [6]. Then,multiple processors execute the same code on different meshelements to compute the final solution. It generally employsan iterative approach to approximate the solution [1], [7].With dynamic problems, the refinement step introduces animbalance to the system and the mesh must be redistributed.

This work was partially supported by the Consejo Nacional de Ciencia yTecnologıa (CONACYT) under grant number 309370.

Since it is not possible to know in advance what regions willbe refined, it is a difficult task to keep the workload in balance.Distributing the mesh among the processors in a parallelcomputer, also known as the mesh-partitioning problem, isknown to be NP-complete [8]–[10].

When parallel systems are employed, efficiency becomesan important concern. The load balancing problem needs tobe addressed in order to improve the efficiency of a parallelsystem. One effective approach is through graph partitioningwhere the load is modeled as a graph. The graph is thenpartitioned and mapped into the processors; it defines thedistribution of the load in the parallel system. It is a moredifficult task with dynamic problems as redistribution mustbe performed regularly when the imbalance reaches a certainthreshold. With the availability of hundreds of thousands ofprocessors becoming cheaper every day, the load balancingproblem shifts its focus to communication costs. The trans-mission of data over network links is considerable slowerthan data processing and a trade between communication andcomputation is required to increase the efficiency of parallelcomputations [11], [12].

Often, FEM libraries are designed for relatively small sys-tems. When hundreds of thousands of processors are availabletheir design becomes an important limitation in order to scaleaccordingly. This situation leads to an important inequalitybetween the software and hardware; which translates intoa decrease of efficiency. The new technologies bring newchallenges and current solutions become obsolete. We addressthe load balancing problem through graph partitioning, witha particular emphasis on the reduction of communication vol-ume, to improve the efficiency of parallel FEM applications.

The paper is structured as follows: Section II provides adetailed background and defines the problem. It introduces theload balancing problem in FEM computations and describeshow to address it. Drawbacks of current methodologies arealso identified. The most important graph partitioning softwareis listed as well. Section III introduces the central idea behindthe proposed algorithm DSHEM. It provides an analysis ofthe limitations of current solutions and introduces the conceptof directional communication on undirected graphs. SectionIV presents one of the main contributions of this paper. It

779978-1-7281-4484-9/19/$31.00 ©2019 IEEE

introduces the algorithm DSHEM with a description of itsimplementation within METIS and an example of how itworks. Section V presents the evaluation of DSHEM; thesecond main contribution of this paper. It provides informationof the metrics and input graphs for the evaluation. DSHEMis compared to SHEM, the original matching algorithm inMETIS. Finally Section VI summarizes the findings andconcludes the paper.

II. THE LOAD BALANCING PROBLEM IN PARALLEL FEMAPPLICATIONS

Load balancing is important in parallel computations; itmaximizes application performance by keeping processor idletime and interprocessor communication overhead as low aspossible. To minimize the overall computation time, all proces-sors should contain the same amount of computational workand data dependencies between processors should be mini-mized. The most important causes of load imbalance in FEMparallel applications are the dynamic nature of the problemthrough time, and the adaptive refinement or coarsening ofthe mesh during the simulation. The interference from otherusers in a shared system and the heterogeneity in either thehardware or in the solver can also affect the load balance andperformance.

Numerous static and dynamic methods have been developedfor load balancing. The dynamic problem has not been exten-sively studied as the static one. Devine et al. [13] provideideas to address the dynamic problem. Willebeek-LeMairand Reeves [14] provide a comparison study of dynamicload balancing strategies. Chen and Taylor [15] achievedimprovements up to 36% when heterogeneity is consideredin distributed systems. Some other attempts to address thisissue are presented in [13], [16]–[19]. Olas et al. [20] haveintroduced a dynamic load balancer to the existing libraryNuscaS [21] which includes a performance model of theirown. The model accurately estimates the cost, measured intime, of every load balancing and computational step with orwithout a balanced workload.

New techniques are needed to reduce the time spend onFEM simulations [22], [23]. Furthermore, speed is commonlythe main objective in dynamic load balancing while the qualityof the partition (its balance) comes in second place. A lessbalanced distribution of work does not necessarily mean anincrease in computing time; it may allow other metrics toimprove such as communication overhead.

A. Load Balancing Through Graph Partitioning

Due to the fact that the mesh of information can be charac-terized by a graph, much effort focuses on graph partitioningalgorithms to address the load balancing problem of paral-lel FEM simulations. Graph partitioning algorithms generatearrays of information containing the location for every graphvertex; i.e., what mesh element is assigned to which processor.The mesh is converted into weighted graph. The vertex weightscorrespond to calculation costs and edge weights correspond topotential communication costs. Different graph representations

can be used. We refer the reader to [24] for details. The goal ofgraph partitioning, then, is to assign equal total vertex weightto partitions while minimizing the weight of cut edges. Anincreasing variety of general purpose techniques and librarieshas been, and is being, developed in recent time whichprovides great effectiveness; we refer the reader to the workby Buluc et al. [25] and Fjallstrom [26] for more information.

A limitation of the use of graphs is the type of systemsthey can represent [27]. Because edges in the graph modelare non-directional, they imply symmetry in all relationships,making them appropriate only for problems represented bysquare, symmetric matrices. To address this drawback, hyper-graphs are used to model the PDE problems [28]. Hypergraphpartitioning’s effectiveness has been demonstrated in manyareas, including VLSI layout, sparse matrix decompositions,and database storage and data mining. However, it has beendemonstrated that hypergraph partitioning is considerablyslower than traditional graph partitioning [29]. It is confirmedby the generalized use of graph partitioning algorithms andlibraries to restore the load balance in parallel FEM compu-tations.

Briefly, the graph partitioning problem involves the creationof subdomains, or smaller groups, from a collection of verticesin a graph, according to some objectives such as the minimiza-tion of a cost function. The problem becomes more complexwhen the number of objectives increases or when they opposeeach other. For the purposes of this paper, we use the definitionof graph partitioning presented by Walshaw and Cross [30];we refer the reader to their work for more details.

To date, there is a wide variety of libraries designed toimprove the efficiency of parallel systems during FEM simula-tions. A number of them are free of charge for academic use,such as METIS [31], [32], Chaco [33], [34], and SCOTCH[35]–[37]. We refer the reader to [38] which provide moredetails on each of them. Over the years, a number of studieshave compared this software [30], [39]–[41]. However, itis difficult to reach a clear conclusion due to the differentexecution parameters and input graphs available.

III. ANALYSIS AND CONCEPT

Undirected graphs are widely used to represent FE meshes.Efficient data structures have been developed to store thegraphs as well as many partitioning libraries. We propose anew vertex matching model for the multilevel graph parti-tioning technique which aims to reduce the communicationvolume by simulating a directed graph.

A. Analysis

Unfortunately, undirected graphs used to characterize FEmeshes do not represent the communication costs properlyleading to inefficient partitions. Minimizing the edge cut doesnot guarantee a reduction in the communication volume, asexplained in the next example. Figure 1 shows two differentpartitions for the same small graph on the left, both of themwith three edges in the cut. The communication volume differs

780

due to the fact that each boundary vertex, represented by dottedlines, has to send and receive data.

It can be seen in Figure 1 that vertices 1, 2 and 3 in (a)send data from Processor 1 to Processor 2; the same appliesfor vertices 4, 5 and 6. A different scenario is shown in (b)where the communication volume is reduced to only 4 withthe same edge cut value. Vertex 2 sends data from Processor1 to Processor 2 only once, regardless of its two edges in thecut. The previous example leads us to think that the reductionof boundary vertices also reduces the communication volumeand the larger graph in Figure 1 supports the hypothesis where(c) and (d) have the same communication volume. However,reducing the number of boundary vertices does not alwaysreduce the communication volume as depicted next. Figure 2shows some examples with different metrics. In (b) the numberof boundary vertices is 7 and the communication volume is 10as in (c), but (c) has 8 boundary vertices. (c) and (d) have thebest edge cut with only 6 edges; however, (d) has the greatestcommunication volume of all.

The correct vertices and edges must be in the boundary inorder to reduce the communication volume. The edge cut isnot an accurate metric to reduce the communication volume.

(c)

121110

987

654

321

Processor 1

Processor 2(d)

121110

987

654

321

Processor 1

Processor 2

(a)

654

321

Processor 1

Processor 2(b)

654

321

Processor 1

Processor 2

Fig. 1. Reduction of communication volume with an edge cut of the samesize on the left. Communication volume with the same number of boundaryvertices on the right.

B. Concept

METIS uses the compressed storage format (CSR) to storethe adjacency structure of the graph. CSR is a widely usedscheme for storing sparse graphs due to its reduced spacerequirements. However, METIS uses this format for undirectedgraphs only; i.e., for symmetric sparce matrices. We refer thereader to [32] for more details on CSR. Using the data struc-tures in METIS, which are designed for undirected graphs, wecan emulate directed versions of them. This is possible due tothe fact that each edge and its weight are stored twice; i.e.,

(c) (d)Processor 3 Processor 3

Processor 1 Processor 2Processor 1 Processor 2

987

654

321

987

654

321

Processor 1

(a) (b)Processor 2Processor 3

987

654

321

987

654

321

Fig. 2. Three different graph partitions with different characteristics.

(u, v) is stored independently of (v, u). We can take advantageof this situation to mimic the direction and source of thecommunication between every pair of vertices connected byan edge during the coarsening process. By using independentvalues for edges (u, v) and (v, u), we can designate the amountand direction of the communication. This is more evident inthe coarsest graph, and affects the initial partition. Later, whenthe partition is projected back to the original graph, the benefitsof this bidirectional graph are evident.

We follow the original process to partition a graph in METISwith some key differences. When two vertices are matched, thecommunication dependencies are considered in the decision.This new condition is implemented in combination with thecoarsening process, where the weights of the edges and thedirection of communication are calculated for the coarsergraph. The values for the new edges, in the coarser graph,are calculated based on the communication dependencies. Inthe next coarsening level, these values are used to create thematching and the new coarser graph. This process is describedin detail in the next section.

IV. DIRECTED SORTED HEAVY EDGE MATCHING

We propose a new vertex matching algorithm called Di-rected Sorted Heavy Edge Matching (DSHEM); it introducesthe concept of bidirectional communication to the matchingphase. The aim is to represent more accurately the communi-cation between the parts while creating the final partition ofthe graph. The utility function uses this new data to create amore efficient matching which leads, in the end, to a betterpartition. DSHEM includes a mechanism to further improvethe results by changing a few parameters during the executionof METIS. It is based on SHEM, an improvement of HEM[42], [43].

781

A. Description

The data structures in METIS are designed for undirectedgraphs, but they can be used to emulate directed versions ofthem. Each edge and its weight are stored twice; i.e., (u, v)is stored independently of (v, u). DSHEM takes advantageof this situation to mimic the direction and source of thecommunication between every pair of vertices connected byan edge. As in SHEM , all vertices are visited in a sortermanner in DSHEM. Vertex u is matched with an unmatchedvertex v such that the weight of the edge (u, v) is maximumover all valid incident edges and the communication volumeis reduced. This new condition is implemented in combinationwith the coarsening process, where the weights of the edgesand the direction of communication are calculated for thecoarser graph.

An example of the matching and coarsening process ofDSHEM is depicted in Figure 3. The original graph in (a) canbe represented as in (c) based on the real bidirectional commu-nication shown in (b). The values in parenthesis represent theweights of the edges. Vertex 1 sends a communication volumeof 1 to vertex 2, and vertex 2 sends a volume of 1 to vertex 1.METIS stores the weight of each edge twice; therefore, we donot incur in extra memory usage or extra computation for thisnew representation. Having the simulated directed graph, thematching and coarsening process can be performed. The firstmatching produced by DSHEM, in (d), would be similar tothat produced by SHEM. The matched vertices are enclosedby doted lines. At this point, the simulated graph does nothave full direction or source information of communications.Vertices 1 and 2 are collapsed to form a new coarse vertex,called super vertex A. All incident edges to both vertices arepreserved to form the coarser graph in (e), only edge (1, 2)is removed; similar process is done to collapse vertex 5 and6 as shown in (f). Now, the coarsening process updates theweight of all edges in the coarser graph by adding informationconcerning the direction and source.

To clarify how this new information is calculated, thearrows originated from each super vertex are grouped into twocategories according to their real origin. In (e), observing thearrows originated from super vertex A, one can appreciate thatsome are solid black and others are white. Solid black arrowsare those belonging to the original vertex 1, while white arrowsbelong to vertex 2. This grouping is translated into (f); notethat some edges have negative values. This minus sign (−) isused to identify the real source of the edge, and not to indicatea negative value. DSHEM collapses two vertices, and only two,to form a new super vertex in the coarser graph. This makes iteasy to indicate the source without using extra memory duringthe coarsening process. The edge (1, 4) incident to vertex 1in (d) is preserved as edge (A, 4) in (e); its communicationvolume remains the same. To differentiate the edges incidentto vertex 2 in (d) which are preserved in (e), the minus signis added to their communication volume. The resulting valuesare then stored in the coarser graph (f) as edges (A, 3) and(A,B). To collapse vertices 5 and 6 in (e) the same process is

(a)

654

3211 1

1 1

1 1 1

(d)

4

3(1,1) (1,1)

(1,1) (1,1)

(1,1)

(1,1)

(1,1)

21

65

(b)

654

321

654

321(1,1) (1,1)

(1,1) (1,1)

(1,1)

(1,1)

(1,1)

(c)

D

C

(2,2)

(i)

B

3

4

A(-1,1)

(1,1)

(1,-1)

(1,1)

(-1,1)

(h)

B4

3A(-1,1)

(1,1)

(1,-1)

(1,1)

(-1,1)

(g)

4

3(-1,1)

(1,1)

(1,-1)

(1,1)

(-1,1)

A

B

()

4

3A

5 6

(e)

Fig. 3. Coarsening process with DSHEM.

followed. The new coarser graph (g) includes all the necessaryinformation to reduce the communication volume in the nextstep of the partitioning process. It can be seen that super vertexA in (g) has three edges; one from the original vertex 1 andtwo from the original vertex 2 in (d) as it is indicated by theminus signs. In fact, the real source cannot be established, onlythe fact that one edge comes from one vertex and the othertwo edges from the other vertex in the finer graph. Vertex Bin (f) has similar characteristics.

The final matching depicted in (h) is important as it de-scribes the idea behind DSHEM. With the complete infor-mation of direction and source, we can improve the partitionreducing the volume of information to be transferred. If SHEMwere used then all edges would have a weight of 1; theircurrent weight is (1, 1) with source information (the minussigns). If we visit vertex A to find a match, in (h), all threeneighbors are inspected (vertices 3, 4 and B). If vertex A ismatched with vertex 3 or B, it can be seen that the resultingedges incident to the collapsed vertex will have three or fourdifferent sources (two sources from vertex A, and one fromvertex 3 or two from vertex B), but if vertex A is matched tovertex 4 the resulting edges will have only two sources (onesource from vertex A and one from vertex 4). DSHEM willmatch vertex A to vertex 4 reducing the number of sourcesin the resulting coarser vertex; SHEM would match vertex Awith any vertex because it only considers the weight of theedge that will be removed.

If we take vertex 4 in (h) as reference, DSHEM will match itwith vertex A while SHEM would choose any of the neighborsresulting in a degradation of the communication volume. Thenext coarsening step produces the coarsest graph shown in (i)

782

with a communication volume of 4. Vertices A and 4 in (h)are collapsed to form super vertex C in (i). The edges incidentto vertex A are also collapsed and their weights added. Dueto the fact that both edges incident to vertex A have the samesource, we just count them as one edge going out from it. Theresulting weight for that edge (C,D) is 2 (1 from the edgeincident to vertex A and 1 from the edge incident to vertex4). The same process is done for the contraction of vertices 3and B.

B. Implementation

Although DSHEM belongs to the matching phase of thepartitioning process, its implementation involves every step ofthe process due to its nature. By using independent valuesfor edges (u, v) and (v, u), DSHEM changes the paradigmwithin METIS and complicates its implementation. The codeis adapted to handle new values for edges (u, v) and (v, u) andnew parameters are added to control the execution of METIS.The multiplier -maxvwtm limits the size of vertices during thecoarsening process. Percentages -dshem p1, -dshem p2, and -dshem p3 are used to modify the behavior of the cost functionin DSHEM.

The information required by DSHEM is the structure of thegraph, the maximum weight that is allowed for a vertex duringthe coarsening process, and three percentage values; similar tothe original SHEM, but with the exception of the new threepercentage values. These percentages are used to fine tune howmuch the weight of the edge, combined with the number ofsources, affects the decision to match vertex j to vertex i. Thealgorithm returns the array with the matching information andthe number of coarse vertices; again, as the original SHEM.

DSHEM can make a more informed decision whethervertices i and j should be matched together. It considers thethree possible scenarios: the number of sources is reduced,preserved or increased. As an example, the first scenario isaddressed by the cost function (nsources < minsources) ∧((maxwgt× dshem p1) < bidiradjwgt). The weight of theedge, in combination with the percentages, affects the decisionwhether increasing the number of sources is preferred overreducing them. It could be more beneficial to remove an overweighted edge, at the cost of increasing the sources, ratherthan removing a very light edge. Removing heavy edges in thecoarsening process also reduces the overall communication inthe coarsest graph.

V. EXPERIMENTAL EVALUATION OF DSHEM

This section presents in detail the metrics, the real life andsynthetic graphs used to evaluate DSHEM. METIS has beenexecuted with real life graphs, as well as synthetic regularand irregular graphs. The refinement process is also studied toassess its impact on the pure partitions generated by DSHEMand SHEM. DSHEM is slower than SHEM, by nature, whenmatching vertices; its impact on the overall partitioning timeis also evaluated. Due to space constraints, we only presentthe most relevant results.

A. MetricsThe total edge cut is given by ‖C‖ =

∑e∈C‖e‖. It is

the sum of the weights of the edges in C. If the edges haveunitary weights (i.e., |C| = ‖C‖) then the number of edgesin the cut is minimized. As stated before, and by studies suchas [44], [45], the edge cut is not the best choice for certainproblems; it does not model the real communication costs inFEM applications [27]. Studies such as [45]–[48] focuses onthe total communication volume.

Let Bvi = {jj|(vi, vj) ∈ C ∧ vj ∈ Sjj} the boundarysubdomains of vertex vi. The amount of data to be transferredby vertex vi to a neighboring subdomain is given by |vi|, itssize. Then, the communication volume coming from vertexvi, going into all its neighboring subdomains, is given byCommV olvi = |vi| × |Bvi |. Hence, the total communicationvolume induced by the partition π is defined as CommV olπ =∑

(|vi| × |Bvi |), the sum of the communication from allvertices. Note that |Bvi | = 0 for non boundary vertices.

The reduction of the total communication volume is notnecessarily the best approach in FEM applications; the dif-ferent subdomains may have an unbalanced volume of com-munication. We use the maximum communication volumeof all subdomains to determine that imbalance. Let BSii

={vi|(vi, vj) ∈ C ∧ vi ∈ Sii} be the boundary vertices in Sii.Then the communication volume produced by subdomain Siiis given by CommV olSii

=∑v∈BSii

CommV olv , the sumof the communication of all its boundary vertices, and themaximum of the communication volume of all subdomains isthen given by max1≤ii≤k CommV olSii .

B. GraphsSeveral methods designed to generate synthetic graphs have

been proposed in literature. They produce graphs with differentproperties, depending on their intended use. The experimentalevaluation of DSHEM does not include any synthetic randomgraph generated by those methods; they are not suited for theevaluation of DSHEM. It is essential to have full control overthe graph geometry to investigate how it influences the qualityof the partition and performance of the algorithm.Synthetic Graphs The synthetic graphs are used to provide

a more controlled output and understand the behavior ofDSHEM. They are valuable to categorize the performanceof DSHEM with the different types (geometries) ofgraphs. They also help in the analysis of the impact ofthe three percentages as the output is more predictable.Three different types of graphs have been selected toevaluate DSHEM; one quadrangular and two triangulargeometries. 3D versions of all three regular graphs arealso created to evaluate DSHEM. Their sizes vary froma few thousand vertices to a million. Figure 4 depicts the2D synthetic graphs of size 3×3: square (sm), triangularsquare (tsm) and dense triangular square (dtsm) graph.Irregularity is also introduced to the graphs by removinga percentage of the edges randomly; a collection ofirregular graphs with different degrees of irregularity isused for the evaluation.

783

Real Life Graphs The SuiteSparse matrix collection has awide variety of spare matrices that arise from real lifeapplications; it is frequently used by the community toevaluate the performance of new algorithms. We usethree graphs from this collection; however, they originallycome from other sources such as the Internet parallelcomputing archive [49]. The graph ef 4elt is an airfoilwith front slat and rear flaps; this is a small 2D trian-gular graph. It contains around 15 thousand vertices and45 thousand edges. The graph ef ocean represents theoceans; a medium size 3D quadrangular graph. It containsover 140 thousand vertices and 400 thousand edges. Thegraph ef sphere is a sphere; a small 2D triangular graph.It has around 16 thousand vertices and 49 thousand edges.

Fig. 4. Regular graphs of size 3 × 3: square, triangular square and densetriangular square.

C. Evaluation

DSHEM is evaluated with four sets of experiments. Thefirst set of experiments is designed to limit the amountof processing power, storage and time that is required forthe execution. Nevertheless, its purpose is also to give anoverview of the performance of DSHEM with an extendedrange of values for the main parameters. With these extendedexperiments, it is possible to design a more accurate analysisfor bigger instances of the input graphs and confirm, or refute,the initial results. The second set of small graphs is used tofocus the study of DSHEM with more specific ranges of valuesused for the parameters. These new values are the product ofthe analysis of the experimental results from the first set. Theset of medium size synthetic graphs is designed to measurethe scalability of DSHEM with much bigger input graphs. Thenumber of subdomains used during experimentation is reduceddue to time constraints. Finally, DSHEM is evaluated with a setof real world graphs, from small to medium sizes and differentgeometries, to confirm the evaluation with synthetic graphs.The four sets of graphs provide accurate information on theperformance of DSHEM. For the purpose of the analysis, thevalues for the multiplier and percentages follow the format (-maxvwtm,-dshem p1,-dshem p2,-dshem p3). In addition, theresults in all figures show the improvement or degradationof the quality of the partition when DSHEM is compared toSHEM.

Reducing the value of multiplier -maxvwtm produces morebalanced initial partitions and the refinement process is alsooptimized. However, a balanced partition does not necessarilymean a smaller edge cut or reduction in communicationvolume; it only means that the subdomains are more equalin size. Figure 5 shows that the multiplier does not greatlyaffects the performance of DSHEM, nonetheless, the results

also show that the new matching strategy performs better withgraphs having a quadrangular geometry. With (-dshem p1,-dshem p2,-dshem p3) set to a fix value, and modifying thevalue of multiplier -maxvwtm it is possible to see that DSHEMbrings a constant benefit to the 3D square (sm2d100p) and the3D triangular square (tsm3d100p).

The initial results show that the two percentages -dshem p1and -dshem p3 do not modify the behavior of the algorithm. Itis also confirmed with subsequent experiments, even thoughthe results are not presented in this paper. A close analysiswas also performed with a clear conclusion, the conditionalsthat use these two percentages evaluate to true only the firsttime, rendering them superfluous.

Depending on the metric and partitioning objective, per-centage -dshem p2 has different effects on the output of thepartitioning process. Figure 6 shows an inflexion point on 100for the percentage -dshem p2 for most of the graph types.A value smaller than 100 produces better results for the 3Dsquare (sm3d100p) and the 3D triangular square (tsm3d100p)graphs; the opposite for the 3D square (sm2d100p) and the 3Ddense triangular square (dtsm3d100p) graphs. With respect tothe total communication volume, in general DSHEM providesbetter partitions than the original SHEM with the same typeof graphs. It is also interesting to note that the percentage -dshem p2 does not affects the output gradually according toits value.

The results with real life graphs confirm the previous results.Figure 7 shows that the graph ef ocean, with quadrangulargeometry, benefits from DSHEM.

The study without the refinement process gives an in-side of how DSHEM can outperform SHEM under certaincircumstances. METIS was originally designed to minimizethe edge cut; the matching and refinement strategies buildaccordingly. Later the new objective was introduced, but itoptimizes the communication volume with an edge-cut basedpartition. The evaluation with the real life graphs confirms theinitial findings. DSHEM produces better partitions with graphshaving a quadrangular geometry.

140,100,095,100

141,100,095,100

142,100,095,100

143,100,095,100

144,100,095,100

145,100,095,100

146,100,095,100

147,100,095,100

148,100,095,100

149,100,095,100

150,100,095,100

151,100,095,100

152,100,095,100

153,100,095,100

154,100,095,100

155,100,095,100

156,100,095,100

157,100,095,100

158,100,095,100

159,100,095,100

160,100,095,100

-2.00%

-1.00%

0.00%

1.00%

2.00%

3.00%

Maximum communication volume of all subdomains

Obj.: Comm. vol.

DSHEM (Evaluating) versus SHEM (Reference)

sm2d100p tsm2d100p dtsm2d100psm3d100p tsm3d100p dtsm3d100p Average

Value of multiplier and percentages

Imp

rove

men

t

Fig. 5. DSHEM vs. SHEM: effect of -maxvwtm on the maximum com-munication volume of all subdomains with synthetic regular graphs andcommunication volume as partitioning objective.

784

150,100,090,100

150,100,091,100

150,100,092,100

150,100,093,100

150,100,094,100

150,100,095,100

150,100,096,100

150,100,097,100

150,100,098,100

150,100,099,100

150,100,100,100

150,100,101,100

150,100,102,100

150,100,103,100

150,100,104,100

150,100,105,100

150,100,106,100

150,100,107,100

150,100,108,100

150,100,109,100

150,100,110,100

-2.00%

-1.00%

0.00%

1.00%

2.00%

3.00%

4.00%


Obj.: Comm. vol.


sm2d100p tsm2d100p dtsm2d100psm3d100p tsm3d100p dtsm3d100p Average


Imp

rove

men

t

Fig. 6. DSHEM vs. SHEM: effect of -dshem p2 on the maximum com-munication volume of all subdomains with synthetic regular graphs andcommunication volume as partitioning objective.


Obj.: Comm. vol.



Imp

rove

men

t

140,100,094,100

145,100,094,100

150,100,094,100

155,100,094,100

160,100,094,100-4.00%

-2.00%

0.00%

2.00%

4.00%

150,100,091,100

150,100,094,100

150,100,097,100

150,100,100,100

150,100,103,100

150,100,106,100

150,100,109,100-6.00%

-4.00%

-2.00%

0.00%

2.00%

4.00%


Imp

rove

men

t

ef_sphere

ef_ocean

Average

ef_4elt

Fig. 7. DSHEM vs. SHEM: effect of -maxvwtm (left) and -dshem p2 (right)on the maximum communication volume of all subdomains with real lifegraphs and communication volume as partitioning objective.

D. Discussion

The initial results suggest that the multiplier -maxvwtmcould improve the partitions generated by DSHEM, However,subsequent experiments show that there is not a clear pattern.It greatly depends on the type of graph, objective to optimizeand metric to consider. Despite the lack of correlation betweenthe value of the multiplier and the quality of the final partition,it seems that DSHEM produces better results more frequentlywhen the multiplier is smaller than 150.

The original design of DSHEM contemplates three differentpercentages that work together. The initial results show thatthe two percentages -dshem p1 and -dshem p3 do not modifythe behavior of the algorithm. A close analysis was alsoperformed with a clear conclusion, the conditionals that usethese two percentages evaluate to true only the first time,rendering them superfluous. When the refinement is not partof the partitioning process, the percentage -dshem p2 has ameasurable and predictable impact on DSHEM. The value 100is an inflexion point for most of the synthetic graphs, someshowing improvement when the percentage is 100 or higherand others when lower. When the refinement is performed,that pattern is modified and less evident; nonetheless, it is stillpossible to infer a change around the value 100.

Upon the study of the experimental results, there is not aclear correlation between the irregularity of the graphs andthe quality of partitions DSHEM produces. The irregularityintroduced to the synthetic graphs affects in a similar way

to Random, SHEM and DSHEM without a clear pattern; itis more the instance of the problem that changes the resultsthan the amount of irregularity. As the results show, DSHEMtends to perform better with certain types of geometries. If thedegree of irregularity reaches the point when the geometry ofthe graph is lost, then DSHEM will no longer guarantee agood partition. DSHEM can perform well within a reasonabledegree of irregularity introduced to the graphs.

VI. CONCLUSIONS AND FUTURE WORK

We introduced a new vertex matching model called DSHEMaimed to reduce the communication volume during FEM sim-ulations. It is based on SHEM implemented in the graph parti-tioning library METIS. DSHEM improves partitions generatedfrom undirected graphs with quadrangular geometries if thecommunication volume is considered. The extra complexity ofthe algorithm is negligible compared to the original algorithmSHEM. No use of extra memory is needed to emulate thedirected graph. DSHEM can improve the system efficiency ofparallel FEM computations under certain circumstances.

We continue the development and implementation of theproposed matching model to improve its performance regard-ing running times and quality of the partition. The refinementprocess needs to be further studied and adapted to the newdata generated by DSHEM.

REFERENCES

[1] S. Blazy, W. Borchers, and U. Dralle, “Parallelization methods for acharacteristic’s pressure correction scheme,” in Flow Simulation withHigh-Performance Computers: II, ser. Notes on numerical fluid mechan-ics, E. H. Hirschel, Ed. Braunschweig/Wiesbaden, Germany: FriedrichVieweg & Sohn Verlagsgesellschaft mbH, 1996, p. 576.

[2] R. Diekmann, U. Dralle, F. Neugebauer, and T. Romke, “PadFEM:A portable parallel FEM-tool,” in High-Performance Computing andNetworking, ser. Lecture Notes in Computer Science, H. Liddell, A. Col-brook, B. Hertzberger, and P. Sloot, Eds. Berlin, Germany: SpringerBerlin / Heidelberg, 1996, vol. 1067, no. 1996, pp. 580–585.

[3] O. C. Zienkiewicz and R. L. Taylor, The finite element method: Thebasis, 5th ed. Oxford, United Kingdom: Butterworth-Heinemann, 2000,vol. 1.

[4] T. Olas, K. Karczewski, A. Tomas, and R. Wyrzykowski, “FEM com-putations on clusters using different models of parallel programming,”in Parallel Processing and Applied Mathematics, ser. Lecture Notes inComputer Science, R. Wyrzykowski, J. Dongarra, M. Paprzycki, andJ. Wasniewski, Eds. Berlin, Germany: Springer-Verlag, 2006, vol. 2328,no. 2006, pp. 170–182.

[5] R. Verfurth, “A posteriori error estimation and adaptive mesh-refinementtechniques,” Journal of Computational and Applied Mathematics,vol. 50, no. 1-3, pp. 67–83, 1994.

[6] R. Diekmann, D. Meyer, and B. Monien, “Parallel decomposition ofunstructured FEM-meshes,” in Parallel Algorithms for Irregularly Struc-tured Problems, ser. Lecture Notes in Computer Science, A. Ferreira andJ. Rolim, Eds. Springer Berlin Heidelberg, 1995, vol. 980, no. 1995,pp. 199–215.

[7] Y. Saad, Iterative methods for sparse linear systems, 2nd ed. Philadel-phia, PA, United States of America: Society for Industrial and AppliedMathematics, 2003.

[8] M. R. Garey, D. S. Johnson, and L. Stockmeyer, “Some simplified NP-complete problems,” in Proceedings of the 6th Annual ACM Symposiumon Theory of Computing (STOC’74), ser. STOC ’74. Seattle, WA, USA:ACM Press, apr 1974, pp. 47–63.

[9] ——, “Some simplified NP-complete graph problems,” TheoreticalComputer Science, vol. 1, no. 3, pp. 237–267, 1976.

785

[10] M. R. Garey and D. S. Johnson, Computers and intractability: Aguide to the theory of NP-completeness, ser. A Series of Books in theMathematical Sciences, V. Klee, Ed. San Francisco, CA, United Statesof America: W. H. Freeman and Company, 1979.

[11] B. J. Lint and T. K. M. Agerwala, “Communication issues in the designand analysis of parallel algorithms,” IEEE Transactions on SoftwareEngineering, vol. SE-7, no. 2, pp. 174–188, 1981.

[12] S. Moore and D. Greenfield, “The next resource war: computation vs.communication,” in Proceedings of the 2008 international workshop onSystem level interconnect prediction, ser. SLIP ’08. Newcastle, UnitedKingdom: ACM Press, apr 2008, pp. 81–86.

[13] K. D. Devine, E. G. Boman, R. T. Heaphy, B. A. Hendrickson, J. D.Teresco, J. Faik, J. E. Flaherty, and L. G. Gervasio, “New challenges indynamic load balancing,” Applied Numerical Mathematics, vol. 52, no.2-3, pp. 133–152, feb 2005.

[14] M. H. Willebeek-LeMair and A. P. Reeves, “Strategies for dynamic loadbalancing on highly parallel computers,” IEEE Transactions on Paralleland Distributed Systems, vol. 4, no. 9, pp. 979–993, 1993.

[15] J. Chen and V. E. Taylor, “Mesh partitioning for efficient use ofdistributed systems,” IEEE Transactions on Parallel and DistributedSystems, vol. 13, no. 1, pp. 67–79, 2002.

[16] S. Sinha and M. Parashar, “Adaptive system sensitive partitioningof AMR applications on heterogeneous clusters,” Cluster Computing,vol. 5, no. 4, pp. 343–352, 2002.

[17] C. H. Walshaw and M. Cross, “Multilevel mesh partitioning for het-erogeneous communication networks,” Future Generation ComputerSystems, vol. 17, no. 5, pp. 601–623, 2001.

[18] T. Minyard and Y. Kallinderis, “Parallel load balancing for dynamicexecution environments,” Computer Methods in Applied Mechanics andEngineering, vol. 189, no. 4, pp. 1295–1309, 2000.

[19] J. D. Teresco, M. W. Beall, J. E. Flaherty, and M. S. Shephard, “Ahierarchical partition model for adaptive finite element computation,”Computer Methods in Applied Mechanics and Engineering, vol. 184,no. 2-4, pp. 269–285, 2000.

[20] T. Olas, R. Lesniak, R. Wyrzykowski, and P. Gepner, “Parallel adaptivefinite element package with dynamic load balancing for 3D thermo-mechanical problems,” in Parallel Processing and Applied Mathematics,ser. Lecture Notes in Computer Science, R. Wyrzykowski, J. Dongarra,K. Karczewski, and J. Wasniewski, Eds. Springer Berlin Heidelberg,2010, vol. 6067, no. 2010, pp. 299–311.

[21] R. Wyrzykowski, T. Olas, and N. Sczygiol, “Object-oriented approachto finite element modeling on clusters,” in Applied Parallel Computing.New Paradigms for HPC in Industry and Academia, ser. Lecture Notesin Computer Science, T. Sørevik, F. Manne, A. H. Gebremedhin, andR. Moe, Eds. Berlin, Germany: Springer-Verlag, 2001, vol. 1947, no.2001, pp. 250–257.

[22] J. L. Gonzalez Garcıa, R. Yahyapour, and A. Tchernykh, “Load bal-ancing for parallel computations with the finite element method,” in3rd International Supercomputing Conference in Mexico, Guanajuato,Guanajuato, Mexico, 2012, p. 9.

[23] ——, “Load balancing for parallel computations with the finite elementmethod,” Computacion y Sistemas, vol. 3, no. 17, pp. 299–316, sep 2013.

[24] A. Basermann, J. Clinckemaillie, T. Coupez, J. Fingberg, H. Digonnet,R. Ducloux, J.-M. Gratien, U. Hartmann, G. Lonsdale, B. Maerten,D. Roose, and C. H. Walshaw, “Dynamic load-balancing of finiteelement applications with the DRAMA library,” Applied MathematicalModelling, vol. 25, no. 2, pp. 83–98, dec 2000.

[25] A. Buluc, H. Meyerhenke, I. Safro, P. Sanders, and C. Schulz, “Recentadvances in graph partitioning,” in Algorithm Engineering: Selected Re-sults and Surveys, ser. Lecture Notes in Computer Science, L. Kliemannand P. Sanders, Eds. Springer-Verlag, nov 2016, vol. 9220, pp. 117–158.

[26] P.-O. Fjallstrom, “Algorithms for graph partitioning: A survey,”Linkoping Electronic Articles in Computer and Information Science,vol. 3, no. 1998, sep 1998.

[27] B. A. Hendrickson, “Graph partitioning and parallel solvers: Has theemperor no clothes?” in Proceedings of the 5th International Symposiumon Solving Irregularly Structured Problems in Parallel. London, UnitedKingdom: Springer-Verlag, 1998, pp. 218–225.

[28] U. V. Catalyurek and C. Aykanat, “Hypergraph-partitioning-based de-composition for parallel sparse-matrix vector multiplication,” IEEETransactions on Parallel and Distributed System, vol. 10, no. 7, pp.673–693, 1999.

[29] P. Miettinen, M. Honkala, and J. Roos, “Using METIS and hMETISalgorithms in circuit partitioning,” Helsinki University of Technology,Espoo, Finland, Tech. Rep., 2006.

[30] C. H. Walshaw and M. Cross, “Mesh partitioning: A multilevel balancingand refinement algorithm,” SIAM Journal on Scientific Computing,vol. 22, no. 1, pp. 63–80, jun 2000.

[31] G. Karypis and V. Kumar, “METIS - Serial graph partitioningand fill-reducing matrix ordering,” 2012. [Online]. Available: http://glaros.dtc.umn.edu/gkhome/metis/metis/overview

[32] G. Karypis, “METIS A software package for partitioning unstructuredgraphs, partitioning meshes, and computing fill-reducing orderings ofsparse matrices,” University of Minnesota, Minneapolis, MN, UnitedStates of America, Tech. Rep., 2011.

[33] B. A. Hendrickson and R. Leland, “Chaco: Software for partitioninggraphs,” 2015. [Online]. Available: https://cfwebprod.sandia.gov/cfdocs/CompResearch/templates/insert/softwre.cfm?sw=36

[34] ——, “The Chaco user’s guide: Version 2.0,” Sandia National Labora-tories, Albuquerque, NM, United States of America, Tech. Rep., 1995.

[35] F. Pellegrini, “SCOTCH: Static mapping, graph, mesh and hypergraphpartitioning, and parallel and sequential sparse matrix orderingpackage,” 2012. [Online]. Available: http://www.labri.u-bordeaux.fr/perso/pelegrin/scotch/

[36] ——, “Scotch and libScotch 5.1 user’s guide,” Universit´e Bordeaux I,Talence, France, Tech. Rep., 2010.

[37] F. Pellegrini and J. Roman, “SCOTCH: A software package for staticmapping by dual recursive bipartitioning of process and architecturegraphs,” in High-Performance Computing and Networking, ser. LectureNotes in Computer Science, H. Liddell, A. Colbrook, B. Hertzberger,and P. Sloot, Eds. Springer Berlin Heidelberg, 1996, vol. 1067, no.1996, pp. 493–498.

[38] R. Banos Navarro and C. Gil Montoya, “Graph and mesh partitioning:An overview of the current state-of-the-art,” in Mesh PartitioningTechniques and Domain Decomposition Methods, F. Magoules, Ed.Stirlingshire, United Kingdom: Saxe-Coburg Publications, 2007, pp. 1–26.

[39] G. Karypis and V. Kumar, “Multilevel k-way partitioning schemefor irregular graphs,” Journal of Parallel and Distributed Computing,vol. 48, no. 1, pp. 96–129, jan 1998.

[40] R. Diekmann, R. Preis, F. Schlimbach, and C. H. Walshaw, “Shape-optimized mesh partitioning and load balancing for parallel adaptiveFEM,” Parallel Computing, vol. 26, no. 12, pp. 1555–1581, 2000.

[41] R. Battiti and A. A. Bertossi, “Greedy, prohibition, and reactive heuris-tics for graph partitioning,” IEEE Transactions on Computers, vol. 48,no. 4, pp. 361–385, apr 1999.

[42] G. Karypis and V. Kumar, “Analysis of multilevel graph partitioning,”in Proceedings of the 1995 ACM/IEEE conference on Supercomputing(CDROM), ser. Supercomputing ’95. New York, NY, United States ofAmerica: ACM, 1995, p. 29.

[43] ——, “Analysis of multilevel graph partitioning,” University of Min-nesota, Minneapolis, MN, United States of America, Tech. Rep., 1995.

[44] H. Meyerhenke, B. Monien, and S. Schamberger, “Graph partitioningand disturbed diffusion,” Parallel Computing, vol. 35, no. 10-11, pp.544–569, 2009.

[45] M. Deveci, K. Kaya, B. Ucar, and U. V. Catalyurek, “Hypergraph parti-tioning for multiple communication cost metrics: Model and methods,”Journal of Parallel and Distributed Computing, vol. 77, pp. 69–83, 2015.

[46] B. A. Hendrickson and T. G. Kolda, “Graph partitioning models forparallel computing,” Parallel Computing, vol. 26, no. 12, pp. 1519–1534, nov 2000.

[47] U. V. Catalyurek, M. Deveci, K. Kaya, and B. Ucar, “UMPa: A multi-objective, multi-level partitioner for communication minimization,” inGraph partitioning and graph clustering, D. A. Bader, H. Meyerhenke,P. Sanders, and D. Wagner, Eds. Providence, RI, United States ofAmerica: American Mathematical Society, 2013, vol. 588, pp. 53–66.

[48] D. P. Woodruff and Q. Zhang, “When distributed computation iscommunication expensive,” in Distributed Computing, ser. Lecture Notesin Computer Science, Y. Afek, Ed. Springer Berlin Heidelberg, 2013,vol. 8205, pp. 16–30.

[49] D. Beckett and University of Kent at Canterbury, “Internet parallelcomputing archive,” 2000. [Online]. Available: http://wotug.org/parallel/

786

Graph Partitioning for FEM Applications: Reducing the ...chernykh/papers/HPCS 2019 Graph...

Documents

Transcript of Graph Partitioning for FEM Applications: Reducing the ...chernykh/papers/HPCS 2019 Graph...