00 A Parallel Sparse Direct Solver via Hierarchical DAG ... · We present a parallel sparse direct...

26
00 A Parallel Sparse Direct Solver via Hierarchical DAG Scheduling Kyungjoo Kim, The University of Texas at Austin Victor Eijkhout, The University of Texas at Austin We present a parallel sparse direct solver for multi-core architectures based on Directed Acyclic Graph (DAG) scheduling. Recently, DAG scheduling has become popular in advanced Dense Linear Algebra libraries due to its efficient asynchronous parallel execution of tasks. However, its application to sparse matrix problems is more challenging as it has to deal with an enormous number of highly irregular tasks. This typically results in substantial scheduling overhead both in time and space, which causes overall parallel performance to be suboptimal. We describe a parallel solver based on two-level task parallelism: tasks are first generated from a parallel tree traversal on the assembly tree; next, those tasks are further refined by using algorithms-by-blocks to gain fine-grained parallelism. The resulting fine-grained tasks are asynchronously executed after their dependencies are analyzed. Our approach is distinct from others in that we adopt two-level task scheduling to mirror the two-level parallelism. As a result we reduce scheduling overhead, and increase efficiency and flexibility. The proposed parallel sparse direct solver is evaluated for the particular problems arising from the hp-Finite Element Method where conventional sparse direct solvers do not scale well. Categories and Subject Descriptors: G.4 [Mathematical Software]: Efficiency General Terms: Performance Additional Key Words and Phrases: Gaussian elimination, Directed Acyclic Graph, Direct method, LU, Multi-core, Multi- frontal, OpenMP, Sparse matrix, Supernodes, Task parallelism, Unassembled HyperMatrix ACM Reference Format: Kim, K., Eijkhout, V., 2012, A Parallel Sparse Direct Solver via Hierarchical DAG Scheduling. ACM Trans. Math. Softw. 0, 0, Article 00 ( 2012), 26 pages. DOI = 10.1145/0000000.0000000 http://doi.acm.org/10.1145/0000000.0000000 1. INTRODUCTION Many scientific applications devote a considerable amount of time to solving linear systems of equa- tions Ax = b, where A is usually a large and sparse matrix. Large sparse systems can be solved by either direct or iterative methods. The major disadvantage of iterative methods is that they may not converge to the solution: iterative methods usually depend on the construction of good precondi- tioners to ensure convergence. On the other hand, direct methods based on Gaussian elimination are robust but expensive. Required memory for the solution in 2D problems is O(NlogN) and 3D problems generally increases the space complexity to O(N 4/3 ) when a matrix is permuted by nested dissection ordering [Duff et al. 1976; Gilbert and Tarjan 1987]. The performance of sparse direct methods generally varies according to the sparsity of problems [Gould et al. 2007]. No single ap- proach is the best in solving all types of sparse matrices. The approach selected should be based on characteristics of the sparsity pattern such as bandedness, structural or numerical symmetry, or the presence of cliques in the matrix graph. In our previous work [Bientinesi et al. 2010], a new sparse direct solver using Unassembled HyperMatrices (UHMs) was designed and developed for the Finite Element Method (FEM) with an hp-adaptive strategy of mesh refinements. In this adaptive context, Authors’ addresses: Kyungjoo Kim, Department of Aerospace Engineering and Engineering Mechanics, The University of Texas at Austin, Austin, TX 78712, [email protected]. Victor Eijkhout, Texas Advanced Computing Center, The University of Texas at Austin, Austin, TX 78758, [email protected]. Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies show this notice on the first page or initial screen of a display along with the full citation. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, to redistribute to lists, or to use any component of this work in other works requires prior specific permission and/or a fee. Permissions may be requested from Publications Dept., ACM, Inc., 2 Penn Plaza, Suite 701, New York, NY 10121-0701 USA, fax +1 (212) 869-0481, or [email protected]. c 2012 ACM 0098-3500/2012/-ART00 $15.00 DOI 10.1145/0000000.0000000 http://doi.acm.org/10.1145/0000000.0000000 ACM Transactions on Mathematical Software, Vol. 0, No. 0, Article 00, Publication date: 2012.

Transcript of 00 A Parallel Sparse Direct Solver via Hierarchical DAG ... · We present a parallel sparse direct...

Page 1: 00 A Parallel Sparse Direct Solver via Hierarchical DAG ... · We present a parallel sparse direct solver for multi-core architectures based on Directed Acyclic Graph (DAG) scheduling.

00

A Parallel Sparse Direct Solver via Hierarchical DAG Scheduling

Kyungjoo Kim, The University of Texas at AustinVictor Eijkhout, The University of Texas at Austin

We present a parallel sparse direct solver for multi-core architectures based on Directed Acyclic Graph (DAG) scheduling.Recently, DAG scheduling has become popular in advanced Dense Linear Algebra libraries due to its efficient asynchronousparallel execution of tasks. However, its application to sparse matrix problems is more challenging as it has to deal withan enormous number of highly irregular tasks. This typically results in substantial scheduling overhead both in time andspace, which causes overall parallel performance to be suboptimal. We describe a parallel solver based on two-level taskparallelism: tasks are first generated from a parallel tree traversal on the assembly tree; next, those tasks are further refinedby using algorithms-by-blocks to gain fine-grained parallelism. The resulting fine-grained tasks are asynchronously executedafter their dependencies are analyzed. Our approach is distinct from others in that we adopt two-level task scheduling tomirror the two-level parallelism. As a result we reduce scheduling overhead, and increase efficiency and flexibility. Theproposed parallel sparse direct solver is evaluated for the particular problems arising from the hp-Finite Element Methodwhere conventional sparse direct solvers do not scale well.

Categories and Subject Descriptors: G.4 [Mathematical Software]: Efficiency

General Terms: Performance

Additional Key Words and Phrases: Gaussian elimination, Directed Acyclic Graph, Direct method, LU, Multi-core, Multi-frontal, OpenMP, Sparse matrix, Supernodes, Task parallelism, Unassembled HyperMatrix

ACM Reference Format:Kim, K., Eijkhout, V., 2012, A Parallel Sparse Direct Solver via Hierarchical DAG Scheduling. ACM Trans. Math. Softw. 0,0, Article 00 ( 2012), 26 pages.DOI = 10.1145/0000000.0000000 http://doi.acm.org/10.1145/0000000.0000000

1. INTRODUCTION

Many scientific applications devote a considerable amount of time to solving linear systems of equa-tions Ax = b, where A is usually a large and sparse matrix. Large sparse systems can be solved byeither direct or iterative methods. The major disadvantage of iterative methods is that they may notconverge to the solution: iterative methods usually depend on the construction of good precondi-tioners to ensure convergence. On the other hand, direct methods based on Gaussian eliminationare robust but expensive. Required memory for the solution in 2D problems is O(NlogN) and 3Dproblems generally increases the space complexity to O(N4/3) when a matrix is permuted by nesteddissection ordering [Duff et al. 1976; Gilbert and Tarjan 1987]. The performance of sparse directmethods generally varies according to the sparsity of problems [Gould et al. 2007]. No single ap-proach is the best in solving all types of sparse matrices. The approach selected should be based oncharacteristics of the sparsity pattern such as bandedness, structural or numerical symmetry, or thepresence of cliques in the matrix graph. In our previous work [Bientinesi et al. 2010], a new sparsedirect solver using Unassembled HyperMatrices (UHMs) was designed and developed for the FiniteElement Method (FEM) with an hp-adaptive strategy of mesh refinements. In this adaptive context,

Authors’ addresses: Kyungjoo Kim, Department of Aerospace Engineering and Engineering Mechanics, The Universityof Texas at Austin, Austin, TX 78712, [email protected]. Victor Eijkhout, Texas Advanced Computing Center, TheUniversity of Texas at Austin, Austin, TX 78758, [email protected] to make digital or hard copies of part or all of this work for personal or classroom use is granted without feeprovided that copies are not made or distributed for profit or commercial advantage and that copies show this notice on thefirst page or initial screen of a display along with the full citation. Copyrights for components of this work owned by othersthan ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, toredistribute to lists, or to use any component of this work in other works requires prior specific permission and/or a fee.Permissions may be requested from Publications Dept., ACM, Inc., 2 Penn Plaza, Suite 701, New York, NY 10121-0701USA, fax +1 (212) 869-0481, or [email protected]© 2012 ACM 0098-3500/2012/-ART00 $15.00

DOI 10.1145/0000000.0000000 http://doi.acm.org/10.1145/0000000.0000000

ACM Transactions on Mathematical Software, Vol. 0, No. 0, Article 00, Publication date: 2012.

Page 2: 00 A Parallel Sparse Direct Solver via Hierarchical DAG ... · We present a parallel sparse direct solver for multi-core architectures based on Directed Acyclic Graph (DAG) scheduling.

00:2 K. Kim and V. Eijkhout

the solver effectively uses the application information in solving a sequence of linear systems thatare locally updated, and the solver stores partial factors previously computed and exploits them tofactor the current sparse system. Extending our previous work, we present a fully asynchronousparallel sparse direct solver targeting large-scale sparse matrices associated with three-dimensionalhp-meshes that range from 100k to a few million unknowns 1.

There are two important aspects to hp-adaptive FEM matrices that we use in our solver. First ofall, the linear systems derive from a sequence of consecutive meshes that are derived from an initialcoarse mesh by local updates. This means that, below the level corresponding to the coarse mesh,elements are organized in a refinement tree. While our overall factorization approach is similar tothe multi-frontal method [Duff 1986; Gupta and Kumar 1994; Liu 1992], the solver has an abilityto preserve and reuse the previously constructed assembly tree and corresponding partial factors tosolve the locally updated system (see our paper [Bientinesi et al. 2010] for more details).

Secondly, matrices from hp-adaptive problems have dense subblocks implying that there are es-sentially no scalar operations: all operations are dense, which allows highly efficient level 3 BasicLinear Algebra Subprogram (BLAS) functions [Dongarra et al. 1990] to be utilized. The blockstructures produced by the high order discretization are directly utilized throughout the entire solu-tion procedure. On the other hand, existing sparse direct solvers reconstruct such block structuresusing graph partitioning algorithms.

Since we target our solver to a multicore and shared memory model, we develop a solution basedon Directed Acyclic Graph (DAG) scheduling. Unlike the classic fork-join parallel model, this ap-proach adopts asynchronous parallel task scheduling using the DAG of tasks, where nodes stand fortasks and edges indicate dependencies among the tasks.

While these methods have been successful in the context of dense matrices, the application ofDAG scheduling to sparse matrices is not trivial for the following reasons:

(1) The overall factorization has a large number of tasks, which can increase scheduling overheadif tasks are scheduled out-of-order to improve performance.

(2) Tasks are inherently irregular owing to the sparse matrix structure, the hp-adaptivity, and thegrowing block size during the factorization.

(3) Numerical pivoting of sparse matrices may create additional fill, and even additional tasks,causing dynamic changes to the workflow. Consequently, out-of-order scheduling based on acomplete DAG becomes less efficient.

Possibly due to the above difficulties, we have found only a few cases which extend DAG-based taskscheduling to multi-frontal Cholesky [Hogg et al. 2010], LDLT and LU with static pivoting [Bosilcaet al. 2012; Lacoste et al. 2012], and QR factorization [Buttari 2013]. In this paper, we present LUfactorization with pivoting for matrices having a symmetric sparse structure.

In our scheme, irregular coarse grain tasks are decomposed into regular fine-grained tasks usingalgorithms-by-blocks. Such algorithms view matrices as collections of blocks (submatrices) whichbecome units of data. Computations with blocks then become units of computations. Refined tasksare scheduled in a fully asynchronous manner via multiple DAG schedulers. A key aspect of thissolver is that a DAG scheduler locally analyzes a series of block operations associated with a set ofdense subproblems.

The DAG scheduler developed here and its interface have the following features that are dis-tinct from other advanced Dense Linear Algebra (DLA) libraries such as SuperMatrix [Chan et al.2007; Quintana-Ortı et al. 2009] and Parallel Linear Algebra for Scalable Multi-core Architec-tures (PLASMA) [Buttari et al. 2009]:

— OpenMP explicit tasking. Our DAG scheduling is implemented based on the OpenMP frame-work, which is considered the de facto standard in parallel processing on shared-memory pro-

1The advanced hp-FEM typically provides an order of magnitude higher solution resolution than conventional linear FEMsby using both variable element size (h) and higher order of approximation (p).

ACM Transactions on Mathematical Software, Vol. 0, No. 0, Article 00, Publication date: 2012.

Page 3: 00 A Parallel Sparse Direct Solver via Hierarchical DAG ... · We present a parallel sparse direct solver for multi-core architectures based on Directed Acyclic Graph (DAG) scheduling.

A Parallel Sparse Direct Solver via Hierarchical DAG Scheduling 00:3

1

1 1

p-1

p-1

p-1

(p-1)(p-2)/2

(a) High order FE

P-1

P-1

P-1

(P-1)(P-2)/2

(b) Element matrix

Fig. 1: The left figure shows the number of DOFs related to topological nodes with a polynomialorder p. The figure on the right illustrates the shape of the corresponding element matrix.

cessors. The recently added feature of explicit task parallelism [OpenMP Architecture ReviewBoard 2008] allows us to leave the lowest level of scheduling to the OpenMP runtime, ratherincorporating it explicitly in our scheduler, as is done in SuperMatrix and PLASMA. This meansthat our software is expressed in a high level of task abstraction, and provides portable parallelperformance.

— Nested parallelism. Our task scheduler is implemented based on the nested parallelism supportedby OpenMP tasking. This feature enables nested DAG scheduling. While we explicitly managethe correct dependencies between dense subproblems during the multi-frontal factorization, eachsubproblem uses its own scheduler for the fine-grained tasks. These tasks are executed asyn-chronously through the unified tasking environment in OpenMP. In effect we are schedulingschedulers.

We will compare the performance of our solver against other sparse solvers. Additionally, we willevaluate our task scheduler strategy by also applying it to dense problems. Obtaining performancecomparable to state-of-the-art DLA libraries will argue for the efficiency of our strategy.

2. SPARSE MATRICES FROM THE HP-FINITE ELEMENT METHOD

The FEM is widely used for solving engineering and scientific problems. The method approximatessolutions using piecewise polynomial basis functions in a discretization of the physical problemdomain, which creates a large sparse system of equations. The system of equations must be solved,and containing the numerical cost is a trade-off with the quality of the solution. For an efficientsolution process it is essential to understand the characteristics of the derived sparse system.

The advanced hp-FEM uses an adaptive strategy to deliver highly accurate solutions while keep-ing the cost low. In the hp-FEM, the adaptive procedure controls both mesh size (h) and polyno-mial order of approximation (p). A large body of mathematical literature [Babuska and Suri 1994;Babuska et al. 1981; Szabo 1990] on the theoretical analysis of the hp-FEM proves its superiorperformance compared to the conventional linear FEM or with fixed p-FEM.

To solve problems formulated by the hp-FEM, a direct method is often preferred to iterative meth-ods. The difficulty with iterative methods is an increasing condition number with the approximationorder [Babuska et al. 1989; Carnevali et al. 1993], which leads to a slow convergence or failure tofind solutions.

Compared to other linear methods, an hp-discretization creates a more complex sparsity patterndue to the variable order of approximation. In the hp-discretization, multiple DOFs, corresponding

ACM Transactions on Mathematical Software, Vol. 0, No. 0, Article 00, Publication date: 2012.

Page 4: 00 A Parallel Sparse Direct Solver via Hierarchical DAG ... · We present a parallel sparse direct solver for multi-core architectures based on Directed Acyclic Graph (DAG) scheduling.

00:4 K. Kim and V. Eijkhout

(a) p = 1, nz = 226,981 (b) p = 4, nz = 1,771,561

Fig. 2: Characteristic sparsity patterns respectively obtained from p = 1 and p = 4 keeping the samesystem DOFs on a regular box mesh.

to the polynomial order, are associated with the same topological node, as illustrated in Fig. 1a. Thenumber of DOFs increases with the order of approximation 2 as follows:

Node Type Edge Face Volume# of DOF O(p) O(p2) O(p3)

This gives an element matrix shape as depicted in Fig. 1b.A sparse system derived from an hp-discretization can be characterized in terms of topological

nodes and associated DOFs rather than individual DOFs. Formally, we can construct a quotientgraph of the graph of unknowns by dividing out the equivalence classes of basis functions associatedwith the same topological node. Our method of UHMs (see next) makes it possible then to formulatethe factorization algorithm entirely in terms of this quotient graph. This makes for an efficient solversince the graph to be analyzed is considerably reduced, as well as giving an increased opportunityfor using Level 3 BLAS operations.

Our solver is based on keeping element matrices unassembled as long as possible during thefactorization. In the hp-FEM, these elements matrices are fairly large, leading to a very efficientfactorization; see our earlier work [Bientinesi et al. 2010]. Fig. 2 shows the global system of a linearelement method and a higher order FEM; one sees that the hp-discretization gives a denser structure.

Our use of dense element matrices is somewhat similar to existing work on supernodes [Liu et al.1993] and node amalgamation [Ashcraft and Grimes 1989; Davis and Hager 2009; Demmel et al.1999]; it differs in that supernodes are not (re)discovered from the linearized matrix, but rather givena priori. Tests (as in our earlier work) show that this strategy can give higher efficiency than existinggeneral sparse direct solvers.

3. FACTORIZATION ALGORITHMS

A general procedure for direct methods consists of four phases: ordering, analysis, factorization, andforward/backward substitution. In the first phase, a fill-reducing order of the sparse matrix is con-structed. Next, a symbolic factorization is performed in the analysis phase to determine workspacesand supernodes. The heaviest workload is encountered within the numerical factorization. Oncefactors are computed, the solution is obtained via forward/backward substitution.

2In general, the order of approximation selected by the hp-FEM is 3-4 for 2D and 5-6 for 3D problems.

ACM Transactions on Mathematical Software, Vol. 0, No. 0, Article 00, Publication date: 2012.

Page 5: 00 A Parallel Sparse Direct Solver via Hierarchical DAG ... · We present a parallel sparse direct solver for multi-core architectures based on Directed Acyclic Graph (DAG) scheduling.

A Parallel Sparse Direct Solver via Hierarchical DAG Scheduling 00:5

1st Refinement 2nd Refinement

1

0 2

3

4 56

Domain

Fig. 3: A fictitious refinement tree is obtained through recursive bisections on the hp-mesh.

A064A60 A0

66

A046A40 A0

44

A06A04A00

A64 A466

A46A44

A164A61 A1

66

A146A41 A1

44

A16A14A11

A365A63 A3

66

A356A53 A3

55

A36A35A33

A265A62 A2

66

A256A52 A2

55

A62A25A22

A65 A566

A56A55

A66

0

4

6

1 6

2

5

3

4 6 5

6

Subtree A Delayed elimination

A365A63 A3

66

A356A53 A3

55

A36A35A33

A265A62 A2

66

A256A52 A2

55

A62A25A22

A53

A566

A56

A36A35A33

A63

A55

A65

Fig. 4: An element assembly scheme and its corresponding factorization tree. On the right, a variantfactorization for a subtree is given, and the tree is dynamically modified when pivots are delayed tocontrol the element growth. The superscripts represent ownership of temporary storage for partiallyassembled Schur complements.

Our factorization adheres to the same schema, but several aspects differ from the general casebecause of our application area. For an overview of existing algorithms and literature, we refer tothe book by Davis [Davis 2006].

In this section we consider aspects of the basic factorization algorithm; the next section will thendiscuss the efficient parallel execution of it. We consider the ordering scheme, the factorization interms of unassembled element matrices, and the second factorization level which arises from theuse of algorithms-by-blocks.

3.1. Ordering strategy

In our ordering phase we construct a partial order of the elements and subdomains. To arrive ata partial ordering of all elements we construct a posteriori a refinement tree for the initial mesh.This gives us a tree where all elements are recursively defined as refinements from a (fictitious) topelement. This tree structure corresponds to a partial ordering of elements, and the factorization taskswill be forced to obey this ordering. This recursive procedure for constructing a mesh hierarchy isdescribed in Fig. 3. (We have already remarked that the connectivity graph of the hp-mesh is con-siderably smaller than the graph of matrix elements; in this particular example, edge and interiorfaces are each associated with multiple DOFs.) Inside each element on a level all unknowns aretreated together as a block; in between levels we observe the factorization obeys the partial order-ing just described: the coarse level subdomain can only be factored and eliminated after its childsubdomains on the finer levels.

ACM Transactions on Mathematical Software, Vol. 0, No. 0, Article 00, Publication date: 2012.

Page 6: 00 A Parallel Sparse Direct Solver via Hierarchical DAG ... · We present a parallel sparse direct solver for multi-core architectures based on Directed Acyclic Graph (DAG) scheduling.

00:6 K. Kim and V. Eijkhout

3.2. Unassembled factorization

In our hp-FEM context we never form a global matrix, or even a linear ordering of the unknowns.Instead, we store a tree structure of elements and super elements from which a matrix could beconstructed. We call this an Unassembled HyperMatrix (UHM) since it is a generalization of theconcept of a matrix, and since all elements are left unassembled. The dense matrices correspondingto elements are stored on leaves in the tree. As is normal with FEM, internal variables of the fi-nite elements are fully assembled, while inter-element boundary variables remain unassembled; theboundary variables also appear in multiple element matrices [Damhaug and Reid 1996; Duff andReid 1983].

In our previous work [Bientinesi et al. 2010] we showed how the UHM idea provides an opportu-nity to reuse partial factors previously computed for the solutions of a new updated system. In thispaper we use the UHM idea, but focus on the efficiency of solving a single linear system.

Our solver strategy is illustrated for a simple example in Fig. 4. The UHM solver organizesunassembled matrices as a full tree structure as described in this figure. The factorization is recur-sively driven by a partial order tree traversal; more formally we describe this recursion as follows:

(1) Assembly. An element on any but the leaf level is assembled from its children:

A := assemble(

Ale f tBR ,Aright

BR

)(1)

where Ale f tBR and Aright

BR represent the Schur complements from the children.(2) Identification of interior variables. The element is conformally partitioned into quadrants with

square diagonal blocks. The ‘TL’ (top left) block corresponds to the fully assembled internalvariables and the ‘BR’ (bottom right) block corresponds to boundary variables.

A→(

AT L AT RABL ABR

)(2)

(3) Partial LU factorization with partial pivoting of a block AT L. From its definition, AT L containsthe fully assembled nodes of an element and can therefore be eliminated. Block LU factorizationis applied to the matrix A where pivots are selected within the submatrix AT L such that:(

P 00 I

)(AT L AT RABL ABR

)=

(LT L 0LBL I

)(UT L UT R

0 S

)→ A (3)

where P is a permutation matrix, LT L is normalized to have unit diagonal entries, UT L is uppertriangular matrix, and S represents the Schur complement resulting from the elimination ofinterior variables. Factors are computed in-place and overwritten to the same submatrices A.By equating corresponding submatrices on the left and right, the partial elimination proceed asfollows:— Interior variables within the submatrix AT L are eliminated by a standard right-looking

LU algorithm with partial pivoting (LAPACK routine xgetrf), giving the factorsAT L← {LT L\UT L}.

— Submatrices AT R and ABL are updated to AT R ← L−1T L(PAT R) and ABL ← ABLU−1

T L respec-tively.

— Submatrix ABR is updated to the Schur complement, ABR← S := ABR−ABLA−1T LAT R. Note

that, since ABR consisted of unassembled matrix elements, this Schur complement is itselfunassembled; it will become assembled when it is merged into its parent on the next higherlevel.

The recursive factorization finishes at the root, where ABR is an empty matrix.

3.3. Pivoting strategy

Our pivot strategy is designed to mitigate element growth while limiting additional fill-in. Above wealready remarked that we use a pivoting LU factorization for the internal variables of an element.

ACM Transactions on Mathematical Software, Vol. 0, No. 0, Article 00, Publication date: 2012.

Page 7: 00 A Parallel Sparse Direct Solver via Hierarchical DAG ... · We present a parallel sparse direct solver for multi-core architectures based on Directed Acyclic Graph (DAG) scheduling.

A Parallel Sparse Direct Solver via Hierarchical DAG Scheduling 00:7

After this factorization, we measure the element growth [Erisman and Reid 1974] to ensure thefactorization is stable with a given threshold α:

max∣∣Ai j∣∣< α ·max

∣∣Ai j∣∣ (4)

where α is a threshold value to monitor excessive element growth. In general α is a function of aprescribed constant value and size of the element. For instance, we use the following criterion:

α = (1.0+u−10 ) · size(AT L) (5)

where the criterion is slightly modified from [Duff and Reid 1983].If unacceptable element growth is found, the factorization is discarded, the whole matrix is

merged with its sibling in Step (1) on the next higher level, and elimination of the interior vari-ables is thus delayed to the parent level as depicted in Fig. 4. The effect of this is that the solver canthen select pivots from a larger column, presumably attaining greater stability. We use this cautiousapproach to preserve the block structures derived from the hp-discretization in the factorizationphase. We do not use any scaling techniques.

200 400 600 800

100

101

102

103

104

105

106

Dense subproblem size

max{L\U} i

j/m

axA

ij

No pivot delayu0 = 0.001u0 = 0.1

Fig. 5: Element growth as a function of subproblem size.

The graph depicted in Fig. 5 describes the element growth during factorization for two differentstability thresholds. We use random element matrices, of which entries are scaled in (−1.0,1.0), toform a system of equations. The connectivities of the element matrices are based on a unstructuredtetrahedral mesh with p = 4 discretizing a spherical domain. In practice, as the element matrix sizeincreases at the upper levels of the hierarchy, we observed that partial pivoting using just the rows inthe elimination block becomes stable. On the other hand, when delayed pivots occur for small sizedelements, the overhead in handling (recomputing) those pivots slightly increases the factorizationcost.

3.4. Algorithms-by-blocks

The previous subsections discussed the factorization, focusing on the natural parallelism from itsrecursive tree traversal: tasks on separate branches can be processed simultaneously. However, thereis a second source of parallelism. Closer to the root, where task parallelism decreases, the blocks tobe factored grow. Thus, identifying further parallelism in processing these dense blocks is essentialfor improving the efficiency of the overall algorithm.

ACM Transactions on Mathematical Software, Vol. 0, No. 0, Article 00, Publication date: 2012.

Page 8: 00 A Parallel Sparse Direct Solver via Hierarchical DAG ... · We present a parallel sparse direct solver for multi-core architectures based on Directed Acyclic Graph (DAG) scheduling.

00:8 K. Kim and V. Eijkhout

TRSM

TRSM GEMM

TRSM

TRSM GEMM

GEMM

GEMM TRSM

TRSM GEMM PLU

1st Iteration 3rd Iteration2nd Iteration

PLU ApplyPivots

PLU Apply Pivots

ApplyPivots

| {z }| {z }| {z }

Apply Pivots

Apply Pivots

Apply Pivots

Fig. 6: LU factorization with partial pivoting on a 3×3 block matrix where pivots are selected in awhole column vector.

For processing these blocks, we use so-called algorithms-by-blocks [Low and van de Geijn 2004;Quintana-Ortı et al. 2009], which reformulate DLA algorithms in terms of blocks. Consider forexample the partial factorization (step 3) described above. The factorization algorithm is presentedin terms of blocks, but only two independent tasks are available:

AT L←{L\U}T L PLU↓

AT R← L−1T L(PAT R) ABL← ABLU−1

T L TRSM↓

ABR← ABR− ABLAT R GEMM

(6)

Further task-level parallelism can be pursued by organizing a matrix by blocks (submatrices). Forinstance, consider a matrix AT L with 3×3 blocks, where each block Ai j has conforming dimensionswith adjacent blocks:

AT L =

( A00 A01 A02A10 A11 A12A20 A21 A22

)(7)

The LU factorization can be reformulated as an algorithm-by-blocks changing the unit of data froma scalar to a block. A number of tasks are identified from the resulting workflow. For example, Fig. 6describes a block LU factorization with partial pivoting. In the first iteration of the algorithm, fourindependent TRSM and four independent GEMM tasks are created after the unblocked LU factor-ization with partial pivoting is performed on the first merged column block. We will discuss imple-mentation aspects of this operation later in Section 5. The algorithm generates tasks by repeatingthis process. In the same way, coarse-grain TRSM and GEMM tasks associated with ABL, AT R andABR are also decomposed into fine-grained tasks. These fine-grained tasks are mostly regular andare related through input/output data dependencies. After their dependencies are analyzed, tasks arescheduled asynchronously, which leads to highly efficient task parallelism on modern multi-corearchitectures. This has been explored in the past for parallelizing a sequential workflow of densematrices [Buttari et al. 2009; Chan et al. 2007; Quintana-Ortı et al. 2009].

In the next section, we explore how to incorporate such algorithms in the context of a task-parallelsparse solver.

4. A FULLY ASYNCHRONOUS PARALLEL SPARSE DIRECT SOLVER

In the course of a sparse factorization, as depicted in Fig. 7, we have two oppositely behaving typesof parallelism: on the one hand, a decreasing amount of task parallelism as the factorization pro-gresses from the leaves to the root of the tree; on the other hand, an increasing opportunity forparallelism inside the blocks as their sizes grow. The question is how to exploit the two-level paral-lelism in harmony to extract near-optimal utilization from multi-core architectures while avoidingexcessive complexity in the implementation.

ACM Transactions on Mathematical Software, Vol. 0, No. 0, Article 00, Publication date: 2012.

Page 9: 00 A Parallel Sparse Direct Solver via Hierarchical DAG ... · We present a parallel sparse direct solver for multi-core architectures based on Directed Acyclic Graph (DAG) scheduling.

A Parallel Sparse Direct Solver via Hierarchical DAG Scheduling 00:9

Fig. 7: Recursive factorization on the assembly tree. Irregular sized tasks are created and they arehierarchically related.

4.1. The limits of existing approaches

Sparse linear systems from hp-adaptive problems have been solved (see for instance [Demkowiczet al. 2007]) using general purpose direct solvers. Our earlier work [Bientinesi et al. 2010] showedthat our UHM approach compares favorably on a single processor; we now argue that on multicorearchitectures our problem area also allows us to develop an efficient parallel solver that comparesfavorably to others.

Adapting our factorization to the multicore case, we recognize that the two-level factorization,of refinement (assembly tree) levels and elements inside a level, leads to a large number of tasks,which we could schedule through a package like SuperMatrix, PLASMA or QUeuing And Runtimefor Kernels (QUARK) [Yarkhan et al. 2011]. There are two problems with this.

First, the number of tasks in our application is very large, which would lead to large schedul-ing overhead in analyzing task dependencies. For example, a problem discussed in the Section 6with p = 4 creates 523 thousand tasks, and 1.76 million tasks are invoked for the same mesh ap-proximated by p = 6. A scheduler such as QUARK uses a task window approach 3 to reduce thescheduling overhead, but this may incur inefficiency because of the repeated termination of onewindow and start-up of the next. Our scheduler avoids this problem since the tree-level tasks havetheir ordering given a priori, and only tasks resulting from subdividing dense blocks need to bescheduled explicitly. Thus we have both a global schedule and low overhead.

Second, a further objection is that the task list is dynamic because of numerical pivoting, andthese packages can not yet deal with that.

Other approaches could recognize the two-level structure, such as using multi-threaded BLAS ora DAG scheduler such as QUARK for the nodes, coupled with a simple post-order tree traversal.This approach [Geist and Ng 1989; Pothen and Sun 1993] suffers from imperfect load balance, forinstance because each node in the graph has to be assigned to a fixed number of cores, which canthen not participate in processing other nodes. Also, completion of a node subtask is typically aglobal synchronization point, diminishing parallel efficiency.

For these reasons we advocate an approach where a single task list is formed from the subtasksof all the compute nodes. Our approach does not suffer from excessive scheduling overhead, sincewe do not analyze global dependencies: we combine the partial ordering of tasks in the tree, witha dependency analysis on the subtasks from each tree node factorization. One might say we useour a priori knowledge of the factorization algorithm for the large scale scheduling, and only use aruntime scheduler where the algorithm does not dictate any particular ordering.

Finally, our approach can deal with pivoting and its dynamic insertions and deletions in the taskqueue, as will be discussed below.

3In principle, users are responsible for the window size in QUARK. The default window size is set 50 times the number ofthreads.

ACM Transactions on Mathematical Software, Vol. 0, No. 0, Article 00, Publication date: 2012.

Page 10: 00 A Parallel Sparse Direct Solver via Hierarchical DAG ... · We present a parallel sparse direct solver for multi-core architectures based on Directed Acyclic Graph (DAG) scheduling.

00:10 K. Kim and V. Eijkhout

Assemble child Schur complements

{LTL / UTL} := LU(ATL)

ABL:=TRSM (LTL,ABL) ATR:=TRSM (UTL,ATR)

ABR:=GEMM(ABL,ATR,ABR)

Matrix addition

BLAS 3

Task granularity Workflow Matrix blocking

Stability check

PASS

FAIL

Delay pivots

Fig. 8: Internal workflow in the tree-level task associated with partial factorization.

4.2. Parallelization strategy

Our solver strategy is driven by the parallel tree traversal of the elimination tree: each node in thetree is a task that handles one partially assembled element matrix, and it can not proceed until itschildren have been similarly processed. Recursively, this defines the overall algorithm.

We illustrate a tree-level task in Fig. 8. It consists of an assembly (at O(n2) cost) of the Schurcomplements that resulted from the factorization of the children elements, followed by level 3 BLASoperations for the factorization (all of which are O(n3)). While the submatrices are assembled usingirregular hp-blocks, matrix factorization proceeds with uniformly sized blocks to gain efficient fine-grained parallelism. Although some assembly and factorization tasks can be executed concurrentlyafter their task dependencies are analyzed [Buttari 2013; Hogg et al. 2010], scheduling many andtiny assembly tasks incurs more overhead rather than parallel efficiency in our application problems.For this reason, tasks that correspond to the matrix assembly are aggregated into a single task.

The block algorithms in each node-level task lead to finer-grained tasks that form a DAG; theentire factorization can then be represented by hierarchically related subgraphs as depicted in Fig. 9.We first schedule tasks related to corresponding elements in the assembly tree. Those tree-leveltasks are scheduled together with local DAG schedulers. In effect, we schedule schedulers. Next,each group of fine-grained tasks is dispatched by a local scheduling policy, which is guided by theassociated local DAG. To reduce excessive overhead in scheduling a large dense problem, we alsouse a window mechanism that is adopted by QUARK. Our window mechanism is separately appliedto tasks within a scheduler; thus, the window does not limit scheduling tasks generated from otherschedulers. Since no global DAG is used, scheduling overhead is fairly modest.

4.3. Parallel tree traversal

In the factorization phase we create and schedule a large number of tasks. However, we do notuse an explicit data structure to reflect the schedule according to which tasks are to be executed.Instead, we rely on OpenMP mechanisms: by declaring tasks in the right execution order with OMPpragmas, they are entered into the internal OMP scheduler with their dependencies indicated.

We also do not use the omp parallel for pragma around loops; rather, we let the factorizationgenerate OMP tasks. A reason for not using OMP parallel loops is that they are not suited to complexnested parallelism. A major challenge in implementing nested parallelism is a load balancing issue.Tasks have to be divided into separate groups and threads are properly allocated to the groups.However, during multi-frontal factorization workloads are dynamically defined; thus, it is difficultto match tasks with computing resources to achieve good load balance.

ACM Transactions on Mathematical Software, Vol. 0, No. 0, Article 00, Publication date: 2012.

Page 11: 00 A Parallel Sparse Direct Solver via Hierarchical DAG ... · We present a parallel sparse direct solver for multi-core architectures based on Directed Acyclic Graph (DAG) scheduling.

A Parallel Sparse Direct Solver via Hierarchical DAG Scheduling 00:11

LU

Piv

ot

Trs

mT

rsm

HE

AD

Gem

m

TA

IL

HE

AD

LU

Piv

ot

Piv

ot

Trs

mT

rsm

Trs

mT

rsm

HE

AD

Gem

mG

em

mG

em

mG

em

m

TA

IL

LU

Piv

ot

Trs

mT

rsm

HE

AD

Gem

m

TA

IL

LU

Piv

ot

Piv

ot

Trs

mT

rsm

Trs

mT

rsm

Gem

mG

em

mG

em

mG

em

m

TA

IL

HE

AD

LU

Piv

ot

Trs

mT

rsm

HE

AD

Gem

m TA

IL

HE

AD

LU

Piv

ot

Trs

mT

rsm

HE

AD

Gem

m

TA

IL

LU

Piv

ot

Trs

mT

rsm

HE

AD

Gem

m

TA

IL

LU

Piv

ot

Piv

ot

Trs

mT

rsm

Trs

mT

rsm

Gem

mG

em

mG

em

mG

em

m

TA

IL

LU

Piv

ot

Piv

ot

Piv

ot

Trs

mT

rsm

Trs

mT

rsm

Trs

mT

rsm

Gem

mG

em

mG

em

mG

em

mG

em

mG

em

mG

em

mG

em

mG

em

m

TA

IL

HE

AD

LU

Piv

ot

Trs

mT

rsm

HE

AD

Gem

m TA

IL

HE

AD

LU

Piv

ot

Trs

mT

rsm

HE

AD

Gem

m TA

IL

LU

Piv

ot

Piv

ot

Trs

mT

rsm

Trs

mT

rsm

HE

AD

Gem

mG

em

mG

em

mG

em

m

TA

IL

LU

Piv

ot

Piv

ot

Trs

mT

rsm

Trs

mT

rsm

Gem

mG

em

mG

em

mG

em

m

TA

IL

HE

AD

LU

Piv

ot

Trs

mT

rsm

HE

AD

Gem

m TA

IL

HE

AD

LU

Piv

ot

Piv

ot

Trs

mT

rsm

Trs

mT

rsm

HE

AD

Gem

mG

em

mG

em

mG

em

m

TA

IL

LU

Piv

ot

Piv

ot

Trs

mT

rsm

Trs

mT

rsm

Gem

mG

em

mG

em

mG

em

m

TA

IL

HE

AD

LU

Piv

ot

Trs

mT

rsm

HE

AD

Gem

m TA

IL

HE

AD

LU

Piv

ot

Trs

mT

rsm

HE

AD

Gem

m

TA

IL

LU

Piv

ot

Piv

ot

Trs

mT

rsm

Trs

mT

rsm

Gem

mG

em

mG

em

mG

em

m

TA

IL

LU

Piv

ot

Piv

ot

Trs

mT

rsm

Trs

mT

rsm

Gem

mG

em

mG

em

mG

em

m

TA

IL

LU

Piv

ot

Piv

ot

Piv

ot

Trs

mT

rsm

Trs

mT

rsm

Trs

mT

rsm

Gem

mG

em

mG

em

mG

em

mG

em

mG

em

mG

em

mG

em

mG

em

m

TA

IL

LU

Piv

ot

Piv

ot

Piv

ot

Trs

mT

rsm

Trs

mT

rsm

Trs

mT

rsm

Gem

mG

em

mG

em

mG

em

mG

em

mG

em

mG

em

mG

em

mG

em

m

TA

IL

HE

AD

LU

Piv

ot

Piv

ot

Trs

mT

rsm

Trs

mT

rsm

HE

AD

Gem

mG

em

mG

em

mG

em

m

TA

IL

HE

AD

LU

Piv

ot

Piv

ot

Trs

mT

rsm

Trs

mT

rsm

HE

AD

Gem

mG

em

mG

em

mG

em

m

TA

IL

LU

Piv

ot

Trs

mT

rsm

HE

AD

Gem

m

TA

IL

LU

Piv

ot

Piv

ot

Piv

ot

Trs

mT

rsm

Trs

mT

rsm

Trs

mT

rsm

Gem

mG

em

mG

em

mG

em

mG

em

mG

em

mG

em

mG

em

mG

em

m

TA

IL

HE

AD

LU

Piv

ot

Piv

ot

Trs

mT

rsm

Trs

mT

rsm

HE

AD

Gem

mG

em

mG

em

mG

em

m

TA

IL

HE

AD

LU

Piv

ot

Trs

mT

rsm

HE

AD

Gem

m

TA

IL

LU

Piv

ot

Piv

ot

Trs

mT

rsm

Trs

mT

rsm

Gem

mG

em

mG

em

mG

em

m

TA

IL

HE

AD

LU

Piv

ot

Trs

mT

rsm

HE

AD

Gem

m TA

IL

HE

AD

LU

Piv

ot

Trs

mT

rsm

HE

AD

Gem

m

TA

IL

LU

Piv

ot

Piv

ot

Trs

mT

rsm

Trs

mT

rsm

Gem

mG

em

mG

em

mG

em

m

TA

IL

LU

Piv

ot

Piv

ot

Trs

mT

rsm

Trs

mT

rsm

Gem

mG

em

mG

em

mG

em

m

TA

IL

LU

Piv

ot

Piv

ot

Piv

ot

Trs

mT

rsm

Trs

mT

rsm

Trs

mT

rsm

Gem

mG

em

mG

em

mG

em

mG

em

mG

em

mG

em

mG

em

mG

em

m

TA

IL

HE

AD

LU

Piv

ot

Trs

mT

rsm

HE

AD

Gem

m

TA

IL

HE

AD

LU

Piv

ot

Trs

mT

rsm

HE

AD

Gem

m

TA

IL

LU

Piv

ot

Trs

mT

rsm

HE

AD

Gem

m

TA

IL

LU

Piv

ot

Piv

ot

Trs

mT

rsm

Trs

mT

rsm

Gem

mG

em

mG

em

mG

em

m

TA

IL

HE

AD

LU

Piv

ot

Trs

mT

rsm

HE

AD

Gem

m TA

IL

HE

AD

LU

Piv

ot

Trs

mT

rsm

HE

AD

Gem

m

TA

IL

LU

Piv

ot

Piv

ot

Trs

mT

rsm

Trs

mT

rsm

Gem

mG

em

mG

em

mG

em

m

TA

IL HE

AD

LU

Piv

ot

Trs

mT

rsm

HE

AD

Gem

m

TA

IL

HE

AD

LU

Piv

ot

Piv

ot

Trs

mT

rsm

Trs

mT

rsm

HE

AD

Gem

mG

em

mG

em

mG

em

m

TA

IL

TA

IL

LU

Piv

ot

Piv

ot

Piv

ot

Trs

mT

rsm

Trs

mT

rsm

Trs

mT

rsm

Gem

mG

em

mG

em

mG

em

mG

em

mG

em

mG

em

mG

em

mG

em

m

TA

IL

LU

Piv

ot

Piv

ot

Piv

ot

Trs

mT

rsm

Trs

mT

rsm

Trs

mT

rsm

Gem

mG

em

mG

em

mG

em

mG

em

mG

em

mG

em

mG

em

mG

em

m

TA

IL

LU

Piv

ot

Trs

m

Gem

m

Piv

ot

Piv

ot

Piv

ot

Piv

ot

Trs

mT

rsm

Trs

m

Gem

mG

em

mG

em

m

Trs

mT

rsm

Trs

m

LU

Trs

mT

rsm

Trs

m

Gem

mG

em

mG

em

m

Trs

mT

rsm

Trs

m

Piv

ot

Piv

ot

Piv

ot

Gem

mG

em

mG

em

m

Gem

mG

em

mG

em

m

Gem

mG

em

mG

em

m

Gem

mG

em

mG

em

m

Gem

mG

em

mG

em

m

Gem

mG

em

mG

em

m

TA

IL

LU

Piv

ot

Piv

ot

Trs

m

Trs

mG

em

mG

em

m

Gem

mG

em

m

Piv

ot

Piv

ot

LU

Trs

m Gem

m

Piv

ot

Piv

ot L

U

TA

IL

Fig. 9: A DAG that illustrates how fine-grained tasks are globally organized in the factorizationphase. Fine-grained tasks are inter-related to each other within each subgraph and the subgraphs arein turn related by the assembly tree. The graphs in this figure are rotated.

ACM Transactions on Mathematical Software, Vol. 0, No. 0, Article 00, Publication date: 2012.

Page 12: 00 A Parallel Sparse Direct Solver via Hierarchical DAG ... · We present a parallel sparse direct solver for multi-core architectures based on Directed Acyclic Graph (DAG) scheduling.

00:12 K. Kim and V. Eijkhout

// ** Sparse multifrontal factorization via UHMsi n t factorize_sparse_matrix(Tree::Node *root) {// begin with the root noderecursive_tree_traverse(root);re turn SUCCESS;

}

// ** Recursive tree traversali n t recursive_tree_traverse(Tree::Node *me) {

f o r ( i n t i=0;i<me->get_n_children ();++i) {// tree-level task generation for child tree-nodes#pragma omp task firstprivate(i)recursive_tree_traverse(me->get_child(i));

}

// the current task is suspended#pragma omp taskwait

// process the function (local scheduling)factorize_uhm(me);

re turn SUCCESS;}

// ** Partial factorization in UHMi n t factorize_uhm(Tree::Node* nod) {// merge the Schur complements from child tree-nodesnod->assemble();

// local DAG scheduling for LU factorizationScheduler s;

// tasks are created using algorithms-by-blocks// and they are associated with a local schedulercreate_lu_tasks(nod->ATL, s);create_trsm_tasks(nod->ATL, nod->ABL, s);create_trsm_tasks(nod->ATL, nod->ATR, s);create_gemm_tasks(nod->ABL, nod->ATR, nod->ABR, s);

// local parallel execution (nested parallel tasking)s.flush();

// monitoring element growth;// if necessary, UHM is re-assembled and eliminations are delayednod->stability_check();

re turn SUCCESS;}

Figure 10: Tree-level tasks are generated using the OpenMP framework.

The mechanisms we use are part of the explicit task management that was added to OpenMP 3.0,released in 2008 [OpenMP Architecture Review Board 2008]. The new task scheme includes twocompiler directives:

(1) #pragma omp task creates a new task(2) #pragma omp taskwait is used to synchronize invoked (nested) tasks.

In this work, nested parallelism is supported by OpenMP tasking; a task can recursively createdescendant tasks.

When a task spawns descendant tasks, #pragma omp taskwait can suspend the task until thosetasks are completed. For example, Fig. 10 outlines the parallel multi-frontal factorization throughparallel tree traversal with OpenMP tasking. Invoked tasks are scheduled based on a partial orderin the tree structure; tasks in the current recursion level are executed before their parent task isprocessed.

ACM Transactions on Mathematical Software, Vol. 0, No. 0, Article 00, Publication date: 2012.

Page 13: 00 A Parallel Sparse Direct Solver via Hierarchical DAG ... · We present a parallel sparse direct solver for multi-core architectures based on Directed Acyclic Graph (DAG) scheduling.

A Parallel Sparse Direct Solver via Hierarchical DAG Scheduling 00:13

// ** Fine-grained task are generated using algorithms-by-blocksi n t create_gemm_task( i n t transa , i n t transb ,

FLA Obj alpha , FLA Obj A, FLA Obj B,FLA Obj beta , FLA Obj C,Scheduler s) {

// no transpose A, no transpose B// Matrix objects A, B, and C are matrices consisting of blocksf o r (p=0;p<A.width();++p)

f o r (k2=0;k2<C.width();++k2)f o r (k1=0;k1<C.length();++k1)// tasks are enqueued with in/out arguementss.enqueue(Task(name=’’Gemm’’, op=blas_gemm , // function pointer

n_int_args=2, transa , transb , // 2 - integer variablesn_fla_in=4, alpha , A(k1,p), B(p,k2), beta , // 4 - FLA input matricesn_fla_out=1, C(k1,k2)); // 1 - FLA output matrix

}

Figure 11: Fine-grained tasks are generated and gathered for out-of-order scheduling.

For low-level thread binding, we rely on OpenMP primitives. OpenMP utilizes a thread pool toexecute tasks; when a task is ready to execute, an idle thread picks it up to process the task. Hence,the programming burden for dispatching tasks is removed, and we can achieve portable performancein various architectures that offer an OpenMP implementation.

4.4. Memory usage in tree traversal

During the tree traversal, the solver allocates and frees temporary storage for Schur complements.The necessary amount of memory for the factorization mostly depends on the the shape of assemblytree and the traversal order [Guermouche et al. 2003].

In our recursive factorization depicted in Fig. 10, elements are dynamically created and destroyedby working threads. We also note that our solver does not pre-allocate storage for non-leaf elementsbefore the factorization begins 4. The active memory usage is therefore determined by the OpenMPinternal task scheduler. This also implies that our memory consumption could be arbitrarily baddepending on the OpenMP implementation. In practice, the OpenMP task scheduler provides goodmemory usage as major compilers use task scheduling policies to improves data locality, whichprobably results in Depth First Search (DFS) tree traversal. Since our assembly tree is well-balanced,all DFS-like tree traversals are good.

4.5. Scheduling strategy of the block matrix operations

In Fig. 10 we showed how the block matrix operations are properly scheduled in a partial order. If weproperly schedule the tasks that make up the block matrix factorization, we recursively guaranteethe correct execution of all tasks. While it would be possible to schedule the block matrix taskssimilarly, this would not give a sufficiently fast ramp up to full utilization of all cores. Therefore,we use a slightly more complicated strategy.

We execute the block matrix factorization symbolically, generating a list of tasks. This list is thenanalyzed to find the predecessors of each task. We then schedule the tasks using the OpenMP taskmechanism:

(1) A task is only executed after all its predecessors are executed. This is implemented with anOpenMP task barrier, which leaves the low level task binding to the OpenMP runtime.

(2) Since a task may be predecessor to several other tasks, it may be invoked multiple times. How-ever, it’s only executed once.

4This dynamic memory (de)allocation is also well suited to Non-Uniform Memory Access (NUMA) architectures, as physi-cal memory is allocated on the NUMA node on which the thread is running by a ‘first-touch’ rule.

ACM Transactions on Mathematical Software, Vol. 0, No. 0, Article 00, Publication date: 2012.

Page 14: 00 A Parallel Sparse Direct Solver via Hierarchical DAG ... · We present a parallel sparse direct solver for multi-core architectures based on Directed Acyclic Graph (DAG) scheduling.

00:14 K. Kim and V. Eijkhout

// ** Invoke all tasksi n t Scheduler ::flush() {

whi le (tasks_not_empty()) {open_window(); // set a range of tasks for analysisanalyze(); // construct a DAG for those tasksexecute(); // execute tasks in parallel

}tasks.clear(); // clean-up task queue

re turn SUCCESS;}

// ** Execute tasks in an active windowi n t Scheduler ::execute() {

f o r ( i n t i=_begin;i<_end;++i) {#pragma omp task firstprivate(i) // schedule fine-grained tasks using OpenMPt h i s ->tasklist.at(i).execute_once();

}#pragma omp taskwait // complete the execution of a set of tasksclose_window(); // close the window

re turn SUCCESS;}

Figure 12: The first-level task scheduling.

We illustrate the details of our approach using the example of a dense LU factorization withoutpivoting. Consider a block LU factorization on a 3× 3 block matrix (Fig. 6). A list of tasks isgenerated by the LU algorithm, and tasks are sequentially enqueued into a scheduler (GEMM examplein Fig. 11).

0 LU 1 Tr 2 Tr 3 Tr 4 Tr 5 Gm 6 Gm 7 Gm 8 Gm 9 LU 10 Tr 11 Tr 12 Gm 13 LU- 0 LU 0 LU 0 LU 0 LU 1 Tr 1 Tr 2 Tr 2 Tr 5 Gm 5 Gm 5 Gm 6 Gm 8 Gm- - - - - 3 Tr 4 Tr 3 Tr 4 Tr - 7 Gm 6 Gm 7 Gm 12 Gm- - - - - - - - - - 9 LU 9 LU 8 Gm -- - - - - - - - - - - - 10 Tr -- - - - - - - - - - - - 11 Tr -

Table I: A list of tasks generated by an LU factorization without pivoting on a 3×3 block matrix; forsimplicity we remove pivoting operations in Fig. 6. The first row represents the list of enqueued taskswhere numbers imply the enqueuing order, and each task records dependent tasks in its column.

After task dependencies are analyzed, tasks are organized in the scheduler as tabulated in Table I.Details of the scheduler are given in Fig. 12. Notably, the execute function invokes a list of tasksin the first row of the table. Next, the first rule drives a recursion on the dependent tasks. At anygiven task in the list, OpenMP can execute the task in a right order satisfying the dependencies. Forinstance, the column of the Task 6 creates a recursive call stack:

Task 6 -> Task 1,4 -> Task 0

This recursive process is naturally suited in nested parallelism by using omp task. However, therecursion can also invoke the same task multiple times. The second rule prevents the situation andenforces a task to be executed once by the first-reached thread. In this example, both Task 1 andTask 4 invoke Task 0, but Task 0 is exclusively executed by the first-reached thread; the otherencountered thread is redirected to other available tasks by omp taskyield. This approach canalso increase scheduling overhead as the same tasks are submitted multiple times into the OpenMPruntime task scheduler. In practice, such overhead can be controlled by using the task window thatlimits the number of active tasks.

ACM Transactions on Mathematical Software, Vol. 0, No. 0, Article 00, Publication date: 2012.

Page 15: 00 A Parallel Sparse Direct Solver via Hierarchical DAG ... · We present a parallel sparse direct solver for multi-core architectures based on Directed Acyclic Graph (DAG) scheduling.

A Parallel Sparse Direct Solver via Hierarchical DAG Scheduling 00:15

// ** Tasking policiesi n t Task_::execute_once() {// return flagi n t r_val = SUCCESS , status;

#pragma omp atomic capture{

status = t h i s ->_once_execute;++ t h i s ->_once_execute;

}

// Rule 2: execute the current task ‘‘once’’i f (status == 1) {// Rule 1: recursive calls on dependent tasksf o r ( i n t i=0;i< t h i s ->get_n_dependency ();++i) {

#pragma omp task firstprivate(i)t h i s ->get_dependent_task(i)->execute_once();

}// execute the current task after all dependent tasks are processed#pragma omp taskwaitr_val = t h i s ->execute();

}

// yield the current thread until this task is completedwhi le (! t h i s ->is_completed()) {

#pragma omp taskyield}re turn r_val;

}

Figure 13: Implementation of tasking policies.

A pseudo code for this scheduling is illustrated in Fig. 13. In this scheduling mechanism, a tasktable such as the one depicted in Table I can be interpreted as scheduling hints, representing theconcurrent tasks (in a row) and data locality (in a column):

— tasks in the first row can be independently executed in an arbitrary order, and— dependent tasks in the same column can be tied to the current working thread in favor of reusing

data.

However, there exist several different implementations for OpenMP tasking as the OpenMP stan-dard does not specify implementation details. Various workqueue models and scheduling policiesare studied in [Duran et al. 2008; Olivier and Prins 2010; Shah et al. 2000; Terboven et al. 2008].Thanks to these efforts 5 the current OpenMP implementation shows reasonable performance in thedense matrix factorization discussed in the next section.

5. DENSE MATRIX HANDLING

Our target application leads to a sparse problem that features a large number of subproblems. Itwould be possible to handle these with SuperMatrix or PLASMA, but we argue that this is subopti-mal because of the context in which these matrices appear. In this section we show that our customdense matrix solver performs comparably to state-of-the-art packages.

Since we handle many dense matrices in a row, resource management becomes an issue: whenmultiple task schedulers – namely, for the different dense blocks – are used, compute resourcesshould be efficiently shared between the schedulers. Our solution to this problem is to not managethe dispatching of tasks explicitly, but to let this be handled by OpenMP. As the reader can see inFig. 13, we limit ourselves to declaring taskwait pragmas so that dependencies are observed.

5We were not able to find references for specific implementation details of the current GNU and Intel compilers.6The benchmark measures sustainable memory bandwidth for simple vector kernels e.g., triad: a(i) = b(i) + q*c(i).

ACM Transactions on Mathematical Software, Vol. 0, No. 0, Article 00, Publication date: 2012.

Page 16: 00 A Parallel Sparse Direct Solver via Hierarchical DAG ... · We present a parallel sparse direct solver for multi-core architectures based on Directed Acyclic Graph (DAG) scheduling.

00:16 K. Kim and V. Eijkhout

Processors 24 Processing Cores, Intel Dunnington 2.66 GHZMemory 96 GB ccNUMA, 1066 MHz FSBCache 16 MB L3, three 3 MB shared L2 caches

Compiler & OpenMP Intel 11.1DLA FLAME ver 10519, PLASMA ver 2.0

BLAS Intel MKL 10.2Theoretical Peak 256 GFLOPS

(a) Hardware specifications

Threads 1 2 4 8 10 12 16 20 24Bandwidth (GB/sec) 2.19 4.28 8.41 10.39 8.62 10.31 10.33 10.44 10.46

(b) Stream benchmark [McCalpin 1995] (triad) 6.

Table II: Characteristics of target architecture, Clarksville

0 2 4 6 8 10 12 140

50

100

150

200

Dimension n (in thousands)

GFL

OPS

Dense UHM (DAG)

SuperMatrix

SuperMatrix with packing

PLASMAPLASMA with packing

Fig. 14: [24 cores] Dense Cholesky factorization using a fixed blocksize 256.

Here, we show with two examples that our dense subproblem solver performs as well as somestate-of-the-art DLA packages. First, we report on a Cholesky factorization to show the efficiencyof our task handling scheme, then we report on LU with partial pivoting to show that this does notdetract. All tests were run on the machine with the specification tabulated in Table II.

As depicted in Fig. 14, our dense solver performs similar to, or slightly worse than SuperMatrixand PLASMA on the dense Cholesky factorization. In this comparison, SuperMatrix and PLASMAare interfaced to a format of storage-by-blocks [Gustavson et al. 1999] (also called tile layout); amatrix is divided in blocks, and each block is contiguously laid out in memory. The block formatmay have some performance advantages because the storage scheme provides better data localityand more concurrency of multi-threaded operations as false sharing is reduced. However, the blockformat incurs more complexity in assembling Schur complements. For this reason, we use the tradi-tional (column-major) matrix format; our experiment also shows that the use of the standard formatdoes not adversely affect performance.

In Fig. 15, we compare our dense LU factorization with partial pivoting against SuperMa-trix [Chan et al. 2010] and PLASMA [Dongarra et al. 2011]. The figure shows the performanceof our task scheduler in a dense LU factorization matches that of PLASMA for small matrices, but

ACM Transactions on Mathematical Software, Vol. 0, No. 0, Article 00, Publication date: 2012.

Page 17: 00 A Parallel Sparse Direct Solver via Hierarchical DAG ... · We present a parallel sparse direct solver for multi-core architectures based on Directed Acyclic Graph (DAG) scheduling.

A Parallel Sparse Direct Solver via Hierarchical DAG Scheduling 00:17

0 2 4 6 8 10 12 140

50

100

150

200

Dimension n (in thousands)

GFL

OPS

Dense UHM (DAG)

SuperMatrix (block format)

PLASMA (column-major format)

Fig. 15: [24 cores] Dense LU factorization with partial pivoting using a fixed blocksize 256.

PLASMA achieves a higher performance asymptotically. By contrast, SuperMatrix does not scalewell, because its block format makes it inefficient to gather the column vector. The UHM solver andPLASMA use column storage which makes this unnecessary.

Alternatively, we could also consider the incremental pivoting scheme [Quintana-Ortı et al. 2009;Chan et al. 2010]; the approach provides more efficient task parallelism at the expense of diminishednumerical stability [Quintana-Ortı and van de Geijn 2008].

6. RESULTS

In this section we compare our proposed solver against the state-of-the-art parallel sparse directsolvers MUMPS and PARDISO. We only focus on performance aspects, ignoring solution accuracy,as all solvers produce a similar relative residual.

A brief characterization of the solvers we compare against:

— MUMPS [Amestoy et al. 2002; Amestoy et al. 2006] has been developed for distributed ar-chitectures via the Message Passing Interface (MPI) library since 1996. For this comparison,MUMPS version 4.10.0 is interfaced to Scalable Linear Algebra PACKage (ScaLAPACK) andBasic Linear Algebra Communication Subprograms (BLACS) provided by Intel Math Kernel Li-brary (MKL) version 12.1. Single threaded BLAS and Linear Algebra PACKage (LAPACK) areinterfaced with 24 MPI processes. Test problems are provided in the assembled matrix format 7.

— PARDISO [Schenk and Gartner 2002; Schenk and Gartner 2006] was developed for multi-corearchitectures in 2004, and is a part of Intel’s MKL. For this comparison, we use the version 12.1.

For all cases, we use a 24-core ‘large memory’ node of the Lonestar machine, with properties tabu-lated in Table III. As user access to the hardware memory counters is not allowed on this machine,we were not able to check whether our current implementation achieves the limit of memory band-width in QuickPath Interconnect (QPI). However, our evaluation indicates that the bandwidth ofQPI is not a limiting factor to the parallel performance.

7 Regarding to MUMPS setup, we tested several configurations e.g., 24x1, 12x2, 6x4, 3x8, and 1x24. The performancereported in this paper is selected from the best result, which is 24x1 for the testing problems in the paper. MUMPS supportsboth assembled matrix format and elemental based input. In this study, the solver performs slightly better in the assembledmatrix format.

ACM Transactions on Mathematical Software, Vol. 0, No. 0, Article 00, Publication date: 2012.

Page 18: 00 A Parallel Sparse Direct Solver via Hierarchical DAG ... · We present a parallel sparse direct solver for multi-core architectures based on Directed Acyclic Graph (DAG) scheduling.

00:18 K. Kim and V. Eijkhout

Processors 4x6 Intel Xeon E7540 2.0 GHzMemory 1TB, 64x16GB, DDR3-1066MHz, 4x QPI

Compiler & OpenMP GNU 4.4.5DLA FLAME ver 10519

BLAS Intel MKL 12.1Peak performance/core 8.0 GFLOPs

(a) Lonestar

Threads 1 2 4 8 10 12 16 20 24Bandwidth (GB/sec) 5.21 10.16 20.29 34.56 42.44 46.33 58.83 61.19 71.17

(b) Stream benchmark (triad)

Table III: Machine specification, Lonestar.

6.1. Analysis phase

Test problemsOrder p # of DOFs # of non-zeros

1 6,017 524,2882 45,825 3,276,8003 152,193 13,107,2004 357,889 40,140,8005 695,681 102,760,4486 1,198,337 231,211,008

Table IV: Sparse matrices are obtained from unstructured tetrahedral meshes on a unit sphere do-main varying an approximation order from 1 to 6. The maximum # of DOFs reaches 1.2 million,and the smallest problem includes 6,000 unknowns.

We construct a sequence of test problems based on the same tetrahedral hp-mesh by varying thepolynomial order of approximation p from 1 to 6; problem sizes (all tests are double precision) andsparsity are described in Table IV. Test problems produce structurally-symmetric matrices; matricesare reordered by using the nested dissection in Metis version 4.0, and symbolic factorization isperformed to determine supernodal structures.

Fig. 16 shows that our solver is highly efficient when a sparse system is derived with a high p.As pointed out above (Section 3.1), we operate on the graph of the topological nodes rather than theindividual DOF. Hence, the time complexity does not vary if we only change the polynomial orders.On the other hand, MUMPS and PARDISO spend a considerable amount of time in reordering andanalyzing the matrix, which increases with higher p values. More specifically, the time complex-ity of MUMPS increases proportionally to the increased number of DOF, as expected. PARDISOperforms poor for sparse systems based on higher order approximations; in contrast, the solverperforms best for the linear order approximation.

Our solver preserves the hp-discretization and creates a different supernodal assembly tree fromone generated by MUMPS which uses node-based nested dissection ordering. Table V comparesthe distribution of frontal matrices in the constructed assembly trees8. An interesting observationis that two different solvers report similar maximum front sizes. This implies that the asymptoticFLoating Point Operation (FLOP) and required memory would be almost the same for the twosolvers. However, the number of fronts in the assembly tree is very different.

8For this comparison, we do not evaluate PARDISO as it is implemented with left-looking supernodal factorization.

ACM Transactions on Mathematical Software, Vol. 0, No. 0, Article 00, Publication date: 2012.

Page 19: 00 A Parallel Sparse Direct Solver via Hierarchical DAG ... · We present a parallel sparse direct solver for multi-core architectures based on Directed Acyclic Graph (DAG) scheduling.

A Parallel Sparse Direct Solver via Hierarchical DAG Scheduling 00:19

400 800 1,200

10−1

100

101

102

103

# of dofs (in thousands), increasing p=1-6

Tim

e[s

ec]

UHMMUMPSPARDISO

Fig. 16: Time (lower is better) measured in the analysis phase with increase in the polynomial orderfrom 1 to 6.

# of fronts Max front sizeOrder p UHM MUMPS UHM MUMPS

1 1,104 658 694 5942 9,527 3,613 2,508 2,6613 32,149 8,829 5,746 6,0254 62,955 18,646 10,197 10,3435 63,033 30,583 16,171 16,5416 63,042 65,400 22,336 23,380

Table V: Summary of frontal matrices. MUMPS is interfaced with Metis and the UHM solver usesa fictitious mesh refinement hierarchy preserving hp-mesh structures.

For the example of p ≥ 4, the hp-mesh populates DOFs for all topological nodes i.e., vertices,edges, faces and interior nodes. As we analyze the matrix with respect to the mesh topology andassociated DOFs, our solver reports a similar number of available unassambled element matricesfor p≥ 4. In contrast, MUMPS does not recognize the topological structure of the hp-mesh.

None of this immediately implies that one solver is better than the other. Since the estimatedFLOP for the factorization has the same order of magnitude, the performance then depends on thespecific parallel implementations utilizing characteristic tree structures.

6.2. Scalability

We compare the strong scalability of our solver against PARDISO and MUMPS; we consider thespeed-up by increasing the number of processing units for a fixed problem derived from a high orderdiscretization (p = 4). In this comparison, Metis version 4.0 is used to reorder the sparse matrix.

Detailed performance data is tabulated in Table VI. Some observations are:

— While the UHM solver and MUMPS report almost the same number of FLOPs in the factorizationphase, PARDISO estimates are almost twice that of the others. Note that this difference onlyshows for the higher order problems.

ACM Transactions on Mathematical Software, Vol. 0, No. 0, Article 00, Publication date: 2012.

Page 20: 00 A Parallel Sparse Direct Solver via Hierarchical DAG ... · We present a parallel sparse direct solver for multi-core architectures based on Directed Acyclic Graph (DAG) scheduling.

00:20 K. Kim and V. Eijkhout

UHM, Total cost 3476 GFLOPThreads Time (sec) GFLOP/sec Memory (GB)

1 473.65 6.84 7.372 257.36 13.43 8.254 189.77 17.42 8.148 93.02 35.03 8.14

10 66.29 51.91 8.1512 54.59 59.32 8.2516 45.00 73.81 8.1420 37.63 87.43 8.2624 29.44 113.29 8.18

(a) UHM

MUMPS, Total cost 3264 GFLOPMPI processes Time (sec) GFLOP/sec Memory (GB)

1 454.14 7.19 8.3282 278.40 11.73 10.104 166.78 19.58 10.118 104.19 31.36 12.16

10 85.90 38.03 12.7212 71.89 45.45 12.7216 60.07 54.39 12.8620 50.64 64.52 12.6124 43.83 74.83 12.79

(b) MUMPS

PARDISO, Total cost 7490 GFLOPThreads Time (sec) GFLOP/sec Memory (GB)

1 1153.07 6.50 9.572 610.16 12.28 9.594 309.32 24.22 9.658 173.55 43.16 9.68

10 142.04 52.74 9.7112 119.79 62.53 9.7716 92.38 81.08 9.8320 77.15 97.09 9.8924 66.28 113.53 9.89

(c) PARDISO

Total GFLOP GFLOP/sec Time (sec) Memory (GB)UHM 3476 113.29 29.44 8.18

MUMPS 3264 74.83 43.83 12.79PARDISO 7490 113.53 66.28 9.89

(d) Comparison (24x)

Table VI: Summary of solver performance with varying the number of threads. The test problemhas 357,889 DOFs with p = 4.

— As for the utilization of multicore resources, PARDISO and UHM solvers show higher perfor-mance than MUMPS.

— Due to the controlled computational cost and high performance per FLOP, the UHM solver with24 threads is faster than MUMPS and PARDISO by a factor of 1.48× and 2.25× respectively.

Fig. 17 compares the strong scaling of the solvers and their memory usage with respect to theincreasing number of threads. The total memory usage is monitored by using atomic counter oper-ations for each memory (de)allocation. Our solver achieves 16× speed-up out of 24 threads while

ACM Transactions on Mathematical Software, Vol. 0, No. 0, Article 00, Publication date: 2012.

Page 21: 00 A Parallel Sparse Direct Solver via Hierarchical DAG ... · We present a parallel sparse direct solver for multi-core architectures based on Directed Acyclic Graph (DAG) scheduling.

A Parallel Sparse Direct Solver via Hierarchical DAG Scheduling 00:21

1 2 4 8 10 12 16 20 240

5

10

15

20

# of cores

SPE

ED

-UP

UHM blocksize 256MUMPSPARDISO

1 2 4 8 10 12 16 20 240

5

10

15

# of cores

Mem

ory

(GB

)

UHM blocksize 256MUMPSPARDISO

Fig. 17: Factorization phase for fixed p = 4. In this benchmark, MUMPS is interfaced to 24-MPIprocesses with the sequential MKL; the speed-up graph is based on the sequential factorization timefor the UHM solver.

MUMPS and PARDISO get 11× and 7× speed-up respectively. The graph also shows that thememory usage in our solver does not grow with the number of threads being used.

6.3. Factorization

Fig. 18 compares the time complexity of our parallel factorization against MUMPS and PARDISOfor problems derived with higher p. Our solver is slower than the others for matrices on low orderapproximations, with PARDISO being fastest. However, the number of FLOPs in the factorizationphase of PARDISO is twice that of the others, so its performance becomes poor for higher p. Thisresult does not match to what others have observed. In discussion with the Intel MKL PARDISOdeveloper team, we have not been able to find an explanation for this.

ACM Transactions on Mathematical Software, Vol. 0, No. 0, Article 00, Publication date: 2012.

Page 22: 00 A Parallel Sparse Direct Solver via Hierarchical DAG ... · We present a parallel sparse direct solver for multi-core architectures based on Directed Acyclic Graph (DAG) scheduling.

00:22 K. Kim and V. Eijkhout

400 800 1,2000

200

400

600

800

# of dofs (in thousands), increasing p=1-6

Tim

e[s

ec]

UHMMUMPSPARDISO

400 800 1,2000

1

2

3

# of dofs (in thousands), increasing p=1-6

Rel

ativ

eSP

EE

D-U

P

ReferenceUHM speed-up over MUMPS

UHM speed-up over PARDISO

Fig. 18: Time complexity in the factorization phase. The problem size increases varying the poly-nomial order from 1 to 6.

MUMPS and the UHM solver report roughly the same amount of FLOP estimates for the factor-ization. Hence, the performance gain over MUMPS is mainly due to the efficient use of OpenMPfor the asynchronous parallel execution of fine-grained tasks.

Fig. 19 compares the space complexity of the solvers: the graph compares the peak memory usedfor solving the problems. Our solver shows substantial memory saving compared to other solvers.Compared to MUMPS our solver uses 30% less memory, and compared to PARDISO 20% less,for high polynomial degrees. This is probably due to the fact that the other sparse direct solversover-allocate the workspace to accommodate the delayed pivots. On the other hand, our solverdynamically creates or destroy matrices when elements are assembled during the multi-frontal fac-torization.

ACM Transactions on Mathematical Software, Vol. 0, No. 0, Article 00, Publication date: 2012.

Page 23: 00 A Parallel Sparse Direct Solver via Hierarchical DAG ... · We present a parallel sparse direct solver for multi-core architectures based on Directed Acyclic Graph (DAG) scheduling.

A Parallel Sparse Direct Solver via Hierarchical DAG Scheduling 00:23

400 800 1,2000

0.2

0.4

0.6

0.8

1

1.2

# of dofs (in thousands), increasing p=1-6

Rat

ioof

peak

mem

ory

used

ReferenceUHM / MUMPSUHM / PARDISO

Fig. 19: Peak memory used in the factorization phase with increase in the polynomial order from 1to 6.

7. RELATED WORK

We briefly summarize other sparse direct solvers which use DAG-based task scheduling.TAUCS [Irony et al. 2004] uses recursive block storage in its multi-frontal Cholesky factorization

in conjunction with parallel recursive BLAS. The combination of recursive matrix format (storage-by-blocks) and dense subroutines naturally schedules fine-grained tasks exploiting the memory hi-erarchy of modern computing architectures. The basic approach is similar to ours in that the fine-grained tasks are created within the harmony of two-level parallelism. The solver is parallelizedthrough Cilk [Blumofe et al. 1995].

In MA87 [Hogg et al. 2010], DAG-based task scheduling is implemented for left-looking sparseCholesky factorization. A global DAG is implicitly created and loosely guides task scheduling. Atask, of which dependencies are satisfied, is moved from the task pool to a local thread stack. Anassociated thread with the stack executes a task and update dependencies related to the task. Thisprocedure is repeated until the task pool is empty. The task pool maintains tasks with a prioritybased on type; for example, factorization on diagonal blocks has the highest priority, and updatingblocks has a lower priority. MA87 incorporates and improves their in-house dense matrix kernelsimplemented in MP54 [Hogg 2008], which also uses DAG-based task scheduling.

Recently, multi-frontal QR factorization using DAGs was studied in [Buttari 2013]. The proposedapproach uses block-column partitioning and exploits both tree-level and matrix-level parallelism. Aglobal DAG is formed and exhibits the concurrent tasks during the multi-frontal QR factorization.The QR factorization has a capability to asynchronously schedule panel factorization and matrixassembly using block-column partitioning of frontal matrices.

8. CONCLUSION

We have presented a novel design for a parallel sparse direct solver that exploits the features ofhp-FEM problems. The solver outperforms the state-of-the-art parallel direct solvers when a prob-lem domain is discretized with a higher order of approximation. The proposed direct solver usesa two-level task scheme corresponding to two levels of parallelism in the multi-frontal factoriza-tion. A first level of tasks is generated during the parallel tree traversal on the assembly tree; next,dense subproblems encountered during this traversal are decomposed into fine-grained tasks usingalgorithms-by-blocks. We identify treewise dependencies among the first-level tasks, and we explic-

ACM Transactions on Mathematical Software, Vol. 0, No. 0, Article 00, Publication date: 2012.

Page 24: 00 A Parallel Sparse Direct Solver via Hierarchical DAG ... · We present a parallel sparse direct solver for multi-core architectures based on Directed Acyclic Graph (DAG) scheduling.

00:24 K. Kim and V. Eijkhout

itly manage the fine-grained tasks by local DAG schedulers. Relying on the OpenMP task facilities,these partial task orderings are sufficient, and we achieve efficient scheduling, without the need forexplicitly constructing a global DAG.

The high performance of the proposed solver is mainly attributable to the presence of these fine-grained tasks, which give OpenMP more opportunities for utilization of the processor cores. Ourstrategy of analyzing only local task dependencies, and leaving the scheduling to OpenMP, alsomakes our solver more suitable for solving multiple instance of dense problems, where those prob-lems are related each other. By contrast, the currently available advanced DLA libraries strictlycontrol all computing resources in exploiting a single dense problem (or a sequence of dense oper-ations on the same matrix), and do not allow application-level resource management. The lack ofsuch application-level resource management in those DLA libraries may significantly limit overallefficiency since it introduces synchronization points, and limits parallel utilization throughout thewhole application run.

This solver also has a potential use for applications such as Navier-Stokes, which feature smallleaf level blocks from multiple equations (e.g., velocity, temperature, pressure, density, chemicalreactions, etc.) per mesh node.

Acknowledgement

The authors greatly appreciate the associate editor and the referees for their valuable insightsand helpful suggestions to improve this paper. We also thank Texas Advanced Computing Cen-ter (TACC) at The University of Texas at Austin (http://www.tacc.utexas.edu) for providingHPC resources that were used in this work.

The code that this paper describes has been developed based on libflame and OpenMP. Sourcesare available under the GNU Lesser General Public License (LGPL) for non-commercial use athttp://code.google.com/p/uhm.

This research was sponsored by National Science Foundation (NSF) under grant no. 0904907.Any opinions, findings and conclusions or recommendations expressed in this material are those ofthe author(s) and do not necessarily reflect the views of the NSF.

REFERENCES

AMESTOY, P. R., DUFF, I. S., L’EXCELLENT, J. Y., AND KOSTER, J. 2002. A fully asynchronous multifrontal solver usingdistributed dynamic scheduling. SIAM Journal on Matrix Analysis and Applications 23, 1, 15–41.

AMESTOY, P. R., GUERMOUCHE, A., L’EXCELLENT, J.-Y., AND PRALET, S. 2006. Hybrid scheduling for the parallelsolution of linear systems. Parallel Comput. 32, 2, 136–156.

ASHCRAFT, C. AND GRIMES, R. 1989. The influence of relaxed supernode partitions on the multifrontal method. ACMTransactions on Mathematical Software 15, 4, 291–309.

BABUSKA, I., GRIEBEL, M., AND PITKARANTA, J. 1989. The problem of selecting the shape functions for a p-type finiteelement. International Journal for Numerical Methods in Engineering 28, 8, 1891–1908.

BABUSKA, I. AND SURI, M. 1994. The p and hp versions of the finite element method, basic principles and properties.SIAM Review 36, 4, 578–632.

BABUSKA, I., SZABO, B. A., AND KATZ, I. N. 1981. The p-version of the finite element method. SIAM journal on numer-ical analysis 18, 3, 515–545.

BIENTINESI, P., EIJKHOUT, V., KIM, K., KURTZ, J., AND VAN DE GEIJN, R. 2010. Sparse Direct Factorizations throughUnassembled Hyper-Matrices. Computer Methods in Applied Mechanics and Engineering 199, 430–438.

BLUMOFE, R. D., JOERG, C. F., KUSZMAUL, B. C., LEISERSON, C. E., RANDALL, K. H., AND ZHOU, Y. 1995. Cilk:An efficient multithreaded runtime system. In Proceedings of the fifth ACM SIGPLAN Symposium on Principles andPractice of Parallel Programming. Vol. 37. 207–216.

BOSILCA, G., FAVERGE, M., LACOSTE, X., YAMAZAKI, I., AND RAMET, P. 2012. Toward a supernodal sparse directsolver over DAG runtimese. In PMAA ’2012.

BUTTARI, A. 2013. Fine-grained multithreading for the multifrontal QR factorization of sparse matrices. SIAM Journal onScientific Computing 35, 4, 323–345.

BUTTARI, A., LANGOU, J., KURZAK, J., AND DONGARRA, J. 2009. A Class of Parallel Tiled Linear Algebra Algorithmsfor Multicore Architectures. Parallel Computing 35, 1, 38–53.

ACM Transactions on Mathematical Software, Vol. 0, No. 0, Article 00, Publication date: 2012.

Page 25: 00 A Parallel Sparse Direct Solver via Hierarchical DAG ... · We present a parallel sparse direct solver for multi-core architectures based on Directed Acyclic Graph (DAG) scheduling.

A Parallel Sparse Direct Solver via Hierarchical DAG Scheduling 00:25

CARNEVALI, P., MORRIS, R. B., TSUJI, Y., AND TAYLOR, G. 1993. New basis functions and computational proceduresfor p-version finite element analysis. International journal for numerical methods in engineering 36, 22, 3759–3779.

CHAN, E., VAN DE GEIJN, R. A., AND CHAPMAN, A. 2010. Managing the complexity of lookahead for LU factorizationwith pivoting. In Proceedings of the 22nd ACM Symposium on Parallelism in Algorithms and Architectures - SPAA ’10.ACM Press, New York, New York, USA, 200–208.

CHAN, E., VAN ZEE, F. G., QUINTANA-ORTI, E. S., QUINTANA-ORTI, G., AND VAN DE GEIJN, R. A. 2007. Satisfyingyour dependencies with Supermatrix. In 2007 IEEE International Conference on Cluster Computing. IEEE, 91–99.

DAMHAUG, A. C. AND REID, J. K. 1996. MA46, a Fortran code for direct solution of sparse unsymmetric linear systemsof equations from finite-element applications. Tech. rep., Computing and Information Systems Department, RutherfordAppleton Laboratory, RAL-TR-96-010.

DAVIS, T. A. 2006. Direct Methods for Sparse Linear Systems (Fundamentals of Algorithms 2). Number 0898716136.SIAM, Philadelphia, PA, USA.

DAVIS, T. A. AND HAGER, W. W. 2009. Dynamic supernodes in sparse Cholesky update / downdate and triangular solves.ACM Transactions on Mathematical Software (TOMS) 35, 4, 1–23.

DEMKOWICZ, L., KURTZ, J., PARDO, D., PASZYNSKI, M., RACHOWICZ, W., AND ZDUNEK, A. 2007. Computing withHp-Adaptive Finite Elements, Vol. 2: Frontiers Three Dimensional Elliptic and Maxwell Problems with Applications.Chapman & HallCRC.

DEMMEL, J. W., EISENSTAT, S. C., GILBERT, J. R., LI, X. S., AND LIU, J. W. H. 1999. A supernodal approach to sparsepartial pivoting. SIAM J. Matrix Analysis and Applications 20, 3, 720–755.

DONGARRA, J., FAVERGE, M., LTAIEF, H., AND LUSZCZEK, P. 2011. Achieving numerical accuracy and high performanceusing recursive tile LU factorization. Tech. rep., LAPACK Working Note 259.

DONGARRA, J. J., DU CROZ, J., HAMMARLING, S., AND DUFF, I. 1990. A set of level 3 Basic Linear Algebra Subpro-grams. ACM Trans. Math. Soft. 16, 1, 1–17.

DUFF, I. S. 1986. Parallel implementation of multifrontal schemes. Parallel Comput. 3, 3, 193–204.DUFF, I. S., ERISMAN, A. M., AND REID, J. K. 1976. On George’s nested dissection method. SIAM Numerical Analy-

sis 13, 5, 686–695.DUFF, I. S. AND REID, J. K. 1983. The multifrontal solution of indefinite sparse symmetric linear equations. ACM Trans-

actions on Mathematical Software (TOMS) 9, 3, 302–325.DURAN, A., CORBALAN, J., AND AYGUADE, E. 2008. Evaluation of OpenMP task scheduling strategies. In Proceedings

of the 4th international conference on OpenMP in a new era of parallelism. Springer-Verlag, 100–110.ERISMAN, A. M. AND REID, J. K. 1974. Monitoring the stability of the triangular factorization of a sparse matrix. Nu-

merische Mathematik 22, 3, 183–186.GEIST, G. A. AND NG, E. 1989. Task scheduling for parallel sparse Cholesky factorization. International Journal of Parallel

Programming 18, 4, 291–314.GILBERT, J. R. AND TARJAN, R. E. 1987. The analysis of a nested dissection algorithm. Numer. Math. 50, 4, 377–404.GOULD, N. I. M., SCOTT, J. A., AND HU, Y. 2007. A numerical evaluation of sparse direct solvers for the solution of large

sparse symmetric linear systems of equations. ACM Transactions on Mathematical Software 33, 2, 10:1–32.GUERMOUCHE, A., L’EXCELLENT, J.-Y., AND UTARD, G. 2003. Impact of reordering on the memory of a multifrontal

solver. Parallel Computing 29, 9, 1191–1218.GUPTA, A. AND KUMAR, V. 1994. A scalable parallel algorithm for sparse Cholesky factorization. In Proceedings of the

1994 ACM/IEEE conference on Supercomputing. ACM, 793–802.GUSTAVSON, F. G., JONSSON, I., KA GSTROM, B., AND LING, P. 1999. Towards peak performance on hierarchical SMP

memory architectures - new recursive blocked data formats and BLAS. In Parallel Processing for Scientific Computing.1–4.

HOGG, J. D. 2008. A DAG-based parallel Cholesky factorization for multicore systems. Tech. Rep. December, SFTC Ruther-ford Appleton Laboratory, RAL-TR-2008-029, Harwell Science and Innovation Campus.

HOGG, J. D., REID, J. K., AND SCOTT, J. A. 2010. Design of a multicore sparse Cholesky factorization using DAGs. SIAMJournal on Scientific Computing 32, 6, 3627–3649.

IRONY, D., SHKLARSKI, G., AND TOLEDO, S. 2004. Parallel and fully recursive multifrontal sparse Cholesky. FutureGeneration Computer Systems 20, 3, 425–440.

LACOSTE, X., RAMET, P., FAVERGE, M., YAMAZAKI, I., AND DONGARRA, J. 2012. Sparse direct solvers with acceleratorsover DAG runtimes. Tech. Rep. April.

LIU, J. W. H. 1992. The multifrontal method for sparse matrix solution: theory and practice. Siam Review 34, 1, 82–109.LIU, J. W. H., NG, E. G., AND PEYTON, B. W. 1993. On finding supernodes for sparse matrix computations. SIAM Journal

of Matrix Analysis and Applications 14, 1, 242–252.LOW, T. M. AND VAN DE GEIJN, R. A. 2004. An API for manipulating matrices stored by blocks. Tech. rep., FLAME

Working Note 12, TR-2004-15, The University of Texas at Austin.

ACM Transactions on Mathematical Software, Vol. 0, No. 0, Article 00, Publication date: 2012.

Page 26: 00 A Parallel Sparse Direct Solver via Hierarchical DAG ... · We present a parallel sparse direct solver for multi-core architectures based on Directed Acyclic Graph (DAG) scheduling.

00:26 K. Kim and V. Eijkhout

MCCALPIN, J. D. 1995. Memory Bandwidth and Machine Balance in Current High Performance Computers. IEEE Com-puter Society Technical Committee on Computer Architecture (TCCA) Newsletter, 19–25.

OLIVIER, S. L. AND PRINS, J. F. 2010. Comparison of OpenMP 3.0 and Other Task Parallel Frameworks on UnbalancedTask Graphs. International Journal of Parallel Programming 38, 5-6, 341–360.

OPENMP ARCHITECTURE REVIEW BOARD. 2008. OpenMP Application Program Interface, Version 3.0.http://www.openmp.org.

POTHEN, A. AND SUN, C. 1993. A mapping algorithm for parallel sparse Cholesky factorization. SIAM Journal on ScientificComputing 14, 5, 1253–1257.

QUINTANA-ORTI, E. S. AND VAN DE GEIJN, R. A. 2008. Updating an LU factorization with pivoting. ACM Transactionson Mathematical Software 35, 2, 1–16.

QUINTANA-ORTI, G., QUINTANA-ORTI, E. S., VAN DE GEIJN, R. A., VAN ZEE, F. G., AND CHAN, E. 2009. Programmingmatrix algorithms-by-blocks for thread-level Parallelism. ACM Transactions on Mathematical Software 36, 3, 1–26.

SCHENK, O. AND GARTNER, K. 2002. Solving unsymmetric sparse systems of linear equations with PARDISO. In Pro-ceedings of the International Conference on Computational Science-Part II. Springer-Verlag, 335–363.

SCHENK, O. AND GARTNER, K. 2006. On fast factorization pivoting methods for sparse symmetric indefinite systems.Electronic Transactions on Numerical Analysis 23, 158–179.

SHAH, S., HAAB, G., PETERSEN, P., AND THROOP, J. 2000. Flexible control structures for parallelism in OpenMP. Con-currency: Practice and Experience 12, 12, 1219–1239.

SZABO, B. A. 1990. The p and hp versions of the finite element method in solid mechanics. Computer Methods in AppliedMechanics and Engineering 80, 1-3, 185–195.

TERBOVEN, C., AN MEY, D., SCHMIDL, D., JIN, H., AND REICHSTEIN, T. 2008. Data and thread affinity in OpenMPprograms. In Proceedings of the 2008 workshop on Memory access on future processors. New York, USA, 377–384.

YARKHAN, A., KURZAK, J., AND DONGARRA, J. 2011. QUARK Users’ Guide. Tech. rep., Electrical Engineering andComputer Science, Innovative Computing Laboratory, University of Tenessee.

ACM Transactions on Mathematical Software, Vol. 0, No. 0, Article 00, Publication date: 2012.