Parallel Numerical Algorithmssolomon2.web.engr.illinois.edu/teaching/cs554/slides/slides_08.pdf ·...

63
Sparse Matrices Sparse Triangular Solve Cholesky Factorization Sparse Cholesky Factorization Parallel Numerical Algorithms Chapter 4 – Sparse Linear Systems Section 4.1 – Direct Methods Michael T. Heath and Edgar Solomonik Department of Computer Science University of Illinois at Urbana-Champaign CS 554 / CSE 512 Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 1 / 63

Transcript of Parallel Numerical Algorithmssolomon2.web.engr.illinois.edu/teaching/cs554/slides/slides_08.pdf ·...

  • Sparse MatricesSparse Triangular SolveCholesky Factorization

    Sparse Cholesky Factorization

    Parallel Numerical AlgorithmsChapter 4 – Sparse Linear Systems

    Section 4.1 – Direct Methods

    Michael T. Heath and Edgar Solomonik

    Department of Computer ScienceUniversity of Illinois at Urbana-Champaign

    CS 554 / CSE 512

    Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 1 / 63

  • Sparse MatricesSparse Triangular SolveCholesky Factorization

    Sparse Cholesky Factorization

    Outline

    1 Sparse Matrices

    2 Sparse Triangular Solve

    3 Cholesky Factorization

    4 Sparse Cholesky Factorization

    Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 2 / 63

  • Sparse MatricesSparse Triangular SolveCholesky Factorization

    Sparse Cholesky Factorization

    Sparse Matrix DefinitionsSparse Matrix Products

    Sparse Matrices

    Matrix is sparse if most of its entries are zero

    For efficiency, store and operate on only nonzero entries,e.g., ajk · xk need not be done if ajk = 0

    But more complicated data structures required incur extraoverhead in storage and arithmetic operations

    Matrix is “usefully” sparse if it contains enough zero entriesto be worth taking advantage of them to reduce storageand work required

    In practice, grid discretizations often yield matrices withΘ(n) nonzero entries, i.e., (small) constant number ofnonzeros per row or column

    Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 3 / 63

  • Sparse MatricesSparse Triangular SolveCholesky Factorization

    Sparse Cholesky Factorization

    Sparse Matrix DefinitionsSparse Matrix Products

    Graph Model

    Adjacency Graph G(A) of symmetric n× n matrix A isundirected graph having n vertices, with edge betweenvertices i and j if aij 6= 0

    Number of edges in G(A) is the number of nonzeros in A

    For a grid-based discretization, G(A) is the grid

    Adjacency graph provides visual representation ofalgorithms and highlights connections between numericaland combinatorial algorithms

    For nonsymmetric A, G(A) would be directed

    Often convenient to think of aij as the weight of edge (i, j)

    Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 4 / 63

  • Sparse MatricesSparse Triangular SolveCholesky Factorization

    Sparse Cholesky Factorization

    Sparse Matrix DefinitionsSparse Matrix Products

    Sparse Matrix Representations

    Coordinate (COO) (naive) format – store each nonzeroalong with its row and column indexCompressed-sparse-row (CSR) format

    Store value and column index for each nonzero

    Store index of first nonzero for each row

    Storage for CSR is less than COO and CSR ordering isoften convenient

    CSC (compressed-sparse column), blocked versions (e.g.CSB), and other storage formats are also used

    Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 5 / 63

  • Sparse MatricesSparse Triangular SolveCholesky Factorization

    Sparse Cholesky Factorization

    Sparse Matrix DefinitionsSparse Matrix Products

    Sparse Matrix Distributions

    Dense matrix mappings can be adapted to sparse matrices

    1-D blocked mapping – store all nonzeros in n/pconsecutive rows on each processor1-D cyclic or randomized mapping – store all nonzeros insome subset of n/p rows on each processor2-D block mapping – store all nonzeros in a n/

    √p× n/√p

    block of matrix

    1-D blocked mappings are best for exploiting locality ingraph, especially when there are Θ(n) nonzeros

    Row ordering matters for all mappings, randomization andcyclicity yield load balance, blocking can yield locality

    Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 6 / 63

  • Sparse MatricesSparse Triangular SolveCholesky Factorization

    Sparse Cholesky Factorization

    Sparse Matrix DefinitionsSparse Matrix Products

    Sparse Matrix Vector Multiplication

    Sparse matrix vector multiplication (SpMV) is

    y = Ax

    where A is sparse and x is dense

    CSR-based matrix-vector product, for all i (in parallel) do

    xi =∑j

    ai,c(j)xc(j) =n∑

    j=1

    aijxj

    where c(j) is the index of the jth nonzero in row i

    For random 1-D or 2-D mapping, cost of vectorcommunication is same as in corresponding dense case

    Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 7 / 63

  • Sparse MatricesSparse Triangular SolveCholesky Factorization

    Sparse Cholesky Factorization

    Sparse Matrix DefinitionsSparse Matrix Products

    SpMV with 1-D Mapping

    For 1D blocking (each processor owns n/p rows), numberof elements of x needed by a processor is the number ofcolumns with a nonzero in the rows it ownsIn general, want to order rows to minimize maximumnumber of vector elements needed on any processorGraphically, we want to partition the graph into p subsets ofn/p nodes, to minimize the maximum number of nodes towhich any subset is connected, i.e., for G(A) = (V,E),

    V = V1 ∪ · · · ∪ Vp, |Vi| = n/p

    is selected to minimize

    maxi

    (|{v : v ∈ V \ Vi,∃w ∈ Vi, (v, w) ∈ E}|)

    Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 8 / 63

  • Sparse MatricesSparse Triangular SolveCholesky Factorization

    Sparse Cholesky Factorization

    Sparse Matrix DefinitionsSparse Matrix Products

    Surface Area to Volume Ratio in SpMV

    The number of external vertices the maximum partition isadjacent to depends on the expansion of the graph

    Expansion can be interpreted as a measure of thesurface-area to volume ratio of the subgraphs

    For example, for a k × k × k grid, a subvolume ofk/p1/3 × k/p1/3 × k/p1/3 has surface area Θ(k2/p2/3)

    Communication for this case becomes a neighbor haloexchange on a 3-D processor mesh

    Thus, finding the best 1-D partitioning for SpMV oftencorresponds to domain partitioning and depends on thephysical geometry of the problem

    Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 9 / 63

  • Sparse MatricesSparse Triangular SolveCholesky Factorization

    Sparse Cholesky Factorization

    Sparse Matrix DefinitionsSparse Matrix Products

    Other Sparse Matrix Products

    SpMV is of critical importance to many numerical methods,but suffers from a low flop-to-byte ratio and a potentiallyhigh communication bandwidth cost

    In graph algorithms SpMSpV (x and y are sparse) isprevalent, which is even harder to perform efficiently (e.g.,to minimize work need layout other than CSR, like CSC)

    SpMM (x becomes dense matrix X) provides a higherflop-to-byte ratio and is much easier to do efficiently

    SpGEMM (SpMSpM) (matrix multiplication where allmatrices are sparse) arises in e.g., algebraic multigrid andgraph algorithms, efficiency is highly dependent on sparsity

    Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 10 / 63

  • Sparse MatricesSparse Triangular SolveCholesky Factorization

    Sparse Cholesky Factorization

    Sequential Sparse Triangular SolveParallel Sparse Triangular Solve

    Solving Triangular Sparse Linear Systems

    Given sparse lower-triangular matrix L and vector b, solve

    Lx = b

    all nonzeros of L must be in its lower-triangular part

    Sequential algorithm: take xi = bi/lii, update

    bj = bj − ljixi for all j ∈ {i + 1, . . . , n}

    If L has m > n nonzeros, require Q1 ≈ 2m operations

    Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 11 / 63

  • Sparse MatricesSparse Triangular SolveCholesky Factorization

    Sparse Cholesky Factorization

    Sequential Sparse Triangular SolveParallel Sparse Triangular Solve

    Parallelism in Sparse Triangular Solve

    We can adapt any dense parallel triangular solve algorithmto the sparse case

    Again have fan-in (left-looking) and fan-out (right-looking)variantsCommunication cost stays the same, computational costdecreases

    In fact there may be additional sources of parallelism, e.g.,if l21 = 0, we can solve for x1 and x2 concurrently

    More generally, can concurrently prune leaves of directedacyclic adjacency graph (DAG) G(A) = (V,E), where(i, j) ∈ E if lij 6= 0

    Depth of algorithm corresponds to diameter of this DAG

    Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 12 / 63

  • Sparse MatricesSparse Triangular SolveCholesky Factorization

    Sparse Cholesky Factorization

    Sequential Sparse Triangular SolveParallel Sparse Triangular Solve

    Parallel Algorithm for Sparse Triangular Solve

    Partition: associate fine-grain tasks with each (i, j) suchthat lij 6= 0Communicate: task (i, i) communicates with task (j, i) and(i, j) for all possible jAgglomerate: form coarse-grain tasks for each column ofL, i.e., do 1-D agglomeration, combining fine-grain tasks(?, i) into agglomerated task iMap: assign coarse-grain tasks (columns of L) toprocessors with blocking (for locality) and/or cyclicity (forload balance and concurrency)

    Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 13 / 63

  • Sparse MatricesSparse Triangular SolveCholesky Factorization

    Sparse Cholesky Factorization

    Sequential Sparse Triangular SolveParallel Sparse Triangular Solve

    Analysis of 1-D Parallel Sparse Triangular Solve

    Cost of 1-D algorithm will clearly be less than thecorresponding algorithm for the dense case

    Load balance depends on distribution of nonzeros, cyclicitycan help distribute dense blocks

    Naive algorithm with 1-D column blocking exploitsconcurrency only in fan-out updates

    Communication bandwidth cost depends onsurface-to-volume ratio of each subset of verticesassociated with a block of columns

    Higher concurrency and better performance possible withdynamic/adaptive algorithms

    Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 14 / 63

  • Sparse MatricesSparse Triangular SolveCholesky Factorization

    Sparse Cholesky Factorization

    Cholesky FactorizationComputing CholeskyCholesky AlgorithmParallel Algorithm

    Cholesky Factorization

    Symmetric positive definite matrix A has Choleskyfactorization

    A = LLT

    where L is lower triangular matrix with positive diagonalentries

    Linear systemAx = b

    can then be solved by forward-substitution in lowertriangular system Ly = b, followed by back-substitution inupper triangular system LTx = y

    Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 15 / 63

  • Sparse MatricesSparse Triangular SolveCholesky Factorization

    Sparse Cholesky Factorization

    Cholesky FactorizationComputing CholeskyCholesky AlgorithmParallel Algorithm

    Computing Cholesky Factorization

    Algorithm for computing Cholesky factorization can bederived by equating corresponding entries of A and LLT

    and generating them in correct order

    For example, in 2× 2 case[a11 a21a21 a22

    ]=

    [`11 0`21 `22

    ] [`11 `210 `22

    ]so we have

    `11 =√a11, `21 = a21/`11, `22 =

    √a22 − `221

    Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 16 / 63

  • Sparse MatricesSparse Triangular SolveCholesky Factorization

    Sparse Cholesky Factorization

    Cholesky FactorizationComputing CholeskyCholesky AlgorithmParallel Algorithm

    Cholesky Factorization Algorithm

    for k = 1 to nakk =

    √akk

    for i = k + 1 to naik = aik/akk

    endfor j = k + 1 to n

    for i = j to naij = aij − aik ajk

    endend

    end

    Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 17 / 63

  • Sparse MatricesSparse Triangular SolveCholesky Factorization

    Sparse Cholesky Factorization

    Cholesky FactorizationComputing CholeskyCholesky AlgorithmParallel Algorithm

    Cholesky Factorization Algorithm

    All n square roots are of positive numbers, so algorithmwell defined

    Only lower triangle of A is accessed, so strict uppertriangular portion need not be stored

    Factor L computed in place, overwriting lower triangle of APivoting is not required for numerical stability

    About n3/6 multiplications and similar number of additionsare required (about half as many as for LU)

    Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 18 / 63

  • Sparse MatricesSparse Triangular SolveCholesky Factorization

    Sparse Cholesky Factorization

    Cholesky FactorizationComputing CholeskyCholesky AlgorithmParallel Algorithm

    Parallel Algorithm

    Partition

    For i, j = 1, . . . , n, fine-grain task (i, j) stores aij andcomputes and stores {

    `ij , if i ≥ j`ji, if i < j

    yielding 2-D array of n2 fine-grain tasks

    Zero entries in upper triangle of L need not be computedor stored, so for convenience in using 2-D mesh network,`ij can be redundantly computed as both task (i, j) andtask (j, i) for i > j

    Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 19 / 63

  • Sparse MatricesSparse Triangular SolveCholesky Factorization

    Sparse Cholesky Factorization

    Cholesky FactorizationComputing CholeskyCholesky AlgorithmParallel Algorithm

    Fine-Grain Tasks and Communication

    a11ℓ11

    a21ℓ21

    a21ℓ21

    a22ℓ22

    a31ℓ31

    a41ℓ41

    a32ℓ32

    a42ℓ42

    a31ℓ31

    a32ℓ32

    a41ℓ41

    a42ℓ42

    a33ℓ33

    a43ℓ43

    a43ℓ43

    a44ℓ44

    a51ℓ51

    a52ℓ52

    a61ℓ61

    a62ℓ62

    a53ℓ53

    a54ℓ54

    a63ℓ63

    a64ℓ64

    a51ℓ51

    a61ℓ61

    a52ℓ52

    a62ℓ62

    a53ℓ53

    a63ℓ63

    a54ℓ54

    a64ℓ64

    a55ℓ55

    a65ℓ65

    a65ℓ65

    a66ℓ66

    Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 20 / 63

  • Sparse MatricesSparse Triangular SolveCholesky Factorization

    Sparse Cholesky Factorization

    Cholesky FactorizationComputing CholeskyCholesky AlgorithmParallel Algorithm

    Fine-Grain Parallel Algorithmfor k = 1 to min(i, j)− 1

    recv broadcast of akj from task (k, j)recv broadcast of aik from task (i, k)aij = aij − aik akj

    endif i = j then

    aii =√aii

    broadcast aii to tasks (k, i) and (i, k), k = i+ 1, . . . , nelse if i < j then

    recv broadcast of aii from task (i, i)aij = aij/aiibroadcast aij to tasks (k, j), k = i+ 1, . . . , n

    elserecv broadcast of ajj from task (j, j)aij = aij/ajjbroadcast aij to tasks (i, k), k = j + 1, . . . , n

    end

    Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 21 / 63

  • Sparse MatricesSparse Triangular SolveCholesky Factorization

    Sparse Cholesky Factorization

    Cholesky FactorizationComputing CholeskyCholesky AlgorithmParallel Algorithm

    Agglomeration Schemes

    Agglomerate

    Agglomeration of fine-grain tasks produces

    2-D1-D column1-D row

    parallel algorithms analogous to those for LU factorization,with similar performance and scalability

    Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 22 / 63

  • Sparse MatricesSparse Triangular SolveCholesky Factorization

    Sparse Cholesky Factorization

    Cholesky FactorizationComputing CholeskyCholesky AlgorithmParallel Algorithm

    Loop Orderings for Cholesky

    Each choice of i, j, or k index in outer loop yields differentCholesky algorithm, named for portion of matrix updated bybasic operation in inner loops

    Submatrix-Cholesky : (fan-out) with k in outer loop, innerloops perform rank-1 update of remaining unreducedsubmatrix using current column

    Column-Cholesky : (fan-in) with j in outer loop, inner loopscompute current column using matrix-vector product thataccumulates effects of previous columns

    Row-Cholesky : (fan-in) with i in outer loop, inner loopscompute current row by solving triangular system involvingprevious rows

    Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 23 / 63

  • Sparse MatricesSparse Triangular SolveCholesky Factorization

    Sparse Cholesky Factorization

    Cholesky FactorizationComputing CholeskyCholesky AlgorithmParallel Algorithm

    Memory Access Patterns

    read only read and write

    Submatrix-Cholesky Column-Cholesky Row-Cholesky

    Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 24 / 63

  • Sparse MatricesSparse Triangular SolveCholesky Factorization

    Sparse Cholesky Factorization

    Cholesky FactorizationComputing CholeskyCholesky AlgorithmParallel Algorithm

    Column-Oriented Cholesky Algorithms

    Submatrix-Cholesky

    for k = 1 to nakk =

    √akk

    for i = k + 1 to naik = aik/akk

    endfor j = k + 1 to n

    for i = j to naij = aij − aik ajk

    endend

    end

    Column-Cholesky

    for j = 1 to nfor k = 1 to j − 1

    for i = j to naij = aij − aik ajk

    endendajj =

    √ajj

    for i = j + 1 to naij = aij/ajj

    endend

    Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 25 / 63

  • Sparse MatricesSparse Triangular SolveCholesky Factorization

    Sparse Cholesky Factorization

    Cholesky FactorizationComputing CholeskyCholesky AlgorithmParallel Algorithm

    Column Operations

    Column-oriented algorithms can be stated more compactly byintroducing column operations

    cdiv ( j ): column j is divided by square root of its diagonalentry

    ajj =√ajj

    for i = j + 1 to naij = aij/ajj

    end

    cmod ( j, k): column j is modified by multiple of column k,with k < j

    for i = j to naij = aij − aik ajk

    end

    Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 26 / 63

  • Sparse MatricesSparse Triangular SolveCholesky Factorization

    Sparse Cholesky Factorization

    Cholesky FactorizationComputing CholeskyCholesky AlgorithmParallel Algorithm

    Column-Oriented Cholesky Algorithms

    Submatrix-Cholesky

    for k = 1 to ncdiv (k)for j = k + 1 to n

    cmod ( j, k)end

    end

    right-lookingimmediate-updatedata-drivenfan-out

    Column-Cholesky

    for j = 1 to nfor k = 1 to j − 1

    cmod ( j, k)endcdiv ( j)

    end

    left-lookingdelayed-updatedemand-drivenfan-in

    Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 27 / 63

  • Sparse MatricesSparse Triangular SolveCholesky Factorization

    Sparse Cholesky Factorization

    Cholesky FactorizationComputing CholeskyCholesky AlgorithmParallel Algorithm

    Data Dependences

    cmod (k + 1, k ) cmod (k + 2, k ) cmod (n, k )• • •

    cdiv (k )

    cmod (k, 1) cmod (k, 2 ) cmod (k, k - 1 )• • •

    Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 28 / 63

  • Sparse MatricesSparse Triangular SolveCholesky Factorization

    Sparse Cholesky Factorization

    Cholesky FactorizationComputing CholeskyCholesky AlgorithmParallel Algorithm

    Data Dependences

    cmod (k, ?) operations along bottom can be done in anyorder, but they all have same target column, so updatingmust be coordinated to preserve data integrity

    cmod (?, k) operations along top can be done in any order,and they all have different target columns, so updating canbe done simultaneously

    Performing cmods concurrently is most important sourceof parallelism in column-oriented factorization algorithms

    For dense matrix, each cdiv (k) depends on immediatelypreceding column, so cdivs must be done sequentially

    Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 29 / 63

  • Sparse MatricesSparse Triangular SolveCholesky Factorization

    Sparse Cholesky Factorization

    Sparse EliminationMatrix OrderingsParallel Algorithms

    Sparsity Structure

    For sparse matrix M , let Mi? denote its ith row and M?jits jth column

    Define Struct (Mi?) = {k < i | mik 6= 0}, nonzero structureof row i of strict lower triangle of M

    Define Struct (M?j) = {k > j | mkj 6= 0}, nonzero structureof column j of strict lower triangle of M

    Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 30 / 63

  • Sparse MatricesSparse Triangular SolveCholesky Factorization

    Sparse Cholesky Factorization

    Sparse EliminationMatrix OrderingsParallel Algorithms

    Sparse Cholesky Algorithms

    Submatrix-Cholesky

    for k = 1 to ncdiv ( k)for j ∈ Struct (L?k)

    cmod ( j, k)end

    end

    right-lookingimmediate-updatedata-drivenfan-out

    Column-Cholesky

    for j = 1 to nfor k ∈ Struct (Lj?)

    cmod ( j, k)endcdiv ( j)

    end

    left-lookingdelayed-updatedemand-drivenfan-in

    Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 31 / 63

  • Sparse MatricesSparse Triangular SolveCholesky Factorization

    Sparse Cholesky Factorization

    Sparse EliminationMatrix OrderingsParallel Algorithms

    Graph Model

    Recall that adjacency graph G(A) of symmetric n× nmatrix A is undirected graph with edge between vertices iand j if aij 6= 0

    At each step of Cholesky factorization algorithm,corresponding vertex is eliminated from graph

    Neighbors of eliminated vertex in previous graph becomeclique (fully connected subgraph) in modified graph

    Entries of A that were initially zero may become nonzeroentries, called fill

    Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 32 / 63

  • Sparse MatricesSparse Triangular SolveCholesky Factorization

    Sparse Cholesky Factorization

    Sparse EliminationMatrix OrderingsParallel Algorithms

    Example: Graph Model of Elimination

    ××

    ×××

    ×××

    × × ×× ×××× ×

    ×××

    ×××× ×××

    × ×××××

    A

    ××

    ×××

    ×××

    ×

    ××× ×××

    × ×××××

    L

    + ++ ++

    3

    7

    1

    6

    9

    5

    4

    8

    2

    3

    7

    6

    9

    5

    4

    8

    2

    3

    7

    6

    9

    5

    4

    8 7

    6

    9

    5

    4

    8

    7

    6

    9

    5

    89 8 7

    6

    9 87 9 89

    Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 33 / 63

  • Sparse MatricesSparse Triangular SolveCholesky Factorization

    Sparse Cholesky Factorization

    Sparse EliminationMatrix OrderingsParallel Algorithms

    Elimination Tree

    parent ( j) is row index of first offdiagonal nonzeroin column j of L, if any, and j otherwise

    Elimination tree T (A) is graph having n vertices, with edgebetween vertices i and j, for i > j, if i = parent ( j)

    If matrix is irreducible, then elimination tree is single treewith root at vertex n; otherwise, it is more accuratelytermed elimination forest

    T (A) is spanning tree for filled graph, F (A), which is G(A)with all fill edges added

    Each column of Cholesky factor L depends only on itsdescendants in elimination tree

    Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 34 / 63

  • Sparse MatricesSparse Triangular SolveCholesky Factorization

    Sparse Cholesky Factorization

    Sparse EliminationMatrix OrderingsParallel Algorithms

    Example: Elimination Tree

    ××

    ×××

    ×××

    × × ×× ×××× ×

    ×××

    ×××× ×××

    × ×××××

    A

    ××

    ×××

    ×××

    ×

    ××× ×××

    × ×××××

    L

    + ++ ++

    3

    7

    1

    6

    9

    5

    4

    8

    2

    3

    7

    1

    6

    9

    5

    4

    8

    23

    7

    1

    6

    9

    5

    4

    8

    2G (A) F (A) T (A)

    Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 35 / 63

  • Sparse MatricesSparse Triangular SolveCholesky Factorization

    Sparse Cholesky Factorization

    Sparse EliminationMatrix OrderingsParallel Algorithms

    Effect of Matrix Ordering

    Amount of fill depends on order in which variables areeliminated

    Example: “arrow” matrix — if first row and column aredense, then factor fills in completely, but if last row andcolumn are dense, then they cause no fill

    ××

    ×××

    ×××

    × ×××××××

    ×

    ×× × ×××××

    ××

    ×× ×

    ×× ×××

    ××××××

    ×

    ×××××× × ×

    Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 36 / 63

  • Sparse MatricesSparse Triangular SolveCholesky Factorization

    Sparse Cholesky Factorization

    Sparse EliminationMatrix OrderingsParallel Algorithms

    Ordering Heuristics

    General problem of finding ordering that minimizes fill isNP-complete, but there are relatively cheap heuristics that limitfill effectively

    Bandwidth or profile reduction : reduce distance of nonzerodiagonals from main diagonal (e.g., RCM)

    Minimum degree : eliminate node having fewest neighborsfirst

    Nested dissection : recursively split graph into pieces usinga vertex separator, numbering separator vertices last

    Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 37 / 63

  • Sparse MatricesSparse Triangular SolveCholesky Factorization

    Sparse Cholesky Factorization

    Sparse EliminationMatrix OrderingsParallel Algorithms

    Symbolic Factorization

    For symmetric positive definite (SPD) matrices, orderingcan be determined in advance of numeric factorization

    Only locations of nonzeros matter, not their numericalvalues, since pivoting is not required for numerical stability

    Once ordering is selected, locations of all fill entries in Lcan be anticipated and efficient static data structure set upto accommodate them prior to numeric factorization

    Structure of column j of L is given by union of structures oflower triangular portion of column j of A and prior columnsof L whose first nonzero below diagonal is in row j

    Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 38 / 63

  • Sparse MatricesSparse Triangular SolveCholesky Factorization

    Sparse Cholesky Factorization

    Sparse EliminationMatrix OrderingsParallel Algorithms

    Solving Sparse SPD Systems

    Basic steps in solving sparse SPD systems by Choleskyfactorization

    1 Ordering : Symmetrically reorder rows and columns ofmatrix so Cholesky factor suffers relatively little fill

    2 Symbolic factorization : Determine locations of all fillentries and allocate data structures in advance toaccommodate them

    3 Numeric factorization : Compute numeric values of entriesof Cholesky factor

    4 Triangular solve : Compute solution by forward- andback-substitution

    Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 39 / 63

  • Sparse MatricesSparse Triangular SolveCholesky Factorization

    Sparse Cholesky Factorization

    Sparse EliminationMatrix OrderingsParallel Algorithms

    Parallel Sparse Cholesky

    In sparse submatrix- or column-Cholesky, if ajk = 0, thencmod ( j, k) is omitted

    Sparse factorization thus has additional source ofparallelism, since “missing” cmods may permit multiplecdivs to be done simultaneously

    Elimination tree shows data dependences among columnsof Cholesky factor L, and hence identifies potentialparallelism

    At any point in factorization process, all factor columnscorresponding to leaves in the elimination tree can becomputed simultaneously

    Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 40 / 63

  • Sparse MatricesSparse Triangular SolveCholesky Factorization

    Sparse Cholesky Factorization

    Sparse EliminationMatrix OrderingsParallel Algorithms

    Parallel Sparse Cholesky

    Height of elimination tree determines longest serial paththrough computation, and hence parallel execution time

    Width of elimination tree determines degree of parallelismavailable

    Short, wide, well-balanced elimination tree desirable forparallel factorization

    Structure of elimination tree depends on ordering of matrix

    So ordering should be chosen both to preserve sparsityand to enhance parallelism

    Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 41 / 63

  • Sparse MatricesSparse Triangular SolveCholesky Factorization

    Sparse Cholesky Factorization

    Sparse EliminationMatrix OrderingsParallel Algorithms

    Levels of Parallelism in Sparse Cholesky

    Fine-grainTask is one multiply-add pairAvailable in either dense or sparse caseDifficult to exploit effectively in practice

    Medium-grainTask is one cmod or cdivAvailable in either dense or sparse caseAccounts for most of speedup in dense case

    Large-grainTask computes entire set of columns in subtree ofelimination treeAvailable only in sparse case

    Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 42 / 63

  • Sparse MatricesSparse Triangular SolveCholesky Factorization

    Sparse Cholesky Factorization

    Sparse EliminationMatrix OrderingsParallel Algorithms

    Example: Band Ordering, 1-D Grid

    ×××

    ×××

    ××××

    ××

    ×

    ××

    ×××

    ×

    A

    ×××

    ×××

    ×××

    ×××

    ×

    L

    4

    1

    6

    3

    5

    7

    2

    T (A)

    4

    1

    6

    3

    5

    7

    2

    G (A)

    Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 43 / 63

  • Sparse MatricesSparse Triangular SolveCholesky Factorization

    Sparse Cholesky Factorization

    Sparse EliminationMatrix OrderingsParallel Algorithms

    Example: Minimum Degree, 1-D Grid

    ×××

    ×××

    × ×××

    ×××

    ××

    ×××

    ×

    A

    ×××

    ×××

    ×××

    ×××

    ×

    L

    7

    1

    4

    5

    6

    2

    3

    G (A)

    4

    6

    7

    2

    T (A)

    1

    3

    5

    Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 44 / 63

  • Sparse MatricesSparse Triangular SolveCholesky Factorization

    Sparse Cholesky Factorization

    Sparse EliminationMatrix OrderingsParallel Algorithms

    Example: Nested Dissection, 1-D Grid

    ×××

    ×××

    × ×

    ×

    ×

    ×××××

    ×××

    ×

    A

    ×××

    ×××

    ×××

    ××

    ××

    L

    7

    1

    6

    2

    5

    4

    3

    G (A)

    + +4

    6

    7

    2

    T (A)

    1

    3

    5

    Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 45 / 63

  • Sparse MatricesSparse Triangular SolveCholesky Factorization

    Sparse Cholesky Factorization

    Sparse EliminationMatrix OrderingsParallel Algorithms

    Example: Band Ordering, 2-D Grid

    ××

    ×××

    ×××

    ×× ×× ×

    ××× ×

    ×××

    ×

    ×

    ××

    ××××

    ××

    ××

    ×

    A

    ××

    ×××

    ×××

    ××××

    ×××

    ×××

    ××

    ×

    L

    + ++

    ++

    7

    4

    1

    8

    5

    2

    9

    6

    3

    G (A)

    +++

    4

    1

    6

    3

    5

    7

    9

    8

    2

    T (A)

    Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 46 / 63

  • Sparse MatricesSparse Triangular SolveCholesky Factorization

    Sparse Cholesky Factorization

    Sparse EliminationMatrix OrderingsParallel Algorithms

    Example: Minimum Degree, 2-D Grid

    ××

    ×××

    ×××

    × × ×× ×××× ×

    ×××

    ×××× ×××

    × ×××××

    A

    ××

    ×××

    ×××

    ×

    ××× ×××

    × ×××××

    L

    + ++ ++

    3

    7

    1

    6

    9

    5

    4

    8

    2

    G (A)

    3

    7

    1

    6

    9

    5

    4

    8

    2

    T (A)

    Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 47 / 63

  • Sparse MatricesSparse Triangular SolveCholesky Factorization

    Sparse Cholesky Factorization

    Sparse EliminationMatrix OrderingsParallel Algorithms

    Example: Nested Dissection, 2-D Grid

    ××

    ×××

    ×××

    × × ×× ×

    ×××××××

    ××

    ×

    ×

    ×××

    ××

    ×××

    ×A

    ××

    ×××

    ×××

    ××

    ×

    ×

    ×××

    × ××

    ×××L

    + ++ ++

    4

    7

    1

    6

    8

    3

    5

    9

    2

    G (A)

    4

    7

    1

    6

    9

    3

    5

    8

    2

    T (A)

    Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 48 / 63

  • Sparse MatricesSparse Triangular SolveCholesky Factorization

    Sparse Cholesky Factorization

    Sparse EliminationMatrix OrderingsParallel Algorithms

    Mapping

    Cyclic mapping of columns to processors works well fordense problems, because it balances load andcommunication is global anyway

    To exploit locality in communication for sparsefactorization, better approach is to map columns in subtreeof elimination tree onto local subset of processors

    Still use cyclic mapping within dense submatrices(“supernodes”)

    Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 49 / 63

  • Sparse MatricesSparse Triangular SolveCholesky Factorization

    Sparse Cholesky Factorization

    Sparse EliminationMatrix OrderingsParallel Algorithms

    Example: Subtree Mapping

    00

    00

    00

    0

    0

    0

    1

    1

    1

    1

    1

    1

    1

    1

    1 2

    2

    2

    2

    2

    2

    2

    2

    2 3

    3

    3

    3

    3

    3

    3

    3

    3

    0

    0

    1

    2

    2

    3

    2

    3

    0

    1

    3

    1

    0

    Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 50 / 63

  • Sparse MatricesSparse Triangular SolveCholesky Factorization

    Sparse Cholesky Factorization

    Sparse EliminationMatrix OrderingsParallel Algorithms

    Fan-Out Sparse Choleskyfor j ∈ mycols

    if j is leaf node in T (A) thencdiv ( j)send L?j to processes in map (Struct (L?j))mycols = mycols − { j }

    endendwhile mycols 6= ∅

    receive any column of L, say L?kfor j ∈ mycols ∩ Struct (L?k)

    cmod ( j, k)if column j requires no more cmods then

    cdiv ( j)send L?j to processes in map (Struct (L?j))mycols = mycols − { j }

    endend

    end

    Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 51 / 63

  • Sparse MatricesSparse Triangular SolveCholesky Factorization

    Sparse Cholesky Factorization

    Sparse EliminationMatrix OrderingsParallel Algorithms

    Fan-In Sparse Choleskyfor j = 1 to n

    if j ∈ mycols or mycols ∩ Struct (Lj?) 6= ∅ thenu = 0for k ∈ mycols ∩ Struct (Lj?)

    u = u+ `jk L?kif j ∈ mycols then

    incorporate u into factor column jwhile any aggregated update column

    for column j remains, receive oneand incorporate it into factor column j

    endcdiv ( j)

    elsesend u to process map ( j)

    endend

    end

    Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 52 / 63

  • Sparse MatricesSparse Triangular SolveCholesky Factorization

    Sparse Cholesky Factorization

    Sparse EliminationMatrix OrderingsParallel Algorithms

    Multifrontal Sparse Cholesky

    Multifrontal algorithm operates recursively, starting fromroot of elimination tree for A

    Dense frontal matrix Fj is initialized to have nonzeroentries from corresponding row and column of A as its firstrow and column, and zeros elsewhere

    Fj is then updated by extend_add operations with updatematrices from its children in elimination tree

    extend_add operation, denoted by ⊕, merges matrices bytaking union of their subscript sets and summing entries forany common subscripts

    After updating of Fj is complete, its partial Choleskyfactorization is computed, producing corresponding rowand column of L as well as update matrix Uj

    Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 53 / 63

  • Sparse MatricesSparse Triangular SolveCholesky Factorization

    Sparse Cholesky Factorization

    Sparse EliminationMatrix OrderingsParallel Algorithms

    Example: extend_add

    a11 a13 a15 a18a31 a33 a35 a38a51 a53 a55 a58a81 a83 a85 a88

    ⊕b11 b12 b15 b17b21 b22 b25 b27b51 b52 b55 b57b71 b72 b75 b77

    =

    a11 + b11 b12 a13 a15 + b15 b17 a18b21 b22 0 b25 b27 0a31 0 a33 a35 0 a38

    a51 + b51 b52 a53 a55 + b55 b57 a58b71 b72 0 b75 b77 0a81 0 a83 a85 0 a88

    Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 54 / 63

  • Sparse MatricesSparse Triangular SolveCholesky Factorization

    Sparse Cholesky Factorization

    Sparse EliminationMatrix OrderingsParallel Algorithms

    Multifrontal Sparse CholeskyFactor( j)

    Let {i1, . . . , ir} = Struct (L?j)

    Let Fj =

    aj,j aj,i1 . . . aj,irai1,j 0 . . . 0

    ......

    . . ....

    air,j 0 . . . 0

    for each child i of j in elimination tree

    Factor(i)Fj = Fj ⊕Ui

    endPerform one step of dense Cholesky:

    Fj =

    `j,j 0`i1,j

    ... I`ir,j

    1 00 Uj

    `j,j `i1,j . . . `ir,j0 I

    Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 55 / 63

  • Sparse MatricesSparse Triangular SolveCholesky Factorization

    Sparse Cholesky Factorization

    Sparse EliminationMatrix OrderingsParallel Algorithms

    Advantages of Multifrontal Method

    Most arithmetic operations performed on dense matrices,which reduces indexing overhead and indirect addressing

    Can take advantage of loop unrolling, vectorization, andoptimized BLAS to run at near peak speed on many typesof processors

    Data locality good for memory hierarchies, such as cache,virtual memory with paging, or explicit out-of-core solvers

    Naturally adaptable to parallel implementation byprocessing multiple independent fronts simultaneously ondifferent processors

    Parallelism can also be exploited in dense matrixcomputations within each front

    Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 56 / 63

  • Sparse MatricesSparse Triangular SolveCholesky Factorization

    Sparse Cholesky Factorization

    Sparse EliminationMatrix OrderingsParallel Algorithms

    Summary for Parallel Sparse Cholesky

    Principal ingredients in efficient parallel algorithm for sparseCholesky factorization

    Reordering matrix to obtain relatively short and wellbalanced elimination tree while also limiting fill

    Multifrontal or supernodal approach to exploit densesubproblems effectively

    Subtree mapping to localize communication

    Cyclic mapping of dense subproblems to achieve goodload balance

    2-D algorithm for dense subproblems to enhancescalability

    Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 57 / 63

  • Sparse MatricesSparse Triangular SolveCholesky Factorization

    Sparse Cholesky Factorization

    Sparse EliminationMatrix OrderingsParallel Algorithms

    Scalability of Sparse Cholesky

    Performance and scalability of sparse Cholesky depend onsparsity structure of particular matrix

    Sparse factorization with nested dissection requiresfactorization of dense matrix of dimension Θ(

    √n ) for 2-D

    grid problem with n grid points (√n is the size of the root

    vertex separator), for which unconditional weak scalabilityis possible

    However, efficiency often deteriorates as a result of therest of the sparse factorization taking more time

    Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 58 / 63

  • Sparse MatricesSparse Triangular SolveCholesky Factorization

    Sparse Cholesky Factorization

    Sparse EliminationMatrix OrderingsParallel Algorithms

    References – Dense Cholesky

    G. Ballard, J. Demmel, O. Holtz, and O. Schwartz,Communication-optimal parallel and sequential Choleskydecomposition, SIAM J. Sci. Comput. 32:3495-3523, 2010

    J. W. Demmel, M. T. Heath, and H. A. van der Vorst,Parallel numerical linear algebra, Acta Numerica2:111-197, 1993

    D. O’Leary and G. W. Stewart, Data-flow algorithms forparallel matrix computations, Comm. ACM 28:840-853,1985

    D. O’Leary and G. W. Stewart, Assignment and schedulingin parallel matrix factorization, Linear Algebra Appl.77:275-299, 1986

    Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 59 / 63

  • Sparse MatricesSparse Triangular SolveCholesky Factorization

    Sparse Cholesky Factorization

    Sparse EliminationMatrix OrderingsParallel Algorithms

    References – Sparse CholeskyM. T. Heath, Parallel direct methods for sparse linearsystems, D. E. Keyes, A. Sameh, and V. Venkatakrishnan,eds., Parallel Numerical Algorithms, pp. 55-90, Kluwer,1997

    M. T. Heath, E. Ng and B. W. Peyton, Parallel algorithmsfor sparse linear systems, SIAM Review 33:420-460, 1991

    J. Liu, Computational models and task scheduling forparallel sparse Cholesky factorization, Parallel Computing3:327-342, 1986

    J. Liu, Reordering sparse matrices for parallel elimination,Parallel Computing 11:73-91, 1989

    J. Liu, The role of elimination trees in sparse factorization,SIAM J. Matrix Anal. Appl. 11:134-172, 1990

    Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 60 / 63

  • Sparse MatricesSparse Triangular SolveCholesky Factorization

    Sparse Cholesky Factorization

    References – Multifrontal Methods

    I. S. Duff, Parallel implementation of multifrontal schemes,Parallel Computing 3:193-204, 1986

    A. Gupta, Parallel sparse direct methods: a short tutorial,IBM Research Report RC 25076, November 2010

    J. Liu, The multifrontal method for sparse matrix solution:theory and practice, SIAM Review 34:82-109, 1992

    J. A. Scott, Parallel frontal solvers for large sparse linearsystems, ACM Trans. Math. Software 29:395-417, 2003

    Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 61 / 63

  • Sparse MatricesSparse Triangular SolveCholesky Factorization

    Sparse Cholesky Factorization

    References – Scalability

    A. George, J. Lui, and E. Ng, Communication results forparallel sparse Cholesky factorization on a hypercube,Parallel Computing 10:287-298, 1989

    A. Gupta, G. Karypis, and V. Kumar, Highly scalableparallel algorithms for sparse matrix factorization, IEEETrans. Parallel Distrib. Systems 8:502-520, 1997

    T. Rauber, G. Runger, and C. Scholtes, Scalability ofsparse Cholesky factorization, Internat. J. High SpeedComputing 10:19-52, 1999

    R. Schreiber, Scalability of sparse direct solvers,A. George, J. R. Gilbert, and J. Liu, eds., Graph Theoryand Sparse Matrix Computation, pp. 191-209,Springer-Verlag, 1993

    Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 62 / 63

  • Sparse MatricesSparse Triangular SolveCholesky Factorization

    Sparse Cholesky Factorization

    References – Nonsymmetric Sparse Systems

    I. S. Duff and J. A. Scott, A parallel direct solver for largesparse highly unsymmetric linear systems, ACM Trans.Math. Software 30:95-117, 2004

    A. Gupta, A shared- and distributed-memory parallelgeneral sparse direct solver, Appl. Algebra Engrg.Commun. Comput., 18(3):263-277, 2007

    X. S. Li and J. W. Demmel, SuperLU_Dist: A scalabledistributed-memory sparse direct solver for unsymmetriclinear systems, ACM Trans. Math. Software 29:110-140,2003

    K. Shen, T. Yang, and X. Jiao, S+: Efficient 2D sparse LUfactorization on parallel machines, SIAM J. Matrix Anal.Appl. 22:282-305, 2000

    Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 63 / 63

    Sparse MatricesSparse Matrix DefinitionsSparse Matrix Products

    Sparse Triangular SolveSequential Sparse Triangular SolveParallel Sparse Triangular Solve

    Cholesky FactorizationCholesky FactorizationComputing CholeskyCholesky AlgorithmParallel Algorithm

    Sparse Cholesky FactorizationSparse EliminationMatrix OrderingsParallel Algorithms