TMA4280—Introduction to Supercomputing · 2017. 3. 13. · 1 Discussion on Direct Solvers...

1

Discussion on Direct SolversTMA4280—Introduction to Supercomputing

Based on 2016v slides by Eivind Fonn

NTNU, IMF

February 27. 2017

The problem

— We want to solve

Ax = b, b,x ∈ RN , A ∈ RN×N

where A is the system resulting from discretiziing a Poisson problemusing finite differences (see the previous slides).

— We use standard notation for matrices and vectors, i.e.a1,1 a1,2 · · · a1,Na2,1 a2,2 · · · a2,N

......

. . ....

aN,1 aN,2 · · · aN,N

x1x2...

xN

=

b1b2...

bN

2

Computer algorithms

— We know that the solution is given by

x = A−1b.

— Explicitly forming the inverse A−1 is expensive, prone to round-offerrors, and something we seldom do on computers.

— Exception: Very small and frequently re-used sub-problems that areknown to be well conditioned.

— Matrix inversion has the same time complexity as matrixmultiplication (typically O(n3)).

— Instead, we implement algorithms that solve a linear system given aspecific right-hand-side vector.

— The structure and properties of the matrix A determine whichalgorithms we can use.

3

Computer algorithms

— Trivial example: if A is known to be orthogonal, then

x = Aᵀb.

— Orthogonal matrices are the exception, and not the rule.— We are more likely to find and exploit properties such as

• Symmetry• Definiteness• Sparsity• Bandedness

— We will now consider a number of different algorithms, theirimplementation and usability in a parallel context.

4

General matrices (the hammer)

— In introductory linear algebra we learned two ways to invert generalmatrices.

— First is Cramer’s rule. The solution to a linear system

Ax = b

can be found by repeated application of

xi =det Ai

det A

where Ai is A with the i th column replaced with b.— Naive implementations scale as N!, which makes Cramer’s rule

useless.

5


— Instead we tend to resort to Gaussian elimination.— A systematic procedure that allows us to transform the matrix A into

triangular form, which allows for easy inversion using backwardsubstitution.

× × × × ×× × × ×× × ×× ××

— The last equation is trivial. Solve that, plug the value in the

next-to-last equation, which makes that trivial, etc.

6


— The equations are on the form

a1,1x1 + a1,2x2 + . . .+ a1,NxN = b1

a2,1x1 + a2,2x2 + . . .+ a2,NxN = b2

... =...

— We want to get rid of the term a2,1x1. We do this by applying therow-operation (row 2)− a2,1/a1,1 (row 1).

— This yields in the second row

0 +

(a2,2 −

a2,1

a1,1a1,2

)x2 + . . .+

(a2,N −

a2,1

a1,1a1,N

)xN = b2 −

a2,1

a1,1b1

7


— We repeat this for all rows and all columns beneath the diagonal.— Note: we need ak,k 6= 0 when eliminating column k . If that’s not the

case, we have to exchange two rows. This is called pivoting.— To limit cancellation errors due to limited precision, we also want

ak,k 0. Therefore, choose the row with the largest element. This iscalled partial pivoting. (Full pivoting also interchanges columns.)

— This is a procedure with relatively simple rules, which makes itsuitable for implementation.

8


There are two problems with Gaussian elimination.— It modifies the right-hand-side vector b. If we want to solve the same

linear system with a different b we have to redo the whole procedure.— It is still prone to round-off errors, even with partial pivoting.— Therefore, the typical implementation of Gaussian eliminiation is in

terms of the LU decomposition: we seek two matrices L and U, lowerand upper triangular, such that A = LU.

— This decomposition is independent of b.

9


— We can then find the solution to a system

Ax = LUx = b

in this way.— First, solve Lv = b for v . (Forward substitution.)— Then, solve Ux = v for x . (Backward substitution.)— Forward substitution works the same way as backward substitution.

Just backward. . . in other words, forward.

10


— The LU decomposition has redundancy: there are two sets ofdiagonal elements. There are two popular implementations:Doolittle’s method (unit diagonal on L) and Crout’s method (unitdiagonal on U).

— Doolittle’s algorithm can be stated briefly as

u1,k = a1,k k = 1, . . . ,N

`j,1 =aj,1

u1,1j = 2, . . . ,N

uj,k = aj,k −j−1∑s=1

`j,sus,k k = j , . . . ,N

`j,k =1

uk,k

(aj,k −

k−1∑s=1

`j,sus,k

)j = k + 1, . . . ,N

11

LAPACK

— LU decomposition is somewhat tedious to implement, particularly inan efficient way.

— Smart people have done it for you.— LAPACK: The Linear Algebra Pack, which builds on BLAS.— Same naming conventions and matrix formats.— Just like with BLAS and CBLAS there is a LAPACK and a CLAPACK

(and LAPACKE).— Here we will stick to LAPACK (Fortran numbering).

12

LAPACK

Function prototype:

void dgesv(const int *n, const int *rhs ,

double *A, const int *lda , int *ipiv ,

double *B, const int *ldb , int *info);

Usage:

void lusolve(Matrix A, Vector x)

int *ipiv = malloc(x->len * sizeof(int));

int one = 1, info;

dgesv (&x->len , &one , A->data[0], &x->len ,

ipiv , x->data , &x->len , &info);

free(ipiv);

13

LAPACK

— dgesv overwrites the matrix A with the factorization L,U.— dgesv actually calls two functions:

• dgetrf to compute the decomposition,• dgetrs to solve the system.

— Therefore we can solve for different right-hand-sides after the firstcall, by calling dgetrs ourselves.

— It would be a mistake to call dgesv more than once!

14

Improving performance

— While LAPACK is efficient, it cannot do wonders.— LU decomposition has the same time complexity as matrix

multiplication: O(N3). It is asymptotically as bad as matrix inversion(although in realistic terms much better).

— Even if A is sparse, in general L and U aren’t. (Although for bandedmatrices they actually are.)

15

Symmetry

— If A is symmetric and positive definite, we find that U = Lᵀ.— Thus we can save a factor of two in memory and in floating point

operations.— No pivoting is required for such systems.— This is termed Cholesky factorization.— Note that all conditions are satisfied by the Poisson matrices.

16

LAPACK

Function prototype:

void dposv(char *uplo , const int *n,

const int *nrhs , double *A,

const int *lda , double *B,

const int *ldb , int *info);

Usage:

void llsolve(Matrix A, Vector x)

int one = 1, info;

char uplo = 'L';

dposv (&uplo , &x->len , &one , A->data[0],

&x->len , x->data , &x->len , &info);

17

Improving performance

— LU decomposition is unfit for parallel implementation• It is sequential in nature (you must eliminate for row 2 before row 3).• Pivoting requires synchronization for every row.• The substitution phase is completely sequential (but this is not so

important, since it is O(N2)).• Smart people have tried to make it work by identifying independent

blocks in the matrix.

— For most systems the results are mediocre and with limitedscalability.

— ⇒ LU decomposition is unusable in a parallel context.— However, it is useful as a component in other algorithms.

18

Better approaches

— Solution methods can be categorized in two classes.— Direct methods yield the solution in a predictable number of

operations. Cramer’s rule, Gaussian elimination and LUdecomposition are good examples of direct methods.

— Iterative methods converge to the solution through some iterativeprocedure with an unpredictable number of operations.

— Let us now consider another example of a direct method.

19

Diagonalization

— This is not a hammer. We must exploit known properties about thematrix A.

— We recall the five-point stencil:

−1 −1

−1

−1

+4

— This results from a double application of the three-point stencil in twodirections.

−1 +2 −1

20

Diagonalization

— In the following, the matrix A results from the five-point stencil (2Dproblem), while the matrix T results from the three-point stencil (T ).

— For a symmetric positive definite matrix C, we know that we canperform an eigendecomposition

CQ = QΛ

where Q is the matrix of eigenvectors (as columns) and Λ is thediagonal matrix of eigenvalues.

— Since C is SPD, Q is orthogonal, so

C = QΛQᵀ.

21

Diagonalization

— Using this in the linear system, we get

Cx = b =⇒ QΛQᵀ = b.

— Multiply from the left by Qᵀ, recalling that Qᵀ = Q−1.

ΛQᵀx︸︷︷︸x

= Qᵀb︸︷︷︸b

.

— This reduces the problem to three relatively easy steps.

22

Diagonalization

1. Calculate b with a matrix-vector product

b = Qᵀb

in O(N2) operations.2. Solve the system

Λx = b

in O(N) operations (Λ is diagonal).3. Calculate x with another matrix-vector product

x = Qx

in O(N2) operations.

23

Diagonalization

— This looks promising: it seems we have found a way to solve thesystem in O(N2) operations instead of O(N3)!

— Unfortunately this is not true, as the computation of theeigendecomposition itself (Q and Λ) requires O(N3) operations.

— However, in certain cases we can still exploit this method.

24

Diagonalization

— We constructed the matrix A by applying the three-point stencil in twodirections.

— In the language of linear algebra, this translates to a tensor product.

A = I ⊗ T + T ⊗ I

— The linear system(I ⊗ T + T ⊗ I) x = b

in a global numbering scheme, can equivalently be stated

TX + XT ᵀ = B

in a local numbering scheme. Here, the unknown X and theright-hand-side B are matrices.

25

Diagonalization

— A global numbering scheme is a scheme where we number theunknowns using a single index. This naturally maps to a vector.

— A local numbering scheme is a scheme where we number theunknowns using one index for each spatial direction. This naturallymaps to a matrix (in 2D) or more generally a tensor.

— Thus, we consider in this case a system of matrix equations.— Alternative way of thinking about it: we operate with T along the

columns of X , and then along the rows.

TX︸︷︷︸columns

+ (TXᵀ)ᵀ︸︷︷︸rows

= TX + XT ᵀ

26

Diagonalization

— Now let us diagonalize T , recalling that it is SPD.

TX + XT ᵀ = B ⇒QΛQᵀX + XQΛQᵀ = B.

— Multiply with Q from the right,

QΛQᵀXQ + XQΛ = BQ

— and by Qᵀ from the left,

ΛQᵀXQ︸︷︷︸X

+ QᵀXQ︸︷︷︸X

Λ = QᵀBQ︸︷︷︸B

,

orΛX + XΛ = B.

27

Diagonalization

We find the solution in three steps1. Calculate B with two matrix-matrix products

B = QᵀBQ

in O(n3) operations.2. Solve the system

xi,j =bi,j

λi + λj

in O(n2) operations.3. Recover the solution with two matrix-matrix products

X = QXQᵀ

in O(n3) operations.

28

Diagonalization

— Note that n =√

N in two dimensions.— That gives O(N3/2) operations in total, and a space complexity ofO(N). A very fast algorithm!

— Computing the eigendecomposition is also O(N3/2). Since the matrixis smaller, this no longer dominates the running time.

— It is parallelizable: a few large, global data exchanges are needed(transposing matrices). Multiplying matrices can be done withrelatively little communication. (More on this in the big project.)

— LU decomposition would require O(N3) operations and O(N2)storage. Bandewd LU decomposition would require O(N2)operations and O(N3/2) storage.

29

Diagonalization

— The diagonalization method is quite general, applicable to any SPDsystem with a tensor-product operator.

— We used Poisson to make it more easily understandable (I hope).— It turns out that we can do even better by exploiting even more

structure in the Poisson problem.

30

Diagonalization

— The continuous eigenvalue problem

−uxx = λu, in Ω = (0,1),

u(0) = u(1) = 0

has solutions for j = 1,2, . . .

uj (x) = sin(jπx),

λj = j2π2

— We now consider operating with T on vectors consisting of ujsampled at the gridpoints, i.e.

q(j)i = uj (xi ) = sin

(ijπn

)

31

Diagonalization

— This yields

(T q(j))i = 2(

1− cos(

jπn

))︸︷︷︸

λj

sin(

ijπn

)︸︷︷︸

q(j)i

— Thus, the eigenvectors of T are the eigenfunctions of −∂2/∂x2

sampled at the gridpoints. (Lucky!)— Let us normalize to obtain the matrix Q.

qᵀj qj = 1⇒ Qi,j =

√2n

sin(

ijπn

)Λi,i = 2

(1− cos

(jπn

))

32

Discrete Fourier Transform

— Now for something completely different. (But not really. It’s a magictrick.)

— Consider a periodic function v(x) with period 2π.— Sample this function at equidistant points xj , j = 0,1, . . . ,n.— As with the finite difference grid, xj = jh where h = 2π/N.— We name the samples vj = v(xj ).

x0 x1 xn

33


— Consider the vectors ϕk , where

(ϕk )j = eikxj ,

whereeikxj = cos(kxj ) + i sin(kxj ).

— These vectors form a basis for the complex n-dimensional space Cn.In particular, they are orthogonal:

ϕHkϕ` =

n, k = `,

0, k 6= `,k , ` = 0,1, . . . ,n − 1.

34


— Since they form a basis, any vector in the space can be expressed asa linear combination of these vectors.

— The vectorv =

(v0 · · · vn−1

)ᵀ ∈ Rn

can be expressed as

v =n−1∑k=0

vkϕk ,

or element by element,

vj =n−1∑k=0

vk (ϕk )j =n−1∑k=0

vk eikxj .

35


— Here vk are the discrete Fourier coefficients, which are given by theinverse relation. (Since ϕk ) are orthogonal, this is easy to compute.)

vk =1n

n∑j=0

vjeikxj .

— The Fourier transform is extremely useful and has been extensivelystudied.

— It is useful in, among other things, signal analysis, audio and videocompression.

— Important property: the DFT coefficients of an odd, real signal arepurely imaginary.

36

Discrete Sine Transform

— A transform closely related to DFT is the discrete sine transform, theDST.

— It applies to a function v(x), periodic with period 2π, and odd. That is,v(x) = −v(−x)

— We discretize this function on an equidistant mesh between 0 and π.(The values between π and 2π aren’t needed because it is odd.)

— Since v is odd, we also know that v(x0) = v(xn) = 0. (These are ourPoisson boundary conditions.)

— Thus the discrete function is represented by the vector

v =(v1 · · · vn−1

)ᵀ ∈ Rn−1.

37

Discrete Sine Transform

— As a basis for this space, let us use the sines

(ψk )j = sin(

kjπn

), j = 1, . . . ,n − 1,

and note that

ψᵀkψ` =

n/2, k = `,

0, k 6= `.

— Thus

vj =n−1∑k=1

vk sin(

kjπn

)=(S−1v

)j ,

and

vk =2n

n−1∑j=1

vj sin(

jkπn

)= (Sv)k

38

Tying it all toghether

— In particular, we have

Q =

√n2

S, Qᵀ =

√2n

S−1.

Therefore we can compute a matrix-vector product involving Q or Qᵀ

by using the DST.— How to compute the DST quickly? Consider a vector

v =(v1 · · · vn−1

)ᵀ ∈ Rn−1.

Construct the odd extended vector

w =(0 v1 · · · vn−1 0 −vn−1 · · · −v1

)ᵀ ∈ R2n.

39

Tying it all toghether

— It turns out that the DST coefficients of w and the DFT coefficients ofw are related by

vk = 2iwk , k = 1, . . . ,n − 1

— Thus, we can find the DST by computing the DFT of the extendedvector, multiplying the first n − 1 coefficients (after the constantmode) by 2i, and throwing away the rest of them.

— This is good because there are very fast algorithms for computing theDFT: the infamous Fast Fourier Transform (FFT).

— One FFT is O(n log n), so a matrix-matrix product using the FFT isO(n2 log n).

40

Tying it all together

We find the solution in three steps1. Calculate B with two matrix-matrix products

Bᵀ = S−1(SB)ᵀ

in O(n2 log n) operations.2. Solve the system

xi,j =bi,j

λi + λj

in O(n2) operations.3. Recover the solution with two matrix-matrix products

X = S−1(SXᵀ)ᵀ

in O(n2 log n) operations.

41

TMA4280—Introduction to Supercomputing · 2017. 3. 13. · 1 Discussion on Direct Solvers...

Documents

Transcript of TMA4280—Introduction to Supercomputing · 2017. 3. 13. · 1 Discussion on Direct Solvers...