TMA4280—Introduction to Supercomputing · 2017. 3. 13. · 1 Discussion on Direct Solvers...
Transcript of TMA4280—Introduction to Supercomputing · 2017. 3. 13. · 1 Discussion on Direct Solvers...
1
Discussion on Direct SolversTMA4280—Introduction to Supercomputing
Based on 2016v slides by Eivind Fonn
NTNU, IMF
February 27. 2017
The problem
— We want to solve
Ax = b, b,x ∈ RN , A ∈ RN×N
where A is the system resulting from discretiziing a Poisson problemusing finite differences (see the previous slides).
— We use standard notation for matrices and vectors, i.e.a1,1 a1,2 · · · a1,Na2,1 a2,2 · · · a2,N
......
. . ....
aN,1 aN,2 · · · aN,N
x1x2...
xN
=
b1b2...
bN
2
Computer algorithms
— We know that the solution is given by
x = A−1b.
— Explicitly forming the inverse A−1 is expensive, prone to round-offerrors, and something we seldom do on computers.
— Exception: Very small and frequently re-used sub-problems that areknown to be well conditioned.
— Matrix inversion has the same time complexity as matrixmultiplication (typically O(n3)).
— Instead, we implement algorithms that solve a linear system given aspecific right-hand-side vector.
— The structure and properties of the matrix A determine whichalgorithms we can use.
3
Computer algorithms
— Trivial example: if A is known to be orthogonal, then
x = Aᵀb.
— Orthogonal matrices are the exception, and not the rule.— We are more likely to find and exploit properties such as
• Symmetry• Definiteness• Sparsity• Bandedness
— We will now consider a number of different algorithms, theirimplementation and usability in a parallel context.
4
General matrices (the hammer)
— In introductory linear algebra we learned two ways to invert generalmatrices.
— First is Cramer’s rule. The solution to a linear system
Ax = b
can be found by repeated application of
xi =det Ai
det A
where Ai is A with the i th column replaced with b.— Naive implementations scale as N!, which makes Cramer’s rule
useless.
5
General matrices (the hammer)
— Instead we tend to resort to Gaussian elimination.— A systematic procedure that allows us to transform the matrix A into
triangular form, which allows for easy inversion using backwardsubstitution.
× × × × ×× × × ×× × ×× ××
— The last equation is trivial. Solve that, plug the value in the
next-to-last equation, which makes that trivial, etc.
6
General matrices (the hammer)
— The equations are on the form
a1,1x1 + a1,2x2 + . . .+ a1,NxN = b1
a2,1x1 + a2,2x2 + . . .+ a2,NxN = b2
... =...
— We want to get rid of the term a2,1x1. We do this by applying therow-operation (row 2)− a2,1/a1,1 (row 1).
— This yields in the second row
0 +
(a2,2 −
a2,1
a1,1a1,2
)x2 + . . .+
(a2,N −
a2,1
a1,1a1,N
)xN = b2 −
a2,1
a1,1b1
7
General matrices (the hammer)
— We repeat this for all rows and all columns beneath the diagonal.— Note: we need ak,k 6= 0 when eliminating column k . If that’s not the
case, we have to exchange two rows. This is called pivoting.— To limit cancellation errors due to limited precision, we also want
ak,k 0. Therefore, choose the row with the largest element. This iscalled partial pivoting. (Full pivoting also interchanges columns.)
— This is a procedure with relatively simple rules, which makes itsuitable for implementation.
8
General matrices (the hammer)
There are two problems with Gaussian elimination.— It modifies the right-hand-side vector b. If we want to solve the same
linear system with a different b we have to redo the whole procedure.— It is still prone to round-off errors, even with partial pivoting.— Therefore, the typical implementation of Gaussian eliminiation is in
terms of the LU decomposition: we seek two matrices L and U, lowerand upper triangular, such that A = LU.
— This decomposition is independent of b.
9
General matrices (the hammer)
— We can then find the solution to a system
Ax = LUx = b
in this way.— First, solve Lv = b for v . (Forward substitution.)— Then, solve Ux = v for x . (Backward substitution.)— Forward substitution works the same way as backward substitution.
Just backward. . . in other words, forward.
10
General matrices (the hammer)
— The LU decomposition has redundancy: there are two sets ofdiagonal elements. There are two popular implementations:Doolittle’s method (unit diagonal on L) and Crout’s method (unitdiagonal on U).
— Doolittle’s algorithm can be stated briefly as
u1,k = a1,k k = 1, . . . ,N
`j,1 =aj,1
u1,1j = 2, . . . ,N
uj,k = aj,k −j−1∑s=1
`j,sus,k k = j , . . . ,N
`j,k =1
uk,k
(aj,k −
k−1∑s=1
`j,sus,k
)j = k + 1, . . . ,N
11
LAPACK
— LU decomposition is somewhat tedious to implement, particularly inan efficient way.
— Smart people have done it for you.— LAPACK: The Linear Algebra Pack, which builds on BLAS.— Same naming conventions and matrix formats.— Just like with BLAS and CBLAS there is a LAPACK and a CLAPACK
(and LAPACKE).— Here we will stick to LAPACK (Fortran numbering).
12
LAPACK
Function prototype:
void dgesv(const int *n, const int *rhs ,
double *A, const int *lda , int *ipiv ,
double *B, const int *ldb , int *info);
Usage:
void lusolve(Matrix A, Vector x)
int *ipiv = malloc(x->len * sizeof(int));
int one = 1, info;
dgesv (&x->len , &one , A->data[0], &x->len ,
ipiv , x->data , &x->len , &info);
free(ipiv);
13
LAPACK
— dgesv overwrites the matrix A with the factorization L,U.— dgesv actually calls two functions:
• dgetrf to compute the decomposition,• dgetrs to solve the system.
— Therefore we can solve for different right-hand-sides after the firstcall, by calling dgetrs ourselves.
— It would be a mistake to call dgesv more than once!
14
Improving performance
— While LAPACK is efficient, it cannot do wonders.— LU decomposition has the same time complexity as matrix
multiplication: O(N3). It is asymptotically as bad as matrix inversion(although in realistic terms much better).
— Even if A is sparse, in general L and U aren’t. (Although for bandedmatrices they actually are.)
15
Symmetry
— If A is symmetric and positive definite, we find that U = Lᵀ.— Thus we can save a factor of two in memory and in floating point
operations.— No pivoting is required for such systems.— This is termed Cholesky factorization.— Note that all conditions are satisfied by the Poisson matrices.
16
LAPACK
Function prototype:
void dposv(char *uplo , const int *n,
const int *nrhs , double *A,
const int *lda , double *B,
const int *ldb , int *info);
Usage:
void llsolve(Matrix A, Vector x)
int one = 1, info;
char uplo = 'L';
dposv (&uplo , &x->len , &one , A->data[0],
&x->len , x->data , &x->len , &info);
17
Improving performance
— LU decomposition is unfit for parallel implementation• It is sequential in nature (you must eliminate for row 2 before row 3).• Pivoting requires synchronization for every row.• The substitution phase is completely sequential (but this is not so
important, since it is O(N2)).• Smart people have tried to make it work by identifying independent
blocks in the matrix.
— For most systems the results are mediocre and with limitedscalability.
— ⇒ LU decomposition is unusable in a parallel context.— However, it is useful as a component in other algorithms.
18
Better approaches
— Solution methods can be categorized in two classes.— Direct methods yield the solution in a predictable number of
operations. Cramer’s rule, Gaussian elimination and LUdecomposition are good examples of direct methods.
— Iterative methods converge to the solution through some iterativeprocedure with an unpredictable number of operations.
— Let us now consider another example of a direct method.
19
Diagonalization
— This is not a hammer. We must exploit known properties about thematrix A.
— We recall the five-point stencil:
−1 −1
−1
−1
+4
— This results from a double application of the three-point stencil in twodirections.
−1 +2 −1
20
Diagonalization
— In the following, the matrix A results from the five-point stencil (2Dproblem), while the matrix T results from the three-point stencil (T ).
— For a symmetric positive definite matrix C, we know that we canperform an eigendecomposition
CQ = QΛ
where Q is the matrix of eigenvectors (as columns) and Λ is thediagonal matrix of eigenvalues.
— Since C is SPD, Q is orthogonal, so
C = QΛQᵀ.
21
Diagonalization
— Using this in the linear system, we get
Cx = b =⇒ QΛQᵀ = b.
— Multiply from the left by Qᵀ, recalling that Qᵀ = Q−1.
ΛQᵀx︸︷︷︸x
= Qᵀb︸︷︷︸b
.
— This reduces the problem to three relatively easy steps.
22
Diagonalization
1. Calculate b with a matrix-vector product
b = Qᵀb
in O(N2) operations.2. Solve the system
Λx = b
in O(N) operations (Λ is diagonal).3. Calculate x with another matrix-vector product
x = Qx
in O(N2) operations.
23
Diagonalization
— This looks promising: it seems we have found a way to solve thesystem in O(N2) operations instead of O(N3)!
— Unfortunately this is not true, as the computation of theeigendecomposition itself (Q and Λ) requires O(N3) operations.
— However, in certain cases we can still exploit this method.
24
Diagonalization
— We constructed the matrix A by applying the three-point stencil in twodirections.
— In the language of linear algebra, this translates to a tensor product.
A = I ⊗ T + T ⊗ I
— The linear system(I ⊗ T + T ⊗ I) x = b
in a global numbering scheme, can equivalently be stated
TX + XT ᵀ = B
in a local numbering scheme. Here, the unknown X and theright-hand-side B are matrices.
25
Diagonalization
— A global numbering scheme is a scheme where we number theunknowns using a single index. This naturally maps to a vector.
— A local numbering scheme is a scheme where we number theunknowns using one index for each spatial direction. This naturallymaps to a matrix (in 2D) or more generally a tensor.
— Thus, we consider in this case a system of matrix equations.— Alternative way of thinking about it: we operate with T along the
columns of X , and then along the rows.
TX︸︷︷︸columns
+ (TXᵀ)ᵀ︸ ︷︷ ︸rows
= TX + XT ᵀ
26
Diagonalization
— Now let us diagonalize T , recalling that it is SPD.
TX + XT ᵀ = B ⇒QΛQᵀX + XQΛQᵀ = B.
— Multiply with Q from the right,
QΛQᵀXQ + XQΛ = BQ
— and by Qᵀ from the left,
ΛQᵀXQ︸ ︷︷ ︸X
+ QᵀXQ︸ ︷︷ ︸X
Λ = QᵀBQ︸ ︷︷ ︸B
,
orΛX + XΛ = B.
27
Diagonalization
We find the solution in three steps1. Calculate B with two matrix-matrix products
B = QᵀBQ
in O(n3) operations.2. Solve the system
xi,j =bi,j
λi + λj
in O(n2) operations.3. Recover the solution with two matrix-matrix products
X = QXQᵀ
in O(n3) operations.
28
Diagonalization
— Note that n =√
N in two dimensions.— That gives O(N3/2) operations in total, and a space complexity ofO(N). A very fast algorithm!
— Computing the eigendecomposition is also O(N3/2). Since the matrixis smaller, this no longer dominates the running time.
— It is parallelizable: a few large, global data exchanges are needed(transposing matrices). Multiplying matrices can be done withrelatively little communication. (More on this in the big project.)
— LU decomposition would require O(N3) operations and O(N2)storage. Bandewd LU decomposition would require O(N2)operations and O(N3/2) storage.
29
Diagonalization
— The diagonalization method is quite general, applicable to any SPDsystem with a tensor-product operator.
— We used Poisson to make it more easily understandable (I hope).— It turns out that we can do even better by exploiting even more
structure in the Poisson problem.
30
Diagonalization
— The continuous eigenvalue problem
−uxx = λu, in Ω = (0,1),
u(0) = u(1) = 0
has solutions for j = 1,2, . . .
uj (x) = sin(jπx),
λj = j2π2
— We now consider operating with T on vectors consisting of ujsampled at the gridpoints, i.e.
q(j)i = uj (xi ) = sin
(ijπn
)
31
Diagonalization
— This yields
(T q(j))i = 2(
1− cos(
jπn
))︸ ︷︷ ︸
λj
sin(
ijπn
)︸ ︷︷ ︸
q(j)i
— Thus, the eigenvectors of T are the eigenfunctions of −∂2/∂x2
sampled at the gridpoints. (Lucky!)— Let us normalize to obtain the matrix Q.
qᵀj qj = 1⇒ Qi,j =
√2n
sin(
ijπn
)Λi,i = 2
(1− cos
(jπn
))
32
Discrete Fourier Transform
— Now for something completely different. (But not really. It’s a magictrick.)
— Consider a periodic function v(x) with period 2π.— Sample this function at equidistant points xj , j = 0,1, . . . ,n.— As with the finite difference grid, xj = jh where h = 2π/N.— We name the samples vj = v(xj ).
x0 x1 xn
33
Discrete Fourier Transform
— Consider the vectors ϕk , where
(ϕk )j = eikxj ,
whereeikxj = cos(kxj ) + i sin(kxj ).
— These vectors form a basis for the complex n-dimensional space Cn.In particular, they are orthogonal:
ϕHkϕ` =
n, k = `,
0, k 6= `,k , ` = 0,1, . . . ,n − 1.
34
Discrete Fourier Transform
— Since they form a basis, any vector in the space can be expressed asa linear combination of these vectors.
— The vectorv =
(v0 · · · vn−1
)ᵀ ∈ Rn
can be expressed as
v =n−1∑k=0
vkϕk ,
or element by element,
vj =n−1∑k=0
vk (ϕk )j =n−1∑k=0
vk eikxj .
35
Discrete Fourier Transform
— Here vk are the discrete Fourier coefficients, which are given by theinverse relation. (Since ϕk ) are orthogonal, this is easy to compute.)
vk =1n
n∑j=0
vjeikxj .
— The Fourier transform is extremely useful and has been extensivelystudied.
— It is useful in, among other things, signal analysis, audio and videocompression.
— Important property: the DFT coefficients of an odd, real signal arepurely imaginary.
36
Discrete Sine Transform
— A transform closely related to DFT is the discrete sine transform, theDST.
— It applies to a function v(x), periodic with period 2π, and odd. That is,v(x) = −v(−x)
— We discretize this function on an equidistant mesh between 0 and π.(The values between π and 2π aren’t needed because it is odd.)
— Since v is odd, we also know that v(x0) = v(xn) = 0. (These are ourPoisson boundary conditions.)
— Thus the discrete function is represented by the vector
v =(v1 · · · vn−1
)ᵀ ∈ Rn−1.
37
Discrete Sine Transform
— As a basis for this space, let us use the sines
(ψk )j = sin(
kjπn
), j = 1, . . . ,n − 1,
and note that
ψᵀkψ` =
n/2, k = `,
0, k 6= `.
— Thus
vj =n−1∑k=1
vk sin(
kjπn
)=(S−1v
)j ,
and
vk =2n
n−1∑j=1
vj sin(
jkπn
)= (Sv)k
38
Tying it all toghether
— In particular, we have
Q =
√n2
S, Qᵀ =
√2n
S−1.
Therefore we can compute a matrix-vector product involving Q or Qᵀ
by using the DST.— How to compute the DST quickly? Consider a vector
v =(v1 · · · vn−1
)ᵀ ∈ Rn−1.
Construct the odd extended vector
w =(0 v1 · · · vn−1 0 −vn−1 · · · −v1
)ᵀ ∈ R2n.
39
Tying it all toghether
— It turns out that the DST coefficients of w and the DFT coefficients ofw are related by
vk = 2iwk , k = 1, . . . ,n − 1
— Thus, we can find the DST by computing the DFT of the extendedvector, multiplying the first n − 1 coefficients (after the constantmode) by 2i, and throwing away the rest of them.
— This is good because there are very fast algorithms for computing theDFT: the infamous Fast Fourier Transform (FFT).
— One FFT is O(n log n), so a matrix-matrix product using the FFT isO(n2 log n).
40
Tying it all together
We find the solution in three steps1. Calculate B with two matrix-matrix products
Bᵀ = S−1(SB)ᵀ
in O(n2 log n) operations.2. Solve the system
xi,j =bi,j
λi + λj
in O(n2) operations.3. Recover the solution with two matrix-matrix products
X = S−1(SXᵀ)ᵀ
in O(n2 log n) operations.
41