Lecture 14 - University of California, San...
Transcript of Lecture 14 - University of California, San...
Lecture 14
Solving systems of linear equations
11/6/06 Scott B. Baden/CSE 260/Fall 2006 2
Linear systems of equations
• A common task in scientific computation is to solve asystem of linear equations
• Often result from of discretizing a differential equation
• Example: linear system of 2 equations in 2 unknowns
(1) 2x + 3y = 8(2) 3x + 2y = 7
• Rewriting equation (1) x = (8-3y)/2
• Substituting x into the LHS of equation (2)3(8-3y)/2 + 2y = (24-9y)/2 + 2y
(24-9y) + 4y = 14 10 = 5y y = 2
• Back substituting the value of y into equation (1)x = 1
11/6/06 Scott B. Baden/CSE 260/Fall 2006 3
Matrix vector equations
• Our linear system of 2 equations in 2 unknowns …2x1 + 3x2 = 83x1 + 2x2 = 7
• may be conveniently expressed in matrix notation: Ax = b
• When we solved for x1 = (8-3 x2)/2 and substituted into the 2nd
equation, we reduced the matrix to an equivalent form
• We multiplied row 1 of A by 3/2 and subtracting the scaled versionfrom row 2 of A and b
===7
8,,
23
32
2
1b
x
xxA
==5
8,
5.20
32bA
11/6/06 Scott B. Baden/CSE 260/Fall 2006 4
Rank 1 updates
• We call this a rank-1 update
• Multiplying row 1 by 3/2: [3 9/2]
• Subtracting from row 2:
• Similarly for b
• An applet for looking at 2x2 systems:www.freeboundaries.org/java/la_applets/GaussElim/
=23
32A
=5.20
32A
11/6/06 Scott B. Baden/CSE 260/Fall 2006 5
Gaussian Elimination
• The process of eliminating the non-zerovalues under the main diagonal is calledGaussian Elimination, named after themathematicianJohann Carl Friedrich Gauss (1777-1855)
• Input: an n n matrix corresponding to alinear system of n equations in nunknowns (must have non-trivial sol’n)
• Eliminate the non-zero values under themain diagonal to produce an upper triangular matrix U
0U
11/6/06 Scott B. Baden/CSE 260/Fall 2006 6
Solving the system of linear equations
• Step 1: obtain the upper triangular
matrix U …
• Step 2: solve the corresponding
upper triangular system Ux = c by
back substitution
0U
11/6/06 Scott B. Baden/CSE 260/Fall 2006 7
What are we computing?
• GE computes the LU factorization A = L U,where L is a lower triangular matrix
• Plugging LU into the original equation Ax = b
Ax = (LU) x = L (Ux) = Ly = b where y = Ux
0L 0
UA =
11/6/06 Scott B. Baden/CSE 260/Fall 2006 8
Cost
• To solve Ax = b
– Factorize A = LU using GE (2/3 n3 flops)
– Solve Ly = b for y using substitution
(n2 flops)
– Solve Ux = y for x using back substitution
(n2 flops)
• We don’t compute U explicitly unless we are
solving for multiple right hand sides b
11/6/06 Scott B. Baden/CSE 260/Fall 2006 9
Visualizing the algorithm
• Eliminating non-zeroes below the diagonal …
One column at a time
Scanning from left to right
Column 0 Column 1 Column 2 Column n-1
11/6/06 Scott B. Baden/CSE 260/Fall 2006 10
A 3 3 example
• Consider the followingsystem of equations
• We usually write thesystem as an augmentedmatrix
7=4x2+3x1+9x0
8=4x2+3x1+4x0
3= x2+ x1+x0
74 39
84 34
31 11
11/6/06 Scott B. Baden/CSE 260/Fall 2006 11
3 3 example
• Multiply row 0 by 4,and subtract from row 1
74 39
84 34
31 11
[4 3 4 8] – 4*[1 1 1 3] = [0 -1 0 -4]
74 39
-40 -10
31 11
11/6/06 Scott B. Baden/CSE 260/Fall 2006 12
3 3 example
• Multiply row 0 by 9,and subtract from row 2
-20-5 -60
-4 0 -10
3 1 11
[9 3 4 7] – 9*[1 1 1 3] = [0 -6 -5 -20]
74 39
-40 -10
31 11
11/6/06 Scott B. Baden/CSE 260/Fall 2006 13
3 3 example
• Eliminate second column
• Multiply row 1 by 6,and add to row 2
4-5 00
-4 0 -10
3 1 11
[0 -6 -5 -20] + -6*[0 -1 0 -4]
= [0 0 -5 4]
-20-5 -60
-4 0 -10
3 1 11
11/6/06 Scott B. Baden/CSE 260/Fall 2006 14
Overview
for k = 0 to n-1 // For each column k
for i = k+1 to n-1 // Eliminate entries below the diagonal: // subtract a multiple of row k
from succeeding rows i
A[i,k+1:n ] – = (A[i,k] / A[k,k]) * A[k, k+1:n ]
end for
end for
Column 0 Column 1 Column 2 Column n-1
11/6/06 Scott B. Baden/CSE 260/Fall 2006 15
• For each column k : 0 to n-1
• … subtract multiples of row k:
A[k,k+1:n] …
… from rows i = k+1 to n
• Multipliers mik = A[i,k]/A[k,k]
• … cancel the elements below
the diagonal: A[k+1:n-1,k]
• We only update to the right of
and below A[k,k]
Eliminating the entries below the diagonal
for i = k+1 to n-1
A[i,k+1:n ] – = mik A[k, k+1:n ]
k
Row k[k,j]
[i,j][i,k] Row i
11/6/06 Scott B. Baden/CSE 260/Fall 2006 16
Putting it all together
for k = 0 to n-1 // For each column k
m[k+1:n-1] = A[k+1:n-1,k] / A[ k , k] // Compute Multipliers
for i = k+1 to n-1 // for each row i > k
A[ i , k+1: n] - = m[ i ] * A[ k , k+1:n ] // Scale row k by mik // and subtract from row i
end for
end for
11/6/06 Scott B. Baden/CSE 260/Fall 2006 17
Problems with roundoff
• The rank-1 update step uses division …
A[i, k+1:n ] -= ( A[i,k]/A[k,k]) * A[k,k+1:n]
• How to handle vanishing denominators or ones
that are very small
01
10[ ]• Gaussian elimination
will fail with this matrix
• But we can avoid theproblem if we swap rows 10
01[ ]
11/6/06 Scott B. Baden/CSE 260/Fall 2006 18
Pivoting to avoid stability problems
• We call this process of swapping rows partial pivoting
• Assume we carry 3 decimal digits of precision
• Consider the following A matrix and RHS b
11
110-4
A = [ ]• The correct solution is
1
2b =[ ]
1
1x =[ ]
11/6/06 Scott B. Baden/CSE 260/Fall 2006 19
Notation
• Consider the followingsystem of equations
• We usually write thesystem as an augmentedmatrix
7=4x2+3x1+9x0
8=4x2+3x1+4x0
3= x2+ x1+x0
74 39
84 34
31 11
11/6/06 Scott B. Baden/CSE 260/Fall 2006 20
Roundoff
• Eliminate zero in row 2 by subtracting 104 row 0
1 - 104
1
2 - 1040
110-4
L = [ | ]• But 1 - 104 rounds to -104
-104
1
-1040
110-4
L = [ | ]
11/6/06 Scott B. Baden/CSE 260/Fall 2006 21
Stability problems due to roundoff
• Thus
-104
1
-1040
110-4
L | b = [ ]• We now back substitute to solve for x2 and then x1
-104 x2 = -104 x2 = 1
• Substituting the value of x2 into the first equation
10-4 x1 + 1* x2 = 1 10-4 x1 = 0 x1 = 0
• But the correct solution is x1 = x2 = 1
11/6/06 Scott B. Baden/CSE 260/Fall 2006 22
Partial Pivoting
• Rule: pick the largest value in the column
• This is called partial pivoting, because only rowsare swapped
• It can be shown that when with partial pivoting,we compute A = P L U, where P is apermutation matrix expressing the rows swaps
• We can also swap columns: full pivoting
• But full pivoting is more expensive to implement
11/6/06 Scott B. Baden/CSE 260/Fall 2006 23
Parallelization
P0 P1 P5 P0 P1 P5 P0 P1 P2 P5 P0 P1 P2 P3 P4 P5
• We’ll use 1D vertical strip partitioning
• Each process owns N/p columns
• Consider the case where p=N=6
• The represents outstanding work in succeeding
k iterations
11/6/06 Scott B. Baden/CSE 260/Fall 2006 24
Analyzing the code
• Let’s look at the data parallel code to understand
where the communication is occurring
• Assume blocked decomposition on the 2nd axis
• Parallelism occurs in array statements
for k = 0 to n-1 // For each column k
m[k+1:n-1] = A[k+1:n-1,k] / A[ k , k] // Compute Multipliers
for i = k+1 to n-1 // for each row i > k
A[ i , k+1: n] - = m[ i ] * A[ k , k+1:n ] // Scale row k by mik // and subtract from row i
end for
end for
11/6/06 Scott B. Baden/CSE 260/Fall 2006 25
Determining communication requirements
• At each step k of the elimination, processor
k div p is in charge: it computes the multipliers
• No communication is needed: all the required
data are owned by processor k div p
for k = 0 to n-1
m[k+1:n-1] = A[k+1:n-1,k] / A[ k , k]
for i = k+1 to n-1
A[ i , k+1:n ] - = m[ j ] * A[ k , k+1:n ]
end for
end for
11/6/06 Scott B. Baden/CSE 260/Fall 2006 26
Determining communication requirements
• But the elements A[ k , : ] can have different
owners
• Processor j div p owns A[k,j]
• What operation is needed to carry out the
multiplication m[ j ] * A[ k , : ] ?
for k = 0 to n-1
m[k+1:n-1] = A[k+1:n-1,k] / A[ k , k]
for j = k+1 to n-1
A[ j , : ] - = m[ j ] * A[ k , : ]
end for
end for
11/6/06 Scott B. Baden/CSE 260/Fall 2006 27
Communication and control
• Each processor is in charge of eliminating N/Pcolumns
• One processor chooses the pivot row and computesthe multipliers
• The multipliers are then broadcasted
11/6/06 Scott B. Baden/CSE 260/Fall 2006 28
Updates
• All processes carry out updates
• Elements A[ k , k+1: n] can have different owners
• Process j div p owns A[k,j] in m[ j ] * A[ k , k+1: n ]
for i = k+1 to n-1
A[ i , k+1: n] - = m[ i ] * A[ k , k+1:n ]
end for
11/6/06 Scott B. Baden/CSE 260/Fall 2006 29
Performance
• Finding the pivot row is a serial bottleneck
– Only one process owns the intersecting column
• Another bottleneck is load imbalance
– When eliminating a column, processors to the left of are idle
11/6/06 Scott B. Baden/CSE 260/Fall 2006 30
Cyclic decomposition improves load balance
• A cyclic decomposition evens out the workload
• A blocked cyclic decomposition improves
locality and reduces communication overhead
11/6/06 Scott B. Baden/CSE 260/Fall 2006 31
In practice
• 1D is not scalable; 2D block cyclicdecompositions required
• More complicated since additionalcommunication steps are required
• The algorithm is blocked as with matrix multiply
• Scalapack is a well known library that performsGE and many other useful operations involvingmatrices
• See http://www.netlib.org/scalapack
11/6/06 Scott B. Baden/CSE 260/Fall 2006 32
2D Block CYCLIC decomposition
improves performance
A cyclic mapping tessellates a template of the
processes over the entire domain
32
10
11/6/06 Scott B. Baden/CSE 260/Fall 2006 33
Distributed GE with a 2D Block Cyclic Layout
Choose pivot, compute multipliers,
make swap decision
Broadcast swap decision and multipliers
so each block column can participate
Carry out the elimination
11/6/06 Scott B. Baden/CSE 260/Fall 2006 34
Sparse Matrices
• A matrix where knowledge about the location ofthe non-zeroes is useful
• Consider Jacobi’s method with a 5-point stencil
u’[i,j] = (u[i-1,j] + u[i+1,j]+
u[i,j-1]+ u[i, j+1] - h2f[i,j]) /4
11/6/06 Scott B. Baden/CSE 260/Fall 2006 35
Savings with Sparse Matrices
• 7 x 7 grid: 49 x 49 matrix, use Cholesky’s method
• Nonzeroes: 349, 30% of the space
• Flops: 2643, 6.7%
• Fill-in: a zero entry becomes nonzero
11/6/06 Scott B. Baden/CSE 260/Fall 2006 36
Managing fill in
• Fill-in and computation reduced with improved ordering
• Advantage increases with N: n=63 (N = n2 = 3969)
– 250109 non-zeroes 85416
– 16 million flops 3.6 million
Natural order: nonzeros = 233 Min. Degree order: nonzeros = 207
11/6/06 Scott B. Baden/CSE 260/Fall 2006 37
What is different about sparse factorization?
• Minimize fill in
– Reordering
– Permutations not only maintain stability, but
minimize fill in
– Trade off stability against sparseness
– We may be able to predict fill in
• Update only where there are non-zeroes
– Structure changes dynamically
– Elimination tree
11/6/06 Scott B. Baden/CSE 260/Fall 2006 38
Cholesky
• In some cases the matrix is both– Symmetric, A = AT
– Positive definite A = L * LT L = A
• No need for pivoting
• We can use Cholesky Factorization
• Flops count cut in half
• What complications does this avoid?
• Result: many matrices can be transformed intothis form
• Super LU: Demmel and Li