Lecture 14 - University of California, San...

Lecture 14

Solving systems of linear equations

11/6/06 Scott B. Baden/CSE 260/Fall 2006 2

Linear systems of equations

• A common task in scientific computation is to solve asystem of linear equations

• Often result from of discretizing a differential equation

• Example: linear system of 2 equations in 2 unknowns

(1) 2x + 3y = 8(2) 3x + 2y = 7

• Rewriting equation (1) x = (8-3y)/2

• Substituting x into the LHS of equation (2)3(8-3y)/2 + 2y = (24-9y)/2 + 2y

(24-9y) + 4y = 14 10 = 5y y = 2

• Back substituting the value of y into equation (1)x = 1


Matrix vector equations

• Our linear system of 2 equations in 2 unknowns …2x1 + 3x2 = 83x1 + 2x2 = 7

• may be conveniently expressed in matrix notation: Ax = b

• When we solved for x1 = (8-3 x2)/2 and substituted into the 2nd

equation, we reduced the matrix to an equivalent form

• We multiplied row 1 of A by 3/2 and subtracting the scaled versionfrom row 2 of A and b

===7

8,,

23

32

2

1b

x

xxA

==5

8,

5.20

32bA


Rank 1 updates

• We call this a rank-1 update

• Multiplying row 1 by 3/2: [3 9/2]

• Subtracting from row 2:

• Similarly for b

• An applet for looking at 2x2 systems:www.freeboundaries.org/java/la_applets/GaussElim/

=23

32A

=5.20

32A


Gaussian Elimination

• The process of eliminating the non-zerovalues under the main diagonal is calledGaussian Elimination, named after themathematicianJohann Carl Friedrich Gauss (1777-1855)

• Input: an n n matrix corresponding to alinear system of n equations in nunknowns (must have non-trivial sol’n)

• Eliminate the non-zero values under themain diagonal to produce an upper triangular matrix U

0U


Solving the system of linear equations

• Step 1: obtain the upper triangular

matrix U …

• Step 2: solve the corresponding

upper triangular system Ux = c by

back substitution

0U


What are we computing?

• GE computes the LU factorization A = L U,where L is a lower triangular matrix

• Plugging LU into the original equation Ax = b

Ax = (LU) x = L (Ux) = Ly = b where y = Ux

0L 0

UA =


Cost

• To solve Ax = b

– Factorize A = LU using GE (2/3 n3 flops)

– Solve Ly = b for y using substitution

(n2 flops)

– Solve Ux = y for x using back substitution

(n2 flops)

• We don’t compute U explicitly unless we are

solving for multiple right hand sides b


Visualizing the algorithm

• Eliminating non-zeroes below the diagonal …

One column at a time

Scanning from left to right

Column 0 Column 1 Column 2 Column n-1


A 3 3 example

• Consider the followingsystem of equations

• We usually write thesystem as an augmentedmatrix

7=4x2+3x1+9x0

8=4x2+3x1+4x0

3= x2+ x1+x0

74 39

84 34

31 11


3 3 example

• Multiply row 0 by 4,and subtract from row 1

74 39

84 34

31 11

[4 3 4 8] – 4*[1 1 1 3] = [0 -1 0 -4]

74 39

-40 -10

31 11


3 3 example

• Multiply row 0 by 9,and subtract from row 2

-20-5 -60

-4 0 -10

3 1 11

[9 3 4 7] – 9*[1 1 1 3] = [0 -6 -5 -20]

74 39

-40 -10

31 11


3 3 example

• Eliminate second column

• Multiply row 1 by 6,and add to row 2

4-5 00

-4 0 -10

3 1 11

[0 -6 -5 -20] + -6*[0 -1 0 -4]

= [0 0 -5 4]

-20-5 -60

-4 0 -10

3 1 11


Overview

for k = 0 to n-1 // For each column k

for i = k+1 to n-1 // Eliminate entries below the diagonal: // subtract a multiple of row k

from succeeding rows i

A[i,k+1:n ] – = (A[i,k] / A[k,k]) * A[k, k+1:n ]

end for

end for

Column 0 Column 1 Column 2 Column n-1


• For each column k : 0 to n-1

• … subtract multiples of row k:

A[k,k+1:n] …

… from rows i = k+1 to n

• Multipliers mik = A[i,k]/A[k,k]

• … cancel the elements below

the diagonal: A[k+1:n-1,k]

• We only update to the right of

and below A[k,k]

Eliminating the entries below the diagonal

for i = k+1 to n-1

A[i,k+1:n ] – = mik A[k, k+1:n ]

k

Row k[k,j]

[i,j][i,k] Row i


Putting it all together


m[k+1:n-1] = A[k+1:n-1,k] / A[ k , k] // Compute Multipliers

for i = k+1 to n-1 // for each row i > k

A[ i , k+1: n] - = m[ i ] * A[ k , k+1:n ] // Scale row k by mik // and subtract from row i

end for

end for


Problems with roundoff

• The rank-1 update step uses division …

A[i, k+1:n ] -= ( A[i,k]/A[k,k]) * A[k,k+1:n]

• How to handle vanishing denominators or ones

that are very small

01

10[ ]• Gaussian elimination

will fail with this matrix

• But we can avoid theproblem if we swap rows 10

01[ ]


Pivoting to avoid stability problems

• We call this process of swapping rows partial pivoting

• Assume we carry 3 decimal digits of precision

• Consider the following A matrix and RHS b

11

110-4

A = [ ]• The correct solution is

1

2b =[ ]

1

1x =[ ]


Notation

• Consider the followingsystem of equations

• We usually write thesystem as an augmentedmatrix

7=4x2+3x1+9x0

8=4x2+3x1+4x0

3= x2+ x1+x0

74 39

84 34

31 11


Roundoff

• Eliminate zero in row 2 by subtracting 104 row 0

1 - 104

1

2 - 1040

110-4

L = [ | ]• But 1 - 104 rounds to -104

-104

1

-1040

110-4

L = [ | ]


Stability problems due to roundoff

• Thus

-104

1

-1040

110-4

L | b = [ ]• We now back substitute to solve for x2 and then x1

-104 x2 = -104 x2 = 1

• Substituting the value of x2 into the first equation

10-4 x1 + 1* x2 = 1 10-4 x1 = 0 x1 = 0

• But the correct solution is x1 = x2 = 1


Partial Pivoting

• Rule: pick the largest value in the column

• This is called partial pivoting, because only rowsare swapped

• It can be shown that when with partial pivoting,we compute A = P L U, where P is apermutation matrix expressing the rows swaps

• We can also swap columns: full pivoting

• But full pivoting is more expensive to implement


Parallelization

P0 P1 P5 P0 P1 P5 P0 P1 P2 P5 P0 P1 P2 P3 P4 P5

• We’ll use 1D vertical strip partitioning

• Each process owns N/p columns

• Consider the case where p=N=6

• The represents outstanding work in succeeding

k iterations


Analyzing the code

• Let’s look at the data parallel code to understand

where the communication is occurring

• Assume blocked decomposition on the 2nd axis

• Parallelism occurs in array statements


m[k+1:n-1] = A[k+1:n-1,k] / A[ k , k] // Compute Multipliers

for i = k+1 to n-1 // for each row i > k

A[ i , k+1: n] - = m[ i ] * A[ k , k+1:n ] // Scale row k by mik // and subtract from row i

end for

end for


Determining communication requirements

• At each step k of the elimination, processor

k div p is in charge: it computes the multipliers

• No communication is needed: all the required

data are owned by processor k div p

for k = 0 to n-1

m[k+1:n-1] = A[k+1:n-1,k] / A[ k , k]

for i = k+1 to n-1

A[ i , k+1:n ] - = m[ j ] * A[ k , k+1:n ]

end for

end for


Determining communication requirements

• But the elements A[ k , : ] can have different

owners

• Processor j div p owns A[k,j]

• What operation is needed to carry out the

multiplication m[ j ] * A[ k , : ] ?

for k = 0 to n-1

m[k+1:n-1] = A[k+1:n-1,k] / A[ k , k]

for j = k+1 to n-1

A[ j , : ] - = m[ j ] * A[ k , : ]

end for

end for


Communication and control

• Each processor is in charge of eliminating N/Pcolumns

• One processor chooses the pivot row and computesthe multipliers

• The multipliers are then broadcasted


Updates

• All processes carry out updates

• Elements A[ k , k+1: n] can have different owners

• Process j div p owns A[k,j] in m[ j ] * A[ k , k+1: n ]

for i = k+1 to n-1

A[ i , k+1: n] - = m[ i ] * A[ k , k+1:n ]

end for


Performance

• Finding the pivot row is a serial bottleneck

– Only one process owns the intersecting column

• Another bottleneck is load imbalance

– When eliminating a column, processors to the left of are idle


Cyclic decomposition improves load balance

• A cyclic decomposition evens out the workload

• A blocked cyclic decomposition improves

locality and reduces communication overhead


In practice

• 1D is not scalable; 2D block cyclicdecompositions required

• More complicated since additionalcommunication steps are required

• The algorithm is blocked as with matrix multiply

• Scalapack is a well known library that performsGE and many other useful operations involvingmatrices

• See http://www.netlib.org/scalapack


2D Block CYCLIC decomposition

improves performance

A cyclic mapping tessellates a template of the

processes over the entire domain

32

10


Distributed GE with a 2D Block Cyclic Layout

Choose pivot, compute multipliers,

make swap decision

Broadcast swap decision and multipliers

so each block column can participate

Carry out the elimination


Sparse Matrices

• A matrix where knowledge about the location ofthe non-zeroes is useful

• Consider Jacobi’s method with a 5-point stencil

u’[i,j] = (u[i-1,j] + u[i+1,j]+

u[i,j-1]+ u[i, j+1] - h2f[i,j]) /4


Savings with Sparse Matrices

• 7 x 7 grid: 49 x 49 matrix, use Cholesky’s method

• Nonzeroes: 349, 30% of the space

• Flops: 2643, 6.7%

• Fill-in: a zero entry becomes nonzero


Managing fill in

• Fill-in and computation reduced with improved ordering

• Advantage increases with N: n=63 (N = n2 = 3969)

– 250109 non-zeroes 85416

– 16 million flops 3.6 million

Natural order: nonzeros = 233 Min. Degree order: nonzeros = 207


What is different about sparse factorization?

• Minimize fill in

– Reordering

– Permutations not only maintain stability, but

minimize fill in

– Trade off stability against sparseness

– We may be able to predict fill in

• Update only where there are non-zeroes

– Structure changes dynamically

– Elimination tree


Cholesky

• In some cases the matrix is both– Symmetric, A = AT

– Positive definite A = L * LT L = A

• No need for pivoting

• We can use Cholesky Factorization

• Flops count cut in half

• What complications does this avoid?

• Result: many matrices can be transformed intothis form

• Super LU: Demmel and Li

Lecture 14 - University of California, San...

Documents

Transcript of Lecture 14 - University of California, San...