Download - Conjugate Gradient Method

Method of Conjugate Gradients

Dragica Vasileska

CG Method � Conjugate Gradient (CG) Method

M. R. Hestenes & E. Stiefel, 1952

BiCG Method � Biconjugate Gradient (BiCG) Method

C. Lanczos, 1952D. A. H. Jacobs, 1981C. F. Smith et al., 1990R. Barret et al., 1994

CGS Method � Conjugate Gradient Squared (CGS) Method (MATLAB Function)

P. Sonneveld, 1989

GMRES Method � Generalized Minimal � Residual (GMRES) Method

Y. Saad & M. H. Schultz, 1986R. Barret et al., 1994

Y. Saad, 1996

QMR Method � Quasi�Minimal�Residual (QMR) Method

R. Freund & N. Nachtigal, 1990N. Nachtigal, 1991R. Barret et al., 1994Y. Saad, 1996

Conjugate Gradient Method

1. Introduction, Notation and Basic Terms2. Eigenvalues and Eigenvectors3. The Method of Steepest Descent4. Convergence Analysis of the Method of Steepest Descent5. The Method of Conjugate Directions6. The Method of Conjugate Gradients7. Convergence Analysis of Conjugate Gradient Method8. Complexity of the Conjugate Gradient Method9. Preconditioning Techniques10. Conjugate Gradient Type Algorithms for Non-

Symmetric Matrices (CGS, Bi-CGSTAB Method)

1. Introduction, Notation and Basic Terms

� The CG is one of the most popular iterative methods for solving large systems of linear equations

Ax=bwhich arise in many important settings, such as finite difference and finite element methods for solving partial differential equa-tions, circuit analysis etc.� It is suited for use with sparse matrices. If A is dense, the best

choice is to factor A and solve the equation by backsubstitution.� There is a fundamental underlying structure for almost all the

descent algorithms: (1) one starts with an initial point; (2) deter-mines according to a fixed rule a direction of movement; (3) mo-ves in that direction to a relative minimum of the objective function; (4) at the new point, a new direction is determined and the process is repeated. The difference between different algorithms depends upon the rule by which successive directions of movement are selected.

1. Introduction, Notation and Basic Terms (Cont�d)

� A matrix is a rectangular array of numbers, called elements.

� The transpose of an mn matrix A is the nm matrix AT with elements

aijT = aji

� Two square nn matrices A and B are similar if there is a nonsin-gular matrix S such that

B=S-1AS� Matrices having a single row are called row vectors; matrices

having a single column are called column vectors.Row vector: a = [a1, a2, �, an]Column vector: a = (a1, a2, �, an)

� The inner product of two vectors is written as

n

iii

1

T yxyx


� A matrix is called symmetric if aij = aji .� A matrix is positive definite if, for every nonzero vector x

xTAx > 0� Basic matrix identities: (AB)T=BTAT and (AB)-1=B-1A-1

� A quadratic form is a scalar, quadratic function of a vector with the form

� The gradient of a quadratic form is

c21

)f( TT xbAxx x

bAxxA

x

x

x

21

21

)f(x

)f(x

)(f T

n

1'


� Various quadratic forms for an arbitrary 22 matrix:

Positive-definitematrix

Negative-definitematrix

Singularpositive-indefinite

matrix

Indefinitematrix

� Specific example of a 22 symmetric positive definite matrix:


22

0c82

6223

x b A ,,

Graph of a quadratic form f(x)

Contours of thequadratic form

Gradient ofthe quadratic form

2. Eigenvalues and Eigenvectors

� For any nn matrix B, a scalar and a nonzero vector v that satisfy the equation

Bv=vare said to be the eigenvalue and the eigenvector of B.� If the matrix is symmetric, then the following properties hold:

(a) the eigenvalues of B are real(b) eigenvectors associated with distinct eigenvalues are

orthogonal� The matrix B is positive definite (or positive semidefinite) if and

only if all eigenvalues of B are positive (or nonnegative).� Why should we care about the eigenvalues? Iterative methods often

depend on applying B to a vector over and over again:(a) If ||<1, then Biv=iv vanishes as i approaches infinity(b) If ||>1, then Biv=iv will grow to infinity.

2. Eigenvalues and Eigenvectors (Cont�d)

=-0.5

2

� Examples for ||<1 and || >1

� The spectral radius of a matrix is: (B)= max|i|

=0.7=-2


� Example: The Jacobi Method for the solution of Ax=b- split the matrix A such that A=D+E

- Instead of solving the system Ax=b, one solves the modifi-ed system x=Bx+z, where B=-D-1E and z=D-1b

- The iterative method is: x(i+1)= Bx(i)+z, orit terms of the error vector: e(i+1)= Be(i)

- For our particular example, we have that the eigenvalues and the eigenvectors of the matrix A are:

1=7, v1=[1, 2]T

2=2, v2=[-2, 1]T

- The eigenvalues and eigenvectors of the matrix B are:1=-2/3=-0.47, v1=[2, 1]T

2= 2/3=0.47, v2=[-2, 1]T

� Graphical representation of the convergence of the Jacobi method


Eigenvectors of Btogether with their

eigenvalues

Convergence of the Jacobi method which starts at [-2,-2]T andconverges to [2, -2]T

The error vector e(0) The error vector e(1)

The error vector e(2)

3. The Method of Steepest Descent

� In the method of steepest descent, one starts with an arbitrary point x(0) and takes a series of steps x(1), x(2), � until we are satis-fied that we are close enough to the solution.

� When taking the step, one chooses the direction in which f decreases most quickly, i.e.

� Definitions:

error vector: e(i)=x(i)-xresidual: r(i)=b-Ax(i)

� From Ax=b, it follows that

r(i)=-Ae(i)=-f�(x(i))

(i)(i))(f Axbx '

The residual is actually the direction of steepest descent

3. The Method of Steepest Descent (Cont�d)

� Start with some vector x(0). The next vector falls along the solid line

� The magnitude of the step is determined with a line search proce-dure that minimizes f along the line:

)0()0()1( rxx

)0(

)0()0(

T)0()0(

)0(T

)0()0(

)0(T

)1(

)0(T

)1(

)1(T)1()1(

)0(

)0()0(

0

0

0

0

0d

d)(f)f(

dd

Arr

rr rArrr

rrxAb

rAxb

rr

xxx

T

TT

'

� Geometrical representation(e)

(f)



� The algorithm

� To avoid one matrix-vector multiplication, one uses

(i)(i)(i))1(i(i)(i)(i))1(i

(i)

(i)(i)

(i)(i)

(i)

(i)

ree rxx

Arr

rr

Axbr

T

T

(i)(i)(i))1(i Arrr

Two matrix-vectormultiplications arerequired.

The disadvantage of using this recurrence is that the residual sequence isdetermined without any feedback from the value of x(i), so that round-off

errors may cause x(i) to converge to some point near x.

4. Convergence Analysis of the Method of Steepest Descent� The error vector e(i) is equal to the

eigenvector vj of A

� The error vector e(i) is a linear combi-nation of the eigenvectors of A, and all eigenvalues are the same

0, 1

(i)(i)(i))1(ij(i)

(i)

(i)

jjj(i)(i)

(i)

(i)

ree Arr

rrvAvAer

T

T

0, 1

(i)(i)(i))1(i(i)

(i)

(i)

n

1jjj

n

1jjjj(i)(i)

n

1jjj(i)

(i)

(i)

ree Arr

rr

vvAer

ve

T

T

3. The Method of Steepest Descent

� In the method of steepest descent, one starts with an arbitrary point x(0) and takes a series of steps x(1), x(2), � until we are satis-fied that we are close enough to the solution.

� When taking the step, one chooses the direction in which f decreases most quickly, i.e.

� Definitions:

error vector: e(i)=x(i)-xresidual: r(i)=b-Ax(i)

� From Ax=b, it follows that

r(i)=-Ae(i)=-f�(x(i))

(i)(i))(f Axbx '

The residual is actually the direction of steepest descent


� Start with some vector x(0). The next vector falls along the solid line

� The magnitude of the step is determined with a line search proce-dure that minimizes f along the line:

)0()0()1( rxx

)0(

)0()0(

T)0()0(

)0(T

)0()0(

)0(T

)1(

)0(T

)1(

)1(T)1()1(

)0(

)0()0(

0

0

0

0

0d

d)(f)f(

dd

Arr

rr rArrr

rrxAb

rAxb

rr

xxx

T

TT

'

� Geometrical representation(e)

(f)



� Summary of the algorithm

� To avoid one matrix-vector multiplication, one uses instead

(i)(i)(i))1(i Arrr

(i)(i)(i))1(i

(i)

(i)(i)

(i)(i)

(i)

(i)

rxx

Arr

rr

Axbr

T

T

� Efficient implementation of the method of steepest descent

1ii

endif-

else-

50le by is divisibi If

andiiWhile

0i

T

T

02

max

T

rr

q r r

Ax br

rxx

qr

Arq

rr

Axbr

4. Convergence Analysis of the Method of Steepest Descent� The error vector e(i) is equal to the

eigenvector vj of A

� The error vector e(i) is a linear combi-nation of the eigenvectors of A, and all eigenvalues are the same

0, 1

(i)(i)(i))1(ij(i)

(i)

(i)

jjj(i)(i)

(i)

(i)

ree Arr

rrvAvAer

T

T

0, 1

(i)(i)(i))1(i(i)

(i)

(i)

n

1jjj

n

1jjjj(i)(i)

n

1jjj(i)

(i)

(i)

ree Arr

rr

vvAer

ve

T

T

4. Convergence Analysis of the Method of Steepest Descent (Cont�d)

� General convergence of the method can be proven by calculating the energy norm

� For n=2, one has that

j j2jj

3j

2j

2j

2j

2j222

A)i()1i(T

)1(i2

A)1i( 1 ,eAeee

12minmax

232

2222

/,/

11

))((

)(1

Conditioning number

4. Convergence Analysis of the Method of Steepest Descent (Cont�d)

The influence of the spectralcondition number on the convergence

Worst-case scenario: =±

5. The Method of Conjugate Directions

� Basic idea:1. Pick a set of orthogonal search directions d(0), d(1), � , d(n-1)

2. Take exactly one step in each search direction to line up with x

� Mathematical formulation:1. For each step we choose a point

x(i+1)=x(i)+ (i) d(i)

2. To find (i), we use the fact that e(i+1) is orthogonal to d(i)

)i(T

)i(

)i(T

)i()i(

)i(T

)i()i()i(T

)i(

)i()i()i(T

)i(

)1i(T

)i(

0

0

0

dd

ed

dded

ded

ed

5. The Method of Conjugate Directions (Cont�d)

� To solve the problem of not knowing e(i), one makes the search directions to be A-orthogonal rather then orthogonal to each other, i.e.:

0A )j(T

)i( dd


� The new requirement is now that e(i+1) is A-orthogonal to d(i)

(i)

(i))i(

)i()i()i(T

)i(

)1i(T

)i(

(i)T

)1(i

)1(iT)1(i)1(i

(i)

(i)

0

0

0

0d

d)('f)f(

dd

Add

rd

deAd

Aed

dr

xxx

T

T

If the search vectors were the residuals, thisformula would be identical to the method of

steepest descent.


� Proof that this process computes x in n-steps- express the error terms as a linear combination of the search

directions

- use the fact that the search directions are A-orthogonal

)j(

1n

0j)j()0( de


� Calculation of the A-ortogonal search directions by a conjugate Gram-Schmidth process1. Take a set of linearly independent vectors u0, u1, � , un-1

2. Assume that d(0)=u0

3. For i>0, take an ui and subtracts all the components from itthat are not A-orthogonal to the previous search directions

)j(T(j)

)j(T(i)

ij)j(

1i

0jij)i()i( ,

Add

Adu dud


� The method of Conjugate Directions using the axial unit vectors as basis vectors


1. The error vector is A-orthogonal to all previous search directions2. The residual vector is orthogonal to all previous search directions3. The residual vector is also orthogonal to all previous basis vectors.

jifor0jTij

Tij

Ti rurdAed

� Important observations:

6. The Method of Conjugate Gradients

� The method of Conjugate Gradients is simply the method of conjugate directions where the search directions are constructed by conjugation of the residuals, i.e. ui=r(i)

� This allows us to simplify the calculation of the new search direction because

� The new search direction is determined as a linear combination of the previous search direction and the new residual

1ji0

1ji1

)1i(T

)1i(

)i(T

)i(

)1i(T

)1i(

)i(T

)i(

)1i(ij rr

rr

Add

rr

(i)i)1i()1(i drd

6. The Method of Conjugate Gradients (Cont�d)

� Efficient implementation

1ii

endif-

else-

50y ivisible b If i is d

andiiWhile

0i

old

new

Tnew

newold

Tnew

02

newmax

new0

Tnew

dr d

rr

q r r

Ax br

dxx qd

Adq

rrrd

Axbr

The Landscape of Ax=b Solvers

Pivoting

LU

GMRES,

BiCGSTAB, �

Cholesky

Conjugate gradient

DirectA = LU

Iterativey� = Ay

Non-symmetric

Symmetricpositivedefinite

More Robust Less Storage (if sparse)

More Robust

More General

Conjugate gradient iteration

� One matrix-vector multiplication per iteration� Two vector dot products per iteration� Four n-vectors of working storage

x0 = 0, r0 = b, d0 = r0

for k = 1, 2, 3, . . .

ák = (rTk-1rk-1) / (dT

k-1Adk-1) step length

xk = xk-1 + ák dk-1 approx solution

rk = rk-1 � ák Adk-1 residual

âk = (rTk rk) / (rT

k-1rk-1) improvement

dk = rk + âk dk-1 search direction

7. Convergence Analysis of the CG Method

� If the algorithm is performed in exact arithmetic, the exact solution is obtained in at most n-steps.

� When finite precision arithmetic is used, rounding errors lead to gradual loss of orthogonality among the residuals, and the finite termination property of the method is lost.

� If the matrix A has only m distinct eigenvalues, then the CG will converge in at most m iterations

� An error bound on the CG method can be obtained in terms of the A-norm, and after k-iterations:

k

A0Ak 1)(1)(

2

AA

xxxx

� Dominant operations of Steepest Descent or Conjugate Gradient method are the matrix vector products. Thus, both methods have spatial complexity O(m).

� Suppose that after I-iterations, one requires that ||e(i)||A||e(0)||A(a) Steepest Descent:

(b) Conjugate Gradient:

� The time complexity of these two methods is:

(a) Steepest Descent:

(b) Conjugate Gradient:

� Stopping criterion: ||r(i)||||r(0)||

8. Complexity of the CG Method

121 lni

221 lni

)( mO

)( mO

9. Preconditioning Techniques

References:

� J.A. Meijerink and H.A. van der Vorst, �An iterative solution method for linear systems of which the coefficient matrix is a symmetric M-matrix,�Mathematics of Computation, Vol. 31, No. 137, pp. 148-162 (1977).

� J.A. Meijerink and H.A. van der Vorst, �Guidelines for the usage of incomplete decompositions in solving sets of linear equations as they occur in practical problems,� Journal of Computational Physics, Vol. 44, pp. 134-155 (1981).

� D.S. Kershaw, �The incomplete Cholesky-Conjugate gradient method for the iterative solution of systems of linear equations,� Journal of Computational Physics, Vol. 26, pp. 43-65 (1978).

� H.A. van der Vorst, �High performance preconditioning,� SIAM Journal of Scientific Statistical Computations, Vol. 10, No. 6, pp. 1174-1185 (1989).

9. Preconditioning Techniques (Cont�d)

� Preconditioning is a technique for improving the condition number of a matrix. Suppose that M is a symmetric, positive-definite matrix that approximates A, but is easier to invert. Then, rather than solving the original system

Ax=bone solves the modified system

M-1Ax= M-1b� The effectiveness of the preconditioner depends upon the condition

number of M-1A.

� Intuitively, the preconditioning is an attempt to stretch the quadratic form to appear more spherical, so that the eigenvalues become closer to each other.

� Derivation: xEx bExAEE EEM TTT

~,~, 11

9. Preconditioning Techniques (Cont�d)� Transformed Preconditioned CG

(PCG) Algorithm

1

~~~

~~

~~~~~

~~~~~

~~

,~~

~~,~~,0

11

10

2max

0

11

ii

-or-

andiiWhile

i

old

new

Tnew

newold

T

Tnew

Tnew

newT

new

T

d rd

rr

q rr xAEE bEr

dxx qd

dAEEq

rr

rd xAEEbEr

� Untransformed PCG Algorithm dEd rEr T

~,~ 1

1

,

,,0

1

1

02

max

01

1

ii

-or-

andiiWhile

i

old

new

-Tnew

newold

Tnew

new

new-T

new

dr Md

rMr

q r r Ax br

dxx qd

Adq

rMr

rMd Axbr

9. Preconditioning Techniques (Cont�d)

There are sevaral types of preconditioners that can be used:

(a) Preconditioners based on splitting of the matrix A, i.e.

A = M - N(b) Complete or incomplete factorization of A, e.g.

A = LLT + E

(c) Approximation of M=A-1

(d) Reordering of the equations and/or unknowns, e.g.

Domain Decomposition

9. Preconditioning Techniques (Cont�d)(a) Preconditioning based on splittings of A

� Diagonal Preconditioning:M=D=diag{a11, a22, � ann}

� Tridiagonal Preconditioning

� Block Diagonal Preconditioning

22

0c82

6223

x b A ,,

9. Preconditioning Techniques (Cont�d)(b) Preconditioning based on factorization of A

� Cholesky Factorization (for symmetric and positive definite matrix) as a direct method:

A = LLT

x = (L-T)(L-1b)

where:

� Modified Cholesky factorization: A = LDLT

niijL

LLA

L

LAL

ii

i

kikjkji

ji

i

kikiiii

,,2,1 ,

1

1

1

1

9. Preconditioning Techniques (Cont�d)(b) Preconditioning based on factorization of A

� Incomplete Cholesky Factorization:

A = LLT + E or A = LDLT + E , where Lij=0 if (i,j)P

Simplest choice is: P={(i,j)| aij=0; i,j,=1,2,�,n} => ICCG(0)

� Modified ICCG(0):

Aaibi ci

TL

ãi ib~

ic~

Did

~

niccbbdcdbada iiiimimiiiiii ,,2,1,~,~

,~~~~~~ 2

12

11

mi

mi

i

iii

T

d

c

d

baddiagdiag

2

1

21

1

),()(

)()(

MA

LDDDLM

9. Preconditioning Techniques (Cont�d)(c) Preconditioning based on approximation of A

� If (J)<1, then the inverse of I-J is

� Write matrix A as:

A=D+L+U=D(I+D-1(L+U)) => J= -D-1(L+U))

� The inverse of A can be expressed as:

� For k=1, one has that:

� For k=m, one usually uses:

0

321)(k

k JJJIJJI

11

0

1111

)()1(

)())((

DULD

DJIJIDA

k

k

k

1111 )( DULDDM

11

0

1 )()1(

DULDM

km

k

kkm

9. Preconditioning Techniques (Cont�d)(d) Domain Decomposition Preconditioning

Domain I

Domain II

SBB

0M0

00M

L L

S00

0M0

00M

LMTT

T-

-

where

21

2

1

2

1

1

1

1

,

f

d

d

z

x

x

QBB

BA0

B0A

bAx 2

1

2

1

22

11

21

,TT

21

11*

*22

11

2211

21

, BMBBMBSS

SBB

BM0

B0M

M -T-T

TT

10. Conjugate Gradient Type Algorithms forNonsymmetric Matrices

� The Bi-Conjugate Gradient(BCG) Algorithm was propo-sed by Lanczos in 1954.

� It solves not only the original system Ax=b but also the dual linear system ATx*=b*.

� Each step of this algorithm requires a matrix-by-vector product with both A and AT.

� The search direction pj* does

not contribute to the solution directly.

1

,,,0

***

*

*

*

02

max

0

ii

--

andiiWhile

, i

old

new

Tnew

newold

*Tnew

T*

new

newT

new

p rp

pr p

rr

q rr q r r pxx

qp

pAq Apq

rr rp r,p

0r*rAxbr

**

***

T

10. Conjugate Gradient Type Algorithms forNonsymmetric Matrices (Cont�d)

� The Conjugate Gradient Squared (CGS) Algorithm was developed by Sonneveld in 1984.

� Main idea of the algorithm:

1(

)()(

,

arbitrary is,0

*0

*0

02

max

0*

00

*000

0

ii)

-

andiiWhile

, i

old

new

Tnew

newold

Tnew

new

newT

new

pq up

qr u

rr

quA r r

quxx Apuq

Apr

rr

ru ,rp rAxbr

**

**

rAp

rAr

rAprAr

0

0

)(

)(

)()(

0

0

Tj

Tj

jj

jj

j

j

10. Conjugate Gradient Type Algorithms forNonsymmetric Matrices (Cont�d)

� The Bi-Conjugate Gradient Stabilized (Bi-CGSTAB) Algorithm was developed by van der Vorst in 1992.

� Rather than producing iterates whose residual vectors are of the form

� it produces iterates with residual vectors of the form

1(

,

arbitrary is,0

*0

*0

02

max

0*

0

*000

0

ii)

-

andiiWhile

, i

old

new

Tnew

newold

T

T

Tnew

new

newT

new

Appr p

rr

As sr

spxx AsAs

Ass

Aprs Apr

rr

rp rAxbr

02 )( rArjj

0

0)()(

)()(rAAp

rAAr

jjj

jjj

Complexity of linear solvers

O(n1.17 )O(n1.25 )CG, modified IC:

O(n1.75 ) -> O(n1.31 )O(n1.20 ) -> O(n1+ )CG, support trees:

O(n1.33 )O(n1.5 )CG, no precond:

O(n2 )O(n2 )CG, exact arithmetic:

O(n)O(n)Multigrid:

O(n2 )O(n1.5 )Sparse Cholesky:

3D2D

n1/2 n1/3

Time to solve model problem (Poisson�s equation) on regular mesh