Method of Conjugate Gradients
Dragica Vasileska
CG Method � Conjugate Gradient (CG) Method
M. R. Hestenes & E. Stiefel, 1952
BiCG Method � Biconjugate Gradient (BiCG) Method
C. Lanczos, 1952D. A. H. Jacobs, 1981C. F. Smith et al., 1990R. Barret et al., 1994
CGS Method � Conjugate Gradient Squared (CGS) Method (MATLAB Function)
P. Sonneveld, 1989
GMRES Method � Generalized Minimal � Residual (GMRES) Method
Y. Saad & M. H. Schultz, 1986R. Barret et al., 1994
Y. Saad, 1996
QMR Method � Quasi�Minimal�Residual (QMR) Method
R. Freund & N. Nachtigal, 1990N. Nachtigal, 1991R. Barret et al., 1994Y. Saad, 1996
Conjugate Gradient Method
1. Introduction, Notation and Basic Terms2. Eigenvalues and Eigenvectors3. The Method of Steepest Descent4. Convergence Analysis of the Method of Steepest Descent5. The Method of Conjugate Directions6. The Method of Conjugate Gradients7. Convergence Analysis of Conjugate Gradient Method8. Complexity of the Conjugate Gradient Method9. Preconditioning Techniques10. Conjugate Gradient Type Algorithms for Non-
Symmetric Matrices (CGS, Bi-CGSTAB Method)
1. Introduction, Notation and Basic Terms
� The CG is one of the most popular iterative methods for solving large systems of linear equations
Ax=bwhich arise in many important settings, such as finite difference and finite element methods for solving partial differential equa-tions, circuit analysis etc.� It is suited for use with sparse matrices. If A is dense, the best
choice is to factor A and solve the equation by backsubstitution.� There is a fundamental underlying structure for almost all the
descent algorithms: (1) one starts with an initial point; (2) deter-mines according to a fixed rule a direction of movement; (3) mo-ves in that direction to a relative minimum of the objective function; (4) at the new point, a new direction is determined and the process is repeated. The difference between different algorithms depends upon the rule by which successive directions of movement are selected.
1. Introduction, Notation and Basic Terms (Cont�d)
� A matrix is a rectangular array of numbers, called elements.
� The transpose of an mn matrix A is the nm matrix AT with elements
aijT = aji
� Two square nn matrices A and B are similar if there is a nonsin-gular matrix S such that
B=S-1AS� Matrices having a single row are called row vectors; matrices
having a single column are called column vectors.Row vector: a = [a1, a2, �, an]Column vector: a = (a1, a2, �, an)
� The inner product of two vectors is written as
n
iii
1
T yxyx
1. Introduction, Notation and Basic Terms (Cont�d)
� A matrix is called symmetric if aij = aji .� A matrix is positive definite if, for every nonzero vector x
xTAx > 0� Basic matrix identities: (AB)T=BTAT and (AB)-1=B-1A-1
� A quadratic form is a scalar, quadratic function of a vector with the form
� The gradient of a quadratic form is
c21
)f( TT xbAxx x
bAxxA
x
x
x
21
21
)f(x
)f(x
)(f T
n
1'
1. Introduction, Notation and Basic Terms (Cont�d)
� Various quadratic forms for an arbitrary 22 matrix:
Positive-definitematrix
Negative-definitematrix
Singularpositive-indefinite
matrix
Indefinitematrix
� Specific example of a 22 symmetric positive definite matrix:
1. Introduction, Notation and Basic Terms (Cont�d)
22
0c82
6223
x b A ,,
Graph of a quadratic form f(x)
Contours of thequadratic form
Gradient ofthe quadratic form
2. Eigenvalues and Eigenvectors
� For any nn matrix B, a scalar and a nonzero vector v that satisfy the equation
Bv=vare said to be the eigenvalue and the eigenvector of B.� If the matrix is symmetric, then the following properties hold:
(a) the eigenvalues of B are real(b) eigenvectors associated with distinct eigenvalues are
orthogonal� The matrix B is positive definite (or positive semidefinite) if and
only if all eigenvalues of B are positive (or nonnegative).� Why should we care about the eigenvalues? Iterative methods often
depend on applying B to a vector over and over again:(a) If ||<1, then Biv=iv vanishes as i approaches infinity(b) If ||>1, then Biv=iv will grow to infinity.
2. Eigenvalues and Eigenvectors (Cont�d)
=-0.5
2
� Examples for ||<1 and || >1
� The spectral radius of a matrix is: (B)= max|i|
=0.7=-2
2. Eigenvalues and Eigenvectors (Cont�d)
� Example: The Jacobi Method for the solution of Ax=b- split the matrix A such that A=D+E
- Instead of solving the system Ax=b, one solves the modifi-ed system x=Bx+z, where B=-D-1E and z=D-1b
- The iterative method is: x(i+1)= Bx(i)+z, orit terms of the error vector: e(i+1)= Be(i)
- For our particular example, we have that the eigenvalues and the eigenvectors of the matrix A are:
1=7, v1=[1, 2]T
2=2, v2=[-2, 1]T
- The eigenvalues and eigenvectors of the matrix B are:1=-2/3=-0.47, v1=[2, 1]T
2= 2/3=0.47, v2=[-2, 1]T
� Graphical representation of the convergence of the Jacobi method
2. Eigenvalues and Eigenvectors (Cont�d)
Eigenvectors of Btogether with their
eigenvalues
Convergence of the Jacobi method which starts at [-2,-2]T andconverges to [2, -2]T
The error vector e(0) The error vector e(1)
The error vector e(2)
3. The Method of Steepest Descent
� In the method of steepest descent, one starts with an arbitrary point x(0) and takes a series of steps x(1), x(2), � until we are satis-fied that we are close enough to the solution.
� When taking the step, one chooses the direction in which f decreases most quickly, i.e.
� Definitions:
error vector: e(i)=x(i)-xresidual: r(i)=b-Ax(i)
� From Ax=b, it follows that
r(i)=-Ae(i)=-f�(x(i))
(i)(i))(f Axbx '
The residual is actually the direction of steepest descent
3. The Method of Steepest Descent (Cont�d)
� Start with some vector x(0). The next vector falls along the solid line
� The magnitude of the step is determined with a line search proce-dure that minimizes f along the line:
)0()0()1( rxx
)0(
)0()0(
T)0()0(
)0(T
)0()0(
)0(T
)1(
)0(T
)1(
)1(T)1()1(
)0(
)0()0(
0
0
0
0
0d
d)(f)f(
dd
Arr
rr rArrr
rrxAb
rAxb
rr
xxx
T
TT
'
� Geometrical representation(e)
(f)
3. The Method of Steepest Descent (Cont�d)
3. The Method of Steepest Descent (Cont�d)
� The algorithm
� To avoid one matrix-vector multiplication, one uses
(i)(i)(i))1(i(i)(i)(i))1(i
(i)
(i)(i)
(i)(i)
(i)
(i)
ree rxx
Arr
rr
Axbr
T
T
(i)(i)(i))1(i Arrr
Two matrix-vectormultiplications arerequired.
The disadvantage of using this recurrence is that the residual sequence isdetermined without any feedback from the value of x(i), so that round-off
errors may cause x(i) to converge to some point near x.
4. Convergence Analysis of the Method of Steepest Descent� The error vector e(i) is equal to the
eigenvector vj of A
� The error vector e(i) is a linear combi-nation of the eigenvectors of A, and all eigenvalues are the same
0, 1
(i)(i)(i))1(ij(i)
(i)
(i)
jjj(i)(i)
(i)
(i)
ree Arr
rrvAvAer
T
T
0, 1
(i)(i)(i))1(i(i)
(i)
(i)
n
1jjj
n
1jjjj(i)(i)
n
1jjj(i)
(i)
(i)
ree Arr
rr
vvAer
ve
T
T
3. The Method of Steepest Descent
� In the method of steepest descent, one starts with an arbitrary point x(0) and takes a series of steps x(1), x(2), � until we are satis-fied that we are close enough to the solution.
� When taking the step, one chooses the direction in which f decreases most quickly, i.e.
� Definitions:
error vector: e(i)=x(i)-xresidual: r(i)=b-Ax(i)
� From Ax=b, it follows that
r(i)=-Ae(i)=-f�(x(i))
(i)(i))(f Axbx '
The residual is actually the direction of steepest descent
3. The Method of Steepest Descent (Cont�d)
� Start with some vector x(0). The next vector falls along the solid line
� The magnitude of the step is determined with a line search proce-dure that minimizes f along the line:
)0()0()1( rxx
)0(
)0()0(
T)0()0(
)0(T
)0()0(
)0(T
)1(
)0(T
)1(
)1(T)1()1(
)0(
)0()0(
0
0
0
0
0d
d)(f)f(
dd
Arr
rr rArrr
rrxAb
rAxb
rr
xxx
T
TT
'
� Geometrical representation(e)
(f)
3. The Method of Steepest Descent (Cont�d)
3. The Method of Steepest Descent (Cont�d)
� Summary of the algorithm
� To avoid one matrix-vector multiplication, one uses instead
(i)(i)(i))1(i Arrr
(i)(i)(i))1(i
(i)
(i)(i)
(i)(i)
(i)
(i)
rxx
Arr
rr
Axbr
T
T
� Efficient implementation of the method of steepest descent
1ii
endif-
else-
50le by is divisibi If
andiiWhile
0i
T
T
02
max
T
rr
q r r
Ax br
rxx
qr
Arq
rr
Axbr
4. Convergence Analysis of the Method of Steepest Descent� The error vector e(i) is equal to the
eigenvector vj of A
� The error vector e(i) is a linear combi-nation of the eigenvectors of A, and all eigenvalues are the same
0, 1
(i)(i)(i))1(ij(i)
(i)
(i)
jjj(i)(i)
(i)
(i)
ree Arr
rrvAvAer
T
T
0, 1
(i)(i)(i))1(i(i)
(i)
(i)
n
1jjj
n
1jjjj(i)(i)
n
1jjj(i)
(i)
(i)
ree Arr
rr
vvAer
ve
T
T
4. Convergence Analysis of the Method of Steepest Descent (Cont�d)
� General convergence of the method can be proven by calculating the energy norm
� For n=2, one has that
j j2jj
3j
2j
2j
2j
2j222
A)i()1i(T
)1(i2
A)1i( 1 ,eAeee
12minmax
232
2222
/,/
11
))((
)(1
Conditioning number
4. Convergence Analysis of the Method of Steepest Descent (Cont�d)
The influence of the spectralcondition number on the convergence
Worst-case scenario: =±
5. The Method of Conjugate Directions
� Basic idea:1. Pick a set of orthogonal search directions d(0), d(1), � , d(n-1)
2. Take exactly one step in each search direction to line up with x
� Mathematical formulation:1. For each step we choose a point
x(i+1)=x(i)+ (i) d(i)
2. To find (i), we use the fact that e(i+1) is orthogonal to d(i)
)i(T
)i(
)i(T
)i()i(
)i(T
)i()i()i(T
)i(
)i()i()i(T
)i(
)1i(T
)i(
0
0
0
dd
ed
dded
ded
ed
5. The Method of Conjugate Directions (Cont�d)
� To solve the problem of not knowing e(i), one makes the search directions to be A-orthogonal rather then orthogonal to each other, i.e.:
0A )j(T
)i( dd
5. The Method of Conjugate Directions (Cont�d)
� The new requirement is now that e(i+1) is A-orthogonal to d(i)
(i)
(i))i(
)i()i()i(T
)i(
)1i(T
)i(
(i)T
)1(i
)1(iT)1(i)1(i
(i)
(i)
0
0
0
0d
d)('f)f(
dd
Add
rd
deAd
Aed
dr
xxx
T
T
If the search vectors were the residuals, thisformula would be identical to the method of
steepest descent.
5. The Method of Conjugate Directions (Cont�d)
� Proof that this process computes x in n-steps- express the error terms as a linear combination of the search
directions
- use the fact that the search directions are A-orthogonal
)j(
1n
0j)j()0( de
5. The Method of Conjugate Directions (Cont�d)
� Calculation of the A-ortogonal search directions by a conjugate Gram-Schmidth process1. Take a set of linearly independent vectors u0, u1, � , un-1
2. Assume that d(0)=u0
3. For i>0, take an ui and subtracts all the components from itthat are not A-orthogonal to the previous search directions
)j(T(j)
)j(T(i)
ij)j(
1i
0jij)i()i( ,
Add
Adu dud
5. The Method of Conjugate Directions (Cont�d)
� The method of Conjugate Directions using the axial unit vectors as basis vectors
5. The Method of Conjugate Directions (Cont�d)
1. The error vector is A-orthogonal to all previous search directions2. The residual vector is orthogonal to all previous search directions3. The residual vector is also orthogonal to all previous basis vectors.
jifor0jTij
Tij
Ti rurdAed
� Important observations:
6. The Method of Conjugate Gradients
� The method of Conjugate Gradients is simply the method of conjugate directions where the search directions are constructed by conjugation of the residuals, i.e. ui=r(i)
� This allows us to simplify the calculation of the new search direction because
� The new search direction is determined as a linear combination of the previous search direction and the new residual
1ji0
1ji1
)1i(T
)1i(
)i(T
)i(
)1i(T
)1i(
)i(T
)i(
)1i(ij rr
rr
Add
rr
(i)i)1i()1(i drd
6. The Method of Conjugate Gradients (Cont�d)
� Efficient implementation
1ii
endif-
else-
50y ivisible b If i is d
andiiWhile
0i
old
new
Tnew
newold
Tnew
02
newmax
new0
Tnew
dr d
rr
q r r
Ax br
dxx qd
Adq
rrrd
Axbr
The Landscape of Ax=b Solvers
Pivoting
LU
GMRES,
BiCGSTAB, �
Cholesky
Conjugate gradient
DirectA = LU
Iterativey� = Ay
Non-symmetric
Symmetricpositivedefinite
More Robust Less Storage (if sparse)
More Robust
More General
Conjugate gradient iteration
� One matrix-vector multiplication per iteration� Two vector dot products per iteration� Four n-vectors of working storage
x0 = 0, r0 = b, d0 = r0
for k = 1, 2, 3, . . .
ák = (rTk-1rk-1) / (dT
k-1Adk-1) step length
xk = xk-1 + ák dk-1 approx solution
rk = rk-1 � ák Adk-1 residual
âk = (rTk rk) / (rT
k-1rk-1) improvement
dk = rk + âk dk-1 search direction
7. Convergence Analysis of the CG Method
� If the algorithm is performed in exact arithmetic, the exact solution is obtained in at most n-steps.
� When finite precision arithmetic is used, rounding errors lead to gradual loss of orthogonality among the residuals, and the finite termination property of the method is lost.
� If the matrix A has only m distinct eigenvalues, then the CG will converge in at most m iterations
� An error bound on the CG method can be obtained in terms of the A-norm, and after k-iterations:
k
A0Ak 1)(1)(
2
AA
xxxx
� Dominant operations of Steepest Descent or Conjugate Gradient method are the matrix vector products. Thus, both methods have spatial complexity O(m).
� Suppose that after I-iterations, one requires that ||e(i)||A||e(0)||A(a) Steepest Descent:
(b) Conjugate Gradient:
� The time complexity of these two methods is:
(a) Steepest Descent:
(b) Conjugate Gradient:
� Stopping criterion: ||r(i)||||r(0)||
8. Complexity of the CG Method
121 lni
221 lni
)( mO
)( mO
9. Preconditioning Techniques
References:
� J.A. Meijerink and H.A. van der Vorst, �An iterative solution method for linear systems of which the coefficient matrix is a symmetric M-matrix,�Mathematics of Computation, Vol. 31, No. 137, pp. 148-162 (1977).
� J.A. Meijerink and H.A. van der Vorst, �Guidelines for the usage of incomplete decompositions in solving sets of linear equations as they occur in practical problems,� Journal of Computational Physics, Vol. 44, pp. 134-155 (1981).
� D.S. Kershaw, �The incomplete Cholesky-Conjugate gradient method for the iterative solution of systems of linear equations,� Journal of Computational Physics, Vol. 26, pp. 43-65 (1978).
� H.A. van der Vorst, �High performance preconditioning,� SIAM Journal of Scientific Statistical Computations, Vol. 10, No. 6, pp. 1174-1185 (1989).
9. Preconditioning Techniques (Cont�d)
� Preconditioning is a technique for improving the condition number of a matrix. Suppose that M is a symmetric, positive-definite matrix that approximates A, but is easier to invert. Then, rather than solving the original system
Ax=bone solves the modified system
M-1Ax= M-1b� The effectiveness of the preconditioner depends upon the condition
number of M-1A.
� Intuitively, the preconditioning is an attempt to stretch the quadratic form to appear more spherical, so that the eigenvalues become closer to each other.
� Derivation: xEx bExAEE EEM TTT
~,~, 11
9. Preconditioning Techniques (Cont�d)� Transformed Preconditioned CG
(PCG) Algorithm
1
~~~
~~
~~~~~
~~~~~
~~
,~~
~~,~~,0
11
10
2max
0
11
ii
-or-
andiiWhile
i
old
new
Tnew
newold
T
Tnew
Tnew
newT
new
T
d rd
rr
q rr xAEE bEr
dxx qd
dAEEq
rr
rd xAEEbEr
� Untransformed PCG Algorithm dEd rEr T
~,~ 1
1
,
,,0
1
1
02
max
01
1
ii
-or-
andiiWhile
i
old
new
-Tnew
newold
Tnew
new
new-T
new
dr Md
rMr
q r r Ax br
dxx qd
Adq
rMr
rMd Axbr
9. Preconditioning Techniques (Cont�d)
There are sevaral types of preconditioners that can be used:
(a) Preconditioners based on splitting of the matrix A, i.e.
A = M - N(b) Complete or incomplete factorization of A, e.g.
A = LLT + E
(c) Approximation of M=A-1
(d) Reordering of the equations and/or unknowns, e.g.
Domain Decomposition
9. Preconditioning Techniques (Cont�d)(a) Preconditioning based on splittings of A
� Diagonal Preconditioning:M=D=diag{a11, a22, � ann}
� Tridiagonal Preconditioning
� Block Diagonal Preconditioning
22
0c82
6223
x b A ,,
9. Preconditioning Techniques (Cont�d)(b) Preconditioning based on factorization of A
� Cholesky Factorization (for symmetric and positive definite matrix) as a direct method:
A = LLT
x = (L-T)(L-1b)
where:
� Modified Cholesky factorization: A = LDLT
niijL
LLA
L
LAL
ii
i
kikjkji
ji
i
kikiiii
,,2,1 ,
1
1
1
1
9. Preconditioning Techniques (Cont�d)(b) Preconditioning based on factorization of A
� Incomplete Cholesky Factorization:
A = LLT + E or A = LDLT + E , where Lij=0 if (i,j)P
Simplest choice is: P={(i,j)| aij=0; i,j,=1,2,�,n} => ICCG(0)
� Modified ICCG(0):
Aaibi ci
TL
ãi ib~
ic~
Did
~
niccbbdcdbada iiiimimiiiiii ,,2,1,~,~
,~~~~~~ 2
12
11
mi
mi
i
iii
T
d
c
d
baddiagdiag
2
1
21
1
),()(
)()(
MA
LDDDLM
9. Preconditioning Techniques (Cont�d)(c) Preconditioning based on approximation of A
� If (J)<1, then the inverse of I-J is
� Write matrix A as:
A=D+L+U=D(I+D-1(L+U)) => J= -D-1(L+U))
� The inverse of A can be expressed as:
� For k=1, one has that:
� For k=m, one usually uses:
0
321)(k
k JJJIJJI
11
0
1111
)()1(
)())((
DULD
DJIJIDA
k
k
k
1111 )( DULDDM
11
0
1 )()1(
DULDM
km
k
kkm
9. Preconditioning Techniques (Cont�d)(d) Domain Decomposition Preconditioning
Domain I
Domain II
SBB
0M0
00M
L L
S00
0M0
00M
LMTT
T-
-
where
21
2
1
2
1
1
1
1
,
f
d
d
z
x
x
QBB
BA0
B0A
bAx 2
1
2
1
22
11
21
,TT
21
11*
*22
11
2211
21
, BMBBMBSS
SBB
BM0
B0M
M -T-T
TT
10. Conjugate Gradient Type Algorithms forNonsymmetric Matrices
� The Bi-Conjugate Gradient(BCG) Algorithm was propo-sed by Lanczos in 1954.
� It solves not only the original system Ax=b but also the dual linear system ATx*=b*.
� Each step of this algorithm requires a matrix-by-vector product with both A and AT.
� The search direction pj* does
not contribute to the solution directly.
1
,,,0
***
*
*
*
02
max
0
ii
--
andiiWhile
, i
old
new
Tnew
newold
*Tnew
T*
new
newT
new
p rp
pr p
rr
q rr q r r pxx
qp
pAq Apq
rr rp r,p
0r*rAxbr
**
***
T
10. Conjugate Gradient Type Algorithms forNonsymmetric Matrices (Cont�d)
� The Conjugate Gradient Squared (CGS) Algorithm was developed by Sonneveld in 1984.
� Main idea of the algorithm:
1(
)()(
,
arbitrary is,0
*0
*0
02
max
0*
00
*000
0
ii)
-
andiiWhile
, i
old
new
Tnew
newold
Tnew
new
newT
new
pq up
qr u
rr
quA r r
quxx Apuq
Apr
rr
ru ,rp rAxbr
**
**
rAp
rAr
rAprAr
0
0
)(
)(
)()(
0
0
Tj
Tj
jj
jj
j
j
10. Conjugate Gradient Type Algorithms forNonsymmetric Matrices (Cont�d)
� The Bi-Conjugate Gradient Stabilized (Bi-CGSTAB) Algorithm was developed by van der Vorst in 1992.
� Rather than producing iterates whose residual vectors are of the form
� it produces iterates with residual vectors of the form
1(
,
arbitrary is,0
*0
*0
02
max
0*
0
*000
0
ii)
-
andiiWhile
, i
old
new
Tnew
newold
T
T
Tnew
new
newT
new
Appr p
rr
As sr
spxx AsAs
Ass
Aprs Apr
rr
rp rAxbr
02 )( rArjj
0
0)()(
)()(rAAp
rAAr
jjj
jjj
Complexity of linear solvers
O(n1.17 )O(n1.25 )CG, modified IC:
O(n1.75 ) -> O(n1.31 )O(n1.20 ) -> O(n1+ )CG, support trees:
O(n1.33 )O(n1.5 )CG, no precond:
O(n2 )O(n2 )CG, exact arithmetic:
O(n)O(n)Multigrid:
O(n2 )O(n1.5 )Sparse Cholesky:
3D2D
n1/2 n1/3
Time to solve model problem (Poisson�s equation) on regular mesh
Top Related