Math for CSLecture 61 Function Optimization Newton’s Method. Conjugate Gradients.

22
Math for CS Lecture 6 1 Lecture 6 Function Optimization Newton’s Method. Conjugate Gradients
  • date post

    19-Dec-2015
  • Category

    Documents

  • view

    219
  • download

    0

Transcript of Math for CSLecture 61 Function Optimization Newton’s Method. Conjugate Gradients.

Math for CS Lecture 6 1

Lecture 6

Function Optimization

Newton’s Method.

Conjugate Gradients

Math for CS Lecture 6 2

In Lecture 5 we have seen that the steepest descent method can suffer from slow convergence. Newton’s method fixes this problem for cases, where the function f(x) near x* can be approximated by a paraboloid:

,where

and

Newton’s Method

(1)

Math for CS Lecture 6 3

Here gk is the gradient and Qk is the Hessian of the function f,

evaluated at xk . They appear in the 2nd and 3rd terms of the Taylor

expansion of f(xk). Minimum of the function should require:

The solution of this equation gives the step direction and the step

size towards the minimum of (2), which is, presumably, close to

the minimum of f(x). The minimization algorithm in which

xk+1=y(xk)=xk+∆, with ∆ defined by (2) is called a Newton’s

method.

Newton’s Method 2

(2)

Math for CS Lecture 6 4

The greater speed of Newton's method over steepest descent is

borne out by analysis: while steepest descent has a linear order of

convergence, Newton's method is quadratic. In fact, let

be the place reached by a Newton step starting at x.

Suppose that at the minimum x* the Hessian Q(x*) is

nonsingular. Then

Since g(x*)=0

Newton’s Method 3

Math for CS Lecture 6 5

And

We can estimate the difference |xk+1-x*| :

where is some point on the line between x* and xk.

Newton’s Method 4

Math for CS Lecture 6 6

Since y(x*) = x*, the first derivatives of y at x* are zero, so that

the first term in the right-hand side above vanishes, and

Thus, the convergence rate of Newton's method is of order at

least two.

Newton’s Method is Quadratic

Math for CS Lecture 6 7

For example, for a quadratic function

The steepest descent takes many iterations to converge, while the

Newton’s method will require only one step.

However, this single iteration in Newton's method is more

expensive, because it requires both the gradient gk and the Hessian

Qk to be evaluated, for a total of derivatives . In addition,

the Hessian must be inverted, or, at least, a system (2) must be

solved.

Newton’s Method is Quadratic 2

2

nn

(3)

Math for CS Lecture 6 8

In contrast, steepest descent requires the gradient gk for selecting

the step direction Pk, and a line search in the direction pk to find the

step size. These faster step can be advantageous over faster

convergence of Newton’s method for large dimensionality of x,

which can exceed many thousands.

The method of conjugate gradients, discussed in the

following slides is motivated by the desire to accelerate

convergence with respect to the steepest descent method, but

without paying the storage cost of Newton's method.

Newton’s Method vs Steepest Descent

Math for CS Lecture 6 9

Conjugate Gradient

Suppose that we want to minimize the quadratic function

where Q is a symmetric, positive definite matrix, and x has n components. As we saw in explanation of steepest descent, the minimum x* is the solution to the linear system

The explicit solution of this system requires about O(n3) operations and O(n2) memory, what is very expensive.

Math for CS Lecture 6 10

Conjugate Gradients 2

We now consider an alternative solution method that does not need Q, but only the gradient of f(xk)

evaluated at n different points x1 , . . ., xn.

Conjugate Gradient

Gradient

Math for CS Lecture 6 11

Conjugate Gradients 3

Consider the case n = 3, in which the variable x in f(x) is a three-

dimensional vector . Then the quadratic function f(x) is constant

over ellipsoids, called isosurfaces, centered at the minimum x* .

How can we start from a point xo on one of these ellipsoids and

reach x* by a finite sequence of one-dimensional searches? In the

steepest descent, for the poorly conditioned Hessians orthogonal

directions lead to many small steps, that is, to slow convergence.

Math for CS Lecture 6 12

Conjugate Gradients: Spherical Case

When the ellipsoids are spheres, on the other hand, the convergence is

much faster: first step takes from x0 to x1 , and the line between x0 and

x1 is tangent to an isosurface at x1 . The next step is in the direction of

the gradient, takes us to x* right away. Suppose however that we

cannot afford to compute this special direction p1 orthogonal to po, but

that we can only compute some direction p1 orthogonal to po (there is

an n-1 -dimensional space of such directions!) and reach the minimum

of f(x) in this direction.

In that case n steps will take us to x* of the sphere, since coordinate of

the minimum in each on the n directions is independent of others.

Math for CS Lecture 6 13

Conjugate Gradients: Elliptical Case

Any set of orthogonal directions, with a line search in each direction,

will lead to the minimum for spherical isosurfaces. Given an arbitrary

set of ellipsoidal isosurfaces, there is a one-to-one mapping with a

spherical system: if Q = UEUT is the SVD of the symmetric, positive

definite matrix Q, then we can write

,where

(4)

(5)

Math for CS Lecture 6 14

Elliptical Case 2

Consequently, there must be a condition for the original problem (in

terms of Q) that is equivalent to orthogonality for the spherical

problem. If two directions qi and qj are orthogonal in the spherical

context, that is, if

what does this translate into in terms of the directions pi and pj for the

ellipsoidal problem? We have

(6)

Math for CS Lecture 6 15

Elliptical Case 3

Consequently,

What is

This condition is called Q-conjugacy, or Q-orthogonality : if equation

(7) holds, then pi and pj are said to be Q-conjugate or Q-orthogonal to

each other. Or simply say "conjugate".

(7)

Math for CS Lecture 6 16

Elliptical Case 4

In summary, if we can find n directions p0, . . .,pn_1 that are mutually

conjugate, i.e. comply with (7), and if we do line minimization along

each direction pk, we reach the minimum in at most n steps. Of course,

we cannot use the transformation (5) in the algorithm, because E and

especially UT are too large. So we need to find a method for generating

n conjugate directions without using either Q or its SVD .

Math for CS Lecture 6 17

Hestenes Stiefel Procedure

Where

Math for CS Lecture 6 18

Hestenes Stiefel Procedure 2

It is simple to see that pk and pk+1 are conjugate. In fact,

The proof that pi and pk+1 for i = 0, . . . , k are also conjugate can be

done by induction, based on the observation that the vectors pk are

found by a generalization of Gram-Schmidt to produce conjugate

rather than orthogonal vectors.

Math for CS Lecture 6 19

Removing the Hessian

In the described algorithm the expression for yk contains the Hessian Q, which is too large. We now show that yk can be rewritten in terms of the gradient values gk and gk+1 only. To this end, we noticeThat

Or

Proof:

So that

Math for CS Lecture 6 20

We can therefore write

and Q has disappeared .

This expression for yk can be further simplified by noticing that

because the line along pk is tangent to an isosurface at xk+l , while

the gradient gk+l is orthogonal to the isosurface at xk +l.

Removing the Hessian 2

Math for CS Lecture 6 21

Similarly,

Then, the denominator of yk becomes

In conclusion, we obtain the Polak-Ribiere formula

Polak-Ribiere formula

Math for CS Lecture 6 22

When the function f(x) is arbitrary, the same algorithm can be used,

but n iterations will not suffice, since the Hessian, which was

constant for the quadratic case, now is a function of xk. Strictly

speaking, we then lose conjugacy, since pk and pk+l are associated to

different Hessians. However, as the algorithm approaches the

minimum x*, the quadratic approximation becomes more and more

valid, and a few cycles of n iterations each will achieve

convergence.

General Case