Multi-Variable Optimization Methods

8/11/2019 Multi-Variable Optimization Methods

1/21


2/21

Higher derivatives of multi-variable functions are dened as in the single-variable case, but notethat the number of gradient component increases by a factor of n for each differentiation.

While the gradient of a function of n variables is an n -vector, the second derivative of an

n -variable function is dened by n2

partial derivatives of the n rst partial derivatives withrespect to th n variables.

2f x i x j

, i = j ; 2f

x 2i, i = j (2)

If the partial derivatives f /x i , f /x j and 2f/x i x j are continuous, then 2f/x i x j exists and 2f/x i x j = 2f/x j x i . These second order partial derivativescan be represented by a square symmetric matrix called the Hessian matrix ,

2f (x ) H (x ) 2664

2 f 2x 1

2 f

x i x n... ... 2 f

x1

xn

2 f

2

x n

3775

(3)

If f is a quadratic function, the Hessian of f is constant and f can be expressed as

f (x ) = 12

x T Hx + gT x + (4)

2


3/21

As in the single-variable case the optimality conditions can be derived from the Taylor-seriesexpansion of f about x ,

f (x + p ) = f (x ) + p T g(x ) + 1

2

2 pT H (x + p ) p, (5)

where 0 1, is a scalar, and p is an n -vector.

For x to be a local minimum, then for any vector p, there must be a nite such thatf (x + p ) f (x ) , i.e. there is a neighborhood in which this condition holds. If thiscondition is satised, then f (x + p ) f (x ) 0 and the rst and second order terms inthe Taylor-series expansion must be greater of equal to zero.

As in the single variable case, and for the same reason, we start by considering the rst orderterms. Since p is an arbitrary vector and can be either positive or negative, then everycomponent of the gradient vector g(x ) must be zero.

Now we have to consider the second order term, 2 pT H (x + p ) p. For this term to be

non-negative, H (x + p ) has to be positive semi-denite, and by continuity, the Hessian atthe optimum, H (x ) must also be positive semi-denite.

3


4/21


5/21

3.2 General Algorithm for Smooth Functions

All algorithms for unconstrained gradient-based optimization can be described as follows. Westart with k = 0 and an estimate of x , xk .

1. Test for convergence. If the conditions for convergence are satised, then we can stop andx k is the solution. Else, go to Step 2.

2. Compute a search direction. Compute the vector pk that denes the direction in n -spacealong which we will search.

3. Compute the step length. Find a positive scalar, k

such that f (xk +

k p

k) < f (x

k) .

4. Update the design variables. Set xk+1 = xk + k pk , k = k + 1 and go back to 1.

x k+1 = xk + k pk

|{z} x k

(8)

5


6/21

3.3 Steepest Descent Method

The steepest descent method uses the negative of the gradient vector at each point as thesearch direction for each iteration. As mentioned previously, the gradient vector is orthogonal tothe plane tangent to the isosurfaces of the function.

The gradient vector at a point, g(x k ) , is also the direction of maximum rate of change(maximum increase) of the function at that point. This rate of change is given by the norm,

g(x k ) .

Steepest descent algorithm:

1. Select starting point x0, and convergence parameters g , a and r .2. Compute g(x k ) f (x k ) . If g(x k ) g then stop. Otherwise, compute the

normalized search direction to pk = g(x k ) / g(x k ) .3. Find the positive step length k such that f (x k + p k ) is minimized.

4. Update the current point, xk+1 = xk + p k .5. Evaluate f (x k+1 ) . If the condition | f (x k+1 ) f (x k ) | a + r | f (x k ) | is satised for

two successive iterations then stop. Otherwise, set k = k + 1 , xk+1 = xk + 1 and returnto step 2.

6


7/21

Note that the steepest descent direction at each iteration is orthogonal to the previous one, i.e.,gT (x k +1 )g(x k ) = 0 . Therefore the method zigzags in the design space and is ratherinefficient.

The algorithm is guaranteed to converge, but it may take an innite number of iterations. Therate of convergence is linear.

Usually, a substantial decrease is observed in the rst few iterations, but the method is veryslow after that.

7


8/21

3.4 Conjugate Gradient Method

A small modication to the steepest descent method takes into account the history of thegradients to move more directly towards the optimum.

1. Select starting point x0, and convergence parameters g , a and r .2. Compute g(x k ) f (x k ) . If g(x k ) g then stop. Otherwise, go to step 5.3. Compute g(x k ) f (x k ) . If g(x k ) g then stop. Otherwise continue.4. Compute the new conjugate gradient direction pk = g(x k ) + k pk 1, where

= g

kgk 1 2

= gT

k g

kgT k 1gk 1.

5. Find the positive step length k such that f (x k + p k ) is minimized.6. Update the current point, xk+1 = xk + p k .7. Evaluate f (x k+1 ) . If the condition | f (x k+1 ) f (x k ) | a + r | f (x k ) | is satised for

two successive iterations then stop. Otherwise, set k = k + 1 , xk+1 = xk + 1 and return

to step 3.

Usually, a restart is performed every n iterations for computational stability, i.e. we start with asteepest descent direction.

8


9/21


10/21


11/21

3.5.1 Modied Newtons Method

A small modication to Newtons method is to perform a line search along the Newtondirection, rather than accepting the step size that would minimize the quadratic model.

1. Select starting point x0, and convergence parameter g .2. Compute g(x k ) f (x k ) . If g(x k ) g then stop. Otherwise, continue.3. Compute H (x k ) 2f (x k ) and the search direction, pk = H 1gk .4. Find the positive step length k such that f (x k + k pk ) is minimized. (Start with

k = 1 ).

5. Update the current point, xk+1 = xk + k pk and return to step 2.

Although this modication increases the probability that f (x k + pk ) < f (x k ) , it stillvulnerable to the problem of having an Hessian that is not positive denite and has all the otherdisadvantages of the pure Newtons method.

11


12/21

3.6 Quasi-Newton Methods

This class of methods uses rst order information only, but build second order information anapproximate Hessian based on the sequence of function values and gradients from previousiterations. The method also forces H to be symmetric and positive denite, greatly improvingits convergence properties.

When using quasi-Newton methods, we usually start with the Hessian initialized to the identitymatrix and then update it at each iteration. Since what we actually need is the inverse of theHessian, we will work with V k H 1k . The update at each iteration is written as V k and is

added to the current one, V k +1 = V k + V k . (11)

Let sk be the step taken from xk , and consider the Taylor-series expansion of the gradientfunction about xk ,

g(x k + s k ) = gk + H k s k + (12)Truncating this series and setting the variation of the gradient to yk = g(x k + s k ) gk yields

H k s k = yk . (13)

Then, the new approximate inverse of the Hessian, V k+1 must satisfy the quasi-Newtoncondition ,

V k+1 yk = sk . (14)

12


13/21

3.6.1 DavidonFletcherPowell (DFP) Method

One of the rst quasi-Newton methods was devised by Davidon (in 1959) and modied byFletcher and Powell (1963).

1. Select starting point x0, and convergence parameter g . Set k = 0 and H 0 = I .2. Compute g(x k ) f (x k ) . If g(x k ) g then stop. Otherwise, continue.3. Compute the search direction, pk = V k gk .4. Find the positive step length k such that f (x k + k pk ) is minimized. (Start with

k = 1 ).5. Update the current point, xk+1 = xk + k pk , set sk = k pk , and compute the change inthe gradient, yk = gk+1 gk .

6. Update V k+1 by computing

A k = V k yk yT k V k

yT k V k yk

B k = sk s T k

sT k yk

V k+1 = V k A k + B k7. Set k = k + 1 and return to step 2.

13


14/21

3.6.2 BroydenFletcherGoldfarbShanno (BFGS) Method

The BFGS method is also a quasi-Newton method with a different updating formula

V k +1 = "I sk y T k

s T k yk#V k "I sk y T k

s T k yk#+ sk s T k

s T k yk(15)

The relative performance between the DFP and BFGS methods is problem dependent.

14


15/21

3.7 Trust Region Methods

Trust region, or restricted-step methods are a different approach to resolving the weaknessesof the pure form of Newtons method, arising from an Hessian that is not positive denite or ahighly nonlinear function.

One way to interpret these problems is to say that they arise from the fact that we are steppingoutside a the region for which the quadratic approximation is reasonable. Thus we canovercome this difficulties by minimizing the quadratic function within a region around xk withinwhich we trust the quadratic model.

15


16/21

1. Select starting point x0, a convergence parameter g and the initial size of the trust region,h 0.

2. Compute g(x k ) f (x k ) . If g(x k ) g then stop. Otherwise, continue.

3. Compute H (x k ) 2

f (x k ) and solve the quadratic subproblem

minimize q(s k ) = f (x k ) + g(x k ) T s k + 12 sT k H (x k )s k (16)

w.r.t. sk (17)

s.t. h k s k h k , i = 1 , . . . , n (18)

4. Evaluate f (x k + s k ) and compute the ratio that measures the accuracy of the quadraticmodel,

r k = f q

= f (x k ) f (x k + s k )

f (x k ) q(s k ).

5. Compute the size for the new trust region as follows:

h k +1 = s k

4 if r k < 0.25 , (19)

h k +1 = 2 h k if r k < 0.25 and h k = s k , (20)

hk+1

= hk

otherwise. (21)

16


17/21


18/21

3.8 Convergence Characteristics for a Quadratic Function

18


19/21

3.9 Convergence Characteristics for the Rosenbrock Function

19


20/21


21/21

References

[1] A. D. Belegundu and T. R. Chandrupatla. Optimization Concepts and Applications inEngineering , chapter 3. Prentice Hall, 1999.

[2] P. E. Gill, W. Murray, and M. H. Wright. Practical Optimization , chapter 4. AcademicPress, 1981.

[3] C. Onwubiko. Introduction to Engineering Design Optimization , chapter 4. Prentice Hall,2000.

21

Multi-Variable Optimization Methods

Documents

Transcript of Multi-Variable Optimization Methods