Nonlinear Programming Models Fabio Schoen Introductionfor all x,y ∈ Ω,λ ∈ [0,1] Nonlinear...

Nonlinear Programming ModelsFabio Schoen

2008

http://gol.dsi.unifi.it/users/schoen

Nonlinear Programming Models – p. 1

Introduction


NLP problems

min f(x)

x ∈ S ⊆ Rn

Standard form:

min f(x)

hi(x) = 0 i = 1,m

gj(x) ≤ 0 j = 1, k

Here S = x ∈ Rn : hi(x) = 0∀ i, gj(x) ≤ 0∀ j


Local and global optima

A global minimum or global optimum is any x⋆ ∈ S such that

x ∈ S⇒f(x) ≥ f(x⋆)

A point x is a local optimum if ∃ ε > 0 such that

x ∈ S ∩ B(x, ε)⇒f(x) ≥ f(x)

where B(x, ε) = x ∈ Rn : ‖x − x‖ ≤ ε is a ball in R

n.Any global optimum is also a local optimum, but the opposite isgenerally false.



Convex Functions

A set S ⊆ Rn is convex if

x, y ∈ S⇒λx + (1 − λ)y ∈ S

for all choices of λ ∈ [0, 1]. Let Ω ⊆ Rn: non empty convex set. A

function f : Ω → R is convex iff

f(λx + (1 − λ)y) ≤ λf(x) + (1 − λ)f(y)

for all x, y ∈ Ω, λ ∈ [0, 1]


Convex Functions

x y


Properties of convex functions

Every convex function is continuous in the interior of Ω. It mightbe discontinuous, but only on the frontier.If f is continuously differentiable then it is convex iff

f(y) ≥ f(x) + (y − x)T∇f(x)

for all y ∈ Ω


Convex functions

yx


If f is twice continuously differentiable ⇒f it is convex iff itsHessian matrix is positive semi-definite:

∇2f(x) :=

[

∂2f

∂xi∂xj

]

then ∇2f(x) < 0 iff

vT∇2f(x)v ≥ 0 ∀ v ∈ Rn

or, equivalently, all eigenvalues of ∇2f(x) are non negative.


Example: an affine function is convex (and concave)For a quadratic function (Q: symmetric matrix):

f(x) =1

2xT Qx + bT x + c

we have

∇f(x) = Qx + b ∇2f(x) = Q

⇒f is convex iff Q < 0


Convex Optimization Problems

min f(x)

x ∈ S

is a convex optimization problem iff S is a convex set and f isconvex on S. For a problem in standard form

min f(x)

hi(x) = 0 i = 1,m

gj(x) ≤ 0 j = 1, k

if f is convex, hi(x) are affine functions, gj(x) are convexfunctions, then the problem is convex.


Maximization

Slight abuse in notation: a problem

max f(x)

x ∈ S

is called convex iff S is a convex set and f is a concave function(not to be confused with minimization of a concave function, (ormaximization of a convex function) which are NOT a convexoptimization problem)


Convex and non convex optimization

Convex optimization “is easy”, non convex optimization isusually very hard.Fundamental property of convex optimization problems: everylocal optimum is also a global optimum (will give a proof later)Minimizing a positive semidefinite quadratic function on apolyhedron is easy (polynomially solvable); if even a singleeigenvalue of the hessian is negative ⇒the problem becomesNP–hard


Convex functions: examples

Many (of course not all . . . ) functions are convex!

affine functions aT x + b

quadratic functions 12xT Qx + bT x + c with Q = QT , Q 0

any norm is a convex function

x log x (however log x is concave)

f is convex if and only if ∀x0, d ∈ Rn, its restriction to any

line: φ(α) = f(x0 + αd), is a convex function

a linear non negative combination of convex functions isconvex

g(x, y) convex in x for all y ⇒∫

g(x, y) dy convex


more examples . . .

maxiaTi x + b is convex

f, g: convex ⇒maxf(x), g(x) is convex

fa convex functions for any a ∈ A (a possibly uncountableset) ⇒supa∈A fa(x) is convex

f convex ⇒f(Ax + b)

let S ⊆ Rn be any set ⇒f(x) = sups∈S ‖x − s‖ is convex

Trace(AT X) =∑

i,j AijXij is convex (it is linear!)

log det X−1 is convex over the set of matricesX ∈ R

n×n : X ≻ 0

λmax(X) (the largest eigenvalue of a matrix X)


Data Approximation


Table of contents

norm approximation

maximum likelihood

robust estimation


Norm approximation

Problem:min

x‖Ax − b‖

where A, b: parameters. Usually the system is over-determined,i.e. b 6∈ Range(A).For example, this happens when A ∈ R

m×n with m > n and Ahas full rank.r := Ax − b: “residual”.


Examples

‖r‖ =√

rT r: least squares (or “regression”)

‖r‖ =√

rT Pr with P ≻ 0: weighted least squares

‖r‖ = maxi |ri|: minimax, or ℓ∞ or di Tchebichevapproximation

‖r‖ =∑

i |ri|: absolute or ℓ1 approximation

Possible (convex) additional constraints:

maximum deviation from an initial estimate: ‖x − xest‖ ≤ ǫ

simple bounds ℓi ≤ xi ≤ ui

ordering: x1 ≤ x2 ≤ · · · ≤ xn


Example: ℓ1 norm

Matrix A ∈ R100×30

0

10

20

30

40

50

60

70

80

-5 -4 -3 -2 -1 0 1 2 3 4 5

norm 1 residuals


ℓ∞ norm

0

2

4

6

8

10

12

14

16

18

20

-5 -4 -3 -2 -1 0 1 2 3 4 5

∞ norm residuals


ℓ2 norm

0

2

4

6

8

10

12

14

16

18

-5 -4 -3 -2 -1 0 1 2 3 4 5

norm 2 residuals


Variants

min∑

i h(yi − aTi x) where h: convex function:

h linear–quadratic h(z) =

z2 |z| ≤ 1

2|z| − 1 |z| > 1

“dead zone”: h(z) =

0 |z| ≤ 1

|z| − 1 |z| > 1

logarithmic barrier: h(z) =

− log(1 − z2) |z| < 1

∞ |z| ≥ 1


comparison

-0.5

0

0.5

1

1.5

2

2.5

3

3.5

4

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

norm 1(x)norm 2(x)linquad(x)

deadzone(x)logbarrier(x)


Maximum likelihood

Given a sample X1, X2, . . . , Xk and a parametric family ofprobability density functions L(·; θ), the maximum likelihoodestimate of θ given the sample is

θ = arg maxθ

L(X1, . . . , Xk; θ)

Example: linear measures with and additive i.i.d. (independentidentically dsitributed) noise:

Xi = aTi θ + εi (1)

where εi iid random variables with density p(·):

L(X1 . . . , Xk; θ) =k∏

i=1

p(Xi − aTi θ)


Max likelihood estimate - MLE

(taking the logarithm, which does not change optimum points):

θ = arg maxθ

∑

i

log(p(Xi − aTi θ))

If p is log–concave ⇒this problem is convex. Examples:

ε ∼ N (0, σ), i.e. p(z) = (2πσ)−1/2 exp(−z2/2σ2) ⇒MLE is theℓ2 estimate: θ = arg min ‖Aθ − X‖2;

p(z) = (1/(2a)) exp(−|z|/a) ⇒ℓ1 estimate:θ = arg minθ ‖Aθ − X‖1


p(z) = (1/a) exp(−z/a)1z≥0 (negative exponential)⇒theestimate can be found solving the LP problem:

min 1T (X − Aθ)

Aθ ≤ X

p uniform on [−a, a] ⇒the MLE is any θ such that‖Aθ − X‖∞ ≤ a


Ellipsoids

An ellipsoid is a subset of Rn of the form

E = x ∈ Rn : (x − x0)

T P−1(x − x0) ≤ 1

where x0 ∈ Rn is the center of the ellipsoid and P is a

symmetric positive-definite matrix.Alternative representations:

E = x ∈ Rn : ‖Ax − b‖2 ≤ 1

where A ≻ 0, or

E = x ∈ Rn : x = x0 + Au | ‖u‖2 ≤ 1

where A is square and non singular (affine transformation of theunit ball)


Robust Least Squares

Least Squares: x = arg min√

∑

i(aTi x − bi)2 Hp: ai not known,

but it is known that

ai ∈ Ei = ai + Piu : ‖u‖ ≤ 1

where Pi = P Ti < 0. Definition: worst case residuals:

maxai∈Ei

√

∑

i

(aTi x − bi)2

A robust estimate of x is the solution of

xr = arg minx

maxai∈Ei

√

∑

i

(aTi x − bi)2


RLS

It holds:|α + βT y| ≤ |α| + ‖β‖‖y‖

then, choosing y⋆ = β/‖β‖ if α ≥ 0 and y⋆ = −β/‖β‖, otherwiseif α < 0, then ‖y‖ = 1 and

|α + βT y⋆| = |α + βT β/‖β‖sign(α)|= |α| + ‖β‖

then:

maxai∈Ei

|(aTi x − bi)| = max

‖u‖≤1|aT

i x − bi + uT Pix|

= |aTi x − bi| + ‖Pix‖


. . .

Thus the Robust Least Squares problem reduces to

min

(

∑

i

(|aTi x − bi| + ‖Pix‖)2

)1/2

(a convex optimization problem).Transformation:

minx,t

‖t‖2

|aTi x − bi| + ‖Pix‖ ≤ ti ∀ i i.e.


. . .

minx,t

‖t‖2

aTi x − bi + ‖Pix‖ ≤ ti

−aTi x + bi + ‖Pix‖ ≤ ti

(Second Order Cone Problem). A norm cone is a convex set

C = (x, t) ∈ Rn+1 : ‖x‖ ≤ t


Geometrical Problems


Geometrical Problems

projections and distances

polyhedral intersection

extremal volume ellipsoids

classification problems


Projection on a set

Given a set C the projection of x on C is defined as:

PC(x) = arg minz∈C

‖z − x‖

bc

bc

bc

⊕

⊕

⊕


Projection on a convex set

IfC = x : Ax = b, fi(x) ≤ 0, i = 1,m

where fi: convex ⇒C is a convex set and the problem

PC(x) = arg min ‖x − z‖Az = b

fi(z) ≤ 0 i = 1,m

is convex.


Distance between convex sets

dist(C(1), C(2)) = minx∈C(1),y∈C(2)

‖x − y‖


Distance between convex sets

If C(j) = x : A(j)x = b(j), f(j)i ≤ 0 then the minimum distance

can be found through a convex model:

min ‖x(1) − x(2)‖A(1)x(1) = b(1)

A(2)x(2) = b(2)

f(1)i x(1) ≤ 0

f(2)i x(2) ≤ 0


Polyhedral intersection

1: polyhedra described by means of linear inequalities:

P1 = x : Ax ≤ b,P2 = x : Cx ≤ d


Polyhedral intersection

P1

⋂P2 = ∅? It is a linear feasibility problem: Ax ≤ b, Cx ≤ d

P1 ⊆ P2? Just check

supcTk x : Ax ≤ b ≤ dk ∀ k

(solution of a finite number of LP’s)


Polyhedral intersection (2)

2: polyhedra (polytopes) described through vertices:

P1 = convv1, . . . , vk,P2 = convw1, . . . , wh

P1

⋂P2 = ∅? Need to find λ1, λk, µ1, µh ≥ 0:∑

i

λi = 1∑

j

µj = 1

∑

i

λivi =∑

j

µjwj

P1 ⊆ P2? ∀ i = 1, . . . , k check whether ∃µj ≥ 0:

∑

j

µj = 1

∑

j

µjwj = viNonlinear Programming Models – p. 41

Minimal ellipsoid containing k points

Given v1, . . . , vk ∈ Rn find an ellipsoid

E = x : ‖Ax − b‖ ≤ 1

with minimal volume containing the k given points.

*

*

*

*

*

*

**

* *

*

*

*

*

*

*

*

**


A = AT ≻ 0. Volume of E is proportional to det A−1 ⇒convexoptimization problem (in the unknowns: A, b):

min log det A−1

A = AT

A ≻ 0

‖Avi − b‖ ≤ 1 i = 1, k


Max. ellipsoid contained in a polyhedron

Given P = x : Ax ≤ b find an ellipsoid:

E = By + d : ‖y‖ ≤ 1

contained in P with maximum volume.


Max. ellipsoid contained in a polyhedron

E ⊆ P ⇔ aTi (By + d) ≤ bi ∀ y : ‖y‖ ≤ 1

⇔ sup‖y‖≤1

aTi By + aT

i d ≤ bi ∀ i

⇔ ‖Bai‖ + aTi d ≤ bi

maxB,d

log det B

B = BT ≻ 0

‖Bai‖ + aTi d ≤ bi i = 1, . . .


Difficult variants

These problems are hard:

find a maximal volume ellipsoid contained in a polyhedrongiven by its vertices

*

*

**

*

*

*

*

*

*

*

*

*

*

**

*

*


find a minimal volume ellipsoid containing a polyhedrondescribed as a system of linear inequalities.


It is already a difficult problem to show whether a given ellipsoidE contains a polyhedron P = Ax ≤ b.This problem is still difficult even when the ellipsoid is a sphere:this problem is equivalent to norm maximization in a polyhedron– it is an NP–hard concave optimization problem.


Linear classification (separation)

bc

bc

bc

bc

bcbc

bc

bc

bc

bc

bc

bc

bc

bc

bc

bc

bc


Given two point sets X1, . . . , Xk, Y1, . . . , Yh find an hyperplaneaT x = t such that:

aT Xi ≥ 1 i = 1, k

aT Yj ≤ 1 j = 1, h

(LP feasibility problem).


Robust separation

bc

bc

bc

bc

bcbc

bc

bc

bc

bc

bc

bc

bc

bc

bc

bc

bc


Robust separation

Find a “maximal” separation:

maxa:‖a‖≤1

(

mini

aT Xi − maxj

aT Yj

)

equivalent to the convex problem:

max t1 − t2

aT Xi ≥ t1 ∀ i

aT Yj ≤ t2 ∀ j

‖a‖ ≤ 1


Optimality ConditionsFabio Schoen

2008


Optimality Conditions – p. 1

Optimality Conditions: descent directions

Let S ⊆ Rn be a convex set and consider the problem

minx∈S

f(x)

where f : S → R. Let x1, x2 ∈ S and d = x2 − x1. d is a feasibledirection.If there exists ǫ > 0 such that f(x1 + ǫd) < f(x1) ∀ ǫ ∈ (0, ǫ), d iscalled a descent direction at x1.Elementary necessary optimality condition: if x⋆ is a localoptimum, no descent direction may exist at x⋆


Optimality Conditions for Convex Sets

If x⋆ ∈ S is a local optimum for f() and there exists aneighborhood U(x⋆) such that f ∈ C1(U(x⋆)), then

dT∇f(x⋆) ≥ 0 ∀ d : feasible direction

Optimality Conditions – p. 3 Optimality Conditions – p. 4


proof

Taylor expansion:

f(x⋆ + ǫd) = f(x⋆) + ǫdT∇f(x⋆) + o(ǫ)

d cannot be a descent direction, so, if ǫ is sufficiently small, thenf(x⋆ + ǫd) ≥ f(x⋆). Thus

ǫdT∇f(x⋆) + o(ǫ) ≥ 0

and dividing by ǫ,

dT∇f(x⋆) +o(ǫ)

ǫ≥ 0

Letting ǫ ↓ 0 the proof is complete.


Optimality Conditions: tangent cone

General case:

min f(x)

gi(x) ≤ 0 i = 1, . . . ,m

x ∈ X (X : open set)

Let S = x ∈ X : gi(x) ≤ 0, i = 1, . . . ,m.Tangent cone to S in x: T (x) = d ∈ R

n:

d

‖d‖ = limxk→x

xk − x

‖xk − x‖

where xk ∈ S.


b

bc

bc

bc

bc

bc

bc


Some examples

S = Rn ⇒T (x) = R

n ∀x

S = Ax = b ⇒

T (x) = d : Ad = 0

S = Ax ≤ b; let I be the set of active constraints in x:

aTi x = bi i ∈ I

aTi x < bi i 6∈ I.



Let d = limk(xk − x)/‖(xk − x)‖ ⇒

aTi d = aT

i limk

(xk − x)/‖(xk − x)‖ i ∈ I

= limk

aTi (xk − x)/‖(xk − x)‖

= limk

(aTi xk − b)/‖(xk − x)‖

≤ 0

Thus if d ∈ T (x) ⇒aTi d ≤ 0 for i ∈ I.


Viceversa, let xk = x + αkd. If aTi d ≤ 0 for i ∈ I ⇒

aTi xk = aT

i (x + αkd) i ∈ I= bi + αka

Ti d

≤ bi

aTi xk = aT

i (x + αkd) i 6∈ I< bi + αka

Ti d

≤ bi if αk small enough

Thus

T (x) = d : aTi d ≤ 0∀ i ∈ I


Example

Let S = (x, y) ∈ R2 : x2 − y = 0 (parabola).

Tangent cone at (0, 0)? Let (xk, yk) → (0, 0), i.e.xk → 0, yk = x2

k:

‖(xk, yk) − (0, 0)‖ =√

x2k + (xk)4

= |xk|√

1 + x2k

and

limxk→0+

xk

|xk|√

1 + x2k

= 1 limxk→0+

yk

|xk|√

1 + x2k

= 0

limxk→0−

xk

|xk|√

1 + x2k

= −1 limxk→0−

yk

|xk|√

1 + x2k

= 0

thus T (0, 0) = (−1, 0), (1, 0)Optimality Conditions – p. 12

Descent direction

d ∈ Rn is a feasible direction in x ∈ S if ∃ α > 0 :

x + αd ∈ S ∀α ∈ [0, α).

d feasible ⇒d ∈ T (x), but in general the converse is false.If

f(x + αd) ≤ f(x) ∀α ∈ (0, α)

d is a descent direction


I order necessary opt condition

Let x ∈ S ⊆ Rn be a local optimum for minx∈S f(x); let

f ∈ C1(U(x)). Then

dT∇f(x) ≥ 0 ∀ d ∈ T (x)

Proofd = limk(xk − x)/‖(xk − x)‖. Taylor expansion:

f(xk) = f(x) + ∇T f(x)(xk − x) + o(‖xk − x‖)= f(x) + ∇T f(x)(xk − x) + ‖xk − x‖o(1).

x local optimum ⇒∃U(x) : f(x) ≥ f(x) ∀x ∈ U ∩ S.


. . .

If k is large enough, xk ∈ U(x):

f(xk) − f(x) ≥ 0

thus∇T f(x)(xk − x) + ‖xk − x‖o(1) ≥ 0

Dividing by ‖(xk − x)‖ :

∇T f(x)(xk − x)/‖(xk − x)‖ + o(1) ≥ 0

and in the limit∇T f(x)d ≥ 0.


Examples

Unconstrained problemsEvery d ∈ R

n belongs to the tangent cone ⇒at a local optimum

∇T f(x)d ≥ 0 ∀ d ∈ Rn

Choosing d = ei e d = −ei we get

∇f(x) = 0

NB: the same is true if x is a local minimum in the relativeinterior of the feasible region.


Linear equality constraints

min f(x)

Ax = b

Tangent cone: d : Ad = 0. Necessary conditions:

∇T f(x)d ≥ 0 ∀ d : Ad = 0

equivalent statement:

mind

∇T f(x)d = 0

Ad = 0

(a linear program).


Linear equality constraints

From LP duality ⇒

max 0T λ = 0

AT λ = ∇f(x)

Thus at a local minimum point there exist Lagrange multipliers:

∃λ : AT λ = ∇f(x)


Linear inequalities

min f(x)

Ax ≤ b

Tangent cone at a local minimum x:d ∈ R

n : aTi d ≤ 0 ∀ i ∈ I(x). Let AI be the rows of A

associated to active constraints at x. Then

mind

∇T f(x)d = 0

AId ≤ 0

λ ≤ 0


Linear inequalities

From LP duality:

max 0T λ = 0

ATIλ = ∇f(x)

λ ≤ 0

Thus, at a local optimum, the gradient is a non positive linearcombination of the coefficients of active constraints.


Farkas’ Lemma

Let A: matrix in Rm×n and b ∈ R

n. One and only one of thefollowing sets:

AT y ≤ 0

bT y > 0

and

Ax = b

x ≥ 0

is non empty


Geometrical interpretation

AT y ≤ 0 Ax = b

bT y > 0 x ≥ 0

a1

a2

b

z : ∃x : z = Ax, x ≥ 0

y : AT y ≤ 0


Proof

1) if ∃x ≥ 0 : Ax = b ⇒bT y = xT AT y. Thus if AT y ≤ 0 ⇒bT y ≤ 0.2) Premise: Separating hyperplane theorem: let C and D betwo convex nonempty sets: C ∪ D = ∅. Then there exists a 6= 0and b:

aT x ≤ b x ∈ C

aT x ≥ b x ∈ D

If C is a point and D is a closed convex set, separation is strict,i.e.

aT C < b

aT x > b x ∈ D


Farkas’ Lemma (proof)

2) let x : Ax = b, x ≥ 0 = ∅. Let

S = y ∈ Rm : ∃x ≥ 0, Ax = y

S is closed, convex and b 6∈ S. From the separating hyperplanetheorem: ∃α ∈ R

m 6= 0, β ∈ R:

αT y ≤ β ∀x ∈ S

αT b > β

0 ∈ S ⇒β ≥ 0 ⇒αT b > 0; αT Ax ≤ β for all x ≥ 0. This ispossible iff αT A ≤ 0.Letting y = α we obtain a solution of

AY y ≤ 0 bT y > 0


First order feasible variations cone

G(x) = d ∈ Rn : ∇T gi(x)d ≤ 0 i ∈ I

b

b


First order variations

G(x) ⊇ T (x).In fact if xk is feasible and

d = limk

xk − x

‖xk − x‖

then gi(x) ≤ 0 and

g(x + limk

(xk − x)) ≤ 0


. . .

g(x + limk

‖xk − x‖ xk − x

‖xk − x‖) ≤ 0

g(x + limk

‖xk − x‖ limxk − x

‖xk − x‖) ≤ 0

g(x + limk

‖xk − x‖d) ≤ 0

Let αk = ‖xk − x‖, if αk ≈ 0:

g(x + αkd) ≤ 0


gi(x + αkd) = gi(x) + αk∇T gi(x)d + o(αk)

where αk > 0 and d belong to the tangent cone T (x). If the i–thconstraint is active, then

gi(x + αkd) = αk∇T gi(x)d + o(αk) ≤ 0

gi(x + αkd)/αk = ∇T gi(x)d + o(αk))/αk ≤ 0

Letting αk → 0 the result is obtained.


example

G(x) 6= T (x);

−x3 + y ≤ 0

−y ≤ 0


KKT necessary conditions

(Karush–Kuhn–Tucker)Let x ∈ X ⊆ R

n, X 6= ∅ be a local optimum for

min f(x)

gi(x) ≤ 0 i = 1, . . . ,m

x ∈ X

I: indices of active constraints at x. If:

1. f(x), gi(x) ∈ C1(x) for i ∈ I2. “constraint qualifications” conditions: T (x) = G(x) hold in x ;

then there exist Lagrange multipliers λi ≥ 0, i ∈ I:

∇f(x) +∑

i∈I

λi∇gi(x) = 0.


Proof

x local optimum ⇒if d ∈ T (x) ⇒dT∇f(x) ≥ 0. But d ∈ T (x) ⇒

dT∇gi(x) ≤ 0 i ∈ I.

Thus it is impossible that

−∇T f(x)d > 0

∇T gi(x)d ≤ 0 i ∈ I

From Farkas’ Lemma ⇒there exists a solution of:∑

i∈I

λi∇T gi(x) = −∇T f(x) i ∈ I

λi ≥ 0 i ∈ I


Constraint qualifications: examples

polyhedra: X = Rn and gi(x) are affine functions: Ax ≤ b

linear independence: X open set, gi(x), i 6∈ I continuous in x and∇gi(x), i ∈ I are linearly independent.

Slater condition: X open set, gi(x), i ∈ I convex differentiablefunctions in x, gi(x), i 6∈ I continuous in x, and ∃ x ∈ Xstrictly feasible:

gi(x) < 0 i ∈ I.


Convex problems

An optimization problem

minx∈S

f(x)

is a convex problem if

S is a convex set, i.e.

x, y ∈ S⇒λx + (1 − λ)y ∈ S

∀λ ∈ [0, 1]

f is a convex function on S, i.e.

f(λx + (1 − λ)y) ≤ λf(x) + (1 − λ)f(y)

∀λ ∈ [0, 1] and x, y ∈ S


Standard convex problem

min f(x)

gi(x) ≤ 0 i = 1,m

hj(x) = 0 j = 1, k

if

f is convex

gi are convex

hj are affine (i.e. of the form αT x + β)

then the problem is convex.


Convex problems

Every local optimum is a global one.Proof: x: local optimum for minS f(x)x⋆: global optimum.S convex ⇒λx⋆ + (1 − λ)x ∈ S. Thus if λ ≈ 0 ⇒

f(x) ≤ f(λx⋆ + (1 − λ)x

≤ λf(x⋆) + (1 − λ)f(x)

⇒f(x) ≤ f(x⋆)

and x is also a global optimum.


Sufficiency of 1st order conditions

(for a convex differentiable problem: if dT∇f(x) ∀ d ∈ T (x), thenx is a (global) optimumProof:

f(y) ≥ f(x) + (y − x)T∇f(x) ∀ y ∈ S

But y − x ∈ T (x) ⇒

f(y) ≥ f(x) + dT∇f(x) ∀ y ∈ S

≥ f(x)

thus x is a global minimum.


Convexity of the set of global optima

(for convex problems)The set of global minima of a convex problem is a convex set. Infact, let x and y be global minima for the convex problem

minx∈S

f(x)

Then, choosing λ ∈ [0, 1] we have λx + (1 − λ)y ∈ S, as S isconvex. Moreover

f(λx + (1 − λ)y) ≤ λf(x) + (1 − λ)f(y)

λf ⋆ + (1 − λ)f ⋆ = f ⋆

where f ⋆ is the global minimum value. Thus the equality holdsand the proof is complete.


KKT for equality constraints

x: local optimum for

min f(x)

gi(x) ≤ 0 i = 1, . . . ,m

hj(x) = 0 j = 1, . . . , k

x ∈ X ⊆ Rn

Let I: set of active inequalities in x. If f(x),gi(x), i ∈ I, hj(x) ∈ C1 and “constraint qualifications” hold in x,⇒∃λi ≥ 0∀ i ∈ I e µj ∈ R,∀ j = 1, . . . , h:

∇f(x) +∑

i∈I

λi∇gi(x) +h∑

j=1

µj∇hj(x) = 0


Complementarity

KKT equivalent formulation:

∇f(x) +m∑

i=1

λi∇gi(x) +h∑

j=1

µj∇hj(x) = 0

λigi(x) = 0 i = 1, . . . ,m

Condition λigi(x) = 0 is called complementarity condition


II order necessary conditions

If f, g1, hj ∈ C2 in x and the gradients of active constraints in xare linearly independent, then there exist mutlipliersλi ≥ 0, i ∈ I and µj, j = 1, . . . , k such that

∇f(x) +∑

i∈I

λi∇gi(x) +k∑

j=1

µj∇hj(x) = 0

and

dT∇2L(x)d ≥ 0

for every direction d: dT∇gi(x) ≤ 0, dT∇hj(x) = 0 where

∇2L(x) := ∇2f(x) +∑

i∈I

λi∇2gi(x) +k∑

j=1

µj∇2hj(x)


Sufficient conditions

Let f, gi, hj twice continuously differentiable. Let x⋆, λ⋆, µ⋆:

∇f(x⋆) +∑

i∈I

λ⋆i∇gi(x

⋆) +k∑

j=1

µ⋆j∇hj(x

⋆) = 0

λ⋆i gi(x

⋆) = 0

λ⋆i ≥ 0

dT∇2L(x⋆)d > 0 ∀ d :dT∇hj(x⋆) = 0

dT∇gi(x⋆) = 0, i ∈ I

then x⋆ is a local minimum.


Lagrange Duality

Problem:

f ⋆ = min f(x)

gi(x) ≤ 0

x ∈ X

definition: Lagrange Function:

L(x; λ) = f(x) +∑

i

λigi(x) λ ≥ 0, x ∈ X


Relaxation

Given an optimization problem

minx∈S

f(x)

a relaxation is a problem

minx∈Q

g(x)

where

S ⊆ Q

g(x) ≤ f(x) ∀x ∈ S.

Weak Duality : The optimal value of a relaxation is a lowerbound on the optimum value of the problem.


Lagrange minimization is a relaxation

Proof:

Feasible set of the Lagrange problem: X (contains theoriginal one)

If g(x) ≤ 0 and λ ≥ 0 ⇒

L(x, λ) = f(x) + λT g(x)

≤ f(x)


Dual Lagrange function

with respect to constraints g(x) ≤ 0:

θ(λ) = infx∈X

L(x, λ)

= infx∈X

(f(x) + λT g(x))

For every choice of λ ≥ 0, θ(λ) is a lower bound for everyfeasible solution and in particular, is a lower bound for theglobal minimum value of the problem.


Example (circle packing)

min−r

4r2 − (xi − xj)2 − (yi − yj)

2 ≤ 0 1 ≤ i < j ≤ N

xi, yi ≤ 1 i = 1, . . . , N

−xi,−yi ≤ 0 i = 1, . . . , N


When N = 2, relaxing the first constraint:

θ(λ) = minx,y,r

−r + λ(4r2 − (x1 − x2)2 − (y1 − y2)

2)

x1, x2, y1, y2 ≥ 0

x1, x2, y1, y2 ≤ 1


solution

Minimizing with respect to x, y ⇒|x1 − x2| = |y1 − y2| = 1 fromwhich

θ(λ) = minr

−r + 4λr2 − 2λ

r =1

8λ

θ(λ) = −2λ − 1

16λ

This is a lower bound on the optimum value. Best possiblelower bound:

θ⋆ = maxλ

θ(λ)

λ⋆ =1

4√

2θ⋆ = −

√2

2Optimality Conditions – p. 48

Choosing (x1, y1) = (0, 0) and (x2, y2) = (1, 1) a feasible solutionwith r =

√2/2 is obtained.

The Lagrange dual gives a lower bound equal to −√

2/2: sameas the objective function at a feasible solution ⇒optimalsolution!(an exception, not the rule!)


Lagrange Dual

θ⋆ = max θ(λ)

λ ≥ 0

This problem might:

1. be unbounded

2. have a finite sup but non max

3. have a unique maximum attained in correspondence with asingle solution x

4. have many different maxima, each connected with adifferent solution x


Equality constraints

f ⋆ = min f(x)

gi(x) ≤ 0 i = 1, . . . ,m

hj(x) = 0 j = 1, . . . , k

x ∈ X

Lagrange function:

L(x; λ, µ) = f(x) + λT g(x) + µT h(x)

where λ ≥ 0, but µ is free.


Linear Programming

min cT x

Ax ≤ b

Dual Lagrange function:

θ(λ) = minx

cT x + λT (Ax − b)

= −λT b + minx

(cT + λT A)x.

but:

minx

(cT + λT A)x =

0 if cT + λT A = 0

−∞ otherwise.


. . .

Lagrange dual function:

θ(λ) =

−λT b if cT + λT A = 0

−∞ otherwise.

Lagrange dual:

max−λT b

λT A + cT = 0

λ ≥ 0

which is equivalent to:

max λT b

λT A = cT

λ ≤ 0Optimality Conditions – p. 53

Quadratic Programming (QP)

min1

2xT Qx + cT x

Ax = b

(Q: symmetric).Lagrange dual function:

θ(λ) = minx

1

2xT Qx + cT x + λT (Ax − b)

= −λT b + minx

1

2xT Qx + (cT + λT A)x


QP – Case 1

Q has at least one negative eigenvalue ⇒

minx

1

2xT Qx + (cT + λT A)x = −∞

In fact ∃ d : dT Qd < 0.Choosing x = αd with α > 0 ⇒

1

2xT Qx + (cT + λT A)x =

1

2α2dT Qd + α(cT + λT A)d

and for large values of α this can be made as small as desired.


QP – Case 2

Q positive definite ⇒minimum point of the dual Lagrangefunction:

Qx + (c + AT λ) = 0

i.e.

x = −Q−1(c + AT λ)


. . .

Lagrange function value:

θ(λ) = −λT b +1

2xT Qx + (cT + λT A)x

= −λT b +1

2(c + AT λ)T Q−1QQ−1(c + AT λ)

− (cT + λT A)Q−1(c + AT λ)

= −λT b +1

2(c + AT λ)T Q−1(c + AT λ)

− (cT + λT A)Q−1(c + AT λ)

= −λT b − 1

2(c + AT λ)T Q−1(c + AT λ)


. . .

Lagrange dual (seen as a min problem):

minλ

λT b +1

2(c + AT λ)T Q−1(c + AT λ)

Optimality conditions:

b + AQ−1(c + AT λ) = 0

But recalling that x = −Q−1(c + AT λ) ⇒

b − Ax = 0 feasibility of x

⇒if we find optimal multipliers λ (a linear system) ⇒we get the optimalsolution x (thanks to feasibility and weak duality)!


Properties of the Lagrange dual

For any problem

f ⋆ = min f(x)

gi(x) ≤ 0 i = 1, . . . ,m

x ∈ X

where X is non empty and compact, if f and gi are continuousthen the Lagrange dual function is concave


Dim.

From Weierstrass theorem

θ(λ) = minx∈X

f(x) + λT g(x)

exists and is finite

θ(ηa + (1 − η)b) = minx∈X

(f(x) + (ηa + (1 − η)b)T g(x))

= minx∈X

(η(f(x) + aT g(x)) + (1 − η)(f(x) + bT g(x)))

≥ η minx∈X

(f(x) + aT g(x)) + (1 − η) minx∈X

(f(x) + bT g(x))

= ηθ(a) + (1 − η)θ(b).


Solution of the Lagrange dual

maxλ

θ(λ) = maxλ

minx∈X

(f(x) + λT g(x))

is equivalent to

max z

z ≤ f(x) + λT g(x) ∀x ∈ X

λ ≥ 0

After having computed f and g in x1, x2, . . . , xk a restricted dualcan be defined:

max z

z ≤ f(xj) + λT g(xj) ∀ j = 1, . . . , k

λ ≥ 0 Optimality Conditions – p. 61

. . .

Let λ be the optimal solution of the restricted dual. Is it anoptimal dual solution? Is it true that z ≤ f(x) + λT g(x)? Check:we look for x, optimal solution of

minx∈X

f(x) + λT g(x)

if f(x) + λT g(x) ≥ z then we have found the optimal solutionof the dual;

otherwise the pair x, f(x) is added to the restricted dual anda new solution is computed.


Geometric programming

Unconstrained Geometric program:

minx>0

m∑

k=1

ck

n∏

j=1

xαkj

j αkj ∈ R, ck > 0

(non convex). Variable substitution:

xj = exp(yj) yj ∈ R


Transformed problem:

miny

m∑

k=1

(

ck

n∏

j=1

eαkjyj

)

=

miny

m∑

k=1

eαTk

y+βk βk = log ck

still non convex, but its logarithm is convex.


Duality example

Dual of

min f(x) min logm∑

k=1

exp(αTk x + βk)

No constraints ⇒dual lagrange function is identical to f(x)!Strong duality holds, but is useless.Simple transformation:

min logm∑

k=1

exp yk

yk = αTk x + βk


solving the dual

Dual function

L(λ) = minx,y

logm∑

k=1

exp yk + λT (Ax + β − y)

Minimization in x is unconstrained: min λT Ax ⇒if λT A 6= 0 L(λ) is unbounded

if λT A = 0 then

L(λ) = miny

logm∑

k=1

exp yk + λT (β − y)


First order (unconstrained) optimality conditions w.r.t. yi:

exp yi∑

k exp yk

− λi = 0

⇒Lagrange multipliers exist provided that∑

i

λi = 1 λi > 0∀ i


Substituting λj = exp yj/∑

k exp yk,

L(λ) = log∑

j

exp yj −∑

j

λjyj

= log∑

j

exp yj −∑

j

yj exp yj/∑

k

exp yk

=1

∑

k exp yk

(∑

k

exp yk(log∑

j

exp yj − yk))

=∑

k

(

exp yk∑

j exp yj

(log∑

j

exp yj − yk)

)

= −∑

k

λk log λk


Lagrange Dual

The Lagrange Dual becomes:

maxλ

βT λ −∑

k

λk log λk

∑

k

λk = 1

AT λ = 0

λ ≥ 0


Special cases: linear constraints

min f(x)

Ax ≥ b

Lagrange function:

L(x, λ) = f(x) + λT (b − Ax)

Constraint qualifications always hold (polyhedron). If x⋆ is alocal optimum there exists λ⋆ ≥ 0:

Ax⋆ ≥ b

∇f(x⋆) = AT λ⋆

λ⋆T (b − Ax⋆) = 0


Non negativity constraints

min f(x)

x ≥ 0

Lagrange function: L(x, λ) = f(x) − λT x. KKT conditions:

∇f(x⋆) = λ⋆

x⋆ ≥ 0

λ⋆ ≥ 0

(λ⋆)T x⋆ = 0


λ⋆j =

∂f(x⋆)

∂xj

j = 1, n

from which

∂f(x⋆)

∂xj

= 0 ∀ j : x⋆j > 0

∂f(x⋆)

∂xj

≥ 0 otherwise


Box constraints

min f(x)

ℓ ≤ x ≤ u ℓi < ui∀ i

Lagrange function: L(x, λ, µ) = f(x) + λT (ℓ − x) + µT (x − u).KKT conditions:

∇f(x⋆) = λ⋆ − µ⋆

(ℓ − x⋆)T λ⋆ = 0

(x⋆ − u)T µ = 0

(λ⋆, µ⋆) ≥ 0

Given x⋆ letJℓ = j : x⋆

j = ℓj, Ju = j : x⋆j = uj, J0 = j : ℓj < x⋆

j < ujOptimality Conditions – p. 73

Box constr. (cont)

Then, from complementarity,

∂f(x⋆)

∂xj

= λ⋆j j ∈ Jℓ

∂f(x⋆)

∂xj

= −µ⋆j j ∈ Ju

∂f(x⋆)

∂xj

= 0 j ∈ J0


Thus

∂f(x⋆)

∂xj

≥ 0 j ∈ Jℓ

∂f(x⋆)

∂xj

≤ 0 j ∈ Ju

∂f(x⋆)

∂xj

= 0 j ∈ J0

with feasibility ℓ ≤ x⋆ ≤ u


Optimization over the simplex

min f(x)

1T x = 1

x ≥ 0

Lagrange function: L(x, λ, µ) = f(x) − λT x + µT (1T x − 1). KKT:

∇f(x⋆) = λ⋆ − µ⋆1

1T x⋆ = 1

(x⋆, λ⋆) ≥ 0

(λ⋆)T x⋆ = 0


simplex. . .

∂f(x⋆)

∂xj

− λ⋆j = −µ⋆

(all equal). Thus, from complementarity, if x⋆j > 0 then λ⋆

j = 0

and ∂f(x⋆)∂xj

= −µ⋆; otherwise ∂f(x⋆)∂xj

≥ −µ⋆. Thus, if j : x⋆j > 0,

∂f(x⋆)

∂xj

≤ ∂f(x⋆)

∂xk

∀ k


Application: Min var portfolio

Given n assets with random returns R1, . . . , Rn, how to invest 1e in such a way that the resulting portfolio has minimumvariance? If xj denotes the percentage of the investment onasset j, how to compute the variance of this portfolio P (x)?

Var = E(P (x) − (E(P (x))))2

= E

(

n∑

j=1

(Rj − E(Rj))xj

)2

=∑

i,j

(Ri − E(Ri))(Rj − E(Rj))xixj

= xT Qx

where Q is the variance-covariance matrix of the n assets.


Min var portfolio

Problem (objective multiplied by 1/2 for simpler computations):

min(1/2)xT Qx

1T x = 1

x ≥ 0


Optimal portfolio

KKT: for all j : x⋆j > 0:

∑

j

Qijxj ≤∑

j

Qkjxj ∀ k

Vector Qx might be thaught as the vector of marginalcontributions to the total risk (which is a weighted sum ofelements of Qx). Thus in the optimal portfolio, all assets withpositive level give equal (and minimal) contribution to the totalrisk.


Algorithms for unconstrained localoptimization

Fabio Schoen

2008


Algorithms for unconstrained local optimization – p. 1

Optimization Algorithms

Most common form for optimization algorithms:Line search-based methods:Given a starting point x0 a sequence is generated:

xk+1 = xk + αkdk

where dk ∈ Rn: search direction, αk > 0: step

Usually first dk is chosen and than the step is obtained, oftenfrom a 1–dimensional optimization


Trust-region algorithms

A model m(x) and a confidence region U(xk) containing xk aredefined. The new iterate is chosen as the solution of theconstrained optimization problem

minx∈U(xk)

m(x)

The model and the confidence region are possibly updated ateach iteration.


Speed measures

Let x⋆: local optimum. The error in xk might be measured e.g.as

e(xk) = ‖xk − x⋆‖ or

e(xk) = |f(xk) − f(x⋆)|.

Given xk → x⋆ if ∃ q > 0, β ∈ (0, 1) : (for k large enough):

e(xk) ≤ qβk

⇒xk is linearly convergent, or converges with order 1;β : convergence rateA sufficient condition for linear convergence:

lim supe(xk+1)

e(xk)≤ β



super–linear convergence

If for every β ∈ (0, 1) exists q:

e(xk) ≤ qβk

then convergence is super–linear.Sufficient condition:

lim supe(xk+1)

e(xk)= 0


Higher order convergence

If, given p > 1, ∃ q > 0, β ∈ (0, 1) :

e(xk) ≤ qβ(pk)

then xk is said to converge with order at least pIf p = 2 ⇒quadratic convergence Sufficient condition:

lim supe(xk+1)

e(xk)p< ∞


Examples

1k

converges to 0 with order one 1 (linear convergence)


Examples

1k

converges to 0 with order one 1 (linear convergence)1k2 converges to 0 with order 1


Examples

1k


2−k converges to 0 with order 1


Examples

1k



k−k converges to 0 with order 1; convergence issuper–linear


Examples

1k



k−k converges to 0 with order 1; convergence issuper–linear1

22k converges a 0 with order 2 quadratic convergence


Descent directions and the gradient

Let f ∈ C1(Rn), xk ∈ Rn : ∇f(xk) 6= 0

Let d ∈ Rn. If

dT∇f(xk) < 0

then d is a descent directionTaylor expansion:

f(xk + αd) − f(xk) = αdT∇f(xk) + o(α)

f(xk + αd) − f(xk)

α= dT∇f(xk) + o(1)

Thus if α is small enough f(xk + αd) − f(xk) < 0

NB: d might be a descent direction even if dT∇f(xk) = 0


Convergence of line search methods

If a sequence xk+1 = xk + αkdk is generated in such a way that:

L0 = x : f(x) ≤ f(x0) is compact

dk 6= 0 whenever ∇f(xk) 6= 0

f(xk+1) ≤ f(xk)

if ∇f(xk) 6= 0 ∀ k then

limk→∞

dTk

‖dk‖∇f(xk) = 0


if dk 6= 0 then

|dTk ∇f(xk)|‖dk‖

≥ σ(‖∇f(xk)‖)

where σ is such that limk→∞ σ(tk) = 0⇒ limk→∞ tk = 0

(σ is called a forcing function)


Then either there exists a finite index k such that ∇f(xk) = 0 orotherwise

xk ∈ L0 and all of its limit points are in L0

f(xk) admits a limit

limk→∞∇f(xk) = 0

for every limit point x of xk we have ∇f(x) = 0


Comments on the assumptions

f(xk+1) ≤ f(xk): most optimization methods choose dk as adescent direction. If dk is a descent direction, choosing αk

“sufficiently small” ensures the validity of the assumption

limk→∞dT

k

‖dk‖∇f(xk) = 0: given a normalized direction dk, the

scalar product dkT∇f(xk) is the directional derivative of falong dk: it is required that this goes to zero. This can beachieved through precise line searches (choosing the stepso that f is minimized along dk)|dT

k∇f(xk)|

‖dk‖≥ σ(‖∇f(xk)‖): letting, e.g., σ(t) = ct, c > 0, if

dk : dTk ∇f(xk) < 0 then the condition becomes

dTk ∇f(xk)

‖dk‖ ‖∇f(xk‖≤ −c


Recalling that

cos θk =dT

k ∇f(xk)

‖dk‖ ‖∇f(xk‖

then the condition becomes

cos θk ≤ −c

that is, the angle between dk and ∇f(xk) is bounded away fromorthogonality.

θkdT

k ∇f(xk)


Gradient Algorithms

General scheme:

xk+1 = xk − αkDk∇f(xk)

with Dk ≻ 0 e αk > 0If ∇f(xk) 6= 0 then

dk = Dk∇f(xk)

is a descent direction. In fact

dTk ∇f(xk) = −∇T f(xk)Dk∇f(xk)

< 0


Steepest Descent

or “gradient” method:

Dk := I

i.e. xk+1 = xk − αk∇f(xk).If ∇f(xk) 6= 0 then dk = −∇f(xk) is a descent direction.Moreover, it is the steepest (w.r.t. the euclidean norm):

mind∈Rn

∇T f(xk)d

‖d‖ ≤ 1


∇f(xk)


. . .

mind∈Rn

∇T f(xk)d

√dT d ≤ 1

KKT conditions: In the interior ⇒∇T f(xk) = 0; if the constraint isactive ⇒

∇f(xk) + λd

‖d‖ = 0

√dT d = 1

λ ≥ 0

⇒d = − ∇f(xk)‖∇f(xk)‖

.


Newton’s method

Dk := −(

∇2f(xk))−1

Motivation: Taylor expansion of f :

f(x) ≈ f(xk) + ∇T f(xk)(x − xk) +1

2(x − xk)

T∇2f(xk)(x − xk)

Minimizing the approximation:

∇f(xk) + ∇2f(xk)(x − xk) = 0

If the hessian is non singular ⇒

x = xk −(

∇2f(xk))−1 ∇f(xk)


Step choice

Given dk, how to choose αk so that xk+1 = xk + αkdk?

“optimal” choice (one-dimensional optimization):

αk = arg minα≥0

f(xk + αdk).

Analytical expression of the optimal step is available only in fewcases. E.g. if f(x) = 1

2xT Qx + cT x with Q ≻ 0. Then

f(xk + αdk) =1

2(xk + αdk)

T Q(xk + αdk) + cT (xk + αdk)

=1

2α2dT

k Qdk + α(Qxk + c)T dk + β

where β does not depend on α.


Minimizing w.r.t. α:

αdTk Qdk + (Qxk + c)T dk = 0 ⇒

α = −(Qxk + c)T dk

dTk Qdk

= − dTk ∇f(xk)

dTk ∇2f(xk)dk

E.g., in steepest descent:

αk =‖∇f(xk)‖2

∇T f(xk)∇2f(xk)∇f(xk)


Approximate step size

Rules for choosing a step-size (from the sufficient condition forconvergence):

f(xk+1) < f(xk)

limk→∞dT

k

‖dk‖∇f(xk) = 0

Often it is also required that

‖xk+1 − xk‖ → 0

dTK∇f(xk + αkdk) → 0

In general it is important to insure a sufficient reduction of f anda sufficiently large step xk+1 − xk


Avoid too large steps

u

u

uu


Avoid too small stepsu

u

u

u

u


Armijo’s rule

Input: δ ∈ (0, 1), γ ∈ (0, 1/2), ∆k > 0

α := ∆k;while (f(xk + αdk) > f(xk) + γαdT

k∇f(xk)) do

α := δα ;end

return α

Typical values : δ ∈ [0.1, 0.5], γ ∈ [10−4, 10−3].On exit the returned step is such that

f(xk + αdk) ≤ f(xk) + γαdT

k ∇f(xk)


α

acceptable steps

αdTk ∇f(xk)

γαdTk ∇f(xk)


Line search in practice

How to choose the initial step size ∆k?Let φ(α) = f(xk + αdk). A possibility is to choose ∆k = α⋆, theminimizer of a quadratic approximation to φ(·). Example:

q(α) = c0 + c1α +1

2c2α

2

q(0) = c0 := f(xk)

q′(0) = c1 := dTk ∇f(xk)

Then α⋆ = −c1/c2.


Third condition? If an estimate f of the minimum of f(xk + αdk)

is available ⇒choose c2 : min q(α) = f .

min q(α) = q(−c1/c2)

= c0 − c21/c2 := f

c2 = c21/2(f − c0)

α⋆ = −c1/c2

= 2f − c0

c1


Thus it is reasonable to start with

∆k = 2f − f(xk)

dTk ∇f(xk)

A reasonable estimate might be to choose ∆k = 2 (f(xk−1)−f(xk))

dT

k∇f(xk)


Convergence of steepest descent

xk+1 = xk − αk∇f(xk)

If a sufficiently accurate step size is used ⇒the condition of thetheorem on global convergence are satisfied ⇒the steepestdescent algorithm globally converges to a stationary point.“Sufficiently accurate” means exact line search or, e.g., Armijo’srule.


Local analysis of steepest descent

Behaviour of the algorithm when minimizing

f(x) =1

2xT Qx

where Q ≻ 0. (local and global) optimum: x⋆ = 0. Steepestdescent method:

xk+1 = xk − αk∇f(xk)

= xk − αkQxk

= (I − αkQ)xk

Error (in x) at step k + 1:

‖xk+1 − 0‖ = ‖(I − αkQ)xk‖

=√

xTk (I − αkQ)2xk Algorithms for unconstrained local optimization – p. 30

Analysis

Let A: symmetric with eigenvalues: λ1 < · · · < λn. Then

λ1‖v‖2 ≤ vT Av ≤ λm‖v‖2 ∀ v ∈ Rn

⇒

xTk (I − αkQ)2xk ≤ λ⋆xT

k xk

where λ⋆ largest eigenvalue of (I − αkQ)2.


. . .

λ is an eigenvalue of A iff αλ is an eigenvalue of αA

λ is an eigenvalue of A iff 1 + λ is an eigenvalue of I + A

Thus the eigenvalues of (I − αkQ) are

1 − αλi

where λi are the eigenvalues of Q. The maximum eigenvaluewill be:

max(1 − αkλ1)2, (1 − αkλn)2

thus

‖xk+1‖ ≤√

max(1 − αkλ1)2, (1 − αkλn)2‖xk‖= max|1 − αkλ1|, |1 − αkλn|‖xk‖


. . .

Eliminating the dependency on αk:

max|1 − αλ1|, |1 − αλn| =

max1 − αλ1,−1 + αλ1, 1 − αλn,−1 + αλn

0

1

2

3

4

5

0 0.2 0.4 0.6 0.8 1

|1 − αλ1|

|1 − αλn|


. . .

α ≥ 0 and λ1 ≤ λn, ⇒

1 − αλ1 ≥ 1 − αλn

−1 + αλ1 ≤ −1 + αλn

and thus

max|1 − αkλ1|, |1 − αkλn|‖xk‖ = max1 − αλ1,−1 + αλn

Minimum point:

1 − αλ1 = −1 + αλn

i.e.

α⋆ =2

λ1 + λnAlgorithms for unconstrained local optimization – p. 34

Analysis

In the best possible case

‖xk+1‖‖xk‖

≤ |1 − α⋆λ1|

= |1 − 2

λ1 + λn

λ1|

=λn − λ1

λn + λ1

=ρ − 1

ρ + 1

where ρ = λn/λ1: condition number of Qρ ≫ 1 (ill–conditioned problem) ⇒very slow convergenceρ ≈ 1 ⇒very speed convergence


Zig–zagging

min1

2(x2 + My2)

where M > 0. Optimum: x⋆0y⋆ = 0. Starting point: (M, 1).Iterates:

[

xk+1

yk+1

]

=

[

xk

yk

]

+ α

[

xk

Myk

]

With optimal step size ⇒[

xk+1

yk+1

]

=

[

M(

M−1M+1

)k

(

−M−1M+1

)k

]


Converegence is

rapid if M ≈ 1

very slow and “zig–zagging” if M ≫ 1 or M ≪ 1

Slow convergence and zig–zagging are general phenomena(especially when the starting point is near the longest axes ofthe ellipsoidal level sets)


Zig–zagging

-10

-5

0

5

10

0 20 40 60 80 100


Analysis of Newton’s method

Newton-Raphson method: xk+1 = xk − (∇2f(xk))−1 ∇f(xk). Let

x⋆: local optimum. Taylor expansion of ∇f :

∇f(x⋆) = 0

= ∇f(xk) + ∇2f(xk)(x⋆ − xk) + o(‖x⋆ − xk‖)

If ∇2f(xk) is non singular and ‖(∇2f(xk))−1‖ is limited ⇒

0 =(

∇2f(xk))−1 ∇f(xk) + (x⋆ − xk) +

(

∇2f(xk))−1

o(‖x⋆ − xk‖)= x⋆ − xk+1 + o(‖x⋆ − xk‖)


Thus

‖x⋆ − xk+1‖ = o(‖x⋆ − xk‖)

i.e. ‖x⋆−xk+1‖

‖x⋆−xk‖= o(‖x⋆−xk‖)

‖x⋆−xk‖⇒convergence is at least super–linear


Local Convergence of Newton’s Method

Let f ∈ C2(U(x⋆, δ1)), where U : ball with radius δ1 and center x⋆;let ∇2f(x⋆) be non–singular. Then:

1. ∃ δ > 0 : if x0 ∈ U(x⋆, δ) ⇒xk is well defined andconverges to x⋆ at least superlinearly.

2. If ∃ δ > 0, L > 0,M > 0 :

‖∇2f(x) −∇2f(y)‖ ≤ L‖x − y‖

and

‖(∇2f(x))−1‖ ≤ M

then, if x0 ∈ U(x⋆, δ) Newton’s method converges with orderat least 2 and

‖xk+1 − x⋆‖ ≤ LM

2‖xk − x⋆‖2


Difficulties

Many things might go wrong:

at some iteration, ∇2f(xk) might be singular. For example:if xk belongs to a flat region f(x) = constant.

even if non singular, inversion ∇2f(xk) or, in any case,solving a linear system with coefficient matrix ∇2f(xk) isnumerically unstable and computationally demanding

there is no guarantee that ∇2f(xk) ≻ 0 ⇒Newton directionmight not be a descent direction


Difficulties

Newton’s method just tries to solve the system

∇f(xk) = 0

and thus might very well be attracted towards a maximum

the method lacks global convergence: it converges only ifstarted “near” a local optimum


Newton–type methods

line search variant: xk+1 = xk − αk (∇2f(xk))−1 ∇f(xk)

Modified Newton method: replace ∇2f(xk) by(∇2f(xk) + Dk) where Dk is chosen so that ∇2f(xk) + Dk ispositive definite


Quasi-Newton methods

Consider solving the nonlinear system ∇f(x) = 0. Taylorexpansion of the gradient:

∇f(xk) ≈ ∇f(xk+1) + ∇2f(xk+1)(xk − xk+1)

Let Bk+1 be an approximation of the hessian in xk+1.Quasi–Newton equation:

Bk+1(xk+1 − xk) = ∇f(xk+1) −∇f(xk)


Quasi–Newton equation

Let:

sk := xk+1 − xk yk := ∇f(xk+1) −∇f(xk)

Quasi–Newton equation: Bk+1sk = yk. If Bk was the previousapproximate hessian, we ask that

1. the variation between Bk and Bk+1 is “small”

2. nothing changes along directions which are normal to thestep sk:

Bkz = Bk+1z ∀ z : zT sk = 0

Choosing n−1 vectors z which are orthogonal to sk ⇒n2 linearlyindependent equations in n2 unknowns ⇒∃ a unique solution.


Broyden updating

It can be shown that the unique solution is given by:

Bk+1 = Bk +(yk − Bksk)s

Tk

sTk sk

Theorem: let Bk ∈ Rn×n and sk 6= 0. The unique solution to:

minB

‖Bk − B‖F

Bsk = yk

is Broyden’s update Bk+1 here ‖X‖F =√

TrXT X denotesFrobenius norm.


proof

‖Bk+1 − Bk‖ =

∥

∥

∥

∥

(yk − Bksk)sTk

sTk sk

∥

∥

∥

∥

=

∥

∥

∥

∥

∥

(Bsk − Bksk)sTk

sTk sk

∥

∥

∥

∥

∥

=

∥

∥

∥

∥

∥

(B − Bk)sksTk

sTk sk

∥

∥

∥

∥

∥

≤∥

∥

∥(B − Bk)

∥

∥

∥

‖sksTk ‖

sTk sk

=∥

∥

∥(B − Bk)

∥

∥

∥

√

TrsksTk sksT

k

sTk sk

=∥

∥

∥(B − Bk)

∥

∥

∥

sTk sk

sTk sk

= ‖(B − Bk)‖

Unicity is a consequence of the strict convexity of the norm andthe convexity of the feasible region.


Quasi-Newton and optimization

Special situation:

1. the hessian matrix in optimization problems is symmetric;

2. in gradient methods, when we letxk+1 = xk − (Bk+1)

−1 ∇f(xk), it is desirable that Bk+1 bepositive definite.

Broyden’s update:


Tk

sTk sk

is generally not symmetric even if Bk is.


Simmetry

Remedy: let C1 = Bk +(yk−Bksk)sT

k

sT

ksk

symmetrization:

C2 =1

2(C1 + CT

1 )

However, it does not satisfy Quasi–Newton equation. Broydenupdate of C2:

C3 = C2 +(yk − C2sk)s

Tk

sTk sk

which is not symmetric, . . .


PBS update

In the limit


Tk + sk(yk − Bksk)

T

sTk sk

+(sT

k (yk − Bksk))sksTk

(sTk sk)2

(PBS – Powell-Broyden-Symmetric update).Imposing also hereditary positive definiteness, DFP(Davidon-Fletcher-Powell) is obtained:

Bk+1 = Bk +(yk − Bksk)y

Tk + yk(yk − Bksk)

T

yTk sk

+(sT

k (yk − Bksk))ykyTk

(yTk sk)2

=

(

I − yksTk

yTk sk

)

Bk

(

I − skyTk

yTk sk

)

+yky

Tk

yTk sk


BFGS

Same ideas, but applied to the approximate inverse Hessian:Inverse Quasi–Newton equation:

sk = Hk+1yk

lead to the most common Quasi–Newton update: BFGS(Broyden-Fletcher-Goldfarb-Shanno):

Hk+1 =

(

I − skyTk

yTk sk

)

Hk

(

I − yksTk

yTk sk

)

+sks

Tk

yTk sk


BFGS method

xk+1 = xk − αkHk∇f(xk)

Hk+1 =

(

I − skyTk

yTk sk

)

Hk

(

I − yksTk

yTk sk

)

+sks

Tk

yTk sk

yk = ∇f(xk+1) −∇f(xk)

sk = xk+1 − xk


Trust Region methods

Possible defect of standard Newton method: the approximationbecomes less and less precise if we move away from thecurrent point. Long step ⇒bad approximation.Idea: constrained minimization of quadratic approximation:

xk+1 = arg min‖xk+1−xk‖≤∆k

mk(x) where

mk(x) = f(xk) + ∇T f(xk)(xk+1 − xk)

+1

2(xk+1 − xk)

T∇2f(xk)(xk+1 − xk)

∆k > 0: parameter.First advantage (over pure Newton): the step is always definite(thanks to Weierstrass’s theorem)


Outline of Trust Region

Let mk(·) a local model function. E.g. in Newton Trust Regionmethods,

mk(s) = f(xk) + sT∇f(xk) +1

2sT∇2f(xk)s

or in a Quasi-Newton Trust Region method

mk(s) = f(xk) + sT∇f(xk) +1

2sT Bks


How to choose and update the trust region radius ∆k? Given astep sk, let

ρk =f(xk) − f(xk + sk)

mk(0) − mk(sk)

the ratio between the actual reduction and the predictedreduction


Model updating

ρk =f(xk) − f(xk + sk)

mk(0) − mk(sk)

The predicted reduction is always non negative;

if ρk is small (surely if it is negative) the model and thefunction strongly disagree ⇒the step must be rejected andthe trust region reduced

if ρk ≥ 1 it is safe to expand the trust region

intermediate ρk values lead us to keep the regionunchanged


Algorithm

Data: ∆ > 0, ∆0 ∈ (0, ∆), η ∈ [0, 1/4]

for k = 0, 1, . . . doFind the step sk and ρk minimizing the model in the trust region ;if ρk < 1/4 then

∆k+1 = ∆k/4 ;else

if ρk > 3/4 and ‖sk‖ = ∆k then

∆k+1 = min2∆k, ∆ ;else

∆k+1 = ∆k;end

end

if ρk > η thenxk+1 = xk + sk;

elsexk+1 = xk;

end

end


Solving the model

How to find

mins

∇f(xk)T s +

1

2sT Bks

‖s‖ ≤ ∆

If Bk ≻ 0, KKT conditions are necessary and sufficient; rewritingthe constraint as sT s ≤ ∆2 ⇒:

∇f(xk) + Bks + 2λs = 0

λ(∆ − ‖s‖) = 0


Thus either s is in the interior of the ball with radius ∆, in whichcase λ = 0 and we have the (quasi)-Newton step:

p = −B−1k ∇f(xk)

or ‖s‖ = ∆ and if λ > 0 then 2λs = −∇f(xk) − Bs = −∇mk(s)⇒s is parallel to the negtaive gradient of the model and normalto its contour lines.


The Cauchy Point

Strategy to approximately solve the trust region sub–problem.Find the “Cauchy point”: the minimizer of mk along the direction−∇f(xk) within the trust region. First find the direction:

psk = arg min

pfk + ∇f(xk)

T p

‖p‖ ≤ ∆k

Then along this direction find a minimizer

τk = arg minτ≥0

mk(τpsk)

‖τpsk‖ ≤ ∆k

The Cauchy point is xk + τkpsk.


Finding the Cauchy point

Finding psk is easy: analytic solution:

psk = −∇f(xk)

‖gk‖∆k

For the step size τk:

If ∇f(xk)T Bk∇f(xk) ≤ 0 ⇒negative curvature direction

⇒largest possible step ⇒τk = 1

Otherwise the model along the line is strictly convex, so

τk = min1, ‖∇f(xk)‖3

∆k∇f(xk)T Bk∇f(xk)

Choosing the Cauchy point ⇒global but extremely slowconvergence (similar to steepest descent). Usually an improvedpoint is searched starting from the Cauchy one.


Derivative Free Optimization


Pattern Search

For smooth optimization, but without knowledge of derivatives.Elementary idea: if x ∈ R

2 is not a local minimum for f , then atleast one of the directions e1, e2,−e1,−e2 (moving towards E, N,W, S) forms an acute angle with −∇f(x) ⇒is a descentdirection.Direct search: explores all the direction in search of one whichgives a descent.


Coordinate search

Let D⊕ = ±ei be the set of coordinate directions and theiropposites

Data: k = 0, ∆0 an initial step length, x0 a starting pointwhile ∆ is large enough do

if f(xk + ∆kd) < f(xk) for some d ∈ D⊕ thenxk+1 = xk + ∆kd (step accepted) ;

else∆k+1 = 0.5∆k ;

end

k = k + 1 ;end


Pattern search

It is not necessary to explore 2n directions. It is sufficient thatthe set of directions forms a positive span, i.e. every v ∈ R

n

should be expressible as a non negative linear combination ofthe vectors in the set.Formally, G is a generating set iff

∀ v 6= 0 ∈ Rn∃ g ∈ G : vT g > 0

A “good” generating set should be characterized by asufficiently high cosine measure:

κ(G) := minv 6=0

maxd∈G

vT d

‖v‖‖d‖


Examples

u

u

u

u

u

u

u

u

uu

In the first case κ ≈ 0.19612, in the second κ = 0.5, in the thirdκ =

√0.5 ≈ 0.7017


Step Choice

xk+1 =

xk + ∆kdk if f(xk + ∆kdk) < f(xk) − ρ(∆k)(success)

xk otherwise (failure)

where ρ(t) = o(t). We let

∆k+1 = φk∆k

where φk ≥ 1 for successful iterations, φk < 1 otherwise.Direct methods possess good convergence properties.


b


b


b


b


Nelder-Mead Simplex

Given a simplex S = v1, . . . , vn+1 in Rn let vr the worst point:

r = arg maxif(vi). Let C be the centroid of S \ vr:

C =

∑

i6=r vi

n

The algorithm performs a sort of line search along the directionC − vr.Let

R = C + (C − vr)

a reflection of the worst point along the direction. Let f be thebest function value in the current simplex.Three cases might occur:


1: Reflection

Check f(R): if it is intermediate, i.e. better than the worst andworse than the best, then accept the reflection, i.e. discard theworst point in the simplex and replace it with R.


Reflection step

b

b

b⊗

worst

reflection


2: improvement

if the trial step is an improvement:

f(R) < f

then attempt an expansion: try to move R to R = R + (R − C)

If successful (f(R) < f(R)) then accept the expansion anddiscard the worst point.If unsuccessful, then accept R as a new point and discard theworst one.


Expansion

b

b

bb

⊗

worst

reflection

expansion


3: contraction

If however the reflected point R is worse than all points in thesimplex (possibly except the worst vr), than a contraction step isperformed:

if f(R) > f(vr) (R is worse than all points in the simplex),add

0.5(vr + C)

to the simplex and discard vr

otherwise if R is better than vr than add

0.5(R + C)

to the simplex and discard vr


Contraction

b

b

b

b

b

⊗worst

reflectioncontraction

b


Nelder-Mead is not a direct search method (only a singledirection at a time is explored)It is widely used by practitioners. However it may fail toconverge to a local minimum.There are examples of strictly convex functions in R

2 on whichthe method converges to a non-stationary point. The badconvergence properties are connected to the event that then–dimensional simplex degenerates into a lower dimensionalspace.Moreover the method has a strong tendency to generatedirections which are almost normal to that of the gradient!Convergent variants of Nelder-Mead method do exists.


Implicit filtering

Let

f(x) = h(x) + w(x)

where h(x) is a smooth function, while w(x) can be consideredas an additive, typically random, noise.The method performs a rough estimate of the gradient (finitedifference with a “large step”) and proceeds with an Armijo linesearch. If unsuccessful, the step for finite differences isreduced.


Implicit filtering

Data: εk ↓ 0, params δ, γ, ∆ of Armijo’s rulerepeat

OuterIteration = false;repeat

compute f(xk) and a finite difference estimate of ∇f(xk):

∇εkf(xk) = [(f(xk + εkei) − f(xkεkei))/2εk]

if ‖∇εkf(xk)‖ ≤ εk then

OuterIteration = trueelse

Armijo: if successful accept the Armijo step;otherwise let OuterIteration = true

end

until OuterIteration ;k = k + 1;

until convergence criterion ;Algorithms for unconstrained local optimization – p. 82

Convergence properties

If

∇2h(x) is Lipschitz continuous

the sequence xk generated by the method is infinite

limk→∞

ε2k +

η(xk; εk)

εk

= 0

where

η(x; ε) = supz:‖z−x‖∞≤ε

|w(x)|

unsuccessful Armijo steps occur at most a finite number oftimes

then all limit points of xk are stationaryAlgorithms for unconstrained local optimization – p. 83

Algorithms for constrained localoptimization

Fabio Schoen

2008


Algorithms for constrained local optimization – p. 1

Feasible direction methods


Frank–Wolfe method

Let X: convex set. Consider the problem:

minx∈X

f(x)

Let xk ∈ X ⇒choosing a feasible direction dk corresponds tochoosing a point x ∈ X : dk = x − xk.“Steepest descent” choice:

minx∈X

∇T f(xk)(x − xk)

(a linear objective with convex constraints, usually easy tosolve). Let xk be an optimal solution of this problem.


Frank–Wolfe

If ∇T f(xk)(xk − xk) = 0 then

∇T f(xk)d ≥ 0

for every feasible direction d ⇒first order necessary conditionshold.Otherwise, letting dk = xk − x, this is a descent direction alongwhich a step αk ∈ (0, 1] might be chosen according to Armijo’srule.



Convergence of Frank-Wolfe method

Under mild conditions the method converges to a pointsatisfying first order necessary conditions.However it is usually extremely slow (convergence may besub–linear)It might find applications in very large scale problems in whichsolving the sub-problem for direction determination is very easy(e.g. when X is a polytope).


Gradient Projection methods

Generic iteration:

xk+1 = xk + αk(xk − xk)

where the direction dk = xk − xk is obtained finding

xk = [xk − sk∇f(xk)]+

where: sk ∈ R+ and [·]+ represents projection over the feasible

set.


The method is slightly faster than Frank-Wolfe, with a linearconvergence rate similar to that of (unconstrained) steepestdescent.It might be applied if projection is relatively cheap, e.g. whenthe feasible set is a box.A point xk satisfies first order necessary conditionsdT∇f(xk) ≥ 0 iff

xk = [xk − sk∇f(xk)]+


Lagrange Multiplier Algorithms


Barrier Methods

min f(x)

gj(x) ≤ 0 j = 1, . . . , r

A Barrier is a continuous function which tends to +∞ wheneverx approaches the boundary of the feasible region. Examples ofbarrier functions:

B(x) = −∑

j

log(−gj(x)) logaritmic barrier

B(x) = −∑

j

1

gj(x)invers barrier


Barrier Method

Let εk ↓ 0 and x0 strictly feasible, i.e. gj(x0) < 0∀ j. Then let

xk = arg minx∈Rn

(f(x) + εkB(x))

Proposition: every limit point of xk is a global minimum of theconstrained optimization problem


Analysis of Barrier methods

Special case: a single constraint (might be generalized)Let x be a limit point of xk (a global minimum). If KKTconditions hold, then there exists a unique λ ≥ 0:

∇f(x) + λ∇g(x) = 0

(with λg(x) = 0. xk, solution of the barrier problem

min f(x) + εkB(x)

g(x) < 0

satisfies

∇f(xk) + εk∇B(xk) = 0


. . .

If B(x) = φ(g(x)), ⇒

∇f(xk) + εkφ′(g(xk))∇g(xk) = 0

In the limit, for k → ∞:

lim εkφ′(g(xk))∇g(xk) = λ∇g(x)

if limk g(xk) < 0 ⇒φ′(g(xk))∇g(xk) → K (finite) and Kεk → 0

if limk g(xk) = 0 ⇒(thanks to the unicity of Lagrangemultipliers),

λ = limk

εkφ′(g(xk))


Difficulties in Barrier Methods

strong numeric instability: the condition number of thehessian matrix grows as εk → 0

need for an initial strictly feasible point x0

(partial) remedy: εk is very slowly decreased and the solution ofthe k + 1–th problem is obtained starting an unconstrainedoptimization from xk


Example

min(x − 1)2 + (y − 1)2

x + y ≤ 1

Logarithmic Barrier problem:

min(x − 1)2 + (y − 1)2 − εk log(1 − x − y)

x + y − 1 < 0

Gradient:

2(x − 1) + εk

1−x−y

2(y − 1) + εk

1−x−y

Stationary points x = y = 3

4±

√

1+εk

4(only the “-” solution is acceptable)


Barrier methods and L.P.

min cT x

Ax = b

x ≥ 0

Logarithmic Barrier on x ≥ 0:

min cT x − ε∑

j

log xj

Ax = b

x > 0


The central path

The starting point is usually associated with ε = ∞ and is theunique solution of

min−∑

j

log xj

Ax = b

x > 0

The trajectory x(ε) of solutions to the barrier problem is calledthe central path and leads to an optimal solution of the LP.


Penalty Methods

Penalized problem:

min f(x) + ρP (x)

where ρ > 0 and P (x) ≥ 0 with P (x) = 0 if x is feasible.Example:

min f(x)

hi(x) = 0 i = 1, . . . ,m

A penalized problem might be:

min f(x) + ρ∑

i

hi(x)2


Convergence of the quadratic penalty method

(for equality constrained problems): let

P (x; ρ) = f(x) + ρ∑

i

hi(x)2

Given ρ0 > 0, x0 ∈ Rn, k = 0, let

xk+1 = arg min P (x; ρk)

(found with an iterative method initialized at xk); let ρk+1 > ρk,k := k + 1.If xk+1 is a global minimizer of P and ρk → ∞ then every limitpoint of xk is a global optimum of the constrained problem.


Exact penalties

Exact penalties: there exists a penalty parameter value s.t. theoptimal solution to the penalized problem is the optimal solutionof the original one.ℓ1 penalty function:

P1(x; ρ) = f(x) + ρ∑

i

|hi(x)|


Exact penalties

for inequality constrained problems:

min f(x)

hi(x) = 0

gj(x) ≤ 0

the penalized problem is

P1(x; ρ) = f(x)ρ∑

i

|hi(x)| + ρ∑

j

max(0,−gj(x))


Augmented Lagrangian method

Given an equality constrained problem, reformulate it as:

min f(x) +1

2ρ‖h(x)‖2

h(x) = 0

The Lagrange function of this problem is called AugmentedLagrangian:

L(x; λ) = f(x) +1

2ρ‖h(x)‖2 + λT h(x)


Motivation

minx

f(x) +1

2ρ‖h(x)‖2 + λT h(x)

∇xLρ(x, λ) = ∇f(x) +∑

i

λi∇h(x) + ρh(x)∇h(x)

= ∇xL(x, λ) + ρh(x)∇h(x)

∇2xxLρ(x, λ) = ∇2f(x) +

∑

i

λi∇2h(x) + ρh(x)∇2h(x) + ρ∇h(x)∇T h(x)

= ∇2xxL(x, λ) + ρh(x)∇2h(x) + ρ∇h(x)∇T h(x)


motivation . . .

Let (x⋆, λ⋆) an optimal (primal and dual) solution. Necessarily:∇xL(x⋆, λ⋆) = 0; moreover h(x⋆) = 0 thus

∇xLρ(x⋆, λ⋆) = ∇xL(x⋆, λ⋆) + ρh(x⋆)∇h(x⋆)

= 0

⇒(x⋆, λ⋆) is a stationary point for the augmented lagrangian.


motivation . . .

Observe that:

∇2

xxLρ(x, λ) = ∇2

xxL(x, λ) + ρh(x)∇2h(x) + ρ∇h(x)∇T h(x)

= ∇2

xxL(x, λ) + ρ∇h(x)∇T h(x)

Assume that sufficient optimality conditions hold:

vT∇2

xxL(x⋆, λ⋆)v > 0 ∀ v : vT∇h(x⋆) = 0,


. . .

Let v 6= 0 : vT∇h(x⋆)= 0. Then

vT∇2

xxLρ(x⋆, λ⋆)vT = vT∇2

xxL(x⋆, λ⋆)vT + ρvT∇h(x⋆)∇T h(x⋆)v

= vT∇2

xxL(x⋆, λ⋆)vT > 0


. . .

Let v 6= 0 : vT∇h(x⋆)6= 0. Then

vT∇2

xxLρ(x⋆, λ⋆)vT = vT∇2

xxL(x⋆, λ⋆)vT + ρvT∇h(x⋆)∇T h(x⋆)v

= vT∇2

xxL(x⋆, λ⋆)vT + ρ(vT∇h(x⋆))2

which might be negative. However ∃ρ > 0: if ρ ≥ ρ

⇒vT∇2xxLρ(x

⋆, λ⋆)vT > 0.Thus, if ρ is large enough, the Hessian of the augmentedlagrangian is positive definite and x⋆ is a (strict) local minimumof Lρ(·, λ

⋆)


Inequality constraints

min f(x)

g(x) ≤ 0

Nonlinear transformation of inequalities into equalities:

minx,s

f(x)

gj(x) + s2

j = 0 j = 1, p


Given the problem

min f(x)

hi(x) = 0 i = 1,m

gj(x) ≤ 0 j = 1, p

an Augmented Lagrangian problem might be defined as

minLρ(x, z; λ, µ) = minx,z

f(x) + λT h(x) +1

2ρ‖h(x)‖2

+∑

j

µj(gj(x) + z2

j ) +1

2ρ

∑

j

(gj(x) + z2

j )2


. . .

Consider minimization with respect to z variables:

minz

∑

j

µj(gj(x) + z2

j ) +1

2ρ

∑

j

(gj(x) + z2

j )2

= minu≥0

∑

j

µj(gj(x) + uj) +1

2ρ(gj(x) + uj)

2

(quadratic minimization over the nonnegative orthant). Solution:

u⋆j = max0, uj

where u is the unconstrained optimum:

u : µj + ρ(gj(x) + uj) = 0


. . .

Thus:

u⋆j = max0,−

µj

ρ− gj(x).

Substituting:

Lρ(x; λ, µ) = f(x) + λT h(x) +1

2ρ‖h(x)‖2

+1

2ρ

∑

j

(

max0, µj + ρgj(x) − µ2

j

)

This is an Augmented Lagragian for inequality constrainedproblems.


Sequential Quadratic Programming

min f(x)

hi(x) = 0

Idea: apply Newton’s method to solve the KKT equations:Lagrangian function:

L(x; λ) = f(x) +∑

i

λihi(x)

let H(x) = [hi(x)] ,∇H(x) = [∇hi(x)]. KKT conditions:

F [x; λ] =

[

∇f(x) + ∇HT (x)λH(x)

]

= 0


Newton step for SQP

Jacobian of KKT system:

F ′(x, λ) =

[

∇2xxL(x; λ) ∇T H(x)∇H(x) 0

]

Newton step:[

xk+1

λk+1

]

=

[

xk

λk

]

+

[

dk

∆k

]

where[

∇2xxL(xk; λk) ∇T H(xk)∇H(xk) 0

] [

dk

∆k

]

=

[

−∇f(xk) −∇HT (xk)λk

−H(xk)

]


existence

The Newton step exists if

the Jacobian of the constraint set ∇H(xk) has full row rank

the Hessian ∇2xxL(xk; λk) is positive definite

In this case the Newton step is the unique solution of

∇2

xxL(xk; λk)dk + ∇T H(xk)∆k + ∇f(xk) + ∇HT (xk)λk = 0

∇H(xk)dk + H(xk) = 0


Alternative view: SQP

mind

f(xk) + ∇f(xk)T d +

1

2dT∇2

xxL(xk; λk)d

∇H(xk)d + H(xk) = 0

KKT conditions:

∇2

xxL(xk; λk)d + ∇f(xk) + ∇H(xk)Λk = 0

Under the same conditions as before this QP has a uniquesolution dk with Lagrange multipliers Λk = λk+1


Alternative view: SQP

mind

L(xk, λk) + ∇TxL(xk, λk)d +

1

2dT∇2

xxL(xk; λk)d

∇H(xk)d + H(xk) = 0

KKT conditions:

∇2

xxL(xk; λk)d + ∇f(xk) + ∇H(xk)λk + ∇H(xk)Λk = 0

Under the same conditions as before this QP has a uniquesolution dk with Lagrange multipliers Λk = ∆k+1


Thus SQP can be seen as a method which

minimizes a quadratic approximation to the Lagrangian

subject to a first order approximation of the constraints.


Inequalities

If the original problem is

min f(x)

hi(x) = 0

gj(x) ≤ 0

then the SQP iteration solves

mind

fk + ∇f(xk)T d +

1

2dT∇2

xxL(xk, λk)d

∇Ti hi(xk)p + hi(xk) = 0

∇Tj gj(xk)p + gj(xk) ≤ 0


Filter Methods

Basic idea:

min f(x)

g(x) ≤ 0

can be considered as a problem with two objectives:

minimize f(x)

minimize g(x)

(the second objective has priority over the first)


Filter

Given the problem

min f(x)

gj(x) ≤ 0 j = 1, . . . , k

let us consider the bi-criteria optimization problem

min f(x)

min h(x)

where

h(x) =∑

j

maxgj(x), 0


Let fk, hk, k = 1, 2, . . . the observed values of f and h at pointsx1, x2, . . ..A pair (fk, hk) dominates a pair (fℓ, hℓ) iff

fk ≤ fℓ and

hk ≤ hℓ

A filter is a list of pairs which are non-dominated by the others


bc

bc

bc

bc

bc

h(x)

f(x)


Trust region SQP

Consider a Trust-region SQP method:

mind

fk + ∇L(xk; λk)T d +

1

2dT∇2

xxL(xk; λk)d


‖d‖∞ ≤ ρ

(the ∞ norm is used here in order to keep the problem a QP)Traditional (unconstrained) trust region methods: if the currentstep is a failure ⇒reduce the trust region ⇒eventually the stepwill become a pure gradient step ⇒convergence!


Trust region SQP

Here diminishing the trust region radius might lead to infeasibleQP’s:

gj(x) ≤ 0


bc xk


Filter methods

Data: x0: starting point, ρ, k = 0

while Convergence criterion not satisfied do

if QP is infeasible thenFind xk+1 minimizing constraint violation;

elseSolve QP and get a step dk; try setting xk+1 = xk + dk;if (fk+1, hk+1) is acceptable to the filter then

Accept xk+1 and add (fk+1, hk+1) to the filter;Remove dominated points from the filter;Possibly increase ρ;

elseReject the step;Reduce ρ;

end

end

set k = k + 1;end Algorithms for constrained local optimization – p. 44

Comparison with other methods

bc

bc

bc

bc

bc

h(x)

f(x)

acceptable steps "classical" method

Rejected filter steps


Introduction to Global OptimizationFabio Schoen

2008


Introduction to Global Optimization – p. 1

Global Optimization Problems

minx∈S⊆Rn

f(x)

What is it meant by global optimization? Of course we sould liketo find

f ∗ = minx∈S⊆Rn

f(x)

andx∗ = arg min f(x) : f(x∗) ≤ f(x) ∀ x ∈ S


This definition in unsatisfactory:

the problem is “ill posed” in x (two objective functions whichdiffer only slightly might have global optima which arearbitrarily far)

it is however well posed in the optimal values: ||f − g|| ≤ δ⇒|f ∗ − g∗| ≤ ε


Quite often we are satisfied in looking for f ∗ and search one ormore feasible solutions suche that

f(x) ≤ f(x∗) + ε

Frequently, however, this is too ambitious a task!



Research in Global Optimization

the problem is highly relevant, especially in applications

the problem is very hard (perhaps too much) to solve

there are plenty of publications on global optimizationalgorithms for specific problem classes

there are only relatively few papers with relevant theoreticalcontents

often from elegant theories, weak algorithms have beenproduced and viceversa, the best computational methodsoften lack a sound theoretical support


many global optimization papers get published on appliedresearch journals

Bazaraa, Sherali, Shetty “Nonlinear Programming: theoryand algorithms”, 1993:the word “global optimum” appears for the first time on page99, the second time at page 132, then at page 247:“A desirable property of an algorithm for solving [anoptimization] problem is that it generates a sequence ofpoints converging to a global optimal solution. In manycases however we may have to be satisfied with lessfavorable outcomes.”after this (in 638 pages) it never appears anymore. “Globaloptimization” is never cited.


Similar situation in Bertsekas, Nonlinear Programming (1999):777 pages, but only the definition of global minima and maximais given!Nocedal & Wrigth, “Numerical Optimization”, 2nd edition, 2006:Global solutions are needed in some applications, but for manyproblems they are difficult to recognize and even more difficultto locate. . .many successful global optimization algorithms require thesolution of many local optimization problems, to which thealgorithms described in this book can be applied


Complexity

Global optimization is “hopeless”: without “global” informationno algorithm will find a certifiable global optimum unless itgenerates a dense sample.There exists a rigorous definition of “global” information – someexamples:

number of local optima

global optimum value

for global optimization problems over a box, (an upperbound on) the Lipschitz constant

|f(y) − f(x)| ≤ L‖x− y‖ ∀x, y

Concavity of the objective function + convexity of thefeasible region

an explicit representation of the objective function as thedifference between two convex functions (+ convexity of the


Complexity

Global optimization is computationally intractable alsoaccording to classical complexity theory. Special cases:Quadratic programming:

minl≤Ax≤u

1

2xTQx+ cTx

is NP–hard [Sahni, 1974] and, when considered as a decisionproblem, NP -complete [Vavasis, 1990].


Many special cases are still NP–hard:

norm maximization on a parallelotope:

max ‖x‖b ≤ Ax ≤ c

Quadratic optimization on a hyper-rectangle (A = I) wheneven only one eigenvalue of Q is negative

quadratic minimization over a simplex

minx≥0

1

2xTQx+ cTx

∑

j

xj = 1

Even checking that a point is a local optimum is NP -hardIntroduction to Global Optimization – p. 10

Applications of global optimization

concave minimization – quantity discounts, scaleeconomies

fixed charge

combinatorial optimization - binary linear programming:

min cTx+KxT (1 − x)

Ax = b

x ∈ [0, 1]

or:

min cTx

Ax = b

x ∈ [0, 1]

xT (1 − x) = 0Introduction to Global Optimization – p. 11

Minimization of cost functions which are neither convex norconcave. E.g.: finding the minimum conformation ofcomplex molecules – Lennard-Jones micro-cluster, proteinfolding, protein-ligand docking,Example: Lennard-Jones: pair potential due to two atoms atX1, X2 ∈ R

3:

v(r) =1

r12− 2

r6

where r = ‖X1 −X2‖. The total energy of a cluster of Natoms located at X1, . . . , XN ∈ R

3 is defined as:∑

i=1,...,N

∑

j<i

v(||Xi −Xj||)

This function has a number of local (non global) minimawhich grows like exp(N)


Lennard-Jones potential

-3

-2

-1

0

1

2

3

0.5 1 1.5 2 2.5 3 3.5 4 4.5 5

attractive(x)repulsive(x)

lennard-jones(x)


Protein folding and docking

Potential energy model:E = El + Ea + Ed + Ev + Ee where:

El =∑

i∈L

1

2Kb

i (ri − r0i )

2

(contribution of pairs of bonded atoms):

Ea =∑

i∈A

1

2Kθ

i (θi − θ0i )

2

(angle between 3 bonded atoms)

Ed =∑

i∈T

1

2Kφ

i [1 + cos(nφi − γ)]

(dihedrals)Introduction to Global Optimization – p. 14

Ev =∑

(i,j)

∑

∈C

(

Aij

R12ij

− Bij

R6ij

)

(van der Waals)

Ee =1

2

∑

(i,j)

∑

∈C

qiqjεRij

(Coulomb interaction)


Docking

Given two macro-molecules M1,M2, find their minimal energycouplingIf no bonds are changed ⇒to find the optimal docking it issufficient to minimized:

Ev + Ee =∑

i∈M1,j∈M2

(

Aij

R12ij

− Bij

R6ij

)

+1

2

∑

i∈M1,j∈M2

qiqjεRij


Main algorithmic strategies

Two main families:

1. with global information (“structured problems”)

2. without global information (“unstructured problems”)

Structured problems ⇒stochastic and deterministic methodsUnstructured problems ⇒typically stochastic algorithmsEvery global optimization method should try to find a balancebetween

exploration of the feasible region

approximations of the optimum


Example: Lennard Jones

LJN = minLJ(X) = minN−1∑

i=1

N∑

j=i+1

1

‖Xi −Xj‖12− 2

‖Xi −Xj‖6

This is a highly structured problem. But is it easy/convenient touse its structure?And how?


LJ

The map

F1 : R3N 7→ R

N(N−1)/2+

F1(X1, . . . , XN) 7→

‖X1 −X2‖2, . . . , ‖XN−1 −XN‖2

is convex and the function

F2 : RN(N−1)/2+ 7→ R

F2(r12, . . . , rN−1,N) 7→∑ 1

r6ij

− 2∑ 1

r3ij

is the difference between two convex functions. Thus LJ(X)can be seen as the difference between two convex function (ad.c. programming problem)


NB: every C2 function is d.c., but often its d.c. decomposition isnot known.D.C. optimization is very elegant, there exists a nice dualitytheory, but algorithms are typically very inefficient.


A primal method for d.c. optimization

“cutting plane” method (just an example, not particularlyefficient, useless for high dimensional problems).Any unconstrained d.c. problem can be represented as anequivalent problem with linear objective, a convex constraintand a reverse convex constraint. If g, h ar convex, thenmin g(x) − h(x) is equivalent to:

min z

g(x) − h(x) ≤ z

which is equivalent to

min z

g(x) ≤ w

h(x) + z ≥ w


D.C. canonical form

min cTx

g(x) ≤ 0

h(x) ≥ 0

where h, g: convex. Let

Ω = x : g(x) ≤ 0C = x : h(x)≤0

Hp:0 ∈ intΩ ∩ intC, cTx > 0∀x ∈ Ω \ intC

Fundamental property: if a D.C. problem admits an optimum, atleast one optimum belongs to

∂Ω ∩ ∂C Introduction to Global Optimization – p. 22

Discussion of the assumptions

g(0) < 0, h(0) < 0, cTx > 0∀ feasible x. Let x be a solution to theconvex problem

min cTx g(x) ≤ 0

If h(x) ≥ 0 then x solves the d.c. problem. Otherwise cTx > cT xfor all feasible x. Coordinate transformation: y = x− x:

min cTy

g(y) ≤ 0

h(y) ≥ 0

where g(y) = g(y + x). Then cTy > 0 for all feasible solutionsand h(0) > 0; by continuity it is possible to choose x so thatg(0) < 0.

Introduction to Global Optimization – p. 23-9 -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3

-4

-3

-2

-1

0

1

2

3

4

Ω

C

0

cTx = 0


Let x best known solution.Let

D(x) = x ∈ Ω : cTx ≤ cT xIf D(x) ⊆ C then x is optimal;Check: a polytope P (with known vertices) is built whichcontains D(x)If all vertices of P are in C ⇒optimal solution. Otherwise let v:best feasible vertex;the intersection of the segment [0, v] with ∂C (if feasible) is animproving point x. Otherwise a cut is introduced in P which istangent to Ω in x.


-4

-3

-2

-1

0

1

2

3

4

Ω

C

cTx = 0

x

D(x) = x ∈ Ω : cTx ≤ cT x


Initialization

Given a feasible solution x, take a polytope P such that

P ⊇ D(x)

i.e.

y : cTy ≤ cT x

y feasible

⇒y ∈ P

If P ⊂ C, i.e. if y ∈ P ⇒h(y) ≤ 0 then x is optimal.Checking is easy if we know the vertices of P .


-4

-3

-2

-1

0

1

2

3

4

Ω

C

cTx = 0

x

P : D(x) ⊆ P with vertices V1, . . . , Vk. V ⋆ := arg maxh(Vj)

V ⋆


Step 1

Let V ⋆ the vertex with largest h() value. Surely h(V ⋆) > 0(otherwise we stop with an optimal solution)Moreover: h(0) < 0 (0 is in the interior of C). Thus the line fromV ⋆ to 0 must intersect the boundary of CLet xk be the intersection point. It might be feasible(⇒improving) or not.


-4

-3

-2

-1

0

1

2

3

4

Ω

C

cTx = 0

x

xk = ∂C ∩ [V ⋆, 0]

V ⋆

xk


-9 -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3-4

-3

-2

-1

0

1

2

3

4

Ω

C

cTx = 0

If xk ∈ Ω, set x := xk

x


-4

-3

-2

-1

0

1

2

3

4

Ω

C

cTx = 0

Otherwise if xk 6∈ Ω, the polytope is divided


-9 -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3-4

-3

-2

-1

0

1

2

3

4

Ω

C

cTx = 0

Otherwise if xk 6∈ Ω, the polytope is divided


Duality for d.c. problems

minx∈S

g(x) − h(x)

where f, g: convex. Let

h⋆(u) := supuTx− h(x) : x ∈ Rn

g⋆(u) := supuTx− g(x) : x ∈ Rn

the conjugate functions of h e g. The problem

infh⋆(u) − g⋆(u) : u : h⋆(u) < +∞

is the Fenchel-Rockafellar dual. If min g(x) − h(x) admits anoptimum, then Fenchel dual is a strong dual.


If x⋆ ∈ arg min g(x) − h(x) then

u⋆ ∈ ∂h(x⋆)

(∂ denotes subdifferential) is dual optimal and ifu⋆ ∈ arg minh⋆(u) − g⋆(u) then

x⋆ ∈ ∂g⋆(u⋆)

is an optimal primal solution.


A primal/dual algorithm

Pk : min g(x) − (h(xk) + (x− xk)Tyk)

andDk : minh⋆(y) − (g⋆(yk−1) + xT

k (y − yk−1)


Exact Global Optimization


GlobOpt - relaxations

Consider the global optimization problem (P):

min f(x)

x ∈ X

and assume the min exists and is finite and that we can use arelaxation (R):

min g(y)

y ∈ Y

Usually both X and Y are subsets of the same space Rn.

Recall: (R) is a relaxation of (P) iff:

X ⊆ Y

g(x) ≤ f(x) for all x ∈ XIntroduction to Global Optimization – p. 37

Branch and Bound

1. Solve the relaxation (R) and let L be the (global) optimumvalue (assume it is feasible for (R))

2. (Heuristically) solve the original problem (P) (or, moregenerally, find a “good” feasible solution to (P) in X). Let Ube the best feasible function value known

3. if U − L ≤ ε then stop: U is a certified ε–optimum for (P)

4. otherwise split X and Y into two parts and apply to each ofthem the same method


Tools

“good relaxations”: easy yet accurate

good upper bounding, i.e., good heuristics for (P)

Good relaxations can be obtained, e.g., through:

convex relaxations

domain reduction


Convex relaxations

Assume X is convex and Y = X. If g is the convex envelop of fon X, then solving the convex relaxation (R), in one step givesthe certified global optimum for (P).g(x) is a convex under-estimator of f on X if:

g(x)is convex

g(x) ≤ f(x) ∀x ∈ X

g is the convex envelop of f on X if:

gis a convex under-estimator off

g(x) ≥ h(x) ∀x ∈ X

∀h : convex under-estimator of f


A 1-D example


Convex under-estimator


Branching


Bounding

fathomed

Upper bound

lower boundsIntroduction to Global Optimization – p. 44

Relaxation of the feasible domain

Let

minx∈S

f(x)

be a GlobOpt problem where f is convex, while S is non convex.A relaxation (outer approximation) is obtained replacing S with alarger set Q. If Q is convex ⇒convex optimization problem.If the optimal solution to

minx∈Q

f(x)

belongs to S ⇒optimal solution to the original problem.


Example

minx∈[0,5],y∈[0,3]

−x− 2y

xy ≤ 3

0 1 2 3 4 5 60

1

2

3

4


Relaxation

minx∈[0,5],y∈[0,3]

−x− 2y

xy ≤ 3

We know that:

(x+ y)2 = x2 + y2 + 2xy

thus

xy = ((x+ y)2 − x2 − y2)/2

and, as x and y are non-negative, x2 ≤ 5x, y2 ≤ 3y, thus a(convex) relaxation of xy ≤ 3 is

(x+ y)2 − 5x− 3y ≤ 6

(a convex constraint)


Relaxation

0 1 2 3 4 5 60

1

2

3

4

Optimal solution of the relaxed convex problem: (2, 3) (value:−8)


Stronger Relaxation

minx∈[0,5],y∈[0,3]

−x− 2y

xy ≤ 3

Thus:

(5 − x)(3 − y) ≥ 0 ⇒15 − 3x− 5y + xy ≥ 0 ⇒

xy ≥ 3x+ 5y − 15

Thus a (convex) relaxation of xy ≤ 3 is

3x+ 5y − 15 ≤ 3

i.e.: 3x+ 5y ≤ 18Introduction to Global Optimization – p. 49

Relaxation

0 1 2 3 4 5 60

1

2

3

4

The optimal solution of the convex (linear) relaxation is (1, 3)which is feasible ⇒optimal for the original problem


Convex (concave) envelopes

How to build convex envelopes of a function or how to relax anon convex constraint?Convex envelopes ⇒lower boundsConvex envelopes of −f(x) ⇒upper boundsConstraint: g(x) ≤ 0 ⇒if h(x) is a convex underestimator of gthen h(x) ≤ 0 is a convex relaxations.Constraint: g(x) ≥ 0 ⇒if h(x) is concave and h(x) ≥ g(x), thenh(x) ≥ 0 is a “convex” constraint


Convex envelopes

Definition: a function is polyhedral if it is the pointwise maximumof a finite number of linear functions.(NB: in general, the convex envelope is the pointwisesupremum of affine minorants)The generating set X of a function f over a convex set P is theset

X = x ∈ Rn : (x, f(x))is a vertex of epi(convP (f))

I.e., given f we first build its convex envelop in P and thendefine its epigraph (x, y) : x ∈ P, y ≥ f(x). This is a convexset whose extreme points can be denoted by V . X are the xcoordinates of V


Generating sets

* *

*

*


bbb


Characterization

Let f(x) be continuously differentiable in a polytope P . Theconvex envelope of f on P is polyhedral if and only if

X(f) = Vert(P )

(the generating set is the vertex set of P )Corollary: let f1, . . . , fm ∈ C1(P ) and

∑

i fi(x) possesspolyhedral convex envelopes on P . Then

Conv(∑

i

fi(x)) =∑

i

Convfi(x)

iff the generating set of∑

i Conv(fi(x)) is Vert(P )


Characterization

If a f(x) is such that Convf(x) is polyhedral, than an affinefunction h(x) such that

1. h(x) ≤ f(x) for all x ∈ Vert(P )

2. there exist n+ 1 affinely independent vertices of P ,V1, . . . , Vn+1 such that

f(Vi) = h(Vi) i = 1, . . . , n+ 1

belongs to the polyhedral description of Convf(x) and

h(x) = convf(x)

for any x ∈ Conv(V1, . . . , Vn+1).


Characterization

The condition may be reversed: given m affine functionsh1, . . . , hm such that, for each of them

1. hj(x) ≤ f(x) for all x ∈ Vert(P )

2. there exist n+ 1 affinely independent vertices of P ,V1, . . . , Vn+1 such that

f(Vi) = hj(Vi) i = 1, . . . , n+ 1

Then the function ψ(x) = maxj φj(x) is the convex envelope of apolyhedral function f iff

the generating set of ψ is Vert(P)

for every vertex Vi we have ψ(Vi) = f(Vi)


Sufficient condition

If f(x) is lower semi-continuous in P and for all x 6∈ Vert(P ) thereexists a line ℓx: x ∈ interior of P ∩ ℓx and f(x) is concave in aneighborhood of x on ℓx,then Convf(x) is polyhedralApplication: let

f(x) =∑

i,j

αijxixj

The sufficient condition holds for f in [0, 1]n ⇒bilinear forms arepolyhedral in an hypercube


Application: a bilinear term

(Al-Khayyal, Falk (1983)): let x ∈ [ℓx, ux], y ∈ [ℓy, uy]. Then theconvex envelope of xy in [ℓx, ux] × [ℓy, uy is

φ(x, y) = maxℓyx+ ℓxy − ℓxℓy;uyx+ uxy − uxuy

In fact: φ(x, y) is a under-estimate of xy:

(x− ℓx)(y − ℓy) ≥ 0

xy ≥ ℓyx+ ℓxy − ℓxℓy

and analogously for xy ≥ uyx+ uxy − uxuy


Bilinear terms

xy ≥ φ(x, y) = maxℓyx+ ℓxy − ℓxℓy;uyx+ uxy − uxuyNo other (polyhedral) function underestimating xy is tighter.In fact ℓyx+ ℓxy − ℓxℓy belongs to the convex envelope: itunderestimates xy and coincides with xy at 3 vertices((ℓx, ℓy), (ℓx, uy), (ux, ℓy)).Analogously for the other affine function.All vertices are interpolated by these 2 underestimatinghyperplanes ⇒they form the convex envelop of xy


All easy then?

Of course no!Many things can go wrong . . .

It is true that, on the hypercube, a bilinear form:∑

i<j

αijxixj

is polyhedral (easy to see) but we cannot guarantee ingeneral that the generating set of the envelope are thevertices of the hypercube! (in particular, if α’s have oppositesigns)

if the set is not an hypercube, even a bilinear term might benon polyhedral: e.g. xy on the triangle 0 ≤ x ≤ y ≤ 1

Finding the (polyhedral) convex envelope of a bilinear form on ageneric polytope P is NP–hard!


Fractional terms

A convex underestimate of a fractional term x/y over a box canbe obtained through

w ≥ ℓx/y + x/uy − ℓx/uy if ℓx ≥ 0

w ≥ x/uy − ℓxy/ℓyuy + ℓx/ℓy if ℓx < 0

w ≥ ux/y + x/ℓy − ux/ℓy if ℓx ≥ 0

w ≥ x/ℓy − uxy/ℓyuy + ux/uy if ℓx < 0

(a better underestimate exists)


Univariate concave terms

If f(x), x ∈ [ℓx, ux], is concave, then the convex envelope issimply its linear interpolation at the extremes of the interval:

f(ℓx) +f(ux) − f(ℓx)

ux − ℓx(x− ℓx)


Underestimating a general nonconvex function

Let f(x) ∈ C2 be general non convex. Than a convexunderestimate on a box can be defined as

φ(x) = f(x) −n∑

i=1

αi(xi − ℓi)(ui − xi)

where αi > 0 are parameters. The Hessian of φ is

∇2φ(x) = ∇2f(x) + 2diag(α)

φ is convex iff ∇2φ(x) is positive semi-definite.


How to choose αi’s? One possibility: uniform choice: αi = α. Inthis case convexity of φ is obtained iff

α ≥ max

0,−1

2min

x∈[ℓ,u]λmin(x)

where λmin(x) is the minimum eigenvalue of ∇2f(x)


Key properties

φ(x) ≤ f(x)

φ interpolates f at all vertices of [ℓ, u]

φ is convex

Maximum separation:

max(f(x) − φ(x)) =1

4α∑

i

(ui − ℓi)2

Thus the error in underestimation decreases when the boxis split.


Estimation of α

Compute an interval Hessian [H] : [H(x)]ij = [hLij(x), h

Uij(x)] in

[ℓ, u]Find α such that [H] + 2diag(α) < 0.Gerschgorin theorem for real matrices:

λmin ≥ mini

hii −∑

j 6=i

|hij|

Extension to interval matrices:

λmin ≥ mini

hLii −

∑

j 6=i

max|hLij|, |hU

ij|uj − ℓjui − ℓi


Improvements

new relaxation functions (other than quadratic). Example

Φ(x; γ) = −n∑

i=1

(1 − eγi(xi−ℓi))(1 − eγi(ui−xi))

gives a tighter underestimate than the quadratic function

partitioning: partition the domain into a small number ofregions (hyper-rectangules); evaluate a convexunderestimator in each region; join the underestimators toform a single convex function in the whole domain


Domain (range) reduction

Techniques for cutting the feasible region without cutting theglobal optimum solution.Simplest approaches: feasibility-based and optimality-basedrange reduction (RR).Let the problem be:

minx∈S

f(x)

Feasibility based RR asks for solving

ℓi = minxi ui = maxxi

x ∈ S x ∈ S

for all i ∈ 1, . . . , n and then adding the constraints x ∈ [ℓ, u] tothe problem (or to the sub-problems generated during Branch &Bound)


Feasibility Based RR

If S is a polyhedron, RR requires the solution of LP’s:

[ℓ, u] = min /maxx

Ax ≤ b

x ∈ [L,U ]

“Poor man’s” L.P. based RR: from every constraint∑

j aijxj ≤ biin which ai > 0 then

x ≤1

ai

(

bi −∑

j 6=

aijxj

)

⇒

x ≤1

ai

(

bi −∑

j 6=

minaijLj, aijUj)


Optimality Based RR

Given an incumbent solution x ∈ S, ranges are updated bysolving the sequence:

ℓi = minxi ui = maxxi

f(x) ≤ f(x) f(x) ≤ f(x)

x ∈ S x ∈ S

where f(x) is a convex underestimate of f in the currentdomain.RR can be applied iteratively (i.e., at the end of a complete RRsequence, we might start a new one using the new bounds)


generalization

minx∈X

f(x) (P )

g(x) ≤ 0

a (non convex) problem; let

minx∈X

f(x) (R)

g(x) ≤ 0

be a convex relaxation of (P ):

x ∈ X : g(x) ≤ 0 ⊆ x ∈ X : g(x) ≤ 0 and

x ∈ X : g(x) ≤ 0⇒f(x) ≤ f(x)


R.H.S. perturbation

Let

φ(y) = minx∈X

f(x) (Ry)

g(x) ≤ y

be a perturbation of (R). (R) convex ⇒(Ry) convex for any y.Let x: an optimal solution of (R) and assume that the i–thconstraint is active:

g(x) = 0

Then, if xy is an optimal solution of (Ry) ⇒gi(x) ≤ yi is active at

xy if yi ≤ 0


Duality

Assume (R) has a finite optimum at x with value φ(0) andLagrange multipliers µ. Then the hyperplane

H(y) = φ(0) − µTy

is a supporting hyperplane of the graph of φ(y) at y = 0, i.e.

φ(y) ≥ φ(0) − µTy ∀ y ∈ Rm


Main result

If (R) is convex with optimum value φ(0), constraint i is active atthe optimum and the Lagrange multiplier is µi > 0 then, if U isan upper bound for the original problem (P ) the constraint:

gi(x) ≥ −(U − L)/µi

(where L = φ(0)) is valid for the original problem (P ), i.e. it doesnot exclude any feasible solution with value better than U .


proof

Problem (Ry) can be seen as a convex relaxation of theperturbed non convex problem

Φ(y) = minx∈X

f(x)

g(x) ≤ y

and thus φ(y) ≤ Φ(y). Thus underestimating (Ry) produces anunderestimate of Φ(y). Let y := eiyi; From duality:L− µT eiyi ≤ φ(eiyi) ≤ Φ(eiyi)If yi < 0 then U is an upper bound also for Φ(eiyi), thusL− µiyi ≤ U . But if yi < 0 then constraint i is active. For anyfeasible x there exists a yi < 0 such that g(x) ≤ yi is active ⇒wemay substitute yi with g

i(x) and deduce L− µigi

(x) ≤ U


Applications

Range reduction: let x ∈ [ℓ, u] in the convex relaxed problem. Ifvariable xi is at its upper bound in the optimal solution, them wecan deduce

xi ≥ maxℓi, ui − (U − L)/λi

where λi is the optimal multiplier associated to the i–th upperbound. Analogously for active lower bounds:

xi ≤ minui, ℓi + (U − L)/λi


Let the constraint

aTi x ≤ bi

be active in an optimal solution of the convex relaxation (R).Then we can deduce the valid inequality

aiTx ≥ bi − (U − L)/µi


Methods based on “merit functions”

Bayesian algorithm: the objective function is considered as arealization of a stochastic process

f(x) = F (x;ω)

A loss function is defined, e.g.:

L(x1, ..., xn;ω) = mini=1,n

F (xi;ω) − minxF (x;ω)

and the next point to sample is placed in order to minimize theexpected loss (or risk)

xn+1 = arg minE (L(x1, ..., xn, xn+1) | x1, ..., xn)

= arg minE (min(F (xn+1;ω) − F (x;ω)) | x1, ..., xn)


Radial basis method

Given k observations (x1, f1), . . . , (xk, fk), an interpolant is built:

s(x) =n∑

i=1

λiΦ(‖x− xi‖) + p(x)

p: polynomial of a (prefixed) small degree m. Φ: radial functionlike, e.g.:

Φ(r) = r linear

Φ(r) = r3 cubic

Φ(r) = r2 log r thin plate spline

Φ(r) = e−γr2

gaussian

Polynomial p is necessary to guarantee existence of a uniqueinterpolant (i.e. when the matrix Φij = Φ(‖xi −xj‖) is singular)


“Bumpiness”

Let f ⋆k an estimate of the value of the global optimum after k

observations. Let syk the (unique) interpolant of the data points

(xi, fi)i = 1, . . . , k

(y, f ⋆k )

Idea: the most likely location of y is such that the resultinginterpolant has minimum “bumpiness”Bumpiness measure:

σ(sk) = (−1)m+1∑

λisyk(xi)


TO BE DONE


Stochastic methods

Pure Random Search - random uniform sampling over thefeasible region

Best start: like Pure Random Search, but a local search isstarted from the best observation

Multistart: Local searches started from randomly generatedstarting points


-3

-2

-1

0

1

2

3

0 1 2 3 4 5

rsrsrs rs rsrs rs rsrsrs

+

++

+

+

+ + +++


-3

-2

-1

0

1

2

3

0 1 2 3 4 5

rsrsrs rs rsrs rs rsrsrs

+

++

+

+

+ + +++


Clustering methods

Given a uniform sample, evaluate the objective function

Sample Transformation (or concentration): either a fractionof “worst” points are discarded, or a few steps of a gradientmethod are performed

Remaining points are clustered

from the best point in each cluster a single local search isstarted


Uniform sample

−1

−3

0

−5

rs

rs rs

rs

rs

rsrs

rs

rs

rs

rsrs

rs

rs

rs

rs

rs

rs

rs

rsrsrs

rs

rs

rs

rs

rs

rsrs

rs

rs

0

1

2

3

4

5

0 1 2 3 4 5


Sample concentration

−1

−3

0

−5

rs

rsrs

rs

rs

rs

rs

rs

rs

rs

rs

rs

rsrs

rs

+ + +

+

+

+

+

++

+++

+

+ ++

0

1

2

3

4

5

0 1 2 3 4 5


Clustering

−1

−3

0

−5

r

rr

rr

r

r

r

r

r

u

r

u

r

r

0

1

2

3

4

5

0 1 2 3 4 5


Local optimization

−1

−3

0

−5

r

rr

rr

r

r

r

r

r

u

r

u

r

r

0

1

2

3

4

5

0 1 2 3 4 5


Clustering: MLSL

Sampling proceed in batches of N points. Given sample pointsX1, . . . , Xk ∈ [0, 1]n, label Xj as “clustered” iff ∃Y ∈ X1, . . . , Xk:

||Xj − Y || ≤ ∆k :=1√2π

(

log k

kσΓ(

1 +n

2

)

)1

n

andf(Y ) ≤ f(Xj)


Simple Linkage

A sequential sample is generated (batches consist of a singleobservation). A local search is started only from the lastsampled point (i.e. there is no “recall”) unless there exists asufficiently near sampled point with better function valure


Smoothing methods

Given f : Rn → R, the Gaussian transform is defined as:

〈f〉λ(x) =1

πn/2λn

∫

Rn

f(y) exp(

−‖y − x‖2/λ2)

When λ is sufficiently large ⇒〈f〉λ is convex. Idea: starting witha large enough λ, minimize the smoothed function and slowlydecrease λ towards 0.


Smoothing methods

-10-5

05

10 -10

-5

0

5

10

0

0.5

1

1.5

2

2.5

3


-10-5

05

10 -10

-5

0

5

10

0

0.5

1

1.5

2

2.5

3


-10-5

05

10 -10

-5

0

5

10

0.60.8

11.21.41.61.8

22.22.4


-10-5

05

10 -10

-5

0

5

10

0.8

1

1.2

1.4

1.6

1.8

2

2.2


-10-5

05

10 -10

-5

0

5

10

0.8

1

1.2

1.4

1.6

1.8

2

2.2


Transformed function landscape

Elementary idea: local optimization smooths out many “highfrequency” oscillations


0

1

2

3

4

5

6

7

8

9

10


0

1

2

3

4

5

6

7

8

9

10


0

1

2

3

4

5

6

7

8

9

10


Monotonic Basin-Hopping

k := 0; f⋆ := +∞;while k < MaxIter do

Xk: random initial solutionX⋆

k= arg min f(x; Xk);

(local minimization started at Xk)fk = f(X⋆

k);

if fk < f⋆ =⇒ f⋆ := fk

NoImprove := 0;while NoImprove < MaxImprove do

X = random perturbation of Xk

Y = arg minf(x; X) ;if f(Y ) < f⋆ =⇒ Xk := Y ; NoImprove := 0; f⋆ := f(Y )

otherwise NoImprove + +

end while

end while


0

1

2

3

4

5

6

7

8

9

10


0

1

2

3

4

5

6

7

8

9

10


0

1

2

3

4

5

6

7

8

9

10


0

1

2

3

4

5

6

7

8

9

10


0

1

2

3

4

5

6

7

8

9

10


References

In this year’s course the global optimization part has been expanded, so itis possible that some part in nonlinear optimization will be skipped. Here isan essential reference list for the material covered during the course:

Mokhtar S. Bazaraa, John J. Jarvis and Hanif D. Sherali, Linear Program-ming and Network Flows, John Wiley & Sons, 1990.

Dimitri P. Bertsekas, Nonlinear Programming, Athena Scientific, 1999.

Jorge Nocedal and Stephen J. Wright, Numerical Optimization, Springer,2006.

Mohit Tawarmalani and Nikolaos V. Sahinidis, A Polyhedral Branch–and–Cut Approach to Global Optimization, in: Mathematical Programming, vol-ume 103, pages 225-249, 2005.

Androulakis I.P., C.D. Maranas, and C.A. Floudas (PostScript (184K), PDF(154K)), ”αBB : A Global Optimization Method for General ConstrainedNonconvex Problems”, Journal of Global Optimization, 7, 4, pp. 337-363(1995).

A. Rikun. A convex envelope formula for multilinear functions. Journal ofGlobal Optimization, pages 10:425–437, 1997.

Andrea Grosso, Marco Locatelli and Fabio Schoen, A Population Based Ap-proach for Hard Global Optimization Problems Based on Dissimilarity Mea-sures, in: Mathematical Programming, volume 110, number 2, pages 373-404,2007.

1

Nonlinear Programming Models Fabio Schoen Introductionfor all x,y ∈ Ω,λ ∈ [0,1] Nonlinear...

Documents

Transcript of Nonlinear Programming Models Fabio Schoen Introductionfor all x,y ∈ Ω,λ ∈ [0,1] Nonlinear...