Nonlinear Programming Models Fabio Schoen Introductionfor all x,y ∈ Ω,λ ∈ [0,1] Nonlinear...
Transcript of Nonlinear Programming Models Fabio Schoen Introductionfor all x,y ∈ Ω,λ ∈ [0,1] Nonlinear...
Nonlinear Programming ModelsFabio Schoen
2008
http://gol.dsi.unifi.it/users/schoen
Nonlinear Programming Models – p. 1
Introduction
Nonlinear Programming Models – p. 2
NLP problems
min f(x)
x ∈ S ⊆ Rn
Standard form:
min f(x)
hi(x) = 0 i = 1,m
gj(x) ≤ 0 j = 1, k
Here S = x ∈ Rn : hi(x) = 0∀ i, gj(x) ≤ 0∀ j
Nonlinear Programming Models – p. 3
Local and global optima
A global minimum or global optimum is any x⋆ ∈ S such that
x ∈ S⇒f(x) ≥ f(x⋆)
A point x is a local optimum if ∃ ε > 0 such that
x ∈ S ∩ B(x, ε)⇒f(x) ≥ f(x)
where B(x, ε) = x ∈ Rn : ‖x − x‖ ≤ ε is a ball in R
n.Any global optimum is also a local optimum, but the opposite isgenerally false.
Nonlinear Programming Models – p. 4
Convex Functions
A set S ⊆ Rn is convex if
x, y ∈ S⇒λx + (1 − λ)y ∈ S
for all choices of λ ∈ [0, 1]. Let Ω ⊆ Rn: non empty convex set. A
function f : Ω → R is convex iff
f(λx + (1 − λ)y) ≤ λf(x) + (1 − λ)f(y)
for all x, y ∈ Ω, λ ∈ [0, 1]
Nonlinear Programming Models – p. 5
Convex Functions
x y
Nonlinear Programming Models – p. 6
Properties of convex functions
Every convex function is continuous in the interior of Ω. It mightbe discontinuous, but only on the frontier.If f is continuously differentiable then it is convex iff
f(y) ≥ f(x) + (y − x)T∇f(x)
for all y ∈ Ω
Nonlinear Programming Models – p. 7
Convex functions
yx
Nonlinear Programming Models – p. 8
If f is twice continuously differentiable ⇒f it is convex iff itsHessian matrix is positive semi-definite:
∇2f(x) :=
[
∂2f
∂xi∂xj
]
then ∇2f(x) < 0 iff
vT∇2f(x)v ≥ 0 ∀ v ∈ Rn
or, equivalently, all eigenvalues of ∇2f(x) are non negative.
Nonlinear Programming Models – p. 9
Example: an affine function is convex (and concave)For a quadratic function (Q: symmetric matrix):
f(x) =1
2xT Qx + bT x + c
we have
∇f(x) = Qx + b ∇2f(x) = Q
⇒f is convex iff Q < 0
Nonlinear Programming Models – p. 10
Convex Optimization Problems
min f(x)
x ∈ S
is a convex optimization problem iff S is a convex set and f isconvex on S. For a problem in standard form
min f(x)
hi(x) = 0 i = 1,m
gj(x) ≤ 0 j = 1, k
if f is convex, hi(x) are affine functions, gj(x) are convexfunctions, then the problem is convex.
Nonlinear Programming Models – p. 11
Maximization
Slight abuse in notation: a problem
max f(x)
x ∈ S
is called convex iff S is a convex set and f is a concave function(not to be confused with minimization of a concave function, (ormaximization of a convex function) which are NOT a convexoptimization problem)
Nonlinear Programming Models – p. 12
Convex and non convex optimization
Convex optimization “is easy”, non convex optimization isusually very hard.Fundamental property of convex optimization problems: everylocal optimum is also a global optimum (will give a proof later)Minimizing a positive semidefinite quadratic function on apolyhedron is easy (polynomially solvable); if even a singleeigenvalue of the hessian is negative ⇒the problem becomesNP–hard
Nonlinear Programming Models – p. 13
Convex functions: examples
Many (of course not all . . . ) functions are convex!
affine functions aT x + b
quadratic functions 12xT Qx + bT x + c with Q = QT , Q 0
any norm is a convex function
x log x (however log x is concave)
f is convex if and only if ∀x0, d ∈ Rn, its restriction to any
line: φ(α) = f(x0 + αd), is a convex function
a linear non negative combination of convex functions isconvex
g(x, y) convex in x for all y ⇒∫
g(x, y) dy convex
Nonlinear Programming Models – p. 14
more examples . . .
maxiaTi x + b is convex
f, g: convex ⇒maxf(x), g(x) is convex
fa convex functions for any a ∈ A (a possibly uncountableset) ⇒supa∈A fa(x) is convex
f convex ⇒f(Ax + b)
let S ⊆ Rn be any set ⇒f(x) = sups∈S ‖x − s‖ is convex
Trace(AT X) =∑
i,j AijXij is convex (it is linear!)
log det X−1 is convex over the set of matricesX ∈ R
n×n : X ≻ 0
λmax(X) (the largest eigenvalue of a matrix X)
Nonlinear Programming Models – p. 15
Data Approximation
Nonlinear Programming Models – p. 16
Table of contents
norm approximation
maximum likelihood
robust estimation
Nonlinear Programming Models – p. 17
Norm approximation
Problem:min
x‖Ax − b‖
where A, b: parameters. Usually the system is over-determined,i.e. b 6∈ Range(A).For example, this happens when A ∈ R
m×n with m > n and Ahas full rank.r := Ax − b: “residual”.
Nonlinear Programming Models – p. 18
Examples
‖r‖ =√
rT r: least squares (or “regression”)
‖r‖ =√
rT Pr with P ≻ 0: weighted least squares
‖r‖ = maxi |ri|: minimax, or ℓ∞ or di Tchebichevapproximation
‖r‖ =∑
i |ri|: absolute or ℓ1 approximation
Possible (convex) additional constraints:
maximum deviation from an initial estimate: ‖x − xest‖ ≤ ǫ
simple bounds ℓi ≤ xi ≤ ui
ordering: x1 ≤ x2 ≤ · · · ≤ xn
Nonlinear Programming Models – p. 19
Example: ℓ1 norm
Matrix A ∈ R100×30
0
10
20
30
40
50
60
70
80
-5 -4 -3 -2 -1 0 1 2 3 4 5
norm 1 residuals
Nonlinear Programming Models – p. 20
ℓ∞ norm
0
2
4
6
8
10
12
14
16
18
20
-5 -4 -3 -2 -1 0 1 2 3 4 5
∞ norm residuals
Nonlinear Programming Models – p. 21
ℓ2 norm
0
2
4
6
8
10
12
14
16
18
-5 -4 -3 -2 -1 0 1 2 3 4 5
norm 2 residuals
Nonlinear Programming Models – p. 22
Variants
min∑
i h(yi − aTi x) where h: convex function:
h linear–quadratic h(z) =
z2 |z| ≤ 1
2|z| − 1 |z| > 1
“dead zone”: h(z) =
0 |z| ≤ 1
|z| − 1 |z| > 1
logarithmic barrier: h(z) =
− log(1 − z2) |z| < 1
∞ |z| ≥ 1
Nonlinear Programming Models – p. 23
comparison
-0.5
0
0.5
1
1.5
2
2.5
3
3.5
4
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
norm 1(x)norm 2(x)linquad(x)
deadzone(x)logbarrier(x)
Nonlinear Programming Models – p. 24
Maximum likelihood
Given a sample X1, X2, . . . , Xk and a parametric family ofprobability density functions L(·; θ), the maximum likelihoodestimate of θ given the sample is
θ = arg maxθ
L(X1, . . . , Xk; θ)
Example: linear measures with and additive i.i.d. (independentidentically dsitributed) noise:
Xi = aTi θ + εi (1)
where εi iid random variables with density p(·):
L(X1 . . . , Xk; θ) =k∏
i=1
p(Xi − aTi θ)
Nonlinear Programming Models – p. 25
Max likelihood estimate - MLE
(taking the logarithm, which does not change optimum points):
θ = arg maxθ
∑
i
log(p(Xi − aTi θ))
If p is log–concave ⇒this problem is convex. Examples:
ε ∼ N (0, σ), i.e. p(z) = (2πσ)−1/2 exp(−z2/2σ2) ⇒MLE is theℓ2 estimate: θ = arg min ‖Aθ − X‖2;
p(z) = (1/(2a)) exp(−|z|/a) ⇒ℓ1 estimate:θ = arg minθ ‖Aθ − X‖1
Nonlinear Programming Models – p. 26
p(z) = (1/a) exp(−z/a)1z≥0 (negative exponential)⇒theestimate can be found solving the LP problem:
min 1T (X − Aθ)
Aθ ≤ X
p uniform on [−a, a] ⇒the MLE is any θ such that‖Aθ − X‖∞ ≤ a
Nonlinear Programming Models – p. 27
Ellipsoids
An ellipsoid is a subset of Rn of the form
E = x ∈ Rn : (x − x0)
T P−1(x − x0) ≤ 1
where x0 ∈ Rn is the center of the ellipsoid and P is a
symmetric positive-definite matrix.Alternative representations:
E = x ∈ Rn : ‖Ax − b‖2 ≤ 1
where A ≻ 0, or
E = x ∈ Rn : x = x0 + Au | ‖u‖2 ≤ 1
where A is square and non singular (affine transformation of theunit ball)
Nonlinear Programming Models – p. 28
Robust Least Squares
Least Squares: x = arg min√
∑
i(aTi x − bi)2 Hp: ai not known,
but it is known that
ai ∈ Ei = ai + Piu : ‖u‖ ≤ 1
where Pi = P Ti < 0. Definition: worst case residuals:
maxai∈Ei
√
∑
i
(aTi x − bi)2
A robust estimate of x is the solution of
xr = arg minx
maxai∈Ei
√
∑
i
(aTi x − bi)2
Nonlinear Programming Models – p. 29
RLS
It holds:|α + βT y| ≤ |α| + ‖β‖‖y‖
then, choosing y⋆ = β/‖β‖ if α ≥ 0 and y⋆ = −β/‖β‖, otherwiseif α < 0, then ‖y‖ = 1 and
|α + βT y⋆| = |α + βT β/‖β‖sign(α)|= |α| + ‖β‖
then:
maxai∈Ei
|(aTi x − bi)| = max
‖u‖≤1|aT
i x − bi + uT Pix|
= |aTi x − bi| + ‖Pix‖
Nonlinear Programming Models – p. 30
. . .
Thus the Robust Least Squares problem reduces to
min
(
∑
i
(|aTi x − bi| + ‖Pix‖)2
)1/2
(a convex optimization problem).Transformation:
minx,t
‖t‖2
|aTi x − bi| + ‖Pix‖ ≤ ti ∀ i i.e.
Nonlinear Programming Models – p. 31
. . .
minx,t
‖t‖2
aTi x − bi + ‖Pix‖ ≤ ti
−aTi x + bi + ‖Pix‖ ≤ ti
(Second Order Cone Problem). A norm cone is a convex set
C = (x, t) ∈ Rn+1 : ‖x‖ ≤ t
Nonlinear Programming Models – p. 32
Geometrical Problems
Nonlinear Programming Models – p. 33
Geometrical Problems
projections and distances
polyhedral intersection
extremal volume ellipsoids
classification problems
Nonlinear Programming Models – p. 34
Projection on a set
Given a set C the projection of x on C is defined as:
PC(x) = arg minz∈C
‖z − x‖
bc
bc
bc
⊕
⊕
⊕
Nonlinear Programming Models – p. 35
Projection on a convex set
IfC = x : Ax = b, fi(x) ≤ 0, i = 1,m
where fi: convex ⇒C is a convex set and the problem
PC(x) = arg min ‖x − z‖Az = b
fi(z) ≤ 0 i = 1,m
is convex.
Nonlinear Programming Models – p. 36
Distance between convex sets
dist(C(1), C(2)) = minx∈C(1),y∈C(2)
‖x − y‖
Nonlinear Programming Models – p. 37
Distance between convex sets
If C(j) = x : A(j)x = b(j), f(j)i ≤ 0 then the minimum distance
can be found through a convex model:
min ‖x(1) − x(2)‖A(1)x(1) = b(1)
A(2)x(2) = b(2)
f(1)i x(1) ≤ 0
f(2)i x(2) ≤ 0
Nonlinear Programming Models – p. 38
Polyhedral intersection
1: polyhedra described by means of linear inequalities:
P1 = x : Ax ≤ b,P2 = x : Cx ≤ d
Nonlinear Programming Models – p. 39
Polyhedral intersection
P1
⋂P2 = ∅? It is a linear feasibility problem: Ax ≤ b, Cx ≤ d
P1 ⊆ P2? Just check
supcTk x : Ax ≤ b ≤ dk ∀ k
(solution of a finite number of LP’s)
Nonlinear Programming Models – p. 40
Polyhedral intersection (2)
2: polyhedra (polytopes) described through vertices:
P1 = convv1, . . . , vk,P2 = convw1, . . . , wh
P1
⋂P2 = ∅? Need to find λ1, λk, µ1, µh ≥ 0:∑
i
λi = 1∑
j
µj = 1
∑
i
λivi =∑
j
µjwj
P1 ⊆ P2? ∀ i = 1, . . . , k check whether ∃µj ≥ 0:
∑
j
µj = 1
∑
j
µjwj = viNonlinear Programming Models – p. 41
Minimal ellipsoid containing k points
Given v1, . . . , vk ∈ Rn find an ellipsoid
E = x : ‖Ax − b‖ ≤ 1
with minimal volume containing the k given points.
*
*
*
*
*
*
**
* *
*
*
*
*
*
*
*
**
Nonlinear Programming Models – p. 42
A = AT ≻ 0. Volume of E is proportional to det A−1 ⇒convexoptimization problem (in the unknowns: A, b):
min log det A−1
A = AT
A ≻ 0
‖Avi − b‖ ≤ 1 i = 1, k
Nonlinear Programming Models – p. 43
Max. ellipsoid contained in a polyhedron
Given P = x : Ax ≤ b find an ellipsoid:
E = By + d : ‖y‖ ≤ 1
contained in P with maximum volume.
Nonlinear Programming Models – p. 44
Max. ellipsoid contained in a polyhedron
E ⊆ P ⇔ aTi (By + d) ≤ bi ∀ y : ‖y‖ ≤ 1
⇔ sup‖y‖≤1
aTi By + aT
i d ≤ bi ∀ i
⇔ ‖Bai‖ + aTi d ≤ bi
maxB,d
log det B
B = BT ≻ 0
‖Bai‖ + aTi d ≤ bi i = 1, . . .
Nonlinear Programming Models – p. 45
Difficult variants
These problems are hard:
find a maximal volume ellipsoid contained in a polyhedrongiven by its vertices
*
*
**
*
*
*
*
*
*
*
*
*
*
**
*
*
Nonlinear Programming Models – p. 46
find a minimal volume ellipsoid containing a polyhedrondescribed as a system of linear inequalities.
Nonlinear Programming Models – p. 47
It is already a difficult problem to show whether a given ellipsoidE contains a polyhedron P = Ax ≤ b.This problem is still difficult even when the ellipsoid is a sphere:this problem is equivalent to norm maximization in a polyhedron– it is an NP–hard concave optimization problem.
Nonlinear Programming Models – p. 48
Linear classification (separation)
bc
bc
bc
bc
bcbc
bc
bc
bc
bc
bc
bc
bc
bc
bc
bc
bc
Nonlinear Programming Models – p. 49
Given two point sets X1, . . . , Xk, Y1, . . . , Yh find an hyperplaneaT x = t such that:
aT Xi ≥ 1 i = 1, k
aT Yj ≤ 1 j = 1, h
(LP feasibility problem).
Nonlinear Programming Models – p. 50
Robust separation
bc
bc
bc
bc
bcbc
bc
bc
bc
bc
bc
bc
bc
bc
bc
bc
bc
Nonlinear Programming Models – p. 51
Robust separation
Find a “maximal” separation:
maxa:‖a‖≤1
(
mini
aT Xi − maxj
aT Yj
)
equivalent to the convex problem:
max t1 − t2
aT Xi ≥ t1 ∀ i
aT Yj ≤ t2 ∀ j
‖a‖ ≤ 1
Nonlinear Programming Models – p. 52
Optimality ConditionsFabio Schoen
2008
http://gol.dsi.unifi.it/users/schoen
Optimality Conditions – p. 1
Optimality Conditions: descent directions
Let S ⊆ Rn be a convex set and consider the problem
minx∈S
f(x)
where f : S → R. Let x1, x2 ∈ S and d = x2 − x1. d is a feasibledirection.If there exists ǫ > 0 such that f(x1 + ǫd) < f(x1) ∀ ǫ ∈ (0, ǫ), d iscalled a descent direction at x1.Elementary necessary optimality condition: if x⋆ is a localoptimum, no descent direction may exist at x⋆
Optimality Conditions – p. 2
Optimality Conditions for Convex Sets
If x⋆ ∈ S is a local optimum for f() and there exists aneighborhood U(x⋆) such that f ∈ C1(U(x⋆)), then
dT∇f(x⋆) ≥ 0 ∀ d : feasible direction
Optimality Conditions – p. 3 Optimality Conditions – p. 4
proof
Taylor expansion:
f(x⋆ + ǫd) = f(x⋆) + ǫdT∇f(x⋆) + o(ǫ)
d cannot be a descent direction, so, if ǫ is sufficiently small, thenf(x⋆ + ǫd) ≥ f(x⋆). Thus
ǫdT∇f(x⋆) + o(ǫ) ≥ 0
and dividing by ǫ,
dT∇f(x⋆) +o(ǫ)
ǫ≥ 0
Letting ǫ ↓ 0 the proof is complete.
Optimality Conditions – p. 5
Optimality Conditions: tangent cone
General case:
min f(x)
gi(x) ≤ 0 i = 1, . . . ,m
x ∈ X (X : open set)
Let S = x ∈ X : gi(x) ≤ 0, i = 1, . . . ,m.Tangent cone to S in x: T (x) = d ∈ R
n:
d
‖d‖ = limxk→x
xk − x
‖xk − x‖
where xk ∈ S.
Optimality Conditions – p. 6
b
bc
bc
bc
bc
bc
bc
Optimality Conditions – p. 7
Some examples
S = Rn ⇒T (x) = R
n ∀x
S = Ax = b ⇒
T (x) = d : Ad = 0
S = Ax ≤ b; let I be the set of active constraints in x:
aTi x = bi i ∈ I
aTi x < bi i 6∈ I.
Optimality Conditions – p. 8
Optimality Conditions – p. 9
Let d = limk(xk − x)/‖(xk − x)‖ ⇒
aTi d = aT
i limk
(xk − x)/‖(xk − x)‖ i ∈ I
= limk
aTi (xk − x)/‖(xk − x)‖
= limk
(aTi xk − b)/‖(xk − x)‖
≤ 0
Thus if d ∈ T (x) ⇒aTi d ≤ 0 for i ∈ I.
Optimality Conditions – p. 10
Viceversa, let xk = x + αkd. If aTi d ≤ 0 for i ∈ I ⇒
aTi xk = aT
i (x + αkd) i ∈ I= bi + αka
Ti d
≤ bi
aTi xk = aT
i (x + αkd) i 6∈ I< bi + αka
Ti d
≤ bi if αk small enough
Thus
T (x) = d : aTi d ≤ 0∀ i ∈ I
Optimality Conditions – p. 11
Example
Let S = (x, y) ∈ R2 : x2 − y = 0 (parabola).
Tangent cone at (0, 0)? Let (xk, yk) → (0, 0), i.e.xk → 0, yk = x2
k:
‖(xk, yk) − (0, 0)‖ =√
x2k + (xk)4
= |xk|√
1 + x2k
and
limxk→0+
xk
|xk|√
1 + x2k
= 1 limxk→0+
yk
|xk|√
1 + x2k
= 0
limxk→0−
xk
|xk|√
1 + x2k
= −1 limxk→0−
yk
|xk|√
1 + x2k
= 0
thus T (0, 0) = (−1, 0), (1, 0)Optimality Conditions – p. 12
Descent direction
d ∈ Rn is a feasible direction in x ∈ S if ∃ α > 0 :
x + αd ∈ S ∀α ∈ [0, α).
d feasible ⇒d ∈ T (x), but in general the converse is false.If
f(x + αd) ≤ f(x) ∀α ∈ (0, α)
d is a descent direction
Optimality Conditions – p. 13
I order necessary opt condition
Let x ∈ S ⊆ Rn be a local optimum for minx∈S f(x); let
f ∈ C1(U(x)). Then
dT∇f(x) ≥ 0 ∀ d ∈ T (x)
Proofd = limk(xk − x)/‖(xk − x)‖. Taylor expansion:
f(xk) = f(x) + ∇T f(x)(xk − x) + o(‖xk − x‖)= f(x) + ∇T f(x)(xk − x) + ‖xk − x‖o(1).
x local optimum ⇒∃U(x) : f(x) ≥ f(x) ∀x ∈ U ∩ S.
Optimality Conditions – p. 14
. . .
If k is large enough, xk ∈ U(x):
f(xk) − f(x) ≥ 0
thus∇T f(x)(xk − x) + ‖xk − x‖o(1) ≥ 0
Dividing by ‖(xk − x)‖ :
∇T f(x)(xk − x)/‖(xk − x)‖ + o(1) ≥ 0
and in the limit∇T f(x)d ≥ 0.
Optimality Conditions – p. 15
Examples
Unconstrained problemsEvery d ∈ R
n belongs to the tangent cone ⇒at a local optimum
∇T f(x)d ≥ 0 ∀ d ∈ Rn
Choosing d = ei e d = −ei we get
∇f(x) = 0
NB: the same is true if x is a local minimum in the relativeinterior of the feasible region.
Optimality Conditions – p. 16
Linear equality constraints
min f(x)
Ax = b
Tangent cone: d : Ad = 0. Necessary conditions:
∇T f(x)d ≥ 0 ∀ d : Ad = 0
equivalent statement:
mind
∇T f(x)d = 0
Ad = 0
(a linear program).
Optimality Conditions – p. 17
Linear equality constraints
From LP duality ⇒
max 0T λ = 0
AT λ = ∇f(x)
Thus at a local minimum point there exist Lagrange multipliers:
∃λ : AT λ = ∇f(x)
Optimality Conditions – p. 18
Linear inequalities
min f(x)
Ax ≤ b
Tangent cone at a local minimum x:d ∈ R
n : aTi d ≤ 0 ∀ i ∈ I(x). Let AI be the rows of A
associated to active constraints at x. Then
mind
∇T f(x)d = 0
AId ≤ 0
λ ≤ 0
Optimality Conditions – p. 19
Linear inequalities
From LP duality:
max 0T λ = 0
ATIλ = ∇f(x)
λ ≤ 0
Thus, at a local optimum, the gradient is a non positive linearcombination of the coefficients of active constraints.
Optimality Conditions – p. 20
Farkas’ Lemma
Let A: matrix in Rm×n and b ∈ R
n. One and only one of thefollowing sets:
AT y ≤ 0
bT y > 0
and
Ax = b
x ≥ 0
is non empty
Optimality Conditions – p. 21
Geometrical interpretation
AT y ≤ 0 Ax = b
bT y > 0 x ≥ 0
a1
a2
b
z : ∃x : z = Ax, x ≥ 0
y : AT y ≤ 0
Optimality Conditions – p. 22
Proof
1) if ∃x ≥ 0 : Ax = b ⇒bT y = xT AT y. Thus if AT y ≤ 0 ⇒bT y ≤ 0.2) Premise: Separating hyperplane theorem: let C and D betwo convex nonempty sets: C ∪ D = ∅. Then there exists a 6= 0and b:
aT x ≤ b x ∈ C
aT x ≥ b x ∈ D
If C is a point and D is a closed convex set, separation is strict,i.e.
aT C < b
aT x > b x ∈ D
Optimality Conditions – p. 23
Farkas’ Lemma (proof)
2) let x : Ax = b, x ≥ 0 = ∅. Let
S = y ∈ Rm : ∃x ≥ 0, Ax = y
S is closed, convex and b 6∈ S. From the separating hyperplanetheorem: ∃α ∈ R
m 6= 0, β ∈ R:
αT y ≤ β ∀x ∈ S
αT b > β
0 ∈ S ⇒β ≥ 0 ⇒αT b > 0; αT Ax ≤ β for all x ≥ 0. This ispossible iff αT A ≤ 0.Letting y = α we obtain a solution of
AY y ≤ 0 bT y > 0
Optimality Conditions – p. 24
First order feasible variations cone
G(x) = d ∈ Rn : ∇T gi(x)d ≤ 0 i ∈ I
b
b
Optimality Conditions – p. 25
First order variations
G(x) ⊇ T (x).In fact if xk is feasible and
d = limk
xk − x
‖xk − x‖
then gi(x) ≤ 0 and
g(x + limk
(xk − x)) ≤ 0
Optimality Conditions – p. 26
. . .
g(x + limk
‖xk − x‖ xk − x
‖xk − x‖) ≤ 0
g(x + limk
‖xk − x‖ limxk − x
‖xk − x‖) ≤ 0
g(x + limk
‖xk − x‖d) ≤ 0
Let αk = ‖xk − x‖, if αk ≈ 0:
g(x + αkd) ≤ 0
Optimality Conditions – p. 27
gi(x + αkd) = gi(x) + αk∇T gi(x)d + o(αk)
where αk > 0 and d belong to the tangent cone T (x). If the i–thconstraint is active, then
gi(x + αkd) = αk∇T gi(x)d + o(αk) ≤ 0
gi(x + αkd)/αk = ∇T gi(x)d + o(αk))/αk ≤ 0
Letting αk → 0 the result is obtained.
Optimality Conditions – p. 28
example
G(x) 6= T (x);
−x3 + y ≤ 0
−y ≤ 0
Optimality Conditions – p. 29
KKT necessary conditions
(Karush–Kuhn–Tucker)Let x ∈ X ⊆ R
n, X 6= ∅ be a local optimum for
min f(x)
gi(x) ≤ 0 i = 1, . . . ,m
x ∈ X
I: indices of active constraints at x. If:
1. f(x), gi(x) ∈ C1(x) for i ∈ I2. “constraint qualifications” conditions: T (x) = G(x) hold in x ;
then there exist Lagrange multipliers λi ≥ 0, i ∈ I:
∇f(x) +∑
i∈I
λi∇gi(x) = 0.
Optimality Conditions – p. 30
Proof
x local optimum ⇒if d ∈ T (x) ⇒dT∇f(x) ≥ 0. But d ∈ T (x) ⇒
dT∇gi(x) ≤ 0 i ∈ I.
Thus it is impossible that
−∇T f(x)d > 0
∇T gi(x)d ≤ 0 i ∈ I
From Farkas’ Lemma ⇒there exists a solution of:∑
i∈I
λi∇T gi(x) = −∇T f(x) i ∈ I
λi ≥ 0 i ∈ I
Optimality Conditions – p. 31
Constraint qualifications: examples
polyhedra: X = Rn and gi(x) are affine functions: Ax ≤ b
linear independence: X open set, gi(x), i 6∈ I continuous in x and∇gi(x), i ∈ I are linearly independent.
Slater condition: X open set, gi(x), i ∈ I convex differentiablefunctions in x, gi(x), i 6∈ I continuous in x, and ∃ x ∈ Xstrictly feasible:
gi(x) < 0 i ∈ I.
Optimality Conditions – p. 32
Convex problems
An optimization problem
minx∈S
f(x)
is a convex problem if
S is a convex set, i.e.
x, y ∈ S⇒λx + (1 − λ)y ∈ S
∀λ ∈ [0, 1]
f is a convex function on S, i.e.
f(λx + (1 − λ)y) ≤ λf(x) + (1 − λ)f(y)
∀λ ∈ [0, 1] and x, y ∈ S
Optimality Conditions – p. 33
Standard convex problem
min f(x)
gi(x) ≤ 0 i = 1,m
hj(x) = 0 j = 1, k
if
f is convex
gi are convex
hj are affine (i.e. of the form αT x + β)
then the problem is convex.
Optimality Conditions – p. 34
Convex problems
Every local optimum is a global one.Proof: x: local optimum for minS f(x)x⋆: global optimum.S convex ⇒λx⋆ + (1 − λ)x ∈ S. Thus if λ ≈ 0 ⇒
f(x) ≤ f(λx⋆ + (1 − λ)x
≤ λf(x⋆) + (1 − λ)f(x)
⇒f(x) ≤ f(x⋆)
and x is also a global optimum.
Optimality Conditions – p. 35
Sufficiency of 1st order conditions
(for a convex differentiable problem: if dT∇f(x) ∀ d ∈ T (x), thenx is a (global) optimumProof:
f(y) ≥ f(x) + (y − x)T∇f(x) ∀ y ∈ S
But y − x ∈ T (x) ⇒
f(y) ≥ f(x) + dT∇f(x) ∀ y ∈ S
≥ f(x)
thus x is a global minimum.
Optimality Conditions – p. 36
Convexity of the set of global optima
(for convex problems)The set of global minima of a convex problem is a convex set. Infact, let x and y be global minima for the convex problem
minx∈S
f(x)
Then, choosing λ ∈ [0, 1] we have λx + (1 − λ)y ∈ S, as S isconvex. Moreover
f(λx + (1 − λ)y) ≤ λf(x) + (1 − λ)f(y)
λf ⋆ + (1 − λ)f ⋆ = f ⋆
where f ⋆ is the global minimum value. Thus the equality holdsand the proof is complete.
Optimality Conditions – p. 37
KKT for equality constraints
x: local optimum for
min f(x)
gi(x) ≤ 0 i = 1, . . . ,m
hj(x) = 0 j = 1, . . . , k
x ∈ X ⊆ Rn
Let I: set of active inequalities in x. If f(x),gi(x), i ∈ I, hj(x) ∈ C1 and “constraint qualifications” hold in x,⇒∃λi ≥ 0∀ i ∈ I e µj ∈ R,∀ j = 1, . . . , h:
∇f(x) +∑
i∈I
λi∇gi(x) +h∑
j=1
µj∇hj(x) = 0
Optimality Conditions – p. 38
Complementarity
KKT equivalent formulation:
∇f(x) +m∑
i=1
λi∇gi(x) +h∑
j=1
µj∇hj(x) = 0
λigi(x) = 0 i = 1, . . . ,m
Condition λigi(x) = 0 is called complementarity condition
Optimality Conditions – p. 39
II order necessary conditions
If f, g1, hj ∈ C2 in x and the gradients of active constraints in xare linearly independent, then there exist mutlipliersλi ≥ 0, i ∈ I and µj, j = 1, . . . , k such that
∇f(x) +∑
i∈I
λi∇gi(x) +k∑
j=1
µj∇hj(x) = 0
and
dT∇2L(x)d ≥ 0
for every direction d: dT∇gi(x) ≤ 0, dT∇hj(x) = 0 where
∇2L(x) := ∇2f(x) +∑
i∈I
λi∇2gi(x) +k∑
j=1
µj∇2hj(x)
Optimality Conditions – p. 40
Sufficient conditions
Let f, gi, hj twice continuously differentiable. Let x⋆, λ⋆, µ⋆:
∇f(x⋆) +∑
i∈I
λ⋆i∇gi(x
⋆) +k∑
j=1
µ⋆j∇hj(x
⋆) = 0
λ⋆i gi(x
⋆) = 0
λ⋆i ≥ 0
dT∇2L(x⋆)d > 0 ∀ d :dT∇hj(x⋆) = 0
dT∇gi(x⋆) = 0, i ∈ I
then x⋆ is a local minimum.
Optimality Conditions – p. 41
Lagrange Duality
Problem:
f ⋆ = min f(x)
gi(x) ≤ 0
x ∈ X
definition: Lagrange Function:
L(x; λ) = f(x) +∑
i
λigi(x) λ ≥ 0, x ∈ X
Optimality Conditions – p. 42
Relaxation
Given an optimization problem
minx∈S
f(x)
a relaxation is a problem
minx∈Q
g(x)
where
S ⊆ Q
g(x) ≤ f(x) ∀x ∈ S.
Weak Duality : The optimal value of a relaxation is a lowerbound on the optimum value of the problem.
Optimality Conditions – p. 43
Lagrange minimization is a relaxation
Proof:
Feasible set of the Lagrange problem: X (contains theoriginal one)
If g(x) ≤ 0 and λ ≥ 0 ⇒
L(x, λ) = f(x) + λT g(x)
≤ f(x)
Optimality Conditions – p. 44
Dual Lagrange function
with respect to constraints g(x) ≤ 0:
θ(λ) = infx∈X
L(x, λ)
= infx∈X
(f(x) + λT g(x))
For every choice of λ ≥ 0, θ(λ) is a lower bound for everyfeasible solution and in particular, is a lower bound for theglobal minimum value of the problem.
Optimality Conditions – p. 45
Example (circle packing)
min−r
4r2 − (xi − xj)2 − (yi − yj)
2 ≤ 0 1 ≤ i < j ≤ N
xi, yi ≤ 1 i = 1, . . . , N
−xi,−yi ≤ 0 i = 1, . . . , N
Optimality Conditions – p. 46
When N = 2, relaxing the first constraint:
θ(λ) = minx,y,r
−r + λ(4r2 − (x1 − x2)2 − (y1 − y2)
2)
x1, x2, y1, y2 ≥ 0
x1, x2, y1, y2 ≤ 1
Optimality Conditions – p. 47
solution
Minimizing with respect to x, y ⇒|x1 − x2| = |y1 − y2| = 1 fromwhich
θ(λ) = minr
−r + 4λr2 − 2λ
r =1
8λ
θ(λ) = −2λ − 1
16λ
This is a lower bound on the optimum value. Best possiblelower bound:
θ⋆ = maxλ
θ(λ)
λ⋆ =1
4√
2θ⋆ = −
√2
2Optimality Conditions – p. 48
Choosing (x1, y1) = (0, 0) and (x2, y2) = (1, 1) a feasible solutionwith r =
√2/2 is obtained.
The Lagrange dual gives a lower bound equal to −√
2/2: sameas the objective function at a feasible solution ⇒optimalsolution!(an exception, not the rule!)
Optimality Conditions – p. 49
Lagrange Dual
θ⋆ = max θ(λ)
λ ≥ 0
This problem might:
1. be unbounded
2. have a finite sup but non max
3. have a unique maximum attained in correspondence with asingle solution x
4. have many different maxima, each connected with adifferent solution x
Optimality Conditions – p. 50
Equality constraints
f ⋆ = min f(x)
gi(x) ≤ 0 i = 1, . . . ,m
hj(x) = 0 j = 1, . . . , k
x ∈ X
Lagrange function:
L(x; λ, µ) = f(x) + λT g(x) + µT h(x)
where λ ≥ 0, but µ is free.
Optimality Conditions – p. 51
Linear Programming
min cT x
Ax ≤ b
Dual Lagrange function:
θ(λ) = minx
cT x + λT (Ax − b)
= −λT b + minx
(cT + λT A)x.
but:
minx
(cT + λT A)x =
0 if cT + λT A = 0
−∞ otherwise.
Optimality Conditions – p. 52
. . .
Lagrange dual function:
θ(λ) =
−λT b if cT + λT A = 0
−∞ otherwise.
Lagrange dual:
max−λT b
λT A + cT = 0
λ ≥ 0
which is equivalent to:
max λT b
λT A = cT
λ ≤ 0Optimality Conditions – p. 53
Quadratic Programming (QP)
min1
2xT Qx + cT x
Ax = b
(Q: symmetric).Lagrange dual function:
θ(λ) = minx
1
2xT Qx + cT x + λT (Ax − b)
= −λT b + minx
1
2xT Qx + (cT + λT A)x
Optimality Conditions – p. 54
QP – Case 1
Q has at least one negative eigenvalue ⇒
minx
1
2xT Qx + (cT + λT A)x = −∞
In fact ∃ d : dT Qd < 0.Choosing x = αd with α > 0 ⇒
1
2xT Qx + (cT + λT A)x =
1
2α2dT Qd + α(cT + λT A)d
and for large values of α this can be made as small as desired.
Optimality Conditions – p. 55
QP – Case 2
Q positive definite ⇒minimum point of the dual Lagrangefunction:
Qx + (c + AT λ) = 0
i.e.
x = −Q−1(c + AT λ)
Optimality Conditions – p. 56
. . .
Lagrange function value:
θ(λ) = −λT b +1
2xT Qx + (cT + λT A)x
= −λT b +1
2(c + AT λ)T Q−1QQ−1(c + AT λ)
− (cT + λT A)Q−1(c + AT λ)
= −λT b +1
2(c + AT λ)T Q−1(c + AT λ)
− (cT + λT A)Q−1(c + AT λ)
= −λT b − 1
2(c + AT λ)T Q−1(c + AT λ)
Optimality Conditions – p. 57
. . .
Lagrange dual (seen as a min problem):
minλ
λT b +1
2(c + AT λ)T Q−1(c + AT λ)
Optimality conditions:
b + AQ−1(c + AT λ) = 0
But recalling that x = −Q−1(c + AT λ) ⇒
b − Ax = 0 feasibility of x
⇒if we find optimal multipliers λ (a linear system) ⇒we get the optimalsolution x (thanks to feasibility and weak duality)!
Optimality Conditions – p. 58
Properties of the Lagrange dual
For any problem
f ⋆ = min f(x)
gi(x) ≤ 0 i = 1, . . . ,m
x ∈ X
where X is non empty and compact, if f and gi are continuousthen the Lagrange dual function is concave
Optimality Conditions – p. 59
Dim.
From Weierstrass theorem
θ(λ) = minx∈X
f(x) + λT g(x)
exists and is finite
θ(ηa + (1 − η)b) = minx∈X
(f(x) + (ηa + (1 − η)b)T g(x))
= minx∈X
(η(f(x) + aT g(x)) + (1 − η)(f(x) + bT g(x)))
≥ η minx∈X
(f(x) + aT g(x)) + (1 − η) minx∈X
(f(x) + bT g(x))
= ηθ(a) + (1 − η)θ(b).
Optimality Conditions – p. 60
Solution of the Lagrange dual
maxλ
θ(λ) = maxλ
minx∈X
(f(x) + λT g(x))
is equivalent to
max z
z ≤ f(x) + λT g(x) ∀x ∈ X
λ ≥ 0
After having computed f and g in x1, x2, . . . , xk a restricted dualcan be defined:
max z
z ≤ f(xj) + λT g(xj) ∀ j = 1, . . . , k
λ ≥ 0 Optimality Conditions – p. 61
. . .
Let λ be the optimal solution of the restricted dual. Is it anoptimal dual solution? Is it true that z ≤ f(x) + λT g(x)? Check:we look for x, optimal solution of
minx∈X
f(x) + λT g(x)
if f(x) + λT g(x) ≥ z then we have found the optimal solutionof the dual;
otherwise the pair x, f(x) is added to the restricted dual anda new solution is computed.
Optimality Conditions – p. 62
Geometric programming
Unconstrained Geometric program:
minx>0
m∑
k=1
ck
n∏
j=1
xαkj
j αkj ∈ R, ck > 0
(non convex). Variable substitution:
xj = exp(yj) yj ∈ R
Optimality Conditions – p. 63
Transformed problem:
miny
m∑
k=1
(
ck
n∏
j=1
eαkjyj
)
=
miny
m∑
k=1
eαTk
y+βk βk = log ck
still non convex, but its logarithm is convex.
Optimality Conditions – p. 64
Duality example
Dual of
min f(x) min logm∑
k=1
exp(αTk x + βk)
No constraints ⇒dual lagrange function is identical to f(x)!Strong duality holds, but is useless.Simple transformation:
min logm∑
k=1
exp yk
yk = αTk x + βk
Optimality Conditions – p. 65
solving the dual
Dual function
L(λ) = minx,y
logm∑
k=1
exp yk + λT (Ax + β − y)
Minimization in x is unconstrained: min λT Ax ⇒if λT A 6= 0 L(λ) is unbounded
if λT A = 0 then
L(λ) = miny
logm∑
k=1
exp yk + λT (β − y)
Optimality Conditions – p. 66
First order (unconstrained) optimality conditions w.r.t. yi:
exp yi∑
k exp yk
− λi = 0
⇒Lagrange multipliers exist provided that∑
i
λi = 1 λi > 0∀ i
Optimality Conditions – p. 67
Substituting λj = exp yj/∑
k exp yk,
L(λ) = log∑
j
exp yj −∑
j
λjyj
= log∑
j
exp yj −∑
j
yj exp yj/∑
k
exp yk
=1
∑
k exp yk
(∑
k
exp yk(log∑
j
exp yj − yk))
=∑
k
(
exp yk∑
j exp yj
(log∑
j
exp yj − yk)
)
= −∑
k
λk log λk
Optimality Conditions – p. 68
Lagrange Dual
The Lagrange Dual becomes:
maxλ
βT λ −∑
k
λk log λk
∑
k
λk = 1
AT λ = 0
λ ≥ 0
Optimality Conditions – p. 69
Special cases: linear constraints
min f(x)
Ax ≥ b
Lagrange function:
L(x, λ) = f(x) + λT (b − Ax)
Constraint qualifications always hold (polyhedron). If x⋆ is alocal optimum there exists λ⋆ ≥ 0:
Ax⋆ ≥ b
∇f(x⋆) = AT λ⋆
λ⋆T (b − Ax⋆) = 0
Optimality Conditions – p. 70
Non negativity constraints
min f(x)
x ≥ 0
Lagrange function: L(x, λ) = f(x) − λT x. KKT conditions:
∇f(x⋆) = λ⋆
x⋆ ≥ 0
λ⋆ ≥ 0
(λ⋆)T x⋆ = 0
Optimality Conditions – p. 71
λ⋆j =
∂f(x⋆)
∂xj
j = 1, n
from which
∂f(x⋆)
∂xj
= 0 ∀ j : x⋆j > 0
∂f(x⋆)
∂xj
≥ 0 otherwise
Optimality Conditions – p. 72
Box constraints
min f(x)
ℓ ≤ x ≤ u ℓi < ui∀ i
Lagrange function: L(x, λ, µ) = f(x) + λT (ℓ − x) + µT (x − u).KKT conditions:
∇f(x⋆) = λ⋆ − µ⋆
(ℓ − x⋆)T λ⋆ = 0
(x⋆ − u)T µ = 0
(λ⋆, µ⋆) ≥ 0
Given x⋆ letJℓ = j : x⋆
j = ℓj, Ju = j : x⋆j = uj, J0 = j : ℓj < x⋆
j < ujOptimality Conditions – p. 73
Box constr. (cont)
Then, from complementarity,
∂f(x⋆)
∂xj
= λ⋆j j ∈ Jℓ
∂f(x⋆)
∂xj
= −µ⋆j j ∈ Ju
∂f(x⋆)
∂xj
= 0 j ∈ J0
Optimality Conditions – p. 74
Thus
∂f(x⋆)
∂xj
≥ 0 j ∈ Jℓ
∂f(x⋆)
∂xj
≤ 0 j ∈ Ju
∂f(x⋆)
∂xj
= 0 j ∈ J0
with feasibility ℓ ≤ x⋆ ≤ u
Optimality Conditions – p. 75
Optimization over the simplex
min f(x)
1T x = 1
x ≥ 0
Lagrange function: L(x, λ, µ) = f(x) − λT x + µT (1T x − 1). KKT:
∇f(x⋆) = λ⋆ − µ⋆1
1T x⋆ = 1
(x⋆, λ⋆) ≥ 0
(λ⋆)T x⋆ = 0
Optimality Conditions – p. 76
simplex. . .
∂f(x⋆)
∂xj
− λ⋆j = −µ⋆
(all equal). Thus, from complementarity, if x⋆j > 0 then λ⋆
j = 0
and ∂f(x⋆)∂xj
= −µ⋆; otherwise ∂f(x⋆)∂xj
≥ −µ⋆. Thus, if j : x⋆j > 0,
∂f(x⋆)
∂xj
≤ ∂f(x⋆)
∂xk
∀ k
Optimality Conditions – p. 77
Application: Min var portfolio
Given n assets with random returns R1, . . . , Rn, how to invest 1e in such a way that the resulting portfolio has minimumvariance? If xj denotes the percentage of the investment onasset j, how to compute the variance of this portfolio P (x)?
Var = E(P (x) − (E(P (x))))2
= E
(
n∑
j=1
(Rj − E(Rj))xj
)2
=∑
i,j
(Ri − E(Ri))(Rj − E(Rj))xixj
= xT Qx
where Q is the variance-covariance matrix of the n assets.
Optimality Conditions – p. 78
Min var portfolio
Problem (objective multiplied by 1/2 for simpler computations):
min(1/2)xT Qx
1T x = 1
x ≥ 0
Optimality Conditions – p. 79
Optimal portfolio
KKT: for all j : x⋆j > 0:
∑
j
Qijxj ≤∑
j
Qkjxj ∀ k
Vector Qx might be thaught as the vector of marginalcontributions to the total risk (which is a weighted sum ofelements of Qx). Thus in the optimal portfolio, all assets withpositive level give equal (and minimal) contribution to the totalrisk.
Optimality Conditions – p. 80
Algorithms for unconstrained localoptimization
Fabio Schoen
2008
http://gol.dsi.unifi.it/users/schoen
Algorithms for unconstrained local optimization – p. 1
Optimization Algorithms
Most common form for optimization algorithms:Line search-based methods:Given a starting point x0 a sequence is generated:
xk+1 = xk + αkdk
where dk ∈ Rn: search direction, αk > 0: step
Usually first dk is chosen and than the step is obtained, oftenfrom a 1–dimensional optimization
Algorithms for unconstrained local optimization – p. 2
Trust-region algorithms
A model m(x) and a confidence region U(xk) containing xk aredefined. The new iterate is chosen as the solution of theconstrained optimization problem
minx∈U(xk)
m(x)
The model and the confidence region are possibly updated ateach iteration.
Algorithms for unconstrained local optimization – p. 3
Speed measures
Let x⋆: local optimum. The error in xk might be measured e.g.as
e(xk) = ‖xk − x⋆‖ or
e(xk) = |f(xk) − f(x⋆)|.
Given xk → x⋆ if ∃ q > 0, β ∈ (0, 1) : (for k large enough):
e(xk) ≤ qβk
⇒xk is linearly convergent, or converges with order 1;β : convergence rateA sufficient condition for linear convergence:
lim supe(xk+1)
e(xk)≤ β
Algorithms for unconstrained local optimization – p. 4
super–linear convergence
If for every β ∈ (0, 1) exists q:
e(xk) ≤ qβk
then convergence is super–linear.Sufficient condition:
lim supe(xk+1)
e(xk)= 0
Algorithms for unconstrained local optimization – p. 5
Higher order convergence
If, given p > 1, ∃ q > 0, β ∈ (0, 1) :
e(xk) ≤ qβ(pk)
then xk is said to converge with order at least pIf p = 2 ⇒quadratic convergence Sufficient condition:
lim supe(xk+1)
e(xk)p< ∞
Algorithms for unconstrained local optimization – p. 6
Examples
1k
converges to 0 with order one 1 (linear convergence)
Algorithms for unconstrained local optimization – p. 7
Examples
1k
converges to 0 with order one 1 (linear convergence)1k2 converges to 0 with order 1
Algorithms for unconstrained local optimization – p. 7
Examples
1k
converges to 0 with order one 1 (linear convergence)1k2 converges to 0 with order 1
2−k converges to 0 with order 1
Algorithms for unconstrained local optimization – p. 7
Examples
1k
converges to 0 with order one 1 (linear convergence)1k2 converges to 0 with order 1
2−k converges to 0 with order 1
k−k converges to 0 with order 1; convergence issuper–linear
Algorithms for unconstrained local optimization – p. 7
Examples
1k
converges to 0 with order one 1 (linear convergence)1k2 converges to 0 with order 1
2−k converges to 0 with order 1
k−k converges to 0 with order 1; convergence issuper–linear1
22k converges a 0 with order 2 quadratic convergence
Algorithms for unconstrained local optimization – p. 7
Descent directions and the gradient
Let f ∈ C1(Rn), xk ∈ Rn : ∇f(xk) 6= 0
Let d ∈ Rn. If
dT∇f(xk) < 0
then d is a descent directionTaylor expansion:
f(xk + αd) − f(xk) = αdT∇f(xk) + o(α)
f(xk + αd) − f(xk)
α= dT∇f(xk) + o(1)
Thus if α is small enough f(xk + αd) − f(xk) < 0
NB: d might be a descent direction even if dT∇f(xk) = 0
Algorithms for unconstrained local optimization – p. 8
Convergence of line search methods
If a sequence xk+1 = xk + αkdk is generated in such a way that:
L0 = x : f(x) ≤ f(x0) is compact
dk 6= 0 whenever ∇f(xk) 6= 0
f(xk+1) ≤ f(xk)
if ∇f(xk) 6= 0 ∀ k then
limk→∞
dTk
‖dk‖∇f(xk) = 0
Algorithms for unconstrained local optimization – p. 9
if dk 6= 0 then
|dTk ∇f(xk)|‖dk‖
≥ σ(‖∇f(xk)‖)
where σ is such that limk→∞ σ(tk) = 0⇒ limk→∞ tk = 0
(σ is called a forcing function)
Algorithms for unconstrained local optimization – p. 10
Then either there exists a finite index k such that ∇f(xk) = 0 orotherwise
xk ∈ L0 and all of its limit points are in L0
f(xk) admits a limit
limk→∞∇f(xk) = 0
for every limit point x of xk we have ∇f(x) = 0
Algorithms for unconstrained local optimization – p. 11
Comments on the assumptions
f(xk+1) ≤ f(xk): most optimization methods choose dk as adescent direction. If dk is a descent direction, choosing αk
“sufficiently small” ensures the validity of the assumption
limk→∞dT
k
‖dk‖∇f(xk) = 0: given a normalized direction dk, the
scalar product dkT∇f(xk) is the directional derivative of falong dk: it is required that this goes to zero. This can beachieved through precise line searches (choosing the stepso that f is minimized along dk)|dT
k∇f(xk)|
‖dk‖≥ σ(‖∇f(xk)‖): letting, e.g., σ(t) = ct, c > 0, if
dk : dTk ∇f(xk) < 0 then the condition becomes
dTk ∇f(xk)
‖dk‖ ‖∇f(xk‖≤ −c
Algorithms for unconstrained local optimization – p. 12
Recalling that
cos θk =dT
k ∇f(xk)
‖dk‖ ‖∇f(xk‖
then the condition becomes
cos θk ≤ −c
that is, the angle between dk and ∇f(xk) is bounded away fromorthogonality.
θkdT
k ∇f(xk)
Algorithms for unconstrained local optimization – p. 13
Gradient Algorithms
General scheme:
xk+1 = xk − αkDk∇f(xk)
with Dk ≻ 0 e αk > 0If ∇f(xk) 6= 0 then
dk = Dk∇f(xk)
is a descent direction. In fact
dTk ∇f(xk) = −∇T f(xk)Dk∇f(xk)
< 0
Algorithms for unconstrained local optimization – p. 14
Steepest Descent
or “gradient” method:
Dk := I
i.e. xk+1 = xk − αk∇f(xk).If ∇f(xk) 6= 0 then dk = −∇f(xk) is a descent direction.Moreover, it is the steepest (w.r.t. the euclidean norm):
mind∈Rn
∇T f(xk)d
‖d‖ ≤ 1
Algorithms for unconstrained local optimization – p. 15
∇f(xk)
Algorithms for unconstrained local optimization – p. 16
. . .
mind∈Rn
∇T f(xk)d
√dT d ≤ 1
KKT conditions: In the interior ⇒∇T f(xk) = 0; if the constraint isactive ⇒
∇f(xk) + λd
‖d‖ = 0
√dT d = 1
λ ≥ 0
⇒d = − ∇f(xk)‖∇f(xk)‖
.
Algorithms for unconstrained local optimization – p. 17
Newton’s method
Dk := −(
∇2f(xk))−1
Motivation: Taylor expansion of f :
f(x) ≈ f(xk) + ∇T f(xk)(x − xk) +1
2(x − xk)
T∇2f(xk)(x − xk)
Minimizing the approximation:
∇f(xk) + ∇2f(xk)(x − xk) = 0
If the hessian is non singular ⇒
x = xk −(
∇2f(xk))−1 ∇f(xk)
Algorithms for unconstrained local optimization – p. 18
Step choice
Given dk, how to choose αk so that xk+1 = xk + αkdk?
“optimal” choice (one-dimensional optimization):
αk = arg minα≥0
f(xk + αdk).
Analytical expression of the optimal step is available only in fewcases. E.g. if f(x) = 1
2xT Qx + cT x with Q ≻ 0. Then
f(xk + αdk) =1
2(xk + αdk)
T Q(xk + αdk) + cT (xk + αdk)
=1
2α2dT
k Qdk + α(Qxk + c)T dk + β
where β does not depend on α.
Algorithms for unconstrained local optimization – p. 19
Minimizing w.r.t. α:
αdTk Qdk + (Qxk + c)T dk = 0 ⇒
α = −(Qxk + c)T dk
dTk Qdk
= − dTk ∇f(xk)
dTk ∇2f(xk)dk
E.g., in steepest descent:
αk =‖∇f(xk)‖2
∇T f(xk)∇2f(xk)∇f(xk)
Algorithms for unconstrained local optimization – p. 20
Approximate step size
Rules for choosing a step-size (from the sufficient condition forconvergence):
f(xk+1) < f(xk)
limk→∞dT
k
‖dk‖∇f(xk) = 0
Often it is also required that
‖xk+1 − xk‖ → 0
dTK∇f(xk + αkdk) → 0
In general it is important to insure a sufficient reduction of f anda sufficiently large step xk+1 − xk
Algorithms for unconstrained local optimization – p. 21
Avoid too large steps
u
u
uu
Algorithms for unconstrained local optimization – p. 22
Avoid too small stepsu
u
u
u
u
Algorithms for unconstrained local optimization – p. 23
Armijo’s rule
Input: δ ∈ (0, 1), γ ∈ (0, 1/2), ∆k > 0
α := ∆k;while (f(xk + αdk) > f(xk) + γαdT
k∇f(xk)) do
α := δα ;end
return α
Typical values : δ ∈ [0.1, 0.5], γ ∈ [10−4, 10−3].On exit the returned step is such that
f(xk + αdk) ≤ f(xk) + γαdT
k ∇f(xk)
Algorithms for unconstrained local optimization – p. 24
α
acceptable steps
αdTk ∇f(xk)
γαdTk ∇f(xk)
Algorithms for unconstrained local optimization – p. 25
Line search in practice
How to choose the initial step size ∆k?Let φ(α) = f(xk + αdk). A possibility is to choose ∆k = α⋆, theminimizer of a quadratic approximation to φ(·). Example:
q(α) = c0 + c1α +1
2c2α
2
q(0) = c0 := f(xk)
q′(0) = c1 := dTk ∇f(xk)
Then α⋆ = −c1/c2.
Algorithms for unconstrained local optimization – p. 26
Third condition? If an estimate f of the minimum of f(xk + αdk)
is available ⇒choose c2 : min q(α) = f .
min q(α) = q(−c1/c2)
= c0 − c21/c2 := f
c2 = c21/2(f − c0)
α⋆ = −c1/c2
= 2f − c0
c1
Algorithms for unconstrained local optimization – p. 27
Thus it is reasonable to start with
∆k = 2f − f(xk)
dTk ∇f(xk)
A reasonable estimate might be to choose ∆k = 2 (f(xk−1)−f(xk))
dT
k∇f(xk)
Algorithms for unconstrained local optimization – p. 28
Convergence of steepest descent
xk+1 = xk − αk∇f(xk)
If a sufficiently accurate step size is used ⇒the condition of thetheorem on global convergence are satisfied ⇒the steepestdescent algorithm globally converges to a stationary point.“Sufficiently accurate” means exact line search or, e.g., Armijo’srule.
Algorithms for unconstrained local optimization – p. 29
Local analysis of steepest descent
Behaviour of the algorithm when minimizing
f(x) =1
2xT Qx
where Q ≻ 0. (local and global) optimum: x⋆ = 0. Steepestdescent method:
xk+1 = xk − αk∇f(xk)
= xk − αkQxk
= (I − αkQ)xk
Error (in x) at step k + 1:
‖xk+1 − 0‖ = ‖(I − αkQ)xk‖
=√
xTk (I − αkQ)2xk Algorithms for unconstrained local optimization – p. 30
Analysis
Let A: symmetric with eigenvalues: λ1 < · · · < λn. Then
λ1‖v‖2 ≤ vT Av ≤ λm‖v‖2 ∀ v ∈ Rn
⇒
xTk (I − αkQ)2xk ≤ λ⋆xT
k xk
where λ⋆ largest eigenvalue of (I − αkQ)2.
Algorithms for unconstrained local optimization – p. 31
. . .
λ is an eigenvalue of A iff αλ is an eigenvalue of αA
λ is an eigenvalue of A iff 1 + λ is an eigenvalue of I + A
Thus the eigenvalues of (I − αkQ) are
1 − αλi
where λi are the eigenvalues of Q. The maximum eigenvaluewill be:
max(1 − αkλ1)2, (1 − αkλn)2
thus
‖xk+1‖ ≤√
max(1 − αkλ1)2, (1 − αkλn)2‖xk‖= max|1 − αkλ1|, |1 − αkλn|‖xk‖
Algorithms for unconstrained local optimization – p. 32
. . .
Eliminating the dependency on αk:
max|1 − αλ1|, |1 − αλn| =
max1 − αλ1,−1 + αλ1, 1 − αλn,−1 + αλn
0
1
2
3
4
5
0 0.2 0.4 0.6 0.8 1
|1 − αλ1|
|1 − αλn|
Algorithms for unconstrained local optimization – p. 33
. . .
α ≥ 0 and λ1 ≤ λn, ⇒
1 − αλ1 ≥ 1 − αλn
−1 + αλ1 ≤ −1 + αλn
and thus
max|1 − αkλ1|, |1 − αkλn|‖xk‖ = max1 − αλ1,−1 + αλn
Minimum point:
1 − αλ1 = −1 + αλn
i.e.
α⋆ =2
λ1 + λnAlgorithms for unconstrained local optimization – p. 34
Analysis
In the best possible case
‖xk+1‖‖xk‖
≤ |1 − α⋆λ1|
= |1 − 2
λ1 + λn
λ1|
=λn − λ1
λn + λ1
=ρ − 1
ρ + 1
where ρ = λn/λ1: condition number of Qρ ≫ 1 (ill–conditioned problem) ⇒very slow convergenceρ ≈ 1 ⇒very speed convergence
Algorithms for unconstrained local optimization – p. 35
Zig–zagging
min1
2(x2 + My2)
where M > 0. Optimum: x⋆0y⋆ = 0. Starting point: (M, 1).Iterates:
[
xk+1
yk+1
]
=
[
xk
yk
]
+ α
[
xk
Myk
]
With optimal step size ⇒[
xk+1
yk+1
]
=
[
M(
M−1M+1
)k
(
−M−1M+1
)k
]
Algorithms for unconstrained local optimization – p. 36
Converegence is
rapid if M ≈ 1
very slow and “zig–zagging” if M ≫ 1 or M ≪ 1
Slow convergence and zig–zagging are general phenomena(especially when the starting point is near the longest axes ofthe ellipsoidal level sets)
Algorithms for unconstrained local optimization – p. 37
Zig–zagging
-10
-5
0
5
10
0 20 40 60 80 100
Algorithms for unconstrained local optimization – p. 38
Analysis of Newton’s method
Newton-Raphson method: xk+1 = xk − (∇2f(xk))−1 ∇f(xk). Let
x⋆: local optimum. Taylor expansion of ∇f :
∇f(x⋆) = 0
= ∇f(xk) + ∇2f(xk)(x⋆ − xk) + o(‖x⋆ − xk‖)
If ∇2f(xk) is non singular and ‖(∇2f(xk))−1‖ is limited ⇒
0 =(
∇2f(xk))−1 ∇f(xk) + (x⋆ − xk) +
(
∇2f(xk))−1
o(‖x⋆ − xk‖)= x⋆ − xk+1 + o(‖x⋆ − xk‖)
Algorithms for unconstrained local optimization – p. 39
Thus
‖x⋆ − xk+1‖ = o(‖x⋆ − xk‖)
i.e. ‖x⋆−xk+1‖
‖x⋆−xk‖= o(‖x⋆−xk‖)
‖x⋆−xk‖⇒convergence is at least super–linear
Algorithms for unconstrained local optimization – p. 40
Local Convergence of Newton’s Method
Let f ∈ C2(U(x⋆, δ1)), where U : ball with radius δ1 and center x⋆;let ∇2f(x⋆) be non–singular. Then:
1. ∃ δ > 0 : if x0 ∈ U(x⋆, δ) ⇒xk is well defined andconverges to x⋆ at least superlinearly.
2. If ∃ δ > 0, L > 0,M > 0 :
‖∇2f(x) −∇2f(y)‖ ≤ L‖x − y‖
and
‖(∇2f(x))−1‖ ≤ M
then, if x0 ∈ U(x⋆, δ) Newton’s method converges with orderat least 2 and
‖xk+1 − x⋆‖ ≤ LM
2‖xk − x⋆‖2
Algorithms for unconstrained local optimization – p. 41
Difficulties
Many things might go wrong:
at some iteration, ∇2f(xk) might be singular. For example:if xk belongs to a flat region f(x) = constant.
even if non singular, inversion ∇2f(xk) or, in any case,solving a linear system with coefficient matrix ∇2f(xk) isnumerically unstable and computationally demanding
there is no guarantee that ∇2f(xk) ≻ 0 ⇒Newton directionmight not be a descent direction
Algorithms for unconstrained local optimization – p. 42
Difficulties
Newton’s method just tries to solve the system
∇f(xk) = 0
and thus might very well be attracted towards a maximum
the method lacks global convergence: it converges only ifstarted “near” a local optimum
Algorithms for unconstrained local optimization – p. 43
Newton–type methods
line search variant: xk+1 = xk − αk (∇2f(xk))−1 ∇f(xk)
Modified Newton method: replace ∇2f(xk) by(∇2f(xk) + Dk) where Dk is chosen so that ∇2f(xk) + Dk ispositive definite
Algorithms for unconstrained local optimization – p. 44
Quasi-Newton methods
Consider solving the nonlinear system ∇f(x) = 0. Taylorexpansion of the gradient:
∇f(xk) ≈ ∇f(xk+1) + ∇2f(xk+1)(xk − xk+1)
Let Bk+1 be an approximation of the hessian in xk+1.Quasi–Newton equation:
Bk+1(xk+1 − xk) = ∇f(xk+1) −∇f(xk)
Algorithms for unconstrained local optimization – p. 45
Quasi–Newton equation
Let:
sk := xk+1 − xk yk := ∇f(xk+1) −∇f(xk)
Quasi–Newton equation: Bk+1sk = yk. If Bk was the previousapproximate hessian, we ask that
1. the variation between Bk and Bk+1 is “small”
2. nothing changes along directions which are normal to thestep sk:
Bkz = Bk+1z ∀ z : zT sk = 0
Choosing n−1 vectors z which are orthogonal to sk ⇒n2 linearlyindependent equations in n2 unknowns ⇒∃ a unique solution.
Algorithms for unconstrained local optimization – p. 46
Broyden updating
It can be shown that the unique solution is given by:
Bk+1 = Bk +(yk − Bksk)s
Tk
sTk sk
Theorem: let Bk ∈ Rn×n and sk 6= 0. The unique solution to:
minB
‖Bk − B‖F
Bsk = yk
is Broyden’s update Bk+1 here ‖X‖F =√
TrXT X denotesFrobenius norm.
Algorithms for unconstrained local optimization – p. 47
proof
‖Bk+1 − Bk‖ =
∥
∥
∥
∥
(yk − Bksk)sTk
sTk sk
∥
∥
∥
∥
=
∥
∥
∥
∥
∥
(Bsk − Bksk)sTk
sTk sk
∥
∥
∥
∥
∥
=
∥
∥
∥
∥
∥
(B − Bk)sksTk
sTk sk
∥
∥
∥
∥
∥
≤∥
∥
∥(B − Bk)
∥
∥
∥
‖sksTk ‖
sTk sk
=∥
∥
∥(B − Bk)
∥
∥
∥
√
TrsksTk sksT
k
sTk sk
=∥
∥
∥(B − Bk)
∥
∥
∥
sTk sk
sTk sk
= ‖(B − Bk)‖
Unicity is a consequence of the strict convexity of the norm andthe convexity of the feasible region.
Algorithms for unconstrained local optimization – p. 48
Quasi-Newton and optimization
Special situation:
1. the hessian matrix in optimization problems is symmetric;
2. in gradient methods, when we letxk+1 = xk − (Bk+1)
−1 ∇f(xk), it is desirable that Bk+1 bepositive definite.
Broyden’s update:
Bk+1 = Bk +(yk − Bksk)s
Tk
sTk sk
is generally not symmetric even if Bk is.
Algorithms for unconstrained local optimization – p. 49
Simmetry
Remedy: let C1 = Bk +(yk−Bksk)sT
k
sT
ksk
symmetrization:
C2 =1
2(C1 + CT
1 )
However, it does not satisfy Quasi–Newton equation. Broydenupdate of C2:
C3 = C2 +(yk − C2sk)s
Tk
sTk sk
which is not symmetric, . . .
Algorithms for unconstrained local optimization – p. 50
PBS update
In the limit
Bk+1 = Bk +(yk − Bksk)s
Tk + sk(yk − Bksk)
T
sTk sk
+(sT
k (yk − Bksk))sksTk
(sTk sk)2
(PBS – Powell-Broyden-Symmetric update).Imposing also hereditary positive definiteness, DFP(Davidon-Fletcher-Powell) is obtained:
Bk+1 = Bk +(yk − Bksk)y
Tk + yk(yk − Bksk)
T
yTk sk
+(sT
k (yk − Bksk))ykyTk
(yTk sk)2
=
(
I − yksTk
yTk sk
)
Bk
(
I − skyTk
yTk sk
)
+yky
Tk
yTk sk
Algorithms for unconstrained local optimization – p. 51
BFGS
Same ideas, but applied to the approximate inverse Hessian:Inverse Quasi–Newton equation:
sk = Hk+1yk
lead to the most common Quasi–Newton update: BFGS(Broyden-Fletcher-Goldfarb-Shanno):
Hk+1 =
(
I − skyTk
yTk sk
)
Hk
(
I − yksTk
yTk sk
)
+sks
Tk
yTk sk
Algorithms for unconstrained local optimization – p. 52
BFGS method
xk+1 = xk − αkHk∇f(xk)
Hk+1 =
(
I − skyTk
yTk sk
)
Hk
(
I − yksTk
yTk sk
)
+sks
Tk
yTk sk
yk = ∇f(xk+1) −∇f(xk)
sk = xk+1 − xk
Algorithms for unconstrained local optimization – p. 53
Trust Region methods
Possible defect of standard Newton method: the approximationbecomes less and less precise if we move away from thecurrent point. Long step ⇒bad approximation.Idea: constrained minimization of quadratic approximation:
xk+1 = arg min‖xk+1−xk‖≤∆k
mk(x) where
mk(x) = f(xk) + ∇T f(xk)(xk+1 − xk)
+1
2(xk+1 − xk)
T∇2f(xk)(xk+1 − xk)
∆k > 0: parameter.First advantage (over pure Newton): the step is always definite(thanks to Weierstrass’s theorem)
Algorithms for unconstrained local optimization – p. 54
Outline of Trust Region
Let mk(·) a local model function. E.g. in Newton Trust Regionmethods,
mk(s) = f(xk) + sT∇f(xk) +1
2sT∇2f(xk)s
or in a Quasi-Newton Trust Region method
mk(s) = f(xk) + sT∇f(xk) +1
2sT Bks
Algorithms for unconstrained local optimization – p. 55
How to choose and update the trust region radius ∆k? Given astep sk, let
ρk =f(xk) − f(xk + sk)
mk(0) − mk(sk)
the ratio between the actual reduction and the predictedreduction
Algorithms for unconstrained local optimization – p. 56
Model updating
ρk =f(xk) − f(xk + sk)
mk(0) − mk(sk)
The predicted reduction is always non negative;
if ρk is small (surely if it is negative) the model and thefunction strongly disagree ⇒the step must be rejected andthe trust region reduced
if ρk ≥ 1 it is safe to expand the trust region
intermediate ρk values lead us to keep the regionunchanged
Algorithms for unconstrained local optimization – p. 57
Algorithm
Data: ∆ > 0, ∆0 ∈ (0, ∆), η ∈ [0, 1/4]
for k = 0, 1, . . . doFind the step sk and ρk minimizing the model in the trust region ;if ρk < 1/4 then
∆k+1 = ∆k/4 ;else
if ρk > 3/4 and ‖sk‖ = ∆k then
∆k+1 = min2∆k, ∆ ;else
∆k+1 = ∆k;end
end
if ρk > η thenxk+1 = xk + sk;
elsexk+1 = xk;
end
end
Algorithms for unconstrained local optimization – p. 58
Solving the model
How to find
mins
∇f(xk)T s +
1
2sT Bks
‖s‖ ≤ ∆
If Bk ≻ 0, KKT conditions are necessary and sufficient; rewritingthe constraint as sT s ≤ ∆2 ⇒:
∇f(xk) + Bks + 2λs = 0
λ(∆ − ‖s‖) = 0
Algorithms for unconstrained local optimization – p. 59
Thus either s is in the interior of the ball with radius ∆, in whichcase λ = 0 and we have the (quasi)-Newton step:
p = −B−1k ∇f(xk)
or ‖s‖ = ∆ and if λ > 0 then 2λs = −∇f(xk) − Bs = −∇mk(s)⇒s is parallel to the negtaive gradient of the model and normalto its contour lines.
Algorithms for unconstrained local optimization – p. 60
The Cauchy Point
Strategy to approximately solve the trust region sub–problem.Find the “Cauchy point”: the minimizer of mk along the direction−∇f(xk) within the trust region. First find the direction:
psk = arg min
pfk + ∇f(xk)
T p
‖p‖ ≤ ∆k
Then along this direction find a minimizer
τk = arg minτ≥0
mk(τpsk)
‖τpsk‖ ≤ ∆k
The Cauchy point is xk + τkpsk.
Algorithms for unconstrained local optimization – p. 61
Finding the Cauchy point
Finding psk is easy: analytic solution:
psk = −∇f(xk)
‖gk‖∆k
For the step size τk:
If ∇f(xk)T Bk∇f(xk) ≤ 0 ⇒negative curvature direction
⇒largest possible step ⇒τk = 1
Otherwise the model along the line is strictly convex, so
τk = min1, ‖∇f(xk)‖3
∆k∇f(xk)T Bk∇f(xk)
Choosing the Cauchy point ⇒global but extremely slowconvergence (similar to steepest descent). Usually an improvedpoint is searched starting from the Cauchy one.
Algorithms for unconstrained local optimization – p. 62
Derivative Free Optimization
Algorithms for unconstrained local optimization – p. 63
Pattern Search
For smooth optimization, but without knowledge of derivatives.Elementary idea: if x ∈ R
2 is not a local minimum for f , then atleast one of the directions e1, e2,−e1,−e2 (moving towards E, N,W, S) forms an acute angle with −∇f(x) ⇒is a descentdirection.Direct search: explores all the direction in search of one whichgives a descent.
Algorithms for unconstrained local optimization – p. 64
Coordinate search
Let D⊕ = ±ei be the set of coordinate directions and theiropposites
Data: k = 0, ∆0 an initial step length, x0 a starting pointwhile ∆ is large enough do
if f(xk + ∆kd) < f(xk) for some d ∈ D⊕ thenxk+1 = xk + ∆kd (step accepted) ;
else∆k+1 = 0.5∆k ;
end
k = k + 1 ;end
Algorithms for unconstrained local optimization – p. 65
Pattern search
It is not necessary to explore 2n directions. It is sufficient thatthe set of directions forms a positive span, i.e. every v ∈ R
n
should be expressible as a non negative linear combination ofthe vectors in the set.Formally, G is a generating set iff
∀ v 6= 0 ∈ Rn∃ g ∈ G : vT g > 0
A “good” generating set should be characterized by asufficiently high cosine measure:
κ(G) := minv 6=0
maxd∈G
vT d
‖v‖‖d‖
Algorithms for unconstrained local optimization – p. 66
Examples
u
u
u
u
u
u
u
u
uu
In the first case κ ≈ 0.19612, in the second κ = 0.5, in the thirdκ =
√0.5 ≈ 0.7017
Algorithms for unconstrained local optimization – p. 67
Step Choice
xk+1 =
xk + ∆kdk if f(xk + ∆kdk) < f(xk) − ρ(∆k)(success)
xk otherwise (failure)
where ρ(t) = o(t). We let
∆k+1 = φk∆k
where φk ≥ 1 for successful iterations, φk < 1 otherwise.Direct methods possess good convergence properties.
Algorithms for unconstrained local optimization – p. 68
b
Algorithms for unconstrained local optimization – p. 69
b
Algorithms for unconstrained local optimization – p. 70
b
Algorithms for unconstrained local optimization – p. 71
b
Algorithms for unconstrained local optimization – p. 72
Nelder-Mead Simplex
Given a simplex S = v1, . . . , vn+1 in Rn let vr the worst point:
r = arg maxif(vi). Let C be the centroid of S \ vr:
C =
∑
i6=r vi
n
The algorithm performs a sort of line search along the directionC − vr.Let
R = C + (C − vr)
a reflection of the worst point along the direction. Let f be thebest function value in the current simplex.Three cases might occur:
Algorithms for unconstrained local optimization – p. 73
1: Reflection
Check f(R): if it is intermediate, i.e. better than the worst andworse than the best, then accept the reflection, i.e. discard theworst point in the simplex and replace it with R.
Algorithms for unconstrained local optimization – p. 74
Reflection step
b
b
b⊗
worst
reflection
Algorithms for unconstrained local optimization – p. 75
2: improvement
if the trial step is an improvement:
f(R) < f
then attempt an expansion: try to move R to R = R + (R − C)
If successful (f(R) < f(R)) then accept the expansion anddiscard the worst point.If unsuccessful, then accept R as a new point and discard theworst one.
Algorithms for unconstrained local optimization – p. 76
Expansion
b
b
bb
⊗
worst
reflection
expansion
Algorithms for unconstrained local optimization – p. 77
3: contraction
If however the reflected point R is worse than all points in thesimplex (possibly except the worst vr), than a contraction step isperformed:
if f(R) > f(vr) (R is worse than all points in the simplex),add
0.5(vr + C)
to the simplex and discard vr
otherwise if R is better than vr than add
0.5(R + C)
to the simplex and discard vr
Algorithms for unconstrained local optimization – p. 78
Contraction
b
b
b
b
b
⊗worst
reflectioncontraction
b
Algorithms for unconstrained local optimization – p. 79
Nelder-Mead is not a direct search method (only a singledirection at a time is explored)It is widely used by practitioners. However it may fail toconverge to a local minimum.There are examples of strictly convex functions in R
2 on whichthe method converges to a non-stationary point. The badconvergence properties are connected to the event that then–dimensional simplex degenerates into a lower dimensionalspace.Moreover the method has a strong tendency to generatedirections which are almost normal to that of the gradient!Convergent variants of Nelder-Mead method do exists.
Algorithms for unconstrained local optimization – p. 80
Implicit filtering
Let
f(x) = h(x) + w(x)
where h(x) is a smooth function, while w(x) can be consideredas an additive, typically random, noise.The method performs a rough estimate of the gradient (finitedifference with a “large step”) and proceeds with an Armijo linesearch. If unsuccessful, the step for finite differences isreduced.
Algorithms for unconstrained local optimization – p. 81
Implicit filtering
Data: εk ↓ 0, params δ, γ, ∆ of Armijo’s rulerepeat
OuterIteration = false;repeat
compute f(xk) and a finite difference estimate of ∇f(xk):
∇εkf(xk) = [(f(xk + εkei) − f(xkεkei))/2εk]
if ‖∇εkf(xk)‖ ≤ εk then
OuterIteration = trueelse
Armijo: if successful accept the Armijo step;otherwise let OuterIteration = true
end
until OuterIteration ;k = k + 1;
until convergence criterion ;Algorithms for unconstrained local optimization – p. 82
Convergence properties
If
∇2h(x) is Lipschitz continuous
the sequence xk generated by the method is infinite
limk→∞
ε2k +
η(xk; εk)
εk
= 0
where
η(x; ε) = supz:‖z−x‖∞≤ε
|w(x)|
unsuccessful Armijo steps occur at most a finite number oftimes
then all limit points of xk are stationaryAlgorithms for unconstrained local optimization – p. 83
Algorithms for constrained localoptimization
Fabio Schoen
2008
http://gol.dsi.unifi.it/users/schoen
Algorithms for constrained local optimization – p. 1
Feasible direction methods
Algorithms for constrained local optimization – p. 2
Frank–Wolfe method
Let X: convex set. Consider the problem:
minx∈X
f(x)
Let xk ∈ X ⇒choosing a feasible direction dk corresponds tochoosing a point x ∈ X : dk = x − xk.“Steepest descent” choice:
minx∈X
∇T f(xk)(x − xk)
(a linear objective with convex constraints, usually easy tosolve). Let xk be an optimal solution of this problem.
Algorithms for constrained local optimization – p. 3
Frank–Wolfe
If ∇T f(xk)(xk − xk) = 0 then
∇T f(xk)d ≥ 0
for every feasible direction d ⇒first order necessary conditionshold.Otherwise, letting dk = xk − x, this is a descent direction alongwhich a step αk ∈ (0, 1] might be chosen according to Armijo’srule.
Algorithms for constrained local optimization – p. 4
Convergence of Frank-Wolfe method
Under mild conditions the method converges to a pointsatisfying first order necessary conditions.However it is usually extremely slow (convergence may besub–linear)It might find applications in very large scale problems in whichsolving the sub-problem for direction determination is very easy(e.g. when X is a polytope).
Algorithms for constrained local optimization – p. 5
Gradient Projection methods
Generic iteration:
xk+1 = xk + αk(xk − xk)
where the direction dk = xk − xk is obtained finding
xk = [xk − sk∇f(xk)]+
where: sk ∈ R+ and [·]+ represents projection over the feasible
set.
Algorithms for constrained local optimization – p. 6
The method is slightly faster than Frank-Wolfe, with a linearconvergence rate similar to that of (unconstrained) steepestdescent.It might be applied if projection is relatively cheap, e.g. whenthe feasible set is a box.A point xk satisfies first order necessary conditionsdT∇f(xk) ≥ 0 iff
xk = [xk − sk∇f(xk)]+
Algorithms for constrained local optimization – p. 7
Lagrange Multiplier Algorithms
Algorithms for constrained local optimization – p. 8
Barrier Methods
min f(x)
gj(x) ≤ 0 j = 1, . . . , r
A Barrier is a continuous function which tends to +∞ wheneverx approaches the boundary of the feasible region. Examples ofbarrier functions:
B(x) = −∑
j
log(−gj(x)) logaritmic barrier
B(x) = −∑
j
1
gj(x)invers barrier
Algorithms for constrained local optimization – p. 9
Barrier Method
Let εk ↓ 0 and x0 strictly feasible, i.e. gj(x0) < 0∀ j. Then let
xk = arg minx∈Rn
(f(x) + εkB(x))
Proposition: every limit point of xk is a global minimum of theconstrained optimization problem
Algorithms for constrained local optimization – p. 10
Analysis of Barrier methods
Special case: a single constraint (might be generalized)Let x be a limit point of xk (a global minimum). If KKTconditions hold, then there exists a unique λ ≥ 0:
∇f(x) + λ∇g(x) = 0
(with λg(x) = 0. xk, solution of the barrier problem
min f(x) + εkB(x)
g(x) < 0
satisfies
∇f(xk) + εk∇B(xk) = 0
Algorithms for constrained local optimization – p. 11
. . .
If B(x) = φ(g(x)), ⇒
∇f(xk) + εkφ′(g(xk))∇g(xk) = 0
In the limit, for k → ∞:
lim εkφ′(g(xk))∇g(xk) = λ∇g(x)
if limk g(xk) < 0 ⇒φ′(g(xk))∇g(xk) → K (finite) and Kεk → 0
if limk g(xk) = 0 ⇒(thanks to the unicity of Lagrangemultipliers),
λ = limk
εkφ′(g(xk))
Algorithms for constrained local optimization – p. 12
Difficulties in Barrier Methods
strong numeric instability: the condition number of thehessian matrix grows as εk → 0
need for an initial strictly feasible point x0
(partial) remedy: εk is very slowly decreased and the solution ofthe k + 1–th problem is obtained starting an unconstrainedoptimization from xk
Algorithms for constrained local optimization – p. 13
Example
min(x − 1)2 + (y − 1)2
x + y ≤ 1
Logarithmic Barrier problem:
min(x − 1)2 + (y − 1)2 − εk log(1 − x − y)
x + y − 1 < 0
Gradient:
2(x − 1) + εk
1−x−y
2(y − 1) + εk
1−x−y
Stationary points x = y = 3
4±
√
1+εk
4(only the “-” solution is acceptable)
Algorithms for constrained local optimization – p. 14
Barrier methods and L.P.
min cT x
Ax = b
x ≥ 0
Logarithmic Barrier on x ≥ 0:
min cT x − ε∑
j
log xj
Ax = b
x > 0
Algorithms for constrained local optimization – p. 15
The central path
The starting point is usually associated with ε = ∞ and is theunique solution of
min−∑
j
log xj
Ax = b
x > 0
The trajectory x(ε) of solutions to the barrier problem is calledthe central path and leads to an optimal solution of the LP.
Algorithms for constrained local optimization – p. 16
Penalty Methods
Penalized problem:
min f(x) + ρP (x)
where ρ > 0 and P (x) ≥ 0 with P (x) = 0 if x is feasible.Example:
min f(x)
hi(x) = 0 i = 1, . . . ,m
A penalized problem might be:
min f(x) + ρ∑
i
hi(x)2
Algorithms for constrained local optimization – p. 17
Convergence of the quadratic penalty method
(for equality constrained problems): let
P (x; ρ) = f(x) + ρ∑
i
hi(x)2
Given ρ0 > 0, x0 ∈ Rn, k = 0, let
xk+1 = arg min P (x; ρk)
(found with an iterative method initialized at xk); let ρk+1 > ρk,k := k + 1.If xk+1 is a global minimizer of P and ρk → ∞ then every limitpoint of xk is a global optimum of the constrained problem.
Algorithms for constrained local optimization – p. 18
Exact penalties
Exact penalties: there exists a penalty parameter value s.t. theoptimal solution to the penalized problem is the optimal solutionof the original one.ℓ1 penalty function:
P1(x; ρ) = f(x) + ρ∑
i
|hi(x)|
Algorithms for constrained local optimization – p. 19
Exact penalties
for inequality constrained problems:
min f(x)
hi(x) = 0
gj(x) ≤ 0
the penalized problem is
P1(x; ρ) = f(x)ρ∑
i
|hi(x)| + ρ∑
j
max(0,−gj(x))
Algorithms for constrained local optimization – p. 20
Augmented Lagrangian method
Given an equality constrained problem, reformulate it as:
min f(x) +1
2ρ‖h(x)‖2
h(x) = 0
The Lagrange function of this problem is called AugmentedLagrangian:
L(x; λ) = f(x) +1
2ρ‖h(x)‖2 + λT h(x)
Algorithms for constrained local optimization – p. 21
Motivation
minx
f(x) +1
2ρ‖h(x)‖2 + λT h(x)
∇xLρ(x, λ) = ∇f(x) +∑
i
λi∇h(x) + ρh(x)∇h(x)
= ∇xL(x, λ) + ρh(x)∇h(x)
∇2xxLρ(x, λ) = ∇2f(x) +
∑
i
λi∇2h(x) + ρh(x)∇2h(x) + ρ∇h(x)∇T h(x)
= ∇2xxL(x, λ) + ρh(x)∇2h(x) + ρ∇h(x)∇T h(x)
Algorithms for constrained local optimization – p. 22
motivation . . .
Let (x⋆, λ⋆) an optimal (primal and dual) solution. Necessarily:∇xL(x⋆, λ⋆) = 0; moreover h(x⋆) = 0 thus
∇xLρ(x⋆, λ⋆) = ∇xL(x⋆, λ⋆) + ρh(x⋆)∇h(x⋆)
= 0
⇒(x⋆, λ⋆) is a stationary point for the augmented lagrangian.
Algorithms for constrained local optimization – p. 23
motivation . . .
Observe that:
∇2
xxLρ(x, λ) = ∇2
xxL(x, λ) + ρh(x)∇2h(x) + ρ∇h(x)∇T h(x)
= ∇2
xxL(x, λ) + ρ∇h(x)∇T h(x)
Assume that sufficient optimality conditions hold:
vT∇2
xxL(x⋆, λ⋆)v > 0 ∀ v : vT∇h(x⋆) = 0,
Algorithms for constrained local optimization – p. 24
. . .
Let v 6= 0 : vT∇h(x⋆)= 0. Then
vT∇2
xxLρ(x⋆, λ⋆)vT = vT∇2
xxL(x⋆, λ⋆)vT + ρvT∇h(x⋆)∇T h(x⋆)v
= vT∇2
xxL(x⋆, λ⋆)vT > 0
Algorithms for constrained local optimization – p. 25
. . .
Let v 6= 0 : vT∇h(x⋆)6= 0. Then
vT∇2
xxLρ(x⋆, λ⋆)vT = vT∇2
xxL(x⋆, λ⋆)vT + ρvT∇h(x⋆)∇T h(x⋆)v
= vT∇2
xxL(x⋆, λ⋆)vT + ρ(vT∇h(x⋆))2
which might be negative. However ∃ρ > 0: if ρ ≥ ρ
⇒vT∇2xxLρ(x
⋆, λ⋆)vT > 0.Thus, if ρ is large enough, the Hessian of the augmentedlagrangian is positive definite and x⋆ is a (strict) local minimumof Lρ(·, λ
⋆)
Algorithms for constrained local optimization – p. 26
Inequality constraints
min f(x)
g(x) ≤ 0
Nonlinear transformation of inequalities into equalities:
minx,s
f(x)
gj(x) + s2
j = 0 j = 1, p
Algorithms for constrained local optimization – p. 27
Given the problem
min f(x)
hi(x) = 0 i = 1,m
gj(x) ≤ 0 j = 1, p
an Augmented Lagrangian problem might be defined as
minLρ(x, z; λ, µ) = minx,z
f(x) + λT h(x) +1
2ρ‖h(x)‖2
+∑
j
µj(gj(x) + z2
j ) +1
2ρ
∑
j
(gj(x) + z2
j )2
Algorithms for constrained local optimization – p. 28
. . .
Consider minimization with respect to z variables:
minz
∑
j
µj(gj(x) + z2
j ) +1
2ρ
∑
j
(gj(x) + z2
j )2
= minu≥0
∑
j
µj(gj(x) + uj) +1
2ρ(gj(x) + uj)
2
(quadratic minimization over the nonnegative orthant). Solution:
u⋆j = max0, uj
where u is the unconstrained optimum:
u : µj + ρ(gj(x) + uj) = 0
Algorithms for constrained local optimization – p. 29
. . .
Thus:
u⋆j = max0,−
µj
ρ− gj(x).
Substituting:
Lρ(x; λ, µ) = f(x) + λT h(x) +1
2ρ‖h(x)‖2
+1
2ρ
∑
j
(
max0, µj + ρgj(x) − µ2
j
)
This is an Augmented Lagragian for inequality constrainedproblems.
Algorithms for constrained local optimization – p. 30
Sequential Quadratic Programming
min f(x)
hi(x) = 0
Idea: apply Newton’s method to solve the KKT equations:Lagrangian function:
L(x; λ) = f(x) +∑
i
λihi(x)
let H(x) = [hi(x)] ,∇H(x) = [∇hi(x)]. KKT conditions:
F [x; λ] =
[
∇f(x) + ∇HT (x)λH(x)
]
= 0
Algorithms for constrained local optimization – p. 31
Newton step for SQP
Jacobian of KKT system:
F ′(x, λ) =
[
∇2xxL(x; λ) ∇T H(x)∇H(x) 0
]
Newton step:[
xk+1
λk+1
]
=
[
xk
λk
]
+
[
dk
∆k
]
where[
∇2xxL(xk; λk) ∇T H(xk)∇H(xk) 0
] [
dk
∆k
]
=
[
−∇f(xk) −∇HT (xk)λk
−H(xk)
]
Algorithms for constrained local optimization – p. 32
existence
The Newton step exists if
the Jacobian of the constraint set ∇H(xk) has full row rank
the Hessian ∇2xxL(xk; λk) is positive definite
In this case the Newton step is the unique solution of
∇2
xxL(xk; λk)dk + ∇T H(xk)∆k + ∇f(xk) + ∇HT (xk)λk = 0
∇H(xk)dk + H(xk) = 0
Algorithms for constrained local optimization – p. 33
Alternative view: SQP
mind
f(xk) + ∇f(xk)T d +
1
2dT∇2
xxL(xk; λk)d
∇H(xk)d + H(xk) = 0
KKT conditions:
∇2
xxL(xk; λk)d + ∇f(xk) + ∇H(xk)Λk = 0
Under the same conditions as before this QP has a uniquesolution dk with Lagrange multipliers Λk = λk+1
Algorithms for constrained local optimization – p. 34
Alternative view: SQP
mind
L(xk, λk) + ∇TxL(xk, λk)d +
1
2dT∇2
xxL(xk; λk)d
∇H(xk)d + H(xk) = 0
KKT conditions:
∇2
xxL(xk; λk)d + ∇f(xk) + ∇H(xk)λk + ∇H(xk)Λk = 0
Under the same conditions as before this QP has a uniquesolution dk with Lagrange multipliers Λk = ∆k+1
Algorithms for constrained local optimization – p. 35
Thus SQP can be seen as a method which
minimizes a quadratic approximation to the Lagrangian
subject to a first order approximation of the constraints.
Algorithms for constrained local optimization – p. 36
Inequalities
If the original problem is
min f(x)
hi(x) = 0
gj(x) ≤ 0
then the SQP iteration solves
mind
fk + ∇f(xk)T d +
1
2dT∇2
xxL(xk, λk)d
∇Ti hi(xk)p + hi(xk) = 0
∇Tj gj(xk)p + gj(xk) ≤ 0
Algorithms for constrained local optimization – p. 37
Filter Methods
Basic idea:
min f(x)
g(x) ≤ 0
can be considered as a problem with two objectives:
minimize f(x)
minimize g(x)
(the second objective has priority over the first)
Algorithms for constrained local optimization – p. 38
Filter
Given the problem
min f(x)
gj(x) ≤ 0 j = 1, . . . , k
let us consider the bi-criteria optimization problem
min f(x)
min h(x)
where
h(x) =∑
j
maxgj(x), 0
Algorithms for constrained local optimization – p. 39
Let fk, hk, k = 1, 2, . . . the observed values of f and h at pointsx1, x2, . . ..A pair (fk, hk) dominates a pair (fℓ, hℓ) iff
fk ≤ fℓ and
hk ≤ hℓ
A filter is a list of pairs which are non-dominated by the others
Algorithms for constrained local optimization – p. 40
bc
bc
bc
bc
bc
h(x)
f(x)
Algorithms for constrained local optimization – p. 41
Trust region SQP
Consider a Trust-region SQP method:
mind
fk + ∇L(xk; λk)T d +
1
2dT∇2
xxL(xk; λk)d
∇Tj gj(xk)p + gj(xk) ≤ 0
‖d‖∞ ≤ ρ
(the ∞ norm is used here in order to keep the problem a QP)Traditional (unconstrained) trust region methods: if the currentstep is a failure ⇒reduce the trust region ⇒eventually the stepwill become a pure gradient step ⇒convergence!
Algorithms for constrained local optimization – p. 42
Trust region SQP
Here diminishing the trust region radius might lead to infeasibleQP’s:
gj(x) ≤ 0
∇Tj gj(xk)p + gj(xk) ≤ 0
bc xk
Algorithms for constrained local optimization – p. 43
Filter methods
Data: x0: starting point, ρ, k = 0
while Convergence criterion not satisfied do
if QP is infeasible thenFind xk+1 minimizing constraint violation;
elseSolve QP and get a step dk; try setting xk+1 = xk + dk;if (fk+1, hk+1) is acceptable to the filter then
Accept xk+1 and add (fk+1, hk+1) to the filter;Remove dominated points from the filter;Possibly increase ρ;
elseReject the step;Reduce ρ;
end
end
set k = k + 1;end Algorithms for constrained local optimization – p. 44
Comparison with other methods
bc
bc
bc
bc
bc
h(x)
f(x)
acceptable steps "classical" method
Rejected filter steps
Algorithms for constrained local optimization – p. 45
Introduction to Global OptimizationFabio Schoen
2008
http://gol.dsi.unifi.it/users/schoen
Introduction to Global Optimization – p. 1
Global Optimization Problems
minx∈S⊆Rn
f(x)
What is it meant by global optimization? Of course we sould liketo find
f ∗ = minx∈S⊆Rn
f(x)
andx∗ = arg min f(x) : f(x∗) ≤ f(x) ∀ x ∈ S
Introduction to Global Optimization – p. 2
This definition in unsatisfactory:
the problem is “ill posed” in x (two objective functions whichdiffer only slightly might have global optima which arearbitrarily far)
it is however well posed in the optimal values: ||f − g|| ≤ δ⇒|f ∗ − g∗| ≤ ε
Introduction to Global Optimization – p. 3
Quite often we are satisfied in looking for f ∗ and search one ormore feasible solutions suche that
f(x) ≤ f(x∗) + ε
Frequently, however, this is too ambitious a task!
Introduction to Global Optimization – p. 4
Research in Global Optimization
the problem is highly relevant, especially in applications
the problem is very hard (perhaps too much) to solve
there are plenty of publications on global optimizationalgorithms for specific problem classes
there are only relatively few papers with relevant theoreticalcontents
often from elegant theories, weak algorithms have beenproduced and viceversa, the best computational methodsoften lack a sound theoretical support
Introduction to Global Optimization – p. 5
many global optimization papers get published on appliedresearch journals
Bazaraa, Sherali, Shetty “Nonlinear Programming: theoryand algorithms”, 1993:the word “global optimum” appears for the first time on page99, the second time at page 132, then at page 247:“A desirable property of an algorithm for solving [anoptimization] problem is that it generates a sequence ofpoints converging to a global optimal solution. In manycases however we may have to be satisfied with lessfavorable outcomes.”after this (in 638 pages) it never appears anymore. “Globaloptimization” is never cited.
Introduction to Global Optimization – p. 6
Similar situation in Bertsekas, Nonlinear Programming (1999):777 pages, but only the definition of global minima and maximais given!Nocedal & Wrigth, “Numerical Optimization”, 2nd edition, 2006:Global solutions are needed in some applications, but for manyproblems they are difficult to recognize and even more difficultto locate. . .many successful global optimization algorithms require thesolution of many local optimization problems, to which thealgorithms described in this book can be applied
Introduction to Global Optimization – p. 7
Complexity
Global optimization is “hopeless”: without “global” informationno algorithm will find a certifiable global optimum unless itgenerates a dense sample.There exists a rigorous definition of “global” information – someexamples:
number of local optima
global optimum value
for global optimization problems over a box, (an upperbound on) the Lipschitz constant
|f(y) − f(x)| ≤ L‖x− y‖ ∀x, y
Concavity of the objective function + convexity of thefeasible region
an explicit representation of the objective function as thedifference between two convex functions (+ convexity of the
Introduction to Global Optimization – p. 8
Complexity
Global optimization is computationally intractable alsoaccording to classical complexity theory. Special cases:Quadratic programming:
minl≤Ax≤u
1
2xTQx+ cTx
is NP–hard [Sahni, 1974] and, when considered as a decisionproblem, NP -complete [Vavasis, 1990].
Introduction to Global Optimization – p. 9
Many special cases are still NP–hard:
norm maximization on a parallelotope:
max ‖x‖b ≤ Ax ≤ c
Quadratic optimization on a hyper-rectangle (A = I) wheneven only one eigenvalue of Q is negative
quadratic minimization over a simplex
minx≥0
1
2xTQx+ cTx
∑
j
xj = 1
Even checking that a point is a local optimum is NP -hardIntroduction to Global Optimization – p. 10
Applications of global optimization
concave minimization – quantity discounts, scaleeconomies
fixed charge
combinatorial optimization - binary linear programming:
min cTx+KxT (1 − x)
Ax = b
x ∈ [0, 1]
or:
min cTx
Ax = b
x ∈ [0, 1]
xT (1 − x) = 0Introduction to Global Optimization – p. 11
Minimization of cost functions which are neither convex norconcave. E.g.: finding the minimum conformation ofcomplex molecules – Lennard-Jones micro-cluster, proteinfolding, protein-ligand docking,Example: Lennard-Jones: pair potential due to two atoms atX1, X2 ∈ R
3:
v(r) =1
r12− 2
r6
where r = ‖X1 −X2‖. The total energy of a cluster of Natoms located at X1, . . . , XN ∈ R
3 is defined as:∑
i=1,...,N
∑
j<i
v(||Xi −Xj||)
This function has a number of local (non global) minimawhich grows like exp(N)
Introduction to Global Optimization – p. 12
Lennard-Jones potential
-3
-2
-1
0
1
2
3
0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
attractive(x)repulsive(x)
lennard-jones(x)
Introduction to Global Optimization – p. 13
Protein folding and docking
Potential energy model:E = El + Ea + Ed + Ev + Ee where:
El =∑
i∈L
1
2Kb
i (ri − r0i )
2
(contribution of pairs of bonded atoms):
Ea =∑
i∈A
1
2Kθ
i (θi − θ0i )
2
(angle between 3 bonded atoms)
Ed =∑
i∈T
1
2Kφ
i [1 + cos(nφi − γ)]
(dihedrals)Introduction to Global Optimization – p. 14
Ev =∑
(i,j)
∑
∈C
(
Aij
R12ij
− Bij
R6ij
)
(van der Waals)
Ee =1
2
∑
(i,j)
∑
∈C
qiqjεRij
(Coulomb interaction)
Introduction to Global Optimization – p. 15
Docking
Given two macro-molecules M1,M2, find their minimal energycouplingIf no bonds are changed ⇒to find the optimal docking it issufficient to minimized:
Ev + Ee =∑
i∈M1,j∈M2
(
Aij
R12ij
− Bij
R6ij
)
+1
2
∑
i∈M1,j∈M2
qiqjεRij
Introduction to Global Optimization – p. 16
Main algorithmic strategies
Two main families:
1. with global information (“structured problems”)
2. without global information (“unstructured problems”)
Structured problems ⇒stochastic and deterministic methodsUnstructured problems ⇒typically stochastic algorithmsEvery global optimization method should try to find a balancebetween
exploration of the feasible region
approximations of the optimum
Introduction to Global Optimization – p. 17
Example: Lennard Jones
LJN = minLJ(X) = minN−1∑
i=1
N∑
j=i+1
1
‖Xi −Xj‖12− 2
‖Xi −Xj‖6
This is a highly structured problem. But is it easy/convenient touse its structure?And how?
Introduction to Global Optimization – p. 18
LJ
The map
F1 : R3N 7→ R
N(N−1)/2+
F1(X1, . . . , XN) 7→
‖X1 −X2‖2, . . . , ‖XN−1 −XN‖2
is convex and the function
F2 : RN(N−1)/2+ 7→ R
F2(r12, . . . , rN−1,N) 7→∑ 1
r6ij
− 2∑ 1
r3ij
is the difference between two convex functions. Thus LJ(X)can be seen as the difference between two convex function (ad.c. programming problem)
Introduction to Global Optimization – p. 19
NB: every C2 function is d.c., but often its d.c. decomposition isnot known.D.C. optimization is very elegant, there exists a nice dualitytheory, but algorithms are typically very inefficient.
Introduction to Global Optimization – p. 20
A primal method for d.c. optimization
“cutting plane” method (just an example, not particularlyefficient, useless for high dimensional problems).Any unconstrained d.c. problem can be represented as anequivalent problem with linear objective, a convex constraintand a reverse convex constraint. If g, h ar convex, thenmin g(x) − h(x) is equivalent to:
min z
g(x) − h(x) ≤ z
which is equivalent to
min z
g(x) ≤ w
h(x) + z ≥ w
Introduction to Global Optimization – p. 21
D.C. canonical form
min cTx
g(x) ≤ 0
h(x) ≥ 0
where h, g: convex. Let
Ω = x : g(x) ≤ 0C = x : h(x)≤0
Hp:0 ∈ intΩ ∩ intC, cTx > 0∀x ∈ Ω \ intC
Fundamental property: if a D.C. problem admits an optimum, atleast one optimum belongs to
∂Ω ∩ ∂C Introduction to Global Optimization – p. 22
Discussion of the assumptions
g(0) < 0, h(0) < 0, cTx > 0∀ feasible x. Let x be a solution to theconvex problem
min cTx g(x) ≤ 0
If h(x) ≥ 0 then x solves the d.c. problem. Otherwise cTx > cT xfor all feasible x. Coordinate transformation: y = x− x:
min cTy
g(y) ≤ 0
h(y) ≥ 0
where g(y) = g(y + x). Then cTy > 0 for all feasible solutionsand h(0) > 0; by continuity it is possible to choose x so thatg(0) < 0.
Introduction to Global Optimization – p. 23-9 -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3
-4
-3
-2
-1
0
1
2
3
4
Ω
C
0
cTx = 0
Introduction to Global Optimization – p. 24
Let x best known solution.Let
D(x) = x ∈ Ω : cTx ≤ cT xIf D(x) ⊆ C then x is optimal;Check: a polytope P (with known vertices) is built whichcontains D(x)If all vertices of P are in C ⇒optimal solution. Otherwise let v:best feasible vertex;the intersection of the segment [0, v] with ∂C (if feasible) is animproving point x. Otherwise a cut is introduced in P which istangent to Ω in x.
Introduction to Global Optimization – p. 25-9 -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3
-4
-3
-2
-1
0
1
2
3
4
Ω
C
cTx = 0
x
D(x) = x ∈ Ω : cTx ≤ cT x
Introduction to Global Optimization – p. 26
Initialization
Given a feasible solution x, take a polytope P such that
P ⊇ D(x)
i.e.
y : cTy ≤ cT x
y feasible
⇒y ∈ P
If P ⊂ C, i.e. if y ∈ P ⇒h(y) ≤ 0 then x is optimal.Checking is easy if we know the vertices of P .
Introduction to Global Optimization – p. 27-9 -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3
-4
-3
-2
-1
0
1
2
3
4
Ω
C
cTx = 0
x
P : D(x) ⊆ P with vertices V1, . . . , Vk. V ⋆ := arg maxh(Vj)
V ⋆
Introduction to Global Optimization – p. 28
Step 1
Let V ⋆ the vertex with largest h() value. Surely h(V ⋆) > 0(otherwise we stop with an optimal solution)Moreover: h(0) < 0 (0 is in the interior of C). Thus the line fromV ⋆ to 0 must intersect the boundary of CLet xk be the intersection point. It might be feasible(⇒improving) or not.
Introduction to Global Optimization – p. 29-9 -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3
-4
-3
-2
-1
0
1
2
3
4
Ω
C
cTx = 0
x
xk = ∂C ∩ [V ⋆, 0]
V ⋆
xk
Introduction to Global Optimization – p. 30
-9 -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3-4
-3
-2
-1
0
1
2
3
4
Ω
C
cTx = 0
If xk ∈ Ω, set x := xk
x
Introduction to Global Optimization – p. 31-9 -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3
-4
-3
-2
-1
0
1
2
3
4
Ω
C
cTx = 0
Otherwise if xk 6∈ Ω, the polytope is divided
Introduction to Global Optimization – p. 32
-9 -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3-4
-3
-2
-1
0
1
2
3
4
Ω
C
cTx = 0
Otherwise if xk 6∈ Ω, the polytope is divided
Introduction to Global Optimization – p. 32
Duality for d.c. problems
minx∈S
g(x) − h(x)
where f, g: convex. Let
h⋆(u) := supuTx− h(x) : x ∈ Rn
g⋆(u) := supuTx− g(x) : x ∈ Rn
the conjugate functions of h e g. The problem
infh⋆(u) − g⋆(u) : u : h⋆(u) < +∞
is the Fenchel-Rockafellar dual. If min g(x) − h(x) admits anoptimum, then Fenchel dual is a strong dual.
Introduction to Global Optimization – p. 33
If x⋆ ∈ arg min g(x) − h(x) then
u⋆ ∈ ∂h(x⋆)
(∂ denotes subdifferential) is dual optimal and ifu⋆ ∈ arg minh⋆(u) − g⋆(u) then
x⋆ ∈ ∂g⋆(u⋆)
is an optimal primal solution.
Introduction to Global Optimization – p. 34
A primal/dual algorithm
Pk : min g(x) − (h(xk) + (x− xk)Tyk)
andDk : minh⋆(y) − (g⋆(yk−1) + xT
k (y − yk−1)
Introduction to Global Optimization – p. 35
Exact Global Optimization
Introduction to Global Optimization – p. 36
GlobOpt - relaxations
Consider the global optimization problem (P):
min f(x)
x ∈ X
and assume the min exists and is finite and that we can use arelaxation (R):
min g(y)
y ∈ Y
Usually both X and Y are subsets of the same space Rn.
Recall: (R) is a relaxation of (P) iff:
X ⊆ Y
g(x) ≤ f(x) for all x ∈ XIntroduction to Global Optimization – p. 37
Branch and Bound
1. Solve the relaxation (R) and let L be the (global) optimumvalue (assume it is feasible for (R))
2. (Heuristically) solve the original problem (P) (or, moregenerally, find a “good” feasible solution to (P) in X). Let Ube the best feasible function value known
3. if U − L ≤ ε then stop: U is a certified ε–optimum for (P)
4. otherwise split X and Y into two parts and apply to each ofthem the same method
Introduction to Global Optimization – p. 38
Tools
“good relaxations”: easy yet accurate
good upper bounding, i.e., good heuristics for (P)
Good relaxations can be obtained, e.g., through:
convex relaxations
domain reduction
Introduction to Global Optimization – p. 39
Convex relaxations
Assume X is convex and Y = X. If g is the convex envelop of fon X, then solving the convex relaxation (R), in one step givesthe certified global optimum for (P).g(x) is a convex under-estimator of f on X if:
g(x)is convex
g(x) ≤ f(x) ∀x ∈ X
g is the convex envelop of f on X if:
gis a convex under-estimator off
g(x) ≥ h(x) ∀x ∈ X
∀h : convex under-estimator of f
Introduction to Global Optimization – p. 40
A 1-D example
Introduction to Global Optimization – p. 41
Convex under-estimator
Introduction to Global Optimization – p. 42
Branching
Introduction to Global Optimization – p. 43
Bounding
fathomed
Upper bound
lower boundsIntroduction to Global Optimization – p. 44
Relaxation of the feasible domain
Let
minx∈S
f(x)
be a GlobOpt problem where f is convex, while S is non convex.A relaxation (outer approximation) is obtained replacing S with alarger set Q. If Q is convex ⇒convex optimization problem.If the optimal solution to
minx∈Q
f(x)
belongs to S ⇒optimal solution to the original problem.
Introduction to Global Optimization – p. 45
Example
minx∈[0,5],y∈[0,3]
−x− 2y
xy ≤ 3
0 1 2 3 4 5 60
1
2
3
4
Introduction to Global Optimization – p. 46
Relaxation
minx∈[0,5],y∈[0,3]
−x− 2y
xy ≤ 3
We know that:
(x+ y)2 = x2 + y2 + 2xy
thus
xy = ((x+ y)2 − x2 − y2)/2
and, as x and y are non-negative, x2 ≤ 5x, y2 ≤ 3y, thus a(convex) relaxation of xy ≤ 3 is
(x+ y)2 − 5x− 3y ≤ 6
(a convex constraint)
Introduction to Global Optimization – p. 47
Relaxation
0 1 2 3 4 5 60
1
2
3
4
Optimal solution of the relaxed convex problem: (2, 3) (value:−8)
Introduction to Global Optimization – p. 48
Stronger Relaxation
minx∈[0,5],y∈[0,3]
−x− 2y
xy ≤ 3
Thus:
(5 − x)(3 − y) ≥ 0 ⇒15 − 3x− 5y + xy ≥ 0 ⇒
xy ≥ 3x+ 5y − 15
Thus a (convex) relaxation of xy ≤ 3 is
3x+ 5y − 15 ≤ 3
i.e.: 3x+ 5y ≤ 18Introduction to Global Optimization – p. 49
Relaxation
0 1 2 3 4 5 60
1
2
3
4
The optimal solution of the convex (linear) relaxation is (1, 3)which is feasible ⇒optimal for the original problem
Introduction to Global Optimization – p. 50
Convex (concave) envelopes
How to build convex envelopes of a function or how to relax anon convex constraint?Convex envelopes ⇒lower boundsConvex envelopes of −f(x) ⇒upper boundsConstraint: g(x) ≤ 0 ⇒if h(x) is a convex underestimator of gthen h(x) ≤ 0 is a convex relaxations.Constraint: g(x) ≥ 0 ⇒if h(x) is concave and h(x) ≥ g(x), thenh(x) ≥ 0 is a “convex” constraint
Introduction to Global Optimization – p. 51
Convex envelopes
Definition: a function is polyhedral if it is the pointwise maximumof a finite number of linear functions.(NB: in general, the convex envelope is the pointwisesupremum of affine minorants)The generating set X of a function f over a convex set P is theset
X = x ∈ Rn : (x, f(x))is a vertex of epi(convP (f))
I.e., given f we first build its convex envelop in P and thendefine its epigraph (x, y) : x ∈ P, y ≥ f(x). This is a convexset whose extreme points can be denoted by V . X are the xcoordinates of V
Introduction to Global Optimization – p. 52
Generating sets
* *
*
*
Introduction to Global Optimization – p. 53
bbb
Introduction to Global Optimization – p. 54
Characterization
Let f(x) be continuously differentiable in a polytope P . Theconvex envelope of f on P is polyhedral if and only if
X(f) = Vert(P )
(the generating set is the vertex set of P )Corollary: let f1, . . . , fm ∈ C1(P ) and
∑
i fi(x) possesspolyhedral convex envelopes on P . Then
Conv(∑
i
fi(x)) =∑
i
Convfi(x)
iff the generating set of∑
i Conv(fi(x)) is Vert(P )
Introduction to Global Optimization – p. 55
Characterization
If a f(x) is such that Convf(x) is polyhedral, than an affinefunction h(x) such that
1. h(x) ≤ f(x) for all x ∈ Vert(P )
2. there exist n+ 1 affinely independent vertices of P ,V1, . . . , Vn+1 such that
f(Vi) = h(Vi) i = 1, . . . , n+ 1
belongs to the polyhedral description of Convf(x) and
h(x) = convf(x)
for any x ∈ Conv(V1, . . . , Vn+1).
Introduction to Global Optimization – p. 56
Characterization
The condition may be reversed: given m affine functionsh1, . . . , hm such that, for each of them
1. hj(x) ≤ f(x) for all x ∈ Vert(P )
2. there exist n+ 1 affinely independent vertices of P ,V1, . . . , Vn+1 such that
f(Vi) = hj(Vi) i = 1, . . . , n+ 1
Then the function ψ(x) = maxj φj(x) is the convex envelope of apolyhedral function f iff
the generating set of ψ is Vert(P)
for every vertex Vi we have ψ(Vi) = f(Vi)
Introduction to Global Optimization – p. 57
Sufficient condition
If f(x) is lower semi-continuous in P and for all x 6∈ Vert(P ) thereexists a line ℓx: x ∈ interior of P ∩ ℓx and f(x) is concave in aneighborhood of x on ℓx,then Convf(x) is polyhedralApplication: let
f(x) =∑
i,j
αijxixj
The sufficient condition holds for f in [0, 1]n ⇒bilinear forms arepolyhedral in an hypercube
Introduction to Global Optimization – p. 58
Application: a bilinear term
(Al-Khayyal, Falk (1983)): let x ∈ [ℓx, ux], y ∈ [ℓy, uy]. Then theconvex envelope of xy in [ℓx, ux] × [ℓy, uy is
φ(x, y) = maxℓyx+ ℓxy − ℓxℓy;uyx+ uxy − uxuy
In fact: φ(x, y) is a under-estimate of xy:
(x− ℓx)(y − ℓy) ≥ 0
xy ≥ ℓyx+ ℓxy − ℓxℓy
and analogously for xy ≥ uyx+ uxy − uxuy
Introduction to Global Optimization – p. 59
Bilinear terms
xy ≥ φ(x, y) = maxℓyx+ ℓxy − ℓxℓy;uyx+ uxy − uxuyNo other (polyhedral) function underestimating xy is tighter.In fact ℓyx+ ℓxy − ℓxℓy belongs to the convex envelope: itunderestimates xy and coincides with xy at 3 vertices((ℓx, ℓy), (ℓx, uy), (ux, ℓy)).Analogously for the other affine function.All vertices are interpolated by these 2 underestimatinghyperplanes ⇒they form the convex envelop of xy
Introduction to Global Optimization – p. 60
All easy then?
Of course no!Many things can go wrong . . .
It is true that, on the hypercube, a bilinear form:∑
i<j
αijxixj
is polyhedral (easy to see) but we cannot guarantee ingeneral that the generating set of the envelope are thevertices of the hypercube! (in particular, if α’s have oppositesigns)
if the set is not an hypercube, even a bilinear term might benon polyhedral: e.g. xy on the triangle 0 ≤ x ≤ y ≤ 1
Finding the (polyhedral) convex envelope of a bilinear form on ageneric polytope P is NP–hard!
Introduction to Global Optimization – p. 61
Fractional terms
A convex underestimate of a fractional term x/y over a box canbe obtained through
w ≥ ℓx/y + x/uy − ℓx/uy if ℓx ≥ 0
w ≥ x/uy − ℓxy/ℓyuy + ℓx/ℓy if ℓx < 0
w ≥ ux/y + x/ℓy − ux/ℓy if ℓx ≥ 0
w ≥ x/ℓy − uxy/ℓyuy + ux/uy if ℓx < 0
(a better underestimate exists)
Introduction to Global Optimization – p. 62
Univariate concave terms
If f(x), x ∈ [ℓx, ux], is concave, then the convex envelope issimply its linear interpolation at the extremes of the interval:
f(ℓx) +f(ux) − f(ℓx)
ux − ℓx(x− ℓx)
Introduction to Global Optimization – p. 63
Underestimating a general nonconvex function
Let f(x) ∈ C2 be general non convex. Than a convexunderestimate on a box can be defined as
φ(x) = f(x) −n∑
i=1
αi(xi − ℓi)(ui − xi)
where αi > 0 are parameters. The Hessian of φ is
∇2φ(x) = ∇2f(x) + 2diag(α)
φ is convex iff ∇2φ(x) is positive semi-definite.
Introduction to Global Optimization – p. 64
How to choose αi’s? One possibility: uniform choice: αi = α. Inthis case convexity of φ is obtained iff
α ≥ max
0,−1
2min
x∈[ℓ,u]λmin(x)
where λmin(x) is the minimum eigenvalue of ∇2f(x)
Introduction to Global Optimization – p. 65
Key properties
φ(x) ≤ f(x)
φ interpolates f at all vertices of [ℓ, u]
φ is convex
Maximum separation:
max(f(x) − φ(x)) =1
4α∑
i
(ui − ℓi)2
Thus the error in underestimation decreases when the boxis split.
Introduction to Global Optimization – p. 66
Estimation of α
Compute an interval Hessian [H] : [H(x)]ij = [hLij(x), h
Uij(x)] in
[ℓ, u]Find α such that [H] + 2diag(α) < 0.Gerschgorin theorem for real matrices:
λmin ≥ mini
hii −∑
j 6=i
|hij|
Extension to interval matrices:
λmin ≥ mini
hLii −
∑
j 6=i
max|hLij|, |hU
ij|uj − ℓjui − ℓi
Introduction to Global Optimization – p. 67
Improvements
new relaxation functions (other than quadratic). Example
Φ(x; γ) = −n∑
i=1
(1 − eγi(xi−ℓi))(1 − eγi(ui−xi))
gives a tighter underestimate than the quadratic function
partitioning: partition the domain into a small number ofregions (hyper-rectangules); evaluate a convexunderestimator in each region; join the underestimators toform a single convex function in the whole domain
Introduction to Global Optimization – p. 68
Domain (range) reduction
Techniques for cutting the feasible region without cutting theglobal optimum solution.Simplest approaches: feasibility-based and optimality-basedrange reduction (RR).Let the problem be:
minx∈S
f(x)
Feasibility based RR asks for solving
ℓi = minxi ui = maxxi
x ∈ S x ∈ S
for all i ∈ 1, . . . , n and then adding the constraints x ∈ [ℓ, u] tothe problem (or to the sub-problems generated during Branch &Bound)
Introduction to Global Optimization – p. 69
Feasibility Based RR
If S is a polyhedron, RR requires the solution of LP’s:
[ℓ, u] = min /maxx
Ax ≤ b
x ∈ [L,U ]
“Poor man’s” L.P. based RR: from every constraint∑
j aijxj ≤ biin which ai > 0 then
x ≤1
ai
(
bi −∑
j 6=
aijxj
)
⇒
x ≤1
ai
(
bi −∑
j 6=
minaijLj, aijUj)
Introduction to Global Optimization – p. 70
Optimality Based RR
Given an incumbent solution x ∈ S, ranges are updated bysolving the sequence:
ℓi = minxi ui = maxxi
f(x) ≤ f(x) f(x) ≤ f(x)
x ∈ S x ∈ S
where f(x) is a convex underestimate of f in the currentdomain.RR can be applied iteratively (i.e., at the end of a complete RRsequence, we might start a new one using the new bounds)
Introduction to Global Optimization – p. 71
generalization
minx∈X
f(x) (P )
g(x) ≤ 0
a (non convex) problem; let
minx∈X
f(x) (R)
g(x) ≤ 0
be a convex relaxation of (P ):
x ∈ X : g(x) ≤ 0 ⊆ x ∈ X : g(x) ≤ 0 and
x ∈ X : g(x) ≤ 0⇒f(x) ≤ f(x)
Introduction to Global Optimization – p. 72
R.H.S. perturbation
Let
φ(y) = minx∈X
f(x) (Ry)
g(x) ≤ y
be a perturbation of (R). (R) convex ⇒(Ry) convex for any y.Let x: an optimal solution of (R) and assume that the i–thconstraint is active:
g(x) = 0
Then, if xy is an optimal solution of (Ry) ⇒gi(x) ≤ yi is active at
xy if yi ≤ 0
Introduction to Global Optimization – p. 73
Duality
Assume (R) has a finite optimum at x with value φ(0) andLagrange multipliers µ. Then the hyperplane
H(y) = φ(0) − µTy
is a supporting hyperplane of the graph of φ(y) at y = 0, i.e.
φ(y) ≥ φ(0) − µTy ∀ y ∈ Rm
Introduction to Global Optimization – p. 74
Main result
If (R) is convex with optimum value φ(0), constraint i is active atthe optimum and the Lagrange multiplier is µi > 0 then, if U isan upper bound for the original problem (P ) the constraint:
gi(x) ≥ −(U − L)/µi
(where L = φ(0)) is valid for the original problem (P ), i.e. it doesnot exclude any feasible solution with value better than U .
Introduction to Global Optimization – p. 75
proof
Problem (Ry) can be seen as a convex relaxation of theperturbed non convex problem
Φ(y) = minx∈X
f(x)
g(x) ≤ y
and thus φ(y) ≤ Φ(y). Thus underestimating (Ry) produces anunderestimate of Φ(y). Let y := eiyi; From duality:L− µT eiyi ≤ φ(eiyi) ≤ Φ(eiyi)If yi < 0 then U is an upper bound also for Φ(eiyi), thusL− µiyi ≤ U . But if yi < 0 then constraint i is active. For anyfeasible x there exists a yi < 0 such that g(x) ≤ yi is active ⇒wemay substitute yi with g
i(x) and deduce L− µigi
(x) ≤ U
Introduction to Global Optimization – p. 76
Applications
Range reduction: let x ∈ [ℓ, u] in the convex relaxed problem. Ifvariable xi is at its upper bound in the optimal solution, them wecan deduce
xi ≥ maxℓi, ui − (U − L)/λi
where λi is the optimal multiplier associated to the i–th upperbound. Analogously for active lower bounds:
xi ≤ minui, ℓi + (U − L)/λi
Introduction to Global Optimization – p. 77
Let the constraint
aTi x ≤ bi
be active in an optimal solution of the convex relaxation (R).Then we can deduce the valid inequality
aiTx ≥ bi − (U − L)/µi
Introduction to Global Optimization – p. 78
Methods based on “merit functions”
Bayesian algorithm: the objective function is considered as arealization of a stochastic process
f(x) = F (x;ω)
A loss function is defined, e.g.:
L(x1, ..., xn;ω) = mini=1,n
F (xi;ω) − minxF (x;ω)
and the next point to sample is placed in order to minimize theexpected loss (or risk)
xn+1 = arg minE (L(x1, ..., xn, xn+1) | x1, ..., xn)
= arg minE (min(F (xn+1;ω) − F (x;ω)) | x1, ..., xn)
Introduction to Global Optimization – p. 79
Radial basis method
Given k observations (x1, f1), . . . , (xk, fk), an interpolant is built:
s(x) =n∑
i=1
λiΦ(‖x− xi‖) + p(x)
p: polynomial of a (prefixed) small degree m. Φ: radial functionlike, e.g.:
Φ(r) = r linear
Φ(r) = r3 cubic
Φ(r) = r2 log r thin plate spline
Φ(r) = e−γr2
gaussian
Polynomial p is necessary to guarantee existence of a uniqueinterpolant (i.e. when the matrix Φij = Φ(‖xi −xj‖) is singular)
Introduction to Global Optimization – p. 80
“Bumpiness”
Let f ⋆k an estimate of the value of the global optimum after k
observations. Let syk the (unique) interpolant of the data points
(xi, fi)i = 1, . . . , k
(y, f ⋆k )
Idea: the most likely location of y is such that the resultinginterpolant has minimum “bumpiness”Bumpiness measure:
σ(sk) = (−1)m+1∑
λisyk(xi)
Introduction to Global Optimization – p. 81
TO BE DONE
Introduction to Global Optimization – p. 82
Stochastic methods
Pure Random Search - random uniform sampling over thefeasible region
Best start: like Pure Random Search, but a local search isstarted from the best observation
Multistart: Local searches started from randomly generatedstarting points
Introduction to Global Optimization – p. 83
-3
-2
-1
0
1
2
3
0 1 2 3 4 5
rsrsrs rs rsrs rs rsrsrs
+
++
+
+
+ + +++
Introduction to Global Optimization – p. 84
-3
-2
-1
0
1
2
3
0 1 2 3 4 5
rsrsrs rs rsrs rs rsrsrs
+
++
+
+
+ + +++
Introduction to Global Optimization – p. 85
Clustering methods
Given a uniform sample, evaluate the objective function
Sample Transformation (or concentration): either a fractionof “worst” points are discarded, or a few steps of a gradientmethod are performed
Remaining points are clustered
from the best point in each cluster a single local search isstarted
Introduction to Global Optimization – p. 86
Uniform sample
−1
−3
0
−5
rs
rs rs
rs
rs
rsrs
rs
rs
rs
rsrs
rs
rs
rs
rs
rs
rs
rs
rsrsrs
rs
rs
rs
rs
rs
rsrs
rs
rs
0
1
2
3
4
5
0 1 2 3 4 5
Introduction to Global Optimization – p. 87
Sample concentration
−1
−3
0
−5
rs
rsrs
rs
rs
rs
rs
rs
rs
rs
rs
rs
rsrs
rs
+ + +
+
+
+
+
++
+++
+
+ ++
0
1
2
3
4
5
0 1 2 3 4 5
Introduction to Global Optimization – p. 88
Clustering
−1
−3
0
−5
r
rr
rr
r
r
r
r
r
u
r
u
r
r
0
1
2
3
4
5
0 1 2 3 4 5
Introduction to Global Optimization – p. 89
Local optimization
−1
−3
0
−5
r
rr
rr
r
r
r
r
r
u
r
u
r
r
0
1
2
3
4
5
0 1 2 3 4 5
Introduction to Global Optimization – p. 90
Clustering: MLSL
Sampling proceed in batches of N points. Given sample pointsX1, . . . , Xk ∈ [0, 1]n, label Xj as “clustered” iff ∃Y ∈ X1, . . . , Xk:
||Xj − Y || ≤ ∆k :=1√2π
(
log k
kσΓ(
1 +n
2
)
)1
n
andf(Y ) ≤ f(Xj)
Introduction to Global Optimization – p. 91
Simple Linkage
A sequential sample is generated (batches consist of a singleobservation). A local search is started only from the lastsampled point (i.e. there is no “recall”) unless there exists asufficiently near sampled point with better function valure
Introduction to Global Optimization – p. 92
Smoothing methods
Given f : Rn → R, the Gaussian transform is defined as:
〈f〉λ(x) =1
πn/2λn
∫
Rn
f(y) exp(
−‖y − x‖2/λ2)
When λ is sufficiently large ⇒〈f〉λ is convex. Idea: starting witha large enough λ, minimize the smoothed function and slowlydecrease λ towards 0.
Introduction to Global Optimization – p. 93
Smoothing methods
-10-5
05
10 -10
-5
0
5
10
0
0.5
1
1.5
2
2.5
3
Introduction to Global Optimization – p. 94
-10-5
05
10 -10
-5
0
5
10
0
0.5
1
1.5
2
2.5
3
Introduction to Global Optimization – p. 95
-10-5
05
10 -10
-5
0
5
10
0.60.8
11.21.41.61.8
22.22.4
Introduction to Global Optimization – p. 96
-10-5
05
10 -10
-5
0
5
10
0.8
1
1.2
1.4
1.6
1.8
2
2.2
Introduction to Global Optimization – p. 97
-10-5
05
10 -10
-5
0
5
10
0.8
1
1.2
1.4
1.6
1.8
2
2.2
Introduction to Global Optimization – p. 98
Transformed function landscape
Elementary idea: local optimization smooths out many “highfrequency” oscillations
Introduction to Global Optimization – p. 99
0
1
2
3
4
5
6
7
8
9
10
Introduction to Global Optimization – p. 100
0
1
2
3
4
5
6
7
8
9
10
Introduction to Global Optimization – p. 101
0
1
2
3
4
5
6
7
8
9
10
Introduction to Global Optimization – p. 102
Monotonic Basin-Hopping
k := 0; f⋆ := +∞;while k < MaxIter do
Xk: random initial solutionX⋆
k= arg min f(x; Xk);
(local minimization started at Xk)fk = f(X⋆
k);
if fk < f⋆ =⇒ f⋆ := fk
NoImprove := 0;while NoImprove < MaxImprove do
X = random perturbation of Xk
Y = arg minf(x; X) ;if f(Y ) < f⋆ =⇒ Xk := Y ; NoImprove := 0; f⋆ := f(Y )
otherwise NoImprove + +
end while
end while
Introduction to Global Optimization – p. 103
0
1
2
3
4
5
6
7
8
9
10
Introduction to Global Optimization – p. 104
0
1
2
3
4
5
6
7
8
9
10
Introduction to Global Optimization – p. 105
0
1
2
3
4
5
6
7
8
9
10
Introduction to Global Optimization – p. 106
0
1
2
3
4
5
6
7
8
9
10
Introduction to Global Optimization – p. 107
0
1
2
3
4
5
6
7
8
9
10
Introduction to Global Optimization – p. 108
References
In this year’s course the global optimization part has been expanded, so itis possible that some part in nonlinear optimization will be skipped. Here isan essential reference list for the material covered during the course:
Mokhtar S. Bazaraa, John J. Jarvis and Hanif D. Sherali, Linear Program-ming and Network Flows, John Wiley & Sons, 1990.
Dimitri P. Bertsekas, Nonlinear Programming, Athena Scientific, 1999.
Jorge Nocedal and Stephen J. Wright, Numerical Optimization, Springer,2006.
Mohit Tawarmalani and Nikolaos V. Sahinidis, A Polyhedral Branch–and–Cut Approach to Global Optimization, in: Mathematical Programming, vol-ume 103, pages 225-249, 2005.
Androulakis I.P., C.D. Maranas, and C.A. Floudas (PostScript (184K), PDF(154K)), ”αBB : A Global Optimization Method for General ConstrainedNonconvex Problems”, Journal of Global Optimization, 7, 4, pp. 337-363(1995).
A. Rikun. A convex envelope formula for multilinear functions. Journal ofGlobal Optimization, pages 10:425–437, 1997.
Andrea Grosso, Marco Locatelli and Fabio Schoen, A Population Based Ap-proach for Hard Global Optimization Problems Based on Dissimilarity Mea-sures, in: Mathematical Programming, volume 110, number 2, pages 373-404,2007.
1