Convex Optimizationaarti/Class/10701_Spring14/r3_convex_opt.pdf · Reference: Boyd and Vandenberghe...

58
Convex Optimization Dani Yogatama School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA February 12, 2014 Dani Yogatama (Carnegie Mellon University) Convex Optimization February 12, 2014 1 / 26

Transcript of Convex Optimizationaarti/Class/10701_Spring14/r3_convex_opt.pdf · Reference: Boyd and Vandenberghe...

Page 1: Convex Optimizationaarti/Class/10701_Spring14/r3_convex_opt.pdf · Reference: Boyd and Vandenberghe (2004) Dani Yogatama (Carnegie Mellon University) Convex Optimization February

Convex Optimization

Dani Yogatama

School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA

February 12, 2014

Dani Yogatama (Carnegie Mellon University) Convex Optimization February 12, 2014 1 / 26

Page 2: Convex Optimizationaarti/Class/10701_Spring14/r3_convex_opt.pdf · Reference: Boyd and Vandenberghe (2004) Dani Yogatama (Carnegie Mellon University) Convex Optimization February

Key Concepts in Convex Analysis: Convex Sets

Dani Yogatama (Carnegie Mellon University) Convex Optimization February 12, 2014 2 / 26

Page 3: Convex Optimizationaarti/Class/10701_Spring14/r3_convex_opt.pdf · Reference: Boyd and Vandenberghe (2004) Dani Yogatama (Carnegie Mellon University) Convex Optimization February

Key Concepts in Convex Analysis: Convex Functions

Dani Yogatama (Carnegie Mellon University) Convex Optimization February 12, 2014 3 / 26

Page 4: Convex Optimizationaarti/Class/10701_Spring14/r3_convex_opt.pdf · Reference: Boyd and Vandenberghe (2004) Dani Yogatama (Carnegie Mellon University) Convex Optimization February

Key Concepts in Convex Analysis: Minimizers

Dani Yogatama (Carnegie Mellon University) Convex Optimization February 12, 2014 4 / 26

Page 5: Convex Optimizationaarti/Class/10701_Spring14/r3_convex_opt.pdf · Reference: Boyd and Vandenberghe (2004) Dani Yogatama (Carnegie Mellon University) Convex Optimization February

Key Concepts in Convex Analysis: Strong ConvexityRecall the definition of convex function: ∀λ ∈ [0, 1],

f (λx + (1− λ)x ′) ≤ λf (x) + (1− λ)f (x ′)

A β−strongly convex function satisfies a stronger condition: ∀λ ∈ [0, 1]

f (λx + (1− λ)x ′) ≤ λf (x) + (1− λ)f (x ′)− β

2λ(1− λ)‖x − x ′‖22

convexity

strong convexity

Strong convexity⇒6⇐ strict convexity.

Dani Yogatama (Carnegie Mellon University) Convex Optimization February 12, 2014 5 / 26

Page 6: Convex Optimizationaarti/Class/10701_Spring14/r3_convex_opt.pdf · Reference: Boyd and Vandenberghe (2004) Dani Yogatama (Carnegie Mellon University) Convex Optimization February

Key Concepts in Convex Analysis: Strong ConvexityRecall the definition of convex function: ∀λ ∈ [0, 1],

f (λx + (1− λ)x ′) ≤ λf (x) + (1− λ)f (x ′)

A β−strongly convex function satisfies a stronger condition: ∀λ ∈ [0, 1]

f (λx + (1− λ)x ′) ≤ λf (x) + (1− λ)f (x ′)− β

2λ(1− λ)‖x − x ′‖22

convexity strong convexity

Strong convexity⇒6⇐ strict convexity.

Dani Yogatama (Carnegie Mellon University) Convex Optimization February 12, 2014 5 / 26

Page 7: Convex Optimizationaarti/Class/10701_Spring14/r3_convex_opt.pdf · Reference: Boyd and Vandenberghe (2004) Dani Yogatama (Carnegie Mellon University) Convex Optimization February

Key Concepts in Convex Analysis: Strong ConvexityRecall the definition of convex function: ∀λ ∈ [0, 1],

f (λx + (1− λ)x ′) ≤ λf (x) + (1− λ)f (x ′)

A β−strongly convex function satisfies a stronger condition: ∀λ ∈ [0, 1]

f (λx + (1− λ)x ′) ≤ λf (x) + (1− λ)f (x ′)− β

2λ(1− λ)‖x − x ′‖22

convexity strong convexity

Strong convexity⇒6⇐ strict convexity.

Dani Yogatama (Carnegie Mellon University) Convex Optimization February 12, 2014 5 / 26

Page 8: Convex Optimizationaarti/Class/10701_Spring14/r3_convex_opt.pdf · Reference: Boyd and Vandenberghe (2004) Dani Yogatama (Carnegie Mellon University) Convex Optimization February

Key Concepts in Convex Analysis: SubgradientsConvexity ⇒ continuity; convexity 6⇒ differentiability (e.g., f (w) = ‖w‖1).

Subgradients generalize gradients for (maybe non-diff.) convex functions:

v is a subgradient of f at x if f (x′) ≥ f (x) + v>(x′ − x)

Subdifferential: ∂f (x) = {v : v is a subgradient of f at x}If f is differentiable, ∂f (x) = {∇f (x)}

linear lower bound non-differentiable case

Notation: ∇̃f (x) is a subgradient of f at x

Dani Yogatama (Carnegie Mellon University) Convex Optimization February 12, 2014 6 / 26

Page 9: Convex Optimizationaarti/Class/10701_Spring14/r3_convex_opt.pdf · Reference: Boyd and Vandenberghe (2004) Dani Yogatama (Carnegie Mellon University) Convex Optimization February

Key Concepts in Convex Analysis: SubgradientsConvexity ⇒ continuity; convexity 6⇒ differentiability (e.g., f (w) = ‖w‖1).

Subgradients generalize gradients for (maybe non-diff.) convex functions:

v is a subgradient of f at x if f (x′) ≥ f (x) + v>(x′ − x)

Subdifferential: ∂f (x) = {v : v is a subgradient of f at x}

If f is differentiable, ∂f (x) = {∇f (x)}

linear lower bound

non-differentiable case

Notation: ∇̃f (x) is a subgradient of f at x

Dani Yogatama (Carnegie Mellon University) Convex Optimization February 12, 2014 6 / 26

Page 10: Convex Optimizationaarti/Class/10701_Spring14/r3_convex_opt.pdf · Reference: Boyd and Vandenberghe (2004) Dani Yogatama (Carnegie Mellon University) Convex Optimization February

Key Concepts in Convex Analysis: SubgradientsConvexity ⇒ continuity; convexity 6⇒ differentiability (e.g., f (w) = ‖w‖1).

Subgradients generalize gradients for (maybe non-diff.) convex functions:

v is a subgradient of f at x if f (x′) ≥ f (x) + v>(x′ − x)

Subdifferential: ∂f (x) = {v : v is a subgradient of f at x}If f is differentiable, ∂f (x) = {∇f (x)}

linear lower bound

non-differentiable case

Notation: ∇̃f (x) is a subgradient of f at x

Dani Yogatama (Carnegie Mellon University) Convex Optimization February 12, 2014 6 / 26

Page 11: Convex Optimizationaarti/Class/10701_Spring14/r3_convex_opt.pdf · Reference: Boyd and Vandenberghe (2004) Dani Yogatama (Carnegie Mellon University) Convex Optimization February

Key Concepts in Convex Analysis: SubgradientsConvexity ⇒ continuity; convexity 6⇒ differentiability (e.g., f (w) = ‖w‖1).

Subgradients generalize gradients for (maybe non-diff.) convex functions:

v is a subgradient of f at x if f (x′) ≥ f (x) + v>(x′ − x)

Subdifferential: ∂f (x) = {v : v is a subgradient of f at x}If f is differentiable, ∂f (x) = {∇f (x)}

linear lower bound non-differentiable case

Notation: ∇̃f (x) is a subgradient of f at x

Dani Yogatama (Carnegie Mellon University) Convex Optimization February 12, 2014 6 / 26

Page 12: Convex Optimizationaarti/Class/10701_Spring14/r3_convex_opt.pdf · Reference: Boyd and Vandenberghe (2004) Dani Yogatama (Carnegie Mellon University) Convex Optimization February

Key Concepts in Convex Analysis: SubgradientsConvexity ⇒ continuity; convexity 6⇒ differentiability (e.g., f (w) = ‖w‖1).

Subgradients generalize gradients for (maybe non-diff.) convex functions:

v is a subgradient of f at x if f (x′) ≥ f (x) + v>(x′ − x)

Subdifferential: ∂f (x) = {v : v is a subgradient of f at x}If f is differentiable, ∂f (x) = {∇f (x)}

linear lower bound non-differentiable case

Notation: ∇̃f (x) is a subgradient of f at xDani Yogatama (Carnegie Mellon University) Convex Optimization February 12, 2014 6 / 26

Page 13: Convex Optimizationaarti/Class/10701_Spring14/r3_convex_opt.pdf · Reference: Boyd and Vandenberghe (2004) Dani Yogatama (Carnegie Mellon University) Convex Optimization February

Establishing convexity

How to check if f (x) is a convex function?

Verify definition of a convex function.

Check if ∂2f (x)∂2x

greater than or equal to 0 (for twice differentiablefunction).

Show that it is constructed from simple convex functions withoperations that preserver convexity.

nonnegative weighted sumcomposition with affine functionpointwise maximum and supremumcompositionminimizationperspective

Reference: Boyd and Vandenberghe (2004)

Dani Yogatama (Carnegie Mellon University) Convex Optimization February 12, 2014 7 / 26

Page 14: Convex Optimizationaarti/Class/10701_Spring14/r3_convex_opt.pdf · Reference: Boyd and Vandenberghe (2004) Dani Yogatama (Carnegie Mellon University) Convex Optimization February

Establishing convexity

How to check if f (x) is a convex function?

Verify definition of a convex function.

Check if ∂2f (x)∂2x

greater than or equal to 0 (for twice differentiablefunction).

Show that it is constructed from simple convex functions withoperations that preserver convexity.

nonnegative weighted sumcomposition with affine functionpointwise maximum and supremumcompositionminimizationperspective

Reference: Boyd and Vandenberghe (2004)

Dani Yogatama (Carnegie Mellon University) Convex Optimization February 12, 2014 7 / 26

Page 15: Convex Optimizationaarti/Class/10701_Spring14/r3_convex_opt.pdf · Reference: Boyd and Vandenberghe (2004) Dani Yogatama (Carnegie Mellon University) Convex Optimization February

Unconstrained Optimization

Algorithms:

First order methods (gradient descent, FISTA, etc.)

Higher order methods (Newton’s method, ellipsoid, etc.)

...

Dani Yogatama (Carnegie Mellon University) Convex Optimization February 12, 2014 8 / 26

Page 16: Convex Optimizationaarti/Class/10701_Spring14/r3_convex_opt.pdf · Reference: Boyd and Vandenberghe (2004) Dani Yogatama (Carnegie Mellon University) Convex Optimization February

Gradient descent

Problem:min

xf (x)

Algorithm:

gt = ∂f (xt)∂x .

xt = xt−1 − ηgt .

Repeat until convergence.

Dani Yogatama (Carnegie Mellon University) Convex Optimization February 12, 2014 9 / 26

Page 17: Convex Optimizationaarti/Class/10701_Spring14/r3_convex_opt.pdf · Reference: Boyd and Vandenberghe (2004) Dani Yogatama (Carnegie Mellon University) Convex Optimization February

Gradient descent

Problem:min

xf (x)

Algorithm:

gt = ∂f (xt)∂x .

xt = xt−1 − ηgt .

Repeat until convergence.

Dani Yogatama (Carnegie Mellon University) Convex Optimization February 12, 2014 9 / 26

Page 18: Convex Optimizationaarti/Class/10701_Spring14/r3_convex_opt.pdf · Reference: Boyd and Vandenberghe (2004) Dani Yogatama (Carnegie Mellon University) Convex Optimization February

Newton’s method

Problem:min

xf (x)

Assume f is twice differentiable.

Algorithm:

gt = ∂f (xt)∂x .

st = H−1gt , where H is the Hessian.

xt = xt−1 − ηst .

Repeat until convergence.

Newton’s method is a special case of steepest descent using Hessian norm.

Dani Yogatama (Carnegie Mellon University) Convex Optimization February 12, 2014 10 / 26

Page 19: Convex Optimizationaarti/Class/10701_Spring14/r3_convex_opt.pdf · Reference: Boyd and Vandenberghe (2004) Dani Yogatama (Carnegie Mellon University) Convex Optimization February

Newton’s method

Problem:min

xf (x)

Assume f is twice differentiable.

Algorithm:

gt = ∂f (xt)∂x .

st = H−1gt , where H is the Hessian.

xt = xt−1 − ηst .

Repeat until convergence.

Newton’s method is a special case of steepest descent using Hessian norm.

Dani Yogatama (Carnegie Mellon University) Convex Optimization February 12, 2014 10 / 26

Page 20: Convex Optimizationaarti/Class/10701_Spring14/r3_convex_opt.pdf · Reference: Boyd and Vandenberghe (2004) Dani Yogatama (Carnegie Mellon University) Convex Optimization February

Newton’s method

Problem:min

xf (x)

Assume f is twice differentiable.

Algorithm:

gt = ∂f (xt)∂x .

st = H−1gt , where H is the Hessian.

xt = xt−1 − ηst .

Repeat until convergence.

Newton’s method is a special case of steepest descent using Hessian norm.

Dani Yogatama (Carnegie Mellon University) Convex Optimization February 12, 2014 10 / 26

Page 21: Convex Optimizationaarti/Class/10701_Spring14/r3_convex_opt.pdf · Reference: Boyd and Vandenberghe (2004) Dani Yogatama (Carnegie Mellon University) Convex Optimization February

DualityPrimal problem:

minx f (x)subject to gi (x) ≤ 0 i = 1, . . . ,m

hi (x) = 0 i = 1, . . . , p

for x ∈ X.

Lagrangian:

L(x , λ, ν) = f (x) +m∑

i=1

λigi (x) +

p∑i=1

νihi (x)

λi and νi are Lagrange multipliers.

Suppose x is feasible and λ ≥ 0, then we get the lower bound:

f (x) ≥ L(x , λ, ν)∀x ∈ X, λ ∈ Rm+

Dani Yogatama (Carnegie Mellon University) Convex Optimization February 12, 2014 11 / 26

Page 22: Convex Optimizationaarti/Class/10701_Spring14/r3_convex_opt.pdf · Reference: Boyd and Vandenberghe (2004) Dani Yogatama (Carnegie Mellon University) Convex Optimization February

DualityPrimal problem:

minx f (x)subject to gi (x) ≤ 0 i = 1, . . . ,m

hi (x) = 0 i = 1, . . . , p

for x ∈ X.

Lagrangian:

L(x , λ, ν) = f (x) +m∑

i=1

λigi (x) +

p∑i=1

νihi (x)

λi and νi are Lagrange multipliers.

Suppose x is feasible and λ ≥ 0, then we get the lower bound:

f (x) ≥ L(x , λ, ν)∀x ∈ X, λ ∈ Rm+

Dani Yogatama (Carnegie Mellon University) Convex Optimization February 12, 2014 11 / 26

Page 23: Convex Optimizationaarti/Class/10701_Spring14/r3_convex_opt.pdf · Reference: Boyd and Vandenberghe (2004) Dani Yogatama (Carnegie Mellon University) Convex Optimization February

DualityPrimal problem:

minx f (x)subject to gi (x) ≤ 0 i = 1, . . . ,m

hi (x) = 0 i = 1, . . . , p

for x ∈ X.

Lagrangian:

L(x , λ, ν) = f (x) +m∑

i=1

λigi (x) +

p∑i=1

νihi (x)

λi and νi are Lagrange multipliers.

Suppose x is feasible and λ ≥ 0, then we get the lower bound:

f (x) ≥ L(x , λ, ν)∀x ∈ X, λ ∈ Rm+

Dani Yogatama (Carnegie Mellon University) Convex Optimization February 12, 2014 11 / 26

Page 24: Convex Optimizationaarti/Class/10701_Spring14/r3_convex_opt.pdf · Reference: Boyd and Vandenberghe (2004) Dani Yogatama (Carnegie Mellon University) Convex Optimization February

Duality

Primal optimal: p∗ = minx

maxλ≥0,ν

L(x , λ, ν)

Lagrange dual function: minx

L(x , λ, ν)

This is a concave function, regardless of whether f (x) convex or not. Canbe −∞ for some λ and ν.

Lagrange dual problem: maxλ,ν

L(x , λ, ν) subject to λ ≥ 0

Dual feasible: if λ ≥ 0 and λ, ν ∈ dom L(x , λ, ν).

Dual optimal: d∗ = maxλ≥0,ν

minx

L(x , λ, ν)

Dani Yogatama (Carnegie Mellon University) Convex Optimization February 12, 2014 12 / 26

Page 25: Convex Optimizationaarti/Class/10701_Spring14/r3_convex_opt.pdf · Reference: Boyd and Vandenberghe (2004) Dani Yogatama (Carnegie Mellon University) Convex Optimization February

Duality

Primal optimal: p∗ = minx

maxλ≥0,ν

L(x , λ, ν)

Lagrange dual function: minx

L(x , λ, ν)

This is a concave function, regardless of whether f (x) convex or not. Canbe −∞ for some λ and ν.

Lagrange dual problem: maxλ,ν

L(x , λ, ν) subject to λ ≥ 0

Dual feasible: if λ ≥ 0 and λ, ν ∈ dom L(x , λ, ν).

Dual optimal: d∗ = maxλ≥0,ν

minx

L(x , λ, ν)

Dani Yogatama (Carnegie Mellon University) Convex Optimization February 12, 2014 12 / 26

Page 26: Convex Optimizationaarti/Class/10701_Spring14/r3_convex_opt.pdf · Reference: Boyd and Vandenberghe (2004) Dani Yogatama (Carnegie Mellon University) Convex Optimization February

Duality

Primal optimal: p∗ = minx

maxλ≥0,ν

L(x , λ, ν)

Lagrange dual function: minx

L(x , λ, ν)

This is a concave function, regardless of whether f (x) convex or not. Canbe −∞ for some λ and ν.

Lagrange dual problem: maxλ,ν

L(x , λ, ν) subject to λ ≥ 0

Dual feasible: if λ ≥ 0 and λ, ν ∈ dom L(x , λ, ν).

Dual optimal: d∗ = maxλ≥0,ν

minx

L(x , λ, ν)

Dani Yogatama (Carnegie Mellon University) Convex Optimization February 12, 2014 12 / 26

Page 27: Convex Optimizationaarti/Class/10701_Spring14/r3_convex_opt.pdf · Reference: Boyd and Vandenberghe (2004) Dani Yogatama (Carnegie Mellon University) Convex Optimization February

Duality

Primal optimal: p∗ = minx

maxλ≥0,ν

L(x , λ, ν)

Lagrange dual function: minx

L(x , λ, ν)

This is a concave function, regardless of whether f (x) convex or not. Canbe −∞ for some λ and ν.

Lagrange dual problem: maxλ,ν

L(x , λ, ν) subject to λ ≥ 0

Dual feasible: if λ ≥ 0 and λ, ν ∈ dom L(x , λ, ν).

Dual optimal: d∗ = maxλ≥0,ν

minx

L(x , λ, ν)

Dani Yogatama (Carnegie Mellon University) Convex Optimization February 12, 2014 12 / 26

Page 28: Convex Optimizationaarti/Class/10701_Spring14/r3_convex_opt.pdf · Reference: Boyd and Vandenberghe (2004) Dani Yogatama (Carnegie Mellon University) Convex Optimization February

DualityWeak duality p∗ ≥ d∗ always holds for convex and nonconvex problems

Strong duality p∗ = d∗ does not hold in general, but usually holds forconvex problems. Strong duality is guaranteed by Slater’s constraintqualification.

Strong duality holds if the problem is strictly feasible, i.e.

∃x ∈ intD s.t. gi (x) < 0, i = 1, . . . ,m, hi (x) = 0, i = 1, . . . , p

Assume strong duality holds and p∗ and d∗ are attained.

p∗ = f (x∗) = d∗ = minx

L(x∗, λ∗, ν∗) ≤ L(x∗, λ∗, ν∗) ≤ f (x∗) = p∗

We have:

x∗ ∈ arg minx L(x∗, λ∗, ν∗).

λ∗i gi (x∗) = 0 for i = 1, . . . ,m (complementary slackness).

Dani Yogatama (Carnegie Mellon University) Convex Optimization February 12, 2014 13 / 26

Page 29: Convex Optimizationaarti/Class/10701_Spring14/r3_convex_opt.pdf · Reference: Boyd and Vandenberghe (2004) Dani Yogatama (Carnegie Mellon University) Convex Optimization February

DualityWeak duality p∗ ≥ d∗ always holds for convex and nonconvex problems

Strong duality p∗ = d∗ does not hold in general, but usually holds forconvex problems. Strong duality is guaranteed by Slater’s constraintqualification.

Strong duality holds if the problem is strictly feasible, i.e.

∃x ∈ intD s.t. gi (x) < 0, i = 1, . . . ,m, hi (x) = 0, i = 1, . . . , p

Assume strong duality holds and p∗ and d∗ are attained.

p∗ = f (x∗) = d∗ = minx

L(x∗, λ∗, ν∗) ≤ L(x∗, λ∗, ν∗) ≤ f (x∗) = p∗

We have:

x∗ ∈ arg minx L(x∗, λ∗, ν∗).

λ∗i gi (x∗) = 0 for i = 1, . . . ,m (complementary slackness).

Dani Yogatama (Carnegie Mellon University) Convex Optimization February 12, 2014 13 / 26

Page 30: Convex Optimizationaarti/Class/10701_Spring14/r3_convex_opt.pdf · Reference: Boyd and Vandenberghe (2004) Dani Yogatama (Carnegie Mellon University) Convex Optimization February

DualityWeak duality p∗ ≥ d∗ always holds for convex and nonconvex problems

Strong duality p∗ = d∗ does not hold in general, but usually holds forconvex problems. Strong duality is guaranteed by Slater’s constraintqualification.

Strong duality holds if the problem is strictly feasible, i.e.

∃x ∈ intD s.t. gi (x) < 0, i = 1, . . . ,m, hi (x) = 0, i = 1, . . . , p

Assume strong duality holds and p∗ and d∗ are attained.

p∗ = f (x∗) = d∗ = minx

L(x∗, λ∗, ν∗) ≤ L(x∗, λ∗, ν∗) ≤ f (x∗) = p∗

We have:

x∗ ∈ arg minx L(x∗, λ∗, ν∗).

λ∗i gi (x∗) = 0 for i = 1, . . . ,m (complementary slackness).

Dani Yogatama (Carnegie Mellon University) Convex Optimization February 12, 2014 13 / 26

Page 31: Convex Optimizationaarti/Class/10701_Spring14/r3_convex_opt.pdf · Reference: Boyd and Vandenberghe (2004) Dani Yogatama (Carnegie Mellon University) Convex Optimization February

DualityWeak duality p∗ ≥ d∗ always holds for convex and nonconvex problems

Strong duality p∗ = d∗ does not hold in general, but usually holds forconvex problems. Strong duality is guaranteed by Slater’s constraintqualification.

Strong duality holds if the problem is strictly feasible, i.e.

∃x ∈ intD s.t. gi (x) < 0, i = 1, . . . ,m, hi (x) = 0, i = 1, . . . , p

Assume strong duality holds and p∗ and d∗ are attained.

p∗ = f (x∗) = d∗ = minx

L(x∗, λ∗, ν∗) ≤ L(x∗, λ∗, ν∗) ≤ f (x∗) = p∗

We have:

x∗ ∈ arg minx L(x∗, λ∗, ν∗).

λ∗i gi (x∗) = 0 for i = 1, . . . ,m (complementary slackness).

Dani Yogatama (Carnegie Mellon University) Convex Optimization February 12, 2014 13 / 26

Page 32: Convex Optimizationaarti/Class/10701_Spring14/r3_convex_opt.pdf · Reference: Boyd and Vandenberghe (2004) Dani Yogatama (Carnegie Mellon University) Convex Optimization February

Karush-Kuhn-Tucker condition

For a differentiable g(x) and h(x), the KKT conditions are:

gi (x∗) ≤ 0, hi (x

∗) = 0, primal feasibility

λ∗i ≥ 0, dual feasibility

λ∗i gi (x∗) = 0, complementary slackness

∂L(x∗, λ∗, ν∗)

∂x|x=x∗ = 0 Lagrangian stationarity

If x̂ , λ̂, ν̂ satify the KKT for a convex problem, they are optimal.

Dani Yogatama (Carnegie Mellon University) Convex Optimization February 12, 2014 14 / 26

Page 33: Convex Optimizationaarti/Class/10701_Spring14/r3_convex_opt.pdf · Reference: Boyd and Vandenberghe (2004) Dani Yogatama (Carnegie Mellon University) Convex Optimization February

Karush-Kuhn-Tucker condition

For a differentiable g(x) and h(x), the KKT conditions are:

gi (x∗) ≤ 0, hi (x

∗) = 0, primal feasibility

λ∗i ≥ 0, dual feasibility

λ∗i gi (x∗) = 0, complementary slackness

∂L(x∗, λ∗, ν∗)

∂x|x=x∗ = 0 Lagrangian stationarity

If x̂ , λ̂, ν̂ satify the KKT for a convex problem, they are optimal.

Dani Yogatama (Carnegie Mellon University) Convex Optimization February 12, 2014 14 / 26

Page 34: Convex Optimizationaarti/Class/10701_Spring14/r3_convex_opt.pdf · Reference: Boyd and Vandenberghe (2004) Dani Yogatama (Carnegie Mellon University) Convex Optimization February

Support Vector MachinesPrimal problem (hard constraint):

minw1

2‖w‖22

subject to yi 〈xi ,w〉 ≥ 1, i = 1, . . . , n

Lagrangian:

L(w , λ) =1

2‖w‖22 −

n∑i=1

λi (yi 〈xi ,w〉 − 1)

Minimizing with respect to w, we have:

∂L(w ,λ)∂w = 0

w −∑n

i=1 λiyixi = 0

w =n∑

i=1

λiyixi

Dani Yogatama (Carnegie Mellon University) Convex Optimization February 12, 2014 15 / 26

Page 35: Convex Optimizationaarti/Class/10701_Spring14/r3_convex_opt.pdf · Reference: Boyd and Vandenberghe (2004) Dani Yogatama (Carnegie Mellon University) Convex Optimization February

Support Vector MachinesPrimal problem (hard constraint):

minw1

2‖w‖22

subject to yi 〈xi ,w〉 ≥ 1, i = 1, . . . , n

Lagrangian:

L(w , λ) =1

2‖w‖22 −

n∑i=1

λi (yi 〈xi ,w〉 − 1)

Minimizing with respect to w, we have:

∂L(w ,λ)∂w = 0

w −∑n

i=1 λiyixi = 0

w =n∑

i=1

λiyixi

Dani Yogatama (Carnegie Mellon University) Convex Optimization February 12, 2014 15 / 26

Page 36: Convex Optimizationaarti/Class/10701_Spring14/r3_convex_opt.pdf · Reference: Boyd and Vandenberghe (2004) Dani Yogatama (Carnegie Mellon University) Convex Optimization February

Support Vector MachinesPrimal problem (hard constraint):

minw1

2‖w‖22

subject to yi 〈xi ,w〉 ≥ 1, i = 1, . . . , n

Lagrangian:

L(w , λ) =1

2‖w‖22 −

n∑i=1

λi (yi 〈xi ,w〉 − 1)

Minimizing with respect to w, we have:

∂L(w ,λ)∂w = 0

w −∑n

i=1 λiyixi = 0

w =n∑

i=1

λiyixi

Dani Yogatama (Carnegie Mellon University) Convex Optimization February 12, 2014 15 / 26

Page 37: Convex Optimizationaarti/Class/10701_Spring14/r3_convex_opt.pdf · Reference: Boyd and Vandenberghe (2004) Dani Yogatama (Carnegie Mellon University) Convex Optimization February

Support Vector MachinesPlug this back into the Lagrangian:

L(λ) =n∑

1=1

λi −1

2

n∑i=1

n∑j=1

yiyjλiλjx>i xj

Lagrange dual problem is:

maxλ

n∑1=1

λi −1

2

n∑i=1

n∑j=1

yiyjλiλjx>i xj

subject to λi ≥ 0, i = 1, . . . , nn∑

i=1

λiyi = 0

Since this problem only depends on x>i xj , we can use kernels and learn inhigh dimensional space without having to explicitly represent φ(x).

Dani Yogatama (Carnegie Mellon University) Convex Optimization February 12, 2014 16 / 26

Page 38: Convex Optimizationaarti/Class/10701_Spring14/r3_convex_opt.pdf · Reference: Boyd and Vandenberghe (2004) Dani Yogatama (Carnegie Mellon University) Convex Optimization February

Support Vector MachinesPlug this back into the Lagrangian:

L(λ) =n∑

1=1

λi −1

2

n∑i=1

n∑j=1

yiyjλiλjx>i xj

Lagrange dual problem is:

maxλ

n∑1=1

λi −1

2

n∑i=1

n∑j=1

yiyjλiλjx>i xj

subject to λi ≥ 0, i = 1, . . . , nn∑

i=1

λiyi = 0

Since this problem only depends on x>i xj , we can use kernels and learn inhigh dimensional space without having to explicitly represent φ(x).

Dani Yogatama (Carnegie Mellon University) Convex Optimization February 12, 2014 16 / 26

Page 39: Convex Optimizationaarti/Class/10701_Spring14/r3_convex_opt.pdf · Reference: Boyd and Vandenberghe (2004) Dani Yogatama (Carnegie Mellon University) Convex Optimization February

Support Vector MachinesPlug this back into the Lagrangian:

L(λ) =n∑

1=1

λi −1

2

n∑i=1

n∑j=1

yiyjλiλjx>i xj

Lagrange dual problem is:

maxλ

n∑1=1

λi −1

2

n∑i=1

n∑j=1

yiyjλiλjx>i xj

subject to λi ≥ 0, i = 1, . . . , nn∑

i=1

λiyi = 0

Since this problem only depends on x>i xj , we can use kernels and learn inhigh dimensional space without having to explicitly represent φ(x).

Dani Yogatama (Carnegie Mellon University) Convex Optimization February 12, 2014 16 / 26

Page 40: Convex Optimizationaarti/Class/10701_Spring14/r3_convex_opt.pdf · Reference: Boyd and Vandenberghe (2004) Dani Yogatama (Carnegie Mellon University) Convex Optimization February

Support Vector Machines

Primal problem (soft constraint):

minw1

2‖w‖22 + C

n∑i=1

ξi

subject to yi 〈xi ,w〉 ≥ 1− ξi , i = 1, . . . , n

ξi ≥ 0, i = 1 . . . , n

Dani Yogatama (Carnegie Mellon University) Convex Optimization February 12, 2014 17 / 26

Page 41: Convex Optimizationaarti/Class/10701_Spring14/r3_convex_opt.pdf · Reference: Boyd and Vandenberghe (2004) Dani Yogatama (Carnegie Mellon University) Convex Optimization February

Support Vector Machines

Lagrange dual problem for the soft constraint:

maxλ

n∑1=1

λi −1

2

n∑i=1

n∑j=1

yiyjλiλjx>i xj (1)

subject to 0 ≤ λi ≤ C , i = 1, . . . , n (2)n∑

i=1

λiyi = 0 (3)

KKT conditions, for all i :

λi = 0 → yi 〈xi ,w〉 ≥ 1 (4)

0 < λi < C → yi 〈xi ,w〉 = 1 (5)

λi = C → yi 〈xi ,w〉 ≤ 1 (6)

Dani Yogatama (Carnegie Mellon University) Convex Optimization February 12, 2014 18 / 26

Page 42: Convex Optimizationaarti/Class/10701_Spring14/r3_convex_opt.pdf · Reference: Boyd and Vandenberghe (2004) Dani Yogatama (Carnegie Mellon University) Convex Optimization February

Support Vector Machines

Lagrange dual problem for the soft constraint:

maxλ

n∑1=1

λi −1

2

n∑i=1

n∑j=1

yiyjλiλjx>i xj (1)

subject to 0 ≤ λi ≤ C , i = 1, . . . , n (2)n∑

i=1

λiyi = 0 (3)

KKT conditions, for all i :

λi = 0 → yi 〈xi ,w〉 ≥ 1 (4)

0 < λi < C → yi 〈xi ,w〉 = 1 (5)

λi = C → yi 〈xi ,w〉 ≤ 1 (6)

Dani Yogatama (Carnegie Mellon University) Convex Optimization February 12, 2014 18 / 26

Page 43: Convex Optimizationaarti/Class/10701_Spring14/r3_convex_opt.pdf · Reference: Boyd and Vandenberghe (2004) Dani Yogatama (Carnegie Mellon University) Convex Optimization February

Sequential Minimal Optimization (Platt, 1998)

An efficient way to solve SVM dual problem. Break a large QP programinto a series of smallest possible QP problems. Solve these smallsubproblems analytically.

In a nutshell

Choose two Lagrange multipliers λi and λj .

Optimize the dual problem with respect to these two Lagrangemultipliers, holding others fixed.

Repeat until convergence.

There are heuristics to choose Lagrange multipliers that maximizes thestep size towards the global maximum. The first one is chosen fromexamples that violate the KKT condition. The second one is chosen usingapproximation based on absolute difference in error values (see Platt(1998)).

Dani Yogatama (Carnegie Mellon University) Convex Optimization February 12, 2014 19 / 26

Page 44: Convex Optimizationaarti/Class/10701_Spring14/r3_convex_opt.pdf · Reference: Boyd and Vandenberghe (2004) Dani Yogatama (Carnegie Mellon University) Convex Optimization February

Sequential Minimal Optimization (Platt, 1998)

An efficient way to solve SVM dual problem. Break a large QP programinto a series of smallest possible QP problems. Solve these smallsubproblems analytically.

In a nutshell

Choose two Lagrange multipliers λi and λj .

Optimize the dual problem with respect to these two Lagrangemultipliers, holding others fixed.

Repeat until convergence.

There are heuristics to choose Lagrange multipliers that maximizes thestep size towards the global maximum. The first one is chosen fromexamples that violate the KKT condition. The second one is chosen usingapproximation based on absolute difference in error values (see Platt(1998)).

Dani Yogatama (Carnegie Mellon University) Convex Optimization February 12, 2014 19 / 26

Page 45: Convex Optimizationaarti/Class/10701_Spring14/r3_convex_opt.pdf · Reference: Boyd and Vandenberghe (2004) Dani Yogatama (Carnegie Mellon University) Convex Optimization February

Sequential Minimal Optimization (Platt, 1998)

An efficient way to solve SVM dual problem. Break a large QP programinto a series of smallest possible QP problems. Solve these smallsubproblems analytically.

In a nutshell

Choose two Lagrange multipliers λi and λj .

Optimize the dual problem with respect to these two Lagrangemultipliers, holding others fixed.

Repeat until convergence.

There are heuristics to choose Lagrange multipliers that maximizes thestep size towards the global maximum. The first one is chosen fromexamples that violate the KKT condition. The second one is chosen usingapproximation based on absolute difference in error values (see Platt(1998)).

Dani Yogatama (Carnegie Mellon University) Convex Optimization February 12, 2014 19 / 26

Page 46: Convex Optimizationaarti/Class/10701_Spring14/r3_convex_opt.pdf · Reference: Boyd and Vandenberghe (2004) Dani Yogatama (Carnegie Mellon University) Convex Optimization February

Sequential Minimal Optimization (Platt, 1998)

Dani Yogatama (Carnegie Mellon University) Convex Optimization February 12, 2014 20 / 26

Page 47: Convex Optimizationaarti/Class/10701_Spring14/r3_convex_opt.pdf · Reference: Boyd and Vandenberghe (2004) Dani Yogatama (Carnegie Mellon University) Convex Optimization February

Sequential Minimal Optimization (Platt, 1998)

For any two Lagrange multipliers, the constraints are::

0 < λi , λj < C (7)

yiλi + yjλj = −∑

k 6=i ,j ykλk = γ (8)

Express λi in terms of λj

λi =γ − λjyj

yi

Plug this back into our original function. We are then left with a verysimple quadratic problem with one variable λj

Dani Yogatama (Carnegie Mellon University) Convex Optimization February 12, 2014 21 / 26

Page 48: Convex Optimizationaarti/Class/10701_Spring14/r3_convex_opt.pdf · Reference: Boyd and Vandenberghe (2004) Dani Yogatama (Carnegie Mellon University) Convex Optimization February

Sequential Minimal Optimization (Platt, 1998)

For any two Lagrange multipliers, the constraints are::

0 < λi , λj < C (7)

yiλi + yjλj = −∑

k 6=i ,j ykλk = γ (8)

Express λi in terms of λj

λi =γ − λjyj

yi

Plug this back into our original function. We are then left with a verysimple quadratic problem with one variable λj

Dani Yogatama (Carnegie Mellon University) Convex Optimization February 12, 2014 21 / 26

Page 49: Convex Optimizationaarti/Class/10701_Spring14/r3_convex_opt.pdf · Reference: Boyd and Vandenberghe (2004) Dani Yogatama (Carnegie Mellon University) Convex Optimization February

Sequential Minimal Optimization (Platt, 1998)

For any two Lagrange multipliers, the constraints are::

0 < λi , λj < C (7)

yiλi + yjλj = −∑

k 6=i ,j ykλk = γ (8)

Express λi in terms of λj

λi =γ − λjyj

yi

Plug this back into our original function. We are then left with a verysimple quadratic problem with one variable λj

Dani Yogatama (Carnegie Mellon University) Convex Optimization February 12, 2014 21 / 26

Page 50: Convex Optimizationaarti/Class/10701_Spring14/r3_convex_opt.pdf · Reference: Boyd and Vandenberghe (2004) Dani Yogatama (Carnegie Mellon University) Convex Optimization February

Sequential Minimal Optimization (Platt, 1998)Solve for the second Lagrange multiplier λj .

If yi 6= yj , the following bounds apply to λj :

L = max(0, λt−1j − λy−1

i ) (9)

H = min(C ,C + λt−1j − λy−1

i ) (10)

If yi = yj , the following bounds apply to λj :

L = max(0, λt−1j + λy−1

i − C ) (11)

H = min(C , λt−1j + λy−1

i ) (12)

The solution is:

λj =

H ifλj > Hλj ifL ≤ λj ≤ HL ifλj < L

Dani Yogatama (Carnegie Mellon University) Convex Optimization February 12, 2014 22 / 26

Page 51: Convex Optimizationaarti/Class/10701_Spring14/r3_convex_opt.pdf · Reference: Boyd and Vandenberghe (2004) Dani Yogatama (Carnegie Mellon University) Convex Optimization February

Sequential Minimal Optimization (Platt, 1998)Solve for the second Lagrange multiplier λj .

If yi 6= yj , the following bounds apply to λj :

L = max(0, λt−1j − λy−1

i ) (9)

H = min(C ,C + λt−1j − λy−1

i ) (10)

If yi = yj , the following bounds apply to λj :

L = max(0, λt−1j + λy−1

i − C ) (11)

H = min(C , λt−1j + λy−1

i ) (12)

The solution is:

λj =

H ifλj > Hλj ifL ≤ λj ≤ HL ifλj < L

Dani Yogatama (Carnegie Mellon University) Convex Optimization February 12, 2014 22 / 26

Page 52: Convex Optimizationaarti/Class/10701_Spring14/r3_convex_opt.pdf · Reference: Boyd and Vandenberghe (2004) Dani Yogatama (Carnegie Mellon University) Convex Optimization February

Sequential Minimal Optimization (Platt, 1998)Solve for the second Lagrange multiplier λj .

If yi 6= yj , the following bounds apply to λj :

L = max(0, λt−1j − λy−1

i ) (9)

H = min(C ,C + λt−1j − λy−1

i ) (10)

If yi = yj , the following bounds apply to λj :

L = max(0, λt−1j + λy−1

i − C ) (11)

H = min(C , λt−1j + λy−1

i ) (12)

The solution is:

λj =

H ifλj > Hλj ifL ≤ λj ≤ HL ifλj < L

Dani Yogatama (Carnegie Mellon University) Convex Optimization February 12, 2014 22 / 26

Page 53: Convex Optimizationaarti/Class/10701_Spring14/r3_convex_opt.pdf · Reference: Boyd and Vandenberghe (2004) Dani Yogatama (Carnegie Mellon University) Convex Optimization February

Fenchel dualityIf a convex conjugate of f (x) is known, the dual function can be easilyderived. The convex conjugate of a function f is:

f ∗(y) = maxx〈y , x〉 − f (x) (13)

For a generic problem

minx

f (x)

subject to Ax ≤ b

Cx = d

The dual function is: −f ∗(−A>λ− C>ν)− b>λ− d>ν

There are many functions whose conjugate are easy to compute:

ExponentialLogisticQuadratic formLog determinant...

Dani Yogatama (Carnegie Mellon University) Convex Optimization February 12, 2014 23 / 26

Page 54: Convex Optimizationaarti/Class/10701_Spring14/r3_convex_opt.pdf · Reference: Boyd and Vandenberghe (2004) Dani Yogatama (Carnegie Mellon University) Convex Optimization February

Fenchel dualityIf a convex conjugate of f (x) is known, the dual function can be easilyderived. The convex conjugate of a function f is:

f ∗(y) = maxx〈y , x〉 − f (x) (13)

For a generic problem

minx

f (x)

subject to Ax ≤ b

Cx = d

The dual function is: −f ∗(−A>λ− C>ν)− b>λ− d>ν

There are many functions whose conjugate are easy to compute:

ExponentialLogisticQuadratic formLog determinant...

Dani Yogatama (Carnegie Mellon University) Convex Optimization February 12, 2014 23 / 26

Page 55: Convex Optimizationaarti/Class/10701_Spring14/r3_convex_opt.pdf · Reference: Boyd and Vandenberghe (2004) Dani Yogatama (Carnegie Mellon University) Convex Optimization February

Fenchel dualityIf a convex conjugate of f (x) is known, the dual function can be easilyderived. The convex conjugate of a function f is:

f ∗(y) = maxx〈y , x〉 − f (x) (13)

For a generic problem

minx

f (x)

subject to Ax ≤ b

Cx = d

The dual function is: −f ∗(−A>λ− C>ν)− b>λ− d>ν

There are many functions whose conjugate are easy to compute:

ExponentialLogisticQuadratic formLog determinant...

Dani Yogatama (Carnegie Mellon University) Convex Optimization February 12, 2014 23 / 26

Page 56: Convex Optimizationaarti/Class/10701_Spring14/r3_convex_opt.pdf · Reference: Boyd and Vandenberghe (2004) Dani Yogatama (Carnegie Mellon University) Convex Optimization February

Parting notes

Dual formulation is useful.

Give new insights into our problem,

Allow us to develop better optimization methods and use kernel tricks.

Dani Yogatama (Carnegie Mellon University) Convex Optimization February 12, 2014 24 / 26

Page 57: Convex Optimizationaarti/Class/10701_Spring14/r3_convex_opt.pdf · Reference: Boyd and Vandenberghe (2004) Dani Yogatama (Carnegie Mellon University) Convex Optimization February

Thank you!

Questions?

Dani Yogatama (Carnegie Mellon University) Convex Optimization February 12, 2014 25 / 26

Page 58: Convex Optimizationaarti/Class/10701_Spring14/r3_convex_opt.pdf · Reference: Boyd and Vandenberghe (2004) Dani Yogatama (Carnegie Mellon University) Convex Optimization February

References I

Some slides are from an upcoming EACL 2014 tutorial with Andre F. T.Martins, Noah A. Smith, and Mario F. T. Figueiredo

Boyd, S. and Vandenberghe, L. (2004). Convex Optimization. CambridgeUniversity Press.

Platt, J. (1998). Fast training of support vector machines using sequentialminimal optimization. In Scholkopf, B., Burges, C., and Smola, A.,editors, Advances in Kernel Methods - Support Vector Learning. MITPress.

Dani Yogatama (Carnegie Mellon University) Convex Optimization February 12, 2014 26 / 26