Escaping Saddle Points in Constrained Optimization€¦ · 05/12/2018  · 4 2 0 0-2 -5-4-6-8 50-10...

1
Escaping Saddle Points in Constrained Optimization Aryan Mokhtari, Asuman Ozdaglar, and Ali Jadbabaie Laboratory for Information and Decision Systems (LIDS), Massachusetts Institute of Technology (MIT) Introduction I Recent revival of interest in nonconvex optimization Practical success and advances in computational tools I Consider the following general optimization program min x ∈C f (x ) I C⊆ R d is a convex compact closed set This problem is hard Convex Optimization: Optimality Condition I Before jumping to nonconvex optimization Let’s recap the convex case! 0 20 40 60 80 100 10 120 140 160 180 200 5 10 0 5 0 -5 -5 -10 -10 I In the convex setting (f is convex) First-order optimality condition implies global optimality Finding an approximate first-order stationary point is suff. ( Unconstrained: Find x * s.t. k∇f (x * )k≤ ε Constrained: Find x * s.t. f (x * ) T (x - x * ) -ε for all x ∈C Nonconvex Optimization 10 5 -100 10 8 0 6 -50 4 2 0 0 -5 -2 -4 -6 -8 50 -10 -10 100 20 -2 15 20 -1.5 -1 -0.5 0 15 0.5 10 4 10 1 1.5 2 10 5 5 0 0 -5 -5 -10 -10 -15 -15 -20 -20 -8000 -6000 -4000 20 -2000 0 2000 4000 6000 8000 10 10000 20 0 15 10 5 -10 0 -5 -10 -15 -20 -20 I 1st-order optimality is not enough Saddle points exist! I Check higher order derivatives To escape from saddle points Search for a second-order stationary point (SOSP) I Does convergence to an SOSP lead to global optimality? No! I But, if all saddles are escapable (strict saddles) SOSP local minimum! I In several cases, all saddle points are escapable and all local minima are global Eigenvector problem [Absil et al., ’10] Phase retrieval [Sun et al., ’16] Dictionary learning [Sun et al., ’17] Unconstrained Optimization I Consider the unconstrained nonconvex setting (C = R d ) I x * is an approximate (ε, γ )-second-order stationary point if k∇f (x * )k≤ ε | {z } first-order optimality condition and 2 f (x * ) -γ I | {z } second-order optimality condition I Various attempts to design algorithms converging to an SOSP I Perturbing iterates by injecting noise [Ge et al., ’15], [Jin et al., ’17a,b], [Daneshmand et al., ’18] I Using the eigenvector of the smallest eigenvalue of the Hessian [Carmon et al., ’16], [Allen-Zhu, ’17], [Xu & Yang, ’17], [Royer & Wright, ’17], [Agarwal et al., ’17], [Reddi et al., ’18] I Overall cost to find an (ε, γ )-SOSP Polynomial in ε -1 and γ -1 I However, not applicable to the convex constrained setting! I In the constrained case, can we find an SOSP in poly-time? Constrained optimization: Second-order stationary point I How should we define an SOSP for the constrained setting? I x * ∈C is an approximate (ε, γ )-second-order order stationary point if f (x * ) T (x - x * ) -ε for all x ∈C (x - x * ) T 2 f (x * )(x - x * ) -γ for all x ∈C s. t. f (x * ) T (x - x * )= 0 I Second condition should be satisfied only on the subspace that function can be increasing I Setting ε = γ = 0 gives the necessary conditions for a local min I We propose a framework that finds an (ε, γ )-SOSP in poly-time If optimizing a quadratic loss over C up to a constant factor is tractable Proposed algorithm to find an (ε, γ )-SOSP 20 -400 20 15 -300 15 10 -200 10 5 -100 5 0 0 0 -5 100 -5 200 -10 -10 300 -15 -15 400 -20 -20 20 -400 20 15 -300 15 10 -200 10 5 -100 5 0 0 0 -5 100 -5 200 -10 -10 300 -15 -15 400 -20 -20 20 -400 20 15 -300 15 10 -200 10 5 -100 5 0 0 0 -5 100 -5 200 -10 -10 300 -15 -15 400 -20 -20 I Follow a first-order update to reach an ε-FOSP The function value decreases at a rate of O ( -2 ) I Escape from saddle points by solving a QP which depends objective function curvature information The function value decreases at a rate of O (γ -3 ) I Once we escape from a saddle point we won’t revisit it again The function value decreases after escaping from saddle It is guaranteed that the function value never increases Stage I: First-order update (Finding a critical point) I Goal: Find x t s.t. f (x t ) T (x - x t ) ≥-ε for all x ∈C I Follow Frank-Wolfe until reaching an ε-FOSP x t +1 =(1 - η )x t + η v t , where v t = argmin v ∈C {∇f (x t ) T v } I Follow Projected Gradient Descent until reaching an ε-FOSP x t +1 = π C {x t - η f (x t )}, π C (.) is the Euclidean projection onto the convex set C I The function value decreases at least by a factor of O ( -2 ) Stage II: Second-order update (Escaping from saddle points) I Find u t a ρ-approximate solution of the quadratic program Minimize q (u ) := (u - x t ) T 2 f (x t )(u - x t ) subject to u ∈C , f (x t ) T (u - x t )= 0 I q (u * ) q (u t ) ρq (u * ) for some ρ (0, 1] ( If q (u t ) < -ργ Update x t +1 =(1 - σ )x t + σ u t If q (u t ) ≥-ργ q (u * ) ≥-γ x t is an (ε, γ )-SOSP I Some classes of convex constraints satisfy this property Quadratic constraints under some conditions Theoretical Results Theorem. If we set the stepsizes to η = O (ε) and σ = O (ργ ), the proposed algorithm finds an (ε, γ )-SOSP after at most O (max{ε -2 -3 γ -3 }) iterations. I When can we solve the quadratic subproblem approximately? Proposition If C is defined by a quadratic constraint, then the alg. finds an (ε, γ )-SOSP after O (max{τε -2 , d 3 γ -3 }) arith. operations. Proposition If the convex set C is defined as a set of m quadratic constraints (m > 1), and the objective function Hessian satis- fies max x ∈C x T 2 f (x )x ≤O (γ ), then the algorithm finds an (ε, γ )- SOSP at most after O (max{τε -2 , d 3 m 7 γ -3 }) arithmetic operations. Proposed Algorithm I for t = 1, 2,... I Compute v t = argmin v ∈C {∇f (x t ) T v } I if f (x t ) T (v t - x t ) < -ε I x t +1 =(1 - η )x t + η v t I else I Find u t :a ρ-approximate solution of the QP I if q (u t ) < -ργ I x t +1 =(1 - σ )x t + σ u t I else I return x t and stop Stochastic Setting I What about the stochastic setting? min x ∈C f (x ) = min x ∈C E Θ [F (x , Θ)] I where Θ is a random variable with probability distribution P I Replace f (x t ) and 2 f (x t ) by their stochastic approximations g t and H t g t = 1 b g b g X i =1 F (x t i ), H t = 1 b H b H X i =1 2 F (x t i ) I Change some conditions to afford approximation error f (x t ) T (x - x t )= 0 g T t (x - x t ) r Proposed Method for the Stochastic Setting I for t = 1, 2,... I Compute v t = argmin v ∈C {g t T v } I if g t T (v t - x t ) ≤- ε 2 I x t +1 =(1 - η )x t + η v t I else I Find u t :a ρ-approximate solution of min q (u ) := (u - x t ) T H t (u - x t ) s. t. u ∈C , g t T (u - x t ) r I if q (u t ) < - ργ 2 I x t +1 =(1 - σ )x t + σ u t I else I return x t and stop Theoretical Results for the Stochastic Setting Theorem. If we set stepsizes to η = O (ε) and σ = O (ργ ), batch sizes to b g = O (max{ρ -4 γ -4 -2 }) and b H = O (ρ -2 γ -2 ), and choose r = O (ρ 2 γ 2 ), The outcome of Algorithm 2 is an (ε, γ )-SOSP w.h.p. Total No. of iterations is at most O (max{ε -2 -3 γ -3 }) w.h.p. Corollary Algorithm finds an (ε, γ )-SOSP w.h.p. after computing O (max{ε -2 ρ -4 γ -4 -4 -7 γ -7 }) stochastic gradients O (max{ε -2 ρ -3 γ -3 -5 γ -5 }) stochastic Hessians Conclusion I Method for finding an SOSP in constrained settings Using first-order information to reach an FOSP Solve a QP up to a constant factor ρ< 1 to escape from saddles I First finite-time complexity analysis for constrained problems O (max{ε -2 -3 γ -3 }) iter. O (max{τε -2 , d 3 m 7 γ -3 }) A.O. for QC Extended our results to the stochastic setting The 32nd Annual Conference on Neural Information Processing Systems (NIPS) 2018 December 5th, 2018

Transcript of Escaping Saddle Points in Constrained Optimization€¦ · 05/12/2018  · 4 2 0 0-2 -5-4-6-8 50-10...

Page 1: Escaping Saddle Points in Constrained Optimization€¦ · 05/12/2018  · 4 2 0 0-2 -5-4-6-8 50-10 -10 100 20-2 20 15-1.5-1-0.5 0 15 0.5 104 10 1 1.5 2 10 5 0 5 0 -5-15 -15-20-20-8000-6000-4000

Escaping Saddle Points in Constrained OptimizationAryan Mokhtari, Asuman Ozdaglar, and Ali Jadbabaie

Laboratory for Information and Decision Systems (LIDS), Massachusetts Institute of Technology (MIT)

Introduction

I Recent revival of interest in nonconvex optimization⇒ Practical success and advances in computational tools

I Consider the following general optimization program

minx∈C

f (x)

I C ⊆ Rd is a convex compact closed set⇒ This problem is hard

Convex Optimization: Optimality Condition

I Before jumping to nonconvexoptimization⇒ Let’s recap the convex case!

0

20

40

60

80

100

10

120

140

160

180

200

5 100 50-5 -5

-10 -10

I In the convex setting (f is convex)⇒ First-order optimality condition implies global optimality⇒ Finding an approximate first-order stationary point is suff.{

Unconstrained: Find x∗ s.t. ‖∇f (x∗)‖ ≤ ε

Constrained: Find x∗ s.t. ∇f (x∗)T (x − x∗) ≥ −ε for all x ∈ C

Nonconvex Optimization

10

5-100

108

06

-50

42

0

0

-5-2-4

-6-8

50

-10-10

100

20

-2

1520

-1.5

-1

-0.5

0

15

0.5

104

10

1

1.5

2

10 55 00 -5-5 -10-10

-15-15-20 -20

-8000

-6000

-4000

20

-2000

0

2000

4000

6000

8000

10

10000

20015

105

-10 0-5

-10-15-20 -20

I 1st-order optimality is not enough ⇒ Saddle points exist!

I Check higher order derivatives ⇒ To escape from saddle points⇒ Search for a second-order stationary point (SOSP)

I Does convergence to an SOSP lead to global optimality? No!I But, if all saddles are escapable (strict saddles)

⇒ SOSP ⇒ local minimum!

I In several cases, all saddle points are escapable and all localminima are global⇒ Eigenvector problem [Absil et al., ’10]⇒ Phase retrieval [Sun et al., ’16]⇒ Dictionary learning [Sun et al., ’17]

Unconstrained Optimization

I Consider the unconstrained nonconvex setting (C = Rd)I x∗ is an approximate (ε, γ)-second-order stationary point if

‖∇f (x∗)‖ ≤ ε︸ ︷︷ ︸first-order optimality condition

and ∇2f (x∗) � −γI︸ ︷︷ ︸second-order optimality condition

I Various attempts to design algorithms converging to an SOSP

I Perturbing iterates by injecting noise⇒ [Ge et al., ’15], [Jin et al., ’17a,b], [Daneshmand et al., ’18]

I Using the eigenvector of the smallest eigenvalue of the Hessian⇒ [Carmon et al., ’16], [Allen-Zhu, ’17], [Xu & Yang, ’17], [Royer &

Wright, ’17], [Agarwal et al., ’17], [Reddi et al., ’18]

I Overall cost to find an (ε, γ)-SOSP ⇒ Polynomial in ε−1 and γ−1

I However, not applicable to the convex constrained setting!

I In the constrained case, can we find an SOSP in poly-time?

Constrained optimization: Second-order stationary point

I How should we define an SOSP for the constrained setting?I x∗ ∈ C is an approximate (ε, γ)-second-order order stationary

point if

∇f (x∗)T (x − x∗) ≥ −ε for all x ∈ C

(x−x∗)T∇2f (x∗)(x−x∗) ≥ −γ for allx ∈C s. t.∇f (x∗)T (x−x∗)=0

I Second condition should be satisfied only on the subspace thatfunction can be increasing

I Setting ε = γ = 0 gives the necessary conditions for a local min

I We propose a framework that finds an (ε, γ)-SOSP in poly-time⇒ If optimizing a quadratic loss over C up to a constant

factor is tractable

Proposed algorithm to find an (ε, γ)-SOSP

20-400

20 15

-300

15 10

-200

105

-100

50

0

0-5

100

-5

200

-10-10

300

-15-15

400

-20-20

20-400

20 15

-300

15 10

-200

105

-100

50

0

0-5

100

-5

200

-10-10

300

-15-15

400

-20-20

20-400

20 15

-300

15 10

-200

105

-100

50

0

0-5

100

-5

200

-10-10

300

-15-15

400

-20-20

I Follow a first-order update to reach an ε-FOSP⇒ The function value decreases at a rate of O(ε−2)

I Escape from saddle points by solving a QP which dependsobjective function curvature information⇒ The function value decreases at a rate of O(γ−3)

I Once we escape from a saddle point we won’t revisit it again⇒ The function value decreases after escaping from saddle⇒ It is guaranteed that the function value never increases

Stage I: First-order update (Finding a critical point)

I Goal: Find xt s.t. ⇒ ∇f (xt)T (x − xt) ≥ −ε for all x ∈ C

I Follow Frank-Wolfe until reaching an ε-FOSP

xt+1 = (1− η)xt + ηvt, where vt = argminv∈C{∇f (xt)

Tv}

I Follow Projected Gradient Descent until reaching an ε-FOSP

xt+1 = πC{xt − η∇f (xt)},

πC(.) is the Euclidean projection onto the convex set C

I The function value decreases at least by a factor of O(ε−2)

Stage II: Second-order update (Escaping from saddle points)

I Find ut a ρ-approximate solution of the quadratic program

Minimize q(u) := (u − xt)T∇2f (xt)(u − xt)

subject to u ∈ C, ∇f (xt)T (u − xt) = 0

I q(u∗) ≤ q(ut) ≤ ρq(u∗) for some ρ ∈ (0,1]{If q(ut) < −ργ ⇒ Update xt+1 = (1− σ)xt + σut

If q(ut) ≥ −ργ ⇒ q(u∗) ≥ −γ ⇒ xt is an (ε, γ)-SOSP

I Some classes of convex constraints satisfy this property⇒ Quadratic constraints under some conditions

Theoretical Results

Theorem. If we set the stepsizes to η = O(ε) and σ =O(ργ), the proposed algorithm finds an (ε, γ)-SOSP after at mostO(max{ε−2, ρ−3γ−3}) iterations.

I When can we solve the quadratic subproblem approximately?

Proposition If C is defined by a quadratic constraint, then the alg.finds an (ε, γ)-SOSP after O(max{τε−2,d3γ−3}) arith. operations.

Proposition If the convex set C is defined as a set of m quadraticconstraints (m > 1), and the objective function Hessian satis-fies maxx∈C xT∇2f (x)x ≤ O(γ), then the algorithm finds an (ε, γ)-SOSP at most afterO(max{τε−2,d3m7γ−3}) arithmetic operations.

Proposed Algorithm

I for t = 1,2, . . .I Compute vt = argminv∈C{∇f (xt)

Tv}I if ∇f (xt)

T (vt − xt) < −εI xt+1 = (1− η)xt + ηvtI elseI Find ut: a ρ-approximate solution of the QPI if q(ut) < −ργI xt+1 = (1− σ)xt + σutI elseI return xt and stop

Stochastic Setting

I What about the stochastic setting?minx∈C

f (x) = minx∈C

EΘ[F (x ,Θ)]

I where Θ is a random variable with probability distribution PI Replace ∇f (xt) and ∇2f (xt) by their stochastic approximations

gt and Ht

gt =1bg

bg∑i=1

∇F (xt, θi), Ht =1

bH

bH∑i=1

∇2F (xt, θi)

I Change some conditions to afford approximation error⇒ ∇f (xt)

T (x − xt) = 0 ⇒ ∇gTt (x − xt) ≤ r

Proposed Method for the Stochastic Setting

I for t = 1,2, . . .I Compute vt = argminv∈C{gt

Tv}I if gt

T (vt − xt) ≤ −ε2

I xt+1 = (1− η)xt + ηvtI elseI Find ut: a ρ-approximate solution of

min q(u) := (u − xt)THt(u − xt)

s. t. u ∈ C, gtT (u − xt) ≤ r

I if q(ut) < −ργ2I xt+1 = (1− σ)xt + σutI elseI return xt and stop

Theoretical Results for the Stochastic Setting

Theorem. If we set stepsizes to η = O(ε) and σ = O(ργ),batch sizes to bg = O(max{ρ−4γ−4, ε−2}) and bH = O(ρ−2γ−2),and choose r = O(ρ2γ2),⇒ The outcome of Algorithm 2 is an (ε, γ)-SOSP w.h.p.⇒ Total No. of iterations is at most O(max{ε−2, ρ−3γ−3}) w.h.p.

Corollary Algorithm finds an (ε, γ)-SOSP w.h.p. after computing⇒ O(max{ε−2ρ−4γ−4, ε−4, ρ−7γ−7}) stochastic gradients⇒ O(max{ε−2ρ−3γ−3, ρ−5γ−5}) stochastic Hessians

Conclusion

I Method for finding an SOSP in constrained settings⇒ Using first-order information to reach an FOSP⇒ Solve a QP up to a constant factor ρ < 1 to escape from

saddlesI First finite-time complexity analysis for constrained problems

⇒ O(max{ε−2, ρ−3γ−3}) iter. ⇒ O(max{τε−2,d3m7γ−3}) A.O.for QC⇒ Extended our results to the stochastic setting

The 32nd Annual Conference on Neural Information Processing Systems (NIPS) 2018 December 5th, 2018