Understanding Non-convex Optimization - Praneeth Netrapalli€¦ · •Convex optimization ()is a...

95
Understanding Non-convex Optimization Sujay Sanghavi UT Austin Praneeth Netrapalli Microsoft Research

Transcript of Understanding Non-convex Optimization - Praneeth Netrapalli€¦ · •Convex optimization ()is a...

Page 1: Understanding Non-convex Optimization - Praneeth Netrapalli€¦ · •Convex optimization ()is a convex function, 𝒞is convex set •ut “today’s problems”, and this tutorial,

Understanding Non-convex Optimization

Sujay Sanghavi

UT Austin

Praneeth Netrapalli

Microsoft Research

Page 2: Understanding Non-convex Optimization - Praneeth Netrapalli€¦ · •Convex optimization ()is a convex function, 𝒞is convex set •ut “today’s problems”, and this tutorial,

Optimization

min𝑥

𝑓(𝑥)

s. t. 𝑥 ∈ 𝒞

• In general too hard

• Convex optimization 𝑓( ) is a convex function, 𝒞 is convex set

• But “today’s problems”, and this tutorial, are non-convex

• Our focus: non-convex problems that arise in machine learning

Variable, in 𝑅𝑑

function

feasible set

Page 3: Understanding Non-convex Optimization - Praneeth Netrapalli€¦ · •Convex optimization ()is a convex function, 𝒞is convex set •ut “today’s problems”, and this tutorial,

Outline of Tutorial

Part I (algorithms & background)

• Convex optimization (brief overview)

• Nonconvex optimization

Part II

• Example applications of nonconvex optimization

• Open directions

Page 4: Understanding Non-convex Optimization - Praneeth Netrapalli€¦ · •Convex optimization ()is a convex function, 𝒞is convex set •ut “today’s problems”, and this tutorial,

Convex Functions

Convex functions “lie below the line”𝒇 𝝀𝒙𝟏 + 𝟏 − 𝝀 𝒙𝟐 ≤ 𝝀𝒇 𝒙𝟏 + 𝟏 − 𝝀 𝒇 𝒙𝟐

If ∇𝑓 𝑥 exists, convex 𝑓 “lies above local linear approximation”

𝒇 𝒚 ≥ 𝐟 𝐱 + 𝛁𝒇 𝒙 𝑻 𝒚 − 𝒙 ∀𝒚

• GENERAL enough to capture/model many problems

• SPECIAL enough to have fast “off the shelf” algorithms

low-order polynomialcomplexity in dimension d

Page 5: Understanding Non-convex Optimization - Praneeth Netrapalli€¦ · •Convex optimization ()is a convex function, 𝒞is convex set •ut “today’s problems”, and this tutorial,

Convex Optimization

min𝑥

𝑓(𝑥)

s. t. 𝑥 ∈ 𝒞

• Key property of convex optimization:

Local optimality Global optimality

Proof: if 𝑥 is local (but not global) opt and 𝑥∗ is global, then moving from 𝑥 to 𝑥∗ strictly

decreases 𝑓. This violates local optimality of 𝑥

• All general-purpose methods search for locally optimal solutions

Convex function

Convex set

Page 6: Understanding Non-convex Optimization - Praneeth Netrapalli€¦ · •Convex optimization ()is a convex function, 𝒞is convex set •ut “today’s problems”, and this tutorial,

“Black box / Oracle view” of Optimization

𝑥 𝑓(𝑥) 𝑥 ∇𝑓(𝑥) 𝑥 ∇2𝑓(𝑥)

𝑥 𝑔 𝑥 𝑔

Newton

𝟏𝒔𝒕 order

E 𝑔 = ∇𝑓(𝑥) E 𝑔 ∈ 𝜕𝑓(𝑥)

𝟎𝒕𝒉 order 𝟐𝒏𝒅 order

Gradient DescentGradient-free

Stochastic 𝟏𝒔𝒕 order

Stochastic Gradient Descent

Non-smooth 𝟏𝒔𝒕 order

(Stochastic) Sub-gradient Descent

Page 7: Understanding Non-convex Optimization - Praneeth Netrapalli€¦ · •Convex optimization ()is a convex function, 𝒞is convex set •ut “today’s problems”, and this tutorial,

Gradients

For any smooth function,

• −∇𝑓(𝑥) is a local descent direction at 𝑥

• 𝑥∗ locally optimum ∇𝑓 𝑥∗ = 0

For convex functions, this means enough to search for 0-gradient points

∇𝑓 𝑥 exists everywhere

𝑥∗ globally optimum ∇𝑓 𝑥∗ = 0

Page 8: Understanding Non-convex Optimization - Praneeth Netrapalli€¦ · •Convex optimization ()is a convex function, 𝒞is convex set •ut “today’s problems”, and this tutorial,

Gradient Descent

An iterative method to solve min𝑥

𝑓(𝑥)

𝑥𝑡+1 = 𝑥𝑡 − 𝜂𝑡 ∇𝑓(𝑥𝑡)

next iterate current iterate step size current gradient

Page 9: Understanding Non-convex Optimization - Praneeth Netrapalli€¦ · •Convex optimization ()is a convex function, 𝒞is convex set •ut “today’s problems”, and this tutorial,

Gradient Descent

• “Off-the shelf” method that can be applied to “any” problem

• Only need to be able to calculate gradients (i.e. 1st order oracle)

• “Only” parameters are the step sizes 𝜂𝑡

• but choosing them correctly is crucially important

• If too small, convergence too slow. If too big, divergence / oscillation

• The more we know about 𝑓, the better we can make this choice

Page 10: Understanding Non-convex Optimization - Praneeth Netrapalli€¦ · •Convex optimization ()is a convex function, 𝒞is convex set •ut “today’s problems”, and this tutorial,

Gradient Descent Convergence

𝑓(⋅) has 𝜷-Lipschitz gradients if

∇𝑓 𝑥 − ∇𝑓(𝑦) ≤ 𝛽 𝑥 − 𝑦 ∀𝑥, 𝑦

• In this case, fixed 𝜼𝒕= 𝟏/𝜷 gives

𝑓 𝑥𝑇 − 𝑓 𝑥∗ ≤ 𝜖 once 𝑇 ∼ Θ1

𝜖

• This however is not tight: there exists a different method that is faster but only uses 1st

first order oracle …

𝛽 is a uniform bound

Page 11: Understanding Non-convex Optimization - Praneeth Netrapalli€¦ · •Convex optimization ()is a convex function, 𝒞is convex set •ut “today’s problems”, and this tutorial,

Gradient Descent Convergence

Very steep so cannot use big 𝜂

Very flat near 𝑥∗ so small 𝜂makes it slow

Page 12: Understanding Non-convex Optimization - Praneeth Netrapalli€¦ · •Convex optimization ()is a convex function, 𝒞is convex set •ut “today’s problems”, and this tutorial,

Gradient Descent Convergence

𝑓(⋅) is 𝜶-strongly convex if

∇𝑓 𝑥 − ∇𝑓(𝑦) ≥ 𝛼 𝑥 − 𝑦 ∀𝑥, 𝑦

• Now GD with fixed 𝜂𝑡 = 1/𝛽 gives

𝑓 𝑥𝑇 − 𝑓 𝑥∗ ≤ 𝜖 in 𝑇 ∼ 𝑂(log(

1

𝜖)

− log(1−𝛼

𝛽))

exponentially faster than the only smooth case

“cannot be too flat”

Page 13: Understanding Non-convex Optimization - Praneeth Netrapalli€¦ · •Convex optimization ()is a convex function, 𝒞is convex set •ut “today’s problems”, and this tutorial,

Non-smooth functions

Sub-gradient of (convex) 𝑓 at 𝑥 is a set of “gradient like” vectors

{ 𝑔 ∈ ℝ𝑑: 𝑓 𝑦 ≥ 𝑓 𝑥 + 𝑔𝑇 𝑦 − 𝑥 ∀𝑦 }

Eg: |𝑥| relu 𝑥 = max(𝑥, 0)

𝜕𝑓 0 = {𝑢:−1 ≤ 𝑢 ≤ 1} 𝜕𝑓 0 = {𝑢: 0 ≤ 𝑢 ≤ 1}

Page 14: Understanding Non-convex Optimization - Praneeth Netrapalli€¦ · •Convex optimization ()is a convex function, 𝒞is convex set •ut “today’s problems”, and this tutorial,

Sub-gradient Descent

𝑥 𝑔

𝑔 ∈ 𝜕𝑓(𝑥)

Non-smooth 𝟏𝒔𝒕 order

Sub-gradient Descent

𝑥𝑡+1 = 𝑥𝑡 − 𝜂𝑡 𝑔𝑡

𝑔𝑡 ∈ 𝜕𝑓(𝑥𝑡)

𝑥 𝑔

𝐸[𝑔] ∈ 𝜕𝑓(𝑥)

Stochastic Non-smooth 𝟏𝒔𝒕 order

Page 15: Understanding Non-convex Optimization - Praneeth Netrapalli€¦ · •Convex optimization ()is a convex function, 𝒞is convex set •ut “today’s problems”, and this tutorial,

Sub-gradient Descent

• Now cannot use fixed step size in general

𝜂𝑡 has to → 0, else oscillations possible

Convergence slower than gradient descent because step size forced to decay:

𝑓 𝑥𝑇 − 𝑓 𝑥∗ ≤ 𝜖 in 𝑇 ∼ O(1/𝜖2)

E.g. for 𝑓 𝑥 = |𝑥| , subgradient 𝑔 = 𝑠𝑖𝑔𝑛(𝑥) for 𝑥 ≠ 0

𝑥+ = 𝑥 − 𝜂 𝑠𝑖𝑔𝑛 𝑥

will oscillate between 𝑎 and 𝑎 − 𝜂for some 𝑎 that depends on initial point

Page 16: Understanding Non-convex Optimization - Praneeth Netrapalli€¦ · •Convex optimization ()is a convex function, 𝒞is convex set •ut “today’s problems”, and this tutorial,

Stochastic Gradient Descent (SGD)

• Large # terms

min𝑥

1

𝑛

𝑖=1

𝑛

𝑓𝑖(𝑥)

𝑥 ∇𝑓(𝑥)

Gradient Descent

But sometimes even calculating the gradient may be hard

• High dimensionality

min𝑥

𝑓(𝑥1, … , 𝑥𝑑)

𝑛 is large 𝑑 is large

Page 17: Understanding Non-convex Optimization - Praneeth Netrapalli€¦ · •Convex optimization ()is a convex function, 𝒞is convex set •ut “today’s problems”, and this tutorial,

Stochastic Gradient Descent (SGD)

Lowering the complexity of gradient calculation via sub-sampling the sum of terms

B is uniform subset from [n]. This is mini-batch SGD.

𝑥+ = 𝑥 − 𝜂1

𝑛

𝑖=1

𝑛

∇𝑓𝑖(𝑥)𝑥+ = 𝑥 − 𝜂

1

|𝑩|

𝑖∈𝑩

∇𝑓𝑖(𝑥)

𝑥+ = 𝑥 − 𝜂

𝑗=1

𝑑𝜕𝑓

𝜕𝑥𝑗𝒆𝒋 𝑥+ = 𝑥 − 𝜂

𝑑

|𝑆|

𝑗∈𝑆

𝜕𝑓

𝜕𝑥𝑗𝒆𝒋

S is uniform subset from [n]. This is block coordinate descent.

Page 18: Understanding Non-convex Optimization - Praneeth Netrapalli€¦ · •Convex optimization ()is a convex function, 𝒞is convex set •ut “today’s problems”, and this tutorial,

Many ML problems can be reduced to optimization problems of the form

min𝑥

1

𝑛

𝑖=1

𝑛

𝑙(𝑠𝑖; 𝑥)

Mini-batch SGD

∇𝑓 𝑥 =1

𝑛

𝑖=1

𝑛

∇𝑙(𝑠𝑖; 𝑥) 𝑔 =1

|𝐵|

𝑖∈𝐵

∇𝑙(𝑠𝑖; 𝑥)

loss function

𝑖-th sample

weights

“Batch or full” gradient “Mini-batch” gradient

𝑓𝑖(𝑥)

Page 19: Understanding Non-convex Optimization - Praneeth Netrapalli€¦ · •Convex optimization ()is a convex function, 𝒞is convex set •ut “today’s problems”, and this tutorial,

SGD and its Convergence Rate

𝑔 is a Noisy Unbiased (sub) Gradient (NUS) of 𝑓(⋅) at 𝑥 if 𝑬 𝒈|𝒙 ∈ 𝝏𝒇(𝒙)

𝑥𝑡+1 = 𝑥𝑡 − 𝜂𝑡 𝑔𝑡

Suppose Var ∥ 𝑔𝑡 ∥ 𝑥𝑡) ≤ 𝐺𝑡2

SGD:

E 𝑓 𝑥BEST(𝑇)

− 𝑓∗ ≤𝑅2 + σt≤𝑇 𝜂t

2 𝐺𝑡2

2 σ𝑡≤𝑇 𝜂𝑡Theorem:

Where

𝑅 = | 𝑥0 − 𝑥∗|

In all of the cases we have seen so far, g is NUS

Page 20: Understanding Non-convex Optimization - Praneeth Netrapalli€¦ · •Convex optimization ()is a convex function, 𝒞is convex set •ut “today’s problems”, and this tutorial,

SGD and its Convergence Rate

E 𝑓(𝑇) − 𝑓∗ ≤𝑅2 + σt≤𝑇 𝜂t

2 𝐺𝑡2

2 σ𝑡≤𝑇 𝜂𝑡

Idea 1: Fixed mini-batch size at every step 𝑔𝑡 =1

|𝐵|

𝑖∈𝐵

∇𝑓𝑖(𝑥𝑡)

𝐵 does not depend on 𝑡

1(a) Fixed step size 𝜂𝑡 = 𝜂 gives E 𝑓(𝑇) − 𝑓∗ ≤𝑅2

𝜂𝑇+ 𝜂𝐺2

Convergence to a ”ball” around the optimum at rate 1/T

⟹ 𝐺𝑡2 = 𝐺2

Page 21: Understanding Non-convex Optimization - Praneeth Netrapalli€¦ · •Convex optimization ()is a convex function, 𝒞is convex set •ut “today’s problems”, and this tutorial,

SGD and its Convergence Rate

* Convergence to optimum ⟹ 𝜂𝑡 has to decay with time

1(b) Best choice for trading off between two error terms: 𝜂𝑡 ~1

𝑡

Gives E 𝑓(𝑇) − 𝑓∗ ≤ 𝑂 ൗ(𝑅+𝐺)

𝑇convergence rate

- much slower compared to the 𝑂 Τ1 𝑇 rate of (full) gradient descent

- dependence on n captured by G – it is 𝑂 Τ𝑛 |𝐵|

Page 22: Understanding Non-convex Optimization - Praneeth Netrapalli€¦ · •Convex optimization ()is a convex function, 𝒞is convex set •ut “today’s problems”, and this tutorial,

SGD and its Convergence Rate

Lipschitz 𝑓 Strongly Convex 𝑓

Full GD 𝑂(1/𝑇) 𝑂(𝑐𝑇)

Fixed-size mini-batch SGD

𝑂(1/ 𝑇) 𝑂(1/𝑇)

Fixed mini-batch size means fixed variance, which forces decaying step size and hence slow convergence

Variance reduction: keep step size 𝜼𝒕 constant but decrease variance 𝑮𝒕 as t increases(a) By increasing size of mini-batch for some of the steps(b) Via memory

Page 23: Understanding Non-convex Optimization - Praneeth Netrapalli€¦ · •Convex optimization ()is a convex function, 𝒞is convex set •ut “today’s problems”, and this tutorial,

Variance reduction in SGD

min𝑥

1

𝑛

𝑖=1

𝑛

𝑓𝑖(𝑥)

Full GD: 𝑥𝑡+1 = 𝑥𝑡 − 𝜂𝑡1

𝑛σ𝑖 ∇𝑓𝑖(𝑥𝑡) O(n) computations per iteration

SGD: 𝑥𝑡+1 = 𝑥𝑡 − 𝜂𝑡∇𝑓𝑖𝑡 𝑥𝑡 O(1) computation per iteration

(but many more iterations)

random index

Page 24: Understanding Non-convex Optimization - Praneeth Netrapalli€¦ · •Convex optimization ()is a convex function, 𝒞is convex set •ut “today’s problems”, and this tutorial,

Stochastic Average Gradient (Schmidt, Le Roux, Bach)

current estimate for ∇𝑓𝑛(𝑥(𝑘))

• Maintain 𝑔1(𝑘), … , 𝑔𝑛

𝑘

• Initialize with one pass over all samples: 𝑔𝑖(0)

= ∇𝑓𝑖 𝑥0

• At step 𝑡, pick 𝑖𝑡 randomly (Update 𝑔’s lazily)

𝑔𝑖𝑡(𝑡)

= ∇𝑓𝑖𝑡(𝑥(𝑡−1))

𝑔𝑗(𝑡)

= 𝑔𝑗(𝑡−1)

for 𝑗 ≠ 𝑖𝑡

• Update 𝑥(𝑘) = 𝑥(𝑘−1) − 𝜂𝑘1

𝑛σ𝑖=1𝑛 𝑔𝑖

(𝑘)

For the problem

min𝑥

1

𝑛

𝑖=1

𝑛

𝑓𝑖(𝑥)

Page 25: Understanding Non-convex Optimization - Praneeth Netrapalli€¦ · •Convex optimization ()is a convex function, 𝒞is convex set •ut “today’s problems”, and this tutorial,

Stochastic Average Gradient (SAG)

• Memory efficient implementation:

1

𝑛

𝑖=1

𝑛

𝑔𝑖(𝑘)

=𝑔𝑖𝑘(𝑘)

𝑛−𝑔𝑖𝑘

𝑘−1

𝑛+1

𝑛

𝑖=1

𝑛

𝑔𝑖(𝑘−1)

𝑎(𝑘) =𝑔𝑖𝑘(𝑘)

𝑛−𝑔𝑖𝑘

𝑘−1

𝑛+ 𝑎(𝑘−1)

• 𝑎(0) just accumulates ∇𝑓𝑖(𝑥(0))’s

Page 26: Understanding Non-convex Optimization - Praneeth Netrapalli€¦ · •Convex optimization ()is a convex function, 𝒞is convex set •ut “today’s problems”, and this tutorial,

Stochastic Average Gradient (SAG)

NOTE: SAG not an unbiased stochastic gradient method

• Bias = 1

𝑛σ𝑖 ∇𝑓𝑖 𝑥 𝑘 −

1

𝑛σ𝑖 𝑔𝑖

(𝑘)

• As 𝑘 → ∞, 𝑥(𝑘) → 𝑥∗, so Bias → 0 and variance → 0 as well !

However, proving this is quite involved ….

Theorem: If each ∇𝑓𝑖(⋅) is 𝛽-Lipschitz, SAG with fixed step size has

E 𝑓(𝑇) − 𝑓∗ ≤𝑐𝑛

𝑇[ 𝑓(0) − 𝑓∗ + 𝛽 𝑥 0 − 𝑥∗ ]

𝑂1

𝑇convergence

rate !

Page 27: Understanding Non-convex Optimization - Praneeth Netrapalli€¦ · •Convex optimization ()is a convex function, 𝒞is convex set •ut “today’s problems”, and this tutorial,

Theorem: If each 𝑓𝑖 is also 𝛼-strongly convex

E 𝑓(𝑇) − 𝑓∗ ≤ 1 −𝑐0𝛼

𝛽

𝑇

𝑐1𝑅

Stochastic Average Gradient

𝑐0 < 1, depends on 𝑛

Linear convergencefor strongly convex ..

SAG thus fixes the shortcomings of SGD wrt dependence of error on T …

… but it is biased and hence hard to prove extensions for variants like Proximal SGD …

Page 28: Understanding Non-convex Optimization - Praneeth Netrapalli€¦ · •Convex optimization ()is a convex function, 𝒞is convex set •ut “today’s problems”, and this tutorial,

Stochastic Variance Reduced Gradient

SVRG (Johnson and Zhang):

For epoch 𝑡 = 1,… , 𝑇

෩𝑓𝑡 = full gradient

For iteration 𝑘 = 1,… ,𝑀

SAG-like updates

Page 29: Understanding Non-convex Optimization - Praneeth Netrapalli€¦ · •Convex optimization ()is a convex function, 𝒞is convex set •ut “today’s problems”, and this tutorial,

Experiments: SVRG vs SGD (Johnson and Zhang’2013)

Training Loss𝑓(𝑤𝑡)

#gradient computations/n

Fixed learning rates:SVRG can use larger learning rate.

Multiclass logistic regression on MNIST.

Page 30: Understanding Non-convex Optimization - Praneeth Netrapalli€¦ · •Convex optimization ()is a convex function, 𝒞is convex set •ut “today’s problems”, and this tutorial,

Experiments: SVRG vs SGD (Johnson and Zhang’2013)

Training Loss Residual𝑓 𝑤𝑡 − 𝑓(𝑤∗)

Multiclass logistic regression on MNIST.

SVRG is faster than SGD with best-tuned learning rate scheduling.

#gradient computations/n

Page 31: Understanding Non-convex Optimization - Praneeth Netrapalli€¦ · •Convex optimization ()is a convex function, 𝒞is convex set •ut “today’s problems”, and this tutorial,

Acceleration

Page 32: Understanding Non-convex Optimization - Praneeth Netrapalli€¦ · •Convex optimization ()is a convex function, 𝒞is convex set •ut “today’s problems”, and this tutorial,

Acceleration / Momentum

𝑥 ∇𝑓(𝑥)

𝟏𝒔𝒕 order Gradient descent is one algorithm that uses 1st order oracle

But it is possible to get faster rates with this oracle …

𝑥𝑡+1 = 𝑥𝑡 + 𝛽(𝑥𝑡 − 𝑥𝑡−1) − 𝜂𝛻𝑓 𝑥𝑡 + 𝛽(𝑥𝑡 − 𝑥𝑡−1)

𝑥𝑡−1

𝑥𝑡𝑥𝑡 + 𝛽(𝑥𝑡 − 𝑥𝑡−1)

𝑥𝑡+1 Can be viewed as a discretization of

ሷ𝒙 + ෩𝜽 ሶ𝒙 + 𝛁𝒇 𝒙 = 𝟎

Page 33: Understanding Non-convex Optimization - Praneeth Netrapalli€¦ · •Convex optimization ()is a convex function, 𝒞is convex set •ut “today’s problems”, and this tutorial,

Convergence rates of Accelerated GD

Smooth Smooth & Strongly convex

Gradient Descent𝑂(

1

𝜖) 𝑂( 𝜅 log

1

𝜖)

Nesterov’s Accelerated Gradient𝑂(

1

𝜖) 𝑂( 𝜅 log

1

𝜖)

Page 34: Understanding Non-convex Optimization - Praneeth Netrapalli€¦ · •Convex optimization ()is a convex function, 𝒞is convex set •ut “today’s problems”, and this tutorial,

Comparison of the convergence rates

Algorithm Convergence rate

Subgradient𝑂(

1

𝜖2)

ISTA𝑂(

1

𝜖)

FISTA𝑂(

1

𝜖)

Number of Iterations

𝑓𝑡 − 𝑓∗

Figure taken from Tibshirani’s lecture notes: http://www.stat.cmu.edu/~ryantibs/convexopt-S15/scribes/08-prox-grad-scribed.pdf

Performance on a LASSO example.

Page 35: Understanding Non-convex Optimization - Praneeth Netrapalli€¦ · •Convex optimization ()is a convex function, 𝒞is convex set •ut “today’s problems”, and this tutorial,

Summary of convex background

• Convex optimization can be performed efficiently

• Gradient descent (GD) is an important workhorse

• Techniques such as momentum, stochastic updates etc. can be used to speed up GD

• Well studied complexity theory with matching lower/upper bounds

Page 36: Understanding Non-convex Optimization - Praneeth Netrapalli€¦ · •Convex optimization ()is a convex function, 𝒞is convex set •ut “today’s problems”, and this tutorial,

Nonconvex optimization

min𝑥

𝑓 𝑥Problem: 𝑓 ⋅ : nonconvex function

Applications: Deep learning, compressed sensing, matrix completion, dictionary learning, nonnegative matrix factorization, …

Challenge: NP-hard in the worst case

Page 37: Understanding Non-convex Optimization - Praneeth Netrapalli€¦ · •Convex optimization ()is a convex function, 𝒞is convex set •ut “today’s problems”, and this tutorial,

Gradient descent (GD) [Cauchy 1847]

𝑥𝑡+1 = 𝑥𝑡 − 𝜂𝛻𝑓 𝑥𝑡

QuestionHow does it perform for non-

convex functions?

AnswerConverges to first order

stationary points

Definitionϵ-First order stationary point (ϵ-FOSP) : 𝛻𝑓(𝑥) ≤ ϵ

Concretelyϵ-FOSP in O ϵ−2 iterations

[Folklore]

Page 38: Understanding Non-convex Optimization - Praneeth Netrapalli€¦ · •Convex optimization ()is a convex function, 𝒞is convex set •ut “today’s problems”, and this tutorial,

GD for smooth functions

• Assumption: 𝑓 ⋅ is 𝐿-smooth ≝ ∇𝑓 ⋅ is 𝐿-Lipschitz

∇𝑓 𝑥 − ∇𝑓 𝑦 ≤ 𝐿 ⋅ 𝑥 − 𝑦 , ∀𝑥, 𝑦.

This implies

𝑓 𝑦 ≤ 𝑓 𝑥 + ∇𝑓 𝑥 , 𝑦 − 𝑥 +𝐿

2𝑥 − 𝑦 2, ∀𝑥, 𝑦

1st order Taylor expansion

Quadratic upper bound

Page 39: Understanding Non-convex Optimization - Praneeth Netrapalli€¦ · •Convex optimization ()is a convex function, 𝒞is convex set •ut “today’s problems”, and this tutorial,

Alternate view of GD

𝑥𝑡+1 = argmin𝑥 𝑓 𝑥𝑡 + ∇𝑓 𝑥𝑡 , 𝑥 − 𝑥𝑡 +1

2𝜂𝑥 − 𝑥𝑡

2

𝑓 𝑥𝑡+1 ≤ min𝑥 𝑓 𝑥𝑡 + ∇𝑓 𝑥𝑡 , 𝑥 − 𝑥𝑡 +1

2𝜂𝑥 − 𝑥𝑡

2

Quadratic upper bound

𝜂 ≤1

𝐿

= 𝑓 𝑥𝑡 −𝜂

2𝛻𝑓 𝑥𝑡

2

Telescoping: 𝑓 𝑥𝑇 ≤ 𝑓 𝑥0 −𝜂

2σ𝑡=0𝑇−1 𝛻𝑓 𝑥𝑡

2

Page 40: Understanding Non-convex Optimization - Praneeth Netrapalli€¦ · •Convex optimization ()is a convex function, 𝒞is convex set •ut “today’s problems”, and this tutorial,

Stochastic gradient descent (SGD) [Robbins, Monro 1951]

𝑥𝑡+1 = 𝑥𝑡 − 𝜂 𝛻𝑓 𝑥𝑡 ; 𝔼 𝛻𝑓 𝑥𝑡 = 𝛻𝑓 𝑥𝑡

QuestionHow does it perform?

AnswerConverges to first order

stationary points

Definitionϵ-First order stationary point (ϵ-FOSP) : 𝛻𝑓(𝑥) ≤ ϵ

Concretelyϵ-FOSP in O ϵ−4 iterations

vs O ϵ−2 for GD

Page 41: Understanding Non-convex Optimization - Praneeth Netrapalli€¦ · •Convex optimization ()is a convex function, 𝒞is convex set •ut “today’s problems”, and this tutorial,

Proof of convergence rate of SGD

• Assumption: 𝑓 ⋅ is 𝐿-smooth ≝ ∇𝑓 ⋅ is 𝐿-Lipschitz

• Assumption: 𝔼 ∇𝑓 𝑥 − ∇𝑓 𝑥2≤ 𝜎2

𝔼 𝑓 𝑥𝑡+1 ≤ 𝔼 𝑓 𝑥𝑡 −𝜂

2𝔼 𝛻𝑓 𝑥𝑡

2 +𝜂2𝐿

2𝜎2

𝜂 ∼ 1/ 𝑇:1

𝑇σ𝑡=0𝑇−1𝔼 𝛻𝑓 𝑥𝑡

2 ≤ 𝑂 𝜎/ 𝑇

Page 42: Understanding Non-convex Optimization - Praneeth Netrapalli€¦ · •Convex optimization ()is a convex function, 𝒞is convex set •ut “today’s problems”, and this tutorial,

Finite sum problems

𝑓 𝑥 =1

𝑛

𝑖=1

𝑛

𝑓𝑖 𝑥

• 𝑓𝑖 ⋅ = loss on 𝑖th data point; 𝐿-smooth and nonconvex• E.g., 𝑓𝑖 ⋅ = ℓ 𝜙 𝑥, 𝑎𝑖 , 𝑏𝑖 ; ℓ ― loss function; 𝜙 ― model.

• Usually computing ∇𝑓𝑖 ⋅ takes same amount of time for each 𝑖

Page 43: Understanding Non-convex Optimization - Praneeth Netrapalli€¦ · •Convex optimization ()is a convex function, 𝒞is convex set •ut “today’s problems”, and this tutorial,

GD vs SGD

# gradient callsper iteration

# iterations

GD 𝑛 𝝐−𝟐

SGD 𝟏 𝜖−4

𝜖 ― desired accuracy𝑛 ― # component functions

Page 44: Understanding Non-convex Optimization - Praneeth Netrapalli€¦ · •Convex optimization ()is a convex function, 𝒞is convex set •ut “today’s problems”, and this tutorial,

Can we get best of both?

• Main observations:• Can compute exact gradients when required

• Convergence of SGD depends on noise variance 𝜎

• Main idea of SVRG: [Johnson, Zhang 2013]; [Reddi, Hefny, Sra, Poczos, Smola 2016]; [Allen-Zhu, Hazan 2016]• Compute full gradient in the beginning; reference point 𝑥0

• At 𝑥𝑡, ∇𝑓 𝑥𝑡 ≝ ∇𝑓𝑖 𝑥𝑡 − ∇𝑓𝑖 𝑥0 + ∇𝑓(𝑥0)

Page 45: Understanding Non-convex Optimization - Praneeth Netrapalli€¦ · •Convex optimization ()is a convex function, 𝒞is convex set •ut “today’s problems”, and this tutorial,

Two potential functions

𝜎2 ≝ 𝔼 ∇𝑓 𝑥𝑡 − ∇𝑓 𝑥𝑡2≤ 𝐿 𝑥0 − 𝑥𝑡

2

𝔼 𝑓 𝑥𝑡+1 − 𝔼 𝑓 𝑥𝑡 ≤ −𝜂

2𝔼 𝛻𝑓 𝑥𝑡

2 +𝜂2𝐿

2𝜎2

𝑥𝑡+1 − 𝑥𝑡2 ≤ 𝜂2𝔼 𝛻𝑓 𝑥𝑡

2 + 𝜂2𝐿 𝑥0 − 𝑥𝑡2

• For 𝑡 = 0, 𝑥0 − 𝑥𝑡2 = 0.

• 𝑥0 − 𝑥𝑡2 can increase only if 𝛻𝑓 𝑥𝑡

2 is large.

• But if so, 𝔼 𝑓 𝑥𝑡+1 will decrease significantly.

• Careful combination of these two potential functions.

Page 46: Understanding Non-convex Optimization - Praneeth Netrapalli€¦ · •Convex optimization ()is a convex function, 𝒞is convex set •ut “today’s problems”, and this tutorial,

Comparison with GD/SGD𝜖 ― desired accuracy

𝑛 ― # component functions

# gradient callsper iteration

# iterations

GD 𝑛 𝜖−2

SGD 1 𝜖−4

SVRG 1 𝑛2/3𝜖−2

SNVRG(Zhou, Xu, Gu 2018)

Spider(Fang, Li, Lin, Zhang 2018)

1 𝑛1/2𝜖−2 ∧ 𝜖−3

Other results: Ghadimi, Lan 2015; Allen-Zhu 2018

Page 47: Understanding Non-convex Optimization - Praneeth Netrapalli€¦ · •Convex optimization ()is a convex function, 𝒞is convex set •ut “today’s problems”, and this tutorial,

How do FOSPs look like?

Hessian PSD𝛻2𝑓 𝑥 ≽ 0

Second order stationary points (SOSP)

Hessian NSD𝛻2𝑓 𝑥 ≼ 0

Hessian indefinite𝜆min(𝛻

2𝑓 𝑥 ) ≤ 0𝜆max(𝛻

2𝑓 𝑥 ) ≥ 0

Page 48: Understanding Non-convex Optimization - Praneeth Netrapalli€¦ · •Convex optimization ()is a convex function, 𝒞is convex set •ut “today’s problems”, and this tutorial,

FOSPs vs SOSPs in popular problems

• Very well studied

• Matrix sensing [Bhojanapalli, Neyshabur, Srebro 2016]

• Matrix completion [Ge, Lee, Ma 2016]

• Robust PCA [Ge, Jin, Zheng 2017]

• Tensor factorization [Ge, Huang, Jin, Yuan 2015]; [Ge, Ma 2017]

• Smooth semidefinite programs [Boumal, Voroninski, Bandeira 2016]

• Synchronization & community detection [Bandeira, Boumal, Voroninski 2016];

[Mei, Misiakiewicz, Montanari, Oliveira 2017]

• Phase retrieval [Chen, Chi, Fan, Ma 2018]

Page 49: Understanding Non-convex Optimization - Praneeth Netrapalli€¦ · •Convex optimization ()is a convex function, 𝒞is convex set •ut “today’s problems”, and this tutorial,

Upshot for these problems1. FOSP not good enough2. Finding SOSP sufficient

Two major observations

• FOSPs: proliferation (exponential #) of saddle points• Recall FOSP ≜ 𝛻𝑓 𝑥 = 0

• Gradient descent can get stuck near them

• SOSPs: not just local minima; as good as global minima• Recall SOSP ≜ 𝛻𝑓 𝑥 = 0 & 𝛻2𝑓 𝑥 ≽ 0

Page 50: Understanding Non-convex Optimization - Praneeth Netrapalli€¦ · •Convex optimization ()is a convex function, 𝒞is convex set •ut “today’s problems”, and this tutorial,

How to find SOSPs?

• Cubic regularization (CR) [Nesterov, Polyak 2006]

• Contrast with GD: Minimizes quadratic upper bound

𝑥𝑡+1

= argmin𝑥 𝑓 𝑥𝑡 + ∇𝑓 𝑥𝑡 , 𝑥 − 𝑥𝑡 + 𝑥 − 𝑥𝑡 , ∇2𝑓 𝑥𝑡 𝑥 − 𝑥𝑡 +

1

6𝜂𝑥 − 𝑥𝑡

3

2𝑛𝑑 order Taylor expansion

Cubic upper bound

Page 51: Understanding Non-convex Optimization - Praneeth Netrapalli€¦ · •Convex optimization ()is a convex function, 𝒞is convex set •ut “today’s problems”, and this tutorial,

GD vs CR

Guarantee Per Iteration cost

CR SOSP Hessian computation

GD FOSP Gradient computation

• Hessian computation is not practical for large scale applications

Can we find SOSPs using first order (gradient) methods?

Page 52: Understanding Non-convex Optimization - Praneeth Netrapalli€¦ · •Convex optimization ()is a convex function, 𝒞is convex set •ut “today’s problems”, and this tutorial,

GD finds SOSPs with probability 1

• [Lee, Panageas, Piliouras, Simchowitz, Jordan, Recht 2017]

Lebesguemeasure 𝑥0: GD from 𝑥0 converges to non SOSP = 0

• However, time taken for convergence could be Ω 𝑑 [Du, Jin, Lee, Jordan, Singh, Poczos 2017]

Image credit: Chi Jin & Mike Jordan

Page 53: Understanding Non-convex Optimization - Praneeth Netrapalli€¦ · •Convex optimization ()is a convex function, 𝒞is convex set •ut “today’s problems”, and this tutorial,

GD vs CR

Guarantee Per Iteration cost # iterations

CR SOSP Hessian computation 𝑶 𝝐−𝟏.𝟓

GD SOSP Gradient computation Ω 𝑑

• Convergence rate of GD too slow

Can we speed up GD for finding SOSPs?

Page 54: Understanding Non-convex Optimization - Praneeth Netrapalli€¦ · •Convex optimization ()is a convex function, 𝒞is convex set •ut “today’s problems”, and this tutorial,

Perturbation to the rescue!

1. For 𝑡 = 0,1,⋯ do

2. 𝑥𝑡+1 ← 𝑥𝑡 − 𝜂𝛻𝑓 𝑥𝑡 + 𝜉𝑡 where 𝜉𝑡 ∼ 𝑈𝑛𝑖𝑓 𝐵0 𝜖

Perturbed gradient descent (PGD)

Guarantee Per Iteration cost # iterations

CR SOSP Hessian computation 𝑶 𝝐−𝟏.𝟓

GD SOSP Gradient computation Ω 𝑑

PGD SOSP Gradient computation ෩𝑶 𝝐−𝟐

[Ge, Huang, Jin, Yuan 2015]; [Jin, Ge, Netrapalli, Kakade, Jordan 2017]

Page 55: Understanding Non-convex Optimization - Praneeth Netrapalli€¦ · •Convex optimization ()is a convex function, 𝒞is convex set •ut “today’s problems”, and this tutorial,

Main idea

• 𝑆 ≝ set of points around saddle point from where gradient descent does not escape quickly

• Escape ≝ function value decreases significantly

• How much is Vol 𝑆 ?

• Vol 𝑆 small ⇒ perturbed GD escapes saddle points efficiently

Page 56: Understanding Non-convex Optimization - Praneeth Netrapalli€¦ · •Convex optimization ()is a convex function, 𝒞is convex set •ut “today’s problems”, and this tutorial,

Two dimensional quadratic case

• 𝑓 𝑥 =1

2𝑥⊤

1 00 −1

𝑥

• 𝜆min 𝐻 = −1 < 0

• 0,0 is a saddle point

• GD: 𝑥𝑡+1 =1 − 𝜂 00 1 + 𝜂

𝑥𝑡

• 𝑆 is a thin strip, Vol 𝑆 is small

0,0

S

𝐵 0,0

Page 57: Understanding Non-convex Optimization - Praneeth Netrapalli€¦ · •Convex optimization ()is a convex function, 𝒞is convex set •ut “today’s problems”, and this tutorial,

Three dimensional quadratic case

• 𝑓 𝑥 =1

2𝑥⊤

1 0 00 1 00 0 −1

𝑥

• 0,0,0 is a saddle point

• GD: 𝑥𝑡+1 =

1 − 𝜂 0 00 1 − 𝜂 00 0 1 + 𝜂

𝑥𝑡

• 𝑆 is a thin disc, Vol 𝑆 is small

0,0,0S

𝐵 0,0,0

Page 58: Understanding Non-convex Optimization - Praneeth Netrapalli€¦ · •Convex optimization ()is a convex function, 𝒞is convex set •ut “today’s problems”, and this tutorial,

General case

Key technical lemma

𝑆 ∼ thin deformed disc

Vol 𝑆 is small

Page 59: Understanding Non-convex Optimization - Praneeth Netrapalli€¦ · •Convex optimization ()is a convex function, 𝒞is convex set •ut “today’s problems”, and this tutorial,

Can we accelerate nonconvex GD?

[Agarwal, Allen-Zhu, Bullins, Hazan, Ma 2016]

[Carmon, Duchi, Hinder, Sidford 2016, 2017]

[Jin, Netrapalli, Jordan 2018]

Guarantee Per Iteration cost # iterations

CR SOSP Hessian computation 𝑶 𝝐−𝟏.𝟓

GD SOSP Gradient computation Ω 𝑑

PGD SOSP Gradient computation ෨𝑂 𝜖−2

PAGD SOSP Gradient computation 𝑶 𝝐−𝟏.𝟕𝟓

Page 60: Understanding Non-convex Optimization - Praneeth Netrapalli€¦ · •Convex optimization ()is a convex function, 𝒞is convex set •ut “today’s problems”, and this tutorial,

Differential equation view of Accelerated GD

• AGD is a discretization of the following ODE [Polyak 1965]; [Su, Candes, Boyd 2015]

ሷ𝑥 + ෨𝜃 ሶ𝑥 + 𝛻𝑓 𝑥 = 0

• Multiplying by ሶ𝑥 and integrating from 𝑡1 to 𝑡2 gives us

𝑓 𝑥𝑡2 +1

2ሶ𝑥𝑡2

2= 𝑓 𝑥𝑡1 +

1

2ሶ𝑥𝑡1

2− ෨𝜃න

𝑡1

𝑡2

ሶ𝑥𝑡2𝑑𝑡

• Hamiltonian 𝑓 𝑥𝑡 +1

2ሶ𝑥𝑡

2 decreases monotonically

Page 61: Understanding Non-convex Optimization - Praneeth Netrapalli€¦ · •Convex optimization ()is a convex function, 𝒞is convex set •ut “today’s problems”, and this tutorial,

After discretization

Iterate: 𝑥𝑡 and velocity: 𝑣𝑡 ≔ 𝑥𝑡 − 𝑥𝑡−1

• Hamiltonian 𝑓 𝑥𝑡 +1

2𝜂𝑣𝑡

2 decreases monotonically if 𝑓 ⋅ “not

too nonconvex” between 𝑥𝑡 and 𝑥𝑡 + 𝑣𝑡• too nonconvex = negative curvature

• Can increase if 𝑓 ⋅ is “too nonconvex”

• If the function is “too nonconvex”, reset velocity or move in nonconvex direction – negative curvature exploitation

Page 62: Understanding Non-convex Optimization - Praneeth Netrapalli€¦ · •Convex optimization ()is a convex function, 𝒞is convex set •ut “today’s problems”, and this tutorial,

Can we accelerate nonconvex GD?

[Agarwal, Allen-Zhu, Bullins, Hazan, Ma 2016]

[Carmon, Duchi, Hinder, Sidford 2016, 2017]

[Jin, Netrapalli, Jordan 2018]

Guarantee Per Iteration cost # iterations

CR SOSP Hessian computation 𝑶 𝝐−𝟏.𝟓

GD SOSP Gradient computation Ω 𝑑

PGD SOSP Gradient computation ෨𝑂 𝜖−2

PAGD SOSP Gradient computation 𝑶 𝝐−𝟏.𝟕𝟓

Page 63: Understanding Non-convex Optimization - Praneeth Netrapalli€¦ · •Convex optimization ()is a convex function, 𝒞is convex set •ut “today’s problems”, and this tutorial,

Stochastic algorithms• Convergence rate: 𝑂(𝜖−3.5) [Fang,

Lin, Zhang 2019]

• 𝑂(𝜖−3) with variance reduction [Zhou, Xu, Gu 2018]; [Fang, Li, Lin, Zhang 2018]

• Finite sum: 𝑂(𝜖−3 ∧ 𝑛𝜖−2)

1. For 𝑡 = 0,1, … do

2. 𝑥𝑡+1 ← 𝑥𝑡 − 𝜂 𝛻𝑓 𝑥𝑡 + 𝜉𝑡

Perturbed SGD (PSGD)

Page 64: Understanding Non-convex Optimization - Praneeth Netrapalli€¦ · •Convex optimization ()is a convex function, 𝒞is convex set •ut “today’s problems”, and this tutorial,

Summary of nonconvex optimization

• Local optimality for nonconvex optimization

• For many problems, SOSPs are sufficient

• Ideas such as momentum, stochastic methods, variance reduction are useful in the nonconvex setting as well

• Complexity theory still not as mature as in convex optimization

Page 65: Understanding Non-convex Optimization - Praneeth Netrapalli€¦ · •Convex optimization ()is a convex function, 𝒞is convex set •ut “today’s problems”, and this tutorial,

Problems where SOSPs = global optima

• Matrix sensing [Bhojanapalli, Neyshabur, Srebro 2016]

• Matrix completion [Ge, Lee, Ma 2016]

• Robust PCA [Ge, Jin, Zheng 2017]

• Tensor factorization [Ge, Huang, Jin, Yuan 2015]; [Ge, Ma 2017]

• Smooth semidefinite programs [Boumal, Voroninski, Bandeira 2016]

• Synchronization & community detection [Bandeira, Boumal, Voroninski

2016]; [Mei, Misiakiewicz, Montanari, Oliveira 2017]

• Phase retrieval [Chen, Chi, Fan, Ma 2018]

Page 66: Understanding Non-convex Optimization - Praneeth Netrapalli€¦ · •Convex optimization ()is a convex function, 𝒞is convex set •ut “today’s problems”, and this tutorial,

Semi-definite programs (SDPs)

• Several applications• Clustering (max-cut)

• Control

• Sum-of-squares

• …

• Classical polynomial time solutions exist but can be slow• Interior-point methods

• Ellipsoid method

• Multiplicative weight update

min𝑋∈𝑅𝑛×𝑛

⟨𝐶, 𝑋⟩𝑠. 𝑡. 𝐴𝑖 , 𝑋 = 𝑏𝑖 , 1 ≤ 𝑖 ≤ 𝑚

𝑋 ≽ 0

Page 67: Understanding Non-convex Optimization - Praneeth Netrapalli€¦ · •Convex optimization ()is a convex function, 𝒞is convex set •ut “today’s problems”, and this tutorial,

Low rank solutions always exist!

• (Barvinok’95, Pataki’98): For any feasible SDP, at least one solution exists with rank 𝑘∗ ≤ 2𝑚

• In several applications 𝑚 ∼ 𝑛. So 𝑘∗ ≪ 𝑛.

Burer-Monteiro: Optimize in low rank space; iterations are fast!

Page 68: Understanding Non-convex Optimization - Praneeth Netrapalli€¦ · •Convex optimization ()is a convex function, 𝒞is convex set •ut “today’s problems”, and this tutorial,

Burer-Monteiro approach

min𝑈∈𝑅𝑛×𝑘

⟨𝐶, 𝑈𝑈𝑇⟩

𝑠. 𝑡. 𝐴𝑖 , 𝑈𝑈𝑇 = 𝑏𝑖

min𝑋∈𝑅𝑛×𝑛

⟨𝐶, 𝑋⟩

𝑠. 𝑡. 𝐴𝑖 , 𝑋 = 𝑏𝑖; 𝑖 = 1,⋯ ,𝑚𝑋 ≽ 0

𝑘 ∼ 𝑚

𝑛2 dimensional problem 𝑛𝑘 dimensional problem

Page 69: Understanding Non-convex Optimization - Praneeth Netrapalli€¦ · •Convex optimization ()is a convex function, 𝒞is convex set •ut “today’s problems”, and this tutorial,

Burer-Monteiro approach

min𝑈∈𝑅𝑛×𝑘

⟨𝐶, 𝑈𝑈𝑇⟩

𝑠. 𝑡. 𝐴𝑖 , 𝑈𝑈𝑇 = 𝑏𝑖

min𝑈∈𝑅𝑛×𝑘

𝑓 𝑈 = 𝐶, 𝑈𝑈𝑇 + 𝜇

𝑖

𝐴𝑖 , 𝑈𝑈𝑇 − 𝑏𝑖

2

Penalty Version

min𝑋∈𝑅𝑛×𝑛

⟨𝐶, 𝑋⟩

𝑠. 𝑡. 𝐴𝑖 , 𝑋 = 𝑏𝑖; 𝑖 = 1,⋯ ,𝑚𝑋 ≽ 0

Nonconvex problem!

Penalty parameter

𝑘 ∼ 𝑚

Page 70: Understanding Non-convex Optimization - Praneeth Netrapalli€¦ · •Convex optimization ()is a convex function, 𝒞is convex set •ut “today’s problems”, and this tutorial,

Are SOSPs = global minima?

• 𝐺: symmetric Gaussian matrix with 𝐺𝑖𝑗 ∼ 𝑁(0, 𝜎2)

• If 𝑘 = Ω( 𝑚 log 1/𝜖) then with high probability

every 𝜖 SOSP = 𝜖 global optimum

[Boumal, Voroninski, Bandeira 2016]

[Bhojanapalli, Boumal, Jain, Netrapalli 2018]

min𝑈∈𝑅𝑛×𝑘

𝑓 𝑈 = 𝐶,𝑈𝑈𝑇 + 𝜇

𝑖

𝐴𝑖 , 𝑈𝑈𝑇 − 𝑏𝑖

2

min𝑈∈𝑅𝑛×𝑘

𝑓 𝑈 = 𝐶 + 𝐺, 𝑈𝑈𝑇 + 𝜇

𝑖

𝐴𝑖 , 𝑈𝑈𝑇 − 𝑏𝑖

2

• In general, no [Bhojanapalli, Boumal, Jain, Netrapalli 2018]

Smoothed analysis

Page 71: Understanding Non-convex Optimization - Praneeth Netrapalli€¦ · •Convex optimization ()is a convex function, 𝒞is convex set •ut “today’s problems”, and this tutorial,

Two key steps

1. SOSP that is rank deficient is global optimum [Burer-Monteiro 2003]

𝑈 SOSP and 𝜎𝑘 𝑈 = 0 ⇒ 𝑈 is a global optimum

2. For perturbed SDPs, with probability 1, if 𝑘 ≥ 2𝑚, then

all FOSPs have 𝜎𝑘 𝑈 = 0.

𝑘th largest singular value of 𝑈

min𝑈∈ℝ𝑛×𝑘

𝑓 𝑈𝑈⊤𝑓(⋅) convex

Page 72: Understanding Non-convex Optimization - Praneeth Netrapalli€¦ · •Convex optimization ()is a convex function, 𝒞is convex set •ut “today’s problems”, and this tutorial,

𝜖-FOSP ⇒ 𝜎𝑘 𝑈 is smallmin

𝑈∈𝑅𝑛×𝑘𝑓 𝑈 = 𝐶 + 𝐺, 𝑈𝑈𝑇 + 𝜇

𝑖

𝐴𝑖 , 𝑈𝑈𝑇 − 𝑏𝑖

2

• Approximate FOSP: 𝐶 + 𝐺 + 2𝜇 σ𝑖 𝐴𝑖 , 𝑈𝑈⊤ − 𝑏𝑖 𝐴𝑖 𝑈 ≤ 𝜖

𝐶 + 𝐺 + 2𝜇

𝑖

𝐴𝑖 , 𝑈𝑈⊤ − 𝑏𝑖 𝐴𝑖 𝑈

≤ 𝜖𝑛

𝑘

Page 73: Understanding Non-convex Optimization - Praneeth Netrapalli€¦ · •Convex optimization ()is a convex function, 𝒞is convex set •ut “today’s problems”, and this tutorial,

Aside: Lower bound on product of matrices𝐻 𝑈

≥ 𝜎𝑛−𝑘+1 𝐻 ⋅ 𝜎𝑘 𝑈𝑛

𝑘

𝜎𝑘 𝑈 ≤𝐻𝑈

𝜎𝑛−𝑘+1(𝐻)

Page 74: Understanding Non-convex Optimization - Praneeth Netrapalli€¦ · •Convex optimization ()is a convex function, 𝒞is convex set •ut “today’s problems”, and this tutorial,

FOSP ⇒ 𝜎𝑘 𝑈 is small

𝜎𝑛−𝑘+1 𝐶 + 𝐺 + 2𝜇 σ𝑖 𝐴𝑖 , 𝑈𝑈⊤ − 𝑏𝑖 𝐴𝑖 large

𝜎𝑘 𝑈 is small

𝐶 + 𝐺 + 2𝜇

𝑖

𝐴𝑖 , 𝑈𝑈⊤ − 𝑏𝑖 𝐴𝑖 𝑈

≤ 𝜖𝑛

𝑘

Page 75: Understanding Non-convex Optimization - Praneeth Netrapalli€¦ · •Convex optimization ()is a convex function, 𝒞is convex set •ut “today’s problems”, and this tutorial,

Smallest singular values of Gaussian matrices

• 𝜎𝑖 𝐺 denotes the 𝑖th singular value of 𝐺.

ℙ 𝜎𝑛 𝐺 = 0 = 0

• In general, 𝜎𝑛−𝑘 𝐺 ∼𝑘

𝑛.

• Can obtain large deviation bounds [Nguyen 2017]

ℙ 𝜎𝑛−𝑘 𝐺 < 𝑐𝑘

𝑛< exp −𝐶𝑘2 + 𝑘 log 𝑛

• Can extend the above to 𝐺 + 𝐴 for any fixed matrix 𝐴

• In this case, 𝐺 + 𝐶 + 2𝜇σ𝑖 𝐴𝑖 , 𝑈𝑈⊤ − 𝑏𝑖 𝐴𝑖

Page 76: Understanding Non-convex Optimization - Praneeth Netrapalli€¦ · •Convex optimization ()is a convex function, 𝒞is convex set •ut “today’s problems”, and this tutorial,

Summary

Several problems for which SOSPs = global optima

Convert a large convex problem into a smaller nonconvex problem

• E.g., Burer-Monteiro approach for solving SDPs

• Empirically, much faster than ellipsoid/interior point methods

• Open problem: Identify other problems which have this property

Page 77: Understanding Non-convex Optimization - Praneeth Netrapalli€¦ · •Convex optimization ()is a convex function, 𝒞is convex set •ut “today’s problems”, and this tutorial,

Alternating Minimization

Page 78: Understanding Non-convex Optimization - Praneeth Netrapalli€¦ · •Convex optimization ()is a convex function, 𝒞is convex set •ut “today’s problems”, and this tutorial,

Alternating Minimization

Applicable to a special class of problems:

those here variables can be split into two sets, i.e. 𝑥 = 𝑢, 𝑣 such that

min𝑢

𝑓 𝑢, 𝑣 feasible / “easy” for fixed 𝑣

min𝑣

𝑓 𝑢, 𝑣 feasible / “easy” for fixed 𝑢

Alt-min: (initialize)

𝑢+= arg min𝑢

𝑓 𝑢, 𝑣

𝑣+= arg min𝑣

𝑓 𝑢+, 𝑣

• “big” global steps, and can be much faster than “local” gradient descent

• No step size to be tuned / chosen

Page 79: Understanding Non-convex Optimization - Praneeth Netrapalli€¦ · •Convex optimization ()is a convex function, 𝒞is convex set •ut “today’s problems”, and this tutorial,

Matrix Completion

Find a low-rank matrix from a few (randomlysampled) elements

Page 80: Understanding Non-convex Optimization - Praneeth Netrapalli€¦ · •Convex optimization ()is a convex function, 𝒞is convex set •ut “today’s problems”, and this tutorial,

Matrix Completion

Find a low-rank matrix from a few (randomlysampled) elements

- Let Ω be set of sampled elements

Alt min:

(1) Write as non-convex problem

(2) Alternately optimize U and V

Page 81: Understanding Non-convex Optimization - Praneeth Netrapalli€¦ · •Convex optimization ()is a convex function, 𝒞is convex set •ut “today’s problems”, and this tutorial,

AltMin for Matrix Completion

Naturally decouples into small least-squaresproblems

(a) For all i

(b) For all j

Closed form, embarrassingly parallel

Page 82: Understanding Non-convex Optimization - Praneeth Netrapalli€¦ · •Convex optimization ()is a convex function, 𝒞is convex set •ut “today’s problems”, and this tutorial,

Matrix Completion

Empirically: AltMin needs fewer samples than convex methods like trace-norm minimization

- with memory as small as inputand output

- very fast, parallel

Theoretically: both take O(𝑛𝑟2 log 𝑛)randomly chosen samples for exactrecovery of incoherent matrices

- w/ spectral initialization- close to the lower bound of Ω(𝑛𝑟 log 𝑛)

Page 83: Understanding Non-convex Optimization - Praneeth Netrapalli€¦ · •Convex optimization ()is a convex function, 𝒞is convex set •ut “today’s problems”, and this tutorial,

Mixed linear regression

Solve linear equations, except that each is either

or

Find given

Natural for modeling with latent classes- Evolutionary biology: separating phenotypes- Quantitative Finance: detecting regime change- Healthcare: separating patient classes for differential treatment

Several specialized R packages (see [Grun,Leisch] for overview)- all implement variants / optimizations of EM - AltMin

Page 84: Understanding Non-convex Optimization - Praneeth Netrapalli€¦ · •Convex optimization ()is a convex function, 𝒞is convex set •ut “today’s problems”, and this tutorial,

Mixed Linear Regression

… my netflix problem …

Page 85: Understanding Non-convex Optimization - Praneeth Netrapalli€¦ · •Convex optimization ()is a convex function, 𝒞is convex set •ut “today’s problems”, and this tutorial,

Mixed Linear Regression

(a) Assign each sample to the lower current error

(b) Update each 𝛽 from its assigned samples

AltMin: alternate between the 𝑧’s and the 𝛽′𝑠

Both updates areSimple closed forms !

Page 86: Understanding Non-convex Optimization - Praneeth Netrapalli€¦ · •Convex optimization ()is a convex function, 𝒞is convex set •ut “today’s problems”, and this tutorial,

Mixed Linear Regression

Intuition: current iterate truth

Page 87: Understanding Non-convex Optimization - Praneeth Netrapalli€¦ · •Convex optimization ()is a convex function, 𝒞is convex set •ut “today’s problems”, and this tutorial,

Mixed Linear Regression

If not too far from

Then majority points will be correctly assigned.

So, running least-squares on thesewill yield better next iterate.

Intuition: current iterate truth

Page 88: Understanding Non-convex Optimization - Praneeth Netrapalli€¦ · •Convex optimization ()is a convex function, 𝒞is convex set •ut “today’s problems”, and this tutorial,

Mixed Linear Regression

Intuition: current iterate truth

Virtuous cyclebetter estimates

better labels

But only once it is decent enough… so need initialization …

Page 89: Understanding Non-convex Optimization - Praneeth Netrapalli€¦ · •Convex optimization ()is a convex function, 𝒞is convex set •ut “today’s problems”, and this tutorial,

Spectral Initialization

Matrix Completion

Take top-r eigenvectors ofsparse 0-filled matrix

Mixed linear regression

Take top 2 eigenvectors of

𝑀 =

𝑖=1

𝑛

𝑦𝑖2 𝑥𝑖𝑥𝑖

𝑇

AltMin also successful for Phase retrieval, matrix sensing, robust PCA etc.

Open Problem: a more general theory of AltMin and its convergence …

Page 90: Understanding Non-convex Optimization - Praneeth Netrapalli€¦ · •Convex optimization ()is a convex function, 𝒞is convex set •ut “today’s problems”, and this tutorial,

Open problems

Page 91: Understanding Non-convex Optimization - Praneeth Netrapalli€¦ · •Convex optimization ()is a convex function, 𝒞is convex set •ut “today’s problems”, and this tutorial,

Quasi Newton methods

• First order methods that try to emulate second order methods (such as Newton method) by estimating Hessians (using gradients)• E.g., BFGS, L-BFGS ∇𝑓 𝑥𝑘+1 − ∇𝑓 𝑥𝑘 ∼ 𝐻 𝑥𝑘+1 − 𝑥𝑘

• Well understood theory; widely successful for convex optimization in practice

• Nonconvex optimization• Theory: Initial results on convergence to FOSP [Wang, Ma, Goldfarb, Liu 2016]

• Practice: Has not been effective so far• Need to design better algorithms?

Page 92: Understanding Non-convex Optimization - Praneeth Netrapalli€¦ · •Convex optimization ()is a convex function, 𝒞is convex set •ut “today’s problems”, and this tutorial,

Initialization

• Refers to choice of 𝑥0; affects both optimization speed and quality of the local minimum

𝑥1 𝑥2 𝑥3 𝑥4

Relu Relu Relu

𝑤 ∼ 𝒩(0, 𝜎𝑤2)

𝑧 ∼ 𝒩(0, 𝜎𝑧2)

Page 93: Understanding Non-convex Optimization - Praneeth Netrapalli€¦ · •Convex optimization ()is a convex function, 𝒞is convex set •ut “today’s problems”, and this tutorial,

Initialization

• Highly influential empirical works: [Glorot, Bengio 2010]; [Sutskever, Martens, Dahl, Hinton 2013]; [He, Zhang, Ren, Sun 2015]

• Main viewpoint: Gradients and activations at initialization

𝑥1 𝑥2 𝑥3 𝑥4

Relu

𝑦 = ReLU 𝑤⊤𝑥

Var 𝑦 ∝ 𝜎𝑤2 ⋅ 𝑥 2

• For 𝑦 ∼ 𝑥𝑖, choose 𝜎𝑤2 ∝

1

𝑑in

• Training phase: not understood

Page 94: Understanding Non-convex Optimization - Praneeth Netrapalli€¦ · •Convex optimization ()is a convex function, 𝒞is convex set •ut “today’s problems”, and this tutorial,

Nonconvex nonconcave minimax optimization

min𝑥

max𝑦

𝑓(𝑥, 𝑦)

• The function to minimize 𝑔 𝑥 ≝ max𝑦

𝑓(𝑥, 𝑦) defined implicitly

• Basis of generative adversarial networks (GAN) and robust training

• Notions of local optimality not well understood in the general setting

• Gradient descent ascent widely used but its convergence properties not understood

Page 95: Understanding Non-convex Optimization - Praneeth Netrapalli€¦ · •Convex optimization ()is a convex function, 𝒞is convex set •ut “today’s problems”, and this tutorial,

Summary

• Nonconvex optimization is the primary computational work horse in machine learning now

• Main ideas behind these algorithms come from convex optimization

• Research direction 1: Understanding statistical + optimization landscape of important nonconvex problems• E.g., SOSPs = global optima in matrix factorization problems

• Research direction 2: Designing faster algorithms