Part I: Convex Fred Roosta University of Queensland · Part I: Convex Fred Roosta School of...

Post on 21-Jun-2020

12 views 0 download

Transcript of Part I: Convex Fred Roosta University of Queensland · Part I: Convex Fred Roosta School of...

Intro Smooth NonSmooth

Stochastic Second-Order Optimization MethodsPart I: Convex

Fred Roosta

School of Mathematics and PhysicsUniversity of Queensland

Intro Smooth NonSmooth

Disclaimer

Disclaimer:

To preserve brevity and flow, many cited results are simplified,slightly-modified, or generalized...please see the cited work forthe details.

Unfortunately, due to time constraints, many relevant andinteresting works are not mentioned in this presentation.

Intro Smooth NonSmooth

Problem Statement

Problem

minx∈X⊆Rd

F (x) = Ewf (x;w) =

Risk︷ ︸︸ ︷∫Ω

f (x;w(ω))︸ ︷︷ ︸e.g., Loss/Penalty

+prediction function

dP(w(ω))

F : (non-)convex/(non-)smooth

High-dimensional: d 1

With empirical (counting) measure over wini=1 ⊂ Rp, i.e.,∫AdP(w) =

1

n

n∑i=1

1wi∈A, ∀A ⊆ Ω =⇒ F (x) =1

n

n∑i=1

f (x; wi )︸ ︷︷ ︸finite-sum/empirical risk

“Big data”: n 1

Intro Smooth NonSmooth

Intro Smooth NonSmooth

Intro Smooth NonSmooth

Humongous Data / High Dimension

Classical algorithms =⇒ High per-iteration costs

Intro Smooth NonSmooth

Humongous Data / High Dimension

Modern variants:

1 Low per-iteration costs

2 Maintain original convergence rate

Intro Smooth NonSmooth

First Order Methods

Use only gradient information

E.g. : Gradient Descent

x(k+1) = x(k) − α(k)∇F (x(k))

Smooth Convex F ⇒ Sublinear, O(1/k)

Smooth Strongly Convex F ⇒ Linear, O(ρk), ρ < 1

However, iteration cost is high!

Intro Smooth NonSmooth

First Order Methods

Stochastic variants e.g., (mini-batch) SGD

For some s small, chosen at random wisi=1 and

x(k+1) = x(k) − α(k)

s

s∑i=1

∇f (x(k),wi )

Cheap per-iteration costs!

However slower to converge:

Smooth Convex F ⇒ O(1/√k)

Smooth Strongly Convex F ⇒ O(1/k)

Intro Smooth NonSmooth

Modifications...

Achieve the convergence rate of the full GD

Preserve the per-iteration cost of SGD

E.g.: SAG, SDCA, SVRG, Prox-SVRG, Acc-Prox-SVRG,Acc-Prox-SDCA, S2GD, mS2GD, MISO, SAGA, AMSVRG, ...

Intro Smooth NonSmooth

Second Order Methods

Use both gradient and Hessian information

E.g. : Classical Newton’s method

x(k+1) = x(k) − α(k) [∇2F (x(k))]−1∇F (x(k))︸ ︷︷ ︸Non-uniformly Scaled Gradient

Smooth Convex F ⇒ Locally Superlinear

Smooth Strongly Convex F ⇒ Locally Quadratic and GloballyLinear

However, per-iteration cost is much higher!

Intro Smooth NonSmooth

Outline

Part I: ConvexSmooth

Newton-CG

Non-Smooth

Proximal NewtonSemi-smooth Newton

Part II: Non-ConvexLine-Search Based Methods

L-BFGSGauss-NewtonNatural Gradient

Trust-Region Based Methods

Trust-RegionCubic Regularization

Part III: Discussion and Examples

Intro Smooth NonSmooth

Outline

Part I: ConvexSmooth

Newton-CG

Non-Smooth

Proximal NewtonSemi-smooth Newton

Part II: Non-ConvexLine-Search Based Methods

L-BFGSGauss-NewtonNatural Gradient

Trust-Region Based Methods

Trust-RegionCubic Regularization

Part III: Discussion and Examples

Intro Smooth NonSmooth

Finite Sum / Empirical Risk Minimization

FSM/ERM

minx∈X⊆Rd

F (x) =1

n

n∑i=1

f (x; wi ) =1

n

n∑i=1

fi (x)

fi : Convex and Smooth

F : Strongly Convex =⇒ Unique minimizer x?

n 1 and/or d 1

Intro Smooth NonSmooth

Iterative Scheme

y(k) = argminx∈X

(x− x(k))Tg(k) +

1

2(x− x(k))TH(k)(x− x(k))

x(k+1) = x(k) + αk

(y(k) − x(k)

)

Intro Smooth NonSmooth

Iterative Scheme

y(k) = argminx∈X

(x− x(k))T g(k) +

1

2(x− x(k))T H(k)(x− x(k))

, x(k+1) = x(k) + αk

(y(k) − x(k)

)

Newton: g(k) = ∇F (x(k)) & H(k) = ∇2F (x(k))

Gradient Descent: g(k) = ∇F (x(k)) & H(k) = I

Frank-Wolfe: g(k) = ∇F (x(k)) & H(k) = 0

(mini-batch) SGD: Sg ⊂ 1, 2, . . . , n =⇒ g(k) = 1|Sg|

∑j∈Sg

∇fj (x(k)) & H(k) = I

Sub-Sampled Newton:

g(k) = ∇F (x(k))

Sg ⊂ 1, 2, . . . , n =⇒ H(k) =1

|SH |

∑j∈SH

∇2fj (x(k))

Hessian Sub-Sampling

Sg ⊂ 1, 2, . . . , n =⇒ g(k) =1

|Sg|

∑j∈Sg

∇fj (x(k))

SH ⊂ 1, 2, . . . , n =⇒ H(k) =1

|SH |

∑j∈SH

∇2fj (x(k))

Gradient and Hessian Sub-Sampling

Intro Smooth NonSmooth

Sub-sampled Newton’s Method

Let X = Rd , i.e., unconstrained optimization

Iterative Scheme

p(k) ≈ argminp∈Rd

pTg(k) +

1

2pTH(k)p

,

x(k+1) = x(k) + α(k)p(k)

g(k) = ∇F (x(k))

H(k) =1

|SH |∑j∈SH

∇2fj(x(k))

Hessian Sub-Sampling

Intro Smooth NonSmooth

Sub-sampled Newton’s Method

Global convergence, i.e., starting from any initial point

Intro Smooth NonSmooth

Sub-sampled Newton’s Method

Iterative Scheme

Descent Direction: H(k)p(k) ≈ −g(k),

Step Size:

αk = argmax

α≤1α

s.t. F (x(k) + αpk) ≤ F (x(k)) + αβpTk ∇F (x(k))

Update: x(k+1) = x(k) + α(k)p(k)

Intro Smooth NonSmooth

Sub-sampled Newton’s Method

Algorithm Newton’s Method with Hessian Sub-Sampling

1: Input: x(0)

2: for k = 0, 1, 2, · · · until termination do3: - g(k) = ∇F (x(k))

4: - S(k)H ⊆ 1, 2, . . . , n

5: - H(k) =1

|S(k)H |

∑j∈S(k)

H

∇2fj(x(k))

6: - H(k)p(k) ≈ −g(k)

7: - Find α(k) that passes Armijo linesearch8: - Update x(k+1) = x(k) + α(k)p(k)

9: end for

Intro Smooth NonSmooth

Sub-sampled Newton’s Method

Theorem ( [Byrd et al., 2011])

If each fi is strongly convex, twice continuously differentiable withLipschitz continuous gradient, then

limk→∞

∇F (x(k)) = 0.

Intro Smooth NonSmooth

Sub-sampled Newton’s Method

Theorem ( [Bollapragada et al., 2016])

If each fi is strongly convex, twice continuously differentiable withLipschitz continuous gradient, then

E(F (x(k))− F (x?)

)≤ ρk

(F (x(0))− F (x?)

),

where ρ ≤(1− 1/κ2

)with κ being the “condition number” of the

problem.

Intro Smooth NonSmooth

Sub-sampled Newton’s Method

What if F is strongly convex, but each fi is only weakly convex?,e.g.,

F (x) = 1n

∑ni=1 `i (aT

i x)

ai ∈ Rd and Range(aini=1) = Rd

`i : R → R and `′′i ≥ γ > 0

Each ∇2fi (x) = `′′i (aTi x)aia

Ti is rank one!

But ∇2F (x) γ · λmin

(n∑

i=1

aiaTi

)︸ ︷︷ ︸

>0

Intro Smooth NonSmooth

Sub-Sampling Hessian

H =1

|S|∑j∈S∇2fj(x)

Lemma ( [Roosta and Mahoney, 2016a])

Suppose ∇2F (x) γI and 0 ∇2fi (x) LI. Given any0 < ε < 1, 0 < δ < 1, if Hessian is uniformly sub-sampled with

|S| ≥ 2κ log(d/δ)

ε2,

then

Pr(

H (1− ε)γI)≥ 1− δ.

where κ = L/γ.

Intro Smooth NonSmooth

Sub-Sampling Hessian

Inexact Update

Hp ≈ −g =⇒ ‖Hp + g‖ ≤ θ ‖g‖, θ < 1, using CG.

Why CG and not other solvers (at least theoretically)?

H is SPD

p(t) = argminp∈Kt

pTg+1

2pTHp→

⟨p(t), g

⟩≤ −1

2

⟨p(t),Hp(t)

Intro Smooth NonSmooth

Sub-Sampling Hessian

Theorem ( [Roosta and Mahoney, 2016a])

If θ ≤√

(1− ε)κ

, then, w.h.p,

F (x(k+1))− F (x?) ≤ ρ(F (x(k))− F (x?)

),

where ρ = 1− (1− ε)/κ2.

Intro Smooth NonSmooth

Sub-sampled Newton’s Method

Local convergence, i.e., in a neighborhood of x?, and with α(k) = 1

∥∥∥x(k+1) − x?∥∥∥ ≤ ξ1

∥∥∥x(k) − x?∥∥∥+ ξ2

∥∥∥x(k) − x?∥∥∥2

Linear-Quadratic Error Recursion

Intro Smooth NonSmooth

Sub-sampled Newton’s Method

[Erdogdu and Montanari, 2015]

[Ur+1,Λr+1] = TSVD (H, r + 1)

H−1 = λ−1r+1I + Ur

(Λ−1r −

1

λr+1I

)UT

r

x(k+1) = argminx∈X

x−

(x(k) − α(k)H−1∇F (x(k))

)Theorem ( [Erdogdu and Montanari, 2015])∥∥∥x(k+1) − x?

∥∥∥ ≤ ξ(k)1

∥∥∥x(k) − x?∥∥∥+ ξ

(k)2

∥∥∥x(k) − x?∥∥∥2

,

ξ(k)1 = 1− λmin(H(k))

λr+1(H(k))+

Lg

λr+1(H(k))

√log d

|S(k)H |

, and ξ(k)2 =

LH

2λr+1(H(k)).

Note: For constrained optimization, the method is based on two metricgradient projection =⇒ starting from an arbitrary point, the algorithmmight not recognize, i.e., fail to stop at, an stationary point.

Intro Smooth NonSmooth

Sub-sampling Hessian

H =1

|S|∑j∈S∇2fj(x)

Lemma ( [Roosta and Mahoney, 2016b])

Given any 0 < ε < 1, 0 < δ < 1, if Hessian is uniformlysub-sampled with

|S| ≥ 2κ2 log(d/δ)

ε2,

then

Pr(

(1− ε)∇2F (x) H(x) (1 + ε)∇2F (x))≥ 1− δ.

Intro Smooth NonSmooth

Sub-sampled Newton’s Method [Roosta and Mahoney,2016b]

Theorem ( [Roosta and Mahoney, 2016b])

With high-probability, we get∥∥∥x(k+1) − x?∥∥∥ ≤ ξ1

∥∥∥x(k) − x?∥∥∥+ ξ2

∥∥∥x(k) − x?∥∥∥2,

where

ξ1 =ε

(1− ε)+

(√κ

1− ε

)θ, and ξ2 =

LH

2(1− ε)γ.

If θ = ε/√κ, then ξ1 is problem-independent! ⇒ Can be made

arbitrarily small!

Intro Smooth NonSmooth

Sub-sampled Newton’s Method

Theorem ( [Roosta and Mahoney, 2016b])

Consider any 0 < ρ0 < ρ < 1 and ε ≤ ρ0/(1 + ρ0). If‖x(0) − x∗‖ ≤ (ρ− ρ0)/ξ2, and

θ ≤ ρ0

√(1− ε)κ

,

we get locally Q-linear convergence

‖x(k) − x∗‖ ≤ ρ‖x(k−1) − x∗‖, k = 1, . . . , k0

with probability (1− δ)k0 .

Problem-independent local convergence rate

By increasing Hessian accuracy, super-linear rate is possible

Intro Smooth NonSmooth

Putting it all together

Theorem ( [Roosta and Mahoney, 2016b])

Under certain assumptions, starting at any x(0), we have

linear convergence

after certain number of iterations, we get“problem-independent” linear convergence

after certain number of iterations, the step size of α(k) = 1passes Armijo rule for all subsequent iterations

Intro Smooth NonSmooth

“Any optimization algorithm for which the unit step length workshas some wisdom. It is too much of a fluke if the unite step length

[accidentally] works.”

Prof. Jorge NocedalIPAM Summer School, 2012

Intro Smooth NonSmooth

Sub-sampled Newton’s Method

Theorem ( [Bollapragada et al., 2016])

Suppose ∥∥∥∥Ei

(∇2fi (x)−∇2F (x)

)2∥∥∥∥ ≤ σ2.

Then,

E∥∥∥x(k+1) − x?

∥∥∥ ≤ ξ1

∥∥∥x(k) − x?∥∥∥+ ξ2

∥∥∥x(k) − x?∥∥∥2,

where

ξ1 =σ

γ

√|S(k)

H |+ κθ, and ξ2 =

LH

2γ.

Intro Smooth NonSmooth

Exploiting the structure....

Example:

F (x) =1

n

n∑i=1

`i (aTi x) =⇒ ∇2F (x) = ATDA,

where

A ,

aT1...

aTn

∈ Rn×d , Dx ,1

n

`′′(aT

1 x). . .

`′′(aTn x)

∈ Rn×n.

Intro Smooth NonSmooth

Sketching [Pilanci and Wainwright, 2017]

x(k+1) = argminx∈X

(x− x(k))T∇F (x(k)) +

1

2(x− x(k))T∇2F (x(k))(x− x(k))

∇2F (x(k)) = B(k)TB(k), B(k) ∈ Rn×d

x(k+1) = argminx∈X

(x− x(k))T∇F (x(k)) +1

2‖B(k)︸︷︷︸

n×d

(x− x(k))‖2

∇2F (x(k)) ≈ B(k)TS(k)TS(k)B(k), S(k) ∈ Rs×n, d ≤ s n

x(k+1) = argminx∈X

(x− x(k))T∇F (x(k)) +1

2‖S(k)B(k)︸ ︷︷ ︸

s×d

(x− x(k))‖2

Intro Smooth NonSmooth

Sketching [Pilanci and Wainwright, 2017]

Sub-Gaussian sketches

well-known concentration properties

involve dense/unstructured matrix operations

Randomized orthonormal systems, e.g., Hadamard or Fourier

sub-optimal sample sizes

fast matrix multiplication

Random row sampling

Uniform

Non-uniform (more on this later)

Sparse JL sketches

Intro Smooth NonSmooth

Non-uniform Sampling [Xu et al., 2016]

Non-uniformity among ∇2fi =⇒ O(n) uniform samples!!!

Find |SH| independent of n

Immune non-uniformity

Non-Uniform sampling schemes base on

1 Row norms

2 Leverage scores

Intro Smooth NonSmooth

LiSSA [Agarwal et al., 2017]

Key idea: use Taylor expansion (Neumann series) to constructan estimator of the Hessian inverse

‖A‖ ≤ 1,A 0 =⇒ A−1 =∑∞

i=0 (I− A)i

A−1j =

∑ji=0 (I− A)i = I + (I− A) A−1

j−1

limj→∞A−1j = A−1

Uniformly sub-sampled Hessian + Turncated Neumann series

Three-nested loop involving HVP, i.e., ∇2fi ,j(x(k))v(k)i ,j

Leverage fast multiplication by Hessian of GLMs

Intro Smooth NonSmooth

name complexity per iteration reference

Newton-CG O(nnz(A)√κ) Folklore

SSN-LS O(nnz(A) log n + p2κ3/2) [Xu et al., 2016]

SSN-RNS O(nnz(A) + sr(A)pκ5/2) [Xu et al., 2016]

SRHT O(np(log n)4 + p2(log n)4κ3/2) [Pilanci et al., 2016]

SSN-Uniform O(nnz(A) + pκκ3/2) [Roosta et al., 2016]

LiSSA O(nnz(A) + pκκ2) [Agarwal et al., 2017]

κ = maxx

λmax∇2F (x)

λmin∇2F (x)

κ = maxx

maxi λmax∇2fi (x)

λmin∇2F (x)

κ = maxx

maxi λmax∇2fi (x)

mini λmin∇2fi (x)

⇒ κ ≤ κ ≤ κ

Intro Smooth NonSmooth

Sub-sampled Newton’s Method

Iterative Scheme

p(k) ≈ argminp∈Rd

pTg(k) +

1

2pTH(k)p

,

x(k+1) = x(k) + α(k)p(k)

g(k) =1

|Sg|∑j∈Sg

∇fj(x(k))

H(k) =1

|SH |∑j∈SH

∇2fj(x(k))

Gradient and Hessian Sub-Sampling

Intro Smooth NonSmooth

Sub-sampled Newton’s Method

Algorithm Newton’s Method with Hessian and Gradient Sub-Sampling

1: Input: x(0)

2: for k = 0, 1, 2, · · · until termination do

3: - S(k)g ⊆ 1, 2, . . . , n =⇒ g(k) = 1

|S(k)g |

∑j∈S(k)

g∇fj(x(k))

4: if∥∥g(k)

∥∥ ≤ σεg then5: - STOP6: end if7: - S(k)

H ⊆ 1, 2, . . . , n =⇒ H(k) = 1

|S(k)H |

∑j∈S(k)

H

∇2fj(x(k))

8: - H(k)p(k) ≈ −g(k)

9: - Find α(k) that passes Armijo linesearch10: - Update x(k+1) = x(k) + α(k)p(k)

11: end for

Intro Smooth NonSmooth

Sub-sampled Newton’s Method

Global convergence, i.e., starting from any initial point

[Byrd et al., 2012,Roosta and Mahoney, 2016a,Bollapragada et al., 2016]

Intro Smooth NonSmooth

Sub-sampled Newton’s Method

We can write ∇F (x) = AB where

A :=

| | |∇f1(x) ∇f2(x) · · · ∇fn(x)| | |

and B :=

1/n1/n

...1/n

Lemma ( [Roosta and Mahoney, 2016a])

Let ‖∇fi (x)‖ ≤ G (x) <∞. For any 0 < ε, δ < 1, if

|S| ≥ G (x)2(1 +

√8 ln(1/δ)

)2/ε2,

then

Pr(‖∇F (x)− g(x)‖ ≤ ε

)≥ 1− δ.

Intro Smooth NonSmooth

Sub-sampled Newton’s Method

Theorem ( [Roosta and Mahoney, 2016a])

If

θ ≤√

(1− εH)

κ,

then, w.h.p,

F (x(k+1))− F (x?) ≤ ρ(F (x(k))− F (x?)

),

where ρ = 1− (1− εH)/κ2, and upon “STOP”, we have‖∇F (x(k))‖ < (1 + σ) εg.

Intro Smooth NonSmooth

Sub-sampled Newton’s Method

Local convergence, i.e., in a neighborhood of x?, and with α(k) = 1

∥∥∥x(k+1) − x?∥∥∥ ≤ ξ0 + ξ1

∥∥∥x(k) − x?∥∥∥+ ξ2

∥∥∥x(k) − x?∥∥∥2

Error Recursion

[Roosta and Mahoney, 2016b, Bollapragada et al., 2016]

Intro Smooth NonSmooth

Sub-sampled Newton’s Method

Theorem ( [Roosta and Mahoney, 2016b])

With high-probability, we get∥∥∥x(k+1) − x?∥∥∥ ≤ ξ0 + ξ1

∥∥∥x(k) − x?∥∥∥+ ξ2

∥∥∥x(k) − x?∥∥∥2,

whereξ0 =

εg

(1− εH)γ

ξ1 =εH

(1− εH)+

(√κ

1− εH

)θ,

ξ2 =LH

2(1− εH)γ.

Intro Smooth NonSmooth

Sub-sampled Newton’s Method

Theorem ( [Roosta and Mahoney, 2016b])

Consider any 0 < ρ0 + ρ1 < ρ < 1. If ‖x(0) − x∗‖ ≤ c(ρ0, ρ1, ρ),

ε(k)g = ρkεg and

θ ≤ ρ0

√(1− εH)

κ,

we get locally R-linear convergence

‖x(k) − x∗‖ ≤ cρk

with probability (1− δ)2k .

Problem-independent local convergence rate

Intro Smooth NonSmooth

Putting it all together

Theorem ( [Roosta and Mahoney, 2016b])

Under certain assumptions, starting at any x(0), we have

linear convergence,

after certain number of iterations, we get“problem-independent” linear convergence,

after certain number of iterations, the step size of α(k) = 1passes Armijo rule for all subsequent iterations,

upon “STOP” at iteration k, we have‖∇F (x(k))‖ < (1 + σ)

√ρkεg.

Intro Smooth NonSmooth

Outline

Part I: ConvexSmooth

Newton-CG

Non-Smooth

Proximal NewtonSemi-smooth Newton

Part II: Non-ConvexLine-Search Based Methods

L-BFGSGauss-NewtonNatural Gradient

Trust-Region Based Methods

Trust-RegionCubic Regularization

Part III: Discussion and Examples

Intro Smooth NonSmooth

Newton-type Methods for Non-Smooth Problems

Convex Composite Problem

minx∈X⊆Rd

F (x) + R(x)

F : convex and smooth

R: convex and non-smooth

Intro Smooth NonSmooth

Non-Smooth Newton-type [Yu et al., 2010]

Let X = Rd , i.e., unconstrained optimization

Iterative Scheme (Non-Quadratic Sub-Problem)

p(k) ≈ argminp∈Rd

supg(k)∈∂(F+R)(x(k))

pTg(k)

+

1

2pT H(k)︸︷︷︸

e.g.,QN

p

,

x(k+1) = x(k) + α(k)p(k)

Intro Smooth NonSmooth

Proximal Newton-type [Lee et al., 2014,Byrd et al., 2016]

Iterative Scheme

p(k) ≈ argminp∈Rd

pT∇F (x(k)) +1

2pT H(k)︸︷︷︸

≈∇2F (x(k))

p + R(x(k) + p)

x(k+1) = x(k) + α(k)p(k)

Notable examples of proximal Newton methods:

glmnet ( [Friedman et al., 2010]): `1-regularized GLMs,sub-problems are solved using coordinate descent

QUIC ( [Hsieh et al., 2014]): Graphical Lasso problem withfactorization tricks, sub-problems are solved using coordinatedescent

Intro Smooth NonSmooth

Proximal Newton-type [Lee et al., 2014,Byrd et al., 2016]

Inexactness of sub-problem solver

p(k) ≈ argminp∈Rd

pT∇F (x(k)) +

1

2pTH(k)p + R(x(k) + p)

p? is the minimizer of the above sub-problem iff p? = Proxλ(p?), where

Prox(k)λ (p) , argmin

q∈Rd

⟨∇F (x(k)) + H(k)p,q

⟩+ R(x(k) + q) +

λ

2‖q− p‖2

.

∥∥∥p− Prox(k)λ (p)

∥∥∥ ≤ θ ∥∥∥Prox(k)λ (0)

∥∥∥ ,and some sufficient decrease condition on the subproblem

Inexactness

Intro Smooth NonSmooth

Proximal Newton-type [Lee et al., 2014,Byrd et al., 2016]

Step-size

x(k+1) = x(k) + α(k)p(k)

(F + R)(x(k) + αp(k))− (F + R)(x(k)) ≤ β(`k

(x(k) + αp

)− `k

(x(k)

))`k (x) = F (x(k)) + (x− x(k))T∇F (x(k)) + R(x)

Line-Search

Intro Smooth NonSmooth

Proximal Newton-type [Lee et al., 2014,Byrd et al., 2016]

Global convergence: x(k) converges to an optimal solutionstarting at any x(0)

Local convergence: for x(0) close enough to x?

H(k) = ∇2F (x(k))

Exact solve: quadratic convergence

Approximate solve with decaying forcing term: superlinearconvergence

Approximate solve with fixed forcing term: linear convergence

H(k) : Dennis-More =⇒ supelinear convergence

Intro Smooth NonSmooth

Sub-Sampled Proximal Newton-type [Liu et al., 2017]

FSM/ERM

minx∈X⊆Rd

F (x) + R(x) =1

n

n∑i=1

fi (aTi x) + R(x)

fi : Strongly-Convex, Smooth, and Self-Concordant

R: Convex and Non-Smooth

n 1 and/or d 1

Dennis-More condition:

|p(k)T(

H(k) −∇2F (x(k)))

p(k)| ≤ ηkp(k)T∇2F (x(k))p(k)

Leverage score sampling to ensure Dennis-More

Inexact sub-problem solver

Intro Smooth NonSmooth

Finite Sum / Empirical Risk Minimization

Self-Concordant

∣∣vT(∇3f (x)[v]

)v∣∣ ≤ M

(vT∇2f (x)v

)3/2, ∀x, v ∈ Rd

M = 2 is called standard self-concordant.

Theorem ( [Zhang and Lin, 2015])

Suppose there exists γ > 0 and η ∈ [0, 1) such that

|f ′′′

i (t)| ≤ γ(f

′′

i (t))1−η

. Then

F (x) =1

n

n∑i=1

fi (aTi x) +

λ

2‖x‖2

is self-concordant with

M =maxi ‖ai‖1+2η

γ

λη+1/2.

Intro Smooth NonSmooth

Proximal Newton-type Methods

General treatment: [Fukushima and Mine, 1981, Lee et al.,2014, Byrd et al., 2016, Becker and Fadili, 2012, Schmidtet al., 2012, Schmidt et al., 2009, Shi and Liu, 2015]

Tailored to specific problems: [Friedman et al., 2010, Hsiehet al., 2014, Yuan et al., 2012, Oztoprak et al., 2012]

Self-Concordant: [Li et al., 2017, Kyrillidis et al.,2014, Tran-Dinh et al., 2013]

Intro Smooth NonSmooth

Outline

Part I: ConvexSmooth

Newton-CG

Non-Smooth

Proximal NewtonSemi-smooth Newton

Part II: Non-ConvexLine-Search Based Methods

L-BFGSGauss-NewtonNatural Gradient

Trust-Region Based Methods

Trust-RegionCubic Regularization

Part III: Discussion and Examples

Intro Smooth NonSmooth

Semi-Smooth Newton-type Methods

Convex Composite Problem

minx∈X⊆Rd

F (x) + R(x)

F : (non-)convex and (non-)smooth

R: convex and non-smooth

Intro Smooth NonSmooth

Semi-Smooth Newton-type Methods

Recall:

Proximal Newton-type Methods

p(k) ≈ argminp∈Rd

pT∇F (x(k)) +

1

2pTH(k)p + R(x(k) + p)

≡ proxH(k)

R

(x(k) −H(k)−1∇F (x(k))

)− x(k)

proxH(k)

R (x) , argminy∈Rd

R(y) +1

2‖y − x‖2

H(k)

‖y − x‖2H = 〈y − x,H (y − x)〉

Intro Smooth NonSmooth

Semi-Smooth Newton-type Methods

Recall Newton for smooth root-finding problems:

r(x) = 0 =⇒ J(x(k))p(k) = −r(x(k)) =⇒ x(k+1) = x(k) + p(k)

Key Idea

For solving non-linear non-smooth systems of equations, replacethe Jacobian in the Newtonian iteration system by an element ofClarke’s generalized Jacobian, which may be nonempty (e.g.,locally Lipschitz continuous) even if the true Jacobian does notexist.

Example:

r(x) = proxBR

(x− B−1∇F (x)

)− x

Intro Smooth NonSmooth

Semi-Smooth Newton-type Methods

Clarke’s generalized Jacobian

Let r : Rd → Rp be locally Lipschitz continuous at xa. Let Dr bethe set of differentiable points. The Bouligand-subdifferential of rat x is defined as

∂B ,

limk→∞

r′(xk) | xk ∈ Dr, xk → x

.

Clarke’s generalized Jacobian is ∂r(x) = Conv-hull (∂Br(x)).

aBy Rademacher’s Theorem, r is almost everywhere differentiable

Intro Smooth NonSmooth

Semi-Smooth Newton-type Methods

Semi-Smooth Map

r : Rd → Rp is semi-smooth at x if

r a locally Lipschitz continuous function,

r is directionally differentiable at x, and

‖r(x + v)− r(x)− Jv‖ = o(‖v‖), v ∈ Rd , J ∈ ∂r(x + v)

Example: Piecewise continuously differentiable functions.

A vector-valued function is (strongly) semi-smooth if and only ifeach of its component functions is (strongly) semi-smooth.

Intro Smooth NonSmooth

Semi-Smooth Newton-type Methods

First-order stationarity is written as a fixed point-typenon-linear system of equations, e.g.,

rB(x) , x− proxBR

(x− B−1∇F (x)

)= 0, B 0

(Stochastic) semi-smooth Newton’s method used to(approximately) solve these system of nonlinear equations ,e.g., as in [Milzarek et al., 2018]

rB(x) , x− proxBR

(x− B−1g(x)

)J(k)p(k) = −rB(x(k)), J(k) ∈ J B(x(k))

J Bk ,

J ∈ Rd×d | J = (I− A) + AB−1H(k),A ∈ ∂proxB

R (u(x))

u(x) , x− B−1g(x)

Intro Smooth NonSmooth

Semi-Smooth Newton-type Methods

Semi-smoothness of proximal mapping does not hold in general.

In certain settings, these vector-valued functions are monotone andcan be semi-smooth due to the properties of the proximal mappings.

In such cases, their generalized Jacobian matrix is positivesemi-definite due to monotonicity.

Lemma

For a monotone and Lipschitz continuous mapping r : Rd → Rd , andany x ∈ Rd , each element of ∂Br(x) is positive semi-definite.

Intro Smooth NonSmooth

Semi-Smooth Newton-type Methods

General treatment: [Patrinos et al., 2014, Milzarek andUlbrich, 2014, Xiao et al., 2016, Patrinos and Bemporad, 2013]

Tailored to a specific problem: [Yuan et al., 2018]

Stochastic: [Milzarek et al., 2018]

Intro Smooth NonSmooth

THANK YOU!

Intro Smooth NonSmooth

Agarwal, N., Bullins, B., and Hazan, E. (2017).Second-order stochastic optimization for machine learning inlinear time.The Journal of Machine Learning Research, 18(1):4148–4187.

Becker, S. and Fadili, J. (2012).A quasi-Newton proximal splitting method.In Advances in Neural Information Processing Systems, pages2618–2626.

Bollapragada, R., Byrd, R. H., and Nocedal, J. (2016).Exact and inexact subsampled Newton methods foroptimization.IMA Journal of Numerical Analysis.

Byrd, R. H., Chin, G. M., Neveitt, W., and Nocedal, J. (2011).

On the use of stochastic Hessian information in optimizationmethods for machine learning.SIAM Journal on Optimization, 21(3):977–995.

Intro Smooth NonSmooth

Byrd, R. H., Chin, G. M., Nocedal, J., and Wu, Y. (2012).Sample size selection in optimization methods for machinelearning.Mathematical programming, 134(1):127–155.

Byrd, R. H., Nocedal, J., and Oztoprak, F. (2016).An inexact successive quadratic approximation method for l-1regularized optimization.Mathematical Programming, 157(2):375–396.

Erdogdu, M. A. and Montanari, A. (2015).Convergence rates of sub-sampled Newton methods.In Proceedings of the 28th International Conference on NeuralInformation Processing Systems-Volume 2, pages 3052–3060.MIT Press.

Friedman, J., Hastie, T., and Tibshirani, R. (2010).Regularization paths for generalized linear models viacoordinate descent.Journal of statistical software, 33(1):1.

Intro Smooth NonSmooth

Fukushima, M. and Mine, H. (1981).A generalized proximal point algorithm for certain non-convexminimization problems.International Journal of Systems Science, 12(8):989–1000.

Hsieh, C.-J., Sustik, M. A., Dhillon, I. S., and Ravikumar, P.(2014).QUIC: quadratic approximation for sparse inverse covarianceestimation.The Journal of Machine Learning Research, 15(1):2911–2947.

Kyrillidis, A., Mahabadi, R. K., Tran-Dinh, Q., and Cevher, V.(2014).Scalable Sparse Covariance Estimation via Self-Concordance.In AAAI, pages 1946–1952.

Lee, J. D., Sun, Y., and Saunders, M. A. (2014).Proximal Newton-type methods for minimizing compositefunctions.SIAM Journal on Optimization, 24(3):1420–1443.

Intro Smooth NonSmooth

Li, J., Andersen, M. S., and Vandenberghe, L. (2017).Inexact proximal Newton methods for self-concordantfunctions.Mathematical Methods of Operations Research, 85(1):19–41.

Liu, X., Hsieh, C.-J., Lee, J. D., and Sun, Y. (2017).An inexact subsampled proximal Newton-type method forlarge-scale machine learning.arXiv preprint arXiv:1708.08552.

Milzarek, A. and Ulbrich, M. (2014).A semismooth newton method with multidimensional filterglobalization for l 1-optimization.SIAM Journal on Optimization, 24(1):298–333.

Milzarek, A., Xiao, X., Cen, S., Wen, Z., and Ulbrich, M.(2018).A Stochastic Semismooth Newton Method for NonsmoothNonconvex Optimization.arXiv preprint arXiv:1803.03466.

Intro Smooth NonSmooth

Oztoprak, F., Nocedal, J., Rennie, S., and Olsen, P. A. (2012).

Newton-like methods for sparse inverse covariance estimation.In Advances in neural information processing systems, pages755–763.

Patrinos, P. and Bemporad, A. (2013).Proximal Newton methods for convex composite optimization.In Decision and Control (CDC), 2013 IEEE 52nd AnnualConference on, pages 2358–2363. IEEE.

Patrinos, P., Stella, L., and Bemporad, A. (2014).Forward-backward truncated Newton methods for convexcomposite optimization.arXiv preprint arXiv:1402.6655.

Pilanci, M. and Wainwright, M. J. (2017).Newton sketch: A near linear-time optimization algorithm withlinear-quadratic convergence.SIAM Journal on Optimization, 27(1):205–245.

Intro Smooth NonSmooth

Roosta, F. and Mahoney, M. W. (2016a).Sub-sampled Newton methods I: globally convergentalgorithms.arXiv preprint arXiv:1601.04737.

Roosta, F. and Mahoney, M. W. (2016b).Sub-sampled Newton methods II: Local convergence rates.arXiv preprint arXiv:1601.04738.

Schmidt, M., Berg, E., Friedlander, M., and Murphy, K.(2009).Optimizing costly functions with simple constraints: Alimited-memory projected quasi-Newton algorithm.In Artificial Intelligence and Statistics, pages 456–463.

Schmidt, M., Kim, D., and Sra, S. (2012).Projected Newton-type Methods in Machine Learning.Optimization for Machine Learning, (1).

Shi, Z. and Liu, R. (2015).

Intro Smooth NonSmooth

Large scale optimization with proximal stochastic newton-typegradient descent.In Joint European Conference on Machine Learning andKnowledge Discovery in Databases, pages 691–704. Springer.

Tran-Dinh, Q., Kyrillidis, A., and Cevher, V. (2013).Composite self-concordant minimization.arXiv preprint arXiv:1308.2867.

Xiao, X., Li, Y., Wen, Z., and Zhang, L. (2016).A regularized semi-smooth Newton method with projectionsteps for composite convex programs.Journal of Scientific Computing, pages 1–26.

Xu, P., Yang, J., Roosta, F., Re, C., and Mahoney, M. W.(2016).Sub-sampled Newton methods with non-uniform sampling.In Advances in Neural Information Processing Systems (NIPS),pages 3000–3008.

Intro Smooth NonSmooth

Yu, J., Vishwanathan, S., Gunter, S., and Schraudolph, N. N.(2010).A quasi-Newton approach to nonsmooth convex optimizationproblems in machine learning.Journal of Machine Learning Research, 11(Mar):1145–1200.

Yuan, G.-X., Ho, C.-H., and Lin, C.-J. (2012).An improved glmnet for l1-regularized logistic regression.Journal of Machine Learning Research, 13(Jun):1999–2030.

Yuan, Y., Sun, D., and Toh, K.-C. (2018).An Efficient Semismooth Newton Based Algorithm for ConvexClustering.arXiv preprint arXiv:1802.07091.

Zhang, Y. and Lin, X. (2015).DiSCO: Distributed optimization for self-concordant empiricalloss.In International conference on machine learning, pages362–370.