Part I: Convex Fred Roosta University of Queensland · Part I: Convex Fred Roosta School of...

77
Intro Smooth NonSmooth Stochastic Second-Order Optimization Methods Part I: Convex Fred Roosta School of Mathematics and Physics University of Queensland

Transcript of Part I: Convex Fred Roosta University of Queensland · Part I: Convex Fred Roosta School of...

Page 1: Part I: Convex Fred Roosta University of Queensland · Part I: Convex Fred Roosta School of Mathematics and Physics University of Queensland. Intro Smooth NonSmooth ... \Big data":

Intro Smooth NonSmooth

Stochastic Second-Order Optimization MethodsPart I: Convex

Fred Roosta

School of Mathematics and PhysicsUniversity of Queensland

Page 2: Part I: Convex Fred Roosta University of Queensland · Part I: Convex Fred Roosta School of Mathematics and Physics University of Queensland. Intro Smooth NonSmooth ... \Big data":

Intro Smooth NonSmooth

Disclaimer

Disclaimer:

To preserve brevity and flow, many cited results are simplified,slightly-modified, or generalized...please see the cited work forthe details.

Unfortunately, due to time constraints, many relevant andinteresting works are not mentioned in this presentation.

Page 3: Part I: Convex Fred Roosta University of Queensland · Part I: Convex Fred Roosta School of Mathematics and Physics University of Queensland. Intro Smooth NonSmooth ... \Big data":

Intro Smooth NonSmooth

Problem Statement

Problem

minx∈X⊆Rd

F (x) = Ewf (x;w) =

Risk︷ ︸︸ ︷∫Ω

f (x;w(ω))︸ ︷︷ ︸e.g., Loss/Penalty

+prediction function

dP(w(ω))

F : (non-)convex/(non-)smooth

High-dimensional: d 1

With empirical (counting) measure over wini=1 ⊂ Rp, i.e.,∫AdP(w) =

1

n

n∑i=1

1wi∈A, ∀A ⊆ Ω =⇒ F (x) =1

n

n∑i=1

f (x; wi )︸ ︷︷ ︸finite-sum/empirical risk

“Big data”: n 1

Page 4: Part I: Convex Fred Roosta University of Queensland · Part I: Convex Fred Roosta School of Mathematics and Physics University of Queensland. Intro Smooth NonSmooth ... \Big data":

Intro Smooth NonSmooth

Page 5: Part I: Convex Fred Roosta University of Queensland · Part I: Convex Fred Roosta School of Mathematics and Physics University of Queensland. Intro Smooth NonSmooth ... \Big data":

Intro Smooth NonSmooth

Page 6: Part I: Convex Fred Roosta University of Queensland · Part I: Convex Fred Roosta School of Mathematics and Physics University of Queensland. Intro Smooth NonSmooth ... \Big data":

Intro Smooth NonSmooth

Humongous Data / High Dimension

Classical algorithms =⇒ High per-iteration costs

Page 7: Part I: Convex Fred Roosta University of Queensland · Part I: Convex Fred Roosta School of Mathematics and Physics University of Queensland. Intro Smooth NonSmooth ... \Big data":

Intro Smooth NonSmooth

Humongous Data / High Dimension

Modern variants:

1 Low per-iteration costs

2 Maintain original convergence rate

Page 8: Part I: Convex Fred Roosta University of Queensland · Part I: Convex Fred Roosta School of Mathematics and Physics University of Queensland. Intro Smooth NonSmooth ... \Big data":

Intro Smooth NonSmooth

First Order Methods

Use only gradient information

E.g. : Gradient Descent

x(k+1) = x(k) − α(k)∇F (x(k))

Smooth Convex F ⇒ Sublinear, O(1/k)

Smooth Strongly Convex F ⇒ Linear, O(ρk), ρ < 1

However, iteration cost is high!

Page 9: Part I: Convex Fred Roosta University of Queensland · Part I: Convex Fred Roosta School of Mathematics and Physics University of Queensland. Intro Smooth NonSmooth ... \Big data":

Intro Smooth NonSmooth

First Order Methods

Stochastic variants e.g., (mini-batch) SGD

For some s small, chosen at random wisi=1 and

x(k+1) = x(k) − α(k)

s

s∑i=1

∇f (x(k),wi )

Cheap per-iteration costs!

However slower to converge:

Smooth Convex F ⇒ O(1/√k)

Smooth Strongly Convex F ⇒ O(1/k)

Page 10: Part I: Convex Fred Roosta University of Queensland · Part I: Convex Fred Roosta School of Mathematics and Physics University of Queensland. Intro Smooth NonSmooth ... \Big data":

Intro Smooth NonSmooth

Modifications...

Achieve the convergence rate of the full GD

Preserve the per-iteration cost of SGD

E.g.: SAG, SDCA, SVRG, Prox-SVRG, Acc-Prox-SVRG,Acc-Prox-SDCA, S2GD, mS2GD, MISO, SAGA, AMSVRG, ...

Page 11: Part I: Convex Fred Roosta University of Queensland · Part I: Convex Fred Roosta School of Mathematics and Physics University of Queensland. Intro Smooth NonSmooth ... \Big data":

Intro Smooth NonSmooth

Second Order Methods

Use both gradient and Hessian information

E.g. : Classical Newton’s method

x(k+1) = x(k) − α(k) [∇2F (x(k))]−1∇F (x(k))︸ ︷︷ ︸Non-uniformly Scaled Gradient

Smooth Convex F ⇒ Locally Superlinear

Smooth Strongly Convex F ⇒ Locally Quadratic and GloballyLinear

However, per-iteration cost is much higher!

Page 12: Part I: Convex Fred Roosta University of Queensland · Part I: Convex Fred Roosta School of Mathematics and Physics University of Queensland. Intro Smooth NonSmooth ... \Big data":

Intro Smooth NonSmooth

Outline

Part I: ConvexSmooth

Newton-CG

Non-Smooth

Proximal NewtonSemi-smooth Newton

Part II: Non-ConvexLine-Search Based Methods

L-BFGSGauss-NewtonNatural Gradient

Trust-Region Based Methods

Trust-RegionCubic Regularization

Part III: Discussion and Examples

Page 13: Part I: Convex Fred Roosta University of Queensland · Part I: Convex Fred Roosta School of Mathematics and Physics University of Queensland. Intro Smooth NonSmooth ... \Big data":

Intro Smooth NonSmooth

Outline

Part I: ConvexSmooth

Newton-CG

Non-Smooth

Proximal NewtonSemi-smooth Newton

Part II: Non-ConvexLine-Search Based Methods

L-BFGSGauss-NewtonNatural Gradient

Trust-Region Based Methods

Trust-RegionCubic Regularization

Part III: Discussion and Examples

Page 14: Part I: Convex Fred Roosta University of Queensland · Part I: Convex Fred Roosta School of Mathematics and Physics University of Queensland. Intro Smooth NonSmooth ... \Big data":

Intro Smooth NonSmooth

Finite Sum / Empirical Risk Minimization

FSM/ERM

minx∈X⊆Rd

F (x) =1

n

n∑i=1

f (x; wi ) =1

n

n∑i=1

fi (x)

fi : Convex and Smooth

F : Strongly Convex =⇒ Unique minimizer x?

n 1 and/or d 1

Page 15: Part I: Convex Fred Roosta University of Queensland · Part I: Convex Fred Roosta School of Mathematics and Physics University of Queensland. Intro Smooth NonSmooth ... \Big data":

Intro Smooth NonSmooth

Iterative Scheme

y(k) = argminx∈X

(x− x(k))Tg(k) +

1

2(x− x(k))TH(k)(x− x(k))

x(k+1) = x(k) + αk

(y(k) − x(k)

)

Page 16: Part I: Convex Fred Roosta University of Queensland · Part I: Convex Fred Roosta School of Mathematics and Physics University of Queensland. Intro Smooth NonSmooth ... \Big data":

Intro Smooth NonSmooth

Iterative Scheme

y(k) = argminx∈X

(x− x(k))T g(k) +

1

2(x− x(k))T H(k)(x− x(k))

, x(k+1) = x(k) + αk

(y(k) − x(k)

)

Newton: g(k) = ∇F (x(k)) & H(k) = ∇2F (x(k))

Gradient Descent: g(k) = ∇F (x(k)) & H(k) = I

Frank-Wolfe: g(k) = ∇F (x(k)) & H(k) = 0

(mini-batch) SGD: Sg ⊂ 1, 2, . . . , n =⇒ g(k) = 1|Sg|

∑j∈Sg

∇fj (x(k)) & H(k) = I

Sub-Sampled Newton:

g(k) = ∇F (x(k))

Sg ⊂ 1, 2, . . . , n =⇒ H(k) =1

|SH |

∑j∈SH

∇2fj (x(k))

Hessian Sub-Sampling

Sg ⊂ 1, 2, . . . , n =⇒ g(k) =1

|Sg|

∑j∈Sg

∇fj (x(k))

SH ⊂ 1, 2, . . . , n =⇒ H(k) =1

|SH |

∑j∈SH

∇2fj (x(k))

Gradient and Hessian Sub-Sampling

Page 17: Part I: Convex Fred Roosta University of Queensland · Part I: Convex Fred Roosta School of Mathematics and Physics University of Queensland. Intro Smooth NonSmooth ... \Big data":

Intro Smooth NonSmooth

Sub-sampled Newton’s Method

Let X = Rd , i.e., unconstrained optimization

Iterative Scheme

p(k) ≈ argminp∈Rd

pTg(k) +

1

2pTH(k)p

,

x(k+1) = x(k) + α(k)p(k)

g(k) = ∇F (x(k))

H(k) =1

|SH |∑j∈SH

∇2fj(x(k))

Hessian Sub-Sampling

Page 18: Part I: Convex Fred Roosta University of Queensland · Part I: Convex Fred Roosta School of Mathematics and Physics University of Queensland. Intro Smooth NonSmooth ... \Big data":

Intro Smooth NonSmooth

Sub-sampled Newton’s Method

Global convergence, i.e., starting from any initial point

Page 19: Part I: Convex Fred Roosta University of Queensland · Part I: Convex Fred Roosta School of Mathematics and Physics University of Queensland. Intro Smooth NonSmooth ... \Big data":

Intro Smooth NonSmooth

Sub-sampled Newton’s Method

Iterative Scheme

Descent Direction: H(k)p(k) ≈ −g(k),

Step Size:

αk = argmax

α≤1α

s.t. F (x(k) + αpk) ≤ F (x(k)) + αβpTk ∇F (x(k))

Update: x(k+1) = x(k) + α(k)p(k)

Page 20: Part I: Convex Fred Roosta University of Queensland · Part I: Convex Fred Roosta School of Mathematics and Physics University of Queensland. Intro Smooth NonSmooth ... \Big data":

Intro Smooth NonSmooth

Sub-sampled Newton’s Method

Algorithm Newton’s Method with Hessian Sub-Sampling

1: Input: x(0)

2: for k = 0, 1, 2, · · · until termination do3: - g(k) = ∇F (x(k))

4: - S(k)H ⊆ 1, 2, . . . , n

5: - H(k) =1

|S(k)H |

∑j∈S(k)

H

∇2fj(x(k))

6: - H(k)p(k) ≈ −g(k)

7: - Find α(k) that passes Armijo linesearch8: - Update x(k+1) = x(k) + α(k)p(k)

9: end for

Page 21: Part I: Convex Fred Roosta University of Queensland · Part I: Convex Fred Roosta School of Mathematics and Physics University of Queensland. Intro Smooth NonSmooth ... \Big data":

Intro Smooth NonSmooth

Sub-sampled Newton’s Method

Theorem ( [Byrd et al., 2011])

If each fi is strongly convex, twice continuously differentiable withLipschitz continuous gradient, then

limk→∞

∇F (x(k)) = 0.

Page 22: Part I: Convex Fred Roosta University of Queensland · Part I: Convex Fred Roosta School of Mathematics and Physics University of Queensland. Intro Smooth NonSmooth ... \Big data":

Intro Smooth NonSmooth

Sub-sampled Newton’s Method

Theorem ( [Bollapragada et al., 2016])

If each fi is strongly convex, twice continuously differentiable withLipschitz continuous gradient, then

E(F (x(k))− F (x?)

)≤ ρk

(F (x(0))− F (x?)

),

where ρ ≤(1− 1/κ2

)with κ being the “condition number” of the

problem.

Page 23: Part I: Convex Fred Roosta University of Queensland · Part I: Convex Fred Roosta School of Mathematics and Physics University of Queensland. Intro Smooth NonSmooth ... \Big data":

Intro Smooth NonSmooth

Sub-sampled Newton’s Method

What if F is strongly convex, but each fi is only weakly convex?,e.g.,

F (x) = 1n

∑ni=1 `i (aT

i x)

ai ∈ Rd and Range(aini=1) = Rd

`i : R → R and `′′i ≥ γ > 0

Each ∇2fi (x) = `′′i (aTi x)aia

Ti is rank one!

But ∇2F (x) γ · λmin

(n∑

i=1

aiaTi

)︸ ︷︷ ︸

>0

Page 24: Part I: Convex Fred Roosta University of Queensland · Part I: Convex Fred Roosta School of Mathematics and Physics University of Queensland. Intro Smooth NonSmooth ... \Big data":

Intro Smooth NonSmooth

Sub-Sampling Hessian

H =1

|S|∑j∈S∇2fj(x)

Lemma ( [Roosta and Mahoney, 2016a])

Suppose ∇2F (x) γI and 0 ∇2fi (x) LI. Given any0 < ε < 1, 0 < δ < 1, if Hessian is uniformly sub-sampled with

|S| ≥ 2κ log(d/δ)

ε2,

then

Pr(

H (1− ε)γI)≥ 1− δ.

where κ = L/γ.

Page 25: Part I: Convex Fred Roosta University of Queensland · Part I: Convex Fred Roosta School of Mathematics and Physics University of Queensland. Intro Smooth NonSmooth ... \Big data":

Intro Smooth NonSmooth

Sub-Sampling Hessian

Inexact Update

Hp ≈ −g =⇒ ‖Hp + g‖ ≤ θ ‖g‖, θ < 1, using CG.

Why CG and not other solvers (at least theoretically)?

H is SPD

p(t) = argminp∈Kt

pTg+1

2pTHp→

⟨p(t), g

⟩≤ −1

2

⟨p(t),Hp(t)

Page 26: Part I: Convex Fred Roosta University of Queensland · Part I: Convex Fred Roosta School of Mathematics and Physics University of Queensland. Intro Smooth NonSmooth ... \Big data":

Intro Smooth NonSmooth

Sub-Sampling Hessian

Theorem ( [Roosta and Mahoney, 2016a])

If θ ≤√

(1− ε)κ

, then, w.h.p,

F (x(k+1))− F (x?) ≤ ρ(F (x(k))− F (x?)

),

where ρ = 1− (1− ε)/κ2.

Page 27: Part I: Convex Fred Roosta University of Queensland · Part I: Convex Fred Roosta School of Mathematics and Physics University of Queensland. Intro Smooth NonSmooth ... \Big data":

Intro Smooth NonSmooth

Sub-sampled Newton’s Method

Local convergence, i.e., in a neighborhood of x?, and with α(k) = 1

∥∥∥x(k+1) − x?∥∥∥ ≤ ξ1

∥∥∥x(k) − x?∥∥∥+ ξ2

∥∥∥x(k) − x?∥∥∥2

Linear-Quadratic Error Recursion

Page 28: Part I: Convex Fred Roosta University of Queensland · Part I: Convex Fred Roosta School of Mathematics and Physics University of Queensland. Intro Smooth NonSmooth ... \Big data":

Intro Smooth NonSmooth

Sub-sampled Newton’s Method

[Erdogdu and Montanari, 2015]

[Ur+1,Λr+1] = TSVD (H, r + 1)

H−1 = λ−1r+1I + Ur

(Λ−1r −

1

λr+1I

)UT

r

x(k+1) = argminx∈X

x−

(x(k) − α(k)H−1∇F (x(k))

)Theorem ( [Erdogdu and Montanari, 2015])∥∥∥x(k+1) − x?

∥∥∥ ≤ ξ(k)1

∥∥∥x(k) − x?∥∥∥+ ξ

(k)2

∥∥∥x(k) − x?∥∥∥2

,

ξ(k)1 = 1− λmin(H(k))

λr+1(H(k))+

Lg

λr+1(H(k))

√log d

|S(k)H |

, and ξ(k)2 =

LH

2λr+1(H(k)).

Note: For constrained optimization, the method is based on two metricgradient projection =⇒ starting from an arbitrary point, the algorithmmight not recognize, i.e., fail to stop at, an stationary point.

Page 29: Part I: Convex Fred Roosta University of Queensland · Part I: Convex Fred Roosta School of Mathematics and Physics University of Queensland. Intro Smooth NonSmooth ... \Big data":

Intro Smooth NonSmooth

Sub-sampling Hessian

H =1

|S|∑j∈S∇2fj(x)

Lemma ( [Roosta and Mahoney, 2016b])

Given any 0 < ε < 1, 0 < δ < 1, if Hessian is uniformlysub-sampled with

|S| ≥ 2κ2 log(d/δ)

ε2,

then

Pr(

(1− ε)∇2F (x) H(x) (1 + ε)∇2F (x))≥ 1− δ.

Page 30: Part I: Convex Fred Roosta University of Queensland · Part I: Convex Fred Roosta School of Mathematics and Physics University of Queensland. Intro Smooth NonSmooth ... \Big data":

Intro Smooth NonSmooth

Sub-sampled Newton’s Method [Roosta and Mahoney,2016b]

Theorem ( [Roosta and Mahoney, 2016b])

With high-probability, we get∥∥∥x(k+1) − x?∥∥∥ ≤ ξ1

∥∥∥x(k) − x?∥∥∥+ ξ2

∥∥∥x(k) − x?∥∥∥2,

where

ξ1 =ε

(1− ε)+

(√κ

1− ε

)θ, and ξ2 =

LH

2(1− ε)γ.

If θ = ε/√κ, then ξ1 is problem-independent! ⇒ Can be made

arbitrarily small!

Page 31: Part I: Convex Fred Roosta University of Queensland · Part I: Convex Fred Roosta School of Mathematics and Physics University of Queensland. Intro Smooth NonSmooth ... \Big data":

Intro Smooth NonSmooth

Sub-sampled Newton’s Method

Theorem ( [Roosta and Mahoney, 2016b])

Consider any 0 < ρ0 < ρ < 1 and ε ≤ ρ0/(1 + ρ0). If‖x(0) − x∗‖ ≤ (ρ− ρ0)/ξ2, and

θ ≤ ρ0

√(1− ε)κ

,

we get locally Q-linear convergence

‖x(k) − x∗‖ ≤ ρ‖x(k−1) − x∗‖, k = 1, . . . , k0

with probability (1− δ)k0 .

Problem-independent local convergence rate

By increasing Hessian accuracy, super-linear rate is possible

Page 32: Part I: Convex Fred Roosta University of Queensland · Part I: Convex Fred Roosta School of Mathematics and Physics University of Queensland. Intro Smooth NonSmooth ... \Big data":

Intro Smooth NonSmooth

Putting it all together

Theorem ( [Roosta and Mahoney, 2016b])

Under certain assumptions, starting at any x(0), we have

linear convergence

after certain number of iterations, we get“problem-independent” linear convergence

after certain number of iterations, the step size of α(k) = 1passes Armijo rule for all subsequent iterations

Page 33: Part I: Convex Fred Roosta University of Queensland · Part I: Convex Fred Roosta School of Mathematics and Physics University of Queensland. Intro Smooth NonSmooth ... \Big data":

Intro Smooth NonSmooth

“Any optimization algorithm for which the unit step length workshas some wisdom. It is too much of a fluke if the unite step length

[accidentally] works.”

Prof. Jorge NocedalIPAM Summer School, 2012

Page 34: Part I: Convex Fred Roosta University of Queensland · Part I: Convex Fred Roosta School of Mathematics and Physics University of Queensland. Intro Smooth NonSmooth ... \Big data":

Intro Smooth NonSmooth

Sub-sampled Newton’s Method

Theorem ( [Bollapragada et al., 2016])

Suppose ∥∥∥∥Ei

(∇2fi (x)−∇2F (x)

)2∥∥∥∥ ≤ σ2.

Then,

E∥∥∥x(k+1) − x?

∥∥∥ ≤ ξ1

∥∥∥x(k) − x?∥∥∥+ ξ2

∥∥∥x(k) − x?∥∥∥2,

where

ξ1 =σ

γ

√|S(k)

H |+ κθ, and ξ2 =

LH

2γ.

Page 35: Part I: Convex Fred Roosta University of Queensland · Part I: Convex Fred Roosta School of Mathematics and Physics University of Queensland. Intro Smooth NonSmooth ... \Big data":

Intro Smooth NonSmooth

Exploiting the structure....

Example:

F (x) =1

n

n∑i=1

`i (aTi x) =⇒ ∇2F (x) = ATDA,

where

A ,

aT1...

aTn

∈ Rn×d , Dx ,1

n

`′′(aT

1 x). . .

`′′(aTn x)

∈ Rn×n.

Page 36: Part I: Convex Fred Roosta University of Queensland · Part I: Convex Fred Roosta School of Mathematics and Physics University of Queensland. Intro Smooth NonSmooth ... \Big data":

Intro Smooth NonSmooth

Sketching [Pilanci and Wainwright, 2017]

x(k+1) = argminx∈X

(x− x(k))T∇F (x(k)) +

1

2(x− x(k))T∇2F (x(k))(x− x(k))

∇2F (x(k)) = B(k)TB(k), B(k) ∈ Rn×d

x(k+1) = argminx∈X

(x− x(k))T∇F (x(k)) +1

2‖B(k)︸︷︷︸

n×d

(x− x(k))‖2

∇2F (x(k)) ≈ B(k)TS(k)TS(k)B(k), S(k) ∈ Rs×n, d ≤ s n

x(k+1) = argminx∈X

(x− x(k))T∇F (x(k)) +1

2‖S(k)B(k)︸ ︷︷ ︸

s×d

(x− x(k))‖2

Page 37: Part I: Convex Fred Roosta University of Queensland · Part I: Convex Fred Roosta School of Mathematics and Physics University of Queensland. Intro Smooth NonSmooth ... \Big data":

Intro Smooth NonSmooth

Sketching [Pilanci and Wainwright, 2017]

Sub-Gaussian sketches

well-known concentration properties

involve dense/unstructured matrix operations

Randomized orthonormal systems, e.g., Hadamard or Fourier

sub-optimal sample sizes

fast matrix multiplication

Random row sampling

Uniform

Non-uniform (more on this later)

Sparse JL sketches

Page 38: Part I: Convex Fred Roosta University of Queensland · Part I: Convex Fred Roosta School of Mathematics and Physics University of Queensland. Intro Smooth NonSmooth ... \Big data":

Intro Smooth NonSmooth

Non-uniform Sampling [Xu et al., 2016]

Non-uniformity among ∇2fi =⇒ O(n) uniform samples!!!

Find |SH| independent of n

Immune non-uniformity

Non-Uniform sampling schemes base on

1 Row norms

2 Leverage scores

Page 39: Part I: Convex Fred Roosta University of Queensland · Part I: Convex Fred Roosta School of Mathematics and Physics University of Queensland. Intro Smooth NonSmooth ... \Big data":

Intro Smooth NonSmooth

LiSSA [Agarwal et al., 2017]

Key idea: use Taylor expansion (Neumann series) to constructan estimator of the Hessian inverse

‖A‖ ≤ 1,A 0 =⇒ A−1 =∑∞

i=0 (I− A)i

A−1j =

∑ji=0 (I− A)i = I + (I− A) A−1

j−1

limj→∞A−1j = A−1

Uniformly sub-sampled Hessian + Turncated Neumann series

Three-nested loop involving HVP, i.e., ∇2fi ,j(x(k))v(k)i ,j

Leverage fast multiplication by Hessian of GLMs

Page 40: Part I: Convex Fred Roosta University of Queensland · Part I: Convex Fred Roosta School of Mathematics and Physics University of Queensland. Intro Smooth NonSmooth ... \Big data":

Intro Smooth NonSmooth

name complexity per iteration reference

Newton-CG O(nnz(A)√κ) Folklore

SSN-LS O(nnz(A) log n + p2κ3/2) [Xu et al., 2016]

SSN-RNS O(nnz(A) + sr(A)pκ5/2) [Xu et al., 2016]

SRHT O(np(log n)4 + p2(log n)4κ3/2) [Pilanci et al., 2016]

SSN-Uniform O(nnz(A) + pκκ3/2) [Roosta et al., 2016]

LiSSA O(nnz(A) + pκκ2) [Agarwal et al., 2017]

κ = maxx

λmax∇2F (x)

λmin∇2F (x)

κ = maxx

maxi λmax∇2fi (x)

λmin∇2F (x)

κ = maxx

maxi λmax∇2fi (x)

mini λmin∇2fi (x)

⇒ κ ≤ κ ≤ κ

Page 41: Part I: Convex Fred Roosta University of Queensland · Part I: Convex Fred Roosta School of Mathematics and Physics University of Queensland. Intro Smooth NonSmooth ... \Big data":

Intro Smooth NonSmooth

Sub-sampled Newton’s Method

Iterative Scheme

p(k) ≈ argminp∈Rd

pTg(k) +

1

2pTH(k)p

,

x(k+1) = x(k) + α(k)p(k)

g(k) =1

|Sg|∑j∈Sg

∇fj(x(k))

H(k) =1

|SH |∑j∈SH

∇2fj(x(k))

Gradient and Hessian Sub-Sampling

Page 42: Part I: Convex Fred Roosta University of Queensland · Part I: Convex Fred Roosta School of Mathematics and Physics University of Queensland. Intro Smooth NonSmooth ... \Big data":

Intro Smooth NonSmooth

Sub-sampled Newton’s Method

Algorithm Newton’s Method with Hessian and Gradient Sub-Sampling

1: Input: x(0)

2: for k = 0, 1, 2, · · · until termination do

3: - S(k)g ⊆ 1, 2, . . . , n =⇒ g(k) = 1

|S(k)g |

∑j∈S(k)

g∇fj(x(k))

4: if∥∥g(k)

∥∥ ≤ σεg then5: - STOP6: end if7: - S(k)

H ⊆ 1, 2, . . . , n =⇒ H(k) = 1

|S(k)H |

∑j∈S(k)

H

∇2fj(x(k))

8: - H(k)p(k) ≈ −g(k)

9: - Find α(k) that passes Armijo linesearch10: - Update x(k+1) = x(k) + α(k)p(k)

11: end for

Page 43: Part I: Convex Fred Roosta University of Queensland · Part I: Convex Fred Roosta School of Mathematics and Physics University of Queensland. Intro Smooth NonSmooth ... \Big data":

Intro Smooth NonSmooth

Sub-sampled Newton’s Method

Global convergence, i.e., starting from any initial point

[Byrd et al., 2012,Roosta and Mahoney, 2016a,Bollapragada et al., 2016]

Page 44: Part I: Convex Fred Roosta University of Queensland · Part I: Convex Fred Roosta School of Mathematics and Physics University of Queensland. Intro Smooth NonSmooth ... \Big data":

Intro Smooth NonSmooth

Sub-sampled Newton’s Method

We can write ∇F (x) = AB where

A :=

| | |∇f1(x) ∇f2(x) · · · ∇fn(x)| | |

and B :=

1/n1/n

...1/n

Lemma ( [Roosta and Mahoney, 2016a])

Let ‖∇fi (x)‖ ≤ G (x) <∞. For any 0 < ε, δ < 1, if

|S| ≥ G (x)2(1 +

√8 ln(1/δ)

)2/ε2,

then

Pr(‖∇F (x)− g(x)‖ ≤ ε

)≥ 1− δ.

Page 45: Part I: Convex Fred Roosta University of Queensland · Part I: Convex Fred Roosta School of Mathematics and Physics University of Queensland. Intro Smooth NonSmooth ... \Big data":

Intro Smooth NonSmooth

Sub-sampled Newton’s Method

Theorem ( [Roosta and Mahoney, 2016a])

If

θ ≤√

(1− εH)

κ,

then, w.h.p,

F (x(k+1))− F (x?) ≤ ρ(F (x(k))− F (x?)

),

where ρ = 1− (1− εH)/κ2, and upon “STOP”, we have‖∇F (x(k))‖ < (1 + σ) εg.

Page 46: Part I: Convex Fred Roosta University of Queensland · Part I: Convex Fred Roosta School of Mathematics and Physics University of Queensland. Intro Smooth NonSmooth ... \Big data":

Intro Smooth NonSmooth

Sub-sampled Newton’s Method

Local convergence, i.e., in a neighborhood of x?, and with α(k) = 1

∥∥∥x(k+1) − x?∥∥∥ ≤ ξ0 + ξ1

∥∥∥x(k) − x?∥∥∥+ ξ2

∥∥∥x(k) − x?∥∥∥2

Error Recursion

[Roosta and Mahoney, 2016b, Bollapragada et al., 2016]

Page 47: Part I: Convex Fred Roosta University of Queensland · Part I: Convex Fred Roosta School of Mathematics and Physics University of Queensland. Intro Smooth NonSmooth ... \Big data":

Intro Smooth NonSmooth

Sub-sampled Newton’s Method

Theorem ( [Roosta and Mahoney, 2016b])

With high-probability, we get∥∥∥x(k+1) − x?∥∥∥ ≤ ξ0 + ξ1

∥∥∥x(k) − x?∥∥∥+ ξ2

∥∥∥x(k) − x?∥∥∥2,

whereξ0 =

εg

(1− εH)γ

ξ1 =εH

(1− εH)+

(√κ

1− εH

)θ,

ξ2 =LH

2(1− εH)γ.

Page 48: Part I: Convex Fred Roosta University of Queensland · Part I: Convex Fred Roosta School of Mathematics and Physics University of Queensland. Intro Smooth NonSmooth ... \Big data":

Intro Smooth NonSmooth

Sub-sampled Newton’s Method

Theorem ( [Roosta and Mahoney, 2016b])

Consider any 0 < ρ0 + ρ1 < ρ < 1. If ‖x(0) − x∗‖ ≤ c(ρ0, ρ1, ρ),

ε(k)g = ρkεg and

θ ≤ ρ0

√(1− εH)

κ,

we get locally R-linear convergence

‖x(k) − x∗‖ ≤ cρk

with probability (1− δ)2k .

Problem-independent local convergence rate

Page 49: Part I: Convex Fred Roosta University of Queensland · Part I: Convex Fred Roosta School of Mathematics and Physics University of Queensland. Intro Smooth NonSmooth ... \Big data":

Intro Smooth NonSmooth

Putting it all together

Theorem ( [Roosta and Mahoney, 2016b])

Under certain assumptions, starting at any x(0), we have

linear convergence,

after certain number of iterations, we get“problem-independent” linear convergence,

after certain number of iterations, the step size of α(k) = 1passes Armijo rule for all subsequent iterations,

upon “STOP” at iteration k, we have‖∇F (x(k))‖ < (1 + σ)

√ρkεg.

Page 50: Part I: Convex Fred Roosta University of Queensland · Part I: Convex Fred Roosta School of Mathematics and Physics University of Queensland. Intro Smooth NonSmooth ... \Big data":

Intro Smooth NonSmooth

Outline

Part I: ConvexSmooth

Newton-CG

Non-Smooth

Proximal NewtonSemi-smooth Newton

Part II: Non-ConvexLine-Search Based Methods

L-BFGSGauss-NewtonNatural Gradient

Trust-Region Based Methods

Trust-RegionCubic Regularization

Part III: Discussion and Examples

Page 51: Part I: Convex Fred Roosta University of Queensland · Part I: Convex Fred Roosta School of Mathematics and Physics University of Queensland. Intro Smooth NonSmooth ... \Big data":

Intro Smooth NonSmooth

Newton-type Methods for Non-Smooth Problems

Convex Composite Problem

minx∈X⊆Rd

F (x) + R(x)

F : convex and smooth

R: convex and non-smooth

Page 52: Part I: Convex Fred Roosta University of Queensland · Part I: Convex Fred Roosta School of Mathematics and Physics University of Queensland. Intro Smooth NonSmooth ... \Big data":

Intro Smooth NonSmooth

Non-Smooth Newton-type [Yu et al., 2010]

Let X = Rd , i.e., unconstrained optimization

Iterative Scheme (Non-Quadratic Sub-Problem)

p(k) ≈ argminp∈Rd

supg(k)∈∂(F+R)(x(k))

pTg(k)

+

1

2pT H(k)︸︷︷︸

e.g.,QN

p

,

x(k+1) = x(k) + α(k)p(k)

Page 53: Part I: Convex Fred Roosta University of Queensland · Part I: Convex Fred Roosta School of Mathematics and Physics University of Queensland. Intro Smooth NonSmooth ... \Big data":

Intro Smooth NonSmooth

Proximal Newton-type [Lee et al., 2014,Byrd et al., 2016]

Iterative Scheme

p(k) ≈ argminp∈Rd

pT∇F (x(k)) +1

2pT H(k)︸︷︷︸

≈∇2F (x(k))

p + R(x(k) + p)

x(k+1) = x(k) + α(k)p(k)

Notable examples of proximal Newton methods:

glmnet ( [Friedman et al., 2010]): `1-regularized GLMs,sub-problems are solved using coordinate descent

QUIC ( [Hsieh et al., 2014]): Graphical Lasso problem withfactorization tricks, sub-problems are solved using coordinatedescent

Page 54: Part I: Convex Fred Roosta University of Queensland · Part I: Convex Fred Roosta School of Mathematics and Physics University of Queensland. Intro Smooth NonSmooth ... \Big data":

Intro Smooth NonSmooth

Proximal Newton-type [Lee et al., 2014,Byrd et al., 2016]

Inexactness of sub-problem solver

p(k) ≈ argminp∈Rd

pT∇F (x(k)) +

1

2pTH(k)p + R(x(k) + p)

p? is the minimizer of the above sub-problem iff p? = Proxλ(p?), where

Prox(k)λ (p) , argmin

q∈Rd

⟨∇F (x(k)) + H(k)p,q

⟩+ R(x(k) + q) +

λ

2‖q− p‖2

.

∥∥∥p− Prox(k)λ (p)

∥∥∥ ≤ θ ∥∥∥Prox(k)λ (0)

∥∥∥ ,and some sufficient decrease condition on the subproblem

Inexactness

Page 55: Part I: Convex Fred Roosta University of Queensland · Part I: Convex Fred Roosta School of Mathematics and Physics University of Queensland. Intro Smooth NonSmooth ... \Big data":

Intro Smooth NonSmooth

Proximal Newton-type [Lee et al., 2014,Byrd et al., 2016]

Step-size

x(k+1) = x(k) + α(k)p(k)

(F + R)(x(k) + αp(k))− (F + R)(x(k)) ≤ β(`k

(x(k) + αp

)− `k

(x(k)

))`k (x) = F (x(k)) + (x− x(k))T∇F (x(k)) + R(x)

Line-Search

Page 56: Part I: Convex Fred Roosta University of Queensland · Part I: Convex Fred Roosta School of Mathematics and Physics University of Queensland. Intro Smooth NonSmooth ... \Big data":

Intro Smooth NonSmooth

Proximal Newton-type [Lee et al., 2014,Byrd et al., 2016]

Global convergence: x(k) converges to an optimal solutionstarting at any x(0)

Local convergence: for x(0) close enough to x?

H(k) = ∇2F (x(k))

Exact solve: quadratic convergence

Approximate solve with decaying forcing term: superlinearconvergence

Approximate solve with fixed forcing term: linear convergence

H(k) : Dennis-More =⇒ supelinear convergence

Page 57: Part I: Convex Fred Roosta University of Queensland · Part I: Convex Fred Roosta School of Mathematics and Physics University of Queensland. Intro Smooth NonSmooth ... \Big data":

Intro Smooth NonSmooth

Sub-Sampled Proximal Newton-type [Liu et al., 2017]

FSM/ERM

minx∈X⊆Rd

F (x) + R(x) =1

n

n∑i=1

fi (aTi x) + R(x)

fi : Strongly-Convex, Smooth, and Self-Concordant

R: Convex and Non-Smooth

n 1 and/or d 1

Dennis-More condition:

|p(k)T(

H(k) −∇2F (x(k)))

p(k)| ≤ ηkp(k)T∇2F (x(k))p(k)

Leverage score sampling to ensure Dennis-More

Inexact sub-problem solver

Page 58: Part I: Convex Fred Roosta University of Queensland · Part I: Convex Fred Roosta School of Mathematics and Physics University of Queensland. Intro Smooth NonSmooth ... \Big data":

Intro Smooth NonSmooth

Finite Sum / Empirical Risk Minimization

Self-Concordant

∣∣vT(∇3f (x)[v]

)v∣∣ ≤ M

(vT∇2f (x)v

)3/2, ∀x, v ∈ Rd

M = 2 is called standard self-concordant.

Theorem ( [Zhang and Lin, 2015])

Suppose there exists γ > 0 and η ∈ [0, 1) such that

|f ′′′

i (t)| ≤ γ(f

′′

i (t))1−η

. Then

F (x) =1

n

n∑i=1

fi (aTi x) +

λ

2‖x‖2

is self-concordant with

M =maxi ‖ai‖1+2η

γ

λη+1/2.

Page 59: Part I: Convex Fred Roosta University of Queensland · Part I: Convex Fred Roosta School of Mathematics and Physics University of Queensland. Intro Smooth NonSmooth ... \Big data":

Intro Smooth NonSmooth

Proximal Newton-type Methods

General treatment: [Fukushima and Mine, 1981, Lee et al.,2014, Byrd et al., 2016, Becker and Fadili, 2012, Schmidtet al., 2012, Schmidt et al., 2009, Shi and Liu, 2015]

Tailored to specific problems: [Friedman et al., 2010, Hsiehet al., 2014, Yuan et al., 2012, Oztoprak et al., 2012]

Self-Concordant: [Li et al., 2017, Kyrillidis et al.,2014, Tran-Dinh et al., 2013]

Page 60: Part I: Convex Fred Roosta University of Queensland · Part I: Convex Fred Roosta School of Mathematics and Physics University of Queensland. Intro Smooth NonSmooth ... \Big data":

Intro Smooth NonSmooth

Outline

Part I: ConvexSmooth

Newton-CG

Non-Smooth

Proximal NewtonSemi-smooth Newton

Part II: Non-ConvexLine-Search Based Methods

L-BFGSGauss-NewtonNatural Gradient

Trust-Region Based Methods

Trust-RegionCubic Regularization

Part III: Discussion and Examples

Page 61: Part I: Convex Fred Roosta University of Queensland · Part I: Convex Fred Roosta School of Mathematics and Physics University of Queensland. Intro Smooth NonSmooth ... \Big data":

Intro Smooth NonSmooth

Semi-Smooth Newton-type Methods

Convex Composite Problem

minx∈X⊆Rd

F (x) + R(x)

F : (non-)convex and (non-)smooth

R: convex and non-smooth

Page 62: Part I: Convex Fred Roosta University of Queensland · Part I: Convex Fred Roosta School of Mathematics and Physics University of Queensland. Intro Smooth NonSmooth ... \Big data":

Intro Smooth NonSmooth

Semi-Smooth Newton-type Methods

Recall:

Proximal Newton-type Methods

p(k) ≈ argminp∈Rd

pT∇F (x(k)) +

1

2pTH(k)p + R(x(k) + p)

≡ proxH(k)

R

(x(k) −H(k)−1∇F (x(k))

)− x(k)

proxH(k)

R (x) , argminy∈Rd

R(y) +1

2‖y − x‖2

H(k)

‖y − x‖2H = 〈y − x,H (y − x)〉

Page 63: Part I: Convex Fred Roosta University of Queensland · Part I: Convex Fred Roosta School of Mathematics and Physics University of Queensland. Intro Smooth NonSmooth ... \Big data":

Intro Smooth NonSmooth

Semi-Smooth Newton-type Methods

Recall Newton for smooth root-finding problems:

r(x) = 0 =⇒ J(x(k))p(k) = −r(x(k)) =⇒ x(k+1) = x(k) + p(k)

Key Idea

For solving non-linear non-smooth systems of equations, replacethe Jacobian in the Newtonian iteration system by an element ofClarke’s generalized Jacobian, which may be nonempty (e.g.,locally Lipschitz continuous) even if the true Jacobian does notexist.

Example:

r(x) = proxBR

(x− B−1∇F (x)

)− x

Page 64: Part I: Convex Fred Roosta University of Queensland · Part I: Convex Fred Roosta School of Mathematics and Physics University of Queensland. Intro Smooth NonSmooth ... \Big data":

Intro Smooth NonSmooth

Semi-Smooth Newton-type Methods

Clarke’s generalized Jacobian

Let r : Rd → Rp be locally Lipschitz continuous at xa. Let Dr bethe set of differentiable points. The Bouligand-subdifferential of rat x is defined as

∂B ,

limk→∞

r′(xk) | xk ∈ Dr, xk → x

.

Clarke’s generalized Jacobian is ∂r(x) = Conv-hull (∂Br(x)).

aBy Rademacher’s Theorem, r is almost everywhere differentiable

Page 65: Part I: Convex Fred Roosta University of Queensland · Part I: Convex Fred Roosta School of Mathematics and Physics University of Queensland. Intro Smooth NonSmooth ... \Big data":

Intro Smooth NonSmooth

Semi-Smooth Newton-type Methods

Semi-Smooth Map

r : Rd → Rp is semi-smooth at x if

r a locally Lipschitz continuous function,

r is directionally differentiable at x, and

‖r(x + v)− r(x)− Jv‖ = o(‖v‖), v ∈ Rd , J ∈ ∂r(x + v)

Example: Piecewise continuously differentiable functions.

A vector-valued function is (strongly) semi-smooth if and only ifeach of its component functions is (strongly) semi-smooth.

Page 66: Part I: Convex Fred Roosta University of Queensland · Part I: Convex Fred Roosta School of Mathematics and Physics University of Queensland. Intro Smooth NonSmooth ... \Big data":

Intro Smooth NonSmooth

Semi-Smooth Newton-type Methods

First-order stationarity is written as a fixed point-typenon-linear system of equations, e.g.,

rB(x) , x− proxBR

(x− B−1∇F (x)

)= 0, B 0

(Stochastic) semi-smooth Newton’s method used to(approximately) solve these system of nonlinear equations ,e.g., as in [Milzarek et al., 2018]

rB(x) , x− proxBR

(x− B−1g(x)

)J(k)p(k) = −rB(x(k)), J(k) ∈ J B(x(k))

J Bk ,

J ∈ Rd×d | J = (I− A) + AB−1H(k),A ∈ ∂proxB

R (u(x))

u(x) , x− B−1g(x)

Page 67: Part I: Convex Fred Roosta University of Queensland · Part I: Convex Fred Roosta School of Mathematics and Physics University of Queensland. Intro Smooth NonSmooth ... \Big data":

Intro Smooth NonSmooth

Semi-Smooth Newton-type Methods

Semi-smoothness of proximal mapping does not hold in general.

In certain settings, these vector-valued functions are monotone andcan be semi-smooth due to the properties of the proximal mappings.

In such cases, their generalized Jacobian matrix is positivesemi-definite due to monotonicity.

Lemma

For a monotone and Lipschitz continuous mapping r : Rd → Rd , andany x ∈ Rd , each element of ∂Br(x) is positive semi-definite.

Page 68: Part I: Convex Fred Roosta University of Queensland · Part I: Convex Fred Roosta School of Mathematics and Physics University of Queensland. Intro Smooth NonSmooth ... \Big data":

Intro Smooth NonSmooth

Semi-Smooth Newton-type Methods

General treatment: [Patrinos et al., 2014, Milzarek andUlbrich, 2014, Xiao et al., 2016, Patrinos and Bemporad, 2013]

Tailored to a specific problem: [Yuan et al., 2018]

Stochastic: [Milzarek et al., 2018]

Page 69: Part I: Convex Fred Roosta University of Queensland · Part I: Convex Fred Roosta School of Mathematics and Physics University of Queensland. Intro Smooth NonSmooth ... \Big data":

Intro Smooth NonSmooth

THANK YOU!

Page 70: Part I: Convex Fred Roosta University of Queensland · Part I: Convex Fred Roosta School of Mathematics and Physics University of Queensland. Intro Smooth NonSmooth ... \Big data":

Intro Smooth NonSmooth

Agarwal, N., Bullins, B., and Hazan, E. (2017).Second-order stochastic optimization for machine learning inlinear time.The Journal of Machine Learning Research, 18(1):4148–4187.

Becker, S. and Fadili, J. (2012).A quasi-Newton proximal splitting method.In Advances in Neural Information Processing Systems, pages2618–2626.

Bollapragada, R., Byrd, R. H., and Nocedal, J. (2016).Exact and inexact subsampled Newton methods foroptimization.IMA Journal of Numerical Analysis.

Byrd, R. H., Chin, G. M., Neveitt, W., and Nocedal, J. (2011).

On the use of stochastic Hessian information in optimizationmethods for machine learning.SIAM Journal on Optimization, 21(3):977–995.

Page 71: Part I: Convex Fred Roosta University of Queensland · Part I: Convex Fred Roosta School of Mathematics and Physics University of Queensland. Intro Smooth NonSmooth ... \Big data":

Intro Smooth NonSmooth

Byrd, R. H., Chin, G. M., Nocedal, J., and Wu, Y. (2012).Sample size selection in optimization methods for machinelearning.Mathematical programming, 134(1):127–155.

Byrd, R. H., Nocedal, J., and Oztoprak, F. (2016).An inexact successive quadratic approximation method for l-1regularized optimization.Mathematical Programming, 157(2):375–396.

Erdogdu, M. A. and Montanari, A. (2015).Convergence rates of sub-sampled Newton methods.In Proceedings of the 28th International Conference on NeuralInformation Processing Systems-Volume 2, pages 3052–3060.MIT Press.

Friedman, J., Hastie, T., and Tibshirani, R. (2010).Regularization paths for generalized linear models viacoordinate descent.Journal of statistical software, 33(1):1.

Page 72: Part I: Convex Fred Roosta University of Queensland · Part I: Convex Fred Roosta School of Mathematics and Physics University of Queensland. Intro Smooth NonSmooth ... \Big data":

Intro Smooth NonSmooth

Fukushima, M. and Mine, H. (1981).A generalized proximal point algorithm for certain non-convexminimization problems.International Journal of Systems Science, 12(8):989–1000.

Hsieh, C.-J., Sustik, M. A., Dhillon, I. S., and Ravikumar, P.(2014).QUIC: quadratic approximation for sparse inverse covarianceestimation.The Journal of Machine Learning Research, 15(1):2911–2947.

Kyrillidis, A., Mahabadi, R. K., Tran-Dinh, Q., and Cevher, V.(2014).Scalable Sparse Covariance Estimation via Self-Concordance.In AAAI, pages 1946–1952.

Lee, J. D., Sun, Y., and Saunders, M. A. (2014).Proximal Newton-type methods for minimizing compositefunctions.SIAM Journal on Optimization, 24(3):1420–1443.

Page 73: Part I: Convex Fred Roosta University of Queensland · Part I: Convex Fred Roosta School of Mathematics and Physics University of Queensland. Intro Smooth NonSmooth ... \Big data":

Intro Smooth NonSmooth

Li, J., Andersen, M. S., and Vandenberghe, L. (2017).Inexact proximal Newton methods for self-concordantfunctions.Mathematical Methods of Operations Research, 85(1):19–41.

Liu, X., Hsieh, C.-J., Lee, J. D., and Sun, Y. (2017).An inexact subsampled proximal Newton-type method forlarge-scale machine learning.arXiv preprint arXiv:1708.08552.

Milzarek, A. and Ulbrich, M. (2014).A semismooth newton method with multidimensional filterglobalization for l 1-optimization.SIAM Journal on Optimization, 24(1):298–333.

Milzarek, A., Xiao, X., Cen, S., Wen, Z., and Ulbrich, M.(2018).A Stochastic Semismooth Newton Method for NonsmoothNonconvex Optimization.arXiv preprint arXiv:1803.03466.

Page 74: Part I: Convex Fred Roosta University of Queensland · Part I: Convex Fred Roosta School of Mathematics and Physics University of Queensland. Intro Smooth NonSmooth ... \Big data":

Intro Smooth NonSmooth

Oztoprak, F., Nocedal, J., Rennie, S., and Olsen, P. A. (2012).

Newton-like methods for sparse inverse covariance estimation.In Advances in neural information processing systems, pages755–763.

Patrinos, P. and Bemporad, A. (2013).Proximal Newton methods for convex composite optimization.In Decision and Control (CDC), 2013 IEEE 52nd AnnualConference on, pages 2358–2363. IEEE.

Patrinos, P., Stella, L., and Bemporad, A. (2014).Forward-backward truncated Newton methods for convexcomposite optimization.arXiv preprint arXiv:1402.6655.

Pilanci, M. and Wainwright, M. J. (2017).Newton sketch: A near linear-time optimization algorithm withlinear-quadratic convergence.SIAM Journal on Optimization, 27(1):205–245.

Page 75: Part I: Convex Fred Roosta University of Queensland · Part I: Convex Fred Roosta School of Mathematics and Physics University of Queensland. Intro Smooth NonSmooth ... \Big data":

Intro Smooth NonSmooth

Roosta, F. and Mahoney, M. W. (2016a).Sub-sampled Newton methods I: globally convergentalgorithms.arXiv preprint arXiv:1601.04737.

Roosta, F. and Mahoney, M. W. (2016b).Sub-sampled Newton methods II: Local convergence rates.arXiv preprint arXiv:1601.04738.

Schmidt, M., Berg, E., Friedlander, M., and Murphy, K.(2009).Optimizing costly functions with simple constraints: Alimited-memory projected quasi-Newton algorithm.In Artificial Intelligence and Statistics, pages 456–463.

Schmidt, M., Kim, D., and Sra, S. (2012).Projected Newton-type Methods in Machine Learning.Optimization for Machine Learning, (1).

Shi, Z. and Liu, R. (2015).

Page 76: Part I: Convex Fred Roosta University of Queensland · Part I: Convex Fred Roosta School of Mathematics and Physics University of Queensland. Intro Smooth NonSmooth ... \Big data":

Intro Smooth NonSmooth

Large scale optimization with proximal stochastic newton-typegradient descent.In Joint European Conference on Machine Learning andKnowledge Discovery in Databases, pages 691–704. Springer.

Tran-Dinh, Q., Kyrillidis, A., and Cevher, V. (2013).Composite self-concordant minimization.arXiv preprint arXiv:1308.2867.

Xiao, X., Li, Y., Wen, Z., and Zhang, L. (2016).A regularized semi-smooth Newton method with projectionsteps for composite convex programs.Journal of Scientific Computing, pages 1–26.

Xu, P., Yang, J., Roosta, F., Re, C., and Mahoney, M. W.(2016).Sub-sampled Newton methods with non-uniform sampling.In Advances in Neural Information Processing Systems (NIPS),pages 3000–3008.

Page 77: Part I: Convex Fred Roosta University of Queensland · Part I: Convex Fred Roosta School of Mathematics and Physics University of Queensland. Intro Smooth NonSmooth ... \Big data":

Intro Smooth NonSmooth

Yu, J., Vishwanathan, S., Gunter, S., and Schraudolph, N. N.(2010).A quasi-Newton approach to nonsmooth convex optimizationproblems in machine learning.Journal of Machine Learning Research, 11(Mar):1145–1200.

Yuan, G.-X., Ho, C.-H., and Lin, C.-J. (2012).An improved glmnet for l1-regularized logistic regression.Journal of Machine Learning Research, 13(Jun):1999–2030.

Yuan, Y., Sun, D., and Toh, K.-C. (2018).An Efficient Semismooth Newton Based Algorithm for ConvexClustering.arXiv preprint arXiv:1802.07091.

Zhang, Y. and Lin, X. (2015).DiSCO: Distributed optimization for self-concordant empiricalloss.In International conference on machine learning, pages362–370.