Rutgers University -...

36
Optimization in Machine Learning Tong Zhang Rutgers University T. Zhang (Rutgers) Optimization 1 / 24

Transcript of Rutgers University -...

Page 1: Rutgers University - 大眼睛实验室bigeye.au.tsinghua.edu.cn/DragonStar2012/docs/optimization.pdf · Optimization in Machine Learning Tong Zhang Rutgers University T. Zhang (Rutgers)

Optimization in Machine Learning

Tong Zhang

Rutgers University

T. Zhang (Rutgers) Optimization 1 / 24

Page 2: Rutgers University - 大眼睛实验室bigeye.au.tsinghua.edu.cn/DragonStar2012/docs/optimization.pdf · Optimization in Machine Learning Tong Zhang Rutgers University T. Zhang (Rutgers)

Topics

Gradient DescentProximal Projection MethodCoordinate DescentConvex Duality and Dual Coordinate DescentLBFGS

T. Zhang (Rutgers) Optimization 2 / 24

Page 3: Rutgers University - 大眼睛实验室bigeye.au.tsinghua.edu.cn/DragonStar2012/docs/optimization.pdf · Optimization in Machine Learning Tong Zhang Rutgers University T. Zhang (Rutgers)

Supervised Learning

Training data: (Xi ,Yi) (i = 1, . . . ,n)Example: linear prediction function wT xTraining algorithm: SVM

w = arg minw

[n−1

n∑i=1

(1− wT XiYi)+ + λwT w

].

This is an optimization problem: how to find w?

T. Zhang (Rutgers) Optimization 3 / 24

Page 4: Rutgers University - 大眼睛实验室bigeye.au.tsinghua.edu.cn/DragonStar2012/docs/optimization.pdf · Optimization in Machine Learning Tong Zhang Rutgers University T. Zhang (Rutgers)

Unconstrained Optimization

Consider a general unconstrained optimization problem:

w∗ = arg minw

f (w),

How to find the optimal solution?global solution: w such that f (w) ≤ f (w ′) for all w ′.local solution: w such that f (w) ≤ f (w ′) when w ′ is close to w .global solution is local solution but not necessarily vice versa.local optimal (and thus global optimal) solution satisfies ∇f (w) = 0.for convex problems: local and global solutions are the same.

T. Zhang (Rutgers) Optimization 4 / 24

Page 5: Rutgers University - 大眼睛实验室bigeye.au.tsinghua.edu.cn/DragonStar2012/docs/optimization.pdf · Optimization in Machine Learning Tong Zhang Rutgers University T. Zhang (Rutgers)

Gradient Descent

wk = wk−1 − ηk∇f (wk−1).

How fast does this method converge to the optimal solution?

General result: converge to local minimum under suitableconditions.What’s the convergence rate?

Answer: depends on conditions of f (·).This lecture focuses on convex problems.

T. Zhang (Rutgers) Optimization 5 / 24

Page 6: Rutgers University - 大眼睛实验室bigeye.au.tsinghua.edu.cn/DragonStar2012/docs/optimization.pdf · Optimization in Machine Learning Tong Zhang Rutgers University T. Zhang (Rutgers)

Gradient Descent

wk = wk−1 − ηk∇f (wk−1).

How fast does this method converge to the optimal solution?

General result: converge to local minimum under suitableconditions.What’s the convergence rate?Answer: depends on conditions of f (·).This lecture focuses on convex problems.

T. Zhang (Rutgers) Optimization 5 / 24

Page 7: Rutgers University - 大眼睛实验室bigeye.au.tsinghua.edu.cn/DragonStar2012/docs/optimization.pdf · Optimization in Machine Learning Tong Zhang Rutgers University T. Zhang (Rutgers)

Convexity

For all α ∈ [0,1], we have

f (αx + (1− α)x ′) ≤ αf (x) + (1− α)f (x ′).

A subgradient ∇f (x0) at x0 satisfies:

f (x) ≥ f (x0) + (x − x0)>∇f (x0)

Generalize gradient for functionsSubgradient v0 is not necessarily unique:

f (x) = |x | (x ∈ R)at x0 = 0: any v0 ∈ [−1,1] satisfies the requirement (thus asubgradient)

In the following we assume subgradient always exists

T. Zhang (Rutgers) Optimization 6 / 24

Page 8: Rutgers University - 大眼睛实验室bigeye.au.tsinghua.edu.cn/DragonStar2012/docs/optimization.pdf · Optimization in Machine Learning Tong Zhang Rutgers University T. Zhang (Rutgers)

Convexity

For all α ∈ [0,1], we have

f (αx + (1− α)x ′) ≤ αf (x) + (1− α)f (x ′).

A subgradient ∇f (x0) at x0 satisfies:

f (x) ≥ f (x0) + (x − x0)>∇f (x0)

Generalize gradient for functionsSubgradient v0 is not necessarily unique:

f (x) = |x | (x ∈ R)at x0 = 0: any v0 ∈ [−1,1] satisfies the requirement (thus asubgradient)

In the following we assume subgradient always exists

T. Zhang (Rutgers) Optimization 6 / 24

Page 9: Rutgers University - 大眼睛实验室bigeye.au.tsinghua.edu.cn/DragonStar2012/docs/optimization.pdf · Optimization in Machine Learning Tong Zhang Rutgers University T. Zhang (Rutgers)

Common Conditions of Objective Function

Convexity:f (x ′)− f (x)−∇f (x)T (x ′ − x) ≥ 0

Nonsmooth: first order derivative may be discontinuous: e.g. hingeloss or L1 regularizationSmooth: first order derivative is Lipschitz or second order derivativeis bounded:

f (x ′)− f (x)−∇f (x)T (x ′ − x) ≤ L2‖x ′ − x‖22

Strongly Convex:

f (x ′)− f (x)−∇f (x)T (x ′ − x) ≥ µ

2‖x ′ − x‖22

Convex is when the above satisfied with µ = 0.A function can be strongly convex and nonsmooth: f (x) = x2 + |x |.

T. Zhang (Rutgers) Optimization 7 / 24

Page 10: Rutgers University - 大眼睛实验室bigeye.au.tsinghua.edu.cn/DragonStar2012/docs/optimization.pdf · Optimization in Machine Learning Tong Zhang Rutgers University T. Zhang (Rutgers)

Results

Smooth and strongly convex: gradient descent with a sufficientlysmall constant ηk has linear (or geometric) convergence:

f (wk )− f (w∗) = O(γk )

for some γ < 1.Smooth but not strongly convex:

f (wk )− f (w∗) = O(1/k),

with learning rate ηk = O(1/k).Nonsmooth:

f (wk )− f (w∗) = O(1/√

k),

for ηk = O(1/√

k) and wk = k−1∑nj=1 wk .

The learning rate can be tuned with line search.

T. Zhang (Rutgers) Optimization 8 / 24

Page 11: Rutgers University - 大眼睛实验室bigeye.au.tsinghua.edu.cn/DragonStar2012/docs/optimization.pdf · Optimization in Machine Learning Tong Zhang Rutgers University T. Zhang (Rutgers)

Reformulation of Gradient Descent

Gradient descent can be derived from:

wk =arg minw

Qk (w)

Qk (w) :=f (wk−1) +∇f (wk−1)T (w − wk−1) +

12ηk‖w − wk−1‖22

Key properties: assume smoothness for simplicity and 1/ηk ≥ L(smoothness parameter of f ).

Qk (wk−1) = f (wk−1)

Qk (w) ≥ f (w)

Qk (w) is easy to optimizeConsequence: minimize Qk (w) reduces objective value of f (w):f (wk−1)− f (wk ) ≥ Qk (wk−1)−Qk (wk ).

This idea can be be generalized to other convex upper bound of f (w).

T. Zhang (Rutgers) Optimization 9 / 24

Page 12: Rutgers University - 大眼睛实验室bigeye.au.tsinghua.edu.cn/DragonStar2012/docs/optimization.pdf · Optimization in Machine Learning Tong Zhang Rutgers University T. Zhang (Rutgers)

Reformulation of Gradient Descent

Gradient descent can be derived from:

wk =arg minw

Qk (w)

Qk (w) :=f (wk−1) +∇f (wk−1)T (w − wk−1) +

12ηk‖w − wk−1‖22

Key properties: assume smoothness for simplicity and 1/ηk ≥ L(smoothness parameter of f ).

Qk (wk−1) = f (wk−1)

Qk (w) ≥ f (w)

Qk (w) is easy to optimize

Consequence: minimize Qk (w) reduces objective value of f (w):f (wk−1)− f (wk ) ≥ Qk (wk−1)−Qk (wk ).

This idea can be be generalized to other convex upper bound of f (w).

T. Zhang (Rutgers) Optimization 9 / 24

Page 13: Rutgers University - 大眼睛实验室bigeye.au.tsinghua.edu.cn/DragonStar2012/docs/optimization.pdf · Optimization in Machine Learning Tong Zhang Rutgers University T. Zhang (Rutgers)

Reformulation of Gradient Descent

Gradient descent can be derived from:

wk =arg minw

Qk (w)

Qk (w) :=f (wk−1) +∇f (wk−1)T (w − wk−1) +

12ηk‖w − wk−1‖22

Key properties: assume smoothness for simplicity and 1/ηk ≥ L(smoothness parameter of f ).

Qk (wk−1) = f (wk−1)

Qk (w) ≥ f (w)

Qk (w) is easy to optimizeConsequence: minimize Qk (w) reduces objective value of f (w):f (wk−1)− f (wk ) ≥ Qk (wk−1)−Qk (wk ).

This idea can be be generalized to other convex upper bound of f (w).

T. Zhang (Rutgers) Optimization 9 / 24

Page 14: Rutgers University - 大眼睛实验室bigeye.au.tsinghua.edu.cn/DragonStar2012/docs/optimization.pdf · Optimization in Machine Learning Tong Zhang Rutgers University T. Zhang (Rutgers)

Proximal Gradient Method

Assumef (w) = φ(w) + g(w),

then we may consider the following upper bound of f (w)

Qk (w) := φ(wk−1)+∇φ(wk−1)T (w −wk−1)+

12ηk‖w −wk−1‖22 + g(w),

with 1/ηk larger than the smoothness parameter of φ. Then solve for

wk = arg minw

Qk (w).

We assume that this minimization problem is easy.

generalization of gradient descent called proximal gradient descent.useful when g(w) is a simple nonsmooth function such as L1regularization g(w) = λ‖w‖1.

T. Zhang (Rutgers) Optimization 10 / 24

Page 15: Rutgers University - 大眼睛实验室bigeye.au.tsinghua.edu.cn/DragonStar2012/docs/optimization.pdf · Optimization in Machine Learning Tong Zhang Rutgers University T. Zhang (Rutgers)

Example: L1 regularization

f (w) =n∑

i=1

(wT xi − yi)2 + λ‖w‖1.

For example, φ(w) =∑n

i=1(wT xi − yi)

2 and g(w) = λ‖w‖1. Then

Qk := φ(wk−1) +∇φ(wk−1)T (w − wk−1) +

12ηk‖w − wk−1‖22 + λ‖w‖1.

Solution iswk = trunc(wk−1 − ηk∇φ(wk−1)),

where

trunc([u1, . . . ,ud ]) =[trunc(uj)]j=1,...,d

trunc(uj) =sign(uj)(|uj | − ληk )+

T. Zhang (Rutgers) Optimization 11 / 24

Page 16: Rutgers University - 大眼睛实验室bigeye.au.tsinghua.edu.cn/DragonStar2012/docs/optimization.pdf · Optimization in Machine Learning Tong Zhang Rutgers University T. Zhang (Rutgers)

Example: L1 regularization

f (w) =n∑

i=1

(wT xi − yi)2 + λ‖w‖1.

For example, φ(w) =∑n

i=1(wT xi − yi)

2 and g(w) = λ‖w‖1. Then

Qk := φ(wk−1) +∇φ(wk−1)T (w − wk−1) +

12ηk‖w − wk−1‖22 + λ‖w‖1.

Solution iswk = trunc(wk−1 − ηk∇φ(wk−1)),

where

trunc([u1, . . . ,ud ]) =[trunc(uj)]j=1,...,d

trunc(uj) =sign(uj)(|uj | − ληk )+

T. Zhang (Rutgers) Optimization 11 / 24

Page 17: Rutgers University - 大眼睛实验室bigeye.au.tsinghua.edu.cn/DragonStar2012/docs/optimization.pdf · Optimization in Machine Learning Tong Zhang Rutgers University T. Zhang (Rutgers)

Property of Proximal Gradient

Smoothness depending on φ rather than f (can tolerate nonsmooth g)Convergence similar to gradient descent:

if φ is smooth and f is strongly convex: convergence is linearif φ is smooth but not strongly convex: convergence is 1/k .if φ is not smooth: convergence is 1/

√k .

T. Zhang (Rutgers) Optimization 12 / 24

Page 18: Rutgers University - 大眼睛实验室bigeye.au.tsinghua.edu.cn/DragonStar2012/docs/optimization.pdf · Optimization in Machine Learning Tong Zhang Rutgers University T. Zhang (Rutgers)

Nesterov’s Accelerated Gradient (one version)

Procedure:Pick η1, η2, . . . ≥ 0Pick w1 = y1 = z0, thenDefine α0 = 0 and α−2

i − α−1i = α−2

i−1 for i ≥ 1(may also set αi = (1 + i/2)−1)Iterate for i = 1,2, . . . ,T :

zi =arg minz

[g(z) +

12ηi‖z‖22 − (η−1

i zi−1 − α−1i ∇φ(yi))

>z],

wi =(1− αi−1)wi−1 + αi−1zi

yi+1 =(1− αi)wi + αizi

Advantage: faster convergence of 1/k2 for smooth φDisadvantage:

for smooth and strongly convex f : algorithm has to be modified toachieve geometric convergencemodification depends on strong convexity parameter µ.

T. Zhang (Rutgers) Optimization 13 / 24

Page 19: Rutgers University - 大眼睛实验室bigeye.au.tsinghua.edu.cn/DragonStar2012/docs/optimization.pdf · Optimization in Machine Learning Tong Zhang Rutgers University T. Zhang (Rutgers)

Beyond First Order Method: LBFGS (high level view)

Recall gradient descent: successive minimization of

Qk (w) = f (wk−1) +∇f (wk−1)T (w − wk−1) +

12ηk‖w − wk−1‖22.

upper bound of f (w)

Locally a more accurate approximation of f (x) is to use Hessian:

Qk (w) = f (wk−1)+∇f (wk−1)T (w−wk−1)+

12(w−wk−1)

T H(w−wk−1).

BFGS; approximate H using first order gradients.LBFGS: use limited memory (store a few vectors) to approximate HVery effective for optimization of smooth objective functions.

T. Zhang (Rutgers) Optimization 14 / 24

Page 20: Rutgers University - 大眼睛实验室bigeye.au.tsinghua.edu.cn/DragonStar2012/docs/optimization.pdf · Optimization in Machine Learning Tong Zhang Rutgers University T. Zhang (Rutgers)

Coordinate Descent (CD)

Let f (w) = f ([w1, . . . ,wd ])Algorithm:

for j = 1, . . . ,dwj ← arg minu f ([w1, . . . ,wj−1,u,wj+1, . . . ,wd ])

repeat until convergence

Idea: optimize one parameter at a time and fix others

Assumption:each one dimensional problem can be solved easily.each coordinate update for variable j is inexpensive compared togradient descent.

T. Zhang (Rutgers) Optimization 15 / 24

Page 21: Rutgers University - 大眼睛实验室bigeye.au.tsinghua.edu.cn/DragonStar2012/docs/optimization.pdf · Optimization in Machine Learning Tong Zhang Rutgers University T. Zhang (Rutgers)

Coordinate Descent (CD)

Let f (w) = f ([w1, . . . ,wd ])Algorithm:

for j = 1, . . . ,dwj ← arg minu f ([w1, . . . ,wj−1,u,wj+1, . . . ,wd ])

repeat until convergence

Idea: optimize one parameter at a time and fix othersAssumption:

each one dimensional problem can be solved easily.each coordinate update for variable j is inexpensive compared togradient descent.

T. Zhang (Rutgers) Optimization 15 / 24

Page 22: Rutgers University - 大眼睛实验室bigeye.au.tsinghua.edu.cn/DragonStar2012/docs/optimization.pdf · Optimization in Machine Learning Tong Zhang Rutgers University T. Zhang (Rutgers)

Linear Regularization Problem

Consider regularized logistic regression:

w = arg minw

[n∑

i=1

ln(1 + exp(−w>xiyi)) + λ‖w‖1

]

or more generally the following problem with scaler functions fi and hj :

w = arg minw

n∑i=1

fi(wT xi) +d∑

j=1

hj(wj)

.Iteration complexity:

maintain zi = w>xi for i = 1, . . . ,ncoordinate j : update wj and {zi} requires scanning a featurecolumnone pass over j = 1, . . . ,d : one gradient descent step

T. Zhang (Rutgers) Optimization 16 / 24

Page 23: Rutgers University - 大眼睛实验室bigeye.au.tsinghua.edu.cn/DragonStar2012/docs/optimization.pdf · Optimization in Machine Learning Tong Zhang Rutgers University T. Zhang (Rutgers)

Convergence

Practice:for suitable problems, coordinate descent works much better thangradient descente.g., regularized logistic regression

Theory: incompletecurrent analysis either shows no improvements or improvementsunder very restricted scaneriosPaul Tseng, Yurii Nesterov, ...

It is still an open question to develop better theoretical understandingon when does coordinate descent performs better.

T. Zhang (Rutgers) Optimization 17 / 24

Page 24: Rutgers University - 大眼睛实验室bigeye.au.tsinghua.edu.cn/DragonStar2012/docs/optimization.pdf · Optimization in Machine Learning Tong Zhang Rutgers University T. Zhang (Rutgers)

Convergence

Practice:for suitable problems, coordinate descent works much better thangradient descente.g., regularized logistic regression

Theory: incompletecurrent analysis either shows no improvements or improvementsunder very restricted scaneriosPaul Tseng, Yurii Nesterov, ...

It is still an open question to develop better theoretical understandingon when does coordinate descent performs better.

T. Zhang (Rutgers) Optimization 17 / 24

Page 25: Rutgers University - 大眼睛实验室bigeye.au.tsinghua.edu.cn/DragonStar2012/docs/optimization.pdf · Optimization in Machine Learning Tong Zhang Rutgers University T. Zhang (Rutgers)

Convex Duality

Given a convex function f (w), we can define its conjugate or dual

f ∗(α) = supw

[wTα− f (w)].

The optimal w is α = ∇f (w).

The dual of f ∗ is f :

f (w) = supα

[wTα− f ∗(α)].

The optimal α is ∇f ∗(α) = w .We have the following property: for all w and α:

f (w) + f ∗(α) ≥ wTα

equality holds only at w = ∇f ∗(α): equivalent to α = ∇f (w).

T. Zhang (Rutgers) Optimization 18 / 24

Page 26: Rutgers University - 大眼睛实验室bigeye.au.tsinghua.edu.cn/DragonStar2012/docs/optimization.pdf · Optimization in Machine Learning Tong Zhang Rutgers University T. Zhang (Rutgers)

Convex Duality

Given a convex function f (w), we can define its conjugate or dual

f ∗(α) = supw

[wTα− f (w)].

The optimal w is α = ∇f (w).The dual of f ∗ is f :

f (w) = supα

[wTα− f ∗(α)].

The optimal α is ∇f ∗(α) = w .

We have the following property: for all w and α:

f (w) + f ∗(α) ≥ wTα

equality holds only at w = ∇f ∗(α): equivalent to α = ∇f (w).

T. Zhang (Rutgers) Optimization 18 / 24

Page 27: Rutgers University - 大眼睛实验室bigeye.au.tsinghua.edu.cn/DragonStar2012/docs/optimization.pdf · Optimization in Machine Learning Tong Zhang Rutgers University T. Zhang (Rutgers)

Convex Duality

Given a convex function f (w), we can define its conjugate or dual

f ∗(α) = supw

[wTα− f (w)].

The optimal w is α = ∇f (w).The dual of f ∗ is f :

f (w) = supα

[wTα− f ∗(α)].

The optimal α is ∇f ∗(α) = w .We have the following property: for all w and α:

f (w) + f ∗(α) ≥ wTα

equality holds only at w = ∇f ∗(α): equivalent to α = ∇f (w).

T. Zhang (Rutgers) Optimization 18 / 24

Page 28: Rutgers University - 大眼睛实验室bigeye.au.tsinghua.edu.cn/DragonStar2012/docs/optimization.pdf · Optimization in Machine Learning Tong Zhang Rutgers University T. Zhang (Rutgers)

Dual of Linear Regularization Method

Primal optimization problem:

w∗ = arg minw

P(w) P(w) :=n∑

i=1

fi(wT xi) + λg(w).

Dual optimization problem:

α∗ = arg maxα

D(α) D(α) =n∑

i=1

−f ∗i (−αi)− λg∗(λ−1∑

i

αixi).

Strong duality:P(w) ≥ D(α) for all w and αP(w∗) = D(α∗) with the relationship:

w∗ = ∇g∗(λ−1

n∑i=1

α∗,ixi

)α∗ i = f ′i (w

T∗ xi).

Solve dual instead of primal problem.T. Zhang (Rutgers) Optimization 19 / 24

Page 29: Rutgers University - 大眼睛实验室bigeye.au.tsinghua.edu.cn/DragonStar2012/docs/optimization.pdf · Optimization in Machine Learning Tong Zhang Rutgers University T. Zhang (Rutgers)

Quick Justification of Strong Duality

P(w)− D(α) =

[n∑

i=1

fi(wT xi) + λg(w)

]

[n∑

i=1

−f ∗i (−αi)− λg∗(λ−1

n∑i=1

αixi

)]

=n∑

i=1

[fi(wT xi) + f ∗i (αi)− αiwT xi

]+ λ

[g(w) + g∗

(λ−1

n∑i=1

αixi

)− wT

(λ−1

n∑i=1

αixi

)]≥ 0.

Equality holds at f ′i (wT xi) = αi and w = ∇g∗(λ−1∑

i αixi).

Can check this gives the first order optimality conditions for w∗ and α∗.

T. Zhang (Rutgers) Optimization 20 / 24

Page 30: Rutgers University - 大眼睛实验室bigeye.au.tsinghua.edu.cn/DragonStar2012/docs/optimization.pdf · Optimization in Machine Learning Tong Zhang Rutgers University T. Zhang (Rutgers)

Quick Justification of Strong Duality

P(w)− D(α) =

[n∑

i=1

fi(wT xi) + λg(w)

]

[n∑

i=1

−f ∗i (−αi)− λg∗(λ−1

n∑i=1

αixi

)]

=n∑

i=1

[fi(wT xi) + f ∗i (αi)− αiwT xi

]+ λ

[g(w) + g∗

(λ−1

n∑i=1

αixi

)− wT

(λ−1

n∑i=1

αixi

)]≥ 0.

Equality holds at f ′i (wT xi) = αi and w = ∇g∗(λ−1∑

i αixi).Can check this gives the first order optimality conditions for w∗ and α∗.

T. Zhang (Rutgers) Optimization 20 / 24

Page 31: Rutgers University - 大眼睛实验室bigeye.au.tsinghua.edu.cn/DragonStar2012/docs/optimization.pdf · Optimization in Machine Learning Tong Zhang Rutgers University T. Zhang (Rutgers)

Example: Linear Support Vector Machine

Primal formulation:

P(w) =n∑

i=1

(1− w>xiyi)+ + 0.5λ‖w‖22

fi(u) = (1− uyi)+g(w) = 0.5‖w‖2

2.

Dual formulation:

D(α) =n∑

i=1

αiyi + 0.5λ−1

∥∥∥∥∥n∑

i=1

αixiyi

∥∥∥∥∥2

2

, αiyi ∈ [0,1].

−f ∗i (αi) = αiyi with constraint αiyi ∈ [0,1]g∗(w) = 0.5‖w‖2

2

T. Zhang (Rutgers) Optimization 21 / 24

Page 32: Rutgers University - 大眼睛实验室bigeye.au.tsinghua.edu.cn/DragonStar2012/docs/optimization.pdf · Optimization in Machine Learning Tong Zhang Rutgers University T. Zhang (Rutgers)

Dual Coordinate Descent

Dual optimization problem:

α∗ = arg maxα

D(α) D(α) =n∑

i=1

−f ∗i (−αi)− λg∗(λ−1∑

i

αixi).

Apply coordinate descent on dual:maintain w = λ−1∑

i αixi

for i = 1, . . . ,n, we update αi one at a time while fixing the others

Computation: total computation of one pass over the data iscomparable to one gradient descent.

T. Zhang (Rutgers) Optimization 22 / 24

Page 33: Rutgers University - 大眼睛实验室bigeye.au.tsinghua.edu.cn/DragonStar2012/docs/optimization.pdf · Optimization in Machine Learning Tong Zhang Rutgers University T. Zhang (Rutgers)

Convergence

Previous analysis of the method only shows slow convergence.

Our new analysis (work in process with Shai Shalev-Schwartz):To achieve accuracy ε

for smooth loss (e.g. logistic), requires

O(

ln n +ln(1/ε)

n

)passes over data

gradient descent: O(ln(1/ε))

for nonsmooth loss (.e.g, SVM), requires

O(

ln n +1nε

)passes over data

and convergence becomes geometric asymptoticallygradient descent: O(1/ε)

T. Zhang (Rutgers) Optimization 23 / 24

Page 34: Rutgers University - 大眼睛实验室bigeye.au.tsinghua.edu.cn/DragonStar2012/docs/optimization.pdf · Optimization in Machine Learning Tong Zhang Rutgers University T. Zhang (Rutgers)

Convergence

Previous analysis of the method only shows slow convergence.

Our new analysis (work in process with Shai Shalev-Schwartz):To achieve accuracy ε

for smooth loss (e.g. logistic), requires

O(

ln n +ln(1/ε)

n

)passes over data

gradient descent: O(ln(1/ε))

for nonsmooth loss (.e.g, SVM), requires

O(

ln n +1nε

)passes over data

and convergence becomes geometric asymptoticallygradient descent: O(1/ε)

T. Zhang (Rutgers) Optimization 23 / 24

Page 35: Rutgers University - 大眼睛实验室bigeye.au.tsinghua.edu.cn/DragonStar2012/docs/optimization.pdf · Optimization in Machine Learning Tong Zhang Rutgers University T. Zhang (Rutgers)

Convergence

Previous analysis of the method only shows slow convergence.

Our new analysis (work in process with Shai Shalev-Schwartz):To achieve accuracy ε

for smooth loss (e.g. logistic), requires

O(

ln n +ln(1/ε)

n

)passes over data

gradient descent: O(ln(1/ε))

for nonsmooth loss (.e.g, SVM), requires

O(

ln n +1nε

)passes over data

and convergence becomes geometric asymptoticallygradient descent: O(1/ε)

T. Zhang (Rutgers) Optimization 23 / 24

Page 36: Rutgers University - 大眼睛实验室bigeye.au.tsinghua.edu.cn/DragonStar2012/docs/optimization.pdf · Optimization in Machine Learning Tong Zhang Rutgers University T. Zhang (Rutgers)

References

LBFGS: “On the limited memory BFGS method for large scaleoptimization”, Dong C. Liu and Jorge Nocedal, MathematicalProgramming, 1989.Stephen Boyd and Lieven Vandenberghe: Convex OptimizationBook (http://www.stanford.edu/ boyd/cvxbook/)Yurii Nesterov: proximal gradient and accelerated proximal gradient

Introductory Lectures on Convex Optimization: A Basic CourseGradient methods for minimizing composite objective function

Arkadi Nemirovski: optimization lecture noteshttp://www2.isye.gatech.edu/ nemirovs/

T. Zhang (Rutgers) Optimization 24 / 24