ECS289: Scalable Machine Learningchohsieh/teaching/ECS289G_Fall2015/... · 2015-11-03 · ECS289:...

50
ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Oct 1, 2015

Transcript of ECS289: Scalable Machine Learningchohsieh/teaching/ECS289G_Fall2015/... · 2015-11-03 · ECS289:...

Page 1: ECS289: Scalable Machine Learningchohsieh/teaching/ECS289G_Fall2015/... · 2015-11-03 · ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Oct 1, 2015. Outline Convex vs Nonconvex

ECS289: Scalable Machine Learning

Cho-Jui HsiehUC Davis

Oct 1, 2015

Page 2: ECS289: Scalable Machine Learningchohsieh/teaching/ECS289G_Fall2015/... · 2015-11-03 · ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Oct 1, 2015. Outline Convex vs Nonconvex

Outline

Convex vs Nonconvex Functions

Coordinate Descent

Gradient Descent

Newton’s method

Stochastic Gradient Descent

Page 3: ECS289: Scalable Machine Learningchohsieh/teaching/ECS289G_Fall2015/... · 2015-11-03 · ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Oct 1, 2015. Outline Convex vs Nonconvex

Numerical Optimization

Numerical Optimization:

minX f (X )

Can be applied to computer science, economics, control engineering,operating research, . . .

Machine Learning: find a model that minimizes the predictionerror.

Page 4: ECS289: Scalable Machine Learningchohsieh/teaching/ECS289G_Fall2015/... · 2015-11-03 · ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Oct 1, 2015. Outline Convex vs Nonconvex

Properties of the Function

Smooth function: a function has continuous derivative.

Example: ridge regression

argminw

1

2‖Xw − y‖2 +

λ

2‖w‖2

Non-smooth function: Lasso, primal SVM

Lasso: argminw

1

2‖Xw − y‖2 + λ‖w‖1

SVM: argminw

n∑i=1

max(0, 1− yiwT xi ) +λ

2‖w‖2

Page 5: ECS289: Scalable Machine Learningchohsieh/teaching/ECS289G_Fall2015/... · 2015-11-03 · ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Oct 1, 2015. Outline Convex vs Nonconvex

Convex Functions

A function is convex if:

∀x1, x2,∀t ∈ [0, 1], f (tx1 + (1− t)x2) ≤ tf (x1) + (1− t)f (x2)

No local optimum (why?)

Figure from Wikipedia

Page 6: ECS289: Scalable Machine Learningchohsieh/teaching/ECS289G_Fall2015/... · 2015-11-03 · ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Oct 1, 2015. Outline Convex vs Nonconvex

Convex Functions

If f (x) is twice differentiable, then

f is convex if and only if ∇2f (x) 0

Optimal solution may not be unique:

has a set of optimal solutions SGradient: capture the first order change of f :

f (x + αd ) = f (x) + α∇f (x)T d + O(α2)

If f is differentiable, we have the following optimality condition:

x∗ ∈ S if and only if ∇f (x) = 0

Page 7: ECS289: Scalable Machine Learningchohsieh/teaching/ECS289G_Fall2015/... · 2015-11-03 · ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Oct 1, 2015. Outline Convex vs Nonconvex

Strongly Convex Functions

f is strongly convex if there exists a m > 0 such that

f (y) ≥ f (x) +∇f (x)T (y − x) +m

2‖y − x‖2

2

A strongly convex function has a unique global optimum x∗ (why?)

If f is twice differentiable, then

f is strongly convex if and only if ∇2f (x) mI > 0 for all x

Gradient descent, coordinate descent will converge linearly (will seelater)

Page 8: ECS289: Scalable Machine Learningchohsieh/teaching/ECS289G_Fall2015/... · 2015-11-03 · ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Oct 1, 2015. Outline Convex vs Nonconvex

Nonconvex Functions

If f is nonconvex, most algorithms can only converge to stationarypointsx is a stationary point if and only if ∇f (x) = 0Three types of stationary points:

(1) Global optimum (2) Local optimum (3) Saddle pointExample: matrix completion, neural network, . . .Example: f (x , y) = 1

2 (xy − a)2

Page 9: ECS289: Scalable Machine Learningchohsieh/teaching/ECS289G_Fall2015/... · 2015-11-03 · ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Oct 1, 2015. Outline Convex vs Nonconvex

Coordinate Descent

Page 10: ECS289: Scalable Machine Learningchohsieh/teaching/ECS289G_Fall2015/... · 2015-11-03 · ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Oct 1, 2015. Outline Convex vs Nonconvex

Coordinate Descent

Update one variable at a timeCoordinate Descent: repeatedly perform the following loop

Step 1: pick an index iStep 2: compute a step size δ∗ by (approximately) minimizing

argminδ

f (x + δei )

Step 3: xi ← xi + δ∗

Page 11: ECS289: Scalable Machine Learningchohsieh/teaching/ECS289G_Fall2015/... · 2015-11-03 · ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Oct 1, 2015. Outline Convex vs Nonconvex

Coordinate Descent (update sequence)

Three types of updating order:

Cyclic: update sequence

x1, x2, . . . , xn︸ ︷︷ ︸1st outer iteration

, x1, x2, . . . , xn︸ ︷︷ ︸2nd outer iteration

, . . .

A more general setting: update each variable at least once within everyT stepsRandomly permute the sequence for each outer iteration (fasterconvergence in practice)

Random: each time pick a random coordinate to update

Typical way: sample from uniform distributionSample from uniform distribution vs sample from biased distributionP. Zhao and T. Zhang, Stochastic Optimization with Importance Sampling for Regularized Loss Minimization. In

ICML 2015

D. Csiba, Z. Qu and P. Richtarik, Stochastic Dual Coordinate Ascent with Adaptive Probabilities. In ICML 2015

Page 12: ECS289: Scalable Machine Learningchohsieh/teaching/ECS289G_Fall2015/... · 2015-11-03 · ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Oct 1, 2015. Outline Convex vs Nonconvex

Coordinate Descent (update sequence)

Three types of updating order:

Cyclic: update sequence

x1, x2, . . . , xn︸ ︷︷ ︸1st outer iteration

, x1, x2, . . . , xn︸ ︷︷ ︸2nd outer iteration

, . . .

A more general setting: update each variable at least once within everyT stepsRandomly permute the sequence for each outer iteration (fasterconvergence in practice)

Random: each time pick a random coordinate to update

Typical way: sample from uniform distributionSample from uniform distribution vs sample from biased distributionP. Zhao and T. Zhang, Stochastic Optimization with Importance Sampling for Regularized Loss Minimization. In

ICML 2015

D. Csiba, Z. Qu and P. Richtarik, Stochastic Dual Coordinate Ascent with Adaptive Probabilities. In ICML 2015

Page 13: ECS289: Scalable Machine Learningchohsieh/teaching/ECS289G_Fall2015/... · 2015-11-03 · ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Oct 1, 2015. Outline Convex vs Nonconvex

Greedy Coordinate Descent

Greedy: choose the most “important” coordinate to update

How to measure the importance?

By first derivative: |∇i f (x)|By first and second derivative: |∇i f (x)/∇2

ii f (x)|By maximum reduction of objective function

i∗ = argmaxi=1,...,n

(f (x)−min

δf (x + δei )

)Need to consider the time complexity for variable selection

Useful for kernel SVM (see lecture 6)

Page 14: ECS289: Scalable Machine Learningchohsieh/teaching/ECS289G_Fall2015/... · 2015-11-03 · ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Oct 1, 2015. Outline Convex vs Nonconvex

Extension: block coordinate descent

Variables are divided into blocks X1, . . . ,Xp, where each Xi is asubset of variables and

X1 ∪ X2, . . . ,Xp = 1, . . . , n, Xi ∩ Xj = ϕ, ∀i , j

Each time update a Xi by (approximately) solving the subproblemwithin the block

Example: alternating minimization for matrix completion (2 blocks).(See lecture 7)

Page 15: ECS289: Scalable Machine Learningchohsieh/teaching/ECS289G_Fall2015/... · 2015-11-03 · ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Oct 1, 2015. Outline Convex vs Nonconvex

Coordinate Descent (convergence)

Converge to an optimum if f (·) is convex and smooth

Has a linear convergence rate if f (·) is strongly convex

Linear convergence: error f (x t)− f (x∗) decays as

β, β2, β3, . . .

for some β < 1.

Local linear convergence: an algorithm converges linearly after‖x − x∗‖ ≤ K for some K > 0

Page 16: ECS289: Scalable Machine Learningchohsieh/teaching/ECS289G_Fall2015/... · 2015-11-03 · ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Oct 1, 2015. Outline Convex vs Nonconvex

Coordinate Descent (nonconvex)

Block coordinate descent with 2 blocks:

converges to stationary points

With > 2 blocks:

converges to stationary points if each subproblem has a uniqueminimizer.

Page 17: ECS289: Scalable Machine Learningchohsieh/teaching/ECS289G_Fall2015/... · 2015-11-03 · ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Oct 1, 2015. Outline Convex vs Nonconvex

Coordinate Descent: other names

Alternating minimization (matrix completion)

Iterative scaling (for log-linear models)

Decomposition method (for kernel SVM)

Gauss Seidel (for linear system when the matrix is positive definite)

. . .

Page 18: ECS289: Scalable Machine Learningchohsieh/teaching/ECS289G_Fall2015/... · 2015-11-03 · ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Oct 1, 2015. Outline Convex vs Nonconvex

Gradient Descent

Page 19: ECS289: Scalable Machine Learningchohsieh/teaching/ECS289G_Fall2015/... · 2015-11-03 · ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Oct 1, 2015. Outline Convex vs Nonconvex

Gradient Descent

Gradient descent algorithm: repeatedly conduct the following update:

x t+1 ← x t − α∇f (x t)

where α > 0 is the step sizeIt is a fixed point iteration method:

x − α∇f (x) = x if and only if x is an optimal solution

Step size too large ⇒ diverge; too small ⇒ slow convergence

Page 20: ECS289: Scalable Machine Learningchohsieh/teaching/ECS289G_Fall2015/... · 2015-11-03 · ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Oct 1, 2015. Outline Convex vs Nonconvex

Gradient Descent: successive approximation

At each iteration, form an approximation of f (·):

f (x t + d ) ≈ fx t (d ) :=f (x t) +∇f (x t)T d +1

2dT (

1

αI )d

=f (x t) +∇f (x t)T d +1

2αdT d

Update solution by x t+1 ← x t + argmind fx t (d )

d ∗ = −α∇f (x t) is the minimizer of argmind fx t (d )

d ∗ may not decrease the original objective function f

Page 21: ECS289: Scalable Machine Learningchohsieh/teaching/ECS289G_Fall2015/... · 2015-11-03 · ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Oct 1, 2015. Outline Convex vs Nonconvex

Gradient Descent: successive approximation

However, the function value will decrease if

Condition 1: fx(d ) ≥ f (x + d ) for all dCondition 2: fx(0) = f (x)

Why?

f (x t + d ∗) ≤ fx t (d ∗)

≤ fx t (0)

= f (x t)

Condition 2 is satisfied by construction of fx t

Condition 1 is satisfied if 1α I ∇

2f (x) for all x (why?)

Page 22: ECS289: Scalable Machine Learningchohsieh/teaching/ECS289G_Fall2015/... · 2015-11-03 · ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Oct 1, 2015. Outline Convex vs Nonconvex

Gradient Descent: step size

A function has L-Lipchitz continuous gradient if

‖∇f (x)−∇f (y)‖ ≤ L‖x − y‖ ∀x , y

If f is twice differentiable, this implies

∇2f (x) ≤ LI ∀x

In this case, Condition 2 is staisfied if α ≤ 1L

Theorem: gradient descent converges if α ≤ 1L

Theorem: gradient descent converges linearly with α ≤ 1L if f is

strongly convex

Page 23: ECS289: Scalable Machine Learningchohsieh/teaching/ECS289G_Fall2015/... · 2015-11-03 · ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Oct 1, 2015. Outline Convex vs Nonconvex

Gradient Descent

In practice, we do not know L. . . . . .

Step size α too large: the algorithm diverges

Step size α too small: the algorithm converges very slowly

Page 24: ECS289: Scalable Machine Learningchohsieh/teaching/ECS289G_Fall2015/... · 2015-11-03 · ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Oct 1, 2015. Outline Convex vs Nonconvex

Gradient Descent: line search

d ∗ is a “descent direction” if and only if (d ∗)T∇f (x) < 0

Armijo rule bakctracking line search:

Try α = 1, 12 ,

14 , . . . until it staisfies

f (x + αd ∗) ≤ f (x) + γα(d ∗)T∇f (x)

where 0 < γ < 1

Figure from http://ool.sourceforge.net/ool-ref.html

Page 25: ECS289: Scalable Machine Learningchohsieh/teaching/ECS289G_Fall2015/... · 2015-11-03 · ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Oct 1, 2015. Outline Convex vs Nonconvex

Gradient Descent: line search

Gradient descent with line search:

Converges to optimal solutions if f is smoothConverges linearly if f is strongly convex

However, each iteration requires evaluating f several times

Several other step-size selection approaches

(an ongoing research topic, especially for stochastic gradient descent)

Page 26: ECS289: Scalable Machine Learningchohsieh/teaching/ECS289G_Fall2015/... · 2015-11-03 · ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Oct 1, 2015. Outline Convex vs Nonconvex

Gradient Descent: applying to ridge regression

Input: X ∈ RN×d , y ∈ RN , initial w (0)

Output: Solution w∗ := argminw12‖Xw − y‖2 + λ

2‖w‖2

1: t = 02: while not converged do3: Compute the gradient

g = XT (Xw − y) + λw

4: Choose step size αt

5: Update w ← w − αtg6: t ← t + 17: end while

Time complexity: O(nnz(X )) per iteration

Page 27: ECS289: Scalable Machine Learningchohsieh/teaching/ECS289G_Fall2015/... · 2015-11-03 · ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Oct 1, 2015. Outline Convex vs Nonconvex

Proximal Gradient Descent

How can we apply gradient descent to solve the Lasso problem?

argminw

1

2‖Xw − y‖2 + λ ‖w‖1︸ ︷︷ ︸

non−differentiable

General composite function minimization:

argminx

f (x) := g(x) + h(x)

where g is smooth and convex, h is convex but may benon-differentiable

Usually assume h is simple (for computational efficiency)

Page 28: ECS289: Scalable Machine Learningchohsieh/teaching/ECS289G_Fall2015/... · 2015-11-03 · ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Oct 1, 2015. Outline Convex vs Nonconvex

Proximal Gradient Descent: successive approximation

At each iteration, form an approximation of f (·):

f (x t + d ) ≈ fx t (d ) :=g(x t) +∇g(x t)T d +1

2dT (

1

αI )d + h(x t + d )

=g(x t) +∇g(x t)T d +1

2αdT d + h(x t + d )

Update solution by x t+1 ← x t + argmind fx t (d )

This is called “proximal” gradient descent

Sometimes d ∗ = argmind fx t (d ) has a closed form solution

Page 29: ECS289: Scalable Machine Learningchohsieh/teaching/ECS289G_Fall2015/... · 2015-11-03 · ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Oct 1, 2015. Outline Convex vs Nonconvex

Proximal Gradient Descent: `1-regularization (*)

The subproblem:

x t+1 =x t + argmind∇g(x t)T d +

1

2αdT d + λ‖x t + d‖1

= argminu

1

2‖u − (x t − α∇g(x t))‖2 + λα‖u‖1

=S(x t − α∇g(x t), αλ),

where S is the soft-thresholding operator defined by

S(a, z) =

a− z if a > z

a + z if a < −z0 if a ∈ [−z , z ]

Page 30: ECS289: Scalable Machine Learningchohsieh/teaching/ECS289G_Fall2015/... · 2015-11-03 · ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Oct 1, 2015. Outline Convex vs Nonconvex

Proximal Gradient: soft-thresholding

Figure from http://jocelynchi.com/soft-thresholding-operator-and-the-lasso-solution/

Page 31: ECS289: Scalable Machine Learningchohsieh/teaching/ECS289G_Fall2015/... · 2015-11-03 · ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Oct 1, 2015. Outline Convex vs Nonconvex

Proximal Gradient Descent for Lasso

Input: X ∈ RN×d , y ∈ RN , initial w (0)

Output: Solution w∗ := argminw12‖Xw − y‖2 + λ‖w‖1

1: t = 02: while not converged do3: Compute the gradient

g = XT (Xw − y)

4: Choose step size αt

5: Update w ← S(w − αtg , αtλ)6: t ← t + 17: end while

Time complexity: O(nnz(X )) per iteration

Page 32: ECS289: Scalable Machine Learningchohsieh/teaching/ECS289G_Fall2015/... · 2015-11-03 · ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Oct 1, 2015. Outline Convex vs Nonconvex

Newton’s Method

Page 33: ECS289: Scalable Machine Learningchohsieh/teaching/ECS289G_Fall2015/... · 2015-11-03 · ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Oct 1, 2015. Outline Convex vs Nonconvex

Newton’s Method

Iteratively conduct the following updates:

x ← x − α∇2f (x)−1∇f (x)

where α is the step size

If α = 1: converges quadratically when x t is close enough to x∗:

‖x t+1 − x∗‖ ≤ K‖x t − x∗‖2

for some constant K . This means the error f (x t)− f (x∗) decaysquadratically:

β, β2, β4, β8, β16, . . .

Only need few iterations to converge in this “quadratic convergenceregion”

Page 34: ECS289: Scalable Machine Learningchohsieh/teaching/ECS289G_Fall2015/... · 2015-11-03 · ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Oct 1, 2015. Outline Convex vs Nonconvex

Newton’s Method

However, Newton’s update rule is more expensive than gradientdescent/coordinate descent

Page 35: ECS289: Scalable Machine Learningchohsieh/teaching/ECS289G_Fall2015/... · 2015-11-03 · ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Oct 1, 2015. Outline Convex vs Nonconvex

Newton’s Method

Need to compute ∇2f (x)−1∇f (x)

Closed form solution: O(d3) for solving a d dimensional linear system

Usually solved by another iterative solver:

gradient descent

coordinate descent

conjugate gradient method

. . .

Useful for the cases where the quadratic subproblem can be solvedmore efficiently than the original problem

Examples: primal L2-SVM/logistic regression, `1-regularized logisticregression, . . .

Page 36: ECS289: Scalable Machine Learningchohsieh/teaching/ECS289G_Fall2015/... · 2015-11-03 · ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Oct 1, 2015. Outline Convex vs Nonconvex

Newton’s Method(*)

At each iteration, form an approximation of f (·):

f (x t + d ) ≈ fx t (d ) := f (x t) +∇f (x t)T d +1

2αdT∇2f (x)d

Update solution by x t+1 ← x t + argmind fx t (d )

When x is far away from x∗, needs line search to gaurantee convergence

Assume LI ∇2f (x) mI for all x , then α ≤ mL gaurantee the

objective function value deacrease because

L

m∇2f (x) ∇2f (y) ∀x , y

In practice, we often just use line search.

Page 37: ECS289: Scalable Machine Learningchohsieh/teaching/ECS289G_Fall2015/... · 2015-11-03 · ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Oct 1, 2015. Outline Convex vs Nonconvex

Proximal Newton (*)

What if f (x) = g(x) + h(x) and h(x) is non-smooth (h(x) = ‖x‖1)?

At each iteration, form an approximation of f (·):

f (x t + d ) ≈ fx t (d ) := g(x t) +∇g(x t)T d +α

2dT∇2g(x)d + h(x + d )

Update solution by x t+1 ← x t + argmind fx t (d )

Need another iterative solver for solving the subproblem

Page 38: ECS289: Scalable Machine Learningchohsieh/teaching/ECS289G_Fall2015/... · 2015-11-03 · ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Oct 1, 2015. Outline Convex vs Nonconvex

Stochastic Gradient

Page 39: ECS289: Scalable Machine Learningchohsieh/teaching/ECS289G_Fall2015/... · 2015-11-03 · ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Oct 1, 2015. Outline Convex vs Nonconvex

Stochastic Gradient: Motivation

Widely used for machine learning problems (with large number ofsamples)

Given training samples x1, . . . , xn, we usually want to solve thefollowing empirical risk minimization (ERM) problem:

argminw

n∑i=1

`i (xi ),

where each `i (·) is the loss function

Minimize the summation of individual loss on each sample

Page 40: ECS289: Scalable Machine Learningchohsieh/teaching/ECS289G_Fall2015/... · 2015-11-03 · ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Oct 1, 2015. Outline Convex vs Nonconvex

Stochastic Gradient

Assume the objective function can be written as

f (x) =1

n

n∑i=1

fi (x)

Stochastic gradient method:

Iterative conducts the following updates1 Choose an index i (uniform) randomly2 x t+1 ← x t − ηt∇fi (x t)

ηt > 0 is the step size

Why does SG work?

Ei [∇fi (x)] =1

n

n∑i=1

∇fi (x) = ∇f (x)

Is it a fixed point method? No if η > 0 because x∗ − η∇i f (x∗) 6= x∗

Is it a descent method? No, because f (x t+1) 6< f (x t)

Page 41: ECS289: Scalable Machine Learningchohsieh/teaching/ECS289G_Fall2015/... · 2015-11-03 · ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Oct 1, 2015. Outline Convex vs Nonconvex

Stochastic Gradient

Assume the objective function can be written as

f (x) =1

n

n∑i=1

fi (x)

Stochastic gradient method:

Iterative conducts the following updates1 Choose an index i (uniform) randomly2 x t+1 ← x t − ηt∇fi (x t)

ηt > 0 is the step size

Why does SG work?

Ei [∇fi (x)] =1

n

n∑i=1

∇fi (x) = ∇f (x)

Is it a fixed point method? No if η > 0 because x∗ − η∇i f (x∗) 6= x∗

Is it a descent method? No, because f (x t+1) 6< f (x t)

Page 42: ECS289: Scalable Machine Learningchohsieh/teaching/ECS289G_Fall2015/... · 2015-11-03 · ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Oct 1, 2015. Outline Convex vs Nonconvex

Stochastic Gradient

Assume the objective function can be written as

f (x) =1

n

n∑i=1

fi (x)

Stochastic gradient method:

Iterative conducts the following updates1 Choose an index i (uniform) randomly2 x t+1 ← x t − ηt∇fi (x t)

ηt > 0 is the step size

Why does SG work?

Ei [∇fi (x)] =1

n

n∑i=1

∇fi (x) = ∇f (x)

Is it a fixed point method? No if η > 0 because x∗ − η∇i f (x∗) 6= x∗

Is it a descent method? No, because f (x t+1) 6< f (x t)

Page 43: ECS289: Scalable Machine Learningchohsieh/teaching/ECS289G_Fall2015/... · 2015-11-03 · ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Oct 1, 2015. Outline Convex vs Nonconvex

Stochastic Gradient

Step size η has to decay to 0

(e.g., ηt = Ct−a for some constant a,C )

Many variants proposed recently (SVRG, SAGA, . . . )

Widely used in online setting

Page 44: ECS289: Scalable Machine Learningchohsieh/teaching/ECS289G_Fall2015/... · 2015-11-03 · ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Oct 1, 2015. Outline Convex vs Nonconvex

Stochastic Gradient: applying to ridge regression

Objective function:

argminw

1

n

n∑i=1

(wT xi − yi )2 + λ‖w‖2

How to write as argminw1n

∑ni=1 fi (w)?

How to decompose into n components?

Page 45: ECS289: Scalable Machine Learningchohsieh/teaching/ECS289G_Fall2015/... · 2015-11-03 · ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Oct 1, 2015. Outline Convex vs Nonconvex

Stochastic Gradient: applying to ridge regression

Objective function:

argminw

1

n

n∑i=1

(wT xi − yi )2 + λ‖w‖2

How to write as argminw1n

∑ni=1 fi (w)?

First approach: fi (w) = (wT xi − yi )2 + λ‖w‖2

Update rule:

w t+1 ← w t − 2ηt(wT xi − yi )xi − 2ηtλw

= (1− 2ηtλ)w − 2ηt(wT xi − yi )xi

Page 46: ECS289: Scalable Machine Learningchohsieh/teaching/ECS289G_Fall2015/... · 2015-11-03 · ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Oct 1, 2015. Outline Convex vs Nonconvex

Stochastic Gradient: applying to ridge regression

Objective function:

argminw

1

n

n∑i=1

(wT xi − yi )2 + λ‖w‖2

How to write as argminw1n

∑ni=1 fi (w)?

First approach: fi (w) = (wT xi − yi )2 + λ‖w‖2

Update rule:

w t+1 ← w t − 2ηt(wT xi − yi )xi − 2ηtλw

= (1− 2ηtλ)w − 2ηt(wT xi − yi )xi

Need O(d) complexity per iteration even if data is sparse

Page 47: ECS289: Scalable Machine Learningchohsieh/teaching/ECS289G_Fall2015/... · 2015-11-03 · ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Oct 1, 2015. Outline Convex vs Nonconvex

Stochastic Gradient: applying to ridge regression

Objective function:

argminw

1

n

n∑i=1

(wT xi − yi )2 + λ‖w‖2

How to write as argminw1n

∑ni=1 fi (w)?

First approach: fi (w) = (wT xi − yi )2 + λ‖w‖2

Update rule:

w t+1 ← w t − 2ηt(wT xi − yi )xi − 2ηtλw

= (1− 2ηtλ)w − 2ηt(wT xi − yi )xi

Need O(d) complexity per iteration even if data is sparse

Solution: store w = sv where s is a scalar

Page 48: ECS289: Scalable Machine Learningchohsieh/teaching/ECS289G_Fall2015/... · 2015-11-03 · ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Oct 1, 2015. Outline Convex vs Nonconvex

Stochastic Gradient: applying to ridge regression

Objective function:

argminw

1

n

n∑i=1

(wT xi − yi )2 + λ‖w‖2

Second approach:

define Ωi = j | Xij 6= 0 for i = 1, . . . , n

define nj = |i | Xij 6= 0| for j = 1, . . . , d

define fi (w) = (wT xi − yi )2 +

∑j∈Ωi

λnnjw2

j

Update rule when selecting index i :

w t+1j ← w t

j − 2ηt(xTi w t − yi )Xij −

2ηtλn

njw t

j , ∀j ∈ Ωi

Solution: update can be done in O(|Ωi |) operations

Page 49: ECS289: Scalable Machine Learningchohsieh/teaching/ECS289G_Fall2015/... · 2015-11-03 · ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Oct 1, 2015. Outline Convex vs Nonconvex

Stochastic Gradient: applying to ridge regression

Objective function:

argminw

1

n

n∑i=1

(wT xi − yi )2 + λ‖w‖2

Second approach:

define Ωi = j | Xij 6= 0 for i = 1, . . . , n

define nj = |i | Xij 6= 0| for j = 1, . . . , d

define fi (w) = (wT xi − yi )2 +

∑j∈Ωi

λnnjw2

j

Update rule when selecting index i :

w t+1j ← w t

j − 2ηt(xTi w t − yi )Xij −

2ηtλn

njw t

j , ∀j ∈ Ωi

Solution: update can be done in O(|Ωi |) operations

Page 50: ECS289: Scalable Machine Learningchohsieh/teaching/ECS289G_Fall2015/... · 2015-11-03 · ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Oct 1, 2015. Outline Convex vs Nonconvex

Coming up

Next class: Parallel Optimization Methods

Questions?