Stochastic Gradient Descent Methods

Stochastic Optimization STUDY GROUP

Xuetong Wu & Viktoria Schram

Department of EEEUniversity of Melbourne

October 22, 2020

1 / 91

Overview

1 Introduction

2 Gradient Descent

3 Stochastic Gradient Descent

4 Stochastic Subgradient MethodsNon-Smooth Optimization ProblemsOptimization Considering Additional InformationOptimization in Case of Non-I.i.d. Data

5 Conclusion

2 / 91

Introduction

Outline

1 Introduction

2 Gradient Descent

4 Stochastic Subgradient Methods

5 Conclusion

3 / 91

Introduction

Parameter Estimation Problems

Communications

Tracking

Control theory

System identification

Machine learning

Kushner, 1997, Stochastic ApproximationHan-Fu Chen, 2003, Stochastic Approximation and Its Applications

4 / 91

Introduction

Classification Problems

Consider the typical image classification problem,

Training Examples Models

Labels

We wish to learn a good model h to minimise the predicted error:

n∑i=1

(1h(Xi)6=Yi)

5 / 91

Introduction

Classification Problems

Consider the typical image classification problem,

Training Examples Models

Labels

We wish to learn a good model h to minimise the predicted error:

n∑i=1

(1h(Xi)6=Yi)

6 / 91

Introduction

Regression ProblemsConsider the simple regression problem, we wish to learn a goodmodel to minimise the mean squared error.

𝑌𝑌 = 𝑎𝑎𝑎𝑎 + 𝑏𝑏

Training Examples Models Predicted Labels

Mathematically,

mina,b

n∑i=1

(Yi − aXi − b)2

7 / 91

Introduction

Optimization in Learning Problems

Many of machine learning problems can be formulated as thefollowing problem,

minw∈W

F (w) =1

n∑i=1

f(w,Zi)

Zi: training sample/ data pair (Xi, Yi)

w: model parameters (e.g., a, b in least square problem)

f : loss function

8 / 91

Gradient Descent

Outline

1 Introduction

2 Gradient Descent

5 Conclusion

9 / 91

Gradient Descent

Gradient Descentminw∈W

F (w) =1

n∑i=1

f(w,Zi)

If f is convex and differentiable w.r.t. w, with the first-order Taylorapproximation with η > 0,

F (w + η∆w) ≈ F (w) + η∆wT∇wF (w)

Best ∆w that minimises the R.H.S.

∆w = −∇wF (w)

we choose initial point w0 and certain step size ηt at each time t.

(Batch) Gradient descent:

wt+1 = wt − ηt∇wF (w) = wt −ηtn

n∑i=1

∇wf(wt, Zi)

Stops at a certain point such that,

F (wt)− F (w∗) ≤ ε10 / 91

Gradient Descent

Figure: Visualization of Gradient Descent

11 / 91

Gradient Descent

Convergence Rate For GDAssume f is convex and differentiable w.r.t. w, further assume thegradient ∇wF (w) = 1

∑ni=1∇wf(w,Zi) is L-Lipschitz continuous

(∇2F � LI) . Then,

Theorem

Gradient descent with fixed step size η ≤ 1/L satisfies

F (wt)− F (w∗) ≤ ‖w0 − w?‖2

Convergence rate ∼ O(1t ), iteration complexity ∼ O(1

R. Tibshirani, Convex Optimization 10-725

12 / 91

Gradient Descent

Convergence Rate For GD with Strong ConvexityFurthermore, if F (w) is µ-strongly-convex (∇2F � µI).

Theorem

Gradient descent with fixed step size η ≤ 2/(µ+ L) or withbacktracking line search satisfies

F (wt)− F (w?) ≤ ctL2‖w0 − w?‖2

where 0 < c < 1.

Convergence rate ∼ O(ct), iteration needed for error ε ∼ O(log 1ε ).

13 / 91

Gradient Descent

ProblemsTwo main drawbacks of gradient descent,

If n is relatively large, computing the gradient is memory andtime consuming.

If the loss function is nonconvex, the solution can be stuck ina local stationary point (e.g., a saddle point).

14 / 91

Stochastic Gradient Descent

Outline

1 Introduction

2 Gradient Descent

5 Conclusion

15 / 91

A possible practical way is to simulate the stream by randomly pickup Zt uniformly at time t from the training examples.

Namely, the stochastic gradient descent:

wt+1 = wt − ηt∇wf(wt, Zt)

Why does this work? By uniformly picking,

EZt [∇wf(wt, Zt)] =1

n∑i=1

∇wf(wt, Zi)

Unbiased estimate but high variance, usually works well in largescale problems.

16 / 91

Stochastic GD v.s. GD

Figure: Stochastic GD v.s. GD

17 / 91

Remarks on SGD

Computational cost for n samples and p iterations.

GD ∼ O(np)

SGD ∼ O(p)

SGD does not always produce descending directions andgradient is very noisy!

The solution of SGD bounces around optimal value withconstant step size.

Convergence properties?

18 / 91

Convergence Rate Analysis

minimizew F (w) :=1

n∑i=1

f(w,Zi)

We wish to get to achieve ε-optimality,

E[F (wt)]− F (w∗) ≤ ε

after t iterations.

19 / 91

Assumptions

F (w) is µ-strongly-convex and the gradeint is L Lipschitzcontinuous, µ/L ≤ 1.

∇f(wt, Zt) is an unbiased estimate of ∇F (wt).

For all w, the variance of the gradient

EZ [‖∇f(w,Z)‖22]− ‖EZ [∇f(w,Z)]‖22 ≤ σ2

20 / 91

Constant Step Size

Theorem (Convergence with Fixed Stepsizes)

Under the assumptions, if ηt = η ≤ 1L , then SGD achieves

E [F (wt)]− F (w∗) ≤ ηLσ2

2µ+ (1− ηµ)t (F (w0)− F (w∗))

Linear convergence at the beginning.

When t→∞,

E [F (wt)− F (w∗)] ≤ ηLσ2

converges to some neighborhood of w∗− variation in gradientcomputation prevents further progress.

Theorem 4.6 in Bottou, 2018, Optimization Methods for Large-Scale Machine Learning

21 / 91

Diminishing Step Size

Theorem (Convergence with Diminishing Stepsizes)

Under the assumptions, if ηt = θt+1 for some θ > 1

µ , then SGDachieves

E [F (wt)− F (w∗)] ≤ 2ν

ν := max

4(µ− 1), F (w0)− F (w∗)

Convergence rate is O(1t ), iterations needed O(1

ε ) with diminishingstepsize ηt � 1

Theorem 4.7 in Bottou, 2018, Optimization Methods for Large-Scale Machine Learning

22 / 91

Convergence Rate and Time Comparison

Risk Minimization Under Strong Convexity and L-smooth:

iterationcomplexity

per-iterationcost

totalcomput. cost

GD log 1ε n n log 1

SGD 1ε 1 1

23 / 91

AdvantagesCompared to the gradient descent methods, SGD has the followingadvantages.

Less computational cost per iteration.

For larger datasets it can converge faster with large n andmoderate ε.

For non-convex cases, SGD can get rid of saddle pointssometimes due to the variance.

Rong Ge et.al,, COLT’2015, Escaping From Saddle Points - Online Stochastic Gradient for Tensor Decomposition

24 / 91

Variance Reduction and AccelerationTo reduce the variance of the estimate of the gradient, we can usemini-batch SGD for k � n:

wt+1 = wt − ηtk∑i=1

∇f(wt, Zi)

SGD for logistic regression problems with n = 10000:

25 / 91

Variance Reduction and AccelerationCan we have 1 gradient per iteration and only O(log(1/ε))iterations?Yes! Stochastic Average Gradient (SAG):

wt = wt−1 −ηtn

n∑i=1

∇f ti with ∇f ti =

{∇f (wt−1, Zi) if i = i(t)∇f t−1i otherwise

SAG gradient estimates are no longer unbiased, but they havegreatly reduced variance. With the fixed stepsize ηt = 1

E [F (wt)]− F (w∗) 6 O

((1−min

Iteration complexity ∼ O(log 1ε )!

Other variants with similar convergence: SDCA, SVRG, SAGA

Roux et.al, NIPS’12, A Stochastic Gradient Method with an Exponential Convergence Rate for Finite Training SetsShai & Zhang, JMLR’13, Stochastic dual coordinate ascent methods for regularized loss minimization.Johnson & Zhang, NIPS’13, Accelerating stochastic gradient descent using predictive variance reductionDefazio & Bach, NIPS’14, SAGA

26 / 91

Stochastic Gradient Descent Methods

Documents

Transcript of Stochastic Gradient Descent Methods

Asynchronous Parallel Stochastic Gradient Descent A ...ornlcda.github.io/MLHPC2015/presentations/1-Janis.pdf · Asynchronous Parallel Stochastic Gradient Descent - A Numeric Core

Accelerating Stochastic Gradient Descent - arXiv Stochastic Gradient Descent Prateek Jain1, Sham M. Kakade2, Rahul Kidambi2, Praneeth Netrapalli1, and Aaron Sidford3 ... (Frostig et

Stochastic gradient descent on Riemannian manifolds - GdR MIAgdr-mia.math.cnrs.fr/events/optimgeo-14/program/slides/bonnabel.pdf · Stochastic gradient descent on Riemannian manifolds,

Efficient Logistic Regression with Stochastic Gradient Descent

Calibrated Stochastic Gradient Descent for Convolutional Neural … · Calibrated Stochastic Gradient Descent for Convolutional Neural Networks Li’an Zhuo 1, Baochang Zhang , Chen

Semi-Stochastic Gradient Descent Methods

Stochastic Gradient Descent - CMU Statisticsryantibs/convexopt/lectures/stochastic...Stochastic gradient descent Consider minimizing an average of functions min x 1 m Xm i=1 f i(x)

07 logistic regression and stochastic gradient descent

Perspectives on Stochastic Gradient Descent for Machine ...pillaud/Presentations/Cermics/presentation... · Perspectives on Stochastic Gradient Descent for Machine Learning problems

Stochastic Gradient Descent for Linear Systems with ... · Stochastic Gradient Descent for Linear Systems with Missing Data Anna Ma1 and Deanna Needell2 1Institute of Mathematical

Expected tensor decomposition with stochastic gradient descent

GASGD: Stochastic Gradient Descent for Distributed ...midlab.diag.uniroma1.it/articoli/presentation-recsys.pdf · GASGD: Stochastic Gradient Descent for Distributed Asynchronous Matrix

Lecture 6: stochastic gradient descent Sanjeev Arora Elad …...• Gradient descent algorithm, linear classification • Stochastic gradient descent Title 402-lec6 Created Date 10/6/2016

Learning Stochastic Optimal Policies via Gradient Descent

Parallel Stochastic Gradient Descent with Sound Combiners · 2017-05-24 · Parallel Stochastic Gradient Descent with Sound Combiners Saeed Maleki 1Madanlal Musuvathi Todd Mytkowicz

Probabilistic Line Searches for Stochastic Optimizationthe need to deﬁne a learning rate for stochastic gradient descent. 1 Introduction Stochastic gradient descent (SGD) [1] is

Stochastic gradient descent on Riemannian manifolds · 2013-11-22 · Outline 1 Stochastic gradient descent Introduction and examples SGD and machine learning Standard convergence

Efficient Logistic Regression with Stochastic Gradient Descent

Stochastic Gradient Descent Tricks - CILVR at NYU · stochastic gradient descent (SGD). This chapter provides background material, This chapter provides background material, explains

Efficient Logistic Regression with Stochastic Gradient Descent – part 2