Stochastic Gradient Descent Methods

Post on 16-Oct-2021

21 views 0 download

Transcript of Stochastic Gradient Descent Methods

Stochastic Optimization STUDY GROUP

Stochastic Gradient Descent Methods

Xuetong Wu & Viktoria Schram

Department of EEEUniversity of Melbourne

October 22, 2020

1 / 91

Stochastic Optimization STUDY GROUP

Overview

1 Introduction

2 Gradient Descent

3 Stochastic Gradient Descent

4 Stochastic Subgradient MethodsNon-Smooth Optimization ProblemsOptimization Considering Additional InformationOptimization in Case of Non-I.i.d. Data

5 Conclusion

2 / 91

Stochastic Optimization STUDY GROUP

Introduction

Outline

1 Introduction

2 Gradient Descent

3 Stochastic Gradient Descent

4 Stochastic Subgradient Methods

5 Conclusion

3 / 91

Stochastic Optimization STUDY GROUP

Introduction

Parameter Estimation Problems

Communications

Tracking

Control theory

System identification

Machine learning

...

Kushner, 1997, Stochastic ApproximationHan-Fu Chen, 2003, Stochastic Approximation and Its Applications

4 / 91

Stochastic Optimization STUDY GROUP

Introduction

Classification Problems

Consider the typical image classification problem,

Training Examples Models

Dogs

Cats

Labels

We wish to learn a good model h to minimise the predicted error:

minh

1

n

n∑i=1

(1h(Xi)6=Yi)

5 / 91

Stochastic Optimization STUDY GROUP

Introduction

Classification Problems

Consider the typical image classification problem,

Training Examples Models

Dogs

Cats

Labels

We wish to learn a good model h to minimise the predicted error:

minh

1

n

n∑i=1

(1h(Xi)6=Yi)

6 / 91

Stochastic Optimization STUDY GROUP

Introduction

Regression ProblemsConsider the simple regression problem, we wish to learn a goodmodel to minimise the mean squared error.

𝑌𝑌 = 𝑎𝑎𝑎𝑎 + 𝑏𝑏

Training Examples Models Predicted Labels

Mathematically,

mina,b

1

n

n∑i=1

(Yi − aXi − b)2

7 / 91

Stochastic Optimization STUDY GROUP

Introduction

Optimization in Learning Problems

Many of machine learning problems can be formulated as thefollowing problem,

minw∈W

F (w) =1

n

n∑i=1

f(w,Zi)

Zi: training sample/ data pair (Xi, Yi)

w: model parameters (e.g., a, b in least square problem)

f : loss function

8 / 91

Stochastic Optimization STUDY GROUP

Gradient Descent

Outline

1 Introduction

2 Gradient Descent

3 Stochastic Gradient Descent

4 Stochastic Subgradient Methods

5 Conclusion

9 / 91

Stochastic Optimization STUDY GROUP

Gradient Descent

Gradient Descentminw∈W

F (w) =1

n

n∑i=1

f(w,Zi)

If f is convex and differentiable w.r.t. w, with the first-order Taylorapproximation with η > 0,

F (w + η∆w) ≈ F (w) + η∆wT∇wF (w)

Best ∆w that minimises the R.H.S.

∆w = −∇wF (w)

we choose initial point w0 and certain step size ηt at each time t.

(Batch) Gradient descent:

wt+1 = wt − ηt∇wF (w) = wt −ηtn

n∑i=1

∇wf(wt, Zi)

Stops at a certain point such that,

F (wt)− F (w∗) ≤ ε10 / 91

Stochastic Optimization STUDY GROUP

Gradient Descent

Gradient Descent

Figure: Visualization of Gradient Descent

11 / 91

Stochastic Optimization STUDY GROUP

Gradient Descent

Convergence Rate For GDAssume f is convex and differentiable w.r.t. w, further assume thegradient ∇wF (w) = 1

n

∑ni=1∇wf(w,Zi) is L-Lipschitz continuous

(∇2F � LI) . Then,

Theorem

Gradient descent with fixed step size η ≤ 1/L satisfies

F (wt)− F (w∗) ≤ ‖w0 − w?‖2

2ηt

Convergence rate ∼ O(1t ), iteration complexity ∼ O(1

ε ).

R. Tibshirani, Convex Optimization 10-725

12 / 91

Stochastic Optimization STUDY GROUP

Gradient Descent

Convergence Rate For GD with Strong ConvexityFurthermore, if F (w) is µ-strongly-convex (∇2F � µI).

Theorem

Gradient descent with fixed step size η ≤ 2/(µ+ L) or withbacktracking line search satisfies

F (wt)− F (w?) ≤ ctL2‖w0 − w?‖2

where 0 < c < 1.

Convergence rate ∼ O(ct), iteration needed for error ε ∼ O(log 1ε ).

R. Tibshirani, Convex Optimization 10-725

13 / 91

Stochastic Optimization STUDY GROUP

Gradient Descent

ProblemsTwo main drawbacks of gradient descent,

If n is relatively large, computing the gradient is memory andtime consuming.

If the loss function is nonconvex, the solution can be stuck ina local stationary point (e.g., a saddle point).

14 / 91

Stochastic Optimization STUDY GROUP

Stochastic Gradient Descent

Outline

1 Introduction

2 Gradient Descent

3 Stochastic Gradient Descent

4 Stochastic Subgradient Methods

5 Conclusion

15 / 91

Stochastic Optimization STUDY GROUP

Stochastic Gradient Descent

Stochastic Gradient Descent

A possible practical way is to simulate the stream by randomly pickup Zt uniformly at time t from the training examples.

Namely, the stochastic gradient descent:

wt+1 = wt − ηt∇wf(wt, Zt)

Why does this work? By uniformly picking,

EZt [∇wf(wt, Zt)] =1

n

n∑i=1

∇wf(wt, Zi)

Unbiased estimate but high variance, usually works well in largescale problems.

16 / 91

Stochastic Optimization STUDY GROUP

Stochastic Gradient Descent

Stochastic GD v.s. GD

Figure: Stochastic GD v.s. GD

17 / 91

Stochastic Optimization STUDY GROUP

Stochastic Gradient Descent

Remarks on SGD

Computational cost for n samples and p iterations.

GD ∼ O(np)

SGD ∼ O(p)

SGD does not always produce descending directions andgradient is very noisy!

The solution of SGD bounces around optimal value withconstant step size.

Convergence properties?

18 / 91

Stochastic Optimization STUDY GROUP

Stochastic Gradient Descent

Convergence Rate Analysis

minimizew F (w) :=1

n

n∑i=1

f(w,Zi)

We wish to get to achieve ε-optimality,

E[F (wt)]− F (w∗) ≤ ε

after t iterations.

19 / 91

Stochastic Optimization STUDY GROUP

Stochastic Gradient Descent

Assumptions

F (w) is µ-strongly-convex and the gradeint is L Lipschitzcontinuous, µ/L ≤ 1.

∇f(wt, Zt) is an unbiased estimate of ∇F (wt).

For all w, the variance of the gradient

EZ [‖∇f(w,Z)‖22]− ‖EZ [∇f(w,Z)]‖22 ≤ σ2

20 / 91

Stochastic Optimization STUDY GROUP

Stochastic Gradient Descent

Constant Step Size

Theorem (Convergence with Fixed Stepsizes)

Under the assumptions, if ηt = η ≤ 1L , then SGD achieves

E [F (wt)]− F (w∗) ≤ ηLσ2

2µ+ (1− ηµ)t (F (w0)− F (w∗))

Linear convergence at the beginning.

When t→∞,

E [F (wt)− F (w∗)] ≤ ηLσ2

converges to some neighborhood of w∗− variation in gradientcomputation prevents further progress.

Theorem 4.6 in Bottou, 2018, Optimization Methods for Large-Scale Machine Learning

21 / 91

Stochastic Optimization STUDY GROUP

Stochastic Gradient Descent

Diminishing Step Size

Theorem (Convergence with Diminishing Stepsizes)

Under the assumptions, if ηt = θt+1 for some θ > 1

µ , then SGDachieves

E [F (wt)− F (w∗)] ≤ 2ν

t+ 1

where

ν := max

{Lσ2

4(µ− 1), F (w0)− F (w∗)

}

Convergence rate is O(1t ), iterations needed O(1

ε ) with diminishingstepsize ηt � 1

t .

Theorem 4.7 in Bottou, 2018, Optimization Methods for Large-Scale Machine Learning

22 / 91

Stochastic Optimization STUDY GROUP

Stochastic Gradient Descent

Convergence Rate and Time Comparison

Risk Minimization Under Strong Convexity and L-smooth:

iterationcomplexity

per-iterationcost

totalcomput. cost

GD log 1ε n n log 1

ε

SGD 1ε 1 1

ε

23 / 91

Stochastic Optimization STUDY GROUP

Stochastic Gradient Descent

AdvantagesCompared to the gradient descent methods, SGD has the followingadvantages.

Less computational cost per iteration.

For larger datasets it can converge faster with large n andmoderate ε.

For non-convex cases, SGD can get rid of saddle pointssometimes due to the variance.

Rong Ge et.al,, COLT’2015, Escaping From Saddle Points - Online Stochastic Gradient for Tensor Decomposition

24 / 91

Stochastic Optimization STUDY GROUP

Stochastic Gradient Descent

Variance Reduction and AccelerationTo reduce the variance of the estimate of the gradient, we can usemini-batch SGD for k � n:

wt+1 = wt − ηtk∑i=1

∇f(wt, Zi)

SGD for logistic regression problems with n = 10000:

R. Tibshirani, Convex Optimization 10-725

25 / 91

Stochastic Optimization STUDY GROUP

Stochastic Gradient Descent

Variance Reduction and AccelerationCan we have 1 gradient per iteration and only O(log(1/ε))iterations?Yes! Stochastic Average Gradient (SAG):

wt = wt−1 −ηtn

n∑i=1

∇f ti with ∇f ti =

{∇f (wt−1, Zi) if i = i(t)∇f t−1i otherwise

SAG gradient estimates are no longer unbiased, but they havegreatly reduced variance. With the fixed stepsize ηt = 1

16L ,

E [F (wt)]− F (w∗) 6 O

((1−min

16L,

1

8n

})t)

Iteration complexity ∼ O(log 1ε )!

Other variants with similar convergence: SDCA, SVRG, SAGA

Roux et.al, NIPS’12, A Stochastic Gradient Method with an Exponential Convergence Rate for Finite Training SetsShai & Zhang, JMLR’13, Stochastic dual coordinate ascent methods for regularized loss minimization.Johnson & Zhang, NIPS’13, Accelerating stochastic gradient descent using predictive variance reductionDefazio & Bach, NIPS’14, SAGA

26 / 91

Stochastic Optimization STUDY GROUP

Stochastic Gradient Descent

More on SGD

If n is not very large,

1

n

n∑i=1

(f (wt, Zi)− f(w∗, Zi)) ≤ ε?

Sometimes simply minimising the loss will cause overfitting.

27 / 91

Stochastic Optimization STUDY GROUP

Stochastic Gradient Descent

Caveats: Data Overfitting for y = sin 2πx+ n

(a) t = 2 (b) t = 7

(c) t = 100 (d) t = 5000

Often, Small Training Error 6⇒ Small Testing Error !

28 / 91

Stochastic Optimization STUDY GROUP

Stochastic Gradient Descent

Early StoppingActually, if samples are regarded as some random variable drawnfrom some distributions P , we may consider minimising the truerisk,

Ftrue(wt) = EZ∼P [f(wt, Z)]

Figure: Early Stopping with SGD

29 / 91

Stochastic Optimization STUDY GROUP

Stochastic Gradient Descent

SGD Applications in supervised learningSGD Methods are widely applied in machine learning problems.

For different differentiable loss functions,

Adaline:∑

i12

(yi − w>Φ(xi)

)2Tikhonov:

∑i

(yi − wTxi

)2+ λ‖w‖22

Logistic Regression:∑i yi log

(1

1+e−wT xi

)+ (1− yi) log

(1− 1

1+e−wT xi

)What if the loss is not differentiable?

SVM: 12‖w‖

22 + C

∑i max

(0, 1− yi

(w>xi + b

))Lasso:

∑i

(yi − wTxi

)2+ λ‖w‖1

Perceptron:∑

i max{

0,−yiw>Φ(xi)}

Neural Network:∑

i f(w,Zi)

30 / 91

Stochastic Optimization STUDY GROUP

Stochastic Subgradient Methods

Outline

1 Introduction

2 Gradient Descent

3 Stochastic Gradient Descent

4 Stochastic Subgradient MethodsNon-Smooth Optimization ProblemsOptimization Considering Additional InformationOptimization in Case of Non-I.i.d. Data

5 Conclusion

31 / 91

Stochastic Optimization STUDY GROUP

Stochastic Subgradient Methods

Short note: to be consistent with the notations used in the firstpart of this presentation, the following slides were adapted and not

the same nomenclature is used as was used in the recordedpresentation.

Sorry for any inconvenience this might cause.

32 / 91

Stochastic Optimization STUDY GROUP

Stochastic Subgradient Methods

Optimization Problem - Optimal Szenario

Finding zeros (roots) of a real-valued smooth function∇f(w) : Rd → Rd, for w ∈ R.

⇒Newton’s method

wt+1 = wt − [∇f ′w(wt)]−1∇f(wt)

n: tth iterationw: Parameters of function f(.)∇f(w): Derivative of function f(.)f ′w(.): Derivative of ∇f(.) wrt. w

Kushner, 1997, Stochastic ApproximationHan-Fu Chen, 2003, Stochastic Approximation and Its Applications

33 / 91

Stochastic Optimization STUDY GROUP

Stochastic Subgradient Methods

Optimization Problem - Reality

Finding zeros (roots) of an unknown, real-valued function∇f(w) : Rd → Rd, for w ∈ R, which can be observed but theobservation may be corrupted by (i.i.d.) errors (εt)t≥1

⇒Stochastic Approximation (Robbins & Monro ’51):

wt+1 = wt − ηt[∇f(wt) + εt]

ηt: Stepsize, learning rate for iteration tεt: Zero mean noise for iteration t∇f(w): Derivative of function f(.)

Kushner, 1997, Stochastic ApproximationHan-Fu Chen, 2003, Stochastic Approximation and Its Applications

34 / 91

Stochastic Optimization STUDY GROUP

Stochastic Subgradient Methods

Gradient Descent

wt+1 = wt − ηt1

n

n∑i=1

[∇f(wt, Zi)]

Stochastic Gradient Descent

wt+1 = wt − ηt[∇f(wt, Zit)]

f(wt, Zit ): Loss function for tth parameters w and tth sample Zitn: Set (/batch) sizeηt: Learning rate at iteration tt: Iteration t = 1, 2, ...it ∈ {1, ..., n}: Uniformly at random chosen index at iteration t

R. Tibshirani, Convex Optimization 10-725

35 / 91

Stochastic Optimization STUDY GROUP

Stochastic Subgradient Methods

Stochastic Gradient Descent

wt+1 = wt − ηt[∇f(wt, Zit)]

Random selection of f(.):

E[∇f(wt, Zit)] = ∇F (w)

f(wt, Zit ): Loss function for tth parameters w and tth sample Zitn: Set (/batch) sizeηt: Learning rate at iteration tt: Iteration t = 1, 2, ...it ∈ {1, ..., n}: Uniformly at random chosen index at iteration t

R. Tibshirani, Convex Optimization 10-725

36 / 91

Stochastic Optimization STUDY GROUP

Stochastic Subgradient Methods

Question

Optimization function:

Non-smooth

Additional information

Non-i.i.d input samples

Goal:

minw

1

n

n∑i=1

f(w,Zi)

f(w,Zi): Loss function with parameters w for ith sample Zi

H. Li et al., 2018, Visualizing the Loss Landscape of Neural Nets.

37 / 91

Stochastic Optimization STUDY GROUP

Stochastic Subgradient Methods

Question

Optimization function:

Non-smooth

Additional information

Non-i.i.d input samples

Goal:

minw

1

n

n∑i=1

f(w,Zi)

f(w,Zi): Loss function with parameters w for ith sample Zi

H. Li et al., 2018, Visualizing the Loss Landscape of Neural Nets.

38 / 91

Stochastic Optimization STUDY GROUP

Stochastic Subgradient Methods

Question

Optimization function:

Non-smooth

Additional information

Non-i.i.d input samples

Goal:

minw

1

n

n∑i=1

f(w,Zi)

f(w,Zi): Loss function with parameters w for ith sample Zi

H. Li et al., 2018, Visualizing the Loss Landscape of Neural Nets.

39 / 91

Stochastic Optimization STUDY GROUP

Stochastic Subgradient Methods

Goal

Non-differentiable (non-smooth)

⇒ ?

Additional Information

⇒ ?

Non-i.i.d. data

⇒ ?

40 / 91

Stochastic Optimization STUDY GROUP

Stochastic Subgradient Methods

Non-Smooth Optimization Problems

Outline

1 Introduction

2 Gradient Descent

3 Stochastic Gradient Descent

4 Stochastic Subgradient MethodsNon-Smooth Optimization ProblemsOptimization Considering Additional InformationOptimization in Case of Non-I.i.d. Data

5 Conclusion

41 / 91

Stochastic Optimization STUDY GROUP

Stochastic Subgradient Methods

Non-Smooth Optimization Problems

Subgradient of a Function

g is a subgradient of f at x2 if

f(x) ≥ f(x2) + gT (x− x2) ∀ x

*∂f(x): Subdifferential, set of all subgradients, i.e. gi ∈ ∂f(x)(**x =̂ w: Slight abuse in notation compared to before)

S. Boyd, Stanford University, EE364b

42 / 91

Stochastic Optimization STUDY GROUP

Stochastic Subgradient Methods

Non-Smooth Optimization Problems

Subgradient of a Function

Goal: minx

f(x); Use: Gradient descent method

D. S. Rosenberg, 2018, Foundations of Machine Learning, https://bloomberg.github.io/foml/#home 43 / 91

Stochastic Optimization STUDY GROUP

Stochastic Subgradient Methods

Non-Smooth Optimization Problems

Subgradient of a Function

Goal: minx

f(x); Use: Gradient descent method

D. S. Rosenberg, 2018, Foundations of Machine Learning, https://bloomberg.github.io/foml/#home 44 / 91

Stochastic Optimization STUDY GROUP

Stochastic Subgradient Methods

Non-Smooth Optimization Problems

Subgradient of a Function

Goal: minx

f(x); Use: Gradient descent method

D. S. Rosenberg, 2018, Foundations of Machine Learning, https://bloomberg.github.io/foml/#home

45 / 91

Stochastic Optimization STUDY GROUP

Stochastic Subgradient Methods

Non-Smooth Optimization Problems

Subgradient of a Function

Goal: minx

f(x); Use: Gradient descent method

D. S. Rosenberg, 2018, Foundations of Machine Learning, https://bloomberg.github.io/foml/#home

46 / 91

Stochastic Optimization STUDY GROUP

Stochastic Subgradient Methods

Non-Smooth Optimization Problems

Subgradient of a Function

Goal: minx

f(x); Use: Gradient descent method

D. S. Rosenberg, 2018, Foundations of Machine Learning, https://bloomberg.github.io/foml/#home

47 / 91

Stochastic Optimization STUDY GROUP

Stochastic Subgradient Methods

Non-Smooth Optimization Problems

Subgradient of a Function

Goal: minx

f(x); Use: Gradient descent method

⇒Negative subgradient doesn’t necessarily give a descent direction

D. S. Rosenberg, 2018, Foundations of Machine Learning, https://bloomberg.github.io/foml/#home 48 / 91

Stochastic Optimization STUDY GROUP

Stochastic Subgradient Methods

Non-Smooth Optimization Problems

Subgradient of a Function

Goal: minx

f(x); Use: Gradient descent method

⇒Negative subgradient doesn’t necessarily give a descent direction

D. S. Rosenberg, 2018, Foundations of Machine Learning, https://bloomberg.github.io/foml/#home 49 / 91

Stochastic Optimization STUDY GROUP

Stochastic Subgradient Methods

Non-Smooth Optimization Problems

Subgradient of a Function

Goal: minx

f(x); Use: Gradient descent method

⇒Negative subgradient doesn’t necessarily give a descent direction

D. S. Rosenberg, 2018, Foundations of Machine Learning, https://bloomberg.github.io/foml/#home50 / 91

Stochastic Optimization STUDY GROUP

Stochastic Subgradient Methods

Non-Smooth Optimization Problems

Subgradient of a Function

Goal: minx

f(x); Use: Gradient descent method

⇒Negative subgradient doesn’t necessarily give a descent direction

D. S. Rosenberg, 2018, Foundations of Machine Learning, https://bloomberg.github.io/foml/#home51 / 91

Stochastic Optimization STUDY GROUP

Stochastic Subgradient Methods

Non-Smooth Optimization Problems

Subgradient of a Function

Goal: minx

f(x); Use: Gradient descent method

⇒Negative subgradient doesn’t necessarily give a descent direction

D. S. Rosenberg, 2018, Foundations of Machine Learning, https://bloomberg.github.io/foml/#home52 / 91

Stochastic Optimization STUDY GROUP

Stochastic Subgradient Methods

Non-Smooth Optimization Problems

Subgradient of a Function

Goal: minx

f(x); Use: Gradient descent method

⇒Negative subgradient doesn’t necessarily give a descent directionD. S. Rosenberg, 2018, Foundations of Machine Learning, https://bloomberg.github.io/foml/#home

53 / 91

Stochastic Optimization STUDY GROUP

Stochastic Subgradient Methods

Non-Smooth Optimization Problems

Subgradient Method

wt+1 = wt − ηtgt

⇒ keep track of best iterate wbestt+1 among w1, ..., wt+1, i.e.,

f(wbestt+1 ) = minj=1,...t+1

f(wj)

wt: tth parameter estimateηt: Learning rategt: Subgradientt: Iteration t = 1, 2, ...(*w =̂ x: Switch back to notation used in the beginning)

S. Boyd, Stanford University, EE364bR. Tibshirani, Convex Optimization 10-725/36-725

54 / 91

Stochastic Optimization STUDY GROUP

Stochastic Subgradient Methods

Non-Smooth Optimization Problems

Stochastic Subgradient Method

Noisy subgradients: g̃ = g + v, where g ∈ ∂f(w), E[v] = 0

wt+1 = wt − ηtg̃t

⇒ Random choice of (sample) index i at iteration t(out of a set (/batch) of samples with size n)

wt+1 = wt − ηtgit

S. Boyd, Stanford University, EE364bR. Tibshirani, Convex Optimization 10-725/36-725J. Zhu, University of Melbourne, 2020, Discussion after IT lecture

55 / 91

Stochastic Optimization STUDY GROUP

Stochastic Subgradient Methods

Non-Smooth Optimization Problems

Stochastic Subgradient Method

wt+1 = wt − ηtgit

⇒ keep track of best iterate wbestt+1 among w1, ..., wt+1, i.e.,

f(wbestt+1 ) = minj=1,...t+1

f(wj)

wt: tth parameter estimateηt: Learning rategit : Subgradient for uniformly at random chosen sample (Zi)t: Iteration t = 1, 2, ...(*w =̂ x: Switch back to notation used in the beginning)

S. Boyd, Stanford University, EE364bR. Tibshirani, Convex Optimization 10-725/36-725

56 / 91

Stochastic Optimization STUDY GROUP

Stochastic Subgradient Methods

Non-Smooth Optimization Problems

Convergence Results

Fixed step size η

SGM:

limt→∞

F (w(best)t ) ≤ F ∗ +

L2η

2

Stochastic SGM:

limt→∞

F (w(best)t ) ≤ F ∗ +

5n2L2η

2

S. Boyd et al., 2003, Subgradient MethodsR. Tibshirani, Convex Optimization 10-725/36-725 Subgradient Method*Convergence rates for f(w) Lipschitz continuous and constant L > 0

57 / 91

Stochastic Optimization STUDY GROUP

Stochastic Subgradient Methods

Non-Smooth Optimization Problems

Convergence Results

Diminishing stepsize (Robins-Monro condition)

(Stochastic) SGM:

limt→∞

F (w(best)t ) ≤ F ∗

ηt > 0, limt→∞

ηt = 0,

∞∑t=1

ηt =∞

e.g.:

ηt > 0,

∞∑t=1

η2t <∞,

∞∑t=1

ηt =∞

*More about how to choose η: S. Boyd et al., 2003, Subgradient MethodsR. Tibshirani, Convex Optimization 10-725/36-725 Subgradient Method**Convergence rates for f(w) Lipschitz continuous and constant L > 0

58 / 91

Stochastic Optimization STUDY GROUP

Stochastic Subgradient Methods

Non-Smooth Optimization Problems

Applications

Algorithms for non-differentiable convex optimization

Convex analysis

ML/DL

⇒ Methods based on stochastic subgradients are used to(approximately) optimize (nonconvex, nonsmooth) deep neuralnetworks (DNNs)

⇒ E.g.: Adagrad, ADAM, NADAM, RMSProp, ...

Udell, Operations Research and Information Engineering Cornell, 2017, PresentationJ. Duchi et al., 2011, Adaptive Subgradient Methods for Online Learning and Stochastic Optimization

59 / 91

Stochastic Optimization STUDY GROUP

Stochastic Subgradient Methods

Optimization Considering Additional Information

Outline

1 Introduction

2 Gradient Descent

3 Stochastic Gradient Descent

4 Stochastic Subgradient MethodsNon-Smooth Optimization ProblemsOptimization Considering Additional InformationOptimization in Case of Non-I.i.d. Data

5 Conclusion

60 / 91

Stochastic Optimization STUDY GROUP

Stochastic Subgradient Methods

Optimization Considering Additional Information

Adagrad

wt+1 = wt − ηt(Gt)−12 git

⇒ Incorporates knowledge of the geometry of past iterations

wt: tth parameter estimateηt: Learning rategit : Subgradient for uniformly at random chosen sample (Zi)t: Iteration t = 1, 2, ...

Gt: Outer product matrix of past gradients up to time step t:∑tτ=1 gτg

Udell, Cornell, 2017, Operations Research and Information EngineeringJ. Duchi et al., 2011, Adaptive Subgradient Methods for Online Learning and Stochastic OptimizationAdaGrad - Adaptive Subgradient Methods, https://ppasupat.github.io/a9online/1107.html 61 / 91

Stochastic Optimization STUDY GROUP

Stochastic Subgradient Methods

Optimization Considering Additional Information

Adagrad

wt+1,j = wt,j − ηt,j(t∑

τ=1

g2τ,j)− 1

2 git,j

Example: minw

f(w) = 100w21 + w2

2

j: jth feature/parameter wηt: Learning rategit : Subgradient for uniformly at random chosen sample (Zi)t: Iteration t = 1, 2, ...∑tτ=1 g

2τ,j : Sum of past gradients up to time step t

Udell, Cornell, 2017, Operations Research and Information EngineeringJ. Duchi et al., 2011, Adaptive Subgradient Methods for Online Learning and Stochastic OptimizationAdaGrad - Adaptive Subgradient Methods, https://ppasupat.github.io/a9online/1107.html

62 / 91

Stochastic Optimization STUDY GROUP

Stochastic Subgradient Methods

Optimization Considering Additional Information

Adagrad

⇒Variable metric projected subgradient method

S. Boyd, J. Duchi, M. Pilanci, Standford University, 2019, EE364b: Mirror Descent and Variable Metric Methods

63 / 91

Stochastic Optimization STUDY GROUP

Stochastic Subgradient Methods

Optimization Considering Additional Information

Adagrad

⇒Variable metric projected subgradient method

S. Boyd, J. Duchi, M. Pilanci, Standford University, 2019, EE364b: Mirror Descent and Variable Metric Methods

64 / 91

Stochastic Optimization STUDY GROUP

Stochastic Subgradient Methods

Optimization Considering Additional Information

Variable metric projected subgradient method

Projection carried out in Ht = 1ηt

(Gt)12 metric

wt+1 = wt − ηt(Gt)−12 git

wt+1 = PHtW (wt −H−1t git) = PHtW (y)

where

PHtW (y) = argminw∈W

||w − y||2Ht

PHtW (y): Projection of a vector y onto W according to Ht metric

||w − y||Ht =√

(w − y)THt−1(w − y): Mahalanobis norm, weighted l2-distance

S. Boyd, J. Duchi, M. Pilanci, Standford University, 2019, EE364b: Mirror Descent and Variable Metric MethodsY. Chen, Princton University, 2019, ELE 522: Large-Scale Optimization for Data Science

65 / 91

Stochastic Optimization STUDY GROUP

Stochastic Subgradient Methods

Optimization Considering Additional Information

Variable metric projected subgradient method

Projection carried out in Ht = 1ηt

(Gt)12 metric

wt+1 = wt −H−1t git

wt+1 = PHtW (wt −H−1t git) = PHtW (y)

where

PHtW (y) = argminw∈W

||w − y||2Ht

PHtW (y): Projection of a vector y onto W according to Ht metric

||w − y||Ht =√

(w − y)THt−1(w − y): Mahalanobis norm, weighted l2-distance

S. Boyd, J. Duchi, M. Pilanci, Standford University, 2019, EE364b: Mirror Descent and Variable Metric MethodsY. Chen, Princton University, 2019, ELE 522: Large-Scale Optimization for Data Science

66 / 91

Stochastic Optimization STUDY GROUP

Stochastic Subgradient Methods

Optimization Considering Additional Information

Variable metric projected subgradient method

Projection carried out in Ht = 1ηt

(Gt)12 metric

wt+1 = wt −H−1t git

wt+1 = PHtW (wt −H−1t git) = PHtW (y)

where

PHtW (y) = argminw∈W

||w − y||2Ht

PHtW (y): Projection of a vector y onto W according to Ht metric

||w − y||Ht =√

(w − y)THt−1(w − y): Mahalanobis norm, weighted l2-distance

S. Boyd, J. Duchi, M. Pilanci, Standford University, 2019, EE364b: Mirror Descent and Variable Metric MethodsY. Chen, Princton University, 2019, ELE 522: Large-Scale Optimization for Data Science

67 / 91

Stochastic Optimization STUDY GROUP

Stochastic Subgradient Methods

Optimization Considering Additional Information

Variable metric projected subgradient method

Projection carried out in Ht = 1ηt

(Gt)12 metric

wt+1 = wt −H−1t git

wt+1 = PHtW (wt −H−1t git) = PHtW (y)

where

PHtW (y) = argminw∈W

||w − y||2Ht

PHtW (y): Projection of a vector y onto W according to Ht metric

||w − y||Ht =√

(w − y)THt−1(w − y): Mahalanobis norm, weighted l2-distance

S. Boyd, J. Duchi, M. Pilanci, Standford University, 2019, EE364b: Mirror Descent and Variable Metric MethodsY. Chen, Princton University, 2019, ELE 522: Large-Scale Optimization for Data Science

68 / 91

Stochastic Optimization STUDY GROUP

Stochastic Subgradient Methods

Optimization Considering Additional Information

But now, what if...

Parameter of interest lies on a non-Euclidean manifold

Probability vectors

G. Raskutti, The information geometry of mirror descentY. Chen, Princton University, 2019, ELE 522: Large-Scale Optimization for Data Science

69 / 91

Stochastic Optimization STUDY GROUP

Stochastic Subgradient Methods

Optimization Considering Additional Information

Convergence Analysis

Basic inequality: (Projected) subgradient method

F (w(best)t )− F ∗ ≤

R2 + L2∑t

i=1 η2i

2∑t

i=1 ηi

ηi = (R/L)/√t

F (w(best)t )− F ∗ ≤ RL√

t

for: L = maxw∈W||git ||2 and R = max

w,w∗∈W||w − w∗||2

⇒Analysis and convergence results depend on l2 normS. Boyd, J. Duchi, M. Pilanci, Standford University, 2019, EE364b: Mirror Descent and Variable Metric Methods

70 / 91

Stochastic Optimization STUDY GROUP

Stochastic Subgradient Methods

Optimization Considering Additional Information

Subgradient Method Update Rule(s)

minw∈W

F (w)

wt+1 = wt − ηtgit

wt+1 = PW(wt − ηtgit)wt+1 = argmin

w∈W||w − (wt − ηtgit)||22

Using some math:

wt+1 = argminw∈W

{gTitw +1

2ηt||w − wt||22}

S. Boyd, J. Duchi, M. Pilanci, Standford University, 2019, EE364b: Mirror Descent and Variable Metric MethodsJ. Duchi et al., 2003, Proximal and First-Order Methods for Convex Optimization

71 / 91

Stochastic Optimization STUDY GROUP

Stochastic Subgradient Methods

Optimization Considering Additional Information

Subgradient Method Update Rule(s)

minw∈W

F (w)

wt+1 = wt − ηtgitwt+1 = PW(wt − ηtgit)

wt+1 = argminw∈W

||w − (wt − ηtgit)||22

Using some math:

wt+1 = argminw∈W

{gTitw +1

2ηt||w − wt||22}

S. Boyd, J. Duchi, M. Pilanci, Standford University, 2019, EE364b: Mirror Descent and Variable Metric MethodsJ. Duchi et al., 2003, Proximal and First-Order Methods for Convex Optimization

72 / 91

Stochastic Optimization STUDY GROUP

Stochastic Subgradient Methods

Optimization Considering Additional Information

Subgradient Method Update Rule(s)

minw∈W

F (w)

wt+1 = wt − ηtgitwt+1 = PW(wt − ηtgit)wt+1 = argmin

w∈W||w − (wt − ηtgit)||22

Using some math:

wt+1 = argminw∈W

{gTitw +1

2ηt||w − wt||22}

S. Boyd, J. Duchi, M. Pilanci, Standford University, 2019, EE364b: Mirror Descent and Variable Metric MethodsJ. Duchi et al., 2003, Proximal and First-Order Methods for Convex Optimization

73 / 91

Stochastic Optimization STUDY GROUP

Stochastic Subgradient Methods

Optimization Considering Additional Information

Subgradient Method Update Rule(s)

minw∈W

F (w)

wt+1 = wt − ηtgitwt+1 = PW(wt − ηtgit)wt+1 = argmin

w∈W||w − (wt − ηtgit)||22

Using some math:

wt+1 = argminw∈W

{gTitw +1

2ηt||w − wt||22}

S. Boyd, J. Duchi, M. Pilanci, Standford University, 2019, EE364b: Mirror Descent and Variable Metric MethodsJ. Duchi et al., 2003, Proximal and First-Order Methods for Convex Optimization

74 / 91

Stochastic Optimization STUDY GROUP

Stochastic Subgradient Methods

Optimization Considering Additional Information

Stochastic Mirror Descent

wt+1 = argminw{gTtiw +

1

2ηtBφ(w,wt)}

Bregman divergence:

Bφ(w,wt) = φ(w)− φ(wt)−∇φ(wt)T (w − wt)

∇φ(.): Mirror map, invertible mapφ(.): Potential function, strictly convex, differentiable

S. Boyd, J. Duchi, M. Pilanci, Standford University, 2019, EE364b: Mirror Descent and Variable Metric MethodsS. Bubeck, 2015, Convex Optimization: Algorithms and ComplexityN. Azizan et al., Stochastic Interpretation of SMD: Risk-Sensitive Optimality

75 / 91

Stochastic Optimization STUDY GROUP

Stochastic Subgradient Methods

Optimization Considering Additional Information

Generalization/Examples

Gradient descent for φ(w) = 12 ||w||

22

Bφ = 12 ||w − wt||

22

Mirror descent = Projected subgradient method

Negative entropy for φ(w) =∑n

i=1wi logwi(For w ∈ unit simplex)

p-norm Algorithm φ(w) = 12 ||w||

2p

Exponential Gradient descent

Sparse Mirror Descent

...

S. Boyd, J. Duchi, M. Pilanci, Standford University, 2019, EE364b: Mirror Descent and Variable Metric MethodsN. Azizan al., 2019, SMD on Overparameterized Nonlinear Models: Conv., Implicit Regul., and General.

76 / 91

Stochastic Optimization STUDY GROUP

Stochastic Subgradient Methods

Optimization Considering Additional Information

wt+1 = argminw{gTtiw +

1

2ηtBφ(w,wt)}

Using some math:

wt+1 = argminw∈W∩D

Bφ(w, yt+1)

wt+1 = P φW(yt+1)

wt+1 = P φW(∇φ(wt+1)−1)

Alternative update rule stochastic mirror descent:

∇φ(wt+1) = ∇φ(wt)− ηtgTti

S. Boyd, J. Duchi, M. Pilanci, Standford University, 2019, EE364b: Mirror Descent and Variable Metric MethodsS. Bubeck, 2015, Convex Optimization: Algorithms and Complexity

77 / 91

Stochastic Optimization STUDY GROUP

Stochastic Subgradient Methods

Optimization Considering Additional Information

wt+1 = argminw{gTtiw +

1

2ηtBφ(w,wt)}

Using some math:

wt+1 = argminw∈W∩D

Bφ(w, yt+1)

wt+1 = P φW(yt+1)

wt+1 = P φW(∇φ(wt+1)−1)

Alternative update rule stochastic mirror descent:

∇φ(wt+1) = ∇φ(wt)− ηtgTti

S. Boyd, J. Duchi, M. Pilanci, Standford University, 2019, EE364b: Mirror Descent and Variable Metric MethodsS. Bubeck, 2015, Convex Optimization: Algorithms and Complexity

78 / 91

Stochastic Optimization STUDY GROUP

Stochastic Subgradient Methods

Optimization Considering Additional Information

wt+1 = argminw{gTtiw +

1

2ηtBφ(w,wt)}

Using some math:

wt+1 = argminw∈W∩D

Bφ(w, yt+1)

wt+1 = P φW(yt+1)

wt+1 = P φW(∇φ(wt+1)−1)

Alternative update rule stochastic mirror descent:

∇φ(wt+1) = ∇φ(wt)− ηtgTti

S. Boyd, J. Duchi, M. Pilanci, Standford University, 2019, EE364b: Mirror Descent and Variable Metric MethodsS. Bubeck, 2015, Convex Optimization: Algorithms and Complexity

79 / 91

Stochastic Optimization STUDY GROUP

Stochastic Subgradient Methods

Optimization Considering Additional Information

wt+1 = argminw{gTtiw +

1

2ηtBφ(w,wt)}

Using some math:

wt+1 = argminw∈W∩D

Bφ(w, yt+1)

wt+1 = P φW(yt+1)

wt+1 = P φW(∇φ(wt+1)−1)

Alternative update rule stochastic mirror descent:

∇φ(wt+1) = ∇φ(wt)− ηtgTti

S. Boyd, J. Duchi, M. Pilanci, Standford University, 2019, EE364b: Mirror Descent and Variable Metric MethodsS. Bubeck, 2015, Convex Optimization: Algorithms and Complexity

80 / 91

Stochastic Optimization STUDY GROUP

Stochastic Subgradient Methods

Optimization Considering Additional Information

wt+1 = argminw{gTtiw +

1

2ηtBφ(w,wt)}

Using some math:

wt+1 = argminw∈W∩D

Bφ(w, yt+1)

wt+1 = P φW(yt+1)

wt+1 = P φW(∇φ(wt+1)−1)

Alternative update rule stochastic mirror descent:

∇φ(wt+1) = ∇φ(wt)− ηtgTtiS. Boyd, J. Duchi, M. Pilanci, Standford University, 2019, EE364b: Mirror Descent and Variable Metric MethodsS. Bubeck, 2015, Convex Optimization: Algorithms and Complexity

81 / 91

Stochastic Optimization STUDY GROUP

Stochastic Subgradient Methods

Optimization Considering Additional Information

Stochastic Mirror Descent

∇φ(wt+1) = ∇φ(wt)− ηtgTti

∇φ(.): Mirror map, invertible mapφ(.): Potential function, strictly convex, differentiable

S. Boyd, J. Duchi, M. Pilanci, Standford University, 2019, EE364b: Mirror Descent and Variable Metric MethodsM.S. Alkousa al., 2019, On some SMD methods for constrained online optimization problems.Z. Zhou et al., 2020, On the convergence of MD beyond stochastic convex programming 82 / 91

Stochastic Optimization STUDY GROUP

Stochastic Subgradient Methods

Optimization Considering Additional Information

Applications

Non-smooth and/or non-convex stochastic optimizationproblems

Highly overparameterized nonlinear learning problems

Large scale optimization problems

Online learning

Reinforcement learning

Z. Zhou et al., 2020, On the convergence of MD beyond stochastic convex programmingN. Aziza et al., 2019, SMD on Overparametrized Nonlinear Models: Conv., Implicit Regul., and General.M. Raginsky et al., Sparse Q-learning with Mirror DescentS. Mahadevan et al., Continuous-Time SMD on a Network: Variance Reduction, Consensus, Convergence

83 / 91

Stochastic Optimization STUDY GROUP

Stochastic Subgradient Methods

Optimization in Case of Non-I.i.d. Data

Outline

1 Introduction

2 Gradient Descent

3 Stochastic Gradient Descent

4 Stochastic Subgradient MethodsNon-Smooth Optimization ProblemsOptimization Considering Additional InformationOptimization in Case of Non-I.i.d. Data

5 Conclusion

84 / 91

Stochastic Optimization STUDY GROUP

Stochastic Subgradient Methods

Optimization in Case of Non-I.i.d. Data

Ergodic Mirror Descent

Update rule:

wt+1 = argminw{gTtiw +

1

2ηtBφt(w,wt)}

⇒ Based on stochastic mirror descent

J. Duchi et al., 2012, Ergodic Mirror Descent

85 / 91

Stochastic Optimization STUDY GROUP

Stochastic Subgradient Methods

Optimization in Case of Non-I.i.d. Data

Ergodic Mirror Descent

Update rule:

wt+1 = argminw{gTtiw +

1

2ηtBφt(w,wt)}

⇒ Based on stochastic mirror descent

J. Duchi et al., 2012, Ergodic Mirror Descent

86 / 91

Stochastic Optimization STUDY GROUP

Stochastic Subgradient Methods

Optimization in Case of Non-I.i.d. Data

Ergodic Mirror Descent

F (w) := EΠ[f(w;Zi)]w∈W

Stochastic process Pi

Stationary distribution Π such that Pi → Π

Training samples (Z1, ..., Zn) ∼ PLoss function for w on sample Zi is f(w,Zi)

J. Duchi et al., 2012, Ergodic Mirror Descent

87 / 91

Stochastic Optimization STUDY GROUP

Stochastic Subgradient Methods

Optimization in Case of Non-I.i.d. Data

Ergodic Mirror Descent

Convergence in expectation and with high probabilityshown for:

Distributed convex optimization

(Potentially nonlinear) ARMA processes

Learning ranking facts

Pseudo-random sanity

J. Duchi et al., 2012, Ergodic Mirror DescentMicrosoft Research, 2016, Learning and stochastic optimization with non-iid data,https://www.youtube.com/watch?v=_yRnHRQVMgw

88 / 91

Stochastic Optimization STUDY GROUP

Conclusion

Outline

1 Introduction

2 Gradient Descent

3 Stochastic Gradient Descent

4 Stochastic Subgradient Methods

5 Conclusion

89 / 91

Stochastic Optimization STUDY GROUP

Conclusion

Usually

⇒ Stochastic gradient descent

Non-differentiable (non-smooth)

⇒ Stochastic subgradient methods

Additional Information

⇒ Stochastic mirror descent

Non-i.i.d. input data

⇒ Ergodic mirror descent

90 / 91

Stochastic Optimization STUDY GROUP

Conclusion

Thank you

91 / 91