From an ODE for Nesterov’s method to Accelerated ...€¦ · k = h(k + 2) is equivalent to the...

From an ODE for Nesterov’s method

to Accelerated Stochastic

Gradient descent

Adam Oberman with Maxime LabordeMath and Stats, McGill

Stochastic Gradient Descent definition:Math vs. ML

• Before 2010: SGD means Stochastic Differential Equations and Brownian Motion

The data sizes have grown faster than the speed of processors … methods limited by the computing time rather than the sample size .… Unlikely optimization algorithms such as stochastic gradient descent show amazing performance for large-scale problems,[Large-Scale Machine Learning with Stochastic Gradient Descent, Leon Bottou],

• Since around 2010: SGD algorithm for training ML models

w - parameters of the model f. L loss function measuring how well model fits the labels y of the data x.

Optimization for Machine Learning: SGD

• Why is stochastic gradient important for machine learning?

Full batch (entire data set) and mini-batch. SGD corresponds to gradient descent approximating the objective by an average over a random subsample of the data set.

Stochastic Approximation

• The minibatch stochastic gradient is harder to analyze

• Instead make the Stochastic Approximation assumption

SGD algorithms• Why not Newton’s method?

• Small scale numerical methods: have enough memory to solve NxN linear system (e.g. Newton’s method). Solve in few iterations.

• SGD Trade-off: lots of iterations (millions) of a slow but memory-cheap algorithm.

SGD convergence rate

Number of influential variance reduction algorithms to speed up SGD:

• SAG (Schmidt, Le Roux, Bach, 2013), *Lagrange Prize

• SAGA (Defazio, Bach, Lacoste-Julien, 2014)

• SVRG (Jonhson, Zhang, 2013),

However, in many modern applications, variance reduction does not help.

SGD convergence, rate means O(10k) iterations to decrease objective gap by factor of 10. Thus improving rate constant can reduce iterations by 10X.

Effective heuristic: “momentum”. Not theoretically understood.

Nesterov’s accelerated (full) gradient descent

https://distill.pub/2017/momentum/

Accelerated gradient descent: same budget, faster convergence.

Gradient descent.

xk+1 = xk � hrf(xk)<latexit sha1_base64="nISIooKD8vOJ72lwbR05SPNnr5c=">AAACCHicbVC7SgNBFJ2Nrxhfq5YWjgYhIobdWGgjBG0sI5gHZMMyO5lNhp2dXWZmJWFJaWPjh9hYKGKbT7DzQ+ydPApNPHDhcM693HuPFzMqlWV9GZmFxaXllexqbm19Y3PL3N6pySgRmFRxxCLR8JAkjHJSVVQx0ogFQaHHSN0Lrkd+/Z4ISSN+p/oxaYWow6lPMVJacs39npsGJ/YAXsKeG8BT2IUORx5D0C9o4dg181bRGgPOE3tK8uUDp/A9fHIqrvnptCOchIQrzJCUTbsUq1aKhKKYkUHOSSSJEQ5QhzQ15SgkspWOHxnAI620oR8JXVzBsfp7IkWhlP3Q050hUl05643E/7xmovyLVkp5nCjC8WSRnzCoIjhKBbapIFixviYIC6pvhbiLBMJKZ5fTIdizL8+TWqlonxVLtzqNKzBBFuyBQ1AANjgHZXADKqAKMHgAz+AVvBmPxovxbnxMWjPGdGYX/IEx/AEWyJsn</latexit>

Remark: two main A-GD algorithms, correspond to convex and strongly convex case.We focus on one, convex case, to simplify presentation. Strongly convex case is also covered.

Heuristic: momentum term remembers old gradients, overshoots instead of getting stuck.

Motivation: A-GD and SGD Algorithms

xk+1 = yk �1

Lrf(yk)

yk+1 = xk+1 + �k(xk+1 � xk)<latexit sha1_base64="rvZej1j+YLwXQ4COeRp/Qsx3z+s=">AAACQHicbVBPSxtBHJ1NrcbYarRHLz8MloRgyEbBXiyBXjx4UDBGyIblt5NZM+zs7DIzK12WfLRe+hG8ee6lhxbx2lMnMdKa+GDg8f4wMy9IBdem3b53Sm9W3q6uldcrG+/eb25Vt3eudJIpyno0EYm6DlAzwSXrGW4Eu04VwzgQrB9EX6Z+/5YpzRN5afKUDWO8kTzkFI2V/Gr/q19ETXcCH08g9yM4AC9USMGFM/AkBgIhrFuj4XmV/F/0udUEL2AGbbH+LB1YM2r41Vq71Z4Blok7JzUyx7lfvfNGCc1iJg0VqPXA7aRmWKAynAo2qXiZZinSCG/YwFKJMdPDYjbABPatMoIwUfZIAzP1/0aBsdZ5HNhkjGasF72p+Jo3yEz4aVhwmWaGSfp0UZgJMAlM14QRV4wakVuCVHH7VqBjtAMau3nFjuAufnmZXHVa7mGrc3FU636ez1Emu2SP1IlLjkmXnJJz0iOUfCM/yC/y2/nu/HQenMenaMmZdz6QF3D+/AUVwaq6</latexit>

�k =k

k + 1, �k =

pC � 1pC + 1

<latexit sha1_base64="ECfxFJPcr8iKCGUNacqI/4EMEuQ=">AAACK3icbVDLSsNAFJ3UV62vqEs3gyIIak2qoBuh6MalgtVCE8pkOmmHTB7O3AglxK/wI9z4Ky504QO3/ofTB6KtBwYO55zLnXu8RHAFlvVuFCYmp6ZnirOlufmFxSVzeeVKxamkrEZjEcu6RxQTPGI14CBYPZGMhJ5g115w2vOvb5lUPI4uoZswNyTtiPucEtBS0zxxPAakGeBj7PiS0CzIs2Dbznfw3d2o5agbCfgU79r5D9fRprlhla0+8Dixh2SjureP3+8b4rxpPjutmKYhi4AKolTDriTgZkQCp4LlJSdVLCE0IG3W0DQiIVNu1r81x5taaWE/lvpFgPvq74mMhEp1Q08nQwIdNer1xP+8Rgr+kZvxKEmBRXSwyE8Fhhj3isMtLhkF0dWEUMn1XzHtEN0L6HpLugR79ORxclUp2/vlyoVu4wANUERraB1tIRsdoio6Q+eohih6QE/oFb0Zj8aL8WF8DqIFYziziv7A+PoGqeyo/Q==</latexit>

A-GD convex, strongly convex case

xk+1 = xk �hk

Lg(xk),

<latexit sha1_base64="a9m48KIpWuasvx72wkp5drB15R8=">AAACE3icbVC7TgJBFJ3FFyIKamlzIzHBF2Gx0EZDYmNBgYk8EiCb2WEWJjv7yMysgaz8g42/YmOhMbY2dv6Nw6NQ8CQ3OTnn3tx7jx1yJlWx+G0klpZXVteS66mN9OZWJru9U5dBJAitkYAHomljSTnzaU0xxWkzFBR7NqcN270e+417KiQL/Ds1DGnHwz2fOYxgpSUrezSwYvfYHMElDCwXTqHtCEwg7lvuCCrQ7mMFvby2Dk+sbK5YKE4Ai8SckVzZrDxAOpOoWtmvdjcgkUd9RTiWsmWWQtWJsVCMcDpKtSNJQ0xc3KMtTX3sUdmJJz+N4EArXXACoctXMFF/T8TYk3Lo2brTw6ov572x+J/XipRz0YmZH0aK+mS6yIk4qADGAUGXCUoUH2qCiWD6ViB9rENROsaUDsGcf3mR1EsF86xQutVpXKEpkmgP7aM8MtE5KqMbVEU1RNAjekav6M14Ml6Md+Nj2powZjO76A+Mzx8JnZ0y</latexit>

g = rf + e

hk =2pC

k, hk =

c

k1/2,

<latexit sha1_base64="4OTFlleSZ7WLzyjYSsdOmBoYb0Q=">AAACPHicbVBNT9tAFFxToGC+Qnvs5YmoCAkUYoPUXiohceFIBQGkOETPm3Wyynptdp+RIsv8r176I3rjxIVDUdVrz2xCDnx0pJVGM/P09k2cK2mp2bz1Zt7Nzs2/X1j0l5ZXVtdq6x/ObFYYLlo8U5m5iNEKJbVokSQlLnIjMI2VOI+Hh2P//FoYKzN9SqNcdFLsa5lIjuSkbu0kGiBBHza/QaQxVggJbIOIIn/QHYITE4O8hBAie2UIDqtyWO2Av3lzA88D3OmXZbAbVtWO363Vm43mBPCWBFNSZ1Mcd2u/ol7Gi1Ro4gqtbQdhTp0SDUmuROVHhRU58iH2RdtRjamwnXJyfAWfndKDJDPuaYKJ+nyixNTaURq7ZIo0sK+9sfg/r11Q8rVTSp0XJDR/WpQUCiiDcZPQk0ZwUiNHkBvp/gp8gK4Ncn2PSwhen/yWnIWNYK8Rft+vH+xP61hgn9gG22IB+8IO2BE7Zi3G2Q92x36zB++nd+/98f4+RWe86cxH9gLev0eW4Kou</latexit>

SGD convex, strongly convex case

Heuristic A-SGD:

Q: Can we find parameters to achieve faster convergence? Need to understand A-GD in a way than generalizes to stochastic case.

Heuristic: derive accelerated-GD algorithm

double variables and introduce new speed parameter

vk+1 = vk �1

wLrf(vk)

<latexit sha1_base64="9tZeWuDwQpimaAS2k6wBunHk9/U=">AAACFnicbVA9SwNBEN2LXzFGE7W0GQxCRBJysdBGCdhYpIhgPiAJx95mLy7Z2zt29yLhzK+w8a/YWChiK3b+GzcfhSY+GHi8N8PMPDfkTOlS6dtKrKyurW8kN1Nb6e2dTHZ3r6GCSBJaJwEPZMvFinImaF0zzWkrlBT7LqdNd3A18ZtDKhULxK0ehbTr475gHiNYG8nJFoZOPDixx3ABQ2cABeh4EhOIjRLfV8cAHYFdjsHLG/vYyeZKxdIUsEzsOclV7OoDpDOJmpP96vQCEvlUaMKxUm27HOpujKVmhNNxqhMpGmIywH3aNlRgn6puPH1rDEdG6YEXSFNCw1T9PRFjX6mR75pOH+s7tehNxP+8dqS9827MRBhpKshskRdx0AFMMoIek5RoPjIEE8nMrUDusMlFmyRTJgR78eVl0igX7dNi+cakcYlmSKIDdIjyyEZnqIKuUQ3VEUGP6Bm9ojfryXqx3q2PWWvCms/soz+wPn8AUBmebA==</latexit>

(i) Couple equations

(ii) reduce to single gradient evaluation at convex combination

xk+1 = xk �1

Lrf(xk)

<latexit sha1_base64="6gYXPAmf4EXRtbuvJ1ikXePzK44=">AAACEnicbVDLSsNAFJ3UV63VRl26GSxCi1iautCNUnDjoosK9gFNCJPppB0ymYSZibTEfoMbf8WNC0XcunLn3zh9LLR64MLhnHu59x4vZlSqavXLyKysrq1vZDdzW/ntnYK5u9eWUSIwaeGIRaLrIUkY5aSlqGKkGwuCQo+RjhdcTf3OHRGSRvxWjWPihGjAqU8xUlpyzfLITYNjawIv4MgN4Am0fYEwTLXSgDZHHkPQL2mr7JrFaqU6A/xLrAUp1q3GPcwXMk3X/LT7EU5CwhVmSMqeVYuVkyKhKGZkkrMTSWKEAzQgPU05Col00tlLE3iklT70I6GLKzhTf06kKJRyHHq6M0RqKJe9qfif10uUf+6klMeJIhzPF/kJgyqC03xgnwqCFRtrgrCg+laIh0hnonSKOR2CtfzyX9KuVazTSu1Gp3EJ5siCA3AISsACZ6AOrkETtAAGD+AJvIBX49F4Nt6M93lrxljM7INfMD6+ASzOnLs=</latexit>

Second equation is unstable.

Derivation (step 2)

Case 2: modify derivation

Accelerated-SGD AlgorithmIn the sequel, we generalize the Lyapunov analysis for A-GD to the stochastic case. Analysis will give the best constant, and yield the following A-SGD algorithm.

with an improved convergence rate constant (TBD)

visualize algorithm in 2d (full gradients)

• after one iteration, x and v are in the shallow valley.

• (when properly tuned) the faster v variable pulls x along faster than GD.

comparison GD and A-GD: (same axes)

visualize algorithm in 2d (stochastic gradients)

Top: comparison SGD and A-SGD: (same axes). Bottom: After 20 iterations, optimality gap 100X smaller.After 20,000 iterations, SGD reaches A-SGD at k = 20.

visualize algorithm in 2d (stochastic gradients)

• Note: the v moves faster, allows x to average

• the linear term pulls x towards v

Variable learning rate: Convergence for the last iterate

Define G 2 bound on stochastic gradient: E[bg2] G 2

Remark: G 2 = L2D2 + �2 where D is the diameter of the domain

• Strongly convex case: Learning rate hk = O� 1k

�

I Previous results: Nemirovski, Juditsky, Lan, and Shapiro (2009), Shamir,

Zhang (2013), Jain, Kakade, Netrapalli, Sidford (2018): optimal rate O� 1k

�

with constants depending on G 2, D and µ.

I Our result: O� 1k

�rate with constants independent of the L-smoothness bound

of the gradient (depends only on �2and µ).

• Convex case:I Shamir, Zhang (2013): optimal rate order:

log(k)pk

with a rate constant that

depends on G 2and D. Learning rate: hk = O

⇣1pk

⌘.

I Jain, Nagaraj, Netrapalli (2019) remove the log factor assuming that the

number of iterations is decided in advance.

I Our result: O(log(k)/pk) rate for the last iterate, with a constant which

depends on �2, but independent of the L-smoothness. New learning rate

schedule

Maxime Laborde Mokaplan, April 2020 14 / 43

Decreasing Learning Rate Results

Strongly Convex A-SGD Rate:

Comparison with previous results

Learning rate: hk = O� 1k

�

Convergence Results

convex case

strongly convex case

Interpreting Improved rate: For both algorithms, simply an improvement to the rate constant. This is a nice theoretical result, and nice example of continuous time method. In order for it to be of practical use, desire

- practical algorithm, faster in practice (saw this for simple examples)- remains faster when theory no longer applies

- e.g. nonconvex examples, - e.g. training DNNs

Convergence rate E[f (xk) � f ⇤] after k steps: G 2 is a bound on E[g(x)2], and �2

variance. E0 is the initial value of the Lyapunov function

Improvement to the rate constant: independent of L


G2 = L2 + �2<latexit sha1_base64="jqr1uNFIqecj17y5gP6bmktY/j4=">AAAB/XicbVDLSgMxFM34rLXqWN25CRZBEMp0XOhGKLiwCxcV7APaacmkmTY0yQxJRqhD8QvcuhbEhSJu3fsJ7vwb01ZEWw/cy+Gce8nN8SNGlXacT2tufmFxaTm1kl7NrK1v2JvZqgpjiUkFhyyUdR8pwqggFU01I/VIEsR9Rmp+/3Tk166IVDQUl3oQEY+jrqABxUgbqW1vn7VceALPTT+ATUW7HLXctp1z8s4Y8IcUpkmumLnL3T68Z8tt+6PZCXHMidCYIaUaBTfSXoKkppiRYboZKxIh3Edd0jBUIE6Ul4yvH8I9o3RgEEpTQsOx+nsjQVypAffNJEe6p6a9kfif14h1cOwlVESxJgJPHgpiBnUIR1HADpUEazYwBGFJza0Q95BEWJvA0iaEmS/PkqqbLxzm3QuTRglMkAI7YBfsgwI4AkVQAmVQARhcg3vwBJ6tG+vRerFeJ6Nz1vfOFvgD6+0LLCyWHQ==</latexit>

Comparison with previous results

Shamir, Zhang (2013) Acc. SGD

hkc2pk

c

k3/4

Rate✓D2

c2 + c2G 2◆

(2 + log(k))pk

E016c2 + c2�2(1 + log(k))

(k1/4 � 1)2

Table: Convergence rate E[f (xk)� f ⇤] after k steps. G 2is a bound on E[g2]. E0 is the

initial value of the Lyapunov function.

Interpreting Improved rate: remove dependence on L-smoothness from the rate

This is a nice theoretical result, and nice example of continuous time method. Inorder for it to be of practical use, desire

• practical algorithm, faster in practice (saw this for simple examples)


Convex A-SGD algorithm Rate:

G2 = L2 + �2<latexit sha1_base64="jqr1uNFIqecj17y5gP6bmktY/j4=">AAAB/XicbVDLSgMxFM34rLXqWN25CRZBEMp0XOhGKLiwCxcV7APaacmkmTY0yQxJRqhD8QvcuhbEhSJu3fsJ7vwb01ZEWw/cy+Gce8nN8SNGlXacT2tufmFxaTm1kl7NrK1v2JvZqgpjiUkFhyyUdR8pwqggFU01I/VIEsR9Rmp+/3Tk166IVDQUl3oQEY+jrqABxUgbqW1vn7VceALPTT+ATUW7HLXctp1z8s4Y8IcUpkmumLnL3T68Z8tt+6PZCXHMidCYIaUaBTfSXoKkppiRYboZKxIh3Edd0jBUIE6Ul4yvH8I9o3RgEEpTQsOx+nsjQVypAffNJEe6p6a9kfif14h1cOwlVESxJgJPHgpiBnUIR1HADpUEazYwBGFJza0Q95BEWJvA0iaEmS/PkqqbLxzm3QuTRglMkAI7YBfsgwI4AkVQAmVQARhcg3vwBJ6tG+vRerFeJ6Nz1vfOFvgD6+0LLCyWHQ==</latexit>

Numerical Results (preliminary):

• Quadratic with synthetic noise

• Logistic Regression with minibatch stochastic gradients.

• How much difference should the rate constant make in this case?

• Note: exponential convergence, rate constant not important.

• But c/k rate, constant can be important, (since, e.g, factor of 10 can mean 10X more iterations)

Quadratic, sigma = .05, C=4

• Small condition number. Long time.

• Acc Convex is 10X better than SGD

• Acc-Strongly convex algorithm is 10X better than Acc Convex

Quadratic C=1000

• 40,000 iterations:

• A-SGD-C >>A-SGD SC, >> SGD.

• A-SGD Convex: very fast initial drop (40 iterations)

Logistic Regression I: mu = .2

Other Algorithms:

• SGD: 1/k learning rate (tuned)

• heuristic Nesterov: constant Beta, learning rate drops by constant factor every few epochs, tuned.

20,000 iterations

Logistic Regression 2: mu = .05

• Statement of convergence rates

• Idea of proof

Strong convexity, L-smoothness

• We use definitions equivalent to standard ones of strong convexity and L-smoothness

• Example: in case where f is quadratic, the constants correspond to the smallest and largest eigenvalues of the Hessian of f.

Lyapunov approach to convergence rates

• We know from previous analysis in full gradient case, we have the following inequality with beta = 0.

• Stochastic case, non-zero beta (as in SGD analysis)

• SGD: balance the terms, to get a decreasing learning rate.

• SGD: beta has a factor of L in it, A-SGD: just a factor of sigma, for better rate

ODE Liapunov approach

Another ODE for Nesterov’s Method

Our starting point is a perturbation of (S-A-ODE)8><

>:

x =2t(v � x) � 1p

Lrf (x)

v = � t

2rf (x),

(1st-ODE)

The system (1st-ODE) is equivalent to the following ODE

x +3tx + rf (x) = � 1p

L

✓D2f (x) · x +

1trf (x)

◆(H-ODE)

which has an additional Hessian damping term with coefficient 1/pL.

• 2nd order ODE with Hessian damping: Alvarez, Attouch, Bolte, Redont(2002), Attouch, Peyrouquet, Redont (2016)

• Shi, Du, Jordan, Su (2018) introduced a family of high resolution secondorder ODEs which also lead to Nesterov’s method: (H-ODE), special cases = 1p

L


Strongly Convex Case

Stochastic gradient version

Learning rate: hk > 0 . Define8>>><

>>>:

xk+1 � xk = �k(vk � xk) � hkpL(rf (yk) + ek),

vk+1 � vk = �k(xk � vk) � hkpµ (rf (yk) + ek),

yk = (1 � �k)xk + �kvk , �k =hk

pµ

1+hkpµ .

(FE-SC)

Same algorithm as before, but now• Variable learning rate• Stochastic gradient• To simplify: replace p

µ bypµ

1+hkpµ


Convergence result (strongly convex case)

Define the continuous time Lyapunov function

E ac,sc(x , v) := f (x) � f ⇤ +µ

2|v � x⇤|2

Discrete time Lyapunov function E ac,sck := E ac,sc(xk , vk).

Proposition (L., Oberman, 2020)

Assume E ac,sc0 6 1p

L. If hk :=

2pµ

µk+4�2Eac,sc0

�1 ,

E[f (xk)] � f ⇤ 6 4�2

µk + 4�2E ac,sc0

�1 .


Accelerated-SDG Algorithm

From (1st-ODE) to Nesterov

Define the learning rate hk > 0 and a discretization of the total time tk .The time discretization of (1st-ODE) with gradients evaluated at

yk =

✓1 � 2hk

tk

◆xk +

2hktk

vk .

is given by 8>><

>>:

xk+1 � xk =2hktk

(vk � xk) � hkpLrf (yk),

vk+1 � vk = �hktk2

rf (yk),

(FE-C)

Proposition

The discretization of (1st-ODE) given by (FE-C) with hk = h = 1pL

andtk = h(k + 2) is equivalent to the standard Nesterov’s method (C-Nest).

Fix learning rate, get Nesterov’s algorithm


Accelerated-SDG AlgorithmConvergence result (convex case)

Define the continuous time Lyapunov function

E ac,c(t, x , v) := t2(f (x) � f ⇤) + 2|v � x⇤|2

Define the discrete time Lyapunov function E ck by

E ac,ck = E ac,c(tk�1, xk , vk)

Proposition (L., Oberman, 2020)

Assume hk := ck↵ 1p

Land tk =

Pki=1 hi , then for ↵ = 3

4 ,

E[f (xk)] � f ⇤ 61

16c2E0 + c2�2(1 + log(k))

(k1/4 � 1)2


Thanks!

From an ODE for Nesterov’s method to Accelerated ...€¦ · k = h(k + 2) is equivalent to the...

Documents

Transcript of From an ODE for Nesterov’s method to Accelerated ...€¦ · k = h(k + 2) is equivalent to the...