Introduction Method (K s 0 p) - spectrum analysis Conclusion
From an ODE for Nesterov’s method to Accelerated ...€¦ · k = h(k + 2) is equivalent to the...
Transcript of From an ODE for Nesterov’s method to Accelerated ...€¦ · k = h(k + 2) is equivalent to the...
From an ODE for Nesterov’s method
to Accelerated Stochastic
Gradient descent
Adam Oberman with Maxime LabordeMath and Stats, McGill
Stochastic Gradient Descent definition:Math vs. ML
• Before 2010: SGD means Stochastic Differential Equations and Brownian Motion
The data sizes have grown faster than the speed of processors … methods limited by the computing time rather than the sample size .… Unlikely optimization algorithms such as stochastic gradient descent show amazing performance for large-scale problems,[Large-Scale Machine Learning with Stochastic Gradient Descent, Leon Bottou],
• Since around 2010: SGD algorithm for training ML models
w - parameters of the model f. L loss function measuring how well model fits the labels y of the data x.
Optimization for Machine Learning: SGD
• Why is stochastic gradient important for machine learning?
Full batch (entire data set) and mini-batch. SGD corresponds to gradient descent approximating the objective by an average over a random subsample of the data set.
Stochastic Approximation
• The minibatch stochastic gradient is harder to analyze
• Instead make the Stochastic Approximation assumption
SGD algorithms• Why not Newton’s method?
• Small scale numerical methods: have enough memory to solve NxN linear system (e.g. Newton’s method). Solve in few iterations.
• SGD Trade-off: lots of iterations (millions) of a slow but memory-cheap algorithm.
SGD convergence rate
Number of influential variance reduction algorithms to speed up SGD:
• SAG (Schmidt, Le Roux, Bach, 2013), *Lagrange Prize
• SAGA (Defazio, Bach, Lacoste-Julien, 2014)
• SVRG (Jonhson, Zhang, 2013),
However, in many modern applications, variance reduction does not help.
SGD convergence, rate means O(10k) iterations to decrease objective gap by factor of 10. Thus improving rate constant can reduce iterations by 10X.
Effective heuristic: “momentum”. Not theoretically understood.
Nesterov’s accelerated (full) gradient descent
https://distill.pub/2017/momentum/
Accelerated gradient descent: same budget, faster convergence.
Gradient descent.
xk+1 = xk � hrf(xk)<latexit sha1_base64="nISIooKD8vOJ72lwbR05SPNnr5c=">AAACCHicbVC7SgNBFJ2Nrxhfq5YWjgYhIobdWGgjBG0sI5gHZMMyO5lNhp2dXWZmJWFJaWPjh9hYKGKbT7DzQ+ydPApNPHDhcM693HuPFzMqlWV9GZmFxaXllexqbm19Y3PL3N6pySgRmFRxxCLR8JAkjHJSVVQx0ogFQaHHSN0Lrkd+/Z4ISSN+p/oxaYWow6lPMVJacs39npsGJ/YAXsKeG8BT2IUORx5D0C9o4dg181bRGgPOE3tK8uUDp/A9fHIqrvnptCOchIQrzJCUTbsUq1aKhKKYkUHOSSSJEQ5QhzQ15SgkspWOHxnAI620oR8JXVzBsfp7IkWhlP3Q050hUl05643E/7xmovyLVkp5nCjC8WSRnzCoIjhKBbapIFixviYIC6pvhbiLBMJKZ5fTIdizL8+TWqlonxVLtzqNKzBBFuyBQ1AANjgHZXADKqAKMHgAz+AVvBmPxovxbnxMWjPGdGYX/IEx/AEWyJsn</latexit>
Remark: two main A-GD algorithms, correspond to convex and strongly convex case.We focus on one, convex case, to simplify presentation. Strongly convex case is also covered.
Heuristic: momentum term remembers old gradients, overshoots instead of getting stuck.
Motivation: A-GD and SGD Algorithms
xk+1 = yk �1
Lrf(yk)
yk+1 = xk+1 + �k(xk+1 � xk)<latexit sha1_base64="rvZej1j+YLwXQ4COeRp/Qsx3z+s=">AAACQHicbVBPSxtBHJ1NrcbYarRHLz8MloRgyEbBXiyBXjx4UDBGyIblt5NZM+zs7DIzK12WfLRe+hG8ee6lhxbx2lMnMdKa+GDg8f4wMy9IBdem3b53Sm9W3q6uldcrG+/eb25Vt3eudJIpyno0EYm6DlAzwSXrGW4Eu04VwzgQrB9EX6Z+/5YpzRN5afKUDWO8kTzkFI2V/Gr/q19ETXcCH08g9yM4AC9USMGFM/AkBgIhrFuj4XmV/F/0udUEL2AGbbH+LB1YM2r41Vq71Z4Blok7JzUyx7lfvfNGCc1iJg0VqPXA7aRmWKAynAo2qXiZZinSCG/YwFKJMdPDYjbABPatMoIwUfZIAzP1/0aBsdZ5HNhkjGasF72p+Jo3yEz4aVhwmWaGSfp0UZgJMAlM14QRV4wakVuCVHH7VqBjtAMau3nFjuAufnmZXHVa7mGrc3FU636ez1Emu2SP1IlLjkmXnJJz0iOUfCM/yC/y2/nu/HQenMenaMmZdz6QF3D+/AUVwaq6</latexit>
�k =k
k + 1, �k =
pC � 1pC + 1
<latexit sha1_base64="ECfxFJPcr8iKCGUNacqI/4EMEuQ=">AAACK3icbVDLSsNAFJ3UV62vqEs3gyIIak2qoBuh6MalgtVCE8pkOmmHTB7O3AglxK/wI9z4Ky504QO3/ofTB6KtBwYO55zLnXu8RHAFlvVuFCYmp6ZnirOlufmFxSVzeeVKxamkrEZjEcu6RxQTPGI14CBYPZGMhJ5g115w2vOvb5lUPI4uoZswNyTtiPucEtBS0zxxPAakGeBj7PiS0CzIs2Dbznfw3d2o5agbCfgU79r5D9fRprlhla0+8Dixh2SjureP3+8b4rxpPjutmKYhi4AKolTDriTgZkQCp4LlJSdVLCE0IG3W0DQiIVNu1r81x5taaWE/lvpFgPvq74mMhEp1Q08nQwIdNer1xP+8Rgr+kZvxKEmBRXSwyE8Fhhj3isMtLhkF0dWEUMn1XzHtEN0L6HpLugR79ORxclUp2/vlyoVu4wANUERraB1tIRsdoio6Q+eohih6QE/oFb0Zj8aL8WF8DqIFYziziv7A+PoGqeyo/Q==</latexit>
A-GD convex, strongly convex case
xk+1 = xk �hk
Lg(xk),
<latexit sha1_base64="a9m48KIpWuasvx72wkp5drB15R8=">AAACE3icbVC7TgJBFJ3FFyIKamlzIzHBF2Gx0EZDYmNBgYk8EiCb2WEWJjv7yMysgaz8g42/YmOhMbY2dv6Nw6NQ8CQ3OTnn3tx7jx1yJlWx+G0klpZXVteS66mN9OZWJru9U5dBJAitkYAHomljSTnzaU0xxWkzFBR7NqcN270e+417KiQL/Ds1DGnHwz2fOYxgpSUrezSwYvfYHMElDCwXTqHtCEwg7lvuCCrQ7mMFvby2Dk+sbK5YKE4Ai8SckVzZrDxAOpOoWtmvdjcgkUd9RTiWsmWWQtWJsVCMcDpKtSNJQ0xc3KMtTX3sUdmJJz+N4EArXXACoctXMFF/T8TYk3Lo2brTw6ov572x+J/XipRz0YmZH0aK+mS6yIk4qADGAUGXCUoUH2qCiWD6ViB9rENROsaUDsGcf3mR1EsF86xQutVpXKEpkmgP7aM8MtE5KqMbVEU1RNAjekav6M14Ml6Md+Nj2powZjO76A+Mzx8JnZ0y</latexit>
g = rf + e
hk =2pC
k, hk =
c
k1/2,
<latexit sha1_base64="4OTFlleSZ7WLzyjYSsdOmBoYb0Q=">AAACPHicbVBNT9tAFFxToGC+Qnvs5YmoCAkUYoPUXiohceFIBQGkOETPm3Wyynptdp+RIsv8r176I3rjxIVDUdVrz2xCDnx0pJVGM/P09k2cK2mp2bz1Zt7Nzs2/X1j0l5ZXVtdq6x/ObFYYLlo8U5m5iNEKJbVokSQlLnIjMI2VOI+Hh2P//FoYKzN9SqNcdFLsa5lIjuSkbu0kGiBBHza/QaQxVggJbIOIIn/QHYITE4O8hBAie2UIDqtyWO2Av3lzA88D3OmXZbAbVtWO363Vm43mBPCWBFNSZ1Mcd2u/ol7Gi1Ro4gqtbQdhTp0SDUmuROVHhRU58iH2RdtRjamwnXJyfAWfndKDJDPuaYKJ+nyixNTaURq7ZIo0sK+9sfg/r11Q8rVTSp0XJDR/WpQUCiiDcZPQk0ZwUiNHkBvp/gp8gK4Ncn2PSwhen/yWnIWNYK8Rft+vH+xP61hgn9gG22IB+8IO2BE7Zi3G2Q92x36zB++nd+/98f4+RWe86cxH9gLev0eW4Kou</latexit>
SGD convex, strongly convex case
Heuristic A-SGD:
Q: Can we find parameters to achieve faster convergence? Need to understand A-GD in a way than generalizes to stochastic case.
Heuristic: derive accelerated-GD algorithm
double variables and introduce new speed parameter
vk+1 = vk �1
wLrf(vk)
<latexit sha1_base64="9tZeWuDwQpimaAS2k6wBunHk9/U=">AAACFnicbVA9SwNBEN2LXzFGE7W0GQxCRBJysdBGCdhYpIhgPiAJx95mLy7Z2zt29yLhzK+w8a/YWChiK3b+GzcfhSY+GHi8N8PMPDfkTOlS6dtKrKyurW8kN1Nb6e2dTHZ3r6GCSBJaJwEPZMvFinImaF0zzWkrlBT7LqdNd3A18ZtDKhULxK0ehbTr475gHiNYG8nJFoZOPDixx3ABQ2cABeh4EhOIjRLfV8cAHYFdjsHLG/vYyeZKxdIUsEzsOclV7OoDpDOJmpP96vQCEvlUaMKxUm27HOpujKVmhNNxqhMpGmIywH3aNlRgn6puPH1rDEdG6YEXSFNCw1T9PRFjX6mR75pOH+s7tehNxP+8dqS9827MRBhpKshskRdx0AFMMoIek5RoPjIEE8nMrUDusMlFmyRTJgR78eVl0igX7dNi+cakcYlmSKIDdIjyyEZnqIKuUQ3VEUGP6Bm9ojfryXqx3q2PWWvCms/soz+wPn8AUBmebA==</latexit>
(i) Couple equations
(ii) reduce to single gradient evaluation at convex combination
xk+1 = xk �1
Lrf(xk)
<latexit sha1_base64="6gYXPAmf4EXRtbuvJ1ikXePzK44=">AAACEnicbVDLSsNAFJ3UV63VRl26GSxCi1iautCNUnDjoosK9gFNCJPppB0ymYSZibTEfoMbf8WNC0XcunLn3zh9LLR64MLhnHu59x4vZlSqavXLyKysrq1vZDdzW/ntnYK5u9eWUSIwaeGIRaLrIUkY5aSlqGKkGwuCQo+RjhdcTf3OHRGSRvxWjWPihGjAqU8xUlpyzfLITYNjawIv4MgN4Am0fYEwTLXSgDZHHkPQL2mr7JrFaqU6A/xLrAUp1q3GPcwXMk3X/LT7EU5CwhVmSMqeVYuVkyKhKGZkkrMTSWKEAzQgPU05Col00tlLE3iklT70I6GLKzhTf06kKJRyHHq6M0RqKJe9qfif10uUf+6klMeJIhzPF/kJgyqC03xgnwqCFRtrgrCg+laIh0hnonSKOR2CtfzyX9KuVazTSu1Gp3EJ5siCA3AISsACZ6AOrkETtAAGD+AJvIBX49F4Nt6M93lrxljM7INfMD6+ASzOnLs=</latexit>
Second equation is unstable.
Derivation (step 2)
Case 2: modify derivation
Accelerated-SGD AlgorithmIn the sequel, we generalize the Lyapunov analysis for A-GD to the stochastic case. Analysis will give the best constant, and yield the following A-SGD algorithm.
with an improved convergence rate constant (TBD)
visualize algorithm in 2d (full gradients)
• after one iteration, x and v are in the shallow valley.
• (when properly tuned) the faster v variable pulls x along faster than GD.
comparison GD and A-GD: (same axes)
visualize algorithm in 2d (stochastic gradients)
Top: comparison SGD and A-SGD: (same axes). Bottom: After 20 iterations, optimality gap 100X smaller.After 20,000 iterations, SGD reaches A-SGD at k = 20.
visualize algorithm in 2d (stochastic gradients)
• Note: the v moves faster, allows x to average
• the linear term pulls x towards v
Variable learning rate: Convergence for the last iterate
Define G 2 bound on stochastic gradient: E[bg2] G 2
Remark: G 2 = L2D2 + �2 where D is the diameter of the domain
• Strongly convex case: Learning rate hk = O� 1k
�
I Previous results: Nemirovski, Juditsky, Lan, and Shapiro (2009), Shamir,
Zhang (2013), Jain, Kakade, Netrapalli, Sidford (2018): optimal rate O� 1k
�
with constants depending on G 2, D and µ.
I Our result: O� 1k
�rate with constants independent of the L-smoothness bound
of the gradient (depends only on �2and µ).
• Convex case:I Shamir, Zhang (2013): optimal rate order:
log(k)pk
with a rate constant that
depends on G 2and D. Learning rate: hk = O
⇣1pk
⌘.
I Jain, Nagaraj, Netrapalli (2019) remove the log factor assuming that the
number of iterations is decided in advance.
I Our result: O(log(k)/pk) rate for the last iterate, with a constant which
depends on �2, but independent of the L-smoothness. New learning rate
schedule
Maxime Laborde Mokaplan, April 2020 14 / 43
Decreasing Learning Rate Results
Strongly Convex A-SGD Rate:
Comparison with previous results
Learning rate: hk = O� 1k
�
Convergence Results
convex case
strongly convex case
Interpreting Improved rate: For both algorithms, simply an improvement to the rate constant. This is a nice theoretical result, and nice example of continuous time method. In order for it to be of practical use, desire
- practical algorithm, faster in practice (saw this for simple examples)- remains faster when theory no longer applies
- e.g. nonconvex examples, - e.g. training DNNs
Convergence rate E[f (xk) � f ⇤] after k steps: G 2 is a bound on E[g(x)2], and �2
variance. E0 is the initial value of the Lyapunov function
Improvement to the rate constant: independent of L
Maxime Laborde Mokaplan, April 2020 32 / 43
G2 = L2 + �2<latexit sha1_base64="jqr1uNFIqecj17y5gP6bmktY/j4=">AAAB/XicbVDLSgMxFM34rLXqWN25CRZBEMp0XOhGKLiwCxcV7APaacmkmTY0yQxJRqhD8QvcuhbEhSJu3fsJ7vwb01ZEWw/cy+Gce8nN8SNGlXacT2tufmFxaTm1kl7NrK1v2JvZqgpjiUkFhyyUdR8pwqggFU01I/VIEsR9Rmp+/3Tk166IVDQUl3oQEY+jrqABxUgbqW1vn7VceALPTT+ATUW7HLXctp1z8s4Y8IcUpkmumLnL3T68Z8tt+6PZCXHMidCYIaUaBTfSXoKkppiRYboZKxIh3Edd0jBUIE6Ul4yvH8I9o3RgEEpTQsOx+nsjQVypAffNJEe6p6a9kfif14h1cOwlVESxJgJPHgpiBnUIR1HADpUEazYwBGFJza0Q95BEWJvA0iaEmS/PkqqbLxzm3QuTRglMkAI7YBfsgwI4AkVQAmVQARhcg3vwBJ6tG+vRerFeJ6Nz1vfOFvgD6+0LLCyWHQ==</latexit>
Comparison with previous results
Shamir, Zhang (2013) Acc. SGD
hkc2pk
c
k3/4
Rate✓D2
c2 + c2G 2◆
(2 + log(k))pk
E016c2 + c2�2(1 + log(k))
(k1/4 � 1)2
Table: Convergence rate E[f (xk)� f ⇤] after k steps. G 2is a bound on E[g2]. E0 is the
initial value of the Lyapunov function.
Interpreting Improved rate: remove dependence on L-smoothness from the rate
This is a nice theoretical result, and nice example of continuous time method. Inorder for it to be of practical use, desire
• practical algorithm, faster in practice (saw this for simple examples)
Maxime Laborde Mokaplan, April 2020 23 / 43
Convex A-SGD algorithm Rate:
G2 = L2 + �2<latexit sha1_base64="jqr1uNFIqecj17y5gP6bmktY/j4=">AAAB/XicbVDLSgMxFM34rLXqWN25CRZBEMp0XOhGKLiwCxcV7APaacmkmTY0yQxJRqhD8QvcuhbEhSJu3fsJ7vwb01ZEWw/cy+Gce8nN8SNGlXacT2tufmFxaTm1kl7NrK1v2JvZqgpjiUkFhyyUdR8pwqggFU01I/VIEsR9Rmp+/3Tk166IVDQUl3oQEY+jrqABxUgbqW1vn7VceALPTT+ATUW7HLXctp1z8s4Y8IcUpkmumLnL3T68Z8tt+6PZCXHMidCYIaUaBTfSXoKkppiRYboZKxIh3Edd0jBUIE6Ul4yvH8I9o3RgEEpTQsOx+nsjQVypAffNJEe6p6a9kfif14h1cOwlVESxJgJPHgpiBnUIR1HADpUEazYwBGFJza0Q95BEWJvA0iaEmS/PkqqbLxzm3QuTRglMkAI7YBfsgwI4AkVQAmVQARhcg3vwBJ6tG+vRerFeJ6Nz1vfOFvgD6+0LLCyWHQ==</latexit>
Numerical Results (preliminary):
• Quadratic with synthetic noise
• Logistic Regression with minibatch stochastic gradients.
• How much difference should the rate constant make in this case?
• Note: exponential convergence, rate constant not important.
• But c/k rate, constant can be important, (since, e.g, factor of 10 can mean 10X more iterations)
Quadratic, sigma = .05, C=4
• Small condition number. Long time.
• Acc Convex is 10X better than SGD
• Acc-Strongly convex algorithm is 10X better than Acc Convex
Quadratic C=1000
• 40,000 iterations:
• A-SGD-C >>A-SGD SC, >> SGD.
• A-SGD Convex: very fast initial drop (40 iterations)
Logistic Regression I: mu = .2
Other Algorithms:
• SGD: 1/k learning rate (tuned)
• heuristic Nesterov: constant Beta, learning rate drops by constant factor every few epochs, tuned.
20,000 iterations
Logistic Regression 2: mu = .05
• Statement of convergence rates
• Idea of proof
Strong convexity, L-smoothness
• We use definitions equivalent to standard ones of strong convexity and L-smoothness
• Example: in case where f is quadratic, the constants correspond to the smallest and largest eigenvalues of the Hessian of f.
Lyapunov approach to convergence rates
• We know from previous analysis in full gradient case, we have the following inequality with beta = 0.
• Stochastic case, non-zero beta (as in SGD analysis)
• SGD: balance the terms, to get a decreasing learning rate.
• SGD: beta has a factor of L in it, A-SGD: just a factor of sigma, for better rate
ODE Liapunov approach
Another ODE for Nesterov’s Method
Our starting point is a perturbation of (S-A-ODE)8><
>:
x =2t(v � x) � 1p
Lrf (x)
v = � t
2rf (x),
(1st-ODE)
The system (1st-ODE) is equivalent to the following ODE
x +3tx + rf (x) = � 1p
L
✓D2f (x) · x +
1trf (x)
◆(H-ODE)
which has an additional Hessian damping term with coefficient 1/pL.
• 2nd order ODE with Hessian damping: Alvarez, Attouch, Bolte, Redont(2002), Attouch, Peyrouquet, Redont (2016)
• Shi, Du, Jordan, Su (2018) introduced a family of high resolution secondorder ODEs which also lead to Nesterov’s method: (H-ODE), special cases = 1p
L
Maxime Laborde Mokaplan, April 2020 18 / 43
Strongly Convex Case
Stochastic gradient version
Learning rate: hk > 0 . Define8>>><
>>>:
xk+1 � xk = �k(vk � xk) � hkpL(rf (yk) + ek),
vk+1 � vk = �k(xk � vk) � hkpµ (rf (yk) + ek),
yk = (1 � �k)xk + �kvk , �k =hk
pµ
1+hkpµ .
(FE-SC)
Same algorithm as before, but now• Variable learning rate• Stochastic gradient• To simplify: replace p
µ bypµ
1+hkpµ
Maxime Laborde Mokaplan, April 2020 30 / 43
Convergence result (strongly convex case)
Define the continuous time Lyapunov function
E ac,sc(x , v) := f (x) � f ⇤ +µ
2|v � x⇤|2
Discrete time Lyapunov function E ac,sck := E ac,sc(xk , vk).
Proposition (L., Oberman, 2020)
Assume E ac,sc0 6 1p
L. If hk :=
2pµ
µk+4�2Eac,sc0
�1 ,
E[f (xk)] � f ⇤ 6 4�2
µk + 4�2E ac,sc0
�1 .
Maxime Laborde Mokaplan, April 2020 31 / 43
Accelerated-SDG Algorithm
From (1st-ODE) to Nesterov
Define the learning rate hk > 0 and a discretization of the total time tk .The time discretization of (1st-ODE) with gradients evaluated at
yk =
✓1 � 2hk
tk
◆xk +
2hktk
vk .
is given by 8>><
>>:
xk+1 � xk =2hktk
(vk � xk) � hkpLrf (yk),
vk+1 � vk = �hktk2
rf (yk),
(FE-C)
Proposition
The discretization of (1st-ODE) given by (FE-C) with hk = h = 1pL
andtk = h(k + 2) is equivalent to the standard Nesterov’s method (C-Nest).
Fix learning rate, get Nesterov’s algorithm
Maxime Laborde Mokaplan, April 2020 20 / 43
Accelerated-SDG AlgorithmConvergence result (convex case)
Define the continuous time Lyapunov function
E ac,c(t, x , v) := t2(f (x) � f ⇤) + 2|v � x⇤|2
Define the discrete time Lyapunov function E ck by
E ac,ck = E ac,c(tk�1, xk , vk)
Proposition (L., Oberman, 2020)
Assume hk := ck↵ 1p
Land tk =
Pki=1 hi , then for ↵ = 3
4 ,
E[f (xk)] � f ⇤ 61
16c2E0 + c2�2(1 + log(k))
(k1/4 � 1)2
Maxime Laborde Mokaplan, April 2020 22 / 43
Thanks!