Adaptive Methods and Non-convex Optimization - cs.cornell.edu
Transcript of Adaptive Methods and Non-convex Optimization - cs.cornell.edu
Adaptive Methods and Non-convex Optimization
CS6787 Lecture 7 — Fall 2021
Adaptive learning rates
• So far, we’ve looked at update steps that look like
• Here, the learning rate/step size is fixed a priori for each iteration.
• What if we use a step size that varies depending on the model?
• This is the idea of an adaptive learning rate.
wt+1 = wt � ↵trft(wt)<latexit sha1_base64="kn2CCBTj11TsMB31ze7r2wsjmFA=">AAACEXicbZA9SwNBEIb34nf8ilraLAZBEcOdCNoIoo1lBJMISTjmNnvJkr29Y3fOEI78BRv/io2FIrZ2dv4bNx+FJr6w8PDODLPzBokUBl3328nNzS8sLi2v5FfX1jc2C1vbVROnmvEKi2Ws7wMwXArFKyhQ8vtEc4gCyWtB93pYrz1wbUSs7rCf8GYEbSVCwQCt5RcOen6GR96AXtCej/SYNkAmHbDYUBBIoKGPtgcP/ULRLbkj0VnwJlAkE5X9wlejFbM04gqZBGPqnptgMwONgkk+yDdSwxNgXWjzukUFETfNbHTRgO5bp0XDWNunkI7c3xMZRMb0o8B2RoAdM10bmv/V6imG581MqCRFrth4UZhKijEdxkNbQnOGsm8BmBb2r5R1QANDG2LehuBNnzwL1ZOSZ/n2tHh5NYljmeySPXJAPHJGLskNKZMKYeSRPJNX8uY8OS/Ou/Mxbs05k5kd8kfO5w+56puw</latexit><latexit sha1_base64="kn2CCBTj11TsMB31ze7r2wsjmFA=">AAACEXicbZA9SwNBEIb34nf8ilraLAZBEcOdCNoIoo1lBJMISTjmNnvJkr29Y3fOEI78BRv/io2FIrZ2dv4bNx+FJr6w8PDODLPzBokUBl3328nNzS8sLi2v5FfX1jc2C1vbVROnmvEKi2Ws7wMwXArFKyhQ8vtEc4gCyWtB93pYrz1wbUSs7rCf8GYEbSVCwQCt5RcOen6GR96AXtCej/SYNkAmHbDYUBBIoKGPtgcP/ULRLbkj0VnwJlAkE5X9wlejFbM04gqZBGPqnptgMwONgkk+yDdSwxNgXWjzukUFETfNbHTRgO5bp0XDWNunkI7c3xMZRMb0o8B2RoAdM10bmv/V6imG581MqCRFrth4UZhKijEdxkNbQnOGsm8BmBb2r5R1QANDG2LehuBNnzwL1ZOSZ/n2tHh5NYljmeySPXJAPHJGLskNKZMKYeSRPJNX8uY8OS/Ou/Mxbs05k5kd8kfO5w+56puw</latexit><latexit sha1_base64="kn2CCBTj11TsMB31ze7r2wsjmFA=">AAACEXicbZA9SwNBEIb34nf8ilraLAZBEcOdCNoIoo1lBJMISTjmNnvJkr29Y3fOEI78BRv/io2FIrZ2dv4bNx+FJr6w8PDODLPzBokUBl3328nNzS8sLi2v5FfX1jc2C1vbVROnmvEKi2Ws7wMwXArFKyhQ8vtEc4gCyWtB93pYrz1wbUSs7rCf8GYEbSVCwQCt5RcOen6GR96AXtCej/SYNkAmHbDYUBBIoKGPtgcP/ULRLbkj0VnwJlAkE5X9wlejFbM04gqZBGPqnptgMwONgkk+yDdSwxNgXWjzukUFETfNbHTRgO5bp0XDWNunkI7c3xMZRMb0o8B2RoAdM10bmv/V6imG581MqCRFrth4UZhKijEdxkNbQnOGsm8BmBb2r5R1QANDG2LehuBNnzwL1ZOSZ/n2tHh5NYljmeySPXJAPHJGLskNKZMKYeSRPJNX8uY8OS/Ou/Mxbs05k5kd8kfO5w+56puw</latexit><latexit sha1_base64="kn2CCBTj11TsMB31ze7r2wsjmFA=">AAACEXicbZA9SwNBEIb34nf8ilraLAZBEcOdCNoIoo1lBJMISTjmNnvJkr29Y3fOEI78BRv/io2FIrZ2dv4bNx+FJr6w8PDODLPzBokUBl3328nNzS8sLi2v5FfX1jc2C1vbVROnmvEKi2Ws7wMwXArFKyhQ8vtEc4gCyWtB93pYrz1wbUSs7rCf8GYEbSVCwQCt5RcOen6GR96AXtCej/SYNkAmHbDYUBBIoKGPtgcP/ULRLbkj0VnwJlAkE5X9wlejFbM04gqZBGPqnptgMwONgkk+yDdSwxNgXWjzukUFETfNbHTRgO5bp0XDWNunkI7c3xMZRMb0o8B2RoAdM10bmv/V6imG581MqCRFrth4UZhKijEdxkNbQnOGsm8BmBb2r5R1QANDG2LehuBNnzwL1ZOSZ/n2tHh5NYljmeySPXJAPHJGLskNKZMKYeSRPJNX8uY8OS/Ou/Mxbs05k5kd8kfO5w+56puw</latexit>
Example: Polyak’s step length
• This is an simple step size scheme for gradient descent that works when the optimal value is known.
• Can also use this with an estimated optimal value.
↵k =f(wk)� f(w⇤)
krf(wk)k2<latexit sha1_base64="eix1RGYidE4ceUYOVzyydhd3mKs=">AAACIXicbZDNSgMxFIUz/tb6V3XpJliEVlBmimA3QtGNSwVbhU473EkzbWgmMyQZpUz7Km58FTcuFHEnvoxpOwttvRD4OOdebu7xY86Utu0va2FxaXllNbeWX9/Y3Nou7Ow2VJRIQusk4pG890FRzgSta6Y5vY8lhdDn9M7vX479uwcqFYvErR7EtBVCV7CAEdBG8gpVF3jcA6+Pz7EbSCBpUHr0+mV8jA20j8qj1B1iV4DPAWeWO2xXRl6haJ/Yk8Lz4GRQRFlde4VPtxORJKRCEw5KNR071q0UpGaE01HeTRSNgfShS5sGBYRUtdLJhSN8aJQODiJpntB4ov6eSCFUahD6pjME3VOz3lj8z2smOqi2UibiRFNBpouChGMd4XFcuMMkJZoPDACRzPwVkx6YnLQJNW9CcGZPnodG5cQxfHNarF1kceTQPjpAJeSgM1RDV+ga1RFBT+gFvaF369l6tT6sz2nrgpXN7KE/ZX3/AOR4oXk=</latexit><latexit sha1_base64="eix1RGYidE4ceUYOVzyydhd3mKs=">AAACIXicbZDNSgMxFIUz/tb6V3XpJliEVlBmimA3QtGNSwVbhU473EkzbWgmMyQZpUz7Km58FTcuFHEnvoxpOwttvRD4OOdebu7xY86Utu0va2FxaXllNbeWX9/Y3Nou7Ow2VJRIQusk4pG890FRzgSta6Y5vY8lhdDn9M7vX479uwcqFYvErR7EtBVCV7CAEdBG8gpVF3jcA6+Pz7EbSCBpUHr0+mV8jA20j8qj1B1iV4DPAWeWO2xXRl6haJ/Yk8Lz4GRQRFlde4VPtxORJKRCEw5KNR071q0UpGaE01HeTRSNgfShS5sGBYRUtdLJhSN8aJQODiJpntB4ov6eSCFUahD6pjME3VOz3lj8z2smOqi2UibiRFNBpouChGMd4XFcuMMkJZoPDACRzPwVkx6YnLQJNW9CcGZPnodG5cQxfHNarF1kceTQPjpAJeSgM1RDV+ga1RFBT+gFvaF369l6tT6sz2nrgpXN7KE/ZX3/AOR4oXk=</latexit><latexit sha1_base64="eix1RGYidE4ceUYOVzyydhd3mKs=">AAACIXicbZDNSgMxFIUz/tb6V3XpJliEVlBmimA3QtGNSwVbhU473EkzbWgmMyQZpUz7Km58FTcuFHEnvoxpOwttvRD4OOdebu7xY86Utu0va2FxaXllNbeWX9/Y3Nou7Ow2VJRIQusk4pG890FRzgSta6Y5vY8lhdDn9M7vX479uwcqFYvErR7EtBVCV7CAEdBG8gpVF3jcA6+Pz7EbSCBpUHr0+mV8jA20j8qj1B1iV4DPAWeWO2xXRl6haJ/Yk8Lz4GRQRFlde4VPtxORJKRCEw5KNR071q0UpGaE01HeTRSNgfShS5sGBYRUtdLJhSN8aJQODiJpntB4ov6eSCFUahD6pjME3VOz3lj8z2smOqi2UibiRFNBpouChGMd4XFcuMMkJZoPDACRzPwVkx6YnLQJNW9CcGZPnodG5cQxfHNarF1kceTQPjpAJeSgM1RDV+ga1RFBT+gFvaF369l6tT6sz2nrgpXN7KE/ZX3/AOR4oXk=</latexit><latexit sha1_base64="eix1RGYidE4ceUYOVzyydhd3mKs=">AAACIXicbZDNSgMxFIUz/tb6V3XpJliEVlBmimA3QtGNSwVbhU473EkzbWgmMyQZpUz7Km58FTcuFHEnvoxpOwttvRD4OOdebu7xY86Utu0va2FxaXllNbeWX9/Y3Nou7Ow2VJRIQusk4pG890FRzgSta6Y5vY8lhdDn9M7vX479uwcqFYvErR7EtBVCV7CAEdBG8gpVF3jcA6+Pz7EbSCBpUHr0+mV8jA20j8qj1B1iV4DPAWeWO2xXRl6haJ/Yk8Lz4GRQRFlde4VPtxORJKRCEw5KNR071q0UpGaE01HeTRSNgfShS5sGBYRUtdLJhSN8aJQODiJpntB4ov6eSCFUahD6pjME3VOz3lj8z2smOqi2UibiRFNBpouChGMd4XFcuMMkJZoPDACRzPwVkx6YnLQJNW9CcGZPnodG5cQxfHNarF1kceTQPjpAJeSgM1RDV+ga1RFBT+gFvaF369l6tT6sz2nrgpXN7KE/ZX3/AOR4oXk=</latexit>
Intuition: Polyak’s step length
• Approximate the objective with a linear approximation at the current iterate.
• Choose the step size that makes the approximation equal to the known optimal value.
f̂(w) = f(wk) + (w � wk)Trf(wk)
<latexit sha1_base64="jUcN38NeTsKqERu3EHDy+9FAKUM=">AAACF3icbZDLSgMxFIYz9VbrrerSTbAILWKZqYJuhKIblxV6g7aWM2mmDc1khiRjKaVv4cZXceNCEbe6823MtLPQ6oGQj/8/h+T8bsiZ0rb9ZaWWlldW19LrmY3Nre2d7O5eXQWRJLRGAh7IpguKciZoTTPNaTOUFHyX04Y7vI79xj2VigWiqsch7fjQF8xjBLSRutliewAae/lRAV/GV3dYwMc4P8InOOa7Km4LcDkkXjebs4v2rPBfcBLIoaQq3exnuxeQyKdCEw5KtRw71J0JSM0Ip9NMO1I0BDKEPm0ZFOBT1ZnM9priI6P0sBdIc4TGM/XnxAR8pca+azp90AO16MXif14r0t5FZ8JEGGkqyPwhL+JYBzgOCfeYpETzsQEgkpm/YjIACUSbKDMmBGdx5b9QLxWd02Lp9ixXvkriSKMDdIjyyEHnqIxuUAXVEEEP6Am9oFfr0Xq23qz3eWvKSmb20a+yPr4B0veb/A==</latexit>
f⇤ = f̂(wk+1) = f̂(wk � ↵rf(wk))
= f(wk)� ↵krf(wk)k2<latexit sha1_base64="D7Ol+u0KZ8CpFPgJNg024pOBHOY=">AAACUHicbZFLT+MwFIVvOjzDq8CSjUUFoiCqpDMSbEZCw4ZlkSggNaW6cZ3WquNEtgOqMv2Js+mO3zEbFiBwH0jlcSXLx989lu3jMBVcG897dAo/5uYXFpeW3ZXVtfWN4ubWtU4yRVmdJiJRtyFqJrhkdcONYLepYhiHgt2EvfNR/+aeKc0TeWX6KWvG2JE84hSNRa1iJ7o7JL9J0EVDooOHVt478gdlsj+DeuSYBCjSLpJAYihwQstl4gaBNU5WM6a/H30W3FVbxZJX8cZFvgp/KkowrVqrOAzaCc1iJg0VqHXD91LTzFEZTgUbuEGmWYq0hx3WsFJizHQzHwcyIHuWtEmUKDukIWM6uyPHWOt+HFpnjKarP/dG8LteIzPRaTPnMs0Mk3RyUJQJYhIySpe0uWLUiL4VSBW3dyW0iwqpsX/g2hD8z0/+Kq6rFf9npXr5q3T2ZxrHEuzALhyADydwBhdQgzpQ+Af/4RlenKHz5LwWnIn1fYZt+FAF9w0we64a</latexit>
) ↵ =f(wk)� f⇤
krf(wk)k2<latexit sha1_base64="KPPyFEsK+RhCBMr9vvik6+/BGfo=">AAACKHicbVBNS8NAEN3U7/pV9ehlsQgqWJIq6EUsevFYxWqhqWWy3bRLN5uwu7GU2J/jxb/iRUSRXv0lbtsctPXBwOO9GWbmeRFnStv2wMrMzM7NLywuZZdXVtfWcxubdyqMJaEVEvJQVj1QlDNBK5ppTquRpBB4nN57ncuhf/9IpWKhuNW9iNYDaAnmMwLaSI3cuXvDWm0NUoZd7AKP2oDPsOtLIIm/12109vEh9h8O+on7hF0BHgec6u7TQ7HfyOXtgj0CniZOSvIoRbmRe3ebIYkDKjThoFTNsSNdT0BqRjjtZ91Y0QhIB1q0ZqiAgKp6Mnq0j3eN0sR+KE0JjUfq74kEAqV6gWc6A9BtNekNxf+8Wqz903rCRBRrKsh4kR9zrEM8TA03maRE854hQCQzt2LSBhOSNtlmTQjO5MvT5K5YcI4KxevjfOkijWMRbaMdtIccdIJK6AqVUQUR9Ixe0Qf6tF6sN+vLGoxbM1Y6s4X+wPr+AUNrpNY=</latexit>
Example: Line search
• Idea: just choose the step size that minimizes the objective.
• Only works well for gradient descent, not SGD.
• Why?• SGD moves in random directions that don’t always improve the
objective.• Doing line search on full objective is expensive relative to SGD
update.
↵k = argmin↵>0
f(wk � ↵rf(wk))<latexit sha1_base64="PR7QSNzEJ7PvDgpuXkXIPZw1M8s=">AAACKHicbZDLSsNAFIYnXmu9VV26GSxCu7AkIuhGLbpxWcFeoCnhZDpph04mYWailNDHceOruBFRpFufxGmbhbYeGPj5/nM4c34/5kxp2x5bS8srq2vruY385tb2zm5hb7+hokQSWicRj2TLB0U5E7Sumea0FUsKoc9p0x/cTvzmI5WKReJBD2PaCaEnWMAIaIO8wrULPO6DN8CX2AXZw27IhJfOKL7C9ggHpSdjn+CMuQJ8DjNaLnuFol2xp4UXhZOJIsqq5hXe3W5EkpAKTTgo1XbsWHdSkJoRTkd5N1E0BjKAHm0bKSCkqpNODx3hY0O6OIikeULjKf09kUKo1DD0TWcIuq/mvQn8z2snOrjopEzEiaaCzBYFCcc6wpPUcJdJSjQfGgFEMvNXTPoggWiTbd6E4MyfvCgapxXH6PuzYvUmiyOHDtERKiEHnaMqukM1VEcEPaNX9IE+rRfrzfqyxrPWJSubOUB/yvr+Aek5pAA=</latexit><latexit sha1_base64="PR7QSNzEJ7PvDgpuXkXIPZw1M8s=">AAACKHicbZDLSsNAFIYnXmu9VV26GSxCu7AkIuhGLbpxWcFeoCnhZDpph04mYWailNDHceOruBFRpFufxGmbhbYeGPj5/nM4c34/5kxp2x5bS8srq2vruY385tb2zm5hb7+hokQSWicRj2TLB0U5E7Sumea0FUsKoc9p0x/cTvzmI5WKReJBD2PaCaEnWMAIaIO8wrULPO6DN8CX2AXZw27IhJfOKL7C9ggHpSdjn+CMuQJ8DjNaLnuFol2xp4UXhZOJIsqq5hXe3W5EkpAKTTgo1XbsWHdSkJoRTkd5N1E0BjKAHm0bKSCkqpNODx3hY0O6OIikeULjKf09kUKo1DD0TWcIuq/mvQn8z2snOrjopEzEiaaCzBYFCcc6wpPUcJdJSjQfGgFEMvNXTPoggWiTbd6E4MyfvCgapxXH6PuzYvUmiyOHDtERKiEHnaMqukM1VEcEPaNX9IE+rRfrzfqyxrPWJSubOUB/yvr+Aek5pAA=</latexit><latexit sha1_base64="PR7QSNzEJ7PvDgpuXkXIPZw1M8s=">AAACKHicbZDLSsNAFIYnXmu9VV26GSxCu7AkIuhGLbpxWcFeoCnhZDpph04mYWailNDHceOruBFRpFufxGmbhbYeGPj5/nM4c34/5kxp2x5bS8srq2vruY385tb2zm5hb7+hokQSWicRj2TLB0U5E7Sumea0FUsKoc9p0x/cTvzmI5WKReJBD2PaCaEnWMAIaIO8wrULPO6DN8CX2AXZw27IhJfOKL7C9ggHpSdjn+CMuQJ8DjNaLnuFol2xp4UXhZOJIsqq5hXe3W5EkpAKTTgo1XbsWHdSkJoRTkd5N1E0BjKAHm0bKSCkqpNODx3hY0O6OIikeULjKf09kUKo1DD0TWcIuq/mvQn8z2snOrjopEzEiaaCzBYFCcc6wpPUcJdJSjQfGgFEMvNXTPoggWiTbd6E4MyfvCgapxXH6PuzYvUmiyOHDtERKiEHnaMqukM1VEcEPaNX9IE+rRfrzfqyxrPWJSubOUB/yvr+Aek5pAA=</latexit><latexit sha1_base64="PR7QSNzEJ7PvDgpuXkXIPZw1M8s=">AAACKHicbZDLSsNAFIYnXmu9VV26GSxCu7AkIuhGLbpxWcFeoCnhZDpph04mYWailNDHceOruBFRpFufxGmbhbYeGPj5/nM4c34/5kxp2x5bS8srq2vruY385tb2zm5hb7+hokQSWicRj2TLB0U5E7Sumea0FUsKoc9p0x/cTvzmI5WKReJBD2PaCaEnWMAIaIO8wrULPO6DN8CX2AXZw27IhJfOKL7C9ggHpSdjn+CMuQJ8DjNaLnuFol2xp4UXhZOJIsqq5hXe3W5EkpAKTTgo1XbsWHdSkJoRTkd5N1E0BjKAHm0bKSCkqpNODx3hY0O6OIikeULjKf09kUKo1DD0TWcIuq/mvQn8z2snOrjopEzEiaaCzBYFCcc6wpPUcJdJSjQfGgFEMvNXTPoggWiTbd6E4MyfvCgapxXH6PuzYvUmiyOHDtERKiEHnaMqukM1VEcEPaNX9IE+rRfrzfqyxrPWJSubOUB/yvr+Aek5pAA=</latexit>
Adaptive methods for SGD
• Several methods exist• AdaGrad• AdaDelta• RMSProp• Adam
• You’ll see Adam in one of next Wednesday’s papers
AdaGradAdaptive gradient descent
Per-parameter adaptive learning rate schemes
• Main idea: set the learning rate per-parameterdynamically at each iteration based on observed statistics of the past gradients.
• Where the step size now depends on the parameter index j• Corresponds to a multiplication of the gradient by a diagonal
scaling matrix.
• There are many different schemes in this class
(wt+1)j = (wt)j � ↵j,t (rf(wt;xt))j<latexit sha1_base64="hXQNphd/9EIY0bNKJTlhSoQrIL0=">AAACM3icbVDLSiNBFK32bXxFXbopDEJEDd0qKIgQnI3MKsJEhXRobleqk9Lq6qbqthqa/JMbf8SFMMxiZHDrP1hpsxgfBwpOnXMPVfeEqRQGXfe3MzY+MTk1PTNbmptfWFwqL6+cmyTTjDdZIhN9GYLhUijeRIGSX6aaQxxKfhFe/xj6FzdcG5GoX9hPeTuGrhKRYIBWCso/q7dBjlveYDO4ose0uBV8h/og0x4E+dU2DqgveYRV6isIJdDIzuERvQtwk/padHtoI0G54tbcAvQr8UakQkZoBOVHv5OwLOYKmQRjWp6bYjsHjYJJPij5meEpsGvo8palCmJu2nmx84BuWKVDo0Tbo5AW6v+JHGJj+nFoJ2PAnvnsDcXvvFaG0WE7FyrNkCv2/lCUSYoJHRZIO0JzhrJvCTAt7F8p64EGhrbmki3B+7zyV3K+W/P2artn+5X6yaiOGbJG1kmVeOSA1MkpaZAmYeSePJG/5Nl5cP44/5yX99ExZ5RZJR/gvL4BeSaocA==</latexit>
AdaGrad: One of the first adaptive methods
• AdaGrad: Adaptive subgradient methods for online learning and stochastic optimization• J Duchi, E Hazan, Y Singer• Journal of Machine Learning Research, 2011
• High level approach: can use history of sampled gradients to choose the step size for the next SGD step to be inversely proportional to the usual magnitude of gradient steps in that direction• On a per-parameter basis.
AdaGradAlgorithm 1 AdaGrad
input: learning rate factor ⌘, initial parameters w0 2 Rd, small number ✏initialize t 0loop
sample a stochastic gradient gt rf(wt;xt)update model: for all j 2 {1, . . . , d}
(wt+1)j (wt)j �⌘qPt
k=0 (gt)2j + ✏
· gj
t t+ 1end loop
<latexit sha1_base64="adp/IXo0+3jeu6FL6mu+WY6/tyU=">AAAE03icjVNLb9NAEHabACW8WjgioRF1pVYtlR0OoCKk8hI9ACqPtkhxYq3Xa2fb9a7ZHZMWyxfEld/Db+HfsHZSlKQc2EMynv3mm535ZqJccIOe93thsdW+dPnK0tXOtes3bt5aXrl9aFShKTugSij9OSKGCS7ZAXIU7HOuGckiwY6ikxf1/dFXpg1X8hOe5ayfkVTyhFOC1hWuLP4KIpZyWRKRKs1xmFW9vX4noCSvAeWzmLzWJK468zBOre8jEmQQIDvFKCm5zAvcqUAwoiWXKej6NiEUlQY3YEjcLeCSIycCcqJJxtA+DdxR6EHAJQQZwWEUlR+qQWyhJiNCgCyyiDXxueFCSfdi2oaRf2NQgYsQCJYg0VqNwLPgN0rlHbBnEmZIlgsGBAwqOiQGOYXUlsiZRHDTcIYgkCQSBJL1UYhP4DTEDbfhWpuQEUqLrBANL6NKxpCprCZiljer3a4Oj6cZ689NSMPjQdeFxDamrtE9Htdf+lsWGys0WxBDULnTDy/yuP7LVMzEzv+G9sa512EUlrjpVxBong5xY/ZRU5hpxAMIEk1oWUtXlYH5otH+FllYnjz1qgGexzVNO48adCtb4Llalo7aR9UFQ9CfLmdWKbQxvlXrlYwbwQIm47lhm/FU4fKqt+01By4a/sRYdSZnP1xZaAWxsnJZeaggxvR8L8d+SbSdAMFsgsKwnNATkrKeNaUdT9MvmxWrYM164qbpibLyNt7piJJkxpxlkUXWQ2zm72rnv+56BSaP++PVYZKOEyWFAFRQ7yvEXDOK4swahNrS7bTaqbWq1Kszm+Vvb+bduWGFLd0OTt02f75JF43D7rb/cLv7vru6+3zSwCXnrnPfWXd855Gz6+w5+86BQ1v3Wi9bb1vv2gftsv29/WMMXVyYxNxxZk775x/gCZGB</latexit>
Can think of this as like the norm of
the gradients in the jth parameter.
Memory-efficient implementation of AdaGradAlgorithm 1 AdaGrad
input: learning rate factor ⌘, initial parameters w0 2 Rd, small number ✏initialize t 0initialize r 0 2 Rd
loopsample a stochastic gradient gt rf(wt;xt)accumulate second moment estimate rj rj + (gt)
2j for all j 2 {1, . . . , d}
update model: for all j 2 {1, . . . , d}
(wt+1)j (wt)j �⌘
prj + ✏
· gj
t t+ 1end loop
<latexit sha1_base64="ttmBpwb3JTWXaSXzDUZXwM19TII=">AAAE/3icjVNLb9NAEHabACW8WjhyGVEjtWqp4nAAlUsBISrEoTz6kOLUWq/XzrbrXbM7Jm0tH/g13BBXfgpn/ghrJypxggR78Xj2m/l2vpkJM8ENdrs/FxZb7StXry1d79y4eev2neWVuwdG5ZqyfaqE0kchMUxwyfaRo2BHmWYkDQU7DE9fVveHn5k2XMmPeJ6xQUoSyWNOCVpXsLL4yw9ZwmVBRKI0x2Fa9ncHHZ+SrAIUzyPyWpOo7MzCOLW+D0iQgY/sDMO44DLLcbsEwYiWXCagq9uYUFQaXJ8hcTeBS46cCMiIJilD+zRwR0EXfC7BTwkOw7B4Xx5HFmpSIgTIPA1ZHZ8ZLpR052nrjPyCQQkugi9YjERrNYLuP8C6AZ5/Q8d/q1TWAXsmeQxJM8GAgEFFh8Qgp5BYgTiTCG4SNOh9SUJBIF4bBfgMzgJcd6dzEUrzNBd1WkaVjCBVaZWH2bRp5XZ1cDKdsPrdGDvWLJeveTLE9eDkuOdCbEWu9HJPxnUU3qZFRgrNJkTglw3qPIuqT6oiJrb/N7Q/YYZRUOCGV8Ilf6PoP5hpxCPwY01oUY1BWfjmk8bCllNW9Uw6a+HUkkJSJRxMP7fZVbQxnu3NKxnV7fGZjGYGs+Epg+XV7la3PjBveBNj1ZmcvWBloeVHynbHdoMKYkzf62Y4KIi2/RbMEuSGZYSekoT1rSntKJtBUa9jCQ+tJ6pFjZXtZu2djihIasx5GlpkNWxm9q5y/u2un2P8dDBeMybpmCjOBaCCarch4ppRFOfWINSWbmfTzqhVvVqzJsulNrPuzLDclm4Ho5LNmxVp3jjobXmPt3rveqs7LyYCLjn3nQfOmuM5T5wdZ9fZc/Yd2nrTylrnrYv2l/bX9rf29zF0cWESc89pnPaP3y6po/w=</latexit>
Important thing to notice: step
size is monotonically
decreasing!
Demo
AdaGrad for Non-convex Optimization
• What problems might arise when using AdaGrad for non-convex optimization?• Think about the step size always decreasing. Could this cause a
problem?
• If you do think of a problem that might arise, how could you change AdaGrad to fix it?
RMSPropAlgorithm 1 RMSProp
input: learning rate factor ⌘, initial parameters w0 2 Rd, decay rate ⇢initialize t 0initialize r 0 2 Rd
loopsample a stochastic gradient gt rf(wt;xt)accumulate second moment estimate rj ⇢ · rj + (1 � ⇢) (gt)
2j for all
j 2 {1, . . . , d}update model: for all j 2 {1, . . . , d}
(wt+1)j (wt)j �⌘
prj + ✏
· gj
t t+ 1end loop
<latexit sha1_base64="+yX0vNzSwvv0k+DfjrLlEgmbx14=">AAAFD3icjVNLb9QwEE67y2t5tXDkMqJBalWoNssBVC4IhMQBpNLSh1hvI8dxsm4dO9gOpUSRuMKNX8MNceUn8Dv4A0yyq9JskWAuHs18/uYd5VJY1+//nJvvdM+dv3DxUu/ylavXri8s3tixujCMbzMttdmLqOVSKL7thJN8LzecZpHku9Hh09q/+44bK7R67Y5zPspoqkQiGHVoChfnf5GIp0KVVKbaCDfOquHzUY8wmteAcvPl1obRedWbhQmGti1HHQfi+HsXJaVQeeHWK5CcGiVUCqb2JpQ5bcAn3FH/LgglnKAScmpoxh2mBv5R2AciFJCMunEUlZvVfozQmDN6PCHxiRlr/2zAhkt84FCB74BInjhqjD6C/j/ApgU+G71HXmid9wBlymNplksOFKzTbEytEwxSQ2PBlQM/DVvhiaKRpJAsH4XuEbwP3Yp/mosyVmSFbGg50yqGTGc1D0farCnYhActQqwfCIu1g9qzCssB3GusKxPYMmZAjEjHbiU82B/4kGDTqZTgH0yqK4O7iEQCi50FUrUSKvK4fjIdc7n+v1+H08hwFJZuNajgJH4r8z+Y0whMPjGUlfVaVCWxb40rsbIKSyM8t0JqVU0LTmvC0el027N2+CfAiT1TcTM0wlU8s6gtSxUuLPXX+o3AWSWYKkveVDbCxbkOiTXODGfEJLV2GPRzNyqpwS2QHAMUlueUHdKUD1FVuNp2VDbnWcEdtMRNUxONM26sp3+UNLP2OIsQWa+gnfXVxr/5hoVLHo4mZ8cVmwRKCglOQ33rEAvDmZPHqFCGpePG4uZi1+uza0c56c2sObe8wNJxMeq2BbNNOqvsDNaC+2uDV4Olx0+mDbzo3fJue8te4D3wHnvPvQ1v22OdN52PnU+dz90v3a/db93vE+j83PTPTa8l3R+/Ae5VqKk=</latexit>
Just replaces the gradient
accumulation of AdaGrad with an
exponential moving average.
A systems perspective
• What is the computational cost of AdaGrad and RMSProp?• How much additional memory is required compared to
baseline SGD?• How much additional compute is used?
Adaptive methods, summed up
• Generally useful when we can expect there to be different scales for different parameters• But can even work well when that doesn’t happen, as we saw
in the demo.
• Very commonly used class of methods for training ML models.
• We’ll see more of this when we study Adam on Wednesday• Adam is basically RMSProp + Momentum.
Algorithms other than SGD
Machine learning is not just SGD• Once a model is trained, we need to use it to classify new
examples• This inference task is not computed with SGD
• There are other algorithms for optimizing objectives besides SGD• Stochastic coordinate descent• Derivative-free optimization
• There are other common tasks, such as sampling from a distribution• Gibbs sampling and other Markov chain Monte Carlo methods• And we sometimes use this together with SGD à called contrastive
divergence
Why understand these algorithms?
• They represent a significant fraction of machine learning computations• Inference in particular is huge
• You may want to use them instead of SGD• But you don’t want to suddenly pay a computational penalty
for doing so because you don’t know how to make them fast
• Intuition from SGD can be used to make these algorithms faster too• And vice-versa
Review — We’ve covered many methods
• Stochastic gradient descent• Mini-batching• Momentum• Variance reduction
• Nice convergence proofs that give us a rate
• But only for convex problems!
Non-Convex Problems
• Anything that’s not convex
What makes non-convex optimization hard?
• Potentially many local minima
• Saddle points
• Very flat regions
• Widely varying curvature
Source: https://commons.wikimedia.org/wiki/File:Saddle_point.svg
But is it actually that hard?
• Yes, non-convex optimization is at least NP-hard• Can encode most problems as non-convex optimization
problems
• Example: subset sum problem• Given a set of integers, is there a non-empty subset whose
sum is zero?• Known to be NP-complete.
• How do we encode this as an optimization problem?
Subset sum as non-convex optimization
• Let a1, a2, …, an be the input integers
• Let x1, x2, …, xn be 1 if ai is in the subset, and 0 otherwise
• Objective:
• What is the optimum if subset sum returns true? What if it’s false?
minimize�aTx
�2+
nX
i=1
x2i (1� xi)
2
So non-convex optimization is pretty hard
• There can’t be a general algorithm to solve it efficiently in all cases
• Downsides: theoretical guarantees are weak or nonexistent• Depending on the application• There’s usually no theoretical recipe for setting hyperparameters
• Upside: an endless array of problems to try to solve better• And gain theoretical insight about• And improve the performance of implementations
Examples of non-convex problems
• Matrix completion, principle component analysis
• Low-rank models and tensor decomposition
• Maximum likelihood estimation with hidden variables• Usually non-convex
• The big one: deep neural networks
Why are neural networks non-convex?
• They’re often made of convex parts!• This by itself would be convex.
• Composition of convex functions is not convex• So deep neural networks also aren’t convex
Weight Matrix
Multiply
Convex Nonlinea
r Element
Weight Matrix
Multiply
Convex Nonlinea
r Element
Why do neural nets need to be non-convex?
• Neural networks are universal function approximators• With enough neurons, they can learn to approximate any
function arbitrarily well
• To do this, they need to be able to approximate non-convex functions• Convex functions can’t approximate non-convex ones well.
• Neural nets also have many symmetric configurations• For example, exchanging intermediate neurons• This symmetry means they can’t be convex. Why?
How to solve non-convex problems?
• Can use many of the same techniques as before• Stochastic gradient descent• Mini-batching• SVRG• Momentum
• There are also specialized methods for solving non-convex problems• Alternating minimization methods• Branch-and-bound methods• These generally aren’t very popular for machine learning problems
Varieties of theoretical convergence results
• Convergence to a stationary point
• Convergence to a local minimum
• Local convergence to the global minimum
• Global convergence to the global minimum
Non-convexStochastic Gradient Descent
Stochastic Gradient Descent
• The update rule is the same for non-convex functions
• Same intuition of moving in a direction that lowers objective
• Doesn’t necessarily go towards optimum• Even in expectation
wt+1 = wt � ↵trf̃t(wt)
Non-convex SGD: A Systems Perspective
• It’s exactly the same as the convex case!
• The hardware doesn’t care whether our gradients are from a convex function or not
• This means that all our intuition about computational efficiency from the convex case directly applies to the non-convex case
• But does our intuition about statistical efficiency also apply?
When can we say SGD converges?
• First, we need to decide what type of convergence we want to show• Here I’ll just show convergence to a stationary point, the
weakest type
• Assumptions:• Second-differentiable objective
• Lipschitz-continuous gradients
• Noise has bounded variance
• But no convexity assumption!
�LI � r2f(x) � LI
E
���rf̃t(x)� f̃(x)���2� �2
Convergence of Non-Convex SGD
• Start with the update rule:
• At the next time step, by Taylor’s theorem, the objective will be
wt+1 = wt � ↵trf̃t(wt)
f(wt+1) = f(wt � ↵trf̃t(wt))
= f(wt)� ↵trf̃t(wt)Trf(wt) +
↵2t
2rf̃t(wt)
Tr2f(yt)rf̃t(wt)
f(wt)� ↵trf̃t(wt)Trf(wt) +
↵2tL
2
���rf̃t(wt)���2
Convergence (continued)
• Taking the expected value
E [f(wt+1)|wt] f(wt)� ↵tEhrf̃t(wt)
Trf(wt)���wt
i+
↵2tL
2E
���rf̃t(wt)���2����wt
�
= f(wt)� ↵t krf(wt)k2 +↵2tL
2E
���rf̃t(wt)���2����wt
�
= f(wt)� ↵t krf(wt)k2 +↵2tL
2krf(wt)k2
+↵2tL
2E
���rf̃t(wt)�rf(wt)���2����wt
�
f(wt)�✓↵t �
↵2tL
2
◆krf(wt)k2 +
↵2t�
2L
2.
Convergence (continued)
• So now we know how the expected value of the objective evolves.
• If we set ⍺ small enough that 1 - ⍺L/2 > 1/2, then
E [f(wt+1)|wt] f(wt)�✓↵t �
↵2tL
2
◆krf(wt)k2 +
↵2t�
2L
2.
E [f(wt+1)|wt] f(wt)�↵t
2krf(wt)k2 +
↵2t�
2L
2.
Convergence (continued)
• Now taking the full expectation,
• And summing up over an epoch of length T
E [f(wt+1)] E [f(wt)]�↵t
2Ehkrf(wt)k2
i+
↵2t�
2L
2.
E [f(wT )] f(w0)�T�1X
t=0
↵t
2Ehkrf(wt)k2
i+
T�1X
t=0
↵2t�
2LT
2.
Convergence (continued)
• Now we need to decide how to set the step size ⍺t• Let’s just choose a constant step size for simplicity
• So our bound on the objective becomes
E[f(wt)] f(w0)�↵
2
T�1X
t=0
E⇥krf(wt)k2
⇤+
↵2�2LT
2<latexit sha1_base64="D8NzogK+wHPrPy5Fhq0L9kweDGI=">AAACfHicbVFdi9NAFJ3Er7V+VX30wYtVaFlakirqi7Aogg8+rNDuLmSyYTKdtMPOJGHmRi1jfoX/zDd/ii/ipC3i7nph4HDO/Zpz81pJi1H0MwivXL12/cbezd6t23fu3uvff3Bkq8ZwMeeVqsxJzqxQshRzlKjESW0E07kSx/nZu04//iyMlVU5w3UtUs2WpSwkZ+iprP+doviKeeHet0kx/JLhKAWqBHQ4GsEYgBaGcUeZqlesddMWqG105vBN1J662Tj2xN8WvrLABOg3oCXLFYNtS0+cToEauVxhCvvnWnaClUvdgY8w6yZk/UE0iTYBl0G8AwOyi8Os/4MuKt5oUSJXzNokjmpMHTMouRJtjzZW1IyfsaVIPCyZFjZ1G/NaeOaZBRSV8a9E2LD/VjimrV3r3Gdqhit7UevI/2lJg8Xr1MmyblCUfDuoaBRgBd0lYCGN4KjWHjBupN8V+Ip5Z9Dfq+dNiC9++TI4mk7i55PppxeDg7c7O/bII/KEDElMXpED8oEckjnh5FfwOBgGo+B3+DTcD8fb1DDY1Twk5yJ8+Qe1077F</latexit>
Convergence (continued)
• Rearranging the terms,
↵
2
T�1X
t=0
E⇥krf(wt)k2
⇤ f(w0)�E[f(wt)] +
↵2�2LT
2
f(w0)�⇣minw
f(w)⌘+
↵2�2LT
2
f(w0)� f⇤ +↵2�2LT
2<latexit sha1_base64="E/XE3QG6IPf+QqsrTnudGMsMgh0=">AAADAHicnVLPixMxFM6Mv9bxx3YVvHgJFqVVtsxUQS/CoggePKzQ7i400yGTZtqwSWZI3riWcS7+K148KOLVP8Ob/42Zdg61Kwg+CHx8733ve3lJWkhhIQx/ef6Fi5cuX9m5Gly7fuPmbmfv1pHNS8P4mOUyNycptVwKzccgQPKTwnCqUsmP09OXTf74HTdW5HoEy4LHis61yASj4Khkz7tDMkNZRagsFrSuhjUmtlRJBc/DelqN9iNHAH8PaVa9qonkGUww+YCJpqmkOOudJdB3xHSIiRHzBcQ4eODKgiYT9vH+hnqyro7xI7xp2kitmKsGvMGjZoaAkK0mjXGPKKGTs8a037r1/6NXNn34b1XS6YaDcBX4PIha0EVtHCadn2SWs1JxDUxSaydRWEBcUQOCSe6mKC0vKDulcz5xUFPFbVytHrDG9x0zw1lu3NGAV+ymoqLK2qVKXaWisLDbuYb8W25SQvYsroQuSuCarY2yUmLIcfMb8EwYzkAuHaDMCDcrZgvqNgPuzwRuCdH2lc+Do+EgejwYvn3SPXjRrmMH3UX3UA9F6Ck6QK/RIRoj5tXeJ++L99X/6H/2v/nf16W+12puoz/C//Ebtc/qCA==</latexit>
Now, we’re kinda stuck
• How do we use the bound on this term to say something useful?
• Idea: rather than outputting wT, instead output some randomly chosen wi from the history.• You might recall this trick from the proof in the SVRG paper.
↵
2
T�1X
t=0
E⇥krf(wt)k2
⇤ f(w0)�E[f(wt)] +
↵2�2LT
2
f(w0)�⇣minw
f(w)⌘+
↵2�2LT
2
f(w0)� f⇤ +↵2�2LT
2<latexit sha1_base64="E/XE3QG6IPf+QqsrTnudGMsMgh0=">AAADAHicnVLPixMxFM6Mv9bxx3YVvHgJFqVVtsxUQS/CoggePKzQ7i400yGTZtqwSWZI3riWcS7+K148KOLVP8Ob/42Zdg61Kwg+CHx8733ve3lJWkhhIQx/ef6Fi5cuX9m5Gly7fuPmbmfv1pHNS8P4mOUyNycptVwKzccgQPKTwnCqUsmP09OXTf74HTdW5HoEy4LHis61yASj4Khkz7tDMkNZRagsFrSuhjUmtlRJBc/DelqN9iNHAH8PaVa9qonkGUww+YCJpqmkOOudJdB3xHSIiRHzBcQ4eODKgiYT9vH+hnqyro7xI7xp2kitmKsGvMGjZoaAkK0mjXGPKKGTs8a037r1/6NXNn34b1XS6YaDcBX4PIha0EVtHCadn2SWs1JxDUxSaydRWEBcUQOCSe6mKC0vKDulcz5xUFPFbVytHrDG9x0zw1lu3NGAV+ymoqLK2qVKXaWisLDbuYb8W25SQvYsroQuSuCarY2yUmLIcfMb8EwYzkAuHaDMCDcrZgvqNgPuzwRuCdH2lc+Do+EgejwYvn3SPXjRrmMH3UX3UA9F6Ck6QK/RIRoj5tXeJ++L99X/6H/2v/nf16W+12puoz/C//Ebtc/qCA==</latexit>
Let zT = wt with probability 1/T for all t 2 {0, . . . , T � 1}.<latexit sha1_base64="feWdxJxIbR51UdmcNZ0kIjay630=">AAACMXicbVC7TsNAEDzzJrwClDQrYiQKCHYooEFC0FBQgJQAUhxZ58uZnHK+s+7WoGDll2j4E0RDAUK0/AROSMFrqtHMrnZ2olQKi5737IyNT0xOTc/MlubmFxaXyssrF1ZnhvEG01Kbq4haLoXiDRQo+VVqOE0iyS+j7vHAv7zhxgqt6thLeSuh10rEglEspLB8csoR3LuwDgdwG6ILtwI7kBod0UhIgT1w/Z26C7E2QKUEFyEQCoLc2wpkW6Pdqm/7Qd+thuWKV/WGgL/EH5EKGeEsLD8Gbc2yhCtkklrb9L0UWzk1KJjk/VKQWZ5S1qXXvFlQRRNuW/nw4z5sFEp7GCrWCmGoft/IaWJtL4mKyYRix/72BuJ/XjPDeL+VC5VmyBX7OhRnElDDoD5oC8MZyl5BKDOiyAqsQw1lWJRcKkrwf7/8l1zUqv5utXZeqxwejeqYIWtknWwSn+yRQ3JCzkiDMHJPnsgLeXUenGfnzXn/Gh1zRjur5Aecj0/ml6Zg</latexit>
Using our randomly chosen output
• So the expected value of the gradient at this point is
Let zT = wt with probability 1/T for all t 2 {0, . . . , T � 1}.<latexit sha1_base64="feWdxJxIbR51UdmcNZ0kIjay630=">AAACMXicbVC7TsNAEDzzJrwClDQrYiQKCHYooEFC0FBQgJQAUhxZ58uZnHK+s+7WoGDll2j4E0RDAUK0/AROSMFrqtHMrnZ2olQKi5737IyNT0xOTc/MlubmFxaXyssrF1ZnhvEG01Kbq4haLoXiDRQo+VVqOE0iyS+j7vHAv7zhxgqt6thLeSuh10rEglEspLB8csoR3LuwDgdwG6ILtwI7kBod0UhIgT1w/Z26C7E2QKUEFyEQCoLc2wpkW6Pdqm/7Qd+thuWKV/WGgL/EH5EKGeEsLD8Gbc2yhCtkklrb9L0UWzk1KJjk/VKQWZ5S1qXXvFlQRRNuW/nw4z5sFEp7GCrWCmGoft/IaWJtL4mKyYRix/72BuJ/XjPDeL+VC5VmyBX7OhRnElDDoD5oC8MZyl5BKDOiyAqsQw1lWJRcKkrwf7/8l1zUqv5utXZeqxwejeqYIWtknWwSn+yRQ3JCzkiDMHJPnsgLeXUenGfnzXn/Gh1zRjur5Aecj0/ml6Zg</latexit>
E⇥krf(zT )k2
⇤=
T�1X
t=0
E⇥krf(wt)k2
⇤·P(zT = wt)
=1
T
T�1X
t=0
E⇥krf(wt)k2
⇤
<latexit sha1_base64="wYryTt1UWyaHjjs1PZBFEgQ9X9Q=">AAACzHicpVFNixQxEE23X2v7saMevQQHZT04dI+CXhYWRdCLjDCzuzDpbdKZ6pmw6aRJqnXH2Fd/oDfP/hEzH4K7K14sCDxeVb1XqSobJR2m6Y8ovnL12vUbOzeTW7fv3N3t3bt/6ExrBUyEUcYel9yBkhomKFHBcWOB16WCo/L0zSp/9Amsk0aPcdlAXvO5lpUUHANV9H4yhDMsK/+2YwoqnFL2lTLNS8VptfelGD8NxMmQMivnC8yTJ/sJc21deKT7NO1O/PhZ1iX/FPlc4HkRysTMIP3dNOpWPkFuU8jWHpXlwmedH3f/71f0+ukgXQe9DLIt6JNtjIredzYzoq1Bo1DcuWmWNph7blEKBcG+ddBwccrnMA1Q8xpc7tfH6OjjwMxoZWx4Guma/bPD89q5ZV2Gyprjwl3Mrci/5aYtVq9yL3XTImixMapaRdHQ1WXpTFoQqJYBcGFlmJWKBQ97xHD/JCwhu/jly+BwOMieD4YfX/QPXm/XsUMekkdkj2TkJTkg78iITIiI3kcmOouW8YcYYx93m9I42vY8IOci/vYL5FTatg==</latexit>
Convergence (continued)
• Substituting this back into our previous bound gives us
• And this simplifies to
↵T
2E⇥krf(zT )k2
⇤ f(w0)� f⇤ +
↵2�2LT
2<latexit sha1_base64="NZXNxTQ3+7cgmXByr64Uy3+G8LA=">AAACXnicbVFBb9MwFHYCbCNjrMAFicsTFVIHokoKEhwnJiQOHIbUbpPqLHJcu7XmOJH9AnRZ/uRuiAs/BaftYWx8kqVP3/vee/bnvNLKYRz/CsJ79x9sbe88jHYf7T3e7z15euLK2nIx4aUu7VnOnNDKiAkq1OKssoIVuRan+cVRVz/9LqxTpRnjshJpweZGScUZeinr1VRaxhvKdLVgMG6bUQsUxU/MZfO5pVpInAK9AmpYrhnIwWU2PvDC+QioVfMFphB5VyQHP7L4AN6CPH8Nb+Dm1M7q1LzoyNf1iqzXj4fxCnCXJBvSJxscZ71rOit5XQiDXDPnpklcYdowi4pr0Ua0dqJi/ILNxdRTwwrh0mYVTwuvvDIDWVp/DMJKvdnRsMK5ZZF7Z8Fw4W7XOvF/tWmN8mPaKFPVKAxfL5K1BiyhyxpmygqOeukJ41b5uwJfMJ8M+h+JfAjJ7SffJSejYfJuOPr2vn/4aRPHDnlBXpIBScgHcki+kGMyIZz8DoIgCnaDP+FWuBfur61hsOl5Rv5B+Pwv2i2xwg==</latexit>
E⇥krf(zT )k2
⇤ 2(f(w0)� f⇤)
↵T+
↵�2L
2<latexit sha1_base64="zqxF39iszmQxgrSt5bTINWKVg6Q=">AAACW3icbVHPb9MwGHUyYCUMKCBOXCwqpBZElQQkOE5MSBw4DKndJtVZ9MW1W2vOD9lf2ErIP7nTduBfQbhtDrDxJEtP773PP56zSiuLYXjt+Tt37t7b7d0PHuw9fPS4/+TpkS1rw8WUl7o0JxlYoVUhpqhQi5PKCMgzLY6zs4O1f/xdGKvKYoKrSiQ5LAolFQd0Uto3DMUFZrL53DItJM4o+0lZAZkGKoc/0snICacxZUYtlpjQwKUCJg3wJh7K4XkajuhbKk9fj9qGga6WQCctfUO3kU5hVi1ycLt8bZu4TfuDcBxuQG+TqCMD0uEw7V+yecnrXBTINVg7i8IKkwYMKq5FG7Daigr4GSzEzNECcmGTZtNNS185ZU5ladwqkG7UvycayK1d5ZlL5oBLe9Nbi//zZjXKj0mjiqpGUfDtQbLWFEu6LprOlREc9coR4Ea5u1K+BNcKuu8IXAnRzSffJkfxOHo3jr+9H+x/6urokRfkJRmSiHwg++QLOSRTwskV+e3tej3vl7/jB/7eNup73cwz8g/8538ArWGx0A==</latexit>
Convergence (continued)
• Now, if we know that we are going to run for T iterations, we can set the step size to minimize this expression
E⇥krf(zT )k2
⇤ 2(f(w0)� f⇤)
↵T+
↵�2L
2<latexit sha1_base64="zqxF39iszmQxgrSt5bTINWKVg6Q=">AAACW3icbVHPb9MwGHUyYCUMKCBOXCwqpBZElQQkOE5MSBw4DKndJtVZ9MW1W2vOD9lf2ErIP7nTduBfQbhtDrDxJEtP773PP56zSiuLYXjt+Tt37t7b7d0PHuw9fPS4/+TpkS1rw8WUl7o0JxlYoVUhpqhQi5PKCMgzLY6zs4O1f/xdGKvKYoKrSiQ5LAolFQd0Uto3DMUFZrL53DItJM4o+0lZAZkGKoc/0snICacxZUYtlpjQwKUCJg3wJh7K4XkajuhbKk9fj9qGga6WQCctfUO3kU5hVi1ycLt8bZu4TfuDcBxuQG+TqCMD0uEw7V+yecnrXBTINVg7i8IKkwYMKq5FG7Daigr4GSzEzNECcmGTZtNNS185ZU5ladwqkG7UvycayK1d5ZlL5oBLe9Nbi//zZjXKj0mjiqpGUfDtQbLWFEu6LprOlREc9coR4Ea5u1K+BNcKuu8IXAnRzSffJkfxOHo3jr+9H+x/6urokRfkJRmSiHwg++QLOSRTwskV+e3tej3vl7/jB/7eNup73cwz8g/8538ArWGx0A==</latexit>
↵ =cpT
) E⇥krf(zT )k2
⇤ 1p
T·✓2(f(w0)� f⇤)
c+
c�2L
2
◆
<latexit sha1_base64="R5qDZvaZqWC4ERAjyHT8o5vQjxo=">AAACn3icbVHbbtNAEF2bWzG3FB7hYSFCckBEtqkEL0hVEQKJCgWUtEVZ1xpv1smq6wu7Y0ow/i0+hDf+hk1ipNIy0kpHZ2bOzJxNKyUNBsFvx710+crVa1vXvRs3b92+09u+e2DKWnMx4aUq9VEKRihZiAlKVOKo0gLyVInD9OT1Kn/4VWgjy2KMy0rEOcwLmUkOaKmk95OBqhZAX1GWaeANbxtmvmhsxm3rsU9yvkDQujz1GIpvmGbNm5YpkeGUsh+UFZAqoJn/PRkPLHEcUaZXLTH1bJW3kQzPSFLGZyV6awm/Gxn5mX+aBAP6jGbHTwat3YE+/bsOZUbOc7DK+20TtZ3+IOn1g2GwDnoRhB3oky5GSe8Xm5W8zkWBXIEx0zCoMG5Ao+RK2FNrIyrgJzAXUwsLyIWJm7W/LX1smRnNSm1fgXTNnu1oIDdmmae2MgdcmPO5Ffm/3LTG7GXcyKKqURR8MyirFcWSrj6LzqQWHNXSAuBa2l0pX4A1Bu2XetaE8PzJF8FBNAyfD6OPO/3dvc6OLXKfPCI+CckLskvekRGZEO48cPac986++9B9635wR5tS1+l67pF/wv38B8MPyws=</latexit>
Convergence Takeaways
• So even non-convex SGD converges!• In the sense of getting to points where the gradient is arbitrarily
small
• But this doesn’t mean it goes to a local minimum!• Doesn’t rule out that it goes to a saddle point, or a local maximum.• Doesn’t rule out that it goes to a region of very flat but nonzero
gradients.
• Certainly doesn’t mean that it finds the global optimum
• And the theoretical rate here was slow 1/pT
<latexit sha1_base64="igZqocNcj3q34AQqKx+XYBk7D1A=">AAAB8XicbVDLSgNBEOz1GeMr6tHLYBA8xd0o6DHoxWOEvDBZwuxkkgyZnV1neoWw5C+8eFDEq3/jzb9xkuxBEwsaiqpuuruCWAqDrvvtrKyurW9s5rby2zu7e/uFg8OGiRLNeJ1FMtKtgBouheJ1FCh5K9achoHkzWB0O/WbT1wbEakajmPuh3SgRF8wilZ68M475lFjWpt0C0W35M5AlomXkSJkqHYLX51exJKQK2SSGtP23Bj9lGoUTPJJvpMYHlM2ogPetlTRkBs/nV08IadW6ZF+pG0pJDP190RKQ2PGYWA7Q4pDs+hNxf+8doL9az8VKk6QKzZf1E8kwYhM3yc9oTlDObaEMi3srYQNqaYMbUh5G4K3+PIyaZRL3kWpfH9ZrNxkceTgGE7gDDy4ggrcQRXqwEDBM7zCm2OcF+fd+Zi3rjjZzBH8gfP5A1eukLQ=</latexit>
Strengthening these theoretical resultsConvergence to a local minimum• Under stronger conditions, can prove that SGD
converges to a local minimum• For example using the strict saddle property (Ge et al 2015)
• Using even stronger properties, can prove that SGD converges to a local minimum with an explicit convergence rate of 1/T
• But, it’s unclear whether common classes of non-convex problems, such as neural nets, actually satisfy these stronger conditions.
Strengthening these theoretical resultsLocal convergence to the global minimum• Another type of result you’ll see are local convergence
results
• Main idea: if we start close enough to the global optimum, we will converge there with high probability
• Results often give explicit initialization• scheme that is guaranteed to be close• But it’s often expensive to run• And limited to specific problems
Strengthening these theoretical resultsGlobal convergence to a global minimum• The strongest result is convergence no matter where
we initialize• Like in the convex case
• To prove this, we need a global understanding of the objective• So it can only apply to a limited class of problems
• For many problems, we know empirically that this doesn’t happen• Deep neural networks are an example of this
Other types of results
• Bounds on generalization error• Roughly: we can’t say it’ll converge, but we can say that it won’t
overfit
• Ruling out “spurious local minima”• Minima that exist in the training loss, but not in the true/test loss.
• Results that use the Hessian to escape from saddle points• By using it to find a descent direction, but rarely enough that it
doesn’t damage the computational efficiency
Deep Learning as Non-Convex OptimizationOr, “what could go wrong with my non-convex learning algorithm?”
Lots of Interesting Problems are Non-Convex
• Including deep neural networks
• Because of this, we almost always can’t prove convergence or anything like that when we run backpropagation (SGD) on a deep net
• But can we use intuition from PCA and convex optimization to understand what could go wrong when we run non-convex optimization on these complicated problems?
What could go wrong?We could converge to a bad local minimum• Problem: we converge to a local minimum which is bad for
our task• Often in a very steep potential well
• One way to debug: re-run the system with different initialization• Hopefully it will converge to some other local minimum which might
be better
• Another way to debug: add extra noise to gradient updates• Sometimes called “stochastic gradient Langevin dynamics”• Intuition: extra noise pushes us out of the steep potential well
What could go wrong?We could converge to a saddle point• Problem: we converge to a saddle point, which is not
locally optimal
• Upside: usually doesn’t happen with plain SGD• Because noisy gradients push us away from the saddle point• But can happen with more sophisticated SGD-like algorithms
• One way to debug: find the Hessian and compute a descent direction
What could go wrong?We get stuck in a region of low gradient magnitude• Problem: we converge to a region where the gradient’s
magnitude is small, and then stay there for a very long time• Might not affect asymptotic convergence, but very bad for real
systems
• One way to debug: use specialized techniques like batchnorm• There are many methods for preventing this problem for neural
nets
• Another way to debug: design your network so that it doesn’t happen• Networks using a RELU activation tend to avoid this problem
What could go wrong?Due to high curvature, we do huge steps and diverge• Problem: we go to a region where the gradient’s magnitude
is very large, and then we make a series of very large steps and diverge• Especially bad for real systems using floating point arithmetic
• One way to debug: use adaptive step size• Like we did for PCA• Adam (which we’ll discuss on Wednesday) does this sort of thing
• A simple way to debug: just limit the size of the gradient step• But this can lead to the low-gradient-magnitude issue
What could go wrong?I don’t know how to set my hyperparameters• Problem: without theory, how on earth am I supposed to
set my hyperparameters?
• We already have discussed the solution: hyperparameter optimization• All the techniques we discussed apply to the non-convex case.
• To avoid this: just use hyperparameters from folklore
Takeaway
• Non-convex optimization is hard to write theory about
• But it’s just as easy to compute SGD on• This is why we’re seeing a renaissance of empirical computing
• We can use the techniques we have discussed to get speedup here too
• We can apply intuition from the convex case and from simple problems like PCA to learn how these techniques work