Partial Differential Equations Approaches to Optimization ......Partial Differential Equations...

55
Partial Differential Equations Approaches to Optimization and Regularization of Deep Neural Networks Celebrating 75 Years of Mathematics of Computation ICERM Nov 2, 2018 Adam Oberman McGill Dept of Math and Stats supported by NSERC, Simons Fellowship, AFOSR FA9550-18-1-0167

Transcript of Partial Differential Equations Approaches to Optimization ......Partial Differential Equations...

Page 1: Partial Differential Equations Approaches to Optimization ......Partial Differential Equations Approaches to Optimization and Regularization of Deep Neural Networks Celebrating 75

Partial Differential Equations Approaches to Optimization and Regularization of Deep Neural Networks

Celebrating 75 Years of Mathematics of ComputationICERM

Nov 2, 2018

Adam ObermanMcGill Dept of Math and Stats

supported by NSERC, Simons Fellowship, AFOSR FA9550-18-1-0167

Page 2: Partial Differential Equations Approaches to Optimization ......Partial Differential Equations Approaches to Optimization and Regularization of Deep Neural Networks Celebrating 75

Background AI

• Artificial Intelligence is loosely defined as intelligence exhibited by machines

• Operationally: R&D in CS academic sub-disciplines: Computer Vision, Natural Language Processing (NLP), Robotics, etc

AlphaGo uses DL to beat world champion at Go

Page 3: Partial Differential Equations Approaches to Optimization ......Partial Differential Equations Approaches to Optimization and Regularization of Deep Neural Networks Celebrating 75

• AI : specific tasks,

• AGI : general cognitive abilities.

• AGI is a small research area within AI: build machines that can successfully perform any task that a human might do

• So far, no progress on AGI.

Artificial General Intelligence (AGI)

Page 4: Partial Differential Equations Approaches to Optimization ......Partial Differential Equations Approaches to Optimization and Regularization of Deep Neural Networks Celebrating 75

Deep Learning vs. traditional Machine Learning

• Machine Learning (ML) has been around for some time.

• Deep Learning is newer branch of ML which uses Deep Neural networks.

• ML has theory: error estimates and convergence proofs.

• DL less theory. But DL can effectively solve substantially larger scale problems

Page 5: Partial Differential Equations Approaches to Optimization ......Partial Differential Equations Approaches to Optimization and Regularization of Deep Neural Networks Celebrating 75

• ImageNet: Total number of classes: m =21841

• Total number of images: n =14,197,122

• Color images d= 3*256*256= 196,608

Facebook used 256 GPUs, working in parallel, to train ImageNet.

Still an academic dataset. Total number of images on Facebook is much larger

What are DNNs (in Math Language)?

Page 6: Partial Differential Equations Approaches to Optimization ......Partial Differential Equations Approaches to Optimization and Regularization of Deep Neural Networks Celebrating 75

What is the function?Looking for a map from images to labels

x in M = manifold of images

graph of list of word labels

f(x)

Page 7: Partial Differential Equations Approaches to Optimization ......Partial Differential Equations Approaches to Optimization and Regularization of Deep Neural Networks Celebrating 75

Doing the impossible?

In theory, due to curse of dimensionality, impossible to accurately interpolate a high dimensional function.

In practice, possible using Deep Neural Network architecture, training to fit the data with SGD.However we don’t know why it works.

Can train a computer to caption images more accurately that human performance.

Page 8: Partial Differential Equations Approaches to Optimization ......Partial Differential Equations Approaches to Optimization and Regularization of Deep Neural Networks Celebrating 75

Loss Functions versus Error

Classification problem: map image to discrete label space {1,2,3,…,10}

In practice: map to a probability vector, then assign label of the arg max.

Classification is not differentiable, so instead, in order to train, use a loss function as a surrogate.

Page 9: Partial Differential Equations Approaches to Optimization ......Partial Differential Equations Approaches to Optimization and Regularization of Deep Neural Networks Celebrating 75

DNNs in Math Language: high dimensional function fitting

Data fitting problem: f parameterized map from images to probability vectors on labels. y is the correct label. Try to fit data by minimizing the loss.

Training: minimize expected loss, by taking stochastic (approximate) gradients

Note: train on an empirical distribution sampled from the density rho.

minw

Ex⇠⇢n`(f(x;w), y(x)) =nX

i=1

`(f(xi;w), y(xi))<latexit sha1_base64="3uHpidkJhxBpaO95Fmtt7WdytCQ=">AAACQ3icbZBNSysxFIbP+HW99avq0k1QhA6IdNwoyAVBBZcKVoudGjJpxoYmmSHJaMsw/8Wf4sY/4M4/4MaFF3ErmLYifh0IPLznPZycN0oFN7ZavfdGRsfGJ/5M/i1NTc/MzpXnF05MkmnKajQRia5HxDDBFatZbgWrp5oRGQl2GnV2+/3TS6YNT9Sx7aWsKcmF4jGnxDoJl89CyRW+QqEkth1F+X6B8y4KDZco1O0EqwKFTIhKXOluX/lrvUrX99E/Z8gkzrmjoDhXHxbM302Y+z4ur1TXq4NCHxB8h5WdfVW/BoBDXL4LWwnNJFOWCmJMI9hIbTMn2nIqWFEKM8NSQjvkgjUcKiKZaeaDDAq06pQWihPtnrJooH6eyIk0picj5+yfar73+uJvvUZm461mzlWaWabocFGcCWQT1A8Utbhm1IqeA0I1d39FtE00odbFXnIh/Dj5J5xsrAeOj1waezCsSViCZahAAJuwAwdwCDWgcAMP8AT/vVvv0Xv2XobWEe99ZhG+lPf6Bur+sIg=</latexit><latexit sha1_base64="LDMLH6UNa2jjZV0Aeyzoau7EZiQ=">AAACQ3icbZDLSgMxFIYz9V5vVZdugiJ0QGqnGwURBBVcKlgtdmrIpBkbmmSGJGNbhnkIn8EXceMLuPMF3LhQxK1g2op4OxD4+M9/ODl/EHOmTbn84ORGRsfGJyan8tMzs3PzhYXFUx0litAqiXikagHWlDNJq4YZTmuxolgEnJ4F7b1+/+yKKs0ieWJ6MW0IfClZyAg2VkKFc18wiTrQF9i0giA9yFDahb5mAvqqFSGZQZ9yXgyL3e2Ou94rdl0X7lhDIlDKLHnZhfyyIPZpQsx1UWG1XCoPCn6B9xtWdw9k7WZj//oIFe79ZkQSQaUhHGtd9yqxaaRYGUY4zfJ+ommMSRtf0rpFiQXVjXSQQQbXrNKEYaTskwYO1O8TKRZa90Rgnf1T9e9eX/yvV09MuNVImYwTQyUZLgoTDk0E+4HCJlOUGN6zgIli9q+QtLDCxNjY8zaEPyf/hdNKybN8bNPYB8OaBMtgBRSBBzbBLjgER6AKCLgFj+AZvDh3zpPz6rwNrTnnc2YJ/Cjn/QNKGrGP</latexit><latexit sha1_base64="LDMLH6UNa2jjZV0Aeyzoau7EZiQ=">AAACQ3icbZDLSgMxFIYz9V5vVZdugiJ0QGqnGwURBBVcKlgtdmrIpBkbmmSGJGNbhnkIn8EXceMLuPMF3LhQxK1g2op4OxD4+M9/ODl/EHOmTbn84ORGRsfGJyan8tMzs3PzhYXFUx0litAqiXikagHWlDNJq4YZTmuxolgEnJ4F7b1+/+yKKs0ieWJ6MW0IfClZyAg2VkKFc18wiTrQF9i0giA9yFDahb5mAvqqFSGZQZ9yXgyL3e2Ou94rdl0X7lhDIlDKLHnZhfyyIPZpQsx1UWG1XCoPCn6B9xtWdw9k7WZj//oIFe79ZkQSQaUhHGtd9yqxaaRYGUY4zfJ+ommMSRtf0rpFiQXVjXSQQQbXrNKEYaTskwYO1O8TKRZa90Rgnf1T9e9eX/yvV09MuNVImYwTQyUZLgoTDk0E+4HCJlOUGN6zgIli9q+QtLDCxNjY8zaEPyf/hdNKybN8bNPYB8OaBMtgBRSBBzbBLjgER6AKCLgFj+AZvDh3zpPz6rwNrTnnc2YJ/Cjn/QNKGrGP</latexit><latexit sha1_base64="1t3sG2NE1Erwu7My7b6VndIN3d0=">AAACQ3icbZDLSgMxFIYz9VbrrerSTbAIHRCZcaMggqCCywrWip0aMmmmDSaZIcloyzDv5sYXcOcLuHGhiFvBtJbi7UDg4z//4eT8YcKZNp736BQmJqemZ4qzpbn5hcWl8vLKuY5TRWidxDxWFyHWlDNJ64YZTi8SRbEIOW2E14eDfuOGKs1ieWb6CW0J3JEsYgQbK6HyZSCYRLcwENh0wzA7zlHWg4FmAgaqGyOZw4ByXo2qvb1bd7Nf7bku3LeGVKCMWfLzKzm2IDYyIea6qFzxtrxhwTH4v6ECRlVD5YegHZNUUGkIx1o3/e3EtDKsDCOc5qUg1TTB5Bp3aNOixILqVjbMIIcbVmnDKFb2SQOH6veJDAut+yK0zsGp+ndvIP7Xa6Ym2m1lTCapoZJ8LYpSDk0MB4HCNlOUGN63gIli9q+QdLHCxNjYSzaEPyf/hfPtLd/yqVc5OBrFUQRrYB1UgQ92wAE4ATVQBwTcgSfwAl6de+fZeXPev6wFZzSzCn6U8/EJVBWunQ==</latexit>

Page 10: Partial Differential Equations Approaches to Optimization ......Partial Differential Equations Approaches to Optimization and Regularization of Deep Neural Networks Celebrating 75

Generalization. Training set and test set

Goal: generalization: hope that training error is a good estimate of the generalization loss, which is the expected loss on unseen images drawn from the same distribution.

Ex⇠⇢`(f(x;w), y(x)) =

Z`(f(xi;w), y(xi))d⇢(x)

<latexit sha1_base64="mQd6WwJaTEyehbn5A9xdhhb6+jI=">AAACOXicbVBLS0JBFD7XXmYvq2WbIQkUQtRNQQVCBS0N8gFeucwd5+rg3Eczc0u5+Lfa9C/aBW1a9KBtf6C5KlLagYGP78Gc89kBZ1IVCs9GYmFxaXkluZpaW9/Y3Epv79SkHwpCq8TnvmjYWFLOPFpVTHHaCATFrs1p3e6dx3r9jgrJfO9GDQLacnHHYw4jWGnKSldMF6uubUeXQyvqI1MyF5mi6w+RSTnPOtn+yX3ucJDt53LoDJnMU1PBYhPJYlpsxylts9KZQr4wGjQFxVmQKZ9+3CIAqFjpJ7Ptk9ClniIcS9kslgLVirBQjHA6TJmhpAEmPdyhTQ097FLZikaXD9GBZtrI8YV+erUR+zsRYVfKgWtrZ3ynnNVi8j+tGSrnuBUxLwgV9cj4IyfkSPkorhG1maBE8YEGmAimd0WkiwUmSped0iXMnTwPaqV8UeNr3cYFjCcJe7APWSjCEZThCipQBQIP8AJv8G48Gq/Gp/E1tiaMSWYX/ozx/QOjV6x4</latexit><latexit sha1_base64="PeNmUL0W364KwKS6AjbkFuD2cho=">AAACOXicbVDLSgMxFM3UV62vUZduokVoQUrbjYIKBRUENxXsAzrDkEkzbWjmYZLRDkO/yL0b/8Kd4MaFD9z6A2baItp6IXA4D3LvsQNGhSwWn7TUzOzc/EJ6MbO0vLK6pq9v1IUfckxq2Gc+b9pIEEY9UpNUMtIMOEGuzUjD7p0keuOGcEF970pGATFd1PGoQzGSirL0quEi2bXt+GxgxX1oCOpCg3f9ATQIYzkn1z+8ze9FuX4+D4+hQT35I1h0LFlUie0kpWyWni0WisOBP6A0CbKVo/fr7bsLUbX0R6Pt49AlnsQMCdEqlQNpxohLihkZZIxQkADhHuqQloIecokw4+HlA7irmDZ0fK6eWm3I/k7EyBUicm3lTO4Uk1pC/qe1QukcmDH1glASD48+ckIGpQ+TGmGbcoIlixRAmFO1K8RdxBGWquyMKmHq5GlQLxdKCl+qNk7BaNJgC+yAHCiBfVAB56AKagCDe/AMXsGb9qC9aB/a58ia0saZTfBntK9vh0ut4g==</latexit><latexit sha1_base64="PeNmUL0W364KwKS6AjbkFuD2cho=">AAACOXicbVDLSgMxFM3UV62vUZduokVoQUrbjYIKBRUENxXsAzrDkEkzbWjmYZLRDkO/yL0b/8Kd4MaFD9z6A2baItp6IXA4D3LvsQNGhSwWn7TUzOzc/EJ6MbO0vLK6pq9v1IUfckxq2Gc+b9pIEEY9UpNUMtIMOEGuzUjD7p0keuOGcEF970pGATFd1PGoQzGSirL0quEi2bXt+GxgxX1oCOpCg3f9ATQIYzkn1z+8ze9FuX4+D4+hQT35I1h0LFlUie0kpWyWni0WisOBP6A0CbKVo/fr7bsLUbX0R6Pt49AlnsQMCdEqlQNpxohLihkZZIxQkADhHuqQloIecokw4+HlA7irmDZ0fK6eWm3I/k7EyBUicm3lTO4Uk1pC/qe1QukcmDH1glASD48+ckIGpQ+TGmGbcoIlixRAmFO1K8RdxBGWquyMKmHq5GlQLxdKCl+qNk7BaNJgC+yAHCiBfVAB56AKagCDe/AMXsGb9qC9aB/a58ia0saZTfBntK9vh0ut4g==</latexit><latexit sha1_base64="4u7epyBMGvw4Z10O2qFM64ivPWQ=">AAACOXicbVBNS8MwGE79nPOr6tFLcAgbyGh3URBhoILHCe4D1lHSLN3CkrYkqa6U/S0v/gtvghcPinj1D5huY+jmC4GH54O87+NFjEplWS/G0vLK6tp6biO/ubW9s2vu7TdkGAtM6jhkoWh5SBJGA1JXVDHSigRB3GOk6Q0uM715T4SkYXCnkoh0OOoF1KcYKU25Zs3hSPU9L70euekQOpJy6Ih+OIIOYazoF4fnD6WTpDgsleAFdGigZoJLp5JLtdjNUtrmmgWrbI0HzoA9DwpgOjXXfHa6IY45CRRmSMq2XYlUJ0VCUczIKO/EkkQID1CPtDUMECeyk44vH8FjzXShHwr99Gpj9nciRVzKhHvamd0p57WM/E9rx8o/66Q0iGJFAjz5yI8ZVCHMaoRdKghWLNEAYUH1rhD3kUBY6bLzuoSFkxdBo1K2Nb61CtWraR05cAiOQBHY4BRUwQ2ogTrA4BG8gnfwYTwZb8an8TWxLhnTzAH4M8b3D/9XqoQ=</latexit>

Testing: reserve some data and approximate the generalization loss/error on the test data, which is a surrogate for the true expected error on the full density.

Ex⇠⇢test`(f(x;w), y(x)) =X

`(f(xi;w), y(xi))<latexit sha1_base64="dsiTuKjAr93GjfozAUupKurJQBM=">AAACN3icbVBNSxxBFHyj0ej6kTUevTSKsAMiu3sxEAQhETyJQlaFnWXo6X3jNnbPDN1vdJdhIEfv/pVc8jdyixcPinjNP7B3VyR+FDQUVe/xuirKlLRUr//1JiY/TE1/nJmtzM0vLH6qLn0+smluBLZEqlJzEnGLSibYIkkKTzKDXEcKj6Ozb0P/+ByNlWnygwYZdjQ/TWQsBScnhdX9QHPqRVGxW4ZFnwVWahaYXhoWhJbKkgWoVC2u9b9e+BuDWt/32babyvWzEconK5S+H1bX6pv1EdgzabwmazvNy6ufAHAQVv8E3VTkGhMSilvbbjQz6hTckBQKy0qQW8y4OOOn2HY04RptpxjlLtm6U7osTo17CbGR+v9GwbW1Ax25yWFK+9obiu957ZziL51CJllOmIjxoThXjFI2LJF1pUFBauAIF0a6vzLR44YLclVXXAlvIr8lR83NhuOHro3vMMYMrMAq1KABW7ADe3AALRDwC67hFu68396Nd+89jEcnvKedZXgB798juuKsrA==</latexit><latexit sha1_base64="NEQI9PaZjbDyruWtrCfw7KgbFqw=">AAACN3icbVBNSyNBFOzxa2N0NbpHL40iZGCRJBcXFiGggieJYEwgE4aezhvT2D0zdL9ZE4Y5ieDdv+Jl/4a33cseVsSr/8BOIuJXQUNR9R6vq4JECoOVyh9nanpmdu5LYb64sPh1abm0snpi4lRzaPJYxrodMANSRNBEgRLaiQamAgmt4Gx35Ld+gTYijo5xmEBXsdNIhIIztJJfOvQUw34QZPu5nw2oZ4Sinu7HfoZgMM+pB1KWw/Lg57n7fVgeuC7dsVOpejF88Wz5wnX90kZlqzIGfSHV92SjXru6vrwIaMMv3Xq9mKcKIuSSGdOp1hLsZkyj4BLyopcaSBg/Y6fQsTRiCkw3G+fO6aZVejSMtX0R0rH6eiNjypihCuzkKKV5743Ez7xOiuGPbiaiJEWI+ORQmEqKMR2VSHtCA0c5tIRxLexfKe8zzTjaqou2hA+RP5KT2lbV8iPbxh6ZoEDWyDopkyrZJnVyQBqkSTi5IX/Jf3Ln/Hb+OffOw2R0ynne+UbewHl8Ah7JrbY=</latexit><latexit sha1_base64="NEQI9PaZjbDyruWtrCfw7KgbFqw=">AAACN3icbVBNSyNBFOzxa2N0NbpHL40iZGCRJBcXFiGggieJYEwgE4aezhvT2D0zdL9ZE4Y5ieDdv+Jl/4a33cseVsSr/8BOIuJXQUNR9R6vq4JECoOVyh9nanpmdu5LYb64sPh1abm0snpi4lRzaPJYxrodMANSRNBEgRLaiQamAgmt4Gx35Ld+gTYijo5xmEBXsdNIhIIztJJfOvQUw34QZPu5nw2oZ4Sinu7HfoZgMM+pB1KWw/Lg57n7fVgeuC7dsVOpejF88Wz5wnX90kZlqzIGfSHV92SjXru6vrwIaMMv3Xq9mKcKIuSSGdOp1hLsZkyj4BLyopcaSBg/Y6fQsTRiCkw3G+fO6aZVejSMtX0R0rH6eiNjypihCuzkKKV5743Ez7xOiuGPbiaiJEWI+ORQmEqKMR2VSHtCA0c5tIRxLexfKe8zzTjaqou2hA+RP5KT2lbV8iPbxh6ZoEDWyDopkyrZJnVyQBqkSTi5IX/Jf3Ln/Hb+OffOw2R0ynne+UbewHl8Ah7JrbY=</latexit><latexit sha1_base64="71ODtIYdgAE2JcdlClqIel1VpHk=">AAACN3icbVDLSsNAFJ34tr6qLt0MFqEBkbQbBREEFVyJglWhKWEyvbGDM0mYudGWkL9y42+4040LRdz6B05rEV8HBg7n3Mudc8JUCoOe9+CMjI6NT0xOTZdmZufmF8qLS2cmyTSHBk9koi9CZkCKGBooUMJFqoGpUMJ5eLXX98+vQRuRxKfYS6Gl2GUsIsEZWikoH/mKYScM84MiyLvUN0JRX3eSIEcwWBTUBymrUbW7feOu96pd16U7dipTX0YghlYgXDcoV7wNbwD6RWq/SYUMcRyU7/12wjMFMXLJjGnW6im2cqZRcAlFyc8MpIxfsUtoWhozBaaVD3IXdM0qbRol2r4Y6UD9vpEzZUxPhXayn9L89vrif14zw2irlYs4zRBi/nkoyiTFhPZLpG2hgaPsWcK4FvavlHeYZhxt1SVbwp/If8lZfaNm+YlX2d0f1jFFVsgqqZIa2SS75JAckwbh5JY8kmfy4tw5T86r8/Y5OuIMd5bJDzjvH/YEqp8=</latexit>

Orange curve: overtrained. Green curve better generalization

Page 11: Partial Differential Equations Approaches to Optimization ......Partial Differential Equations Approaches to Optimization and Regularization of Deep Neural Networks Celebrating 75

Challenges for deep learning

“It is not clear that the existing AI paradigm is immediately amenable to any sort of software engineering validation and verification. This is a serious issue, and is a potential roadblock to DoD’s use of these modern AI systems, especially when considering the liability and accountability of using AI”

JASON report

Page 12: Partial Differential Equations Approaches to Optimization ......Partial Differential Equations Approaches to Optimization and Regularization of Deep Neural Networks Celebrating 75

Mary Shaw’s evolution of software engineering discipline

Better theory: improves reliability and discipline evolves

Page 13: Partial Differential Equations Approaches to Optimization ......Partial Differential Equations Approaches to Optimization and Regularization of Deep Neural Networks Celebrating 75

Entropy-SGD

Pratik Chaudhari UCLA (now Amazon/U Penn)

Stefano Soatto UCLA Comp Sci.

Stanley Osher, UCLA Math

Guillaume Carlier, CEREMADE, U. Parix IX Dauphine

• 2017 UCLA PhD student (at time of research)• 2018 (present) Amazon research • Fall 2019 Faculty in ESE at U Penn

Deep Relaxation: partial differential equations for optimizing deep neural networks Pratik Chaudhari, Adam M. Oberman, Stanley Osher, Stefano Soatto, Guillame Carlier 2017

Page 14: Partial Differential Equations Approaches to Optimization ......Partial Differential Equations Approaches to Optimization and Regularization of Deep Neural Networks Celebrating 75

Entropy SGD results in Deep Neural Networks (Pratik)

Visualization of Improvement in training loss (left)Improve in Validation Error (right)

dimension = 1.67 million

Page 15: Partial Differential Equations Approaches to Optimization ......Partial Differential Equations Approaches to Optimization and Regularization of Deep Neural Networks Celebrating 75

Different Interpretation: Regularization using Viscous Hamilton-Jacobi PDE

Solution of PDE in one dimension. Cartoon: Algorithm only solves PDE for time depending on Hf(x).

Page 16: Partial Differential Equations Approaches to Optimization ......Partial Differential Equations Approaches to Optimization and Regularization of Deep Neural Networks Celebrating 75

Expected Improvement Theorem in continuous time

Page 17: Partial Differential Equations Approaches to Optimization ......Partial Differential Equations Approaches to Optimization and Regularization of Deep Neural Networks Celebrating 75

Adaptive-SGDjoint with PhD Student Mariana Prazeres

Page 18: Partial Differential Equations Approaches to Optimization ......Partial Differential Equations Approaches to Optimization and Regularization of Deep Neural Networks Celebrating 75

mini-batch: randomly choose a smaller set I of points from the data.

Cost is N

Cost is |I|

Model for mini-batch gradients: k=1 means.

full gradient

mini-batch

For k>1 means, same calculation applies, if we restrict to the active indices. This leads to smaller active batch sizes, and higher variance

Page 19: Partial Differential Equations Approaches to Optimization ......Partial Differential Equations Approaches to Optimization and Regularization of Deep Neural Networks Celebrating 75

Motivation: quality of gMB depends on x

-1 -0.5 0 0.5 1

-1

-0.5

0

0.5

1

• far from x* mb gradients point in a good direction.• near x* require more samples, or small steps (so that directions average in time).

-4 -3 -2 -1 0 1 2 3 4-4

-3

-2

-1

0

1

2

3

4

Page 20: Partial Differential Equations Approaches to Optimization ......Partial Differential Equations Approaches to Optimization and Regularization of Deep Neural Networks Celebrating 75

Adapt in Space instead of Time

-1 -0.5 0 0.5 1

-1

-0.5

0

0.5

1

The ideal learning rate/batch size combination should depend on x (space) rather than k (time).

use MB = 10 use MB = 60

Page 21: Partial Differential Equations Approaches to Optimization ......Partial Differential Equations Approaches to Optimization and Regularization of Deep Neural Networks Celebrating 75

Adaptive SGD

• Adaptively, depending on x, decide on • MB size, or • learning rate.

• Use the following formula (derived later)

- f large, learning rate large (ok to use small MB)- g small, var(MB) restricts learning rate (so use large MB)

Page 22: Partial Differential Equations Approaches to Optimization ......Partial Differential Equations Approaches to Optimization and Regularization of Deep Neural Networks Celebrating 75

Benchmarks: Fix MB and adapt h

0 100 200 300 400 500 60010-8

10-6

10-4

10-2

100f(x) - f* (for a typical run)

ScheduledScheduled 1/tAdaptive

0 100 200 300 400 500 60010-6

10-5

10-4

10-3

10-2

10-1

100f(x) - f*(Averaged over 40 runs)

ScheduledScheduled 1/tAdaptive

Left: Not too clear what is happening with one path.Right: Average over several runs to see the trend

Page 23: Partial Differential Equations Approaches to Optimization ......Partial Differential Equations Approaches to Optimization and Regularization of Deep Neural Networks Celebrating 75

-0.2 -0.15 -0.1 -0.05 0 0.05 0.1 0.15 0.2-0.2

-0.15

-0.1

-0.05

0

0.05

0.1

0.15

0.2Paths of Scheduled and Adaptive SGD

The variance of the paths is clear from this figure

Page 24: Partial Differential Equations Approaches to Optimization ......Partial Differential Equations Approaches to Optimization and Regularization of Deep Neural Networks Celebrating 75

Proof of Convergence with Rate

rate is same order at SGD, but with better constant

Page 25: Partial Differential Equations Approaches to Optimization ......Partial Differential Equations Approaches to Optimization and Regularization of Deep Neural Networks Celebrating 75

Proof of Convergence and Generalization for Lipschitz

Regularized DNNsjoint with Jeff Calder

Lipschitz regularized Deep Neural Networks converge and generalize O. and Jeff Calder; 2018

Page 26: Partial Differential Equations Approaches to Optimization ......Partial Differential Equations Approaches to Optimization and Regularization of Deep Neural Networks Celebrating 75

Background on the generalization/convergence result

Page 27: Partial Differential Equations Approaches to Optimization ......Partial Differential Equations Approaches to Optimization and Regularization of Deep Neural Networks Celebrating 75

Problem: traditional ML theory does not apply to Deep Learning

A new idea is needed to make Deep Learning more reliable.

“Understanding Deep Learning requires rethinking generalization” Zhang (2016)

Page 28: Partial Differential Equations Approaches to Optimization ......Partial Differential Equations Approaches to Optimization and Regularization of Deep Neural Networks Celebrating 75

inspiration: old idea: Total Variation Denoising [1992] R-Osher-F.

used in early, high profile image reconstruction of video images.

Stanley Osher

• minimize a variational functional: combination of a loss term, to the original noisy image, and a regularization term

• Regularization is large on noise, small on images

Page 29: Partial Differential Equations Approaches to Optimization ......Partial Differential Equations Approaches to Optimization and Regularization of Deep Neural Networks Celebrating 75

Regularization: from images to maps

x in M = manifold of images

word labels

f(x)

the learned map: well-behaved on data manifold, but very bad off the manifold (without regularization)

Page 30: Partial Differential Equations Approaches to Optimization ......Partial Differential Equations Approaches to Optimization and Regularization of Deep Neural Networks Celebrating 75

What is new in our result?

• Bartlett proved generalization under the assumption of Lipschitz regularity.

• However, DNNs are not uniformly Lipschitz• By adding regularization to the objective function in the training, we

obtain the uniform Lipschitz bounds

Page 31: Partial Differential Equations Approaches to Optimization ......Partial Differential Equations Approaches to Optimization and Regularization of Deep Neural Networks Celebrating 75

Approaches to regularization

A. Machine Learning: learn data using an appropriate (smooth) parameterized class of functions

B. Algorithmic: use an algorithm which selects the best solution (e.g. Stochastic Gradient Descent as a regularizer, adversarial training)

C. Inverse problems: allow for a broad class of functions, but modify the loss to choose the right one

minw

Ex⇠⇢`(f(x;w), y(x))<latexit sha1_base64="SxNFJklN3YXQujZE+PFI6SQ90+4=">AAACGnicbVDLSgMxFL3js9ZX1aWboAgtSJlxo+BGUMFlBdsKnTJm0kwbmmSGJKMtQ7fiH7jxV9y4UMSduPFvTFsRXwcCh3Pu5eacMOFMG9d9dyYmp6ZnZnNz+fmFxaXlwspqTcepIrRKYh6r8xBrypmkVcMMp+eJoliEnNbD7uHQr19SpVksz0w/oU2B25JFjGBjpaDg+YLJ4Ar5AptOGGbHgyDrIV8zgXzViQfIp5wXo2Jv/6q03S/2SqWgsOmW3RHQF/F+k82D3ZvrCwCoBIVXvxWTVFBpCMdaN7ydxDQzrAwjnA7yfqppgkkXt2nDUokF1c1sFG2AtqzSQlGs7JMGjdTvGxkWWvdFaCeHCfRvbyj+5zVSE+01MyaT1FBJxoeilCMTo2FPqMUUJYb3LcFEMftXRDpYYWJsm3lbwp/If0ltp+y5Ze/UtnEEY+RgHTagCB7swgGcQAWqQOAW7uERnpw758F5dl7GoxPO584a/IDz9gEpiaHK</latexit><latexit sha1_base64="H8CIfyvwEJ3UxyVBgiSjhDmLCRM=">AAACGnicbVDLSgMxFM3UV62vqks3QRFakDLTTQU3BRXEVQX7kE4ZMmnGhiaZIcnYlqFb8Q/c+CtuXCjiTtz4GYIfYNqK+DoQOJxzLzfn+BGjStv2q5Wamp6ZnUvPZxYWl5ZXsqtrNRXGEpMqDlkoGz5ShFFBqppqRhqRJIj7jNT97v7Ir18QqWgoTvUgIi2OzgUNKEbaSF7WcTkVXg+6HOmO7yeHQy/pQ1dRDl3ZCYfQJYzlglx/r5ffGeT6+byX3bIL9hjwizi/yVa5dHV5dvz+VvGyz247xDEnQmOGlGo6xUi3EiQ1xYwMM26sSIRwF52TpqECcaJayTjaEG4bpQ2DUJonNByr3zcSxJUacN9MjhKo395I/M9rxjrYbSVURLEmAk8OBTGDOoSjnmCbSoI1GxiCsKTmrxB3kERYmzYzpoQ/kf+SWrHg2AXnxLRxACZIgw2wCXLAASVQBkegAqoAg2twC+7Bg3Vj3VmP1tNkNGV97qyDH7BePgAKcaPy</latexit><latexit sha1_base64="H8CIfyvwEJ3UxyVBgiSjhDmLCRM=">AAACGnicbVDLSgMxFM3UV62vqks3QRFakDLTTQU3BRXEVQX7kE4ZMmnGhiaZIcnYlqFb8Q/c+CtuXCjiTtz4GYIfYNqK+DoQOJxzLzfn+BGjStv2q5Wamp6ZnUvPZxYWl5ZXsqtrNRXGEpMqDlkoGz5ShFFBqppqRhqRJIj7jNT97v7Ir18QqWgoTvUgIi2OzgUNKEbaSF7WcTkVXg+6HOmO7yeHQy/pQ1dRDl3ZCYfQJYzlglx/r5ffGeT6+byX3bIL9hjwizi/yVa5dHV5dvz+VvGyz247xDEnQmOGlGo6xUi3EiQ1xYwMM26sSIRwF52TpqECcaJayTjaEG4bpQ2DUJonNByr3zcSxJUacN9MjhKo395I/M9rxjrYbSVURLEmAk8OBTGDOoSjnmCbSoI1GxiCsKTmrxB3kERYmzYzpoQ/kf+SWrHg2AXnxLRxACZIgw2wCXLAASVQBkegAqoAg2twC+7Bg3Vj3VmP1tNkNGV97qyDH7BePgAKcaPy</latexit><latexit sha1_base64="O4J9xO27AvK0cuHosbflwDq4gHQ=">AAACGnicbVDLSgMxFM3UV62vqks3wSK0IGWmGwU3BRVcVrAP6JQhk2ba0CQzJBnbYeh3uPFX3LhQxJ248W9MH4i2HggczrmXm3P8iFGlbfvLyqysrq1vZDdzW9s7u3v5/YOGCmOJSR2HLJQtHynCqCB1TTUjrUgSxH1Gmv7gcuI374lUNBR3OolIh6OeoAHFSBvJyzsup8IbQpcj3ff99HrspSPoKsqhK/vhGLqEsWJQHF0MS6dJcVQqefmCXbangD/EWSQFMEfNy3+43RDHnAiNGVKq7VQi3UmR1BQzMs65sSIRwgPUI21DBeJEddJptDE8MUoXBqE0T2g4VX9vpIgrlXDfTE4SqEVvIv7ntWMdnHdSKqJYE4Fnh4KYQR3CSU+wSyXBmiWGICyp+SvEfSQR1qbNnClhKfIyaVTKjl12bu1C9WpeRxYcgWNQBA44A1VwA2qgDjB4AE/gBbxaj9az9Wa9z0Yz1nznEPyB9fkNmmmf5Q==</latexit>

wk+1 = wk + hkrmb`(. . . w)<latexit sha1_base64="BZyyOvDFarvXXwrLP3MBcmYw5kg=">AAACF3icbVBNS8NAFHzx26o16tHLqghKISRe9CIIevCoYG2hiWGz3dglm03Y3VhK6L/w4l/x4kERr3rz37htRbR1YGGY2eG9N1HOmdKu+2lNTc/Mzs0vLFaWlleqq/ba+rXKCklonWQ8k80IK8qZoHXNNKfNXFKcRpw2ouR04DfuqFQsE1e6l9MgxbeCxYxgbaTQdro3ZVLz+ugYdW8SVEOdMEG+wBHHYZlGfeRTztGe3860Qt390N5xHXcI9EO8cbJzsrVWBYOL0P4wUVKkVGjCsVIt7yDXQYmlZoTTfsUvFM0xSfAtbRkqcEpVUA7v6qNdo7RRnEnzhEZD9XeixKlSvcGWuynWHTXuDcT/vFah46OgZCIvNBVkNCguONIZGpSE2kxSonnPEEwkM7si0sESE22qrJgSJk6eJNcHjuc63qVp4wxGWIBN2IY98OAQTuAcLqAOBO7hEZ7hxXqwnqxX6230dcr6zmzAH1jvX40inl0=</latexit><latexit sha1_base64="O8Jfg1XG1Gs6KyE5rSjFZ8Va8xI=">AAACF3icbVBPSwJBHJ21f2ZlWscuYyIYgux6qUsg1aGjQf4BXZfZcdRhZ2eXmdlElv0WXfoqXToU0bVufZtGjSjtwcDjvXn8fr/nhoxKZZqfRmptfWNzK72d2dndy+7n8gctGUQCkyYOWCA6LpKEUU6aiipGOqEgyHcZabve5cxv3xEhacBv1TQkto9GnA4pRkpLTq466cdexUrgOZz0PViBY8eDPY5chpzYdxPYI4zBcm8QKAknJ06uaFbNOeAPsZZJsV7IZ1OFi7Dh5D50FEc+4QozJGXXqoXKjpFQFDOSZHqRJCHCHhqRrqYc+UTa8fyuBJa0MoDDQOjHFZyrvxMx8qWczrYs+UiN5bI3E//zupEantkx5WGkCMeLQcOIQRXAWUlwQAXBik01QVhQvSvEYyQQVrrKjC5h5eRV0qpVLbNq3eg2rsACaXAEjkEZWOAU1ME1aIAmwOAePIJn8GI8GE/Gq/G2+JoyvjOH4A+M9y+pdJ8y</latexit><latexit sha1_base64="O8Jfg1XG1Gs6KyE5rSjFZ8Va8xI=">AAACF3icbVBPSwJBHJ21f2ZlWscuYyIYgux6qUsg1aGjQf4BXZfZcdRhZ2eXmdlElv0WXfoqXToU0bVufZtGjSjtwcDjvXn8fr/nhoxKZZqfRmptfWNzK72d2dndy+7n8gctGUQCkyYOWCA6LpKEUU6aiipGOqEgyHcZabve5cxv3xEhacBv1TQkto9GnA4pRkpLTq466cdexUrgOZz0PViBY8eDPY5chpzYdxPYI4zBcm8QKAknJ06uaFbNOeAPsZZJsV7IZ1OFi7Dh5D50FEc+4QozJGXXqoXKjpFQFDOSZHqRJCHCHhqRrqYc+UTa8fyuBJa0MoDDQOjHFZyrvxMx8qWczrYs+UiN5bI3E//zupEantkx5WGkCMeLQcOIQRXAWUlwQAXBik01QVhQvSvEYyQQVrrKjC5h5eRV0qpVLbNq3eg2rsACaXAEjkEZWOAU1ME1aIAmwOAePIJn8GI8GE/Gq/G2+JoyvjOH4A+M9y+pdJ8y</latexit><latexit sha1_base64="8NtPglgPo5lAysiDwWcJbXcFd98=">AAACF3icbVDLSsNAFJ3UV62vqEs3g6VQKZSkG90IBV24rGAf0LRhMp22QyaTMDOxlJC/cOOvuHGhiFvd+TdO2iLaemDgcM4c7r3HixiVyrK+jNza+sbmVn67sLO7t39gHh61ZBgLTJo4ZKHoeEgSRjlpKqoY6USCoMBjpO35V5nfvidC0pDfqWlEegEacTqkGCktuWZ10k/8ip3CSzjp+7ACx64PHY48htwk8FLoEMZg2RmESsLJmWsWrao1A/wh9jIpggUarvmpozgOCFeYISm7di1SvQQJRTEjacGJJYkQ9tGIdDXlKCCyl8zuSmFJKwM4DIV+XMGZ+juRoEDKabZlKUBqLJe9TPzP68ZqeNFLKI9iRTieDxrGDKoQZiXBARUEKzbVBGFB9a4Qj5FAWOkqC7qElZNXSatWta2qfWsV69eLOvLgBJyCMrDBOaiDG9AATYDBA3gCL+DVeDSejTfjff41Zywyx+APjI9v34Cd3Q==</latexit>

minw

Ex⇠⇢`(f(x;w), y(x)) + �krxfkLp(X,⇢(x))<latexit sha1_base64="P7DuaVrOMGQN8/l0rEXk4DEg0n8=">AAACRnicbVBNaxRBEK1ZPxLXj2zM0UthEGYxLDO5JOAlEAM5eIjgJgvb49jT25Nt0t0zdPeYXca5hvyvXDx78yd48aCIV3t2RTTxQcOrV/WorpeVUlgXRZ+Dzq3bd+6urN7r3n/w8NFab/3xsS0qw/iQFbIwo4xaLoXmQyec5KPScKoyyU+ys/22f/KeGysK/cbNS54oeqpFLhh1Xkp7CVFCp+dIFHXTLKsPmrSeIbFCITHTokHCpQzzcPbivL81D2f9Pj5HIv2CCUXyAYmmmaTpDHNfpfWrt2U42mqd7WiT9jajQbQA/iHxdbK5t3N58Q4AjtLeJzIpWKW4dkxSa8fxdumSmhonmORNl1SWl5Sd0VM+9lRTxW1SL2Jo8JlXJpgXxj/tcKH+7aipsnauMj/ZXmuv91rxf71x5fLdpBa6rBzXbLkoryS6AttMcSIMZ07OPaHMCP9XZFNqKHM++a4P4cbJN8nx9iCOBvFrn8ZLWGIVnsBTCCGGHdiDQziCITC4gi/wDb4HH4OvwY/g53K0E/z2bMA/6MAv/KWxJg==</latexit><latexit sha1_base64="ZwZTG1LCJtv8HQPcPqjWfeKrVIU=">AAACRnicbVBNaxRBEK1ZPxLXr9UcvTQGYRbDMpNLhFwCKkjwEMFNVrbHoaa3J9uku2fo7jG7jHOV/K9ccs7Nn+BF8Auv9uyKaOKDhlev6lFdLyulsC6KPgadK1evXV9ZvdG9eev2nbu9e/f3bVEZxoeskIUZZWi5FJoPnXCSj0rDUWWSH2RHT9v+wTturCj0azcveaLwUItcMHReSnsJVUKnx4QqdNMsq583aT0j1ApFqJkWDaFcyjAPZ9vH/Y15OOv3yWNCpV8wQULfE6oxk5jOSO6rtH75tgxHG62zHW3S3no0iBYgf0h8kazvbJ18eLP7/cte2junk4JVimvHJFo7jjdLl9RonGCSN11aWV4iO8JDPvZUo+I2qRcxNOSRVyYkL4x/2pGF+rejRmXtXGV+sr3WXuy14v9648rlT5Ja6LJyXLPlorySxBWkzZRMhOHMybknyIzwfyVsigaZ88l3fQiXTr5M9jcHcTSIX/k0nsESq/AAHkIIMWzBDryAPRgCg1P4BF/hW3AWfA5+BD+Xo53gt2cN/kEHfgHdjbNO</latexit><latexit sha1_base64="ZwZTG1LCJtv8HQPcPqjWfeKrVIU=">AAACRnicbVBNaxRBEK1ZPxLXr9UcvTQGYRbDMpNLhFwCKkjwEMFNVrbHoaa3J9uku2fo7jG7jHOV/K9ccs7Nn+BF8Auv9uyKaOKDhlev6lFdLyulsC6KPgadK1evXV9ZvdG9eev2nbu9e/f3bVEZxoeskIUZZWi5FJoPnXCSj0rDUWWSH2RHT9v+wTturCj0azcveaLwUItcMHReSnsJVUKnx4QqdNMsq583aT0j1ApFqJkWDaFcyjAPZ9vH/Y15OOv3yWNCpV8wQULfE6oxk5jOSO6rtH75tgxHG62zHW3S3no0iBYgf0h8kazvbJ18eLP7/cte2junk4JVimvHJFo7jjdLl9RonGCSN11aWV4iO8JDPvZUo+I2qRcxNOSRVyYkL4x/2pGF+rejRmXtXGV+sr3WXuy14v9648rlT5Ja6LJyXLPlorySxBWkzZRMhOHMybknyIzwfyVsigaZ88l3fQiXTr5M9jcHcTSIX/k0nsESq/AAHkIIMWzBDryAPRgCg1P4BF/hW3AWfA5+BD+Xo53gt2cN/kEHfgHdjbNO</latexit><latexit sha1_base64="Wk5NREuEB3uk44J3RK/4Zi/0yj8=">AAACRnicbVBNaxRBEK1Zo8b1a9VjLk0WYRbDMpOLgpeACXjwkEA2Wdgeh5renmyT7p6hu8fsMplfl4tnb/4ELx4iwas9myVo4oOGV6/qUV0vK6WwLoq+B517a/cfPFx/1H385Omz570XL49sURnGR6yQhRlnaLkUmo+ccJKPS8NRZZIfZ6cf2v7xF26sKPShW5Q8UXiiRS4YOi+lvYQqodMzQhW6WZbVe01azwm1QhFqZkVDKJcyzMP5+7PB1iKcDwbkDaHSL5gioeeEaswkpnOS+yqtP30uw/FW62xHm7TXj4bREuSGxLdJH1bYT3vf6LRgleLaMYnWTuLt0iU1GieY5E2XVpaXyE7xhE881ai4TeplDA157ZUpyQvjn3Zkqf7tqFFZu1CZn2yvtbd7rfi/3qRy+bukFrqsHNfselFeSeIK0mZKpsJw5uTCE2RG+L8SNkODzPnkuz6EOyffJUfbwzgaxgdRf2d3Fcc6bMAmhBDDW9iBj7API2BwAT/gEn4FX4OfwVXw+3q0E6w8r+AfdOAPbZSvQQ==</latexit>

Page 32: Partial Differential Equations Approaches to Optimization ......Partial Differential Equations Approaches to Optimization and Regularization of Deep Neural Networks Celebrating 75

Comment for math audience

• Our result may not be surprising to math experts• However, it is a new approach to generalization theory.

• Speaking personally, the hard work was giving a math interpretation of the problem (1.5 years)

• Once the model was set up correctly, and we realized we could implement it in a DNN, the math was relatively easier.

• Paper and proof was done in about 6 weeks.

Page 33: Partial Differential Equations Approaches to Optimization ......Partial Differential Equations Approaches to Optimization and Regularization of Deep Neural Networks Celebrating 75

Clean Labels:• relevant in benchmark data sets and

applications,• simpler proof, since the clean lable

function is a minimizers• regime of perfect data interpolation

possible with DNNs

Convergence: two cases

Noisy labels: • relevant in applications,• familiar setting for calculus of

variations

Page 34: Partial Differential Equations Approaches to Optimization ......Partial Differential Equations Approaches to Optimization and Regularization of Deep Neural Networks Celebrating 75

Statement and proof sketch ofthe generalization/convergence result

Page 35: Partial Differential Equations Approaches to Optimization ......Partial Differential Equations Approaches to Optimization and Regularization of Deep Neural Networks Celebrating 75

Lipschitz Regularization of DNNs

Data function: augment the expected loss function on the data with a Lipschitz regularization term

where L0 is Lipschitz constant of the data, and n is number of data points. We are interested in the limit as we sample more points. The limiting funtional is given by

Page 36: Partial Differential Equations Approaches to Optimization ......Partial Differential Equations Approaches to Optimization and Regularization of Deep Neural Networks Celebrating 75

Convergence theorem for Noisy Labels

Convergence on the data manifold. Lipschitz off.

Page 37: Partial Differential Equations Approaches to Optimization ......Partial Differential Equations Approaches to Optimization and Regularization of Deep Neural Networks Celebrating 75

Convergence for clean labels (with a rate)

• Rate of convergence, on the data manifold, of the minimizers.• The rate depends on, n, the number of data points sampled and, m,

the number of labels.• Probabilistic bound, where obtain a given error with high

probability• uniform sampling vs random sampling: the log term and the

probability goes away

Page 38: Partial Differential Equations Approaches to Optimization ......Partial Differential Equations Approaches to Optimization and Regularization of Deep Neural Networks Celebrating 75

Proof

Page 39: Partial Differential Equations Approaches to Optimization ......Partial Differential Equations Approaches to Optimization and Regularization of Deep Neural Networks Celebrating 75

Generalization follows

Page 40: Partial Differential Equations Approaches to Optimization ......Partial Differential Equations Approaches to Optimization and Regularization of Deep Neural Networks Celebrating 75

Lipschitz Regularization improves Adversarial Robustness

Chris Finlay (current PhD student)

Improved robustness to adversarial examples using Lipschitz regularization of the loss

Chris Finlay, O., Bilal Abbasi; Oct 2018; arxiv

Bilal Abbasi (former PhD now working in AI)

Page 41: Partial Differential Equations Approaches to Optimization ......Partial Differential Equations Approaches to Optimization and Regularization of Deep Neural Networks Celebrating 75

Adversarial Attacks

Page 42: Partial Differential Equations Approaches to Optimization ......Partial Differential Equations Approaches to Optimization and Regularization of Deep Neural Networks Celebrating 75

Adversarial Attacks on the loss

Page 43: Partial Differential Equations Approaches to Optimization ......Partial Differential Equations Approaches to Optimization and Regularization of Deep Neural Networks Celebrating 75

Scale measures visible attacks

DNNs are vulnerable to attacks which are invisible to the human eye.Undefended networks have 100% error rate at .1 (in max norm)

Page 44: Partial Differential Equations Approaches to Optimization ......Partial Differential Equations Approaches to Optimization and Regularization of Deep Neural Networks Celebrating 75

Implementation of Lipschitz Regularization of the Loss

Page 45: Partial Differential Equations Approaches to Optimization ......Partial Differential Equations Approaches to Optimization and Regularization of Deep Neural Networks Celebrating 75

Robustness bounds from the Lipschitz constant of the Loss

So training the model to have a better Lipschitz constant will improve the adversarial robustness bounds.

Page 46: Partial Differential Equations Approaches to Optimization ......Partial Differential Equations Approaches to Optimization and Regularization of Deep Neural Networks Celebrating 75

Arms race of attack methods and defences

We tested against toolbox of attacks. Plotted the error curve as a function of the adversarial size.

Strongest attacks: 1. Iterative l2-projected gradient2. Iterative Fast Gradient Signed Method (FGSM)

Page 47: Partial Differential Equations Approaches to Optimization ......Partial Differential Equations Approaches to Optimization and Regularization of Deep Neural Networks Celebrating 75

Adversarial Training: interpretation as regularization

Page 48: Partial Differential Equations Approaches to Optimization ......Partial Differential Equations Approaches to Optimization and Regularization of Deep Neural Networks Celebrating 75

Adversarial Training augmented with Lipschitz Regularization

Page 49: Partial Differential Equations Approaches to Optimization ......Partial Differential Equations Approaches to Optimization and Regularization of Deep Neural Networks Celebrating 75

AT + Tulip Results (2-norm)

Significant improvement over state-of-the art results come from augmenting AT with Lipschitz regularization

Page 50: Partial Differential Equations Approaches to Optimization ......Partial Differential Equations Approaches to Optimization and Regularization of Deep Neural Networks Celebrating 75

AT + Tulip Results (2-norm vs max-norm)

2-Lip > AT-2 > AT-1 > baseline (for all noise levels on both datasets)

Page 51: Partial Differential Equations Approaches to Optimization ......Partial Differential Equations Approaches to Optimization and Regularization of Deep Neural Networks Celebrating 75

Other current areas of interest in AI with

connections to mathematics

We are looking for collaborators: these are possible new projects

Page 52: Partial Differential Equations Approaches to Optimization ......Partial Differential Equations Approaches to Optimization and Regularization of Deep Neural Networks Celebrating 75

Reinforcement Learning

• Related to dynamic Programming

• Computationally intensive and unstable

Related math: dynamic programming, Optimal

Control

Page 53: Partial Differential Equations Approaches to Optimization ......Partial Differential Equations Approaches to Optimization and Regularization of Deep Neural Networks Celebrating 75

Recurrent NN

Page 54: Partial Differential Equations Approaches to Optimization ......Partial Differential Equations Approaches to Optimization and Regularization of Deep Neural Networks Celebrating 75

Generative Networks (GANs)Wasserstein GANs: optimal transportation (OT) mapping between random noise (Gaussians)

and target distribution of images

Related math: Optimal Tranportation algorithms and convergence (Peyre-Cuturi)

Page 55: Partial Differential Equations Approaches to Optimization ......Partial Differential Equations Approaches to Optimization and Regularization of Deep Neural Networks Celebrating 75

Squeeze NetsInference (evaluating the data and assigning a label) is costly (typically 0.1 second on a power hungry high memory GPU) in terms of• Memory (to store the weights)• Computation (multiplying the matrices times the vectors)• Power (the energy Joules used by the chip)• TimeResearch effort to make lean NN. How?• Quantization: low bit number representation and arithmetic. (related

math : non-smooth optimization, when the ReLu are also quantized)• Pruning: trim off the small weights, and retrain• Hyperparameter Optimization: train over multiple architectures and

paramsMostly engineering effort, but could be combined with more math on the training.