Partial Differential Equations Approaches to Optimization ......Partial Differential Equations...

Partial Differential Equations Approaches to Optimization and Regularization of Deep Neural Networks

Celebrating 75 Years of Mathematics of ComputationICERM

Nov 2, 2018

Adam ObermanMcGill Dept of Math and Stats

supported by NSERC, Simons Fellowship, AFOSR FA9550-18-1-0167

Background AI

• Artificial Intelligence is loosely defined as intelligence exhibited by machines

• Operationally: R&D in CS academic sub-disciplines: Computer Vision, Natural Language Processing (NLP), Robotics, etc

AlphaGo uses DL to beat world champion at Go

• AI : specific tasks,

• AGI : general cognitive abilities.

• AGI is a small research area within AI: build machines that can successfully perform any task that a human might do

• So far, no progress on AGI.

Artificial General Intelligence (AGI)

Deep Learning vs. traditional Machine Learning

• Machine Learning (ML) has been around for some time.

• Deep Learning is newer branch of ML which uses Deep Neural networks.

• ML has theory: error estimates and convergence proofs.

• DL less theory. But DL can effectively solve substantially larger scale problems

• ImageNet: Total number of classes: m =21841

• Total number of images: n =14,197,122

• Color images d= 3*256*256= 196,608

Facebook used 256 GPUs, working in parallel, to train ImageNet.

Still an academic dataset. Total number of images on Facebook is much larger

What are DNNs (in Math Language)?

What is the function?Looking for a map from images to labels

x in M = manifold of images

graph of list of word labels

f(x)

Doing the impossible?

In theory, due to curse of dimensionality, impossible to accurately interpolate a high dimensional function.

In practice, possible using Deep Neural Network architecture, training to fit the data with SGD.However we don’t know why it works.

Can train a computer to caption images more accurately that human performance.

Loss Functions versus Error

Classification problem: map image to discrete label space {1,2,3,…,10}

In practice: map to a probability vector, then assign label of the arg max.

Classification is not differentiable, so instead, in order to train, use a loss function as a surrogate.

DNNs in Math Language: high dimensional function fitting

Data fitting problem: f parameterized map from images to probability vectors on labels. y is the correct label. Try to fit data by minimizing the loss.

Training: minimize expected loss, by taking stochastic (approximate) gradients

Note: train on an empirical distribution sampled from the density rho.

minw

Ex⇠⇢n`(f(x;w), y(x)) =nX

i=1

`(f(xi;w), y(xi))<latexit sha1_base64="3uHpidkJhxBpaO95Fmtt7WdytCQ=">AAACQ3icbZBNSysxFIbP+HW99avq0k1QhA6IdNwoyAVBBZcKVoudGjJpxoYmmSHJaMsw/8Wf4sY/4M4/4MaFF3ErmLYifh0IPLznPZycN0oFN7ZavfdGRsfGJ/5M/i1NTc/MzpXnF05MkmnKajQRia5HxDDBFatZbgWrp5oRGQl2GnV2+/3TS6YNT9Sx7aWsKcmF4jGnxDoJl89CyRW+QqEkth1F+X6B8y4KDZco1O0EqwKFTIhKXOluX/lrvUrX99E/Z8gkzrmjoDhXHxbM302Y+z4ur1TXq4NCHxB8h5WdfVW/BoBDXL4LWwnNJFOWCmJMI9hIbTMn2nIqWFEKM8NSQjvkgjUcKiKZaeaDDAq06pQWihPtnrJooH6eyIk0picj5+yfar73+uJvvUZm461mzlWaWabocFGcCWQT1A8Utbhm1IqeA0I1d39FtE00odbFXnIh/Dj5J5xsrAeOj1waezCsSViCZahAAJuwAwdwCDWgcAMP8AT/vVvv0Xv2XobWEe99ZhG+lPf6Bur+sIg=</latexit><latexit sha1_base64="LDMLH6UNa2jjZV0Aeyzoau7EZiQ=">AAACQ3icbZDLSgMxFIYz9V5vVZdugiJ0QGqnGwURBBVcKlgtdmrIpBkbmmSGJGNbhnkIn8EXceMLuPMF3LhQxK1g2op4OxD4+M9/ODl/EHOmTbn84ORGRsfGJyan8tMzs3PzhYXFUx0litAqiXikagHWlDNJq4YZTmuxolgEnJ4F7b1+/+yKKs0ieWJ6MW0IfClZyAg2VkKFc18wiTrQF9i0giA9yFDahb5mAvqqFSGZQZ9yXgyL3e2Ou94rdl0X7lhDIlDKLHnZhfyyIPZpQsx1UWG1XCoPCn6B9xtWdw9k7WZj//oIFe79ZkQSQaUhHGtd9yqxaaRYGUY4zfJ+ommMSRtf0rpFiQXVjXSQQQbXrNKEYaTskwYO1O8TKRZa90Rgnf1T9e9eX/yvV09MuNVImYwTQyUZLgoTDk0E+4HCJlOUGN6zgIli9q+QtLDCxNjY8zaEPyf/hdNKybN8bNPYB8OaBMtgBRSBBzbBLjgER6AKCLgFj+AZvDh3zpPz6rwNrTnnc2YJ/Cjn/QNKGrGP</latexit><latexit sha1_base64="LDMLH6UNa2jjZV0Aeyzoau7EZiQ=">AAACQ3icbZDLSgMxFIYz9V5vVZdugiJ0QGqnGwURBBVcKlgtdmrIpBkbmmSGJGNbhnkIn8EXceMLuPMF3LhQxK1g2op4OxD4+M9/ODl/EHOmTbn84ORGRsfGJyan8tMzs3PzhYXFUx0litAqiXikagHWlDNJq4YZTmuxolgEnJ4F7b1+/+yKKs0ieWJ6MW0IfClZyAg2VkKFc18wiTrQF9i0giA9yFDahb5mAvqqFSGZQZ9yXgyL3e2Ou94rdl0X7lhDIlDKLHnZhfyyIPZpQsx1UWG1XCoPCn6B9xtWdw9k7WZj//oIFe79ZkQSQaUhHGtd9yqxaaRYGUY4zfJ+ommMSRtf0rpFiQXVjXSQQQbXrNKEYaTskwYO1O8TKRZa90Rgnf1T9e9eX/yvV09MuNVImYwTQyUZLgoTDk0E+4HCJlOUGN6zgIli9q+QtLDCxNjY8zaEPyf/hdNKybN8bNPYB8OaBMtgBRSBBzbBLjgER6AKCLgFj+AZvDh3zpPz6rwNrTnnc2YJ/Cjn/QNKGrGP</latexit><latexit sha1_base64="1t3sG2NE1Erwu7My7b6VndIN3d0=">AAACQ3icbZDLSgMxFIYz9VbrrerSTbAIHRCZcaMggqCCywrWip0aMmmmDSaZIcloyzDv5sYXcOcLuHGhiFvBtJbi7UDg4z//4eT8YcKZNp736BQmJqemZ4qzpbn5hcWl8vLKuY5TRWidxDxWFyHWlDNJ64YZTi8SRbEIOW2E14eDfuOGKs1ieWb6CW0J3JEsYgQbK6HyZSCYRLcwENh0wzA7zlHWg4FmAgaqGyOZw4ByXo2qvb1bd7Nf7bku3LeGVKCMWfLzKzm2IDYyIea6qFzxtrxhwTH4v6ECRlVD5YegHZNUUGkIx1o3/e3EtDKsDCOc5qUg1TTB5Bp3aNOixILqVjbMIIcbVmnDKFb2SQOH6veJDAut+yK0zsGp+ndvIP7Xa6Ym2m1lTCapoZJ8LYpSDk0MB4HCNlOUGN63gIli9q+QdLHCxNjYSzaEPyf/hfPtLd/yqVc5OBrFUQRrYB1UgQ92wAE4ATVQBwTcgSfwAl6de+fZeXPev6wFZzSzCn6U8/EJVBWunQ==</latexit>

Generalization. Training set and test set

Goal: generalization: hope that training error is a good estimate of the generalization loss, which is the expected loss on unseen images drawn from the same distribution.

Ex⇠⇢`(f(x;w), y(x)) =

Z`(f(xi;w), y(xi))d⇢(x)

<latexit sha1_base64="mQd6WwJaTEyehbn5A9xdhhb6+jI=">AAACOXicbVBLS0JBFD7XXmYvq2WbIQkUQtRNQQVCBS0N8gFeucwd5+rg3Eczc0u5+Lfa9C/aBW1a9KBtf6C5KlLagYGP78Gc89kBZ1IVCs9GYmFxaXkluZpaW9/Y3Epv79SkHwpCq8TnvmjYWFLOPFpVTHHaCATFrs1p3e6dx3r9jgrJfO9GDQLacnHHYw4jWGnKSldMF6uubUeXQyvqI1MyF5mi6w+RSTnPOtn+yX3ucJDt53LoDJnMU1PBYhPJYlpsxylts9KZQr4wGjQFxVmQKZ9+3CIAqFjpJ7Ptk9ClniIcS9kslgLVirBQjHA6TJmhpAEmPdyhTQ097FLZikaXD9GBZtrI8YV+erUR+zsRYVfKgWtrZ3ynnNVi8j+tGSrnuBUxLwgV9cj4IyfkSPkorhG1maBE8YEGmAimd0WkiwUmSped0iXMnTwPaqV8UeNr3cYFjCcJe7APWSjCEZThCipQBQIP8AJv8G48Gq/Gp/E1tiaMSWYX/ozx/QOjV6x4</latexit><latexit sha1_base64="PeNmUL0W364KwKS6AjbkFuD2cho=">AAACOXicbVDLSgMxFM3UV62vUZduokVoQUrbjYIKBRUENxXsAzrDkEkzbWjmYZLRDkO/yL0b/8Kd4MaFD9z6A2baItp6IXA4D3LvsQNGhSwWn7TUzOzc/EJ6MbO0vLK6pq9v1IUfckxq2Gc+b9pIEEY9UpNUMtIMOEGuzUjD7p0keuOGcEF970pGATFd1PGoQzGSirL0quEi2bXt+GxgxX1oCOpCg3f9ATQIYzkn1z+8ze9FuX4+D4+hQT35I1h0LFlUie0kpWyWni0WisOBP6A0CbKVo/fr7bsLUbX0R6Pt49AlnsQMCdEqlQNpxohLihkZZIxQkADhHuqQloIecokw4+HlA7irmDZ0fK6eWm3I/k7EyBUicm3lTO4Uk1pC/qe1QukcmDH1glASD48+ckIGpQ+TGmGbcoIlixRAmFO1K8RdxBGWquyMKmHq5GlQLxdKCl+qNk7BaNJgC+yAHCiBfVAB56AKagCDe/AMXsGb9qC9aB/a58ia0saZTfBntK9vh0ut4g==</latexit><latexit sha1_base64="PeNmUL0W364KwKS6AjbkFuD2cho=">AAACOXicbVDLSgMxFM3UV62vUZduokVoQUrbjYIKBRUENxXsAzrDkEkzbWjmYZLRDkO/yL0b/8Kd4MaFD9z6A2baItp6IXA4D3LvsQNGhSwWn7TUzOzc/EJ6MbO0vLK6pq9v1IUfckxq2Gc+b9pIEEY9UpNUMtIMOEGuzUjD7p0keuOGcEF970pGATFd1PGoQzGSirL0quEi2bXt+GxgxX1oCOpCg3f9ATQIYzkn1z+8ze9FuX4+D4+hQT35I1h0LFlUie0kpWyWni0WisOBP6A0CbKVo/fr7bsLUbX0R6Pt49AlnsQMCdEqlQNpxohLihkZZIxQkADhHuqQloIecokw4+HlA7irmDZ0fK6eWm3I/k7EyBUicm3lTO4Uk1pC/qe1QukcmDH1glASD48+ckIGpQ+TGmGbcoIlixRAmFO1K8RdxBGWquyMKmHq5GlQLxdKCl+qNk7BaNJgC+yAHCiBfVAB56AKagCDe/AMXsGb9qC9aB/a58ia0saZTfBntK9vh0ut4g==</latexit><latexit sha1_base64="4u7epyBMGvw4Z10O2qFM64ivPWQ=">AAACOXicbVBNS8MwGE79nPOr6tFLcAgbyGh3URBhoILHCe4D1lHSLN3CkrYkqa6U/S0v/gtvghcPinj1D5huY+jmC4GH54O87+NFjEplWS/G0vLK6tp6biO/ubW9s2vu7TdkGAtM6jhkoWh5SBJGA1JXVDHSigRB3GOk6Q0uM715T4SkYXCnkoh0OOoF1KcYKU25Zs3hSPU9L70euekQOpJy6Ih+OIIOYazoF4fnD6WTpDgsleAFdGigZoJLp5JLtdjNUtrmmgWrbI0HzoA9DwpgOjXXfHa6IY45CRRmSMq2XYlUJ0VCUczIKO/EkkQID1CPtDUMECeyk44vH8FjzXShHwr99Gpj9nciRVzKhHvamd0p57WM/E9rx8o/66Q0iGJFAjz5yI8ZVCHMaoRdKghWLNEAYUH1rhD3kUBY6bLzuoSFkxdBo1K2Nb61CtWraR05cAiOQBHY4BRUwQ2ogTrA4BG8gnfwYTwZb8an8TWxLhnTzAH4M8b3D/9XqoQ=</latexit>

Testing: reserve some data and approximate the generalization loss/error on the test data, which is a surrogate for the true expected error on the full density.

Ex⇠⇢test`(f(x;w), y(x)) =X

`(f(xi;w), y(xi))<latexit sha1_base64="dsiTuKjAr93GjfozAUupKurJQBM=">AAACN3icbVBNSxxBFHyj0ej6kTUevTSKsAMiu3sxEAQhETyJQlaFnWXo6X3jNnbPDN1vdJdhIEfv/pVc8jdyixcPinjNP7B3VyR+FDQUVe/xuirKlLRUr//1JiY/TE1/nJmtzM0vLH6qLn0+smluBLZEqlJzEnGLSibYIkkKTzKDXEcKj6Ozb0P/+ByNlWnygwYZdjQ/TWQsBScnhdX9QHPqRVGxW4ZFnwVWahaYXhoWhJbKkgWoVC2u9b9e+BuDWt/32babyvWzEconK5S+H1bX6pv1EdgzabwmazvNy6ufAHAQVv8E3VTkGhMSilvbbjQz6hTckBQKy0qQW8y4OOOn2HY04RptpxjlLtm6U7osTo17CbGR+v9GwbW1Ax25yWFK+9obiu957ZziL51CJllOmIjxoThXjFI2LJF1pUFBauAIF0a6vzLR44YLclVXXAlvIr8lR83NhuOHro3vMMYMrMAq1KABW7ADe3AALRDwC67hFu68396Nd+89jEcnvKedZXgB798juuKsrA==</latexit><latexit sha1_base64="NEQI9PaZjbDyruWtrCfw7KgbFqw=">AAACN3icbVBNSyNBFOzxa2N0NbpHL40iZGCRJBcXFiGggieJYEwgE4aezhvT2D0zdL9ZE4Y5ieDdv+Jl/4a33cseVsSr/8BOIuJXQUNR9R6vq4JECoOVyh9nanpmdu5LYb64sPh1abm0snpi4lRzaPJYxrodMANSRNBEgRLaiQamAgmt4Gx35Ld+gTYijo5xmEBXsdNIhIIztJJfOvQUw34QZPu5nw2oZ4Sinu7HfoZgMM+pB1KWw/Lg57n7fVgeuC7dsVOpejF88Wz5wnX90kZlqzIGfSHV92SjXru6vrwIaMMv3Xq9mKcKIuSSGdOp1hLsZkyj4BLyopcaSBg/Y6fQsTRiCkw3G+fO6aZVejSMtX0R0rH6eiNjypihCuzkKKV5743Ez7xOiuGPbiaiJEWI+ORQmEqKMR2VSHtCA0c5tIRxLexfKe8zzTjaqou2hA+RP5KT2lbV8iPbxh6ZoEDWyDopkyrZJnVyQBqkSTi5IX/Jf3Ln/Hb+OffOw2R0ynne+UbewHl8Ah7JrbY=</latexit><latexit sha1_base64="NEQI9PaZjbDyruWtrCfw7KgbFqw=">AAACN3icbVBNSyNBFOzxa2N0NbpHL40iZGCRJBcXFiGggieJYEwgE4aezhvT2D0zdL9ZE4Y5ieDdv+Jl/4a33cseVsSr/8BOIuJXQUNR9R6vq4JECoOVyh9nanpmdu5LYb64sPh1abm0snpi4lRzaPJYxrodMANSRNBEgRLaiQamAgmt4Gx35Ld+gTYijo5xmEBXsdNIhIIztJJfOvQUw34QZPu5nw2oZ4Sinu7HfoZgMM+pB1KWw/Lg57n7fVgeuC7dsVOpejF88Wz5wnX90kZlqzIGfSHV92SjXru6vrwIaMMv3Xq9mKcKIuSSGdOp1hLsZkyj4BLyopcaSBg/Y6fQsTRiCkw3G+fO6aZVejSMtX0R0rH6eiNjypihCuzkKKV5743Ez7xOiuGPbiaiJEWI+ORQmEqKMR2VSHtCA0c5tIRxLexfKe8zzTjaqou2hA+RP5KT2lbV8iPbxh6ZoEDWyDopkyrZJnVyQBqkSTi5IX/Jf3Ln/Hb+OffOw2R0ynne+UbewHl8Ah7JrbY=</latexit><latexit sha1_base64="71ODtIYdgAE2JcdlClqIel1VpHk=">AAACN3icbVDLSsNAFJ34tr6qLt0MFqEBkbQbBREEFVyJglWhKWEyvbGDM0mYudGWkL9y42+4040LRdz6B05rEV8HBg7n3Mudc8JUCoOe9+CMjI6NT0xOTZdmZufmF8qLS2cmyTSHBk9koi9CZkCKGBooUMJFqoGpUMJ5eLXX98+vQRuRxKfYS6Gl2GUsIsEZWikoH/mKYScM84MiyLvUN0JRX3eSIEcwWBTUBymrUbW7feOu96pd16U7dipTX0YghlYgXDcoV7wNbwD6RWq/SYUMcRyU7/12wjMFMXLJjGnW6im2cqZRcAlFyc8MpIxfsUtoWhozBaaVD3IXdM0qbRol2r4Y6UD9vpEzZUxPhXayn9L89vrif14zw2irlYs4zRBi/nkoyiTFhPZLpG2hgaPsWcK4FvavlHeYZhxt1SVbwp/If8lZfaNm+YlX2d0f1jFFVsgqqZIa2SS75JAckwbh5JY8kmfy4tw5T86r8/Y5OuIMd5bJDzjvH/YEqp8=</latexit>

Orange curve: overtrained. Green curve better generalization

Challenges for deep learning

“It is not clear that the existing AI paradigm is immediately amenable to any sort of software engineering validation and verification. This is a serious issue, and is a potential roadblock to DoD’s use of these modern AI systems, especially when considering the liability and accountability of using AI”

JASON report

Mary Shaw’s evolution of software engineering discipline

Better theory: improves reliability and discipline evolves

Entropy-SGD

Pratik Chaudhari UCLA (now Amazon/U Penn)

Stefano Soatto UCLA Comp Sci.

Stanley Osher, UCLA Math

Guillaume Carlier, CEREMADE, U. Parix IX Dauphine

• 2017 UCLA PhD student (at time of research)• 2018 (present) Amazon research • Fall 2019 Faculty in ESE at U Penn

Deep Relaxation: partial differential equations for optimizing deep neural networks Pratik Chaudhari, Adam M. Oberman, Stanley Osher, Stefano Soatto, Guillame Carlier 2017

Entropy SGD results in Deep Neural Networks (Pratik)

Visualization of Improvement in training loss (left)Improve in Validation Error (right)

dimension = 1.67 million

Different Interpretation: Regularization using Viscous Hamilton-Jacobi PDE

Solution of PDE in one dimension. Cartoon: Algorithm only solves PDE for time depending on Hf(x).

Expected Improvement Theorem in continuous time

Adaptive-SGDjoint with PhD Student Mariana Prazeres

mini-batch: randomly choose a smaller set I of points from the data.

Cost is N

Cost is |I|

Model for mini-batch gradients: k=1 means.

full gradient

mini-batch

For k>1 means, same calculation applies, if we restrict to the active indices. This leads to smaller active batch sizes, and higher variance

Motivation: quality of gMB depends on x

-1 -0.5 0 0.5 1

-1

-0.5

0

0.5

1

• far from x* mb gradients point in a good direction.• near x* require more samples, or small steps (so that directions average in time).

-4 -3 -2 -1 0 1 2 3 4-4

-3

-2

-1

0

1

2

3

4

Adapt in Space instead of Time

-1 -0.5 0 0.5 1

-1

-0.5

0

0.5

1

The ideal learning rate/batch size combination should depend on x (space) rather than k (time).

use MB = 10 use MB = 60

Adaptive SGD

• Adaptively, depending on x, decide on • MB size, or • learning rate.

• Use the following formula (derived later)

- f large, learning rate large (ok to use small MB)- g small, var(MB) restricts learning rate (so use large MB)

Benchmarks: Fix MB and adapt h

0 100 200 300 400 500 60010-8

10-6

10-4

10-2

100f(x) - f* (for a typical run)

ScheduledScheduled 1/tAdaptive

0 100 200 300 400 500 60010-6

10-5

10-4

10-3

10-2

10-1

100f(x) - f*(Averaged over 40 runs)

ScheduledScheduled 1/tAdaptive

Left: Not too clear what is happening with one path.Right: Average over several runs to see the trend

-0.2 -0.15 -0.1 -0.05 0 0.05 0.1 0.15 0.2-0.2

-0.15

-0.1

-0.05

0

0.05

0.1

0.15

0.2Paths of Scheduled and Adaptive SGD

The variance of the paths is clear from this figure

Proof of Convergence with Rate

rate is same order at SGD, but with better constant

Proof of Convergence and Generalization for Lipschitz

Regularized DNNsjoint with Jeff Calder

Lipschitz regularized Deep Neural Networks converge and generalize O. and Jeff Calder; 2018

Background on the generalization/convergence result

Problem: traditional ML theory does not apply to Deep Learning

A new idea is needed to make Deep Learning more reliable.

“Understanding Deep Learning requires rethinking generalization” Zhang (2016)

inspiration: old idea: Total Variation Denoising [1992] R-Osher-F.

used in early, high profile image reconstruction of video images.

Stanley Osher

• minimize a variational functional: combination of a loss term, to the original noisy image, and a regularization term

• Regularization is large on noise, small on images

Regularization: from images to maps

x in M = manifold of images

word labels

f(x)

the learned map: well-behaved on data manifold, but very bad off the manifold (without regularization)

What is new in our result?

• Bartlett proved generalization under the assumption of Lipschitz regularity.

• However, DNNs are not uniformly Lipschitz• By adding regularization to the objective function in the training, we

obtain the uniform Lipschitz bounds

Approaches to regularization

A. Machine Learning: learn data using an appropriate (smooth) parameterized class of functions

B. Algorithmic: use an algorithm which selects the best solution (e.g. Stochastic Gradient Descent as a regularizer, adversarial training)

C. Inverse problems: allow for a broad class of functions, but modify the loss to choose the right one

minw

Ex⇠⇢`(f(x;w), y(x))<latexit sha1_base64="SxNFJklN3YXQujZE+PFI6SQ90+4=">AAACGnicbVDLSgMxFL3js9ZX1aWboAgtSJlxo+BGUMFlBdsKnTJm0kwbmmSGJKMtQ7fiH7jxV9y4UMSduPFvTFsRXwcCh3Pu5eacMOFMG9d9dyYmp6ZnZnNz+fmFxaXlwspqTcepIrRKYh6r8xBrypmkVcMMp+eJoliEnNbD7uHQr19SpVksz0w/oU2B25JFjGBjpaDg+YLJ4Ar5AptOGGbHgyDrIV8zgXzViQfIp5wXo2Jv/6q03S/2SqWgsOmW3RHQF/F+k82D3ZvrCwCoBIVXvxWTVFBpCMdaN7ydxDQzrAwjnA7yfqppgkkXt2nDUokF1c1sFG2AtqzSQlGs7JMGjdTvGxkWWvdFaCeHCfRvbyj+5zVSE+01MyaT1FBJxoeilCMTo2FPqMUUJYb3LcFEMftXRDpYYWJsm3lbwp/If0ltp+y5Ze/UtnEEY+RgHTagCB7swgGcQAWqQOAW7uERnpw758F5dl7GoxPO584a/IDz9gEpiaHK</latexit><latexit sha1_base64="H8CIfyvwEJ3UxyVBgiSjhDmLCRM=">AAACGnicbVDLSgMxFM3UV62vqks3QRFakDLTTQU3BRXEVQX7kE4ZMmnGhiaZIcnYlqFb8Q/c+CtuXCjiTtz4GYIfYNqK+DoQOJxzLzfn+BGjStv2q5Wamp6ZnUvPZxYWl5ZXsqtrNRXGEpMqDlkoGz5ShFFBqppqRhqRJIj7jNT97v7Ir18QqWgoTvUgIi2OzgUNKEbaSF7WcTkVXg+6HOmO7yeHQy/pQ1dRDl3ZCYfQJYzlglx/r5ffGeT6+byX3bIL9hjwizi/yVa5dHV5dvz+VvGyz247xDEnQmOGlGo6xUi3EiQ1xYwMM26sSIRwF52TpqECcaJayTjaEG4bpQ2DUJonNByr3zcSxJUacN9MjhKo395I/M9rxjrYbSVURLEmAk8OBTGDOoSjnmCbSoI1GxiCsKTmrxB3kERYmzYzpoQ/kf+SWrHg2AXnxLRxACZIgw2wCXLAASVQBkegAqoAg2twC+7Bg3Vj3VmP1tNkNGV97qyDH7BePgAKcaPy</latexit><latexit sha1_base64="H8CIfyvwEJ3UxyVBgiSjhDmLCRM=">AAACGnicbVDLSgMxFM3UV62vqks3QRFakDLTTQU3BRXEVQX7kE4ZMmnGhiaZIcnYlqFb8Q/c+CtuXCjiTtz4GYIfYNqK+DoQOJxzLzfn+BGjStv2q5Wamp6ZnUvPZxYWl5ZXsqtrNRXGEpMqDlkoGz5ShFFBqppqRhqRJIj7jNT97v7Ir18QqWgoTvUgIi2OzgUNKEbaSF7WcTkVXg+6HOmO7yeHQy/pQ1dRDl3ZCYfQJYzlglx/r5ffGeT6+byX3bIL9hjwizi/yVa5dHV5dvz+VvGyz247xDEnQmOGlGo6xUi3EiQ1xYwMM26sSIRwF52TpqECcaJayTjaEG4bpQ2DUJonNByr3zcSxJUacN9MjhKo395I/M9rxjrYbSVURLEmAk8OBTGDOoSjnmCbSoI1GxiCsKTmrxB3kERYmzYzpoQ/kf+SWrHg2AXnxLRxACZIgw2wCXLAASVQBkegAqoAg2twC+7Bg3Vj3VmP1tNkNGV97qyDH7BePgAKcaPy</latexit><latexit sha1_base64="O4J9xO27AvK0cuHosbflwDq4gHQ=">AAACGnicbVDLSgMxFM3UV62vqks3wSK0IGWmGwU3BRVcVrAP6JQhk2ba0CQzJBnbYeh3uPFX3LhQxJ248W9MH4i2HggczrmXm3P8iFGlbfvLyqysrq1vZDdzW9s7u3v5/YOGCmOJSR2HLJQtHynCqCB1TTUjrUgSxH1Gmv7gcuI374lUNBR3OolIh6OeoAHFSBvJyzsup8IbQpcj3ff99HrspSPoKsqhK/vhGLqEsWJQHF0MS6dJcVQqefmCXbangD/EWSQFMEfNy3+43RDHnAiNGVKq7VQi3UmR1BQzMs65sSIRwgPUI21DBeJEddJptDE8MUoXBqE0T2g4VX9vpIgrlXDfTE4SqEVvIv7ntWMdnHdSKqJYE4Fnh4KYQR3CSU+wSyXBmiWGICyp+SvEfSQR1qbNnClhKfIyaVTKjl12bu1C9WpeRxYcgWNQBA44A1VwA2qgDjB4AE/gBbxaj9az9Wa9z0Yz1nznEPyB9fkNmmmf5Q==</latexit>

wk+1 = wk + hkrmb`(. . . w)<latexit sha1_base64="BZyyOvDFarvXXwrLP3MBcmYw5kg=">AAACF3icbVBNS8NAFHzx26o16tHLqghKISRe9CIIevCoYG2hiWGz3dglm03Y3VhK6L/w4l/x4kERr3rz37htRbR1YGGY2eG9N1HOmdKu+2lNTc/Mzs0vLFaWlleqq/ba+rXKCklonWQ8k80IK8qZoHXNNKfNXFKcRpw2ouR04DfuqFQsE1e6l9MgxbeCxYxgbaTQdro3ZVLz+ugYdW8SVEOdMEG+wBHHYZlGfeRTztGe3860Qt390N5xHXcI9EO8cbJzsrVWBYOL0P4wUVKkVGjCsVIt7yDXQYmlZoTTfsUvFM0xSfAtbRkqcEpVUA7v6qNdo7RRnEnzhEZD9XeixKlSvcGWuynWHTXuDcT/vFah46OgZCIvNBVkNCguONIZGpSE2kxSonnPEEwkM7si0sESE22qrJgSJk6eJNcHjuc63qVp4wxGWIBN2IY98OAQTuAcLqAOBO7hEZ7hxXqwnqxX6230dcr6zmzAH1jvX40inl0=</latexit><latexit sha1_base64="O8Jfg1XG1Gs6KyE5rSjFZ8Va8xI=">AAACF3icbVBPSwJBHJ21f2ZlWscuYyIYgux6qUsg1aGjQf4BXZfZcdRhZ2eXmdlElv0WXfoqXToU0bVufZtGjSjtwcDjvXn8fr/nhoxKZZqfRmptfWNzK72d2dndy+7n8gctGUQCkyYOWCA6LpKEUU6aiipGOqEgyHcZabve5cxv3xEhacBv1TQkto9GnA4pRkpLTq466cdexUrgOZz0PViBY8eDPY5chpzYdxPYI4zBcm8QKAknJ06uaFbNOeAPsZZJsV7IZ1OFi7Dh5D50FEc+4QozJGXXqoXKjpFQFDOSZHqRJCHCHhqRrqYc+UTa8fyuBJa0MoDDQOjHFZyrvxMx8qWczrYs+UiN5bI3E//zupEantkx5WGkCMeLQcOIQRXAWUlwQAXBik01QVhQvSvEYyQQVrrKjC5h5eRV0qpVLbNq3eg2rsACaXAEjkEZWOAU1ME1aIAmwOAePIJn8GI8GE/Gq/G2+JoyvjOH4A+M9y+pdJ8y</latexit><latexit sha1_base64="O8Jfg1XG1Gs6KyE5rSjFZ8Va8xI=">AAACF3icbVBPSwJBHJ21f2ZlWscuYyIYgux6qUsg1aGjQf4BXZfZcdRhZ2eXmdlElv0WXfoqXToU0bVufZtGjSjtwcDjvXn8fr/nhoxKZZqfRmptfWNzK72d2dndy+7n8gctGUQCkyYOWCA6LpKEUU6aiipGOqEgyHcZabve5cxv3xEhacBv1TQkto9GnA4pRkpLTq466cdexUrgOZz0PViBY8eDPY5chpzYdxPYI4zBcm8QKAknJ06uaFbNOeAPsZZJsV7IZ1OFi7Dh5D50FEc+4QozJGXXqoXKjpFQFDOSZHqRJCHCHhqRrqYc+UTa8fyuBJa0MoDDQOjHFZyrvxMx8qWczrYs+UiN5bI3E//zupEantkx5WGkCMeLQcOIQRXAWUlwQAXBik01QVhQvSvEYyQQVrrKjC5h5eRV0qpVLbNq3eg2rsACaXAEjkEZWOAU1ME1aIAmwOAePIJn8GI8GE/Gq/G2+JoyvjOH4A+M9y+pdJ8y</latexit><latexit sha1_base64="8NtPglgPo5lAysiDwWcJbXcFd98=">AAACF3icbVDLSsNAFJ3UV62vqEs3g6VQKZSkG90IBV24rGAf0LRhMp22QyaTMDOxlJC/cOOvuHGhiFvd+TdO2iLaemDgcM4c7r3HixiVyrK+jNza+sbmVn67sLO7t39gHh61ZBgLTJo4ZKHoeEgSRjlpKqoY6USCoMBjpO35V5nfvidC0pDfqWlEegEacTqkGCktuWZ10k/8ip3CSzjp+7ACx64PHY48htwk8FLoEMZg2RmESsLJmWsWrao1A/wh9jIpggUarvmpozgOCFeYISm7di1SvQQJRTEjacGJJYkQ9tGIdDXlKCCyl8zuSmFJKwM4DIV+XMGZ+juRoEDKabZlKUBqLJe9TPzP68ZqeNFLKI9iRTieDxrGDKoQZiXBARUEKzbVBGFB9a4Qj5FAWOkqC7qElZNXSatWta2qfWsV69eLOvLgBJyCMrDBOaiDG9AATYDBA3gCL+DVeDSejTfjff41Zywyx+APjI9v34Cd3Q==</latexit>

minw

Ex⇠⇢`(f(x;w), y(x)) + �krxfkLp(X,⇢(x))<latexit sha1_base64="P7DuaVrOMGQN8/l0rEXk4DEg0n8=">AAACRnicbVBNaxRBEK1ZPxLXj2zM0UthEGYxLDO5JOAlEAM5eIjgJgvb49jT25Nt0t0zdPeYXca5hvyvXDx78yd48aCIV3t2RTTxQcOrV/WorpeVUlgXRZ+Dzq3bd+6urN7r3n/w8NFab/3xsS0qw/iQFbIwo4xaLoXmQyec5KPScKoyyU+ys/22f/KeGysK/cbNS54oeqpFLhh1Xkp7CVFCp+dIFHXTLKsPmrSeIbFCITHTokHCpQzzcPbivL81D2f9Pj5HIv2CCUXyAYmmmaTpDHNfpfWrt2U42mqd7WiT9jajQbQA/iHxdbK5t3N58Q4AjtLeJzIpWKW4dkxSa8fxdumSmhonmORNl1SWl5Sd0VM+9lRTxW1SL2Jo8JlXJpgXxj/tcKH+7aipsnauMj/ZXmuv91rxf71x5fLdpBa6rBzXbLkoryS6AttMcSIMZ07OPaHMCP9XZFNqKHM++a4P4cbJN8nx9iCOBvFrn8ZLWGIVnsBTCCGGHdiDQziCITC4gi/wDb4HH4OvwY/g53K0E/z2bMA/6MAv/KWxJg==</latexit><latexit sha1_base64="ZwZTG1LCJtv8HQPcPqjWfeKrVIU=">AAACRnicbVBNaxRBEK1ZPxLXr9UcvTQGYRbDMpNLhFwCKkjwEMFNVrbHoaa3J9uku2fo7jG7jHOV/K9ccs7Nn+BF8Auv9uyKaOKDhlev6lFdLyulsC6KPgadK1evXV9ZvdG9eev2nbu9e/f3bVEZxoeskIUZZWi5FJoPnXCSj0rDUWWSH2RHT9v+wTturCj0azcveaLwUItcMHReSnsJVUKnx4QqdNMsq583aT0j1ApFqJkWDaFcyjAPZ9vH/Y15OOv3yWNCpV8wQULfE6oxk5jOSO6rtH75tgxHG62zHW3S3no0iBYgf0h8kazvbJ18eLP7/cte2junk4JVimvHJFo7jjdLl9RonGCSN11aWV4iO8JDPvZUo+I2qRcxNOSRVyYkL4x/2pGF+rejRmXtXGV+sr3WXuy14v9648rlT5Ja6LJyXLPlorySxBWkzZRMhOHMybknyIzwfyVsigaZ88l3fQiXTr5M9jcHcTSIX/k0nsESq/AAHkIIMWzBDryAPRgCg1P4BF/hW3AWfA5+BD+Xo53gt2cN/kEHfgHdjbNO</latexit><latexit sha1_base64="ZwZTG1LCJtv8HQPcPqjWfeKrVIU=">AAACRnicbVBNaxRBEK1ZPxLXr9UcvTQGYRbDMpNLhFwCKkjwEMFNVrbHoaa3J9uku2fo7jG7jHOV/K9ccs7Nn+BF8Auv9uyKaOKDhlev6lFdLyulsC6KPgadK1evXV9ZvdG9eev2nbu9e/f3bVEZxoeskIUZZWi5FJoPnXCSj0rDUWWSH2RHT9v+wTturCj0azcveaLwUItcMHReSnsJVUKnx4QqdNMsq583aT0j1ApFqJkWDaFcyjAPZ9vH/Y15OOv3yWNCpV8wQULfE6oxk5jOSO6rtH75tgxHG62zHW3S3no0iBYgf0h8kazvbJ18eLP7/cte2junk4JVimvHJFo7jjdLl9RonGCSN11aWV4iO8JDPvZUo+I2qRcxNOSRVyYkL4x/2pGF+rejRmXtXGV+sr3WXuy14v9648rlT5Ja6LJyXLPlorySxBWkzZRMhOHMybknyIzwfyVsigaZ88l3fQiXTr5M9jcHcTSIX/k0nsESq/AAHkIIMWzBDryAPRgCg1P4BF/hW3AWfA5+BD+Xo53gt2cN/kEHfgHdjbNO</latexit><latexit sha1_base64="Wk5NREuEB3uk44J3RK/4Zi/0yj8=">AAACRnicbVBNaxRBEK1Zo8b1a9VjLk0WYRbDMpOLgpeACXjwkEA2Wdgeh5renmyT7p6hu8fsMplfl4tnb/4ELx4iwas9myVo4oOGV6/qUV0vK6WwLoq+B517a/cfPFx/1H385Omz570XL49sURnGR6yQhRlnaLkUmo+ccJKPS8NRZZIfZ6cf2v7xF26sKPShW5Q8UXiiRS4YOi+lvYQqodMzQhW6WZbVe01azwm1QhFqZkVDKJcyzMP5+7PB1iKcDwbkDaHSL5gioeeEaswkpnOS+yqtP30uw/FW62xHm7TXj4bREuSGxLdJH1bYT3vf6LRgleLaMYnWTuLt0iU1GieY5E2XVpaXyE7xhE881ai4TeplDA157ZUpyQvjn3Zkqf7tqFFZu1CZn2yvtbd7rfi/3qRy+bukFrqsHNfselFeSeIK0mZKpsJw5uTCE2RG+L8SNkODzPnkuz6EOyffJUfbwzgaxgdRf2d3Fcc6bMAmhBDDW9iBj7API2BwAT/gEn4FX4OfwVXw+3q0E6w8r+AfdOAPbZSvQQ==</latexit>

Comment for math audience

• Our result may not be surprising to math experts• However, it is a new approach to generalization theory.

• Speaking personally, the hard work was giving a math interpretation of the problem (1.5 years)

• Once the model was set up correctly, and we realized we could implement it in a DNN, the math was relatively easier.

• Paper and proof was done in about 6 weeks.

Clean Labels:• relevant in benchmark data sets and

applications,• simpler proof, since the clean lable

function is a minimizers• regime of perfect data interpolation

possible with DNNs

Convergence: two cases

Noisy labels: • relevant in applications,• familiar setting for calculus of

variations

Statement and proof sketch ofthe generalization/convergence result

Lipschitz Regularization of DNNs

Data function: augment the expected loss function on the data with a Lipschitz regularization term

where L0 is Lipschitz constant of the data, and n is number of data points. We are interested in the limit as we sample more points. The limiting funtional is given by

Convergence theorem for Noisy Labels

Convergence on the data manifold. Lipschitz off.

Convergence for clean labels (with a rate)

• Rate of convergence, on the data manifold, of the minimizers.• The rate depends on, n, the number of data points sampled and, m,

the number of labels.• Probabilistic bound, where obtain a given error with high

probability• uniform sampling vs random sampling: the log term and the

probability goes away

Generalization follows

Lipschitz Regularization improves Adversarial Robustness

Chris Finlay (current PhD student)

Improved robustness to adversarial examples using Lipschitz regularization of the loss

Chris Finlay, O., Bilal Abbasi; Oct 2018; arxiv

Bilal Abbasi (former PhD now working in AI)

Adversarial Attacks

Adversarial Attacks on the loss

Scale measures visible attacks

DNNs are vulnerable to attacks which are invisible to the human eye.Undefended networks have 100% error rate at .1 (in max norm)

Implementation of Lipschitz Regularization of the Loss

Robustness bounds from the Lipschitz constant of the Loss

So training the model to have a better Lipschitz constant will improve the adversarial robustness bounds.

Arms race of attack methods and defences

We tested against toolbox of attacks. Plotted the error curve as a function of the adversarial size.

Strongest attacks: 1. Iterative l2-projected gradient2. Iterative Fast Gradient Signed Method (FGSM)

Adversarial Training: interpretation as regularization

Adversarial Training augmented with Lipschitz Regularization

AT + Tulip Results (2-norm)

Significant improvement over state-of-the art results come from augmenting AT with Lipschitz regularization

AT + Tulip Results (2-norm vs max-norm)

2-Lip > AT-2 > AT-1 > baseline (for all noise levels on both datasets)

Other current areas of interest in AI with

connections to mathematics

We are looking for collaborators: these are possible new projects

Reinforcement Learning

• Related to dynamic Programming

• Computationally intensive and unstable

Related math: dynamic programming, Optimal

Control

Recurrent NN

Generative Networks (GANs)Wasserstein GANs: optimal transportation (OT) mapping between random noise (Gaussians)

and target distribution of images

Related math: Optimal Tranportation algorithms and convergence (Peyre-Cuturi)

Squeeze NetsInference (evaluating the data and assigning a label) is costly (typically 0.1 second on a power hungry high memory GPU) in terms of• Memory (to store the weights)• Computation (multiplying the matrices times the vectors)• Power (the energy Joules used by the chip)• TimeResearch effort to make lean NN. How?• Quantization: low bit number representation and arithmetic. (related

math : non-smooth optimization, when the ReLu are also quantized)• Pruning: trim off the small weights, and retrain• Hyperparameter Optimization: train over multiple architectures and

paramsMostly engineering effort, but could be combined with more math on the training.

Partial Differential Equations Approaches to Optimization ......Partial Differential Equations...

Documents

Transcript of Partial Differential Equations Approaches to Optimization ......Partial Differential Equations...