Machine Learning and Dynamical Systems - Approximation ......inf 2 L 1 ([0; T];) J ( ) := E 2 6 4(x...
Transcript of Machine Learning and Dynamical Systems - Approximation ......inf 2 L 1 ([0; T];) J ( ) := E 2 6 4(x...
![Page 1: Machine Learning and Dynamical Systems - Approximation ......inf 2 L 1 ([0; T];) J ( ) := E 2 6 4(x T; y) + Z T 0 R (x t; t) | {z } Regularizer dt 3 7 5 x _ t = f (t; t)0 t T 0; y](https://reader035.fdocuments.net/reader035/viewer/2022071513/6134c0addfd10f4dd73bee96/html5/thumbnails/1.jpg)
Machine Learning and Dynamical SystemsApproximation, Optimization and Beyond
Qianxiao LiEmail: [email protected]
Website: https://blog.nus.edu.sg/qianxiaoli
April, 2021
![Page 2: Machine Learning and Dynamical Systems - Approximation ......inf 2 L 1 ([0; T];) J ( ) := E 2 6 4(x T; y) + Z T 0 R (x t; t) | {z } Regularizer dt 3 7 5 x _ t = f (t; t)0 t T 0; y](https://reader035.fdocuments.net/reader035/viewer/2022071513/6134c0addfd10f4dd73bee96/html5/thumbnails/2.jpg)
1 Background and Motivation
2 Machine Learning via Dynamical Systems
3 Machine Learning for Dynamical Systems
4 Machine Learning of Dynamical Systems
1 49
![Page 3: Machine Learning and Dynamical Systems - Approximation ......inf 2 L 1 ([0; T];) J ( ) := E 2 6 4(x T; y) + Z T 0 R (x t; t) | {z } Regularizer dt 3 7 5 x _ t = f (t; t)0 t T 0; y](https://reader035.fdocuments.net/reader035/viewer/2022071513/6134c0addfd10f4dd73bee96/html5/thumbnails/3.jpg)
Background and Motivation
![Page 4: Machine Learning and Dynamical Systems - Approximation ......inf 2 L 1 ([0; T];) J ( ) := E 2 6 4(x T; y) + Z T 0 R (x t; t) | {z } Regularizer dt 3 7 5 x _ t = f (t; t)0 t T 0; y](https://reader035.fdocuments.net/reader035/viewer/2022071513/6134c0addfd10f4dd73bee96/html5/thumbnails/4.jpg)
Deep Learning: Theory vs Practice
2 49
![Page 5: Machine Learning and Dynamical Systems - Approximation ......inf 2 L 1 ([0; T];) J ( ) := E 2 6 4(x T; y) + Z T 0 R (x t; t) | {z } Regularizer dt 3 7 5 x _ t = f (t; t)0 t T 0; y](https://reader035.fdocuments.net/reader035/viewer/2022071513/6134c0addfd10f4dd73bee96/html5/thumbnails/5.jpg)
Compositional structures in Modern Machine Learning
Models Data Algorithms
3 49
![Page 6: Machine Learning and Dynamical Systems - Approximation ......inf 2 L 1 ([0; T];) J ( ) := E 2 6 4(x T; y) + Z T 0 R (x t; t) | {z } Regularizer dt 3 7 5 x _ t = f (t; t)0 t T 0; y](https://reader035.fdocuments.net/reader035/viewer/2022071513/6134c0addfd10f4dd73bee96/html5/thumbnails/6.jpg)
Composition is Dynamics
Compositional functiony = FT ◦ FT−1 ◦ · · · ◦ F0(x)
Dynamics
y = xT+1, x = x0
xt+1 = Ft(xt) t = 0, 1, . . . , T
4 49
![Page 7: Machine Learning and Dynamical Systems - Approximation ......inf 2 L 1 ([0; T];) J ( ) := E 2 6 4(x T; y) + Z T 0 R (x t; t) | {z } Regularizer dt 3 7 5 x _ t = f (t; t)0 t T 0; y](https://reader035.fdocuments.net/reader035/viewer/2022071513/6134c0addfd10f4dd73bee96/html5/thumbnails/7.jpg)
Three Categories of Interactions
Machine Learning viaDynamical Systems
Dynamical formulationof deep learningDynamical analysis oflearning algorithmsNumerical analysis andarchitecture design
Machine Learning forDynamical Systems
Time series forecastingSequenceclassificationSequence to Sequencemodels
Machine Learning ofDynamical Systems
Learning dynamicalmodels fromtrajectoriesLearning reducedorder models
5 49
![Page 8: Machine Learning and Dynamical Systems - Approximation ......inf 2 L 1 ([0; T];) J ( ) := E 2 6 4(x T; y) + Z T 0 R (x t; t) | {z } Regularizer dt 3 7 5 x _ t = f (t; t)0 t T 0; y](https://reader035.fdocuments.net/reader035/viewer/2022071513/6134c0addfd10f4dd73bee96/html5/thumbnails/8.jpg)
Machine Learning via Dynamical Systems
![Page 9: Machine Learning and Dynamical Systems - Approximation ......inf 2 L 1 ([0; T];) J ( ) := E 2 6 4(x T; y) + Z T 0 R (x t; t) | {z } Regularizer dt 3 7 5 x _ t = f (t; t)0 t T 0; y](https://reader035.fdocuments.net/reader035/viewer/2022071513/6134c0addfd10f4dd73bee96/html5/thumbnails/9.jpg)
Machine Learning via Dynamical Systems
Background on Supervised Learning
![Page 10: Machine Learning and Dynamical Systems - Approximation ......inf 2 L 1 ([0; T];) J ( ) := E 2 6 4(x T; y) + Z T 0 R (x t; t) | {z } Regularizer dt 3 7 5 x _ t = f (t; t)0 t T 0; y](https://reader035.fdocuments.net/reader035/viewer/2022071513/6134c0addfd10f4dd73bee96/html5/thumbnails/10.jpg)
Supervised Learning
Supervised learning is about making predictions
Dataset: D = {xi ∈ X , yi ∈ Y}Ni=1, Target Function: F∗ : X → Y , yi = F∗(xi)
Goal: Learn/Approximate F∗ using information from D
6 49
![Page 11: Machine Learning and Dynamical Systems - Approximation ......inf 2 L 1 ([0; T];) J ( ) := E 2 6 4(x T; y) + Z T 0 R (x t; t) | {z } Regularizer dt 3 7 5 x _ t = f (t; t)0 t T 0; y](https://reader035.fdocuments.net/reader035/viewer/2022071513/6134c0addfd10f4dd73bee96/html5/thumbnails/11.jpg)
Hypothesis Space
To approximate F∗, we first define a hypothesis space H consisting of candidatefunctions
Linear models for regression and classification
H ={∑
j ajϕj(x) : aj ∈ R}
Shallow neural networks
H ={∑
j ajσ(wTj x + bj) : wj ∈ Rd,aj,bj ∈ R
}σ is the activation function, e.g. σ(z) = max(0, z) (ReLU); σ(z) = tanh(z) (Tanh)Deep neural networks
H = {FT ◦ FT−1 ◦ · · · ◦ F0(x) : Each Ft is a shallow network}
7 49
![Page 12: Machine Learning and Dynamical Systems - Approximation ......inf 2 L 1 ([0; T];) J ( ) := E 2 6 4(x T; y) + Z T 0 R (x t; t) | {z } Regularizer dt 3 7 5 x _ t = f (t; t)0 t T 0; y](https://reader035.fdocuments.net/reader035/viewer/2022071513/6134c0addfd10f4dd73bee96/html5/thumbnails/12.jpg)
Hypothesis Space
To approximate F∗, we first define a hypothesis space H consisting of candidatefunctions
Linear models for regression and classification
H ={∑
j ajϕj(x) : aj ∈ R}
Shallow neural networks
H ={∑
j ajσ(wTj x + bj) : wj ∈ Rd,aj,bj ∈ R
}σ is the activation function, e.g. σ(z) = max(0, z) (ReLU); σ(z) = tanh(z) (Tanh)
Deep neural networks
H = {FT ◦ FT−1 ◦ · · · ◦ F0(x) : Each Ft is a shallow network}
7 49
![Page 13: Machine Learning and Dynamical Systems - Approximation ......inf 2 L 1 ([0; T];) J ( ) := E 2 6 4(x T; y) + Z T 0 R (x t; t) | {z } Regularizer dt 3 7 5 x _ t = f (t; t)0 t T 0; y](https://reader035.fdocuments.net/reader035/viewer/2022071513/6134c0addfd10f4dd73bee96/html5/thumbnails/13.jpg)
Hypothesis Space
To approximate F∗, we first define a hypothesis space H consisting of candidatefunctions
Linear models for regression and classification
H ={∑
j ajϕj(x) : aj ∈ R}
Shallow neural networks
H ={∑
j ajσ(wTj x + bj) : wj ∈ Rd,aj,bj ∈ R
}σ is the activation function, e.g. σ(z) = max(0, z) (ReLU); σ(z) = tanh(z) (Tanh)Deep neural networks
H = {FT ◦ FT−1 ◦ · · · ◦ F0(x) : Each Ft is a shallow network}
7 49
![Page 14: Machine Learning and Dynamical Systems - Approximation ......inf 2 L 1 ([0; T];) J ( ) := E 2 6 4(x T; y) + Z T 0 R (x t; t) | {z } Regularizer dt 3 7 5 x _ t = f (t; t)0 t T 0; y](https://reader035.fdocuments.net/reader035/viewer/2022071513/6134c0addfd10f4dd73bee96/html5/thumbnails/14.jpg)
Empirical and Population Risk Minimization
Learning/approximation involves picking out the “best” F ∈ H so that it makessimilar predictions to F∗
Empirical risk minimization (ERM)
F ←− arg minF∈H
1N
N∑i=1
Φ(F(xi), yi︸︷︷︸F∗(xi)
) (Φ is a loss function)
Population risk minimization (PRM)
F ←− arg minF∈H
E(x,y)∼µ∗Φ(F(x), y) (µ∗ is the input-output distribution)
We want to solve PRM, but we often can only perform ERM. The gap between Fand F is the problem of generalization.
8 49
![Page 15: Machine Learning and Dynamical Systems - Approximation ......inf 2 L 1 ([0; T];) J ( ) := E 2 6 4(x T; y) + Z T 0 R (x t; t) | {z } Regularizer dt 3 7 5 x _ t = f (t; t)0 t T 0; y](https://reader035.fdocuments.net/reader035/viewer/2022071513/6134c0addfd10f4dd73bee96/html5/thumbnails/15.jpg)
Empirical and Population Risk Minimization
Learning/approximation involves picking out the “best” F ∈ H so that it makessimilar predictions to F∗
Empirical risk minimization (ERM)
F ←− arg minF∈H
1N
N∑i=1
Φ(F(xi), yi︸︷︷︸F∗(xi)
) (Φ is a loss function)
Population risk minimization (PRM)
F ←− arg minF∈H
E(x,y)∼µ∗Φ(F(x), y) (µ∗ is the input-output distribution)
We want to solve PRM, but we often can only perform ERM. The gap between Fand F is the problem of generalization.
8 49
![Page 16: Machine Learning and Dynamical Systems - Approximation ......inf 2 L 1 ([0; T];) J ( ) := E 2 6 4(x T; y) + Z T 0 R (x t; t) | {z } Regularizer dt 3 7 5 x _ t = f (t; t)0 t T 0; y](https://reader035.fdocuments.net/reader035/viewer/2022071513/6134c0addfd10f4dd73bee96/html5/thumbnails/16.jpg)
Empirical and Population Risk Minimization
Learning/approximation involves picking out the “best” F ∈ H so that it makessimilar predictions to F∗
Empirical risk minimization (ERM)
F ←− arg minF∈H
1N
N∑i=1
Φ(F(xi), yi︸︷︷︸F∗(xi)
) (Φ is a loss function)
Population risk minimization (PRM)
F ←− arg minF∈H
E(x,y)∼µ∗Φ(F(x), y) (µ∗ is the input-output distribution)
We want to solve PRM, but we often can only perform ERM. The gap between Fand F is the problem of generalization.
8 49
![Page 17: Machine Learning and Dynamical Systems - Approximation ......inf 2 L 1 ([0; T];) J ( ) := E 2 6 4(x T; y) + Z T 0 R (x t; t) | {z } Regularizer dt 3 7 5 x _ t = f (t; t)0 t T 0; y](https://reader035.fdocuments.net/reader035/viewer/2022071513/6134c0addfd10f4dd73bee96/html5/thumbnails/17.jpg)
Three Paradigms of Supervised Learning
9 49
![Page 18: Machine Learning and Dynamical Systems - Approximation ......inf 2 L 1 ([0; T];) J ( ) := E 2 6 4(x T; y) + Z T 0 R (x t; t) | {z } Regularizer dt 3 7 5 x _ t = f (t; t)0 t T 0; y](https://reader035.fdocuments.net/reader035/viewer/2022071513/6134c0addfd10f4dd73bee96/html5/thumbnails/18.jpg)
Three Paradigms of Supervised Learning
9 49
![Page 19: Machine Learning and Dynamical Systems - Approximation ......inf 2 L 1 ([0; T];) J ( ) := E 2 6 4(x T; y) + Z T 0 R (x t; t) | {z } Regularizer dt 3 7 5 x _ t = f (t; t)0 t T 0; y](https://reader035.fdocuments.net/reader035/viewer/2022071513/6134c0addfd10f4dd73bee96/html5/thumbnails/19.jpg)
Three Paradigms of Supervised Learning
9 49
![Page 20: Machine Learning and Dynamical Systems - Approximation ......inf 2 L 1 ([0; T];) J ( ) := E 2 6 4(x T; y) + Z T 0 R (x t; t) | {z } Regularizer dt 3 7 5 x _ t = f (t; t)0 t T 0; y](https://reader035.fdocuments.net/reader035/viewer/2022071513/6134c0addfd10f4dd73bee96/html5/thumbnails/20.jpg)
Three Paradigms of Supervised Learning
9 49
![Page 21: Machine Learning and Dynamical Systems - Approximation ......inf 2 L 1 ([0; T];) J ( ) := E 2 6 4(x T; y) + Z T 0 R (x t; t) | {z } Regularizer dt 3 7 5 x _ t = f (t; t)0 t T 0; y](https://reader035.fdocuments.net/reader035/viewer/2022071513/6134c0addfd10f4dd73bee96/html5/thumbnails/21.jpg)
Three Paradigms of Supervised Learning
9 49
![Page 22: Machine Learning and Dynamical Systems - Approximation ......inf 2 L 1 ([0; T];) J ( ) := E 2 6 4(x T; y) + Z T 0 R (x t; t) | {z } Regularizer dt 3 7 5 x _ t = f (t; t)0 t T 0; y](https://reader035.fdocuments.net/reader035/viewer/2022071513/6134c0addfd10f4dd73bee96/html5/thumbnails/22.jpg)
Goal: study new phenomena that arise indeep learning with respect to
approximation, optimization andgeneralization
![Page 23: Machine Learning and Dynamical Systems - Approximation ......inf 2 L 1 ([0; T];) J ( ) := E 2 6 4(x T; y) + Z T 0 R (x t; t) | {z } Regularizer dt 3 7 5 x _ t = f (t; t)0 t T 0; y](https://reader035.fdocuments.net/reader035/viewer/2022071513/6134c0addfd10f4dd73bee96/html5/thumbnails/23.jpg)
Machine Learning via Dynamical Systems
Approximation Theory for Continuous-time ResNets
![Page 24: Machine Learning and Dynamical Systems - Approximation ......inf 2 L 1 ([0; T];) J ( ) := E 2 6 4(x T; y) + Z T 0 R (x t; t) | {z } Regularizer dt 3 7 5 x _ t = f (t; t)0 t T 0; y](https://reader035.fdocuments.net/reader035/viewer/2022071513/6134c0addfd10f4dd73bee96/html5/thumbnails/24.jpg)
The Problem of Approximation
Universal Approximation Property
Given a target F∗ ∈ C and ε > 0, doesthere exists a F ∈ H such that
‖F∗ − F‖ ≤ ε?
Key question: how to achieve approximation via composition?
10 49
![Page 25: Machine Learning and Dynamical Systems - Approximation ......inf 2 L 1 ([0; T];) J ( ) := E 2 6 4(x T; y) + Z T 0 R (x t; t) | {z } Regularizer dt 3 7 5 x _ t = f (t; t)0 t T 0; y](https://reader035.fdocuments.net/reader035/viewer/2022071513/6134c0addfd10f4dd73bee96/html5/thumbnails/25.jpg)
The Continuum Idealization of Residual Networks
11 49
![Page 26: Machine Learning and Dynamical Systems - Approximation ......inf 2 L 1 ([0; T];) J ( ) := E 2 6 4(x T; y) + Z T 0 R (x t; t) | {z } Regularizer dt 3 7 5 x _ t = f (t; t)0 t T 0; y](https://reader035.fdocuments.net/reader035/viewer/2022071513/6134c0addfd10f4dd73bee96/html5/thumbnails/26.jpg)
Example: Flow Maps as Function Approximators
Binary Classification Problem
Not linearly separable!
Evolve with the ODE
xt,1 = −xt,2 sin(t) x0,1 = z1
xt,2 = − 12(1− x2
t,1)xt,2 + xt,1 cos(t) x0,2 = z2
Classify using linear classifier at the end:
g(xT) = 1xT,1>0
12 49
![Page 27: Machine Learning and Dynamical Systems - Approximation ......inf 2 L 1 ([0; T];) J ( ) := E 2 6 4(x T; y) + Z T 0 R (x t; t) | {z } Regularizer dt 3 7 5 x _ t = f (t; t)0 t T 0; y](https://reader035.fdocuments.net/reader035/viewer/2022071513/6134c0addfd10f4dd73bee96/html5/thumbnails/27.jpg)
Approximating Functions by Dynamics
13 49
![Page 28: Machine Learning and Dynamical Systems - Approximation ......inf 2 L 1 ([0; T];) J ( ) := E 2 6 4(x T; y) + Z T 0 R (x t; t) | {z } Regularizer dt 3 7 5 x _ t = f (t; t)0 t T 0; y](https://reader035.fdocuments.net/reader035/viewer/2022071513/6134c0addfd10f4dd73bee96/html5/thumbnails/28.jpg)
What conditions on F ,G are sufficient toinduce the universal approximation
property on H(F ,G)?
![Page 29: Machine Learning and Dynamical Systems - Approximation ......inf 2 L 1 ([0; T];) J ( ) := E 2 6 4(x T; y) + Z T 0 R (x t; t) | {z } Regularizer dt 3 7 5 x _ t = f (t; t)0 t T 0; y](https://reader035.fdocuments.net/reader035/viewer/2022071513/6134c0addfd10f4dd73bee96/html5/thumbnails/29.jpg)
Universal Approximation by Flows
Su�cient conditions for universal approximation by flows [L, Lin & Shen, 19]In dimension ≥ 2, always possible under mild conditions
1. G covers range of F∗
2. F is restricted a�ne invariant
3. Conv(F) contains a wellfunction
In dimension 1, only increasing functions if G = {id}Approximation rates can be characterized for d = 1. For d ≥ 2, a generalcharacterization is open.Connections with controllability [CLT, 19; TG, 20]
Q. Li, T. Lin, and Z. Shen, “Deep Learning via Dynamical Systems: An Approximation Perspective,” en, To appear in J. Eur. Math. Soc. (JEMS), 2021, 2019.arXiv: 1912.10382
C. Cuchiero, M. Larsson, and J. Teichmann, “Deep neural networks, generic universal interpolation, and controlled ODEs,” en, arXiv:1908.07838 [math],2019. arXiv: 1908.07838
P. Tabuada and B. Gharesifard, “Universal Approximation Power of Deep Residual Neural Networks via Nonlinear Control Theory,” arXiv:2007.06007,2020. arXiv: 2007.06007
14 49
![Page 30: Machine Learning and Dynamical Systems - Approximation ......inf 2 L 1 ([0; T];) J ( ) := E 2 6 4(x T; y) + Z T 0 R (x t; t) | {z } Regularizer dt 3 7 5 x _ t = f (t; t)0 t T 0; y](https://reader035.fdocuments.net/reader035/viewer/2022071513/6134c0addfd10f4dd73bee96/html5/thumbnails/30.jpg)
Machine Learning via Dynamical Systems
Deep Learning and Optimal Control
![Page 31: Machine Learning and Dynamical Systems - Approximation ......inf 2 L 1 ([0; T];) J ( ) := E 2 6 4(x T; y) + Z T 0 R (x t; t) | {z } Regularizer dt 3 7 5 x _ t = f (t; t)0 t T 0; y](https://reader035.fdocuments.net/reader035/viewer/2022071513/6134c0addfd10f4dd73bee96/html5/thumbnails/31.jpg)
Calculus of Variations and Optimal Control
15 49
![Page 32: Machine Learning and Dynamical Systems - Approximation ......inf 2 L 1 ([0; T];) J ( ) := E 2 6 4(x T; y) + Z T 0 R (x t; t) | {z } Regularizer dt 3 7 5 x _ t = f (t; t)0 t T 0; y](https://reader035.fdocuments.net/reader035/viewer/2022071513/6134c0addfd10f4dd73bee96/html5/thumbnails/32.jpg)
Deep Learning is Mean-field Optimal Control
The PRM problem using H(F ,G) can be formulated as
infθ∈L∞([0,T],Θ)
J(θ) := Eµ∗
Φ(xT, y) +
∫ T
0R(xt, θt)︸ ︷︷ ︸Regularizer
dt
xt = f (xt, θt) 0 ≤ t ≤ T (x0, y) ∼ µ∗
This is a mean-field optimal control problem, because we need to select θ thatcontrols not one, but an entire distribution of inputs and outputs
Key questions:Theoretical: Necessary and su�cient conditions for optimalityPractical: Understanding, improving learning algorithms
16 49
![Page 33: Machine Learning and Dynamical Systems - Approximation ......inf 2 L 1 ([0; T];) J ( ) := E 2 6 4(x T; y) + Z T 0 R (x t; t) | {z } Regularizer dt 3 7 5 x _ t = f (t; t)0 t T 0; y](https://reader035.fdocuments.net/reader035/viewer/2022071513/6134c0addfd10f4dd73bee96/html5/thumbnails/33.jpg)
Mathematical Results and Numerical Algorithms
Necessary conditionsI Mean-field Pontryagin’s maximum principle (PMP) [E, Han & L, 19]
Necesssary and Su�cient ConditionsI Mean-field Hamilton Jacobi Bellman equations (HJB) [E, Han & L, 19]
AlgorithmsI Training algorithms without gradient descent (based on PMP, generalizes back
propagation) [L, Chen, Tai & E, 18]I Training algorithms for quantized networks [L and Hao, 18]
W. E, J. Han, and Q. Li, “A mean-field optimal control formulation of deep learning,” Research in the Mathematical Sciences, vol. 6, no. 1, p. 10, 2019,issn: 2522-0144
Q. Li, L. Chen, C. Tai, and W. E, “Maximum principle based algorithms for deep learning,” The Journal of Machine Learning Research, vol. 18, no. 1,pp. 5998–6026, 2017
Q. Li and S. Hao, “An Optimal Control Approach to Deep Learning and Applications to Discrete-Weight Neural Networks,” in Proceedings of the 35thInternational Conference on Machine Learning, vol. 80, 2018, pp. 2985–2994
17 49
![Page 34: Machine Learning and Dynamical Systems - Approximation ......inf 2 L 1 ([0; T];) J ( ) := E 2 6 4(x T; y) + Z T 0 R (x t; t) | {z } Regularizer dt 3 7 5 x _ t = f (t; t)0 t T 0; y](https://reader035.fdocuments.net/reader035/viewer/2022071513/6134c0addfd10f4dd73bee96/html5/thumbnails/34.jpg)
Open Loop and Closed Loop Control
Controlled dynamics:xt = f (xt, θt).
Open-loop optimal control: θ∗t a deterministic function of tI Works for fixed x0I Unstable to perturbations on x0
Close-loop optimal control: θ∗t = ξ(x∗t , t)I Feed-back based on current stateI Stable to perturbations on x0 (or any xt)
18 49
![Page 35: Machine Learning and Dynamical Systems - Approximation ......inf 2 L 1 ([0; T];) J ( ) := E 2 6 4(x T; y) + Z T 0 R (x t; t) | {z } Regularizer dt 3 7 5 x _ t = f (t; t)0 t T 0; y](https://reader035.fdocuments.net/reader035/viewer/2022071513/6134c0addfd10f4dd73bee96/html5/thumbnails/35.jpg)
Adversarial Examples
19 49
![Page 36: Machine Learning and Dynamical Systems - Approximation ......inf 2 L 1 ([0; T];) J ( ) := E 2 6 4(x T; y) + Z T 0 R (x t; t) | {z } Regularizer dt 3 7 5 x _ t = f (t; t)0 t T 0; y](https://reader035.fdocuments.net/reader035/viewer/2022071513/6134c0addfd10f4dd73bee96/html5/thumbnails/36.jpg)
Adversarial Defense based on Closed Loop Optimal Control
We can use closed loop control to stabilize propagation of features through a deepnetwork
This leads to increased adversarial robustness without the need to retrain themodel [Chen, L and Zhang, 21]
Z. Chen, Q. Li, and Z. Zhang, “Towards robust neural networks via close-loop control,” in International Conference on Learning Representations, 2021.[Online]. Available: https://openreview.net/forum?id=2AL06y9cDE-
20 49
![Page 37: Machine Learning and Dynamical Systems - Approximation ......inf 2 L 1 ([0; T];) J ( ) := E 2 6 4(x T; y) + Z T 0 R (x t; t) | {z } Regularizer dt 3 7 5 x _ t = f (t; t)0 t T 0; y](https://reader035.fdocuments.net/reader035/viewer/2022071513/6134c0addfd10f4dd73bee96/html5/thumbnails/37.jpg)
Machine Learning via Dynamical Systems
Dynamical Analysis of Learning Algorithms
![Page 38: Machine Learning and Dynamical Systems - Approximation ......inf 2 L 1 ([0; T];) J ( ) := E 2 6 4(x T; y) + Z T 0 R (x t; t) | {z } Regularizer dt 3 7 5 x _ t = f (t; t)0 t T 0; y](https://reader035.fdocuments.net/reader035/viewer/2022071513/6134c0addfd10f4dd73bee96/html5/thumbnails/38.jpg)
Learning Algorithms
Empirical riskJ(θ;D) =
1|D|
∑(x,y)∼D
Φ(F(x; θ), y)
Simplest learning (optimization) algorithm is the stochastic gradient desecent
θk+1 = θk − η︸︷︷︸learning rate
∇θJ(θ;Dk)
where Dk is a random subset of D
21 49
![Page 39: Machine Learning and Dynamical Systems - Approximation ......inf 2 L 1 ([0; T];) J ( ) := E 2 6 4(x T; y) + Z T 0 R (x t; t) | {z } Regularizer dt 3 7 5 x _ t = f (t; t)0 t T 0; y](https://reader035.fdocuments.net/reader035/viewer/2022071513/6134c0addfd10f4dd73bee96/html5/thumbnails/39.jpg)
Convergence Bounds vs Modelling Dynamics
Convergence bounds:
E‖θk − θ∗‖ ≤ r(k) (r(k)→ 0)
Dynamical models:
maxk
E‖θk − skη‖ ≤ r(η) (r(η)→ 0)
{st} is a continuous-time dynamical system
Such techniques dates back to the analysis of finite-di�erence algorithms usingmodified equations
R. Warming and B. Hyett, “The modified equation approach to the stability and accuracy analysis of finite-di�erence methods,” Journal ofcomputational physics, vol. 14, no. 2, pp. 159–179, 1974
22 49
![Page 40: Machine Learning and Dynamical Systems - Approximation ......inf 2 L 1 ([0; T];) J ( ) := E 2 6 4(x T; y) + Z T 0 R (x t; t) | {z } Regularizer dt 3 7 5 x _ t = f (t; t)0 t T 0; y](https://reader035.fdocuments.net/reader035/viewer/2022071513/6134c0addfd10f4dd73bee96/html5/thumbnails/40.jpg)
Stochastic Modified Equations
SGD:θk+1 = θk − η∇θJ(θ;Dk)
Stochastic modified equations
dst = −∇θJ(st;D)dt +√ηΣ(st)
12 dWt︸︷︷︸
Wiener process
whereΣ(s) = CovDk(∇J(s;Dk)) (Gradient covariance)
Weak approximation of {θk} by {skη} [L, Tai & E, 17,19]ExtensionsI Higher orderI SGD variants (adaptive SGD, momentum SGD)
Q. Li, C. Tai, and W. E, “Stochastic modified equations and adaptive stochastic gradient algorithms,” in International Conference on Machine Learning,2017
Q. Li, C. Tai, and W. E, “Stochastic Modified Equations and Dynamics of Stochastic Gradient Algorithms I: Mathematical Foundations,” Journal ofMachine Learning Research, vol. 20, no. 40, pp. 1–47, 2019
23 49
![Page 41: Machine Learning and Dynamical Systems - Approximation ......inf 2 L 1 ([0; T];) J ( ) := E 2 6 4(x T; y) + Z T 0 R (x t; t) | {z } Regularizer dt 3 7 5 x _ t = f (t; t)0 t T 0; y](https://reader035.fdocuments.net/reader035/viewer/2022071513/6134c0addfd10f4dd73bee96/html5/thumbnails/41.jpg)
Applications of Dynamical Modelling of Optimization
Many ApplicationsTuning learning rates can be formulated as a (stochastic) optimal controlproblem [L, Tai & E, 17]Quantitative analysis of batch normalization [Cai, L & Shen, 19]Analysis and improvement of adversarial training [Ye, L, Zhou & Zhu, 21]
Q. Li, C. Tai, and W. E, “Stochastic modified equations and adaptive stochastic gradient algorithms,” in International Conference on Machine Learning,2017
Y. Cai, Q. Li, and Z. Shen, “A quantitative analysis of the e�ect of batch normalization on gradient descent,” in International Conference on MachineLearning, PMLR, 2019, pp. 882–890
N. Ye, Q. Li, X.-Y. Zhou, and Z. Zhu, “Amata: An annealing mechanism for adversarial training acceleration,” AAAI 2021, [arXiv preprint arXiv:2012.08112],2021
24 49
![Page 42: Machine Learning and Dynamical Systems - Approximation ......inf 2 L 1 ([0; T];) J ( ) := E 2 6 4(x T; y) + Z T 0 R (x t; t) | {z } Regularizer dt 3 7 5 x _ t = f (t; t)0 t T 0; y](https://reader035.fdocuments.net/reader035/viewer/2022071513/6134c0addfd10f4dd73bee96/html5/thumbnails/42.jpg)
Machine Learning for Dynamical Systems
![Page 43: Machine Learning and Dynamical Systems - Approximation ......inf 2 L 1 ([0; T];) J ( ) := E 2 6 4(x T; y) + Z T 0 R (x t; t) | {z } Regularizer dt 3 7 5 x _ t = f (t; t)0 t T 0; y](https://reader035.fdocuments.net/reader035/viewer/2022071513/6134c0addfd10f4dd73bee96/html5/thumbnails/43.jpg)
Machine Learning for Dynamical Systems
Supervised Learning of Dynamical Relationships
![Page 44: Machine Learning and Dynamical Systems - Approximation ......inf 2 L 1 ([0; T];) J ( ) := E 2 6 4(x T; y) + Z T 0 R (x t; t) | {z } Regularizer dt 3 7 5 x _ t = f (t; t)0 t T 0; y](https://reader035.fdocuments.net/reader035/viewer/2022071513/6134c0addfd10f4dd73bee96/html5/thumbnails/44.jpg)
Learning Dynamic Relationships
Often, supervised learning has to be performed on the dynamic settingPrice forecasting: given past prices, predict tomorrow’s priceMachine translation: given context, translate a sentenceLearning response systems: given temporal forcing, predict temporal response
Common featuresEach output depends on sequence of inputsNeed to model a sequence of input-output relationships
Deep Learning on temporal/sequential data↓
Interactions between dynamical structures in model and data
25 49
![Page 45: Machine Learning and Dynamical Systems - Approximation ......inf 2 L 1 ([0; T];) J ( ) := E 2 6 4(x T; y) + Z T 0 R (x t; t) | {z } Regularizer dt 3 7 5 x _ t = f (t; t)0 t T 0; y](https://reader035.fdocuments.net/reader035/viewer/2022071513/6134c0addfd10f4dd73bee96/html5/thumbnails/45.jpg)
Learning Dynamic Relationships
Often, supervised learning has to be performed on the dynamic settingPrice forecasting: given past prices, predict tomorrow’s priceMachine translation: given context, translate a sentenceLearning response systems: given temporal forcing, predict temporal response
Common featuresEach output depends on sequence of inputsNeed to model a sequence of input-output relationships
Deep Learning on temporal/sequential data↓
Interactions between dynamical structures in model and data
25 49
![Page 46: Machine Learning and Dynamical Systems - Approximation ......inf 2 L 1 ([0; T];) J ( ) := E 2 6 4(x T; y) + Z T 0 R (x t; t) | {z } Regularizer dt 3 7 5 x _ t = f (t; t)0 t T 0; y](https://reader035.fdocuments.net/reader035/viewer/2022071513/6134c0addfd10f4dd73bee96/html5/thumbnails/46.jpg)
Learning Dynamic Relationships
Often, supervised learning has to be performed on the dynamic settingPrice forecasting: given past prices, predict tomorrow’s priceMachine translation: given context, translate a sentenceLearning response systems: given temporal forcing, predict temporal response
Common featuresEach output depends on sequence of inputsNeed to model a sequence of input-output relationships
Deep Learning on temporal/sequential data↓
Interactions between dynamical structures in model and data
25 49
![Page 47: Machine Learning and Dynamical Systems - Approximation ......inf 2 L 1 ([0; T];) J ( ) := E 2 6 4(x T; y) + Z T 0 R (x t; t) | {z } Regularizer dt 3 7 5 x _ t = f (t; t)0 t T 0; y](https://reader035.fdocuments.net/reader035/viewer/2022071513/6134c0addfd10f4dd73bee96/html5/thumbnails/47.jpg)
Modelling Static vs Dynamic Relationships
Static setting
(input) x ∈ X = Rd
(output) y ∈ Y = Rn
(target) y = F∗(x)
Dynamic setting
(input) x = {xk ∈ Rd} ∈ X(output) y = {yk ∈ Rn} ∈ Y(target) yk = H∗k(x) ∀ k
Goal of supervised learningStatic: learn/approximate F∗
Dynamic: learn/approximate {H∗k}
26 49
![Page 48: Machine Learning and Dynamical Systems - Approximation ......inf 2 L 1 ([0; T];) J ( ) := E 2 6 4(x T; y) + Z T 0 R (x t; t) | {z } Regularizer dt 3 7 5 x _ t = f (t; t)0 t T 0; y](https://reader035.fdocuments.net/reader035/viewer/2022071513/6134c0addfd10f4dd73bee96/html5/thumbnails/48.jpg)
Modelling Static vs Dynamic Relationships
Static setting
(input) x ∈ X = Rd
(output) y ∈ Y = Rn
(target) y = F∗(x)
Dynamic setting
(input) x = {xk ∈ Rd} ∈ X(output) y = {yk ∈ Rn} ∈ Y(target) yk = H∗k(x) ∀ k
Goal of supervised learningStatic: learn/approximate F∗
Dynamic: learn/approximate {H∗k}
26 49
![Page 49: Machine Learning and Dynamical Systems - Approximation ......inf 2 L 1 ([0; T];) J ( ) := E 2 6 4(x T; y) + Z T 0 R (x t; t) | {z } Regularizer dt 3 7 5 x _ t = f (t; t)0 t T 0; y](https://reader035.fdocuments.net/reader035/viewer/2022071513/6134c0addfd10f4dd73bee96/html5/thumbnails/49.jpg)
Modelling Static vs Dynamic Relationships
Static setting
(input) x ∈ X = Rd
(output) y ∈ Y = Rn
(target) y = F∗(x)
Dynamic setting
(input) x = {xk ∈ Rd} ∈ X(output) y = {yk ∈ Rn} ∈ Y(target) yk = H∗k(x) ∀ k
Goal of supervised learningStatic: learn/approximate F∗
Dynamic: learn/approximate {H∗k}
26 49
![Page 50: Machine Learning and Dynamical Systems - Approximation ......inf 2 L 1 ([0; T];) J ( ) := E 2 6 4(x T; y) + Z T 0 R (x t; t) | {z } Regularizer dt 3 7 5 x _ t = f (t; t)0 t T 0; y](https://reader035.fdocuments.net/reader035/viewer/2022071513/6134c0addfd10f4dd73bee96/html5/thumbnails/50.jpg)
Some Examples of Learning Temporal Relationships
Time series classification / regression
yk = H∗k(x) =∑j≤k
αkxk (Adding Problem)
Time series forecasting
yk = xk+1 = H∗k({xj : j ≤ k})
Sequence to sequence models
yk = H∗k({xj : j ∈ Z})
27 49
![Page 51: Machine Learning and Dynamical Systems - Approximation ......inf 2 L 1 ([0; T];) J ( ) := E 2 6 4(x T; y) + Z T 0 R (x t; t) | {z } Regularizer dt 3 7 5 x _ t = f (t; t)0 t T 0; y](https://reader035.fdocuments.net/reader035/viewer/2022071513/6134c0addfd10f4dd73bee96/html5/thumbnails/51.jpg)
Machine Learning for Dynamical Systems
Approximation Theory of RNNs
![Page 52: Machine Learning and Dynamical Systems - Approximation ......inf 2 L 1 ([0; T];) J ( ) := E 2 6 4(x T; y) + Z T 0 R (x t; t) | {z } Regularizer dt 3 7 5 x _ t = f (t; t)0 t T 0; y](https://reader035.fdocuments.net/reader035/viewer/2022071513/6134c0addfd10f4dd73bee96/html5/thumbnails/52.jpg)
The Recurrent Neural Network Hypothesis Space
The recurrent neural network (RNN) architecture
hk+1 = σ(Whk + Uxk), k = 0, 1, . . . ,K − 1h0 = 0yk = c>hK
The RNN parametrizes a sequence of functions for k = 1, 2, . . .
Hk = {x0, . . . , xk−1} 7→ yk
Training RNN: adjust (c,W,U) so that
Hk ≈ H∗k, k = 1, 2, . . . ,K
28 49
![Page 53: Machine Learning and Dynamical Systems - Approximation ......inf 2 L 1 ([0; T];) J ( ) := E 2 6 4(x T; y) + Z T 0 R (x t; t) | {z } Regularizer dt 3 7 5 x _ t = f (t; t)0 t T 0; y](https://reader035.fdocuments.net/reader035/viewer/2022071513/6134c0addfd10f4dd73bee96/html5/thumbnails/53.jpg)
The Recurrent Neural Network Hypothesis Space
The recurrent neural network (RNN) architecture
hk+1 = σ(Whk + Uxk), k = 0, 1, . . . ,K − 1h0 = 0yk = c>hK
The RNN parametrizes a sequence of functions for k = 1, 2, . . .
Hk = {x0, . . . , xk−1} 7→ yk
Training RNN: adjust (c,W,U) so that
Hk ≈ H∗k, k = 1, 2, . . . ,K
28 49
![Page 54: Machine Learning and Dynamical Systems - Approximation ......inf 2 L 1 ([0; T];) J ( ) := E 2 6 4(x T; y) + Z T 0 R (x t; t) | {z } Regularizer dt 3 7 5 x _ t = f (t; t)0 t T 0; y](https://reader035.fdocuments.net/reader035/viewer/2022071513/6134c0addfd10f4dd73bee96/html5/thumbnails/54.jpg)
The Recurrent Neural Network Hypothesis Space
The recurrent neural network (RNN) architecture
hk+1 = σ(Whk + Uxk), k = 0, 1, . . . ,K − 1h0 = 0yk = c>hK
The RNN parametrizes a sequence of functions for k = 1, 2, . . .
Hk = {x0, . . . , xk−1} 7→ yk
Training RNN: adjust (c,W,U) so that
Hk ≈ H∗k, k = 1, 2, . . . ,K
28 49
![Page 55: Machine Learning and Dynamical Systems - Approximation ......inf 2 L 1 ([0; T];) J ( ) := E 2 6 4(x T; y) + Z T 0 R (x t; t) | {z } Regularizer dt 3 7 5 x _ t = f (t; t)0 t T 0; y](https://reader035.fdocuments.net/reader035/viewer/2022071513/6134c0addfd10f4dd73bee96/html5/thumbnails/55.jpg)
The Recurrent Neural Network Hypothesis Space
From the outset, the RNN approach parametrizes a sequence ofhigh-dimensional functions together
This is akin to a reverse of the Mori-Zwanzig formalism
29 49
![Page 56: Machine Learning and Dynamical Systems - Approximation ......inf 2 L 1 ([0; T];) J ( ) := E 2 6 4(x T; y) + Z T 0 R (x t; t) | {z } Regularizer dt 3 7 5 x _ t = f (t; t)0 t T 0; y](https://reader035.fdocuments.net/reader035/viewer/2022071513/6134c0addfd10f4dd73bee96/html5/thumbnails/56.jpg)
The Recurrent Neural Network Hypothesis Space
From the outset, the RNN approach parametrizes a sequence ofhigh-dimensional functions togetherThis is akin to a reverse of the Mori-Zwanzig formalism
C. Ma, J. Wang, and W. E, “Model reduction with memory and the machine learning of dynamical systems,” Communications in Computational Physics,vol. 25, no. 4, pp. 947–962, 2018, issn: 1991-7120. doi: https://doi.org/10.4208/cicp.OA-2018-0269
R. Zwanzig, Nonequilibrium statistical mechanics. Oxford University Press, 2001
29 49
![Page 57: Machine Learning and Dynamical Systems - Approximation ......inf 2 L 1 ([0; T];) J ( ) := E 2 6 4(x T; y) + Z T 0 R (x t; t) | {z } Regularizer dt 3 7 5 x _ t = f (t; t)0 t T 0; y](https://reader035.fdocuments.net/reader035/viewer/2022071513/6134c0addfd10f4dd73bee96/html5/thumbnails/57.jpg)
Empirically, it is found that memoryaffects RNN performance. Can we
rigorously investigate this phenomena?
![Page 58: Machine Learning and Dynamical Systems - Approximation ......inf 2 L 1 ([0; T];) J ( ) := E 2 6 4(x T; y) + Z T 0 R (x t; t) | {z } Regularizer dt 3 7 5 x _ t = f (t; t)0 t T 0; y](https://reader035.fdocuments.net/reader035/viewer/2022071513/6134c0addfd10f4dd73bee96/html5/thumbnails/58.jpg)
Approximation Theory of Linear Continuous-time RNNs
We take a continuum approach:Continuous-time approximation:
hk+1 = hk + δ︸ ︷︷ ︸residual variant
σ(Whk + Uxk) → ht = σ(Wht + Uxt)
Linear activations: σ(z) = zThis allows us to prove precise approximation results [Li, Han, E & L, 21]
E�cient approximation if and only if there is exponential memory decay in {H∗k}If there is no exponential decay, then one may require exponentially largenumber of parameters for approximation - curse of memory
Z. Li, J. Han, W. E, and Q. Li, “On the curse of memory in recurrent neural networks: Approximation and optimization analysis,” in InternationalConference on Learning Representations, 2021. [Online]. Available: https://openreview.net/forum?id=8Sqhl-nF50
30 49
![Page 59: Machine Learning and Dynamical Systems - Approximation ......inf 2 L 1 ([0; T];) J ( ) := E 2 6 4(x T; y) + Z T 0 R (x t; t) | {z } Regularizer dt 3 7 5 x _ t = f (t; t)0 t T 0; y](https://reader035.fdocuments.net/reader035/viewer/2022071513/6134c0addfd10f4dd73bee96/html5/thumbnails/59.jpg)
Extension to Convolution-based Methods
We can extend the previous approach to analyze CNN based methods (e.g.WaveNet) for time-series
Key question: how is RNN and CNN di�erent?
RNN approximation depends on memory decay [Li, Han, E & L, 21]CNN approximation depends on spectral regularity (sparsity) [Jiang, Li & L, 21]
Z. Li, J. Han, W. E, and Q. Li, “On the curse of memory in recurrent neural networks: Approximation and optimization analysis,” in InternationalConference on Learning Representations, 2021. [Online]. Available: https://openreview.net/forum?id=8Sqhl-nF50
H. Jiang, Z. Li, and Q. Li, “Approximation theory of convolutional architectures for time series modelling,” Submitted, 2021
31 49
![Page 60: Machine Learning and Dynamical Systems - Approximation ......inf 2 L 1 ([0; T];) J ( ) := E 2 6 4(x T; y) + Z T 0 R (x t; t) | {z } Regularizer dt 3 7 5 x _ t = f (t; t)0 t T 0; y](https://reader035.fdocuments.net/reader035/viewer/2022071513/6134c0addfd10f4dd73bee96/html5/thumbnails/60.jpg)
Machine Learning for Dynamical Systems
Optimization Dynamics of RNNs
![Page 61: Machine Learning and Dynamical Systems - Approximation ......inf 2 L 1 ([0; T];) J ( ) := E 2 6 4(x T; y) + Z T 0 R (x t; t) | {z } Regularizer dt 3 7 5 x _ t = f (t; t)0 t T 0; y](https://reader035.fdocuments.net/reader035/viewer/2022071513/6134c0addfd10f4dd73bee96/html5/thumbnails/61.jpg)
The Issue of Memory for Optimization
RNN is notoriously hard to train for long sequences
0 10000 20000 30000 40000epoch
105
104
103
102
loss
0 2t
0.0
0.2(t)
(a) Exponential sum target
0 10000 20000 30000 40000epoch
104
103
102
101
100
loss
0 5t
0.0
0.5
(t)
(b) Airy function target
0 10000 20000 30000 40000epoch
104
103
102
101
loss
(c) Lorenz dynamics target
When and why does the training plateau?What is the source of the plateaus?
32 49
![Page 62: Machine Learning and Dynamical Systems - Approximation ......inf 2 L 1 ([0; T];) J ( ) := E 2 6 4(x T; y) + Z T 0 R (x t; t) | {z } Regularizer dt 3 7 5 x _ t = f (t; t)0 t T 0; y](https://reader035.fdocuments.net/reader035/viewer/2022071513/6134c0addfd10f4dd73bee96/html5/thumbnails/62.jpg)
Building Memory Explicitly into the Target Functional
Let us consider a target functional that builds explicitly the e�ect of memory
H∗T(x) =
∫ ∞0
xT−tρ(t)dt (Riesz representation) ρ(t) = ρ(t) + ρ0,ω(t)
We can prove that as memory increases, the training stalls exponentially, withescape time estimates [Li, Han, E & L, 21]
Z. Li, J. Han, W. E, and Q. Li, “On the curse of memory in recurrent neural networks: Approximation and optimization analysis,” in InternationalConference on Learning Representations, 2021. [Online]. Available: https://openreview.net/forum?id=8Sqhl-nF50
33 49
![Page 63: Machine Learning and Dynamical Systems - Approximation ......inf 2 L 1 ([0; T];) J ( ) := E 2 6 4(x T; y) + Z T 0 R (x t; t) | {z } Regularizer dt 3 7 5 x _ t = f (t; t)0 t T 0; y](https://reader035.fdocuments.net/reader035/viewer/2022071513/6134c0addfd10f4dd73bee96/html5/thumbnails/63.jpg)
Plateauing Time Scale
Predicted: log[Plateauing time] ∝ c/ω for ω small
10 12 14 16 181/
102
103
104
Tim
e
Plateauing time
10 12 14 16 181/
101
103
Tim
e
Parameter separation time
34 49
![Page 64: Machine Learning and Dynamical Systems - Approximation ......inf 2 L 1 ([0; T];) J ( ) := E 2 6 4(x T; y) + Z T 0 R (x t; t) | {z } Regularizer dt 3 7 5 x _ t = f (t; t)0 t T 0; y](https://reader035.fdocuments.net/reader035/viewer/2022071513/6134c0addfd10f4dd73bee96/html5/thumbnails/64.jpg)
Plateauing for General Cases
Although our analysis is on simplified setting, the plateauing occurs generally
0 10000 20000epoch
10 2
10 1
loss
Full Matrix (Linear, SGD)
0 10000 20000epoch
10 2
10 1
loss
Full Matrix (Tanh, SGD)
0 10000 20000epoch
10 4
10 3
10 2
10 1
loss
Full Matrix (Tanh, Adam)
35 49
![Page 65: Machine Learning and Dynamical Systems - Approximation ......inf 2 L 1 ([0; T];) J ( ) := E 2 6 4(x T; y) + Z T 0 R (x t; t) | {z } Regularizer dt 3 7 5 x _ t = f (t; t)0 t T 0; y](https://reader035.fdocuments.net/reader035/viewer/2022071513/6134c0addfd10f4dd73bee96/html5/thumbnails/65.jpg)
Machine Learning of Dynamical Systems
![Page 66: Machine Learning and Dynamical Systems - Approximation ......inf 2 L 1 ([0; T];) J ( ) := E 2 6 4(x T; y) + Z T 0 R (x t; t) | {z } Regularizer dt 3 7 5 x _ t = f (t; t)0 t T 0; y](https://reader035.fdocuments.net/reader035/viewer/2022071513/6134c0addfd10f4dd73bee96/html5/thumbnails/66.jpg)
Machine Learning of Dynamical Systems
Learning Stable and Interpretable Dynamics
![Page 67: Machine Learning and Dynamical Systems - Approximation ......inf 2 L 1 ([0; T];) J ( ) := E 2 6 4(x T; y) + Z T 0 R (x t; t) | {z } Regularizer dt 3 7 5 x _ t = f (t; t)0 t T 0; y](https://reader035.fdocuments.net/reader035/viewer/2022071513/6134c0addfd10f4dd73bee96/html5/thumbnails/67.jpg)
Motivation
Rayleigh-Bénard convection(High-dimensional)
Lorenz ‘63 model(Low-dimensional)
Data-driven method to characterize high-dimensional dynamics?Learning reduced coordinatesLearning dynamics on these coordinates
36 49
![Page 68: Machine Learning and Dynamical Systems - Approximation ......inf 2 L 1 ([0; T];) J ( ) := E 2 6 4(x T; y) + Z T 0 R (x t; t) | {z } Regularizer dt 3 7 5 x _ t = f (t; t)0 t T 0; y](https://reader035.fdocuments.net/reader035/viewer/2022071513/6134c0addfd10f4dd73bee96/html5/thumbnails/68.jpg)
The Basic Setup
Consider a high-dimensional dynamical system
x = F(x), x(0) ∈ R
from which we get observation data
D = {x(t)i : t ∈ [0, T], i = 1, . . . ,N}
Goals:Learn a function ϕ : Rd → Rm
Learn a function f : Rm → Rm such that the reduced dynamics
h = f (h), h(0) = ϕ(x(0))
satisfy h(t) ≈ ϕ(x(t)) for all t ∈ [0, T]
37 49
![Page 69: Machine Learning and Dynamical Systems - Approximation ......inf 2 L 1 ([0; T];) J ( ) := E 2 6 4(x T; y) + Z T 0 R (x t; t) | {z } Regularizer dt 3 7 5 x _ t = f (t; t)0 t T 0; y](https://reader035.fdocuments.net/reader035/viewer/2022071513/6134c0addfd10f4dd73bee96/html5/thumbnails/69.jpg)
Structure
The reduction function ϕ and the vector field f should satisfy some structuralconstraintsReduction function
m should be much smaller than dϕ should be approximately invertible
Vector fieldThe dynamics h = f (h) should be stableWe should be able to interpret the physical structure of the learned f
38 49
![Page 70: Machine Learning and Dynamical Systems - Approximation ......inf 2 L 1 ([0; T];) J ( ) := E 2 6 4(x T; y) + Z T 0 R (x t; t) | {z } Regularizer dt 3 7 5 x _ t = f (t; t)0 t T 0; y](https://reader035.fdocuments.net/reader035/viewer/2022071513/6134c0addfd10f4dd73bee96/html5/thumbnails/70.jpg)
Learning dynamics
Two classes of approaches to learn dynamics from dataUnstructured approaches: sparse identification, direct NN fitting, ...I Accurate over short timesI FlexibleI Unstable
Structured approaches: Hamiltonian, gradient, ...I StableI interpretableI Limited approximation power
Can we get the best of both worlds?
39 49
![Page 71: Machine Learning and Dynamical Systems - Approximation ......inf 2 L 1 ([0; T];) J ( ) := E 2 6 4(x T; y) + Z T 0 R (x t; t) | {z } Regularizer dt 3 7 5 x _ t = f (t; t)0 t T 0; y](https://reader035.fdocuments.net/reader035/viewer/2022071513/6134c0addfd10f4dd73bee96/html5/thumbnails/71.jpg)
Onsager principle
A general dynamical model for near-equilibrium process
Mh = −∇V(h)
Second law (dissipative system)
u ·Mu ≥ 0 ∀u ∈ Rm =⇒ ddtV ≤ 0
Onsager’s reciprocal relations (Onsager, 1931)
Mij = Mji
But, limited to linearized regime.
L. Onsager, “Reciprocal relations in irreversible processes. i.,” Physical review, vol. 37, no. 4, p. 405, 1931L. Onsager, “Reciprocal relations in irreversible processes. ii.,” Physical review, vol. 38, no. 12, p. 2265, 1931
40 49
![Page 72: Machine Learning and Dynamical Systems - Approximation ......inf 2 L 1 ([0; T];) J ( ) := E 2 6 4(x T; y) + Z T 0 R (x t; t) | {z } Regularizer dt 3 7 5 x _ t = f (t; t)0 t T 0; y](https://reader035.fdocuments.net/reader035/viewer/2022071513/6134c0addfd10f4dd73bee96/html5/thumbnails/72.jpg)
Generalized Onsager principle
We generalise the Onsager principle
[M(h) + W(h)] h = −∇V(h) + g(h)
M is symmetric positive semi-definite (dissipative, reciprocal relations)W is anti-symmetric (conservative)V is lower-bounded, su�cient growth (potential, free energy, -entropy)g is a simple function (external force, control)
41 49
![Page 73: Machine Learning and Dynamical Systems - Approximation ......inf 2 L 1 ([0; T];) J ( ) := E 2 6 4(x T; y) + Z T 0 R (x t; t) | {z } Regularizer dt 3 7 5 x _ t = f (t; t)0 t T 0; y](https://reader035.fdocuments.net/reader035/viewer/2022071513/6134c0addfd10f4dd73bee96/html5/thumbnails/73.jpg)
Data-Driven Approach
There are two main challenges1. How to find a good set of generalized
coordinates h = ϕ(x)
2. How to determine M,W, V,gData-driven approach [Yu, Tian, E & L, 20]
1. Use PCA / near-isometricauto-encoder to parametrize ϕ
2. Use suitably designed neuralnetworks to parametrize M,W, V,g(OnsagerNet)
H. Yu, X. Tian, W. E, and Q. Li, “Onsagernet: Learning stable and interpretable dynamics using a generalized onsager principle,” arXiv preprintarXiv:2009.02327, 2020
42 49
![Page 74: Machine Learning and Dynamical Systems - Approximation ......inf 2 L 1 ([0; T];) J ( ) := E 2 6 4(x T; y) + Z T 0 R (x t; t) | {z } Regularizer dt 3 7 5 x _ t = f (t; t)0 t T 0; y](https://reader035.fdocuments.net/reader035/viewer/2022071513/6134c0addfd10f4dd73bee96/html5/thumbnails/74.jpg)
Application: Rayleigh-Bénard convection (RBC)
The RBC equations for x = (u, θ)
dudt + (u · ∇)u = ν∆u−∇p + α0gθ
∇ · u = 0dθdt + u · ∇θ = κ∆θ + uyΓ
u: velocity fieldθ: temperature profile
Lorenz ‘63 model is a reduction of the RBC equations by keeping the 3 mostunstable Fourier modes, but it is quantitatively inaccurate
43 49
![Page 75: Machine Learning and Dynamical Systems - Approximation ......inf 2 L 1 ([0; T];) J ( ) := E 2 6 4(x T; y) + Z T 0 R (x t; t) | {z } Regularizer dt 3 7 5 x _ t = f (t; t)0 t T 0; y](https://reader035.fdocuments.net/reader035/viewer/2022071513/6134c0addfd10f4dd73bee96/html5/thumbnails/75.jpg)
Trajectory-wise accuracy
44 49
![Page 76: Machine Learning and Dynamical Systems - Approximation ......inf 2 L 1 ([0; T];) J ( ) := E 2 6 4(x T; y) + Z T 0 R (x t; t) | {z } Regularizer dt 3 7 5 x _ t = f (t; t)0 t T 0; y](https://reader035.fdocuments.net/reader035/viewer/2022071513/6134c0addfd10f4dd73bee96/html5/thumbnails/76.jpg)
Reproduction of phase portraits
45 49
![Page 77: Machine Learning and Dynamical Systems - Approximation ......inf 2 L 1 ([0; T];) J ( ) := E 2 6 4(x T; y) + Z T 0 R (x t; t) | {z } Regularizer dt 3 7 5 x _ t = f (t; t)0 t T 0; y](https://reader035.fdocuments.net/reader035/viewer/2022071513/6134c0addfd10f4dd73bee96/html5/thumbnails/77.jpg)
Visualizing learned energy function
46 49
![Page 78: Machine Learning and Dynamical Systems - Approximation ......inf 2 L 1 ([0; T];) J ( ) := E 2 6 4(x T; y) + Z T 0 R (x t; t) | {z } Regularizer dt 3 7 5 x _ t = f (t; t)0 t T 0; y](https://reader035.fdocuments.net/reader035/viewer/2022071513/6134c0addfd10f4dd73bee96/html5/thumbnails/78.jpg)
Machine Learning of Dynamical Systems
Machine Learning for Rare Events
![Page 79: Machine Learning and Dynamical Systems - Approximation ......inf 2 L 1 ([0; T];) J ( ) := E 2 6 4(x T; y) + Z T 0 R (x t; t) | {z } Regularizer dt 3 7 5 x _ t = f (t; t)0 t T 0; y](https://reader035.fdocuments.net/reader035/viewer/2022071513/6134c0addfd10f4dd73bee96/html5/thumbnails/79.jpg)
Rare Events
Many important applicationsStructure of macro-moleculesChemical reactionsNucleation in phase transitions
Quantities of interestTransition ratesPotential landscapesInvariant distributions
Challenge: high dimensions!
47 49
![Page 80: Machine Learning and Dynamical Systems - Approximation ......inf 2 L 1 ([0; T];) J ( ) := E 2 6 4(x T; y) + Z T 0 R (x t; t) | {z } Regularizer dt 3 7 5 x _ t = f (t; t)0 t T 0; y](https://reader035.fdocuments.net/reader035/viewer/2022071513/6134c0addfd10f4dd73bee96/html5/thumbnails/80.jpg)
Machine Learning for Rare Events
Machine learning can deal with high-dimensions in a data-driven mannerComputing commitor functions (reaction rates) in high-dimensions [L, Lin &Ren, 19]Computing quasipotentials in high-dimensions [Lin, L & Ren, 20]
Q. Li, B. Lin, and W. Ren, “Computing committor functions for the study of rare events using deep learning,” The Journal of Chemical Physics, vol. 151,no. 5, p. 054 112, 2019
B. Lin, Q. Li, and W. Ren, “A data driven method for computing quasipotentials,” Mathematical and Scientific Machine Learning, MSML 2021 [arXivpreprint arXiv:2012.09111], 2020
48 49
![Page 81: Machine Learning and Dynamical Systems - Approximation ......inf 2 L 1 ([0; T];) J ( ) := E 2 6 4(x T; y) + Z T 0 R (x t; t) | {z } Regularizer dt 3 7 5 x _ t = f (t; t)0 t T 0; y](https://reader035.fdocuments.net/reader035/viewer/2022071513/6134c0addfd10f4dd73bee96/html5/thumbnails/81.jpg)
Some Open Problems
ML via DSI Theory for continuum limitsI HJB based learning algorithms
ML for DSI Non-linear extensionsI Analysis of encoder-decoder modelsI Analysis of attention models
ML of DSI Approximation theory for structured modelsI Data-dependent generalization results
More detailed account of ML via DS: lecture notes at https://blog.nus.edu.sg/qianxiaoli/teaching/
49 / 49
![Page 82: Machine Learning and Dynamical Systems - Approximation ......inf 2 L 1 ([0; T];) J ( ) := E 2 6 4(x T; y) + Z T 0 R (x t; t) | {z } Regularizer dt 3 7 5 x _ t = f (t; t)0 t T 0; y](https://reader035.fdocuments.net/reader035/viewer/2022071513/6134c0addfd10f4dd73bee96/html5/thumbnails/82.jpg)
Thank you!