Mathematics for Machine Learningdfg/ProbabilisticInference... · Typically via some form of...
Transcript of Mathematics for Machine Learningdfg/ProbabilisticInference... · Typically via some form of...
![Page 1: Mathematics for Machine Learningdfg/ProbabilisticInference... · Typically via some form of gradient descent. Challenge 1: Differentiation. Compute gradients of a loss ... n 1 df{dx.](https://reader033.fdocuments.net/reader033/viewer/2022041905/5e62c9c6e737573c5200d583/html5/thumbnails/1.jpg)
Mathematics for Machine LearningMarc Deisenroth
Statistical Machine Learning GroupDepartment of ComputingImperial College London
@[email protected]@prowler.io
Deep Learning IndabaUniversity of the WitwatersrandJohannesburg, South Africa
September 10, 2017
![Page 2: Mathematics for Machine Learningdfg/ProbabilisticInference... · Typically via some form of gradient descent. Challenge 1: Differentiation. Compute gradients of a loss ... n 1 df{dx.](https://reader033.fdocuments.net/reader033/viewer/2022041905/5e62c9c6e737573c5200d583/html5/thumbnails/2.jpg)
Applications of Machine Learning
Mathematics for Machine Learning Marc Deisenroth @Deep Learning Indaba, September 10, 2017 2
![Page 3: Mathematics for Machine Learningdfg/ProbabilisticInference... · Typically via some form of gradient descent. Challenge 1: Differentiation. Compute gradients of a loss ... n 1 df{dx.](https://reader033.fdocuments.net/reader033/viewer/2022041905/5e62c9c6e737573c5200d583/html5/thumbnails/3.jpg)
Mathematical Concepts in Machine Learning
§ Linear algebra and matrix decomposition
§ Differentiation
§ Optimization
§ Integration
§ Probability theory and Bayesian inference
§ Functional analysis
Mathematics for Machine Learning Marc Deisenroth @Deep Learning Indaba, September 10, 2017 3
![Page 4: Mathematics for Machine Learningdfg/ProbabilisticInference... · Typically via some form of gradient descent. Challenge 1: Differentiation. Compute gradients of a loss ... n 1 df{dx.](https://reader033.fdocuments.net/reader033/viewer/2022041905/5e62c9c6e737573c5200d583/html5/thumbnails/4.jpg)
Outline
Introduction
Differentiation
Integration
Mathematics for Machine Learning Marc Deisenroth @Deep Learning Indaba, September 10, 2017 4
![Page 5: Mathematics for Machine Learningdfg/ProbabilisticInference... · Typically via some form of gradient descent. Challenge 1: Differentiation. Compute gradients of a loss ... n 1 df{dx.](https://reader033.fdocuments.net/reader033/viewer/2022041905/5e62c9c6e737573c5200d583/html5/thumbnails/5.jpg)
Overview
Introduction
Differentiation
Integration
Mathematics for Machine Learning Marc Deisenroth @Deep Learning Indaba, September 10, 2017 5
![Page 6: Mathematics for Machine Learningdfg/ProbabilisticInference... · Typically via some form of gradient descent. Challenge 1: Differentiation. Compute gradients of a loss ... n 1 df{dx.](https://reader033.fdocuments.net/reader033/viewer/2022041905/5e62c9c6e737573c5200d583/html5/thumbnails/6.jpg)
Feedforward Neural Network
y “ σpzqz “ Ax` b
x z y
A, b
x1
x2
z1
z2
z3
y1
y2
y3
x y
A, b
σ
§ Training a neural network means parameter optimization:Typically via some form of gradient descent
Challenge 1: Differentiation. Compute gradients of a lossfunction with respect to neural network parameters A, b
§ Computing statistics (e.g., means, variances) of predictionsChallenge 2: Integration. Propagate uncertainty through a
neural network
Mathematics for Machine Learning Marc Deisenroth @Deep Learning Indaba, September 10, 2017 6
![Page 7: Mathematics for Machine Learningdfg/ProbabilisticInference... · Typically via some form of gradient descent. Challenge 1: Differentiation. Compute gradients of a loss ... n 1 df{dx.](https://reader033.fdocuments.net/reader033/viewer/2022041905/5e62c9c6e737573c5200d583/html5/thumbnails/7.jpg)
Feedforward Neural Network
y “ σpzqz “ Ax` b
x z y
A, b
x1
x2
z1
z2
z3
y1
y2
y3
x y
A, b
σ
§ Training a neural network means parameter optimization:Typically via some form of gradient descent
Challenge 1: Differentiation. Compute gradients of a lossfunction with respect to neural network parameters A, b
§ Computing statistics (e.g., means, variances) of predictionsChallenge 2: Integration. Propagate uncertainty through a
neural network
Mathematics for Machine Learning Marc Deisenroth @Deep Learning Indaba, September 10, 2017 6
![Page 8: Mathematics for Machine Learningdfg/ProbabilisticInference... · Typically via some form of gradient descent. Challenge 1: Differentiation. Compute gradients of a loss ... n 1 df{dx.](https://reader033.fdocuments.net/reader033/viewer/2022041905/5e62c9c6e737573c5200d583/html5/thumbnails/8.jpg)
Feedforward Neural Network
y “ σpzqz “ Ax` b
x z y
A, b
x1
x2
z1
z2
z3
y1
y2
y3
x y
A, b
σ
§ Training a neural network means parameter optimization:Typically via some form of gradient descent
Challenge 1: Differentiation. Compute gradients of a lossfunction with respect to neural network parameters A, b
§ Computing statistics (e.g., means, variances) of predictionsChallenge 2: Integration. Propagate uncertainty through a
neural network
Mathematics for Machine Learning Marc Deisenroth @Deep Learning Indaba, September 10, 2017 6
![Page 9: Mathematics for Machine Learningdfg/ProbabilisticInference... · Typically via some form of gradient descent. Challenge 1: Differentiation. Compute gradients of a loss ... n 1 df{dx.](https://reader033.fdocuments.net/reader033/viewer/2022041905/5e62c9c6e737573c5200d583/html5/thumbnails/9.jpg)
Feedforward Neural Network
y “ σpzqz “ Ax` b
x z y
A, b
x1
x2
z1
z2
z3
y1
y2
y3
x y
A, b
σ
§ Training a neural network means parameter optimization:Typically via some form of gradient descent
Challenge 1: Differentiation. Compute gradients of a lossfunction with respect to neural network parameters A, b
§ Computing statistics (e.g., means, variances) of predictionsChallenge 2: Integration. Propagate uncertainty through a
neural network
Mathematics for Machine Learning Marc Deisenroth @Deep Learning Indaba, September 10, 2017 6
![Page 10: Mathematics for Machine Learningdfg/ProbabilisticInference... · Typically via some form of gradient descent. Challenge 1: Differentiation. Compute gradients of a loss ... n 1 df{dx.](https://reader033.fdocuments.net/reader033/viewer/2022041905/5e62c9c6e737573c5200d583/html5/thumbnails/10.jpg)
Feedforward Neural Network
y “ σpzqz “ Ax` b
x z y
A, b
x1
x2
z1
z2
z3
y1
y2
y3
x y
A, b
σ
§ Training a neural network means parameter optimization:Typically via some form of gradient descent
Challenge 1: Differentiation. Compute gradients of a lossfunction with respect to neural network parameters A, b
§ Computing statistics (e.g., means, variances) of predictionsChallenge 2: Integration. Propagate uncertainty through a
neural network
Mathematics for Machine Learning Marc Deisenroth @Deep Learning Indaba, September 10, 2017 6
![Page 11: Mathematics for Machine Learningdfg/ProbabilisticInference... · Typically via some form of gradient descent. Challenge 1: Differentiation. Compute gradients of a loss ... n 1 df{dx.](https://reader033.fdocuments.net/reader033/viewer/2022041905/5e62c9c6e737573c5200d583/html5/thumbnails/11.jpg)
Background: Matrix Multiplication
§ Matrix multiplication is notcommutative, i.e., AB ‰ BA
§ When multiplying matrices, the “neighboring” dimensions haveto fit:
Aloomoon
nˆk
Bloomoon
kˆm
“ Cloomoon
nˆm
y “ Ax y = A.dot(x)
yi “ř
j Aijxj y = np.einsum(’ij, j’, A, x)
C “ AB C = A.dot(B)
Cij “ř
k AikBkj C = np.einsum(’ik, kj’, A, B)
Mathematics for Machine Learning Marc Deisenroth @Deep Learning Indaba, September 10, 2017 7
![Page 12: Mathematics for Machine Learningdfg/ProbabilisticInference... · Typically via some form of gradient descent. Challenge 1: Differentiation. Compute gradients of a loss ... n 1 df{dx.](https://reader033.fdocuments.net/reader033/viewer/2022041905/5e62c9c6e737573c5200d583/html5/thumbnails/12.jpg)
Background: Matrix Multiplication
§ Matrix multiplication is notcommutative, i.e., AB ‰ BA
§ When multiplying matrices, the “neighboring” dimensions haveto fit:
Aloomoon
nˆk
Bloomoon
kˆm
“ Cloomoon
nˆm
y “ Ax y = A.dot(x)
yi “ř
j Aijxj y = np.einsum(’ij, j’, A, x)
C “ AB C = A.dot(B)
Cij “ř
k AikBkj C = np.einsum(’ik, kj’, A, B)
Mathematics for Machine Learning Marc Deisenroth @Deep Learning Indaba, September 10, 2017 7
![Page 13: Mathematics for Machine Learningdfg/ProbabilisticInference... · Typically via some form of gradient descent. Challenge 1: Differentiation. Compute gradients of a loss ... n 1 df{dx.](https://reader033.fdocuments.net/reader033/viewer/2022041905/5e62c9c6e737573c5200d583/html5/thumbnails/13.jpg)
Background: Matrix Multiplication
§ Matrix multiplication is notcommutative, i.e., AB ‰ BA
§ When multiplying matrices, the “neighboring” dimensions haveto fit:
Aloomoon
nˆk
Bloomoon
kˆm
“ Cloomoon
nˆm
y “ Ax y = A.dot(x)
yi “ř
j Aijxj y = np.einsum(’ij, j’, A, x)
C “ AB C = A.dot(B)
Cij “ř
k AikBkj C = np.einsum(’ik, kj’, A, B)
Mathematics for Machine Learning Marc Deisenroth @Deep Learning Indaba, September 10, 2017 7
![Page 14: Mathematics for Machine Learningdfg/ProbabilisticInference... · Typically via some form of gradient descent. Challenge 1: Differentiation. Compute gradients of a loss ... n 1 df{dx.](https://reader033.fdocuments.net/reader033/viewer/2022041905/5e62c9c6e737573c5200d583/html5/thumbnails/14.jpg)
Curve Fitting (Regression) in Machine Learning (1)
x-5 0 5
f(x)
-3
-2
-1
0
1
2
3Polynomial of degree 5
DataMaximum likelihood estimate
§ Setting: Given inputs x, predict outputs/targets y§ Model f that depends on parameters θ. Examples:
§ Linear model: f px, θq “ θJx, x, θ P RD
§ Neural network: f px, θq “ NNpx, θq
§ Training data, e.g., N pairs pxi, yiq of inputs xi and observations yi
§ Training the model means finding parameters θ˚, such thatf pxi, θ˚q « yi
§ Define a loss function, e.g.,řN
i“1pyi ´ f pxi, θqq2, which we want tooptimize
§ Typically: Optimization based on some form of gradient descentDifferentiation required
Mathematics for Machine Learning Marc Deisenroth @Deep Learning Indaba, September 10, 2017 8
![Page 15: Mathematics for Machine Learningdfg/ProbabilisticInference... · Typically via some form of gradient descent. Challenge 1: Differentiation. Compute gradients of a loss ... n 1 df{dx.](https://reader033.fdocuments.net/reader033/viewer/2022041905/5e62c9c6e737573c5200d583/html5/thumbnails/15.jpg)
Curve Fitting (Regression) in Machine Learning (2)
§ Training data, e.g., N pairs pxi, yiq ofinputs xi and observations yi
§ Training the model means findingparameters θ˚, such that f pxi, θ˚q « yi
x-5 0 5
f(x)
-3
-2
-1
0
1
2
3Polynomial of degree 5
DataMaximum likelihood estimate
§ Define a loss function, e.g.,řN
i“1pyi ´ f pxi, θqq2, which we want tooptimize
§ Typically: Optimization based on some form of gradient descentDifferentiation required
Mathematics for Machine Learning Marc Deisenroth @Deep Learning Indaba, September 10, 2017 9
![Page 16: Mathematics for Machine Learningdfg/ProbabilisticInference... · Typically via some form of gradient descent. Challenge 1: Differentiation. Compute gradients of a loss ... n 1 df{dx.](https://reader033.fdocuments.net/reader033/viewer/2022041905/5e62c9c6e737573c5200d583/html5/thumbnails/16.jpg)
Curve Fitting (Regression) in Machine Learning (2)
§ Training data, e.g., N pairs pxi, yiq ofinputs xi and observations yi
§ Training the model means findingparameters θ˚, such that f pxi, θ˚q « yi
x-5 0 5
f(x)
-3
-2
-1
0
1
2
3Polynomial of degree 5
DataMaximum likelihood estimate
§ Define a loss function, e.g.,řN
i“1pyi ´ f pxi, θqq2, which we want tooptimize
§ Typically: Optimization based on some form of gradient descentDifferentiation required
Mathematics for Machine Learning Marc Deisenroth @Deep Learning Indaba, September 10, 2017 9
![Page 17: Mathematics for Machine Learningdfg/ProbabilisticInference... · Typically via some form of gradient descent. Challenge 1: Differentiation. Compute gradients of a loss ... n 1 df{dx.](https://reader033.fdocuments.net/reader033/viewer/2022041905/5e62c9c6e737573c5200d583/html5/thumbnails/17.jpg)
Overview
Introduction
Differentiation
Integration
Mathematics for Machine Learning Marc Deisenroth @Deep Learning Indaba, September 10, 2017 10
![Page 18: Mathematics for Machine Learningdfg/ProbabilisticInference... · Typically via some form of gradient descent. Challenge 1: Differentiation. Compute gradients of a loss ... n 1 df{dx.](https://reader033.fdocuments.net/reader033/viewer/2022041905/5e62c9c6e737573c5200d583/html5/thumbnails/18.jpg)
Differentiation: Outline
1. Scalar differentiation: f : RÑ R
2. Multivariate case: f : RN Ñ R
3. Vector fields: f : RN Ñ RM
4. General derivatives: f : RMˆN Ñ RPˆQ
Mathematics for Machine Learning Marc Deisenroth @Deep Learning Indaba, September 10, 2017 11
![Page 19: Mathematics for Machine Learningdfg/ProbabilisticInference... · Typically via some form of gradient descent. Challenge 1: Differentiation. Compute gradients of a loss ... n 1 df{dx.](https://reader033.fdocuments.net/reader033/viewer/2022041905/5e62c9c6e737573c5200d583/html5/thumbnails/19.jpg)
Scalar Differentiation f : R Ñ R
§ Derivative defined as the limit of the difference quotient
f 1pxq “d fdx“ lim
hÑ0
f p x` h q ´ f pxqh
Slope of the secant line through f pxq and f px` hq
Mathematics for Machine Learning Marc Deisenroth @Deep Learning Indaba, September 10, 2017 12
![Page 20: Mathematics for Machine Learningdfg/ProbabilisticInference... · Typically via some form of gradient descent. Challenge 1: Differentiation. Compute gradients of a loss ... n 1 df{dx.](https://reader033.fdocuments.net/reader033/viewer/2022041905/5e62c9c6e737573c5200d583/html5/thumbnails/20.jpg)
Examples
f pxq “ xn f 1pxq “ nxn´1
f pxq “ sinpxq f 1pxq “ cospxqf pxq “ tanhpxq f 1pxq “ 1´ tanh2
pxqf pxq “ exppxq f 1pxq “ exppxqf pxq “ logpxq f 1pxq “ 1
x
Mathematics for Machine Learning Marc Deisenroth @Deep Learning Indaba, September 10, 2017 13
![Page 21: Mathematics for Machine Learningdfg/ProbabilisticInference... · Typically via some form of gradient descent. Challenge 1: Differentiation. Compute gradients of a loss ... n 1 df{dx.](https://reader033.fdocuments.net/reader033/viewer/2022041905/5e62c9c6e737573c5200d583/html5/thumbnails/21.jpg)
Rules
§ Sum Rule
`
f pxq ` gpxq˘1“ f 1pxq ` g1pxq “
d fdx`
dgdx
§ Product Rule
`
f pxqgpxq˘1“ f 1pxqgpxq ` f pxqg1pxq “
d fdx
gpxq ` f pxqdgdx
§ Chain Rule
pg ˝ f q1pxq “`
gp f pxqq˘1“ g1p f pxqq f 1pxq “
dgd f
d fdx
Mathematics for Machine Learning Marc Deisenroth @Deep Learning Indaba, September 10, 2017 14
![Page 22: Mathematics for Machine Learningdfg/ProbabilisticInference... · Typically via some form of gradient descent. Challenge 1: Differentiation. Compute gradients of a loss ... n 1 df{dx.](https://reader033.fdocuments.net/reader033/viewer/2022041905/5e62c9c6e737573c5200d583/html5/thumbnails/22.jpg)
Rules
§ Sum Rule
`
f pxq ` gpxq˘1“ f 1pxq ` g1pxq “
d fdx`
dgdx
§ Product Rule
`
f pxqgpxq˘1“ f 1pxqgpxq ` f pxqg1pxq “
d fdx
gpxq ` f pxqdgdx
§ Chain Rule
pg ˝ f q1pxq “`
gp f pxqq˘1“ g1p f pxqq f 1pxq “
dgd f
d fdx
Mathematics for Machine Learning Marc Deisenroth @Deep Learning Indaba, September 10, 2017 14
![Page 23: Mathematics for Machine Learningdfg/ProbabilisticInference... · Typically via some form of gradient descent. Challenge 1: Differentiation. Compute gradients of a loss ... n 1 df{dx.](https://reader033.fdocuments.net/reader033/viewer/2022041905/5e62c9c6e737573c5200d583/html5/thumbnails/23.jpg)
Rules
§ Sum Rule
`
f pxq ` gpxq˘1“ f 1pxq ` g1pxq “
d fdx`
dgdx
§ Product Rule
`
f pxqgpxq˘1“ f 1pxqgpxq ` f pxqg1pxq “
d fdx
gpxq ` f pxqdgdx
§ Chain Rule
pg ˝ f q1pxq “`
gp f pxqq˘1“ g1p f pxqq f 1pxq “
dgd f
d fdx
Mathematics for Machine Learning Marc Deisenroth @Deep Learning Indaba, September 10, 2017 14
![Page 24: Mathematics for Machine Learningdfg/ProbabilisticInference... · Typically via some form of gradient descent. Challenge 1: Differentiation. Compute gradients of a loss ... n 1 df{dx.](https://reader033.fdocuments.net/reader033/viewer/2022041905/5e62c9c6e737573c5200d583/html5/thumbnails/24.jpg)
Example: Chain Rule
pg ˝ f q1pxq “`
gp f pxqq˘1“ g1p f pxqq f 1pxq “
dgd f
d fdx
gpzq “ tanhpzqz “ f pxq “ xn
pg ˝ f q1pxq “
`
1´ tanh2pxnqq
loooooooomoooooooon
dg{d f
nxn´1loomoon
d f {dx
Mathematics for Machine Learning Marc Deisenroth @Deep Learning Indaba, September 10, 2017 15
![Page 25: Mathematics for Machine Learningdfg/ProbabilisticInference... · Typically via some form of gradient descent. Challenge 1: Differentiation. Compute gradients of a loss ... n 1 df{dx.](https://reader033.fdocuments.net/reader033/viewer/2022041905/5e62c9c6e737573c5200d583/html5/thumbnails/25.jpg)
Example: Chain Rule
pg ˝ f q1pxq “`
gp f pxqq˘1“ g1p f pxqq f 1pxq “
dgd f
d fdx
gpzq “ tanhpzqz “ f pxq “ xn
pg ˝ f q1pxq “`
1´ tanh2pxnqq
loooooooomoooooooon
dg{d f
nxn´1loomoon
d f {dx
Mathematics for Machine Learning Marc Deisenroth @Deep Learning Indaba, September 10, 2017 15
![Page 26: Mathematics for Machine Learningdfg/ProbabilisticInference... · Typically via some form of gradient descent. Challenge 1: Differentiation. Compute gradients of a loss ... n 1 df{dx.](https://reader033.fdocuments.net/reader033/viewer/2022041905/5e62c9c6e737573c5200d583/html5/thumbnails/26.jpg)
f : RN Ñ R
y “ f pxq , x “
»
—
–
x1...
xN
fi
ffi
fl
P RN
§ Partial derivative (change one coordinate at a time):
B fBxi
“ limhÑ0
f px1, . . . , xi´1, xi ` h , xi`1, . . . , xNq ´ f pxqh
§ Jacobian vector (gradient) collects all partial derivatives:
d fdx“
”
B fBx1
¨ ¨ ¨B fxN
ı
P R1ˆN
Note: This is a row vector.
Mathematics for Machine Learning Marc Deisenroth @Deep Learning Indaba, September 10, 2017 16
![Page 27: Mathematics for Machine Learningdfg/ProbabilisticInference... · Typically via some form of gradient descent. Challenge 1: Differentiation. Compute gradients of a loss ... n 1 df{dx.](https://reader033.fdocuments.net/reader033/viewer/2022041905/5e62c9c6e737573c5200d583/html5/thumbnails/27.jpg)
f : RN Ñ R
y “ f pxq , x “
»
—
–
x1...
xN
fi
ffi
fl
P RN
§ Partial derivative (change one coordinate at a time):
B fBxi
“ limhÑ0
f px1, . . . , xi´1, xi ` h , xi`1, . . . , xNq ´ f pxqh
§ Jacobian vector (gradient) collects all partial derivatives:
d fdx“
”
B fBx1
¨ ¨ ¨B fxN
ı
P R1ˆN
Note: This is a row vector.Mathematics for Machine Learning Marc Deisenroth @Deep Learning Indaba, September 10, 2017 16
![Page 28: Mathematics for Machine Learningdfg/ProbabilisticInference... · Typically via some form of gradient descent. Challenge 1: Differentiation. Compute gradients of a loss ... n 1 df{dx.](https://reader033.fdocuments.net/reader033/viewer/2022041905/5e62c9c6e737573c5200d583/html5/thumbnails/28.jpg)
Example
f : R2 Ñ R
f px1, x2q “ x21x2 ` x1x3
2 P R
§ Partial derivatives:B f px1, x2q
Bx1“ 2x1x2 ` x3
2
B f px1, x2q
Bx2“ x2
1 ` 3x1x22
§ Gradient:d fdx“
”
B f px1,x2q
Bx1
B f px1,x2q
Bx2
ı
“
”
2x1x2 ` x32 x2
1 ` 3x1x22
ı
P R1ˆ2 .
Mathematics for Machine Learning Marc Deisenroth @Deep Learning Indaba, September 10, 2017 17
![Page 29: Mathematics for Machine Learningdfg/ProbabilisticInference... · Typically via some form of gradient descent. Challenge 1: Differentiation. Compute gradients of a loss ... n 1 df{dx.](https://reader033.fdocuments.net/reader033/viewer/2022041905/5e62c9c6e737573c5200d583/html5/thumbnails/29.jpg)
Example
f : R2 Ñ R
f px1, x2q “ x21x2 ` x1x3
2 P R
§ Partial derivatives:B f px1, x2q
Bx1“ 2x1x2 ` x3
2
B f px1, x2q
Bx2“ x2
1 ` 3x1x22
§ Gradient:d fdx“
”
B f px1,x2q
Bx1
B f px1,x2q
Bx2
ı
“
”
2x1x2 ` x32 x2
1 ` 3x1x22
ı
P R1ˆ2 .
Mathematics for Machine Learning Marc Deisenroth @Deep Learning Indaba, September 10, 2017 17
![Page 30: Mathematics for Machine Learningdfg/ProbabilisticInference... · Typically via some form of gradient descent. Challenge 1: Differentiation. Compute gradients of a loss ... n 1 df{dx.](https://reader033.fdocuments.net/reader033/viewer/2022041905/5e62c9c6e737573c5200d583/html5/thumbnails/30.jpg)
Example
f : R2 Ñ R
f px1, x2q “ x21x2 ` x1x3
2 P R
§ Partial derivatives:B f px1, x2q
Bx1“ 2x1x2 ` x3
2
B f px1, x2q
Bx2“ x2
1 ` 3x1x22
§ Gradient:d fdx“
”
B f px1,x2q
Bx1
B f px1,x2q
Bx2
ı
“
”
2x1x2 ` x32 x2
1 ` 3x1x22
ı
P R1ˆ2 .
Mathematics for Machine Learning Marc Deisenroth @Deep Learning Indaba, September 10, 2017 17
![Page 31: Mathematics for Machine Learningdfg/ProbabilisticInference... · Typically via some form of gradient descent. Challenge 1: Differentiation. Compute gradients of a loss ... n 1 df{dx.](https://reader033.fdocuments.net/reader033/viewer/2022041905/5e62c9c6e737573c5200d583/html5/thumbnails/31.jpg)
Rules
§ Sum RuleB
Bx`
f pxq ` gpxq˘
“B fBx`BgBx
§ Product Rule
B
Bx`
f pxqgpxq˘
“B fBx
gpxq ` f pxqBgBx
§ Chain Rule
B
Bxpg ˝ f qpxq “
B
Bx`
gp f pxqq˘
“BgB fB fBx
Mathematics for Machine Learning Marc Deisenroth @Deep Learning Indaba, September 10, 2017 18
![Page 32: Mathematics for Machine Learningdfg/ProbabilisticInference... · Typically via some form of gradient descent. Challenge 1: Differentiation. Compute gradients of a loss ... n 1 df{dx.](https://reader033.fdocuments.net/reader033/viewer/2022041905/5e62c9c6e737573c5200d583/html5/thumbnails/32.jpg)
Example: Chain Rule
§ Consider the function
Lpeq “ 12}e}
2 “ 12 eJe
e “ y´ Ax , x P RN , A P RMˆN , e, y P RM
§ Compute dL{dx. What is the dimension/size of dL{dx?
§ dL{dx P R1ˆN
dLdx“
dLde
dedx
dLde“ eJ P R1ˆM (1)
dedx“ ´A P RMˆN (2)
ñdLdx“ eJp´Aq “ ´py´ AxqJA P R1ˆN
Mathematics for Machine Learning Marc Deisenroth @Deep Learning Indaba, September 10, 2017 19
![Page 33: Mathematics for Machine Learningdfg/ProbabilisticInference... · Typically via some form of gradient descent. Challenge 1: Differentiation. Compute gradients of a loss ... n 1 df{dx.](https://reader033.fdocuments.net/reader033/viewer/2022041905/5e62c9c6e737573c5200d583/html5/thumbnails/33.jpg)
Example: Chain Rule
§ Consider the function
Lpeq “ 12}e}
2 “ 12 eJe
e “ y´ Ax , x P RN , A P RMˆN , e, y P RM
§ Compute dL{dx. What is the dimension/size of dL{dx?§ dL{dx P R1ˆN
dLdx“
dLde
dedx
dLde“ eJ P R1ˆM (1)
dedx“ ´A P RMˆN (2)
ñdLdx“ eJp´Aq “ ´py´ AxqJA P R1ˆN
Mathematics for Machine Learning Marc Deisenroth @Deep Learning Indaba, September 10, 2017 19
![Page 34: Mathematics for Machine Learningdfg/ProbabilisticInference... · Typically via some form of gradient descent. Challenge 1: Differentiation. Compute gradients of a loss ... n 1 df{dx.](https://reader033.fdocuments.net/reader033/viewer/2022041905/5e62c9c6e737573c5200d583/html5/thumbnails/34.jpg)
f : RN Ñ RM
y “ f pxq P RM , x P RN
»
—
–
y1...
yM
fi
ffi
fl
“
»
—
–
f1pxq...
fMpxq
fi
ffi
fl
“
»
—
–
f1px1, . . . , xNq...
fMpx1, . . . , xNq
fi
ffi
fl
§ Jacobian matrix (collection of all partial derivatives)»
—
—
—
–
dy1dx...
dyMdx
fi
ffi
ffi
ffi
fl
“
»
—
—
—
–
B f1Bx1
¨ ¨ ¨B f1BxN
......
B fMBx1
¨ ¨ ¨B fMBxN
fi
ffi
ffi
ffi
fl
P RMˆN
Mathematics for Machine Learning Marc Deisenroth @Deep Learning Indaba, September 10, 2017 20
![Page 35: Mathematics for Machine Learningdfg/ProbabilisticInference... · Typically via some form of gradient descent. Challenge 1: Differentiation. Compute gradients of a loss ... n 1 df{dx.](https://reader033.fdocuments.net/reader033/viewer/2022041905/5e62c9c6e737573c5200d583/html5/thumbnails/35.jpg)
f : RN Ñ RM
y “ f pxq P RM , x P RN
»
—
–
y1...
yM
fi
ffi
fl
“
»
—
–
f1pxq...
fMpxq
fi
ffi
fl
“
»
—
–
f1px1, . . . , xNq...
fMpx1, . . . , xNq
fi
ffi
fl
§ Jacobian matrix (collection of all partial derivatives)»
—
—
—
–
dy1dx...
dyMdx
fi
ffi
ffi
ffi
fl
“
»
—
—
—
–
B f1Bx1
¨ ¨ ¨B f1BxN
......
B fMBx1
¨ ¨ ¨B fMBxN
fi
ffi
ffi
ffi
fl
P RMˆN
Mathematics for Machine Learning Marc Deisenroth @Deep Learning Indaba, September 10, 2017 20
![Page 36: Mathematics for Machine Learningdfg/ProbabilisticInference... · Typically via some form of gradient descent. Challenge 1: Differentiation. Compute gradients of a loss ... n 1 df{dx.](https://reader033.fdocuments.net/reader033/viewer/2022041905/5e62c9c6e737573c5200d583/html5/thumbnails/36.jpg)
Example
f pxq “ Ax , f pxq P RM, A P RMˆN , x P RN
§ Compute the gradient d f {dx§ Dimension of d f {dx:
Since f : RN Ñ RM, it follows that d f {dx P RMˆN
§ Gradient:
fi “
Nÿ
j“1
Aijxj ñB fiBxj
“ Aij
ñd fdx“
»
—
—
–
B f1Bx1
¨ ¨ ¨B f1BxN
......
B fMBx1
¨ ¨ ¨B fMBxN
fi
ffi
ffi
fl
“
»
—
—
–
A11 ¨ ¨ ¨ A1N...
...AM1 ¨ ¨ ¨ AMN
fi
ffi
ffi
fl
“ A
(3)
Mathematics for Machine Learning Marc Deisenroth @Deep Learning Indaba, September 10, 2017 21
![Page 37: Mathematics for Machine Learningdfg/ProbabilisticInference... · Typically via some form of gradient descent. Challenge 1: Differentiation. Compute gradients of a loss ... n 1 df{dx.](https://reader033.fdocuments.net/reader033/viewer/2022041905/5e62c9c6e737573c5200d583/html5/thumbnails/37.jpg)
Example
f pxq “ Ax , f pxq P RM, A P RMˆN , x P RN
§ Compute the gradient d f {dx§ Dimension of d f {dx:
Since f : RN Ñ RM, it follows that d f {dx P RMˆN
§ Gradient:
fi “
Nÿ
j“1
Aijxj ñB fiBxj
“ Aij
ñd fdx“
»
—
—
–
B f1Bx1
¨ ¨ ¨B f1BxN
......
B fMBx1
¨ ¨ ¨B fMBxN
fi
ffi
ffi
fl
“
»
—
—
–
A11 ¨ ¨ ¨ A1N...
...AM1 ¨ ¨ ¨ AMN
fi
ffi
ffi
fl
“ A
(3)
Mathematics for Machine Learning Marc Deisenroth @Deep Learning Indaba, September 10, 2017 21
![Page 38: Mathematics for Machine Learningdfg/ProbabilisticInference... · Typically via some form of gradient descent. Challenge 1: Differentiation. Compute gradients of a loss ... n 1 df{dx.](https://reader033.fdocuments.net/reader033/viewer/2022041905/5e62c9c6e737573c5200d583/html5/thumbnails/38.jpg)
Example
f pxq “ Ax , f pxq P RM, A P RMˆN , x P RN
§ Compute the gradient d f {dx§ Dimension of d f {dx:
Since f : RN Ñ RM, it follows that d f {dx P RMˆN
§ Gradient:
fi “
Nÿ
j“1
Aijxj ñB fiBxj
“ Aij
ñd fdx“
»
—
—
–
B f1Bx1
¨ ¨ ¨B f1BxN
......
B fMBx1
¨ ¨ ¨B fMBxN
fi
ffi
ffi
fl
“
»
—
—
–
A11 ¨ ¨ ¨ A1N...
...AM1 ¨ ¨ ¨ AMN
fi
ffi
ffi
fl
“ A
(3)
Mathematics for Machine Learning Marc Deisenroth @Deep Learning Indaba, September 10, 2017 21
![Page 39: Mathematics for Machine Learningdfg/ProbabilisticInference... · Typically via some form of gradient descent. Challenge 1: Differentiation. Compute gradients of a loss ... n 1 df{dx.](https://reader033.fdocuments.net/reader033/viewer/2022041905/5e62c9c6e737573c5200d583/html5/thumbnails/39.jpg)
Example
f pxq “ Ax , f pxq P RM, A P RMˆN , x P RN
§ Compute the gradient d f {dx§ Dimension of d f {dx:
Since f : RN Ñ RM, it follows that d f {dx P RMˆN
§ Gradient:
fi “
Nÿ
j“1
Aijxj ñB fiBxj
“ Aij
ñd fdx“
»
—
—
–
B f1Bx1
¨ ¨ ¨B f1BxN
......
B fMBx1
¨ ¨ ¨B fMBxN
fi
ffi
ffi
fl
“
»
—
—
–
A11 ¨ ¨ ¨ A1N...
...AM1 ¨ ¨ ¨ AMN
fi
ffi
ffi
fl
“ A (3)
Mathematics for Machine Learning Marc Deisenroth @Deep Learning Indaba, September 10, 2017 21
![Page 40: Mathematics for Machine Learningdfg/ProbabilisticInference... · Typically via some form of gradient descent. Challenge 1: Differentiation. Compute gradients of a loss ... n 1 df{dx.](https://reader033.fdocuments.net/reader033/viewer/2022041905/5e62c9c6e737573c5200d583/html5/thumbnails/40.jpg)
Chain Rule
B
Bxpg ˝ f qpxq “
B
Bx`
gp f pxqq˘
“BgB fB fBx
Mathematics for Machine Learning Marc Deisenroth @Deep Learning Indaba, September 10, 2017 22
![Page 41: Mathematics for Machine Learningdfg/ProbabilisticInference... · Typically via some form of gradient descent. Challenge 1: Differentiation. Compute gradients of a loss ... n 1 df{dx.](https://reader033.fdocuments.net/reader033/viewer/2022041905/5e62c9c6e737573c5200d583/html5/thumbnails/41.jpg)
Example
§ Consider f : R2 Ñ R, x : RÑ R2
f pxq “ f px1, x2q “ x21 ` 2x2 ,
xptq “
«
x1ptqx2ptq
ff
“
«
sinptqcosptq
ff
§ The dimensions d f {dx and dx{dt are 1ˆ 2 and 2ˆ 1, respectively§ Compute the gradient d f {dt using the chain rule.
d fdt“
”
B fBx1
B fBx2
ı
»
–
Bx1Bt
Bx2Bt
fi
fl
“
”
2 sin t 2ı
«
cos t´ sin t
ff
“ 2 sin t cos t´ 2 sin t “ 2 sin tpcos t´ 1q
Mathematics for Machine Learning Marc Deisenroth @Deep Learning Indaba, September 10, 2017 23
![Page 42: Mathematics for Machine Learningdfg/ProbabilisticInference... · Typically via some form of gradient descent. Challenge 1: Differentiation. Compute gradients of a loss ... n 1 df{dx.](https://reader033.fdocuments.net/reader033/viewer/2022041905/5e62c9c6e737573c5200d583/html5/thumbnails/42.jpg)
Example
§ Consider f : R2 Ñ R, x : RÑ R2
f pxq “ f px1, x2q “ x21 ` 2x2 ,
xptq “
«
x1ptqx2ptq
ff
“
«
sinptqcosptq
ff
§ The dimensions d f {dx and dx{dt are
1ˆ 2 and 2ˆ 1, respectively§ Compute the gradient d f {dt using the chain rule.
d fdt“
”
B fBx1
B fBx2
ı
»
–
Bx1Bt
Bx2Bt
fi
fl
“
”
2 sin t 2ı
«
cos t´ sin t
ff
“ 2 sin t cos t´ 2 sin t “ 2 sin tpcos t´ 1q
Mathematics for Machine Learning Marc Deisenroth @Deep Learning Indaba, September 10, 2017 23
![Page 43: Mathematics for Machine Learningdfg/ProbabilisticInference... · Typically via some form of gradient descent. Challenge 1: Differentiation. Compute gradients of a loss ... n 1 df{dx.](https://reader033.fdocuments.net/reader033/viewer/2022041905/5e62c9c6e737573c5200d583/html5/thumbnails/43.jpg)
Example
§ Consider f : R2 Ñ R, x : RÑ R2
f pxq “ f px1, x2q “ x21 ` 2x2 ,
xptq “
«
x1ptqx2ptq
ff
“
«
sinptqcosptq
ff
§ The dimensions d f {dx and dx{dt are 1ˆ 2 and 2ˆ 1, respectively§ Compute the gradient d f {dt using the chain rule.
d fdt“
”
B fBx1
B fBx2
ı
»
–
Bx1Bt
Bx2Bt
fi
fl
“
”
2 sin t 2ı
«
cos t´ sin t
ff
“ 2 sin t cos t´ 2 sin t “ 2 sin tpcos t´ 1q
Mathematics for Machine Learning Marc Deisenroth @Deep Learning Indaba, September 10, 2017 23
![Page 44: Mathematics for Machine Learningdfg/ProbabilisticInference... · Typically via some form of gradient descent. Challenge 1: Differentiation. Compute gradients of a loss ... n 1 df{dx.](https://reader033.fdocuments.net/reader033/viewer/2022041905/5e62c9c6e737573c5200d583/html5/thumbnails/44.jpg)
Example
§ Consider f : R2 Ñ R, x : RÑ R2
f pxq “ f px1, x2q “ x21 ` 2x2 ,
xptq “
«
x1ptqx2ptq
ff
“
«
sinptqcosptq
ff
§ The dimensions d f {dx and dx{dt are 1ˆ 2 and 2ˆ 1, respectively§ Compute the gradient d f {dt using the chain rule.
d fdt“
”
B fBx1
B fBx2
ı
»
–
Bx1Bt
Bx2Bt
fi
fl
“
”
2 sin t 2ı
«
cos t´ sin t
ff
“ 2 sin t cos t´ 2 sin t “ 2 sin tpcos t´ 1qMathematics for Machine Learning Marc Deisenroth @Deep Learning Indaba, September 10, 2017 23
![Page 45: Mathematics for Machine Learningdfg/ProbabilisticInference... · Typically via some form of gradient descent. Challenge 1: Differentiation. Compute gradients of a loss ... n 1 df{dx.](https://reader033.fdocuments.net/reader033/viewer/2022041905/5e62c9c6e737573c5200d583/html5/thumbnails/45.jpg)
BREAK
Mathematics for Machine Learning Marc Deisenroth @Deep Learning Indaba, September 10, 2017 24
![Page 46: Mathematics for Machine Learningdfg/ProbabilisticInference... · Typically via some form of gradient descent. Challenge 1: Differentiation. Compute gradients of a loss ... n 1 df{dx.](https://reader033.fdocuments.net/reader033/viewer/2022041905/5e62c9c6e737573c5200d583/html5/thumbnails/46.jpg)
Derivatives with Matrices
§ Re-cap: Gradient of a function f : RD Ñ RE is an EˆD-matrix:
# target dimensions ˆ # parameters
withd fdxP REˆD , d f re, ds “
B fe
Bxd
§ Generalization to cases, where the parameters (D) or targets (E)are matrices, apply immediately
§ Assume f : RMˆN Ñ RPˆQ, then the gradient is apPˆQq ˆ pMˆ Nq object (tensor) where
d f rp, q, m, ns “B fpq
BXmn
Mathematics for Machine Learning Marc Deisenroth @Deep Learning Indaba, September 10, 2017 25
![Page 47: Mathematics for Machine Learningdfg/ProbabilisticInference... · Typically via some form of gradient descent. Challenge 1: Differentiation. Compute gradients of a loss ... n 1 df{dx.](https://reader033.fdocuments.net/reader033/viewer/2022041905/5e62c9c6e737573c5200d583/html5/thumbnails/47.jpg)
Derivatives with Matrices
§ Re-cap: Gradient of a function f : RD Ñ RE is an EˆD-matrix:
# target dimensions ˆ # parameters
withd fdxP REˆD , d f re, ds “
B fe
Bxd
§ Generalization to cases, where the parameters (D) or targets (E)are matrices, apply immediately
§ Assume f : RMˆN Ñ RPˆQ, then the gradient is apPˆQq ˆ pMˆ Nq object (tensor) where
d f rp, q, m, ns “B fpq
BXmn
Mathematics for Machine Learning Marc Deisenroth @Deep Learning Indaba, September 10, 2017 25
![Page 48: Mathematics for Machine Learningdfg/ProbabilisticInference... · Typically via some form of gradient descent. Challenge 1: Differentiation. Compute gradients of a loss ... n 1 df{dx.](https://reader033.fdocuments.net/reader033/viewer/2022041905/5e62c9c6e737573c5200d583/html5/thumbnails/48.jpg)
Derivatives with Matrices
§ Re-cap: Gradient of a function f : RD Ñ RE is an EˆD-matrix:
# target dimensions ˆ # parameters
withd fdxP REˆD , d f re, ds “
B fe
Bxd
§ Generalization to cases, where the parameters (D) or targets (E)are matrices, apply immediately
§ Assume f : RMˆN Ñ RPˆQ, then the gradient is apPˆQq ˆ pMˆ Nq object (tensor) where
d f rp, q, m, ns “B fpq
BXmn
Mathematics for Machine Learning Marc Deisenroth @Deep Learning Indaba, September 10, 2017 25
![Page 49: Mathematics for Machine Learningdfg/ProbabilisticInference... · Typically via some form of gradient descent. Challenge 1: Differentiation. Compute gradients of a loss ... n 1 df{dx.](https://reader033.fdocuments.net/reader033/viewer/2022041905/5e62c9c6e737573c5200d583/html5/thumbnails/49.jpg)
Derivatives with Matrices: Example (1)
f “ Ax , f P RM, A P RMˆN , x P RN
d fdA
P RMˆpMˆNq
d fdA
“
»
—
–
B f1BA...B fMBA
fi
ffi
fl
,B fi
BAP R1ˆpMˆNq
Mathematics for Machine Learning Marc Deisenroth @Deep Learning Indaba, September 10, 2017 26
![Page 50: Mathematics for Machine Learningdfg/ProbabilisticInference... · Typically via some form of gradient descent. Challenge 1: Differentiation. Compute gradients of a loss ... n 1 df{dx.](https://reader033.fdocuments.net/reader033/viewer/2022041905/5e62c9c6e737573c5200d583/html5/thumbnails/50.jpg)
Derivatives with Matrices: Example (1)
f “ Ax , f P RM, A P RMˆN , x P RN
d fdA
P RMˆpMˆNq
d fdA
“
»
—
–
B f1BA...B fMBA
fi
ffi
fl
,B fi
BAP R1ˆpMˆNq
Mathematics for Machine Learning Marc Deisenroth @Deep Learning Indaba, September 10, 2017 26
![Page 51: Mathematics for Machine Learningdfg/ProbabilisticInference... · Typically via some form of gradient descent. Challenge 1: Differentiation. Compute gradients of a loss ... n 1 df{dx.](https://reader033.fdocuments.net/reader033/viewer/2022041905/5e62c9c6e737573c5200d583/html5/thumbnails/51.jpg)
Derivatives with Matrices: Example (2)
fi “
Nÿ
j“1
Aijxj, i “ 1, . . . , M
B fi
BAiq“ xq ñ
B fi
BAi,:“ xJ P R1ˆ1ˆN
B fi
BAk‰i,:“ 0J P R1ˆ1ˆN
B fi
BA“
»
—
—
—
—
—
—
—
—
–
0J...
xJ
0J...
0J
fi
ffi
ffi
ffi
ffi
ffi
ffi
ffi
ffi
fl
P R1ˆpMˆNq
(4)
Mathematics for Machine Learning Marc Deisenroth @Deep Learning Indaba, September 10, 2017 27
![Page 52: Mathematics for Machine Learningdfg/ProbabilisticInference... · Typically via some form of gradient descent. Challenge 1: Differentiation. Compute gradients of a loss ... n 1 df{dx.](https://reader033.fdocuments.net/reader033/viewer/2022041905/5e62c9c6e737573c5200d583/html5/thumbnails/52.jpg)
Derivatives with Matrices: Example (2)
fi “
Nÿ
j“1
Aijxj, i “ 1, . . . , M
B fi
BAiq“ xq
ñB fi
BAi,:“ xJ P R1ˆ1ˆN
B fi
BAk‰i,:“ 0J P R1ˆ1ˆN
B fi
BA“
»
—
—
—
—
—
—
—
—
–
0J...
xJ
0J...
0J
fi
ffi
ffi
ffi
ffi
ffi
ffi
ffi
ffi
fl
P R1ˆpMˆNq
(4)
Mathematics for Machine Learning Marc Deisenroth @Deep Learning Indaba, September 10, 2017 27
![Page 53: Mathematics for Machine Learningdfg/ProbabilisticInference... · Typically via some form of gradient descent. Challenge 1: Differentiation. Compute gradients of a loss ... n 1 df{dx.](https://reader033.fdocuments.net/reader033/viewer/2022041905/5e62c9c6e737573c5200d583/html5/thumbnails/53.jpg)
Derivatives with Matrices: Example (2)
fi “
Nÿ
j“1
Aijxj, i “ 1, . . . , M
B fi
BAiq“ xq ñ
B fi
BAi,:“ xJ P R1ˆ1ˆN
B fi
BAk‰i,:“ 0J P R1ˆ1ˆN
B fi
BA“
»
—
—
—
—
—
—
—
—
–
0J...
xJ
0J...
0J
fi
ffi
ffi
ffi
ffi
ffi
ffi
ffi
ffi
fl
P R1ˆpMˆNq
(4)
Mathematics for Machine Learning Marc Deisenroth @Deep Learning Indaba, September 10, 2017 27
![Page 54: Mathematics for Machine Learningdfg/ProbabilisticInference... · Typically via some form of gradient descent. Challenge 1: Differentiation. Compute gradients of a loss ... n 1 df{dx.](https://reader033.fdocuments.net/reader033/viewer/2022041905/5e62c9c6e737573c5200d583/html5/thumbnails/54.jpg)
Derivatives with Matrices: Example (2)
fi “
Nÿ
j“1
Aijxj, i “ 1, . . . , M
B fi
BAiq“ xq ñ
B fi
BAi,:“ xJ P R1ˆ1ˆN
B fi
BAk‰i,:“ 0J P R1ˆ1ˆN
B fi
BA“
»
—
—
—
—
—
—
—
—
–
0J...
xJ
0J...
0J
fi
ffi
ffi
ffi
ffi
ffi
ffi
ffi
ffi
fl
P R1ˆpMˆNq
(4)
Mathematics for Machine Learning Marc Deisenroth @Deep Learning Indaba, September 10, 2017 27
![Page 55: Mathematics for Machine Learningdfg/ProbabilisticInference... · Typically via some form of gradient descent. Challenge 1: Differentiation. Compute gradients of a loss ... n 1 df{dx.](https://reader033.fdocuments.net/reader033/viewer/2022041905/5e62c9c6e737573c5200d583/html5/thumbnails/55.jpg)
Derivatives with Matrices: Example (2)
fi “
Nÿ
j“1
Aijxj, i “ 1, . . . , M
B fi
BAiq“ xq ñ
B fi
BAi,:“ xJ P R1ˆ1ˆN
B fi
BAk‰i,:“ 0J P R1ˆ1ˆN
B fi
BA“
»
—
—
—
—
—
—
—
—
–
0J...
xJ
0J...
0J
fi
ffi
ffi
ffi
ffi
ffi
ffi
ffi
ffi
fl
P R1ˆpMˆNq (4)
Mathematics for Machine Learning Marc Deisenroth @Deep Learning Indaba, September 10, 2017 27
![Page 56: Mathematics for Machine Learningdfg/ProbabilisticInference... · Typically via some form of gradient descent. Challenge 1: Differentiation. Compute gradients of a loss ... n 1 df{dx.](https://reader033.fdocuments.net/reader033/viewer/2022041905/5e62c9c6e737573c5200d583/html5/thumbnails/56.jpg)
Example: Higher-Order Tensors
§ Consider a matrix A P R4ˆ2 whose entries depend on a vectorx P R3
§ We can compute dApxq{dx P R4ˆ2ˆ3 in two equivalent ways:
Mathematics for Machine Learning Marc Deisenroth @Deep Learning Indaba, September 10, 2017 28
![Page 57: Mathematics for Machine Learningdfg/ProbabilisticInference... · Typically via some form of gradient descent. Challenge 1: Differentiation. Compute gradients of a loss ... n 1 df{dx.](https://reader033.fdocuments.net/reader033/viewer/2022041905/5e62c9c6e737573c5200d583/html5/thumbnails/57.jpg)
Example: Higher-Order Tensors
§ Consider a matrix A P R4ˆ2 whose entries depend on a vectorx P R3
§ We can compute dApxq{dx P R4ˆ2ˆ3 in two equivalent ways:
Mathematics for Machine Learning Marc Deisenroth @Deep Learning Indaba, September 10, 2017 28
![Page 58: Mathematics for Machine Learningdfg/ProbabilisticInference... · Typically via some form of gradient descent. Challenge 1: Differentiation. Compute gradients of a loss ... n 1 df{dx.](https://reader033.fdocuments.net/reader033/viewer/2022041905/5e62c9c6e737573c5200d583/html5/thumbnails/58.jpg)
Gradients of a Single-Layer Neural Network (1)
x z f
A, b
σ
f “ tanhpAx` bloomoon
“:zPRM
q P RM, x P RN , A P RMˆN , b P RM
B fBb“
B fBz
loomoon
MˆM
BzBb
loomoon
MˆM
P RMˆM
B fBA
“B fBz
loomoon
MˆM
BzBA
loomoon
MˆpMˆNq
P RMˆpMˆNq
Mathematics for Machine Learning Marc Deisenroth @Deep Learning Indaba, September 10, 2017 29
![Page 59: Mathematics for Machine Learningdfg/ProbabilisticInference... · Typically via some form of gradient descent. Challenge 1: Differentiation. Compute gradients of a loss ... n 1 df{dx.](https://reader033.fdocuments.net/reader033/viewer/2022041905/5e62c9c6e737573c5200d583/html5/thumbnails/59.jpg)
Gradients of a Single-Layer Neural Network (1)
x z f
A, b
σ
f “ tanhpAx` bloomoon
“:zPRM
q P RM, x P RN , A P RMˆN , b P RM
B fBb“
B fBz
loomoon
MˆM
BzBb
loomoon
MˆM
P RMˆM
B fBA
“B fBz
loomoon
MˆM
BzBA
loomoon
MˆpMˆNq
P RMˆpMˆNq
Mathematics for Machine Learning Marc Deisenroth @Deep Learning Indaba, September 10, 2017 29
![Page 60: Mathematics for Machine Learningdfg/ProbabilisticInference... · Typically via some form of gradient descent. Challenge 1: Differentiation. Compute gradients of a loss ... n 1 df{dx.](https://reader033.fdocuments.net/reader033/viewer/2022041905/5e62c9c6e737573c5200d583/html5/thumbnails/60.jpg)
Gradients of a Single-Layer Neural Network (1)
x z f
A, b
σ
f “ tanhpAx` bloomoon
“:zPRM
q P RM, x P RN , A P RMˆN , b P RM
B fBb“
B fBz
loomoon
MˆM
BzBb
loomoon
MˆM
P RMˆM
B fBA
“B fBz
loomoon
MˆM
BzBA
loomoon
MˆpMˆNq
P RMˆpMˆNq
Mathematics for Machine Learning Marc Deisenroth @Deep Learning Indaba, September 10, 2017 29
![Page 61: Mathematics for Machine Learningdfg/ProbabilisticInference... · Typically via some form of gradient descent. Challenge 1: Differentiation. Compute gradients of a loss ... n 1 df{dx.](https://reader033.fdocuments.net/reader033/viewer/2022041905/5e62c9c6e737573c5200d583/html5/thumbnails/61.jpg)
Gradients of a Single-Layer Neural Network (2)
f “ tanhpAx` bloomoon
“:zPRM
q P RM, x P RN , A P RMˆN , b P RM
B fBb“
B fBz
loomoon
MˆM
BzBb
loomoon
MˆM
P RMˆM
B fBz“ diagp1´ tanh2
pzqq P RMˆM
BzBb“ I P RMˆM (5)
B fBbri, js “
Mÿ
l“1
B fBzri, ls
BzBbrl, js
dfdb = np.einsum(’il, lj’, dfdz, dzdb)
Mathematics for Machine Learning Marc Deisenroth @Deep Learning Indaba, September 10, 2017 30
![Page 62: Mathematics for Machine Learningdfg/ProbabilisticInference... · Typically via some form of gradient descent. Challenge 1: Differentiation. Compute gradients of a loss ... n 1 df{dx.](https://reader033.fdocuments.net/reader033/viewer/2022041905/5e62c9c6e737573c5200d583/html5/thumbnails/62.jpg)
Gradients of a Single-Layer Neural Network (2)
f “ tanhpAx` bloomoon
“:zPRM
q P RM, x P RN , A P RMˆN , b P RM
B fBb“
B fBz
loomoon
MˆM
BzBb
loomoon
MˆM
P RMˆM
B fBz“ diagp1´ tanh2
pzqq P RMˆM
BzBb“ I P RMˆM (5)
B fBbri, js “
Mÿ
l“1
B fBzri, ls
BzBbrl, js
dfdb = np.einsum(’il, lj’, dfdz, dzdb)
Mathematics for Machine Learning Marc Deisenroth @Deep Learning Indaba, September 10, 2017 30
![Page 63: Mathematics for Machine Learningdfg/ProbabilisticInference... · Typically via some form of gradient descent. Challenge 1: Differentiation. Compute gradients of a loss ... n 1 df{dx.](https://reader033.fdocuments.net/reader033/viewer/2022041905/5e62c9c6e737573c5200d583/html5/thumbnails/63.jpg)
Gradients of a Single-Layer Neural Network (2)
f “ tanhpAx` bloomoon
“:zPRM
q P RM, x P RN , A P RMˆN , b P RM
B fBb“
B fBz
loomoon
MˆM
BzBb
loomoon
MˆM
P RMˆM
B fBz“ diagp1´ tanh2
pzqq P RMˆM
BzBb“ I P RMˆM (5)
B fBbri, js “
Mÿ
l“1
B fBzri, ls
BzBbrl, js
dfdb = np.einsum(’il, lj’, dfdz, dzdb)
Mathematics for Machine Learning Marc Deisenroth @Deep Learning Indaba, September 10, 2017 30
![Page 64: Mathematics for Machine Learningdfg/ProbabilisticInference... · Typically via some form of gradient descent. Challenge 1: Differentiation. Compute gradients of a loss ... n 1 df{dx.](https://reader033.fdocuments.net/reader033/viewer/2022041905/5e62c9c6e737573c5200d583/html5/thumbnails/64.jpg)
Gradients of a Single-Layer Neural Network (2)
f “ tanhpAx` bloomoon
“:zPRM
q P RM, x P RN , A P RMˆN , b P RM
B fBb“
B fBz
loomoon
MˆM
BzBb
loomoon
MˆM
P RMˆM
B fBz“ diagp1´ tanh2
pzqq P RMˆM
BzBb“ I P RMˆM (5)
B fBbri, js “
Mÿ
l“1
B fBzri, ls
BzBbrl, js
dfdb = np.einsum(’il, lj’, dfdz, dzdb)
Mathematics for Machine Learning Marc Deisenroth @Deep Learning Indaba, September 10, 2017 30
![Page 65: Mathematics for Machine Learningdfg/ProbabilisticInference... · Typically via some form of gradient descent. Challenge 1: Differentiation. Compute gradients of a loss ... n 1 df{dx.](https://reader033.fdocuments.net/reader033/viewer/2022041905/5e62c9c6e737573c5200d583/html5/thumbnails/65.jpg)
Gradients of a Single-Layer Neural Network (3)
f “ tanhpAx` bloomoon
“:zPRM
q P RM, x P RN , A P RMˆN , b P RM
B fBA“
B fBz
loomoon
MˆM
BzBA
loomoon
MˆpMˆNq
P RMˆpMˆNq
B fBz“ diagp1´ tanh2
pzqq P RMˆM (6)
BzBA
See (4)
B fBAri, j, ks “
Mÿ
l“1
B fBzri, ls
BzBArl, j, ks
dfdA = np.einsum(’il, ljk’, dfdz, dzdA)
Mathematics for Machine Learning Marc Deisenroth @Deep Learning Indaba, September 10, 2017 31
![Page 66: Mathematics for Machine Learningdfg/ProbabilisticInference... · Typically via some form of gradient descent. Challenge 1: Differentiation. Compute gradients of a loss ... n 1 df{dx.](https://reader033.fdocuments.net/reader033/viewer/2022041905/5e62c9c6e737573c5200d583/html5/thumbnails/66.jpg)
Gradients of a Single-Layer Neural Network (3)
f “ tanhpAx` bloomoon
“:zPRM
q P RM, x P RN , A P RMˆN , b P RM
B fBA“
B fBz
loomoon
MˆM
BzBA
loomoon
MˆpMˆNq
P RMˆpMˆNq
B fBz“ diagp1´ tanh2
pzqq P RMˆM (6)
BzBA
See (4)
B fBAri, j, ks “
Mÿ
l“1
B fBzri, ls
BzBArl, j, ks
dfdA = np.einsum(’il, ljk’, dfdz, dzdA)
Mathematics for Machine Learning Marc Deisenroth @Deep Learning Indaba, September 10, 2017 31
![Page 67: Mathematics for Machine Learningdfg/ProbabilisticInference... · Typically via some form of gradient descent. Challenge 1: Differentiation. Compute gradients of a loss ... n 1 df{dx.](https://reader033.fdocuments.net/reader033/viewer/2022041905/5e62c9c6e737573c5200d583/html5/thumbnails/67.jpg)
Gradients of a Single-Layer Neural Network (3)
f “ tanhpAx` bloomoon
“:zPRM
q P RM, x P RN , A P RMˆN , b P RM
B fBA“
B fBz
loomoon
MˆM
BzBA
loomoon
MˆpMˆNq
P RMˆpMˆNq
B fBz“ diagp1´ tanh2
pzqq P RMˆM (6)
BzBA
See (4)
B fBAri, j, ks “
Mÿ
l“1
B fBzri, ls
BzBArl, j, ks
dfdA = np.einsum(’il, ljk’, dfdz, dzdA)
Mathematics for Machine Learning Marc Deisenroth @Deep Learning Indaba, September 10, 2017 31
![Page 68: Mathematics for Machine Learningdfg/ProbabilisticInference... · Typically via some form of gradient descent. Challenge 1: Differentiation. Compute gradients of a loss ... n 1 df{dx.](https://reader033.fdocuments.net/reader033/viewer/2022041905/5e62c9c6e737573c5200d583/html5/thumbnails/68.jpg)
Gradients of a Single-Layer Neural Network (3)
f “ tanhpAx` bloomoon
“:zPRM
q P RM, x P RN , A P RMˆN , b P RM
B fBA“
B fBz
loomoon
MˆM
BzBA
loomoon
MˆpMˆNq
P RMˆpMˆNq
B fBz“ diagp1´ tanh2
pzqq P RMˆM (6)
BzBA
See (4)
B fBAri, j, ks “
Mÿ
l“1
B fBzri, ls
BzBArl, j, ks
dfdA = np.einsum(’il, ljk’, dfdz, dzdA)
Mathematics for Machine Learning Marc Deisenroth @Deep Learning Indaba, September 10, 2017 31
![Page 69: Mathematics for Machine Learningdfg/ProbabilisticInference... · Typically via some form of gradient descent. Challenge 1: Differentiation. Compute gradients of a loss ... n 1 df{dx.](https://reader033.fdocuments.net/reader033/viewer/2022041905/5e62c9c6e737573c5200d583/html5/thumbnails/69.jpg)
Putting Things Together
§ Inputs x, observed outputs y “ f pz, θq ` ε , ε „ N`
0, Σ˘
§ Train single-layer neural network with
f pz, θq “ tanhpzq , z “ Ax` b , θ “ tA, bu
§ Find A, b, such that the squared loss
Lpθq “ 12}e}
2 , e “ y´ f pz, θq
is minimized§ Partial derivatives:
BLBA
“BLBeBeB fB fBzBzBA
BLBb
“BLBeBeB fB fBzBzBb
,
/
/
.
/
/
-
BLBe
(1)BeB f
(2), (3)B fBz
(6)
BzBA
(4)BzBb
(5)
Mathematics for Machine Learning Marc Deisenroth @Deep Learning Indaba, September 10, 2017 32
![Page 70: Mathematics for Machine Learningdfg/ProbabilisticInference... · Typically via some form of gradient descent. Challenge 1: Differentiation. Compute gradients of a loss ... n 1 df{dx.](https://reader033.fdocuments.net/reader033/viewer/2022041905/5e62c9c6e737573c5200d583/html5/thumbnails/70.jpg)
Putting Things Together
§ Inputs x, observed outputs y “ f pz, θq ` ε , ε „ N`
0, Σ˘
§ Train single-layer neural network with
f pz, θq “ tanhpzq , z “ Ax` b , θ “ tA, bu
§ Find A, b, such that the squared loss
Lpθq “ 12}e}
2 , e “ y´ f pz, θq
is minimized§ Partial derivatives:
BLBA
“BLBeBeB fB fBzBzBA
BLBb
“BLBeBeB fB fBzBzBb
,
/
/
.
/
/
-
BLBe
(1)BeB f
(2), (3)B fBz
(6)
BzBA
(4)BzBb
(5)
Mathematics for Machine Learning Marc Deisenroth @Deep Learning Indaba, September 10, 2017 32
![Page 71: Mathematics for Machine Learningdfg/ProbabilisticInference... · Typically via some form of gradient descent. Challenge 1: Differentiation. Compute gradients of a loss ... n 1 df{dx.](https://reader033.fdocuments.net/reader033/viewer/2022041905/5e62c9c6e737573c5200d583/html5/thumbnails/71.jpg)
Putting Things Together
§ Inputs x, observed outputs y “ f pz, θq ` ε , ε „ N`
0, Σ˘
§ Train single-layer neural network with
f pz, θq “ tanhpzq , z “ Ax` b , θ “ tA, bu
§ Find A, b, such that the squared loss
Lpθq “ 12}e}
2 , e “ y´ f pz, θq
is minimized
§ Partial derivatives:BLBA
“BLBeBeB fB fBzBzBA
BLBb
“BLBeBeB fB fBzBzBb
,
/
/
.
/
/
-
BLBe
(1)BeB f
(2), (3)B fBz
(6)
BzBA
(4)BzBb
(5)
Mathematics for Machine Learning Marc Deisenroth @Deep Learning Indaba, September 10, 2017 32
![Page 72: Mathematics for Machine Learningdfg/ProbabilisticInference... · Typically via some form of gradient descent. Challenge 1: Differentiation. Compute gradients of a loss ... n 1 df{dx.](https://reader033.fdocuments.net/reader033/viewer/2022041905/5e62c9c6e737573c5200d583/html5/thumbnails/72.jpg)
Putting Things Together
§ Inputs x, observed outputs y “ f pz, θq ` ε , ε „ N`
0, Σ˘
§ Train single-layer neural network with
f pz, θq “ tanhpzq , z “ Ax` b , θ “ tA, bu
§ Find A, b, such that the squared loss
Lpθq “ 12}e}
2 , e “ y´ f pz, θq
is minimized§ Partial derivatives:
BLBA
“BLBeBeB fB fBzBzBA
BLBb
“BLBeBeB fB fBzBzBb
,
/
/
.
/
/
-
BLBe
(1)BeB f
(2), (3)B fBz
(6)
BzBA
(4)BzBb
(5)
Mathematics for Machine Learning Marc Deisenroth @Deep Learning Indaba, September 10, 2017 32
![Page 73: Mathematics for Machine Learningdfg/ProbabilisticInference... · Typically via some form of gradient descent. Challenge 1: Differentiation. Compute gradients of a loss ... n 1 df{dx.](https://reader033.fdocuments.net/reader033/viewer/2022041905/5e62c9c6e737573c5200d583/html5/thumbnails/73.jpg)
Gradients of a Multi-Layer Neural Network
x fL
A1, b1 AL−1, bL−1
LfL−1
AL−2, bL−2
f1
A2, b2
§ Inputs x, observed outputs y§ Train multi-layer neural network with
f 0 “ x
f i “ σipAi´1 f i´1 ` bi´1q , i “ 1, . . . , L
§ Find Aj, bj for j “ 0, . . . , L´ 1, such that the squared loss
Lpθq “ }y´ f Lpθ, xq}2
is minimized, where θ “ tAj, bju , j “ 0, . . . , L´ 1
Mathematics for Machine Learning Marc Deisenroth @Deep Learning Indaba, September 10, 2017 33
![Page 74: Mathematics for Machine Learningdfg/ProbabilisticInference... · Typically via some form of gradient descent. Challenge 1: Differentiation. Compute gradients of a loss ... n 1 df{dx.](https://reader033.fdocuments.net/reader033/viewer/2022041905/5e62c9c6e737573c5200d583/html5/thumbnails/74.jpg)
Gradients of a Multi-Layer Neural Network
x fL
A1, b1 AL−1, bL−1
LfL−1
AL−2, bL−2
f1
A2, b2
§ Inputs x, observed outputs y§ Train multi-layer neural network with
f 0 “ x
f i “ σipAi´1 f i´1 ` bi´1q , i “ 1, . . . , L
§ Find Aj, bj for j “ 0, . . . , L´ 1, such that the squared loss
Lpθq “ }y´ f Lpθ, xq}2
is minimized, where θ “ tAj, bju , j “ 0, . . . , L´ 1Mathematics for Machine Learning Marc Deisenroth @Deep Learning Indaba, September 10, 2017 33
![Page 75: Mathematics for Machine Learningdfg/ProbabilisticInference... · Typically via some form of gradient descent. Challenge 1: Differentiation. Compute gradients of a loss ... n 1 df{dx.](https://reader033.fdocuments.net/reader033/viewer/2022041905/5e62c9c6e737573c5200d583/html5/thumbnails/75.jpg)
Gradients of a Multi-Layer Neural Network
x fL
A1, b1 AL−1, bL−1
LfL−1
AL−2, bL−2
f1
A2, b2
BLBθL´1
“BLB f L
B f LBθL´1
BLBθL´2
“BLB f L
B f LB f L´1
B f L´1
BθL´2
BLBθL´3
“BLB f L
B f LB f L´1
B f L´1
B f L´2
B f L´2
BθL´3
BLBθi
“BLB f L
B f LB f L´1
¨ ¨ ¨B f i`2
B f i`1
B f i`1
Bθi
More details (including efficient implementation) later this week
Mathematics for Machine Learning Marc Deisenroth @Deep Learning Indaba, September 10, 2017 34
![Page 76: Mathematics for Machine Learningdfg/ProbabilisticInference... · Typically via some form of gradient descent. Challenge 1: Differentiation. Compute gradients of a loss ... n 1 df{dx.](https://reader033.fdocuments.net/reader033/viewer/2022041905/5e62c9c6e737573c5200d583/html5/thumbnails/76.jpg)
Gradients of a Multi-Layer Neural Network
x fL
A1, b1 AL−1, bL−1
LfL−1
AL−2, bL−2
f1
A2, b2
BLBθL´1
“BLB f L
B f LBθL´1
BLBθL´2
“BLB f L
B f LB f L´1
B f L´1
BθL´2
BLBθL´3
“BLB f L
B f LB f L´1
B f L´1
B f L´2
B f L´2
BθL´3
BLBθi
“BLB f L
B f LB f L´1
¨ ¨ ¨B f i`2
B f i`1
B f i`1
Bθi
More details (including efficient implementation) later this week
Mathematics for Machine Learning Marc Deisenroth @Deep Learning Indaba, September 10, 2017 34
![Page 77: Mathematics for Machine Learningdfg/ProbabilisticInference... · Typically via some form of gradient descent. Challenge 1: Differentiation. Compute gradients of a loss ... n 1 df{dx.](https://reader033.fdocuments.net/reader033/viewer/2022041905/5e62c9c6e737573c5200d583/html5/thumbnails/77.jpg)
Gradients of a Multi-Layer Neural Network
x fL
A1, b1 AL−1, bL−1
LfL−1
AL−2, bL−2
f1
A2, b2
BLBθL´1
“BLB f L
B f LBθL´1
BLBθL´2
“BLB f L
B f LB f L´1
B f L´1
BθL´2
BLBθL´3
“BLB f L
B f LB f L´1
B f L´1
B f L´2
B f L´2
BθL´3
BLBθi
“BLB f L
B f LB f L´1
¨ ¨ ¨B f i`2
B f i`1
B f i`1
Bθi
More details (including efficient implementation) later this week
Mathematics for Machine Learning Marc Deisenroth @Deep Learning Indaba, September 10, 2017 34
![Page 78: Mathematics for Machine Learningdfg/ProbabilisticInference... · Typically via some form of gradient descent. Challenge 1: Differentiation. Compute gradients of a loss ... n 1 df{dx.](https://reader033.fdocuments.net/reader033/viewer/2022041905/5e62c9c6e737573c5200d583/html5/thumbnails/78.jpg)
Gradients of a Multi-Layer Neural Network
x fL
A1, b1 AL−1, bL−1
LfL−1
AL−2, bL−2
f1
A2, b2
BLBθL´1
“BLB f L
B f LBθL´1
BLBθL´2
“BLB f L
B f LB f L´1
B f L´1
BθL´2
BLBθL´3
“BLB f L
B f LB f L´1
B f L´1
B f L´2
B f L´2
BθL´3
BLBθi
“BLB f L
B f LB f L´1
¨ ¨ ¨B f i`2
B f i`1
B f i`1
Bθi
More details (including efficient implementation) later this week
Mathematics for Machine Learning Marc Deisenroth @Deep Learning Indaba, September 10, 2017 34
![Page 79: Mathematics for Machine Learningdfg/ProbabilisticInference... · Typically via some form of gradient descent. Challenge 1: Differentiation. Compute gradients of a loss ... n 1 df{dx.](https://reader033.fdocuments.net/reader033/viewer/2022041905/5e62c9c6e737573c5200d583/html5/thumbnails/79.jpg)
Gradients of a Multi-Layer Neural Network
x fL
A1, b1 AL−1, bL−1
LfL−1
AL−2, bL−2
f1
A2, b2
BLBθL´1
“BLB f L
B f LBθL´1
BLBθL´2
“BLB f L
B f LB f L´1
B f L´1
BθL´2
BLBθL´3
“BLB f L
B f LB f L´1
B f L´1
B f L´2
B f L´2
BθL´3
BLBθi
“BLB f L
B f LB f L´1
¨ ¨ ¨B f i`2
B f i`1
B f i`1
Bθi
More details (including efficient implementation) later this weekMathematics for Machine Learning Marc Deisenroth @Deep Learning Indaba, September 10, 2017 34
![Page 80: Mathematics for Machine Learningdfg/ProbabilisticInference... · Typically via some form of gradient descent. Challenge 1: Differentiation. Compute gradients of a loss ... n 1 df{dx.](https://reader033.fdocuments.net/reader033/viewer/2022041905/5e62c9c6e737573c5200d583/html5/thumbnails/80.jpg)
Training Neural Networks as Maximum LikelihoodEstimation
§ Training a neural network in the above way corresponds tomaximum likelihood estimation:
§ If y “ NNpx, θq ` ε, ε „ N`
0, I˘
then the log-likelihood is
log ppy|X, θq “ ´ 12}y´ NNpx, θq}2
§ Find θ˚ by minimizing the negative log-likelihood:
θ˚ “ arg minθ´ log ppy|x, θq
“ arg minθ
12}y´ NNpx, θq}2
“ arg minθ
Lpθq
§ Maximum likelihood estimation can lead to overfitting (interpretnoise as signal)
Mathematics for Machine Learning Marc Deisenroth @Deep Learning Indaba, September 10, 2017 35
![Page 81: Mathematics for Machine Learningdfg/ProbabilisticInference... · Typically via some form of gradient descent. Challenge 1: Differentiation. Compute gradients of a loss ... n 1 df{dx.](https://reader033.fdocuments.net/reader033/viewer/2022041905/5e62c9c6e737573c5200d583/html5/thumbnails/81.jpg)
Training Neural Networks as Maximum LikelihoodEstimation
§ Training a neural network in the above way corresponds tomaximum likelihood estimation:
§ If y “ NNpx, θq ` ε, ε „ N`
0, I˘
then the log-likelihood is
log ppy|X, θq “ ´ 12}y´ NNpx, θq}2
§ Find θ˚ by minimizing the negative log-likelihood:
θ˚ “ arg minθ´ log ppy|x, θq
“ arg minθ
12}y´ NNpx, θq}2
“ arg minθ
Lpθq
§ Maximum likelihood estimation can lead to overfitting (interpretnoise as signal)
Mathematics for Machine Learning Marc Deisenroth @Deep Learning Indaba, September 10, 2017 35
![Page 82: Mathematics for Machine Learningdfg/ProbabilisticInference... · Typically via some form of gradient descent. Challenge 1: Differentiation. Compute gradients of a loss ... n 1 df{dx.](https://reader033.fdocuments.net/reader033/viewer/2022041905/5e62c9c6e737573c5200d583/html5/thumbnails/82.jpg)
Training Neural Networks as Maximum LikelihoodEstimation
§ Training a neural network in the above way corresponds tomaximum likelihood estimation:
§ If y “ NNpx, θq ` ε, ε „ N`
0, I˘
then the log-likelihood is
log ppy|X, θq “ ´ 12}y´ NNpx, θq}2
§ Find θ˚ by minimizing the negative log-likelihood:
θ˚ “ arg minθ´ log ppy|x, θq
“ arg minθ
12}y´ NNpx, θq}2
“ arg minθ
Lpθq
§ Maximum likelihood estimation can lead to overfitting (interpretnoise as signal)
Mathematics for Machine Learning Marc Deisenroth @Deep Learning Indaba, September 10, 2017 35
![Page 83: Mathematics for Machine Learningdfg/ProbabilisticInference... · Typically via some form of gradient descent. Challenge 1: Differentiation. Compute gradients of a loss ... n 1 df{dx.](https://reader033.fdocuments.net/reader033/viewer/2022041905/5e62c9c6e737573c5200d583/html5/thumbnails/83.jpg)
Example: Linear Regression (1)
§ Linear regression with a polynomial of order M:
y “ f px, θq ` ε , ε „ N`
0, σ2ε
˘
f px, θq “ θ0 ` θ1x` θ2x2 ` ¨ ¨ ¨ ` θMxM “
Mÿ
i“0
θixi
§ Given inputs xi and corresponding (noisy) observations yi,i “ 1, . . . , N, find parameters θ “ rθ0, . . . , θMs
J, that minimize thesquared loss (equivalently: maximize the likelihood)
Lpθq “Nÿ
i“1
pyi ´ f pxi, θqq2
Mathematics for Machine Learning Marc Deisenroth @Deep Learning Indaba, September 10, 2017 36
![Page 84: Mathematics for Machine Learningdfg/ProbabilisticInference... · Typically via some form of gradient descent. Challenge 1: Differentiation. Compute gradients of a loss ... n 1 df{dx.](https://reader033.fdocuments.net/reader033/viewer/2022041905/5e62c9c6e737573c5200d583/html5/thumbnails/84.jpg)
Example: Linear Regression (1)
§ Linear regression with a polynomial of order M:
y “ f px, θq ` ε , ε „ N`
0, σ2ε
˘
f px, θq “ θ0 ` θ1x` θ2x2 ` ¨ ¨ ¨ ` θMxM “
Mÿ
i“0
θixi
§ Given inputs xi and corresponding (noisy) observations yi,i “ 1, . . . , N, find parameters θ “ rθ0, . . . , θMs
J, that minimize thesquared loss (equivalently: maximize the likelihood)
Lpθq “Nÿ
i“1
pyi ´ f pxi, θqq2
Mathematics for Machine Learning Marc Deisenroth @Deep Learning Indaba, September 10, 2017 36
![Page 85: Mathematics for Machine Learningdfg/ProbabilisticInference... · Typically via some form of gradient descent. Challenge 1: Differentiation. Compute gradients of a loss ... n 1 df{dx.](https://reader033.fdocuments.net/reader033/viewer/2022041905/5e62c9c6e737573c5200d583/html5/thumbnails/85.jpg)
Example: Linear Regression (2)
x-5 0 5
f(x)
-3
-2
-1
0
1
2
3Polynomial of degree 16
DataMaximum likelihood estimate
§ Regularization, model selection etc. can address overfittingTutorials later this week
§ Alternative approach based on integration
Mathematics for Machine Learning Marc Deisenroth @Deep Learning Indaba, September 10, 2017 37
![Page 86: Mathematics for Machine Learningdfg/ProbabilisticInference... · Typically via some form of gradient descent. Challenge 1: Differentiation. Compute gradients of a loss ... n 1 df{dx.](https://reader033.fdocuments.net/reader033/viewer/2022041905/5e62c9c6e737573c5200d583/html5/thumbnails/86.jpg)
Example: Linear Regression (2)
x-5 0 5
f(x)
-3
-2
-1
0
1
2
3Polynomial of degree 16
DataMaximum likelihood estimate
§ Regularization, model selection etc. can address overfittingTutorials later this week
§ Alternative approach based on integration
Mathematics for Machine Learning Marc Deisenroth @Deep Learning Indaba, September 10, 2017 37
![Page 87: Mathematics for Machine Learningdfg/ProbabilisticInference... · Typically via some form of gradient descent. Challenge 1: Differentiation. Compute gradients of a loss ... n 1 df{dx.](https://reader033.fdocuments.net/reader033/viewer/2022041905/5e62c9c6e737573c5200d583/html5/thumbnails/87.jpg)
Overview
Introduction
Differentiation
Integration
Mathematics for Machine Learning Marc Deisenroth @Deep Learning Indaba, September 10, 2017 38
![Page 88: Mathematics for Machine Learningdfg/ProbabilisticInference... · Typically via some form of gradient descent. Challenge 1: Differentiation. Compute gradients of a loss ... n 1 df{dx.](https://reader033.fdocuments.net/reader033/viewer/2022041905/5e62c9c6e737573c5200d583/html5/thumbnails/88.jpg)
Integration: Outline
1. Motivation
2. Monte-Carlo estimation
3. Basic sampling algorithms
Mathematics for Machine Learning Marc Deisenroth @Deep Learning Indaba, September 10, 2017 39
![Page 89: Mathematics for Machine Learningdfg/ProbabilisticInference... · Typically via some form of gradient descent. Challenge 1: Differentiation. Compute gradients of a loss ... n 1 df{dx.](https://reader033.fdocuments.net/reader033/viewer/2022041905/5e62c9c6e737573c5200d583/html5/thumbnails/89.jpg)
Bayesian Integration to Avoid Overfitting
§ Instead of fitting a single setof parameters θ˚, we canaverage over all plausibleparameters
Bayesian integration:
ppy|xq “ż
ppy|x, θqppθqdθ
-5 0 5x
-3
-2
-1
0
1
2
3
f(x)
Polynomial of degree 16
95% predictive confidence boundDataMaximum likelihood estimateMAP estimate
§ More details on what ppθq is Tutorials later this week
§ For neural networks this integration is intractableApproximations
Mathematics for Machine Learning Marc Deisenroth @Deep Learning Indaba, September 10, 2017 40
![Page 90: Mathematics for Machine Learningdfg/ProbabilisticInference... · Typically via some form of gradient descent. Challenge 1: Differentiation. Compute gradients of a loss ... n 1 df{dx.](https://reader033.fdocuments.net/reader033/viewer/2022041905/5e62c9c6e737573c5200d583/html5/thumbnails/90.jpg)
Bayesian Integration to Avoid Overfitting
§ Instead of fitting a single setof parameters θ˚, we canaverage over all plausibleparameters
Bayesian integration:
ppy|xq “ż
ppy|x, θqppθqdθ
-5 0 5x
-3
-2
-1
0
1
2
3
f(x)
Polynomial of degree 16
95% predictive confidence boundDataMaximum likelihood estimateMAP estimate
§ More details on what ppθq is Tutorials later this week
§ For neural networks this integration is intractableApproximations
Mathematics for Machine Learning Marc Deisenroth @Deep Learning Indaba, September 10, 2017 40
![Page 91: Mathematics for Machine Learningdfg/ProbabilisticInference... · Typically via some form of gradient descent. Challenge 1: Differentiation. Compute gradients of a loss ... n 1 df{dx.](https://reader033.fdocuments.net/reader033/viewer/2022041905/5e62c9c6e737573c5200d583/html5/thumbnails/91.jpg)
Bayesian Integration to Avoid Overfitting
§ Instead of fitting a single setof parameters θ˚, we canaverage over all plausibleparameters
Bayesian integration:
ppy|xq “ż
ppy|x, θqppθqdθ
-5 0 5x
-3
-2
-1
0
1
2
3
f(x)
Polynomial of degree 16
95% predictive confidence boundDataMaximum likelihood estimateMAP estimate
§ More details on what ppθq is Tutorials later this week
§ For neural networks this integration is intractableApproximations
Mathematics for Machine Learning Marc Deisenroth @Deep Learning Indaba, September 10, 2017 40
![Page 92: Mathematics for Machine Learningdfg/ProbabilisticInference... · Typically via some form of gradient descent. Challenge 1: Differentiation. Compute gradients of a loss ... n 1 df{dx.](https://reader033.fdocuments.net/reader033/viewer/2022041905/5e62c9c6e737573c5200d583/html5/thumbnails/92.jpg)
Computing Statistics of Random Variables
§ Computing means/(co)variances also requires solving integrals:
Exrxs “ż
xppxqdx “: µx
Vxrxs “ż
px´ µxqpx´ µxqJdx
Covrx, ys “ij
px´ µxqpy´ µyqJdxdy
§ These integrals can often not be computed in closed formApproximations
Mathematics for Machine Learning Marc Deisenroth @Deep Learning Indaba, September 10, 2017 41
![Page 93: Mathematics for Machine Learningdfg/ProbabilisticInference... · Typically via some form of gradient descent. Challenge 1: Differentiation. Compute gradients of a loss ... n 1 df{dx.](https://reader033.fdocuments.net/reader033/viewer/2022041905/5e62c9c6e737573c5200d583/html5/thumbnails/93.jpg)
Computing Statistics of Random Variables
§ Computing means/(co)variances also requires solving integrals:
Exrxs “ż
xppxqdx “: µx
Vxrxs “ż
px´ µxqpx´ µxqJdx
Covrx, ys “ij
px´ µxqpy´ µyqJdxdy
§ These integrals can often not be computed in closed formApproximations
Mathematics for Machine Learning Marc Deisenroth @Deep Learning Indaba, September 10, 2017 41
![Page 94: Mathematics for Machine Learningdfg/ProbabilisticInference... · Typically via some form of gradient descent. Challenge 1: Differentiation. Compute gradients of a loss ... n 1 df{dx.](https://reader033.fdocuments.net/reader033/viewer/2022041905/5e62c9c6e737573c5200d583/html5/thumbnails/94.jpg)
Approximate Integration
§ Numerical integration (low-dimensional problems)
§ Bayesian quadrature, e.g., O’Hagan (1987, 1991); Rasmussen &Ghahramani (2003)
§ Variational Bayes, e.g., Jordan et al. (1999)
§ Expectation Propagation, Opper & Winther (2001); Minka (2001)
§ Monte-Carlo Methods, e.g., Gilks et al. (1996), Robert & Casella(2004), Bishop (2006)
Mathematics for Machine Learning Marc Deisenroth @Deep Learning Indaba, September 10, 2017 42
![Page 95: Mathematics for Machine Learningdfg/ProbabilisticInference... · Typically via some form of gradient descent. Challenge 1: Differentiation. Compute gradients of a loss ... n 1 df{dx.](https://reader033.fdocuments.net/reader033/viewer/2022041905/5e62c9c6e737573c5200d583/html5/thumbnails/95.jpg)
Monte Carlo Methods—Motivation
§ Monte Carlo methods are computational techniques that makeuse of random numbers
§ Two typical problems:1. Problem 1: Generate samples txpsqu from a given probability
distribution ppxq, e.g., for simulation or representations of datadistributions
2. Problem 2: Compute expectations of functions under thatdistribution:
Er f pxqs “ż
f pxqppxqdx
Example: Means/variances of distributions, predictionsComplication: Integral cannot be evaluated analytically
Mathematics for Machine Learning Marc Deisenroth @Deep Learning Indaba, September 10, 2017 43
![Page 96: Mathematics for Machine Learningdfg/ProbabilisticInference... · Typically via some form of gradient descent. Challenge 1: Differentiation. Compute gradients of a loss ... n 1 df{dx.](https://reader033.fdocuments.net/reader033/viewer/2022041905/5e62c9c6e737573c5200d583/html5/thumbnails/96.jpg)
Monte Carlo Methods—Motivation
§ Monte Carlo methods are computational techniques that makeuse of random numbers
§ Two typical problems:1. Problem 1: Generate samples txpsqu from a given probability
distribution ppxq, e.g., for simulation or representations of datadistributions
2. Problem 2: Compute expectations of functions under thatdistribution:
Er f pxqs “ż
f pxqppxqdx
Example: Means/variances of distributions, predictionsComplication: Integral cannot be evaluated analytically
Mathematics for Machine Learning Marc Deisenroth @Deep Learning Indaba, September 10, 2017 43
![Page 97: Mathematics for Machine Learningdfg/ProbabilisticInference... · Typically via some form of gradient descent. Challenge 1: Differentiation. Compute gradients of a loss ... n 1 df{dx.](https://reader033.fdocuments.net/reader033/viewer/2022041905/5e62c9c6e737573c5200d583/html5/thumbnails/97.jpg)
Monte Carlo Methods—Motivation
§ Monte Carlo methods are computational techniques that makeuse of random numbers
§ Two typical problems:1. Problem 1: Generate samples txpsqu from a given probability
distribution ppxq, e.g., for simulation or representations of datadistributions
2. Problem 2: Compute expectations of functions under thatdistribution:
Er f pxqs “ż
f pxqppxqdx
Example: Means/variances of distributions, predictionsComplication: Integral cannot be evaluated analytically
Mathematics for Machine Learning Marc Deisenroth @Deep Learning Indaba, September 10, 2017 43
![Page 98: Mathematics for Machine Learningdfg/ProbabilisticInference... · Typically via some form of gradient descent. Challenge 1: Differentiation. Compute gradients of a loss ... n 1 df{dx.](https://reader033.fdocuments.net/reader033/viewer/2022041905/5e62c9c6e737573c5200d583/html5/thumbnails/98.jpg)
Problem 2: Monte Carlo Estimation
§ Computing expectations via statistical sampling:
Er f pxqs “ż
f pxqppxqdx
«1S
ÿS
s“1f pxpsqq, xpsq „ ppxq
§ Making predictions (e.g., Bayesian regression with inputs x andtargets y)
ppy|xq “ż
ppy|θ, xq ppθqloomoon
Parameter distribution
dθ
«1S
ÿS
s“1ppy|θpsq, xq , θpsq „ ppθq
§ Key problem: Generating samples from ppxq or ppθqNeed to solve Problem 1
Mathematics for Machine Learning Marc Deisenroth @Deep Learning Indaba, September 10, 2017 44
![Page 99: Mathematics for Machine Learningdfg/ProbabilisticInference... · Typically via some form of gradient descent. Challenge 1: Differentiation. Compute gradients of a loss ... n 1 df{dx.](https://reader033.fdocuments.net/reader033/viewer/2022041905/5e62c9c6e737573c5200d583/html5/thumbnails/99.jpg)
Problem 2: Monte Carlo Estimation
§ Computing expectations via statistical sampling:
Er f pxqs “ż
f pxqppxqdx
«1S
ÿS
s“1f pxpsqq, xpsq „ ppxq
§ Making predictions (e.g., Bayesian regression with inputs x andtargets y)
ppy|xq “ż
ppy|θ, xq ppθqloomoon
Parameter distribution
dθ
«1S
ÿS
s“1ppy|θpsq, xq , θpsq „ ppθq
§ Key problem: Generating samples from ppxq or ppθqNeed to solve Problem 1
Mathematics for Machine Learning Marc Deisenroth @Deep Learning Indaba, September 10, 2017 44
![Page 100: Mathematics for Machine Learningdfg/ProbabilisticInference... · Typically via some form of gradient descent. Challenge 1: Differentiation. Compute gradients of a loss ... n 1 df{dx.](https://reader033.fdocuments.net/reader033/viewer/2022041905/5e62c9c6e737573c5200d583/html5/thumbnails/100.jpg)
Problem 2: Monte Carlo Estimation
§ Computing expectations via statistical sampling:
Er f pxqs “ż
f pxqppxqdx
«1S
ÿS
s“1f pxpsqq, xpsq „ ppxq
§ Making predictions (e.g., Bayesian regression with inputs x andtargets y)
ppy|xq “ż
ppy|θ, xq ppθqloomoon
Parameter distribution
dθ
«1S
ÿS
s“1ppy|θpsq, xq , θpsq „ ppθq
§ Key problem: Generating samples from ppxq or ppθqNeed to solve Problem 1
Mathematics for Machine Learning Marc Deisenroth @Deep Learning Indaba, September 10, 2017 44
![Page 101: Mathematics for Machine Learningdfg/ProbabilisticInference... · Typically via some form of gradient descent. Challenge 1: Differentiation. Compute gradients of a loss ... n 1 df{dx.](https://reader033.fdocuments.net/reader033/viewer/2022041905/5e62c9c6e737573c5200d583/html5/thumbnails/101.jpg)
Sampling Discrete Values
p = 0.3 p = 0.2 p = 0.5a b c
u = 0.55
§ u „ U r0, 1s, where U is the uniform distribution
§ u “ 0.55 ñ x “ c
Mathematics for Machine Learning Marc Deisenroth @Deep Learning Indaba, September 10, 2017 45
![Page 102: Mathematics for Machine Learningdfg/ProbabilisticInference... · Typically via some form of gradient descent. Challenge 1: Differentiation. Compute gradients of a loss ... n 1 df{dx.](https://reader033.fdocuments.net/reader033/viewer/2022041905/5e62c9c6e737573c5200d583/html5/thumbnails/102.jpg)
Continuous Variables
p(z)
z
~
§ More complicated§ Geometrically, we wish to sample uniformly from the area under
the curve§ Two algorithms here:
§ Rejection sampling§ Importance sampling
Mathematics for Machine Learning Marc Deisenroth @Deep Learning Indaba, September 10, 2017 46
![Page 103: Mathematics for Machine Learningdfg/ProbabilisticInference... · Typically via some form of gradient descent. Challenge 1: Differentiation. Compute gradients of a loss ... n 1 df{dx.](https://reader033.fdocuments.net/reader033/viewer/2022041905/5e62c9c6e737573c5200d583/html5/thumbnails/103.jpg)
Rejection Sampling: Setting
p(z)
z
~
§ Assume:§ Sampling from ppzq is difficult§ Evaluating p̃pzq “ Zppzq is easy (and Z may be unknown)
§ Find a simpler distribution (proposal distribution) qpzq fromwhich we can easily draw samples (e.g., Gaussian, Laplace)
§ Find an upper bound kqpzq ě p̃pzq
Mathematics for Machine Learning Marc Deisenroth @Deep Learning Indaba, September 10, 2017 47
![Page 104: Mathematics for Machine Learningdfg/ProbabilisticInference... · Typically via some form of gradient descent. Challenge 1: Differentiation. Compute gradients of a loss ... n 1 df{dx.](https://reader033.fdocuments.net/reader033/viewer/2022041905/5e62c9c6e737573c5200d583/html5/thumbnails/104.jpg)
Rejection Sampling: Setting
p(z)
z
~
§ Assume:§ Sampling from ppzq is difficult§ Evaluating p̃pzq “ Zppzq is easy (and Z may be unknown)
§ Find a simpler distribution (proposal distribution) qpzq fromwhich we can easily draw samples (e.g., Gaussian, Laplace)
§ Find an upper bound kqpzq ě p̃pzq
Mathematics for Machine Learning Marc Deisenroth @Deep Learning Indaba, September 10, 2017 47
![Page 105: Mathematics for Machine Learningdfg/ProbabilisticInference... · Typically via some form of gradient descent. Challenge 1: Differentiation. Compute gradients of a loss ... n 1 df{dx.](https://reader033.fdocuments.net/reader033/viewer/2022041905/5e62c9c6e737573c5200d583/html5/thumbnails/105.jpg)
Rejection Sampling: Algorithm
z0 z
u0
kq(z0)kq(z)
p(z)~
acceptance arearejection area
Adapted from PRML (Bishop, 2006)
1. Generate z0 „ qpzq
2. Generate u0 „ U r0, kqpz0qs
3. If u0 ą p̃pz0q, reject the sample. Otherwise, retain z0
Mathematics for Machine Learning Marc Deisenroth @Deep Learning Indaba, September 10, 2017 48
![Page 106: Mathematics for Machine Learningdfg/ProbabilisticInference... · Typically via some form of gradient descent. Challenge 1: Differentiation. Compute gradients of a loss ... n 1 df{dx.](https://reader033.fdocuments.net/reader033/viewer/2022041905/5e62c9c6e737573c5200d583/html5/thumbnails/106.jpg)
Properties
z0 z
u0
kq(z0)kq(z)
p(z)~
acceptance arearejection area
Adapted from PRML (Bishop, 2006)
§ Accepted pairs pz, uq are uniformly distributed under p̃pzq
§ Probability density of the z-coordiantes of accepted points mustbe proportional to ppzq
§ Samples are independent samples from ppzq
Mathematics for Machine Learning Marc Deisenroth @Deep Learning Indaba, September 10, 2017 49
![Page 107: Mathematics for Machine Learningdfg/ProbabilisticInference... · Typically via some form of gradient descent. Challenge 1: Differentiation. Compute gradients of a loss ... n 1 df{dx.](https://reader033.fdocuments.net/reader033/viewer/2022041905/5e62c9c6e737573c5200d583/html5/thumbnails/107.jpg)
Properties
z0 z
u0
kq(z0)kq(z)
p(z)~
acceptance arearejection area
Adapted from PRML (Bishop, 2006)
§ Accepted pairs pz, uq are uniformly distributed under p̃pzq
§ Probability density of the z-coordiantes of accepted points mustbe proportional to ppzq
§ Samples are independent samples from ppzq
Mathematics for Machine Learning Marc Deisenroth @Deep Learning Indaba, September 10, 2017 49
![Page 108: Mathematics for Machine Learningdfg/ProbabilisticInference... · Typically via some form of gradient descent. Challenge 1: Differentiation. Compute gradients of a loss ... n 1 df{dx.](https://reader033.fdocuments.net/reader033/viewer/2022041905/5e62c9c6e737573c5200d583/html5/thumbnails/108.jpg)
Properties
z0 z
u0
kq(z0)kq(z)
p(z)~
acceptance arearejection area
Adapted from PRML (Bishop, 2006)
§ Accepted pairs pz, uq are uniformly distributed under p̃pzq
§ Probability density of the z-coordiantes of accepted points mustbe proportional to ppzq
§ Samples are independent samples from ppzq
Mathematics for Machine Learning Marc Deisenroth @Deep Learning Indaba, September 10, 2017 49
![Page 109: Mathematics for Machine Learningdfg/ProbabilisticInference... · Typically via some form of gradient descent. Challenge 1: Differentiation. Compute gradients of a loss ... n 1 df{dx.](https://reader033.fdocuments.net/reader033/viewer/2022041905/5e62c9c6e737573c5200d583/html5/thumbnails/109.jpg)
Shortcomings
z0 z
u0
kq(z0)kq(z)
p(z)~
acceptance arearejection area
Adapted from PRML (Bishop, 2006)
§ Finding the upper bound k is tricky
§ In high dimensions the factor k is probably huge
§ Low acceptance rate/high rejection rate of samples
Mathematics for Machine Learning Marc Deisenroth @Deep Learning Indaba, September 10, 2017 50
![Page 110: Mathematics for Machine Learningdfg/ProbabilisticInference... · Typically via some form of gradient descent. Challenge 1: Differentiation. Compute gradients of a loss ... n 1 df{dx.](https://reader033.fdocuments.net/reader033/viewer/2022041905/5e62c9c6e737573c5200d583/html5/thumbnails/110.jpg)
Shortcomings
z0 z
u0
kq(z0)kq(z)
p(z)~
acceptance arearejection area
Adapted from PRML (Bishop, 2006)
§ Finding the upper bound k is tricky
§ In high dimensions the factor k is probably huge
§ Low acceptance rate/high rejection rate of samples
Mathematics for Machine Learning Marc Deisenroth @Deep Learning Indaba, September 10, 2017 50
![Page 111: Mathematics for Machine Learningdfg/ProbabilisticInference... · Typically via some form of gradient descent. Challenge 1: Differentiation. Compute gradients of a loss ... n 1 df{dx.](https://reader033.fdocuments.net/reader033/viewer/2022041905/5e62c9c6e737573c5200d583/html5/thumbnails/111.jpg)
Shortcomings
z0 z
u0
kq(z0)kq(z)
p(z)~
acceptance arearejection area
Adapted from PRML (Bishop, 2006)
§ Finding the upper bound k is tricky
§ In high dimensions the factor k is probably huge
§ Low acceptance rate/high rejection rate of samples
Mathematics for Machine Learning Marc Deisenroth @Deep Learning Indaba, September 10, 2017 50
![Page 112: Mathematics for Machine Learningdfg/ProbabilisticInference... · Typically via some form of gradient descent. Challenge 1: Differentiation. Compute gradients of a loss ... n 1 df{dx.](https://reader033.fdocuments.net/reader033/viewer/2022041905/5e62c9c6e737573c5200d583/html5/thumbnails/112.jpg)
Importance Sampling
Key idea: Do not throw away all rejected samples, but give themlower weight by rewriting the integral as an expectation under asimpler distribution q (proposal distribution):
Epr f pxqs “ż
f pxqppxqdx
“
ż
f pxqppxqqpxqqpxq
dx “ż
f pxqppxqqpxq
qpxqdx
“ Eq
„
f pxqppxqqpxq
If we choose q in a way that we can easily sample from it, we canapproximate this last expectation by Monte Carlo:
Eq
„
f pxqppxqqpxq
«1S
Sÿ
s“1
f pxpsqqppxpsqqqpxpsqq
“1S
Sÿ
s“1
ws f pxpsqq
, xpsq „ qpxq
Mathematics for Machine Learning Marc Deisenroth @Deep Learning Indaba, September 10, 2017 51
![Page 113: Mathematics for Machine Learningdfg/ProbabilisticInference... · Typically via some form of gradient descent. Challenge 1: Differentiation. Compute gradients of a loss ... n 1 df{dx.](https://reader033.fdocuments.net/reader033/viewer/2022041905/5e62c9c6e737573c5200d583/html5/thumbnails/113.jpg)
Importance Sampling
Key idea: Do not throw away all rejected samples, but give themlower weight by rewriting the integral as an expectation under asimpler distribution q (proposal distribution):
Epr f pxqs “ż
f pxqppxqdx
“
ż
f pxqppxqqpxqqpxq
dx
“
ż
f pxqppxqqpxq
qpxqdx
“ Eq
„
f pxqppxqqpxq
If we choose q in a way that we can easily sample from it, we canapproximate this last expectation by Monte Carlo:
Eq
„
f pxqppxqqpxq
«1S
Sÿ
s“1
f pxpsqqppxpsqqqpxpsqq
“1S
Sÿ
s“1
ws f pxpsqq
, xpsq „ qpxq
Mathematics for Machine Learning Marc Deisenroth @Deep Learning Indaba, September 10, 2017 51
![Page 114: Mathematics for Machine Learningdfg/ProbabilisticInference... · Typically via some form of gradient descent. Challenge 1: Differentiation. Compute gradients of a loss ... n 1 df{dx.](https://reader033.fdocuments.net/reader033/viewer/2022041905/5e62c9c6e737573c5200d583/html5/thumbnails/114.jpg)
Importance Sampling
Key idea: Do not throw away all rejected samples, but give themlower weight by rewriting the integral as an expectation under asimpler distribution q (proposal distribution):
Epr f pxqs “ż
f pxqppxqdx
“
ż
f pxqppxqqpxqqpxq
dx “ż
f pxqppxqqpxq
qpxqdx
“ Eq
„
f pxqppxqqpxq
If we choose q in a way that we can easily sample from it, we canapproximate this last expectation by Monte Carlo:
Eq
„
f pxqppxqqpxq
«1S
Sÿ
s“1
f pxpsqqppxpsqqqpxpsqq
“1S
Sÿ
s“1
ws f pxpsqq
, xpsq „ qpxq
Mathematics for Machine Learning Marc Deisenroth @Deep Learning Indaba, September 10, 2017 51
![Page 115: Mathematics for Machine Learningdfg/ProbabilisticInference... · Typically via some form of gradient descent. Challenge 1: Differentiation. Compute gradients of a loss ... n 1 df{dx.](https://reader033.fdocuments.net/reader033/viewer/2022041905/5e62c9c6e737573c5200d583/html5/thumbnails/115.jpg)
Importance Sampling
Key idea: Do not throw away all rejected samples, but give themlower weight by rewriting the integral as an expectation under asimpler distribution q (proposal distribution):
Epr f pxqs “ż
f pxqppxqdx
“
ż
f pxqppxqqpxqqpxq
dx “ż
f pxqppxqqpxq
qpxqdx
“ Eq
„
f pxqppxqqpxq
If we choose q in a way that we can easily sample from it, we canapproximate this last expectation by Monte Carlo:
Eq
„
f pxqppxqqpxq
«1S
Sÿ
s“1
f pxpsqqppxpsqqqpxpsqq
“1S
Sÿ
s“1
ws f pxpsqq
, xpsq „ qpxq
Mathematics for Machine Learning Marc Deisenroth @Deep Learning Indaba, September 10, 2017 51
![Page 116: Mathematics for Machine Learningdfg/ProbabilisticInference... · Typically via some form of gradient descent. Challenge 1: Differentiation. Compute gradients of a loss ... n 1 df{dx.](https://reader033.fdocuments.net/reader033/viewer/2022041905/5e62c9c6e737573c5200d583/html5/thumbnails/116.jpg)
Importance Sampling
Key idea: Do not throw away all rejected samples, but give themlower weight by rewriting the integral as an expectation under asimpler distribution q (proposal distribution):
Epr f pxqs “ż
f pxqppxqdx
“
ż
f pxqppxqqpxqqpxq
dx “ż
f pxqppxqqpxq
qpxqdx
“ Eq
„
f pxqppxqqpxq
If we choose q in a way that we can easily sample from it, we canapproximate this last expectation by Monte Carlo:
Eq
„
f pxqppxqqpxq
«1S
Sÿ
s“1
f pxpsqqppxpsqqqpxpsqq
“1S
Sÿ
s“1
ws f pxpsqq
, xpsq „ qpxq
Mathematics for Machine Learning Marc Deisenroth @Deep Learning Indaba, September 10, 2017 51
![Page 117: Mathematics for Machine Learningdfg/ProbabilisticInference... · Typically via some form of gradient descent. Challenge 1: Differentiation. Compute gradients of a loss ... n 1 df{dx.](https://reader033.fdocuments.net/reader033/viewer/2022041905/5e62c9c6e737573c5200d583/html5/thumbnails/117.jpg)
Importance Sampling
Key idea: Do not throw away all rejected samples, but give themlower weight by rewriting the integral as an expectation under asimpler distribution q (proposal distribution):
Epr f pxqs “ż
f pxqppxqdx
“
ż
f pxqppxqqpxqqpxq
dx “ż
f pxqppxqqpxq
qpxqdx
“ Eq
„
f pxqppxqqpxq
If we choose q in a way that we can easily sample from it, we canapproximate this last expectation by Monte Carlo:
Eq
„
f pxqppxqqpxq
«1S
Sÿ
s“1
f pxpsqqppxpsqqqpxpsqq
“1S
Sÿ
s“1
ws f pxpsqq, xpsq „ qpxq
Mathematics for Machine Learning Marc Deisenroth @Deep Learning Indaba, September 10, 2017 51
![Page 118: Mathematics for Machine Learningdfg/ProbabilisticInference... · Typically via some form of gradient descent. Challenge 1: Differentiation. Compute gradients of a loss ... n 1 df{dx.](https://reader033.fdocuments.net/reader033/viewer/2022041905/5e62c9c6e737573c5200d583/html5/thumbnails/118.jpg)
Properties
§ Unbiased if q ą 0 where p ą 0 and if we can evaluate p
§ Breaks down if we do not have enough samples (puts nearly allweight on a single sample)
Degeneracy, see also Particle Filtering and SMC(Thrun et al., 2005; Doucet et al., 2000)
§ Many draws from proposal density q required, especially in highdimensions
§ Requires to be able to evaluate true p. Generalization exists for p̃.This generalization is biased (but consistent).
§ Does not scale to interesting (high-dimensional) problems
Different approach to sample from complicated (high-dimensional)distributions: Markov Chain Monte Carlo (e.g., Gilks et al., 1996)
Mathematics for Machine Learning Marc Deisenroth @Deep Learning Indaba, September 10, 2017 52
![Page 119: Mathematics for Machine Learningdfg/ProbabilisticInference... · Typically via some form of gradient descent. Challenge 1: Differentiation. Compute gradients of a loss ... n 1 df{dx.](https://reader033.fdocuments.net/reader033/viewer/2022041905/5e62c9c6e737573c5200d583/html5/thumbnails/119.jpg)
Properties
§ Unbiased if q ą 0 where p ą 0 and if we can evaluate p
§ Breaks down if we do not have enough samples (puts nearly allweight on a single sample)
Degeneracy, see also Particle Filtering and SMC(Thrun et al., 2005; Doucet et al., 2000)
§ Many draws from proposal density q required, especially in highdimensions
§ Requires to be able to evaluate true p. Generalization exists for p̃.This generalization is biased (but consistent).
§ Does not scale to interesting (high-dimensional) problems
Different approach to sample from complicated (high-dimensional)distributions: Markov Chain Monte Carlo (e.g., Gilks et al., 1996)
Mathematics for Machine Learning Marc Deisenroth @Deep Learning Indaba, September 10, 2017 52
![Page 120: Mathematics for Machine Learningdfg/ProbabilisticInference... · Typically via some form of gradient descent. Challenge 1: Differentiation. Compute gradients of a loss ... n 1 df{dx.](https://reader033.fdocuments.net/reader033/viewer/2022041905/5e62c9c6e737573c5200d583/html5/thumbnails/120.jpg)
Properties
§ Unbiased if q ą 0 where p ą 0 and if we can evaluate p
§ Breaks down if we do not have enough samples (puts nearly allweight on a single sample)
Degeneracy, see also Particle Filtering and SMC(Thrun et al., 2005; Doucet et al., 2000)
§ Many draws from proposal density q required, especially in highdimensions
§ Requires to be able to evaluate true p. Generalization exists for p̃.This generalization is biased (but consistent).
§ Does not scale to interesting (high-dimensional) problems
Different approach to sample from complicated (high-dimensional)distributions: Markov Chain Monte Carlo (e.g., Gilks et al., 1996)
Mathematics for Machine Learning Marc Deisenroth @Deep Learning Indaba, September 10, 2017 52
![Page 121: Mathematics for Machine Learningdfg/ProbabilisticInference... · Typically via some form of gradient descent. Challenge 1: Differentiation. Compute gradients of a loss ... n 1 df{dx.](https://reader033.fdocuments.net/reader033/viewer/2022041905/5e62c9c6e737573c5200d583/html5/thumbnails/121.jpg)
Properties
§ Unbiased if q ą 0 where p ą 0 and if we can evaluate p
§ Breaks down if we do not have enough samples (puts nearly allweight on a single sample)
Degeneracy, see also Particle Filtering and SMC(Thrun et al., 2005; Doucet et al., 2000)
§ Many draws from proposal density q required, especially in highdimensions
§ Requires to be able to evaluate true p. Generalization exists for p̃.This generalization is biased (but consistent).
§ Does not scale to interesting (high-dimensional) problems
Different approach to sample from complicated (high-dimensional)distributions: Markov Chain Monte Carlo (e.g., Gilks et al., 1996)
Mathematics for Machine Learning Marc Deisenroth @Deep Learning Indaba, September 10, 2017 52
![Page 122: Mathematics for Machine Learningdfg/ProbabilisticInference... · Typically via some form of gradient descent. Challenge 1: Differentiation. Compute gradients of a loss ... n 1 df{dx.](https://reader033.fdocuments.net/reader033/viewer/2022041905/5e62c9c6e737573c5200d583/html5/thumbnails/122.jpg)
Properties
§ Unbiased if q ą 0 where p ą 0 and if we can evaluate p
§ Breaks down if we do not have enough samples (puts nearly allweight on a single sample)
Degeneracy, see also Particle Filtering and SMC(Thrun et al., 2005; Doucet et al., 2000)
§ Many draws from proposal density q required, especially in highdimensions
§ Requires to be able to evaluate true p. Generalization exists for p̃.This generalization is biased (but consistent).
§ Does not scale to interesting (high-dimensional) problems
Different approach to sample from complicated (high-dimensional)distributions: Markov Chain Monte Carlo (e.g., Gilks et al., 1996)
Mathematics for Machine Learning Marc Deisenroth @Deep Learning Indaba, September 10, 2017 52
![Page 123: Mathematics for Machine Learningdfg/ProbabilisticInference... · Typically via some form of gradient descent. Challenge 1: Differentiation. Compute gradients of a loss ... n 1 df{dx.](https://reader033.fdocuments.net/reader033/viewer/2022041905/5e62c9c6e737573c5200d583/html5/thumbnails/123.jpg)
Properties
§ Unbiased if q ą 0 where p ą 0 and if we can evaluate p
§ Breaks down if we do not have enough samples (puts nearly allweight on a single sample)
Degeneracy, see also Particle Filtering and SMC(Thrun et al., 2005; Doucet et al., 2000)
§ Many draws from proposal density q required, especially in highdimensions
§ Requires to be able to evaluate true p. Generalization exists for p̃.This generalization is biased (but consistent).
§ Does not scale to interesting (high-dimensional) problems
Different approach to sample from complicated (high-dimensional)distributions: Markov Chain Monte Carlo (e.g., Gilks et al., 1996)
Mathematics for Machine Learning Marc Deisenroth @Deep Learning Indaba, September 10, 2017 52
![Page 124: Mathematics for Machine Learningdfg/ProbabilisticInference... · Typically via some form of gradient descent. Challenge 1: Differentiation. Compute gradients of a loss ... n 1 df{dx.](https://reader033.fdocuments.net/reader033/viewer/2022041905/5e62c9c6e737573c5200d583/html5/thumbnails/124.jpg)
Summary
z0 z
u0
kq(z0)kq(z)
p(z)~
acceptance arearejection area
§ Two mathematical challenges in machine learning§ Differentiation for optimizing parameters of machine learning
modelsVector calculus and chain rule
§ Integration for computing statistics (e.g., means, variances) and asa principled way to address overfitting issue
Monte-Carlo integration to solve intractable integralsMathematics for Machine Learning Marc Deisenroth @Deep Learning Indaba, September 10, 2017 53
![Page 125: Mathematics for Machine Learningdfg/ProbabilisticInference... · Typically via some form of gradient descent. Challenge 1: Differentiation. Compute gradients of a loss ... n 1 df{dx.](https://reader033.fdocuments.net/reader033/viewer/2022041905/5e62c9c6e737573c5200d583/html5/thumbnails/125.jpg)
Some Application Areas
§ Image/speech/text/language processing using deep neuralnetworks (e.g., Krizhevsky et al., 2012 or overview in Goodfellowet al., 2016)
§ Data-efficient reinforcement learning and robot learning usingGaussian processes (e.g., Deisenroth & Rasmussen, 2011)
§ High-energy physics using deep neural networks or Gaussianprocesses (e.g., Sadowski et al. 2014; Bertone et al., 2016)
Mathematics for Machine Learning Marc Deisenroth @Deep Learning Indaba, September 10, 2017 54
![Page 126: Mathematics for Machine Learningdfg/ProbabilisticInference... · Typically via some form of gradient descent. Challenge 1: Differentiation. Compute gradients of a loss ... n 1 df{dx.](https://reader033.fdocuments.net/reader033/viewer/2022041905/5e62c9c6e737573c5200d583/html5/thumbnails/126.jpg)
References I
[1] G. Bertone, M. P. Deisenroth, J. S. Kim, S. Liem, R. R. de Austri, and M. Welling. Accelerating the BSM Interpretation ofLHC Data with Machine Learning. arXiv preprint arXiv:1611.02704, 2016.
[2] C. M. Bishop. Pattern Recognition and Machine Learning. Information Science and Statistics. Springer-Verlag, 2006.
[3] M. P. Deisenroth, D. Fox, and C. E. Rasmussen. Gaussian Processes for Data-Efficient Learning in Robotics and Control.IEEE Transactions on Pattern Analysis and Machine Intelligence, 37(2):408–423, Feb. 2015.
[4] M. P. Deisenroth and C. E. Rasmussen. PILCO: A Model-Based and Data-Efficient Approach to Policy Search. InProceedings of the International Conference on Machine Learning, pages 465–472. ACM, June 2011.
[5] A. Doucet, S. J. Godsill, and C. Andrieu. On Sequential Monte Carlo Sampling Methods for Bayesian Filtering. Statisticsand Computing, 10:197–208, 2000.
[6] W. R. Gilks, S. Richardson, and D. J. Spiegelhalter, editors. Markov Chain Monte Carlo in Practice: Interdisciplinary Statistics.Chapman & Hall, 1996.
[7] I. Goodfellow, Y. Bengio, and A. Courville. Deep Learning. MIT press, 2016.
[8] M. I. Jordan, Z. Ghahramani, T. S. Jaakkola, and L. K. Saul. An Introduction to Variational Methods for Graphical Models.Machine Learning, 37:183–233, 1999.
[9] S. Kamthe and M. P. Deisenroth. Data-Efficient Reinforcement Learning with Probabilistic Model Predictive Control.arXiv:1706.06491, abs/1706.06491, 2017.
[10] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet Classification with Deep Convolutional Neural Networks. InAdvances in neural information processing systems, pages 1097–1105, 2012.
[11] T. P. Minka. A Family of Algorithms for Approximate Bayesian Inference. PhD thesis, Massachusetts Institute of Technology,Cambridge, MA, USA, Jan. 2001.
[12] R. M. Neal. Bayesian Learning for Neural Networks. PhD thesis, Department of Computer Science, University of Toronto,1996.
[13] A. O’Hagan. Monte Carlo is Fundamentally Unsound. The Statistician, 36(2/3):247–249, 1987.
Mathematics for Machine Learning Marc Deisenroth @Deep Learning Indaba, September 10, 2017 55
![Page 127: Mathematics for Machine Learningdfg/ProbabilisticInference... · Typically via some form of gradient descent. Challenge 1: Differentiation. Compute gradients of a loss ... n 1 df{dx.](https://reader033.fdocuments.net/reader033/viewer/2022041905/5e62c9c6e737573c5200d583/html5/thumbnails/127.jpg)
References II
[14] A. O’Hagan. Bayes-Hermite Quadrature. Journal of Statistical Planning and Inference, 29:245–260, 1991.
[15] M. Opper and O. Winther. Adaptive and Self-averaging Thouless-Anderson-Palmer Mean-field Theory for ProbabilisticModeling. Physical Review E: Statistical, Nonlinear, and Soft Matter Physics, 64:056131, 2001.
[16] C. E. Rasmussen and Z. Ghahramani. Bayesian Monte Carlo. In S. Becker, S. Thrun, and K. Obermayer, editors, Advancesin Neural Information Processing Systems 15, pages 489–496. The MIT Press, Cambridge, MA, USA, 2003.
[17] C. P. Robert and G. Casella. Monte Carlo Methods. Wiley Online Library, 2004.
[18] P. Sadowski, J. Collado, D. Whiteson, and P. Baldi. Deep Learning, Dark Knowledge, and Dark Matter. In NIPS Workshopon High-energy Physics and Machine Learning, pages 81–87, 2014.
[19] S. Thrun, W. Burgard, and D. Fox. Probabilistic Robotics. The MIT Press, Cambridge, MA, USA, 2005.
Mathematics for Machine Learning Marc Deisenroth @Deep Learning Indaba, September 10, 2017 56