Training and Inference for Deep Gaussian Processes
-
Upload
keyon-vafa -
Category
Data & Analytics
-
view
598 -
download
1
Transcript of Training and Inference for Deep Gaussian Processes
![Page 1: Training and Inference for Deep Gaussian Processes](https://reader034.fdocuments.net/reader034/viewer/2022051404/587495d21a28abc62f8bab0d/html5/thumbnails/1.jpg)
Training and Inference for Deep Gaussian Processes
Keyon Vafa
April 26, 2016
Keyon Vafa Training and Inference for Deep GPs April 26, 2016 1 / 50
![Page 2: Training and Inference for Deep Gaussian Processes](https://reader034.fdocuments.net/reader034/viewer/2022051404/587495d21a28abc62f8bab0d/html5/thumbnails/2.jpg)
Motivation
An ideal model for prediction is
accurate
computationally efficient
easy to tune without overfitting
able to provide certainty estimates
Keyon Vafa Training and Inference for Deep GPs April 26, 2016 2 / 50
![Page 3: Training and Inference for Deep Gaussian Processes](https://reader034.fdocuments.net/reader034/viewer/2022051404/587495d21a28abc62f8bab0d/html5/thumbnails/3.jpg)
Motivation
An ideal model for prediction is
accurate
computationally efficient
easy to tune without overfitting
able to provide certainty estimates
Keyon Vafa Training and Inference for Deep GPs April 26, 2016 2 / 50
![Page 4: Training and Inference for Deep Gaussian Processes](https://reader034.fdocuments.net/reader034/viewer/2022051404/587495d21a28abc62f8bab0d/html5/thumbnails/4.jpg)
Motivation
An ideal model for prediction is
accurate
computationally efficient
easy to tune without overfitting
able to provide certainty estimates
Keyon Vafa Training and Inference for Deep GPs April 26, 2016 2 / 50
![Page 5: Training and Inference for Deep Gaussian Processes](https://reader034.fdocuments.net/reader034/viewer/2022051404/587495d21a28abc62f8bab0d/html5/thumbnails/5.jpg)
Motivation
An ideal model for prediction is
accurate
computationally efficient
easy to tune without overfitting
able to provide certainty estimates
Keyon Vafa Training and Inference for Deep GPs April 26, 2016 2 / 50
![Page 6: Training and Inference for Deep Gaussian Processes](https://reader034.fdocuments.net/reader034/viewer/2022051404/587495d21a28abc62f8bab0d/html5/thumbnails/6.jpg)
Motivation
An ideal model for prediction is
accurate
computationally efficient
easy to tune without overfitting
able to provide certainty estimates
Keyon Vafa Training and Inference for Deep GPs April 26, 2016 2 / 50
![Page 7: Training and Inference for Deep Gaussian Processes](https://reader034.fdocuments.net/reader034/viewer/2022051404/587495d21a28abc62f8bab0d/html5/thumbnails/7.jpg)
Motivation
This thesis focuses on one particular class of prediction models, deepGaussian processes for regression. They are a new model, having beenintroduced by Damianou and Lawrence in 2013.
Exact inference is intractable. In this thesis, we introduce a new method tolearn deep GPs, the Deep Gaussian Process Sampling algorithm (DPGS).
Keyon Vafa Training and Inference for Deep GPs April 26, 2016 3 / 50
![Page 8: Training and Inference for Deep Gaussian Processes](https://reader034.fdocuments.net/reader034/viewer/2022051404/587495d21a28abc62f8bab0d/html5/thumbnails/8.jpg)
Motivation
This thesis focuses on one particular class of prediction models, deepGaussian processes for regression. They are a new model, having beenintroduced by Damianou and Lawrence in 2013.
Exact inference is intractable. In this thesis, we introduce a new method tolearn deep GPs, the Deep Gaussian Process Sampling algorithm (DPGS).
Keyon Vafa Training and Inference for Deep GPs April 26, 2016 3 / 50
![Page 9: Training and Inference for Deep Gaussian Processes](https://reader034.fdocuments.net/reader034/viewer/2022051404/587495d21a28abc62f8bab0d/html5/thumbnails/9.jpg)
Motivation
The DGPS algorithm
is more straightforward than existing methods
can more easily adapt to using arbitrary kernels
relies on Monte Carlo sampling to circumvent the intractability hurdle
uses pseudo data to ease the computational burden
Keyon Vafa Training and Inference for Deep GPs April 26, 2016 4 / 50
![Page 10: Training and Inference for Deep Gaussian Processes](https://reader034.fdocuments.net/reader034/viewer/2022051404/587495d21a28abc62f8bab0d/html5/thumbnails/10.jpg)
Motivation
The DGPS algorithm
is more straightforward than existing methods
can more easily adapt to using arbitrary kernels
relies on Monte Carlo sampling to circumvent the intractability hurdle
uses pseudo data to ease the computational burden
Keyon Vafa Training and Inference for Deep GPs April 26, 2016 4 / 50
![Page 11: Training and Inference for Deep Gaussian Processes](https://reader034.fdocuments.net/reader034/viewer/2022051404/587495d21a28abc62f8bab0d/html5/thumbnails/11.jpg)
Motivation
The DGPS algorithm
is more straightforward than existing methods
can more easily adapt to using arbitrary kernels
relies on Monte Carlo sampling to circumvent the intractability hurdle
uses pseudo data to ease the computational burden
Keyon Vafa Training and Inference for Deep GPs April 26, 2016 4 / 50
![Page 12: Training and Inference for Deep Gaussian Processes](https://reader034.fdocuments.net/reader034/viewer/2022051404/587495d21a28abc62f8bab0d/html5/thumbnails/12.jpg)
Motivation
The DGPS algorithm
is more straightforward than existing methods
can more easily adapt to using arbitrary kernels
relies on Monte Carlo sampling to circumvent the intractability hurdle
uses pseudo data to ease the computational burden
Keyon Vafa Training and Inference for Deep GPs April 26, 2016 4 / 50
![Page 13: Training and Inference for Deep Gaussian Processes](https://reader034.fdocuments.net/reader034/viewer/2022051404/587495d21a28abc62f8bab0d/html5/thumbnails/13.jpg)
Motivation
The DGPS algorithm
is more straightforward than existing methods
can more easily adapt to using arbitrary kernels
relies on Monte Carlo sampling to circumvent the intractability hurdle
uses pseudo data to ease the computational burden
Keyon Vafa Training and Inference for Deep GPs April 26, 2016 4 / 50
![Page 14: Training and Inference for Deep Gaussian Processes](https://reader034.fdocuments.net/reader034/viewer/2022051404/587495d21a28abc62f8bab0d/html5/thumbnails/14.jpg)
Gaussian Processes
Table of Contents
1 Gaussian Processes
2 Deep Gaussian Processes
3 Implementation
4 Experiments and Analysis
5 Conclusion
Keyon Vafa Training and Inference for Deep GPs April 26, 2016 5 / 50
![Page 15: Training and Inference for Deep Gaussian Processes](https://reader034.fdocuments.net/reader034/viewer/2022051404/587495d21a28abc62f8bab0d/html5/thumbnails/15.jpg)
Gaussian Processes
Definition of a Gaussian Process
A function f is a Gaussian process (GP) if any finite set of valuesf (x1), . . . , f (xN) has a multivariate normal distribution.
The inputs {xn}Nn=1 can be vectors from any arbitrary sized domain.
Specified by a mean function m(x) and a covariance function k(x, x′)where
m(x) = E [f (x)]
k(x, x′) = Cov(f (x), f (x′)).
Keyon Vafa Training and Inference for Deep GPs April 26, 2016 6 / 50
![Page 16: Training and Inference for Deep Gaussian Processes](https://reader034.fdocuments.net/reader034/viewer/2022051404/587495d21a28abc62f8bab0d/html5/thumbnails/16.jpg)
Gaussian Processes
Definition of a Gaussian Process
A function f is a Gaussian process (GP) if any finite set of valuesf (x1), . . . , f (xN) has a multivariate normal distribution.
The inputs {xn}Nn=1 can be vectors from any arbitrary sized domain.
Specified by a mean function m(x) and a covariance function k(x, x′)where
m(x) = E [f (x)]
k(x, x′) = Cov(f (x), f (x′)).
Keyon Vafa Training and Inference for Deep GPs April 26, 2016 6 / 50
![Page 17: Training and Inference for Deep Gaussian Processes](https://reader034.fdocuments.net/reader034/viewer/2022051404/587495d21a28abc62f8bab0d/html5/thumbnails/17.jpg)
Gaussian Processes
Definition of a Gaussian Process
A function f is a Gaussian process (GP) if any finite set of valuesf (x1), . . . , f (xN) has a multivariate normal distribution.
The inputs {xn}Nn=1 can be vectors from any arbitrary sized domain.
Specified by a mean function m(x) and a covariance function k(x, x′)where
m(x) = E [f (x)]
k(x, x′) = Cov(f (x), f (x′)).
Keyon Vafa Training and Inference for Deep GPs April 26, 2016 6 / 50
![Page 18: Training and Inference for Deep Gaussian Processes](https://reader034.fdocuments.net/reader034/viewer/2022051404/587495d21a28abc62f8bab0d/html5/thumbnails/18.jpg)
Gaussian Processes
Covariance Function
The covariance function (or kernel) determines the smoothness andstationarity of functions drawn from a GP.
The squared exponential covariance function has the following form:
k(x, x′) = σ2f exp
(−1
2(x− x′)TM(x− x′)
)
When M is a diagonal matrix, the elements on the diagonal areknown as the length-scales, denoted by l−2i . σ2f is known as the signalvariance.
Keyon Vafa Training and Inference for Deep GPs April 26, 2016 7 / 50
![Page 19: Training and Inference for Deep Gaussian Processes](https://reader034.fdocuments.net/reader034/viewer/2022051404/587495d21a28abc62f8bab0d/html5/thumbnails/19.jpg)
Gaussian Processes
Covariance Function
The covariance function (or kernel) determines the smoothness andstationarity of functions drawn from a GP.
The squared exponential covariance function has the following form:
k(x, x′) = σ2f exp
(−1
2(x− x′)TM(x− x′)
)
When M is a diagonal matrix, the elements on the diagonal areknown as the length-scales, denoted by l−2i . σ2f is known as the signalvariance.
Keyon Vafa Training and Inference for Deep GPs April 26, 2016 7 / 50
![Page 20: Training and Inference for Deep Gaussian Processes](https://reader034.fdocuments.net/reader034/viewer/2022051404/587495d21a28abc62f8bab0d/html5/thumbnails/20.jpg)
Gaussian Processes
Covariance Function
The covariance function (or kernel) determines the smoothness andstationarity of functions drawn from a GP.
The squared exponential covariance function has the following form:
k(x, x′) = σ2f exp
(−1
2(x− x′)TM(x− x′)
)
When M is a diagonal matrix, the elements on the diagonal areknown as the length-scales, denoted by l−2i . σ2f is known as the signalvariance.
Keyon Vafa Training and Inference for Deep GPs April 26, 2016 7 / 50
![Page 21: Training and Inference for Deep Gaussian Processes](https://reader034.fdocuments.net/reader034/viewer/2022051404/587495d21a28abc62f8bab0d/html5/thumbnails/21.jpg)
Gaussian Processes
Sampling from a GP
x
f(x)
Signal variance 1.0, Length-scale 0.2
x
f(x)
Signal variance 1.0, Length-scale 1.0
x
f(x)
Signal variance 1.0, Length-scale 5.0
x
f(x)
Signal variance 0.2, Length-scale 1.0
x
f(x)
Signal variance 1.0, Length-scale 1.0
x
f(x)
Signal variance 5.0, Length-scale 1.0
Random samples from GP priors. The length-scale controls thesmoothness of our function, while the signal variance controls thedeviation from the mean.
Keyon Vafa Training and Inference for Deep GPs April 26, 2016 8 / 50
![Page 22: Training and Inference for Deep Gaussian Processes](https://reader034.fdocuments.net/reader034/viewer/2022051404/587495d21a28abc62f8bab0d/html5/thumbnails/22.jpg)
Gaussian Processes
GPs for Regression
Setup: We are given a set of inputs X ∈ RN×D and correspondingoutputs y ∈ RN , the function values from a GP evaluated at X. Weassume a mean function m(x) and a covariance function k(x, x′),which rely on parameters θ.
We would like to learn the optimal θ, and estimate the functionvalues y∗ for a set of new inputs X∗.
To learn θ, we optimize the marginal likelihood:
P(y|X,θ) = N (0,KXX ).
We can then use the multivariate normal conditional distribution toevaluate the predictive distribution:
P(y∗|X∗,X, y,θ) ∼ N (KX∗XK−1XXy,KX∗X∗ −KX∗XK−1XXKXX∗).
Keyon Vafa Training and Inference for Deep GPs April 26, 2016 9 / 50
![Page 23: Training and Inference for Deep Gaussian Processes](https://reader034.fdocuments.net/reader034/viewer/2022051404/587495d21a28abc62f8bab0d/html5/thumbnails/23.jpg)
Gaussian Processes
GPs for Regression
Setup: We are given a set of inputs X ∈ RN×D and correspondingoutputs y ∈ RN , the function values from a GP evaluated at X. Weassume a mean function m(x) and a covariance function k(x, x′),which rely on parameters θ.
We would like to learn the optimal θ, and estimate the functionvalues y∗ for a set of new inputs X∗.
To learn θ, we optimize the marginal likelihood:
P(y|X,θ) = N (0,KXX ).
We can then use the multivariate normal conditional distribution toevaluate the predictive distribution:
P(y∗|X∗,X, y,θ) ∼ N (KX∗XK−1XXy,KX∗X∗ −KX∗XK−1XXKXX∗).
Keyon Vafa Training and Inference for Deep GPs April 26, 2016 9 / 50
![Page 24: Training and Inference for Deep Gaussian Processes](https://reader034.fdocuments.net/reader034/viewer/2022051404/587495d21a28abc62f8bab0d/html5/thumbnails/24.jpg)
Gaussian Processes
GPs for Regression
Setup: We are given a set of inputs X ∈ RN×D and correspondingoutputs y ∈ RN , the function values from a GP evaluated at X. Weassume a mean function m(x) and a covariance function k(x, x′),which rely on parameters θ.
We would like to learn the optimal θ, and estimate the functionvalues y∗ for a set of new inputs X∗.
To learn θ, we optimize the marginal likelihood:
P(y|X,θ) = N (0,KXX ).
We can then use the multivariate normal conditional distribution toevaluate the predictive distribution:
P(y∗|X∗,X, y,θ) ∼ N (KX∗XK−1XXy,KX∗X∗ −KX∗XK−1XXKXX∗).
Keyon Vafa Training and Inference for Deep GPs April 26, 2016 9 / 50
![Page 25: Training and Inference for Deep Gaussian Processes](https://reader034.fdocuments.net/reader034/viewer/2022051404/587495d21a28abc62f8bab0d/html5/thumbnails/25.jpg)
Gaussian Processes
GPs for Regression
Setup: We are given a set of inputs X ∈ RN×D and correspondingoutputs y ∈ RN , the function values from a GP evaluated at X. Weassume a mean function m(x) and a covariance function k(x, x′),which rely on parameters θ.
We would like to learn the optimal θ, and estimate the functionvalues y∗ for a set of new inputs X∗.
To learn θ, we optimize the marginal likelihood:
P(y|X,θ) = N (0,KXX ).
We can then use the multivariate normal conditional distribution toevaluate the predictive distribution:
P(y∗|X∗,X, y,θ) ∼ N (KX∗XK−1XXy,KX∗X∗ −KX∗XK−1XXKXX∗).
Keyon Vafa Training and Inference for Deep GPs April 26, 2016 9 / 50
![Page 26: Training and Inference for Deep Gaussian Processes](https://reader034.fdocuments.net/reader034/viewer/2022051404/587495d21a28abc62f8bab0d/html5/thumbnails/26.jpg)
Gaussian Processes
GPs for Regression
Note this is all true because we are assuming the outputs correspond to aGaussian process. We therefore make the following assumption:(
yy∗
)∼ N
((00
),
(KXX KXX∗
KX∗X KX∗X∗
)).
Computing P(y|X) and P(y∗|X∗,X, y) only requires matrix algebra on theabove assumption.
Keyon Vafa Training and Inference for Deep GPs April 26, 2016 10 / 50
![Page 27: Training and Inference for Deep Gaussian Processes](https://reader034.fdocuments.net/reader034/viewer/2022051404/587495d21a28abc62f8bab0d/html5/thumbnails/27.jpg)
Gaussian Processes
GPs for Regression
Note this is all true because we are assuming the outputs correspond to aGaussian process. We therefore make the following assumption:(
yy∗
)∼ N
((00
),
(KXX KXX∗
KX∗X KX∗X∗
)).
Computing P(y|X) and P(y∗|X∗,X, y) only requires matrix algebra on theabove assumption.
Keyon Vafa Training and Inference for Deep GPs April 26, 2016 10 / 50
![Page 28: Training and Inference for Deep Gaussian Processes](https://reader034.fdocuments.net/reader034/viewer/2022051404/587495d21a28abc62f8bab0d/html5/thumbnails/28.jpg)
Gaussian Processes
Example of a GP for Regression
xf(x)
Gaussian Process Regression
x
Outp
uts
Data
Figure: On the left, data from a sigmoidal curve with noise. On the right, samplesfrom a GP trained on the data (represented by ‘x’), using a squared exponentialcovariance function.
Keyon Vafa Training and Inference for Deep GPs April 26, 2016 11 / 50
![Page 29: Training and Inference for Deep Gaussian Processes](https://reader034.fdocuments.net/reader034/viewer/2022051404/587495d21a28abc62f8bab0d/html5/thumbnails/29.jpg)
Deep Gaussian Processes
Table of Contents
1 Gaussian Processes
2 Deep Gaussian Processes
3 Implementation
4 Experiments and Analysis
5 Conclusion
Keyon Vafa Training and Inference for Deep GPs April 26, 2016 12 / 50
![Page 30: Training and Inference for Deep Gaussian Processes](https://reader034.fdocuments.net/reader034/viewer/2022051404/587495d21a28abc62f8bab0d/html5/thumbnails/30.jpg)
Deep Gaussian Processes
Definition of a Deep Gaussian Process
Formally, we define a deep Gaussian Process as the composition of GPs:
f(1:L)(x) = f(L)(f(L−1)(. . . f(2)(f(1)(x)) . . . ))
where f(l)d ∼ GP
(0, k
(l)d (x, x′)
)for f
(l)d ∈ f(l).
Keyon Vafa Training and Inference for Deep GPs April 26, 2016 13 / 50
![Page 31: Training and Inference for Deep Gaussian Processes](https://reader034.fdocuments.net/reader034/viewer/2022051404/587495d21a28abc62f8bab0d/html5/thumbnails/31.jpg)
Deep Gaussian Processes
Deep GP Notation
Each layer l consists of D(l) GPs, where D(l) is the number of unitsat layer l
For an L layer deep GP, we have
one input layer xn ∈ RD(0)
L− 1 hidden layers {hln}L−1
l=1
one output layer yn, which we assume to be 1-dimensional.
All layers are completely connected by GPs, each with their ownkernel.
Keyon Vafa Training and Inference for Deep GPs April 26, 2016 14 / 50
![Page 32: Training and Inference for Deep Gaussian Processes](https://reader034.fdocuments.net/reader034/viewer/2022051404/587495d21a28abc62f8bab0d/html5/thumbnails/32.jpg)
Deep Gaussian Processes
Deep GP Notation
Each layer l consists of D(l) GPs, where D(l) is the number of unitsat layer l
For an L layer deep GP, we have
one input layer xn ∈ RD(0)
L− 1 hidden layers {hln}L−1
l=1
one output layer yn, which we assume to be 1-dimensional.
All layers are completely connected by GPs, each with their ownkernel.
Keyon Vafa Training and Inference for Deep GPs April 26, 2016 14 / 50
![Page 33: Training and Inference for Deep Gaussian Processes](https://reader034.fdocuments.net/reader034/viewer/2022051404/587495d21a28abc62f8bab0d/html5/thumbnails/33.jpg)
Deep Gaussian Processes
Deep GP Notation
Each layer l consists of D(l) GPs, where D(l) is the number of unitsat layer l
For an L layer deep GP, we have
one input layer xn ∈ RD(0)
L− 1 hidden layers {hln}L−1
l=1
one output layer yn, which we assume to be 1-dimensional.
All layers are completely connected by GPs, each with their ownkernel.
Keyon Vafa Training and Inference for Deep GPs April 26, 2016 14 / 50
![Page 34: Training and Inference for Deep Gaussian Processes](https://reader034.fdocuments.net/reader034/viewer/2022051404/587495d21a28abc62f8bab0d/html5/thumbnails/34.jpg)
Deep Gaussian Processes
Deep GP Notation
Each layer l consists of D(l) GPs, where D(l) is the number of unitsat layer l
For an L layer deep GP, we have
one input layer xn ∈ RD(0)
L− 1 hidden layers {hln}L−1
l=1
one output layer yn, which we assume to be 1-dimensional.
All layers are completely connected by GPs, each with their ownkernel.
Keyon Vafa Training and Inference for Deep GPs April 26, 2016 14 / 50
![Page 35: Training and Inference for Deep Gaussian Processes](https://reader034.fdocuments.net/reader034/viewer/2022051404/587495d21a28abc62f8bab0d/html5/thumbnails/35.jpg)
Deep Gaussian Processes
Deep GP Notation
Each layer l consists of D(l) GPs, where D(l) is the number of unitsat layer l
For an L layer deep GP, we have
one input layer xn ∈ RD(0)
L− 1 hidden layers {hln}L−1
l=1
one output layer yn, which we assume to be 1-dimensional.
All layers are completely connected by GPs, each with their ownkernel.
Keyon Vafa Training and Inference for Deep GPs April 26, 2016 14 / 50
![Page 36: Training and Inference for Deep Gaussian Processes](https://reader034.fdocuments.net/reader034/viewer/2022051404/587495d21a28abc62f8bab0d/html5/thumbnails/36.jpg)
Deep Gaussian Processes
Deep GP Notation
Each layer l consists of D(l) GPs, where D(l) is the number of unitsat layer l
For an L layer deep GP, we have
one input layer xn ∈ RD(0)
L− 1 hidden layers {hln}L−1
l=1
one output layer yn, which we assume to be 1-dimensional.
All layers are completely connected by GPs, each with their ownkernel.
Keyon Vafa Training and Inference for Deep GPs April 26, 2016 14 / 50
![Page 37: Training and Inference for Deep Gaussian Processes](https://reader034.fdocuments.net/reader034/viewer/2022051404/587495d21a28abc62f8bab0d/html5/thumbnails/37.jpg)
Deep Gaussian Processes
Example: Two-Layer Deep GP
ynhnxnf g
We have a one dimensional input, xn, a one dimensional hidden unit, hn,and a one dimensional output, yn. This two-layer network consists of twoGPs, f and g where
hn = f (xn), where f ∼ GP(0, k(1)(x , x ′))
andyn = g(hn), where g ∼ GP(0, k(2)(h, h′)).
Keyon Vafa Training and Inference for Deep GPs April 26, 2016 15 / 50
![Page 38: Training and Inference for Deep Gaussian Processes](https://reader034.fdocuments.net/reader034/viewer/2022051404/587495d21a28abc62f8bab0d/html5/thumbnails/38.jpg)
Deep Gaussian Processes
Example: Two-Layer Deep GP
ynhnxnf g
We have a one dimensional input, xn, a one dimensional hidden unit, hn,and a one dimensional output, yn. This two-layer network consists of twoGPs, f and g where
hn = f (xn), where f ∼ GP(0, k(1)(x , x ′))
andyn = g(hn), where g ∼ GP(0, k(2)(h, h′)).
Keyon Vafa Training and Inference for Deep GPs April 26, 2016 15 / 50
![Page 39: Training and Inference for Deep Gaussian Processes](https://reader034.fdocuments.net/reader034/viewer/2022051404/587495d21a28abc62f8bab0d/html5/thumbnails/39.jpg)
Deep Gaussian Processes
Example: More Complicated Model
y
h(2)1
h(2)2
h(2)3
h(2)4
h(1)1
h(1)2
h(1)3
h(1)4
x1
x2
x3
Graphical representation of a more complicated deep GP architecture.Every edge corresponds to a GP between units, as the outputs of eachlayer are the inputs of the following layer. Our input data is 3-dimensional,while the two hidden layers in this model each have 4 hidden units.
Keyon Vafa Training and Inference for Deep GPs April 26, 2016 16 / 50
![Page 40: Training and Inference for Deep Gaussian Processes](https://reader034.fdocuments.net/reader034/viewer/2022051404/587495d21a28abc62f8bab0d/html5/thumbnails/40.jpg)
Deep Gaussian Processes
Sampling From a Deep GP
6 4 2 0 2 4 6
x
2.0
1.5
1.0
0.5
0.0
0.5
1.0
1.5
2.0
2.5
g(f(x))
Full Deep GP
6 4 2 0 2 4 6
x
2
1
0
1
2
3
4
f(x)
Layer 1: Length-scale 0.5
2 1 0 1 2 3 4
f(x)
2.0
1.5
1.0
0.5
0.0
0.5
1.0
1.5
2.0
2.5
g(f(x))
Layer 2: Length-scale 1.0
Samples from deep GPs. As opposed to single-layer GPs, a deep GP canmodel non-stationary functions (functions whose shape changes along theinput space) without the use of a non-stationary kernel.
Keyon Vafa Training and Inference for Deep GPs April 26, 2016 17 / 50
![Page 41: Training and Inference for Deep Gaussian Processes](https://reader034.fdocuments.net/reader034/viewer/2022051404/587495d21a28abc62f8bab0d/html5/thumbnails/41.jpg)
Deep Gaussian Processes
Comparison with Neural Networks
Similarities: deep architectures, completely connected, single-layerGPs correspond to two-layer neural networks with random weightsand infinitely many hidden units.
Differences: deep GP is nonparametric, no activation functions, mustspecify kernels, training is intractable
Keyon Vafa Training and Inference for Deep GPs April 26, 2016 18 / 50
![Page 42: Training and Inference for Deep Gaussian Processes](https://reader034.fdocuments.net/reader034/viewer/2022051404/587495d21a28abc62f8bab0d/html5/thumbnails/42.jpg)
Deep Gaussian Processes
Comparison with Neural Networks
Similarities: deep architectures, completely connected, single-layerGPs correspond to two-layer neural networks with random weightsand infinitely many hidden units.
Differences: deep GP is nonparametric, no activation functions, mustspecify kernels, training is intractable
Keyon Vafa Training and Inference for Deep GPs April 26, 2016 18 / 50
![Page 43: Training and Inference for Deep Gaussian Processes](https://reader034.fdocuments.net/reader034/viewer/2022051404/587495d21a28abc62f8bab0d/html5/thumbnails/43.jpg)
Implementation
Table of Contents
1 Gaussian Processes
2 Deep Gaussian Processes
3 Implementation
4 Experiments and Analysis
5 Conclusion
Keyon Vafa Training and Inference for Deep GPs April 26, 2016 19 / 50
![Page 44: Training and Inference for Deep Gaussian Processes](https://reader034.fdocuments.net/reader034/viewer/2022051404/587495d21a28abc62f8bab0d/html5/thumbnails/44.jpg)
Implementation
FITC Approximation for Single-Layer GP
The Fully Independent Training Conditional Approximation(FITC) circumvents the O(N3) training time for a single-layer GP byintroducing pseudo data, points that are not in the data set but canbe chosen to approximate the function (Snelson and Ghahramani,2005).
We introduce M pseudo inputs X = {xm}Mm=1 and the correspondingpseudo outputs y = {ym}Mm=1, which correspond to the functionvalues at the pseudo inputs.
Key assumption: conditioned on the pseudo data, the output valuesare independent.
Keyon Vafa Training and Inference for Deep GPs April 26, 2016 20 / 50
![Page 45: Training and Inference for Deep Gaussian Processes](https://reader034.fdocuments.net/reader034/viewer/2022051404/587495d21a28abc62f8bab0d/html5/thumbnails/45.jpg)
Implementation
FITC Approximation for Single-Layer GP
The Fully Independent Training Conditional Approximation(FITC) circumvents the O(N3) training time for a single-layer GP byintroducing pseudo data, points that are not in the data set but canbe chosen to approximate the function (Snelson and Ghahramani,2005).
We introduce M pseudo inputs X = {xm}Mm=1 and the correspondingpseudo outputs y = {ym}Mm=1, which correspond to the functionvalues at the pseudo inputs.
Key assumption: conditioned on the pseudo data, the output valuesare independent.
Keyon Vafa Training and Inference for Deep GPs April 26, 2016 20 / 50
![Page 46: Training and Inference for Deep Gaussian Processes](https://reader034.fdocuments.net/reader034/viewer/2022051404/587495d21a28abc62f8bab0d/html5/thumbnails/46.jpg)
Implementation
FITC Approximation for Single-Layer GP
The Fully Independent Training Conditional Approximation(FITC) circumvents the O(N3) training time for a single-layer GP byintroducing pseudo data, points that are not in the data set but canbe chosen to approximate the function (Snelson and Ghahramani,2005).
We introduce M pseudo inputs X = {xm}Mm=1 and the correspondingpseudo outputs y = {ym}Mm=1, which correspond to the functionvalues at the pseudo inputs.
Key assumption: conditioned on the pseudo data, the output valuesare independent.
Keyon Vafa Training and Inference for Deep GPs April 26, 2016 20 / 50
![Page 47: Training and Inference for Deep Gaussian Processes](https://reader034.fdocuments.net/reader034/viewer/2022051404/587495d21a28abc62f8bab0d/html5/thumbnails/47.jpg)
Implementation
FITC Approximation for Single-Layer GP
We assume a priorP(y|X) = N (0,KXX) .
Training takes time O(NM2), and testing requires O(M2).
Keyon Vafa Training and Inference for Deep GPs April 26, 2016 21 / 50
![Page 48: Training and Inference for Deep Gaussian Processes](https://reader034.fdocuments.net/reader034/viewer/2022051404/587495d21a28abc62f8bab0d/html5/thumbnails/48.jpg)
Implementation
FITC Approximation for Single-Layer GP
We assume a priorP(y|X) = N (0,KXX) .
Training takes time O(NM2), and testing requires O(M2).
Keyon Vafa Training and Inference for Deep GPs April 26, 2016 21 / 50
![Page 49: Training and Inference for Deep Gaussian Processes](https://reader034.fdocuments.net/reader034/viewer/2022051404/587495d21a28abc62f8bab0d/html5/thumbnails/49.jpg)
Implementation
FITC Example
x
f(x)
5 Pseudo Parameters
x
f(x)
10 Pseudo Parameters
Figure: The predictive mean of a GP trained on sigmoidal data using the FITCapproximation. On the left, we use 5 pseudo data points, while on the right, weuse 10.
Keyon Vafa Training and Inference for Deep GPs April 26, 2016 22 / 50
![Page 50: Training and Inference for Deep Gaussian Processes](https://reader034.fdocuments.net/reader034/viewer/2022051404/587495d21a28abc62f8bab0d/html5/thumbnails/50.jpg)
Implementation
Learning Deep GPs is Intractable
ynhnxnf
θ(1)
g
θ(2)
Example: two-layer model, with inputs X, outputs y, and hidden layer H(which is N × 1). Ideally, a Bayesian treatment would allow us to integrateout the hidden function values to evaluate
P(y|X,θ) =
∫P(y|H,θ(2))P
(H|X,θ(1)
)dH
=
∫N (0,KHH)N (0,KXX) dH.
Evaluating the integrals of Gaussians with respect to kernel functions isintractable.
Keyon Vafa Training and Inference for Deep GPs April 26, 2016 23 / 50
![Page 51: Training and Inference for Deep Gaussian Processes](https://reader034.fdocuments.net/reader034/viewer/2022051404/587495d21a28abc62f8bab0d/html5/thumbnails/51.jpg)
Implementation
Learning Deep GPs is Intractable
ynhnxnf
θ(1)
g
θ(2)
Example: two-layer model, with inputs X, outputs y, and hidden layer H(which is N × 1). Ideally, a Bayesian treatment would allow us to integrateout the hidden function values to evaluate
P(y|X,θ) =
∫P(y|H,θ(2))P
(H|X,θ(1)
)dH
=
∫N (0,KHH)N (0,KXX) dH.
Evaluating the integrals of Gaussians with respect to kernel functions isintractable.
Keyon Vafa Training and Inference for Deep GPs April 26, 2016 23 / 50
![Page 52: Training and Inference for Deep Gaussian Processes](https://reader034.fdocuments.net/reader034/viewer/2022051404/587495d21a28abc62f8bab0d/html5/thumbnails/52.jpg)
Implementation
DPGS Algorithm Overview
The Deep Gaussian Process Sampling algorithm relies on two central ideas:
We sample predictive means and covariances to approximate themarginal likelihood, relying on automatic differentiation techniques toevaluate the gradients and optimize our objective.
We replace every GP with the FITC GP, so the time complexity for Llayers and H hidden units per layer is O(N2MLH) as opposed toO(N3LH).
Keyon Vafa Training and Inference for Deep GPs April 26, 2016 24 / 50
![Page 53: Training and Inference for Deep Gaussian Processes](https://reader034.fdocuments.net/reader034/viewer/2022051404/587495d21a28abc62f8bab0d/html5/thumbnails/53.jpg)
Implementation
DPGS Algorithm Overview
The Deep Gaussian Process Sampling algorithm relies on two central ideas:
We sample predictive means and covariances to approximate themarginal likelihood, relying on automatic differentiation techniques toevaluate the gradients and optimize our objective.
We replace every GP with the FITC GP, so the time complexity for Llayers and H hidden units per layer is O(N2MLH) as opposed toO(N3LH).
Keyon Vafa Training and Inference for Deep GPs April 26, 2016 24 / 50
![Page 54: Training and Inference for Deep Gaussian Processes](https://reader034.fdocuments.net/reader034/viewer/2022051404/587495d21a28abc62f8bab0d/html5/thumbnails/54.jpg)
Implementation
DPGS Algorithm Overview
The Deep Gaussian Process Sampling algorithm relies on two central ideas:
We sample predictive means and covariances to approximate themarginal likelihood, relying on automatic differentiation techniques toevaluate the gradients and optimize our objective.
We replace every GP with the FITC GP, so the time complexity for Llayers and H hidden units per layer is O(N2MLH) as opposed toO(N3LH).
Keyon Vafa Training and Inference for Deep GPs April 26, 2016 24 / 50
![Page 55: Training and Inference for Deep Gaussian Processes](https://reader034.fdocuments.net/reader034/viewer/2022051404/587495d21a28abc62f8bab0d/html5/thumbnails/55.jpg)
Implementation
Related Work
Damianou and Lawrence (2013) also use the FITC approximation atevery layer, but they perform inference with approximate variationalmarginalization. Subsequent methods (Hensman and Lawrence, 2014;Dai et al., 2015; Bui et al., 2016) also use variational approximations.
These methods are able to integrate out the pseudo outputs at eachlayer, but they rely on integral approximations that restrict the kernel.Meanwhile, the DGPS uses Monte Carlo sampling, which is easier toimplement, more intuitive to understand, and can extent easily tomost kernels.
Keyon Vafa Training and Inference for Deep GPs April 26, 2016 25 / 50
![Page 56: Training and Inference for Deep Gaussian Processes](https://reader034.fdocuments.net/reader034/viewer/2022051404/587495d21a28abc62f8bab0d/html5/thumbnails/56.jpg)
Implementation
Related Work
Damianou and Lawrence (2013) also use the FITC approximation atevery layer, but they perform inference with approximate variationalmarginalization. Subsequent methods (Hensman and Lawrence, 2014;Dai et al., 2015; Bui et al., 2016) also use variational approximations.
These methods are able to integrate out the pseudo outputs at eachlayer, but they rely on integral approximations that restrict the kernel.Meanwhile, the DGPS uses Monte Carlo sampling, which is easier toimplement, more intuitive to understand, and can extent easily tomost kernels.
Keyon Vafa Training and Inference for Deep GPs April 26, 2016 25 / 50
![Page 57: Training and Inference for Deep Gaussian Processes](https://reader034.fdocuments.net/reader034/viewer/2022051404/587495d21a28abc62f8bab0d/html5/thumbnails/57.jpg)
Implementation
Sampling Hidden Values
For inputs X, we calculate the predictive mean and covariance forevery unit in the first hidden layer. We then sample values from eachpredictive distribution
For every hidden layer thereafter, we take the samples from theprevious layer, calculate the predictive mean and covariance, andrepeat sampling until the final layer.
We use K different samples {(µk , Σk)}Kk=1 to approximate themarginal likelihood:
P(y|X) ≈K∑
k=1
P(y|µk , Σk) =K∑
k=1
N (µk , Σk)
Keyon Vafa Training and Inference for Deep GPs April 26, 2016 26 / 50
![Page 58: Training and Inference for Deep Gaussian Processes](https://reader034.fdocuments.net/reader034/viewer/2022051404/587495d21a28abc62f8bab0d/html5/thumbnails/58.jpg)
Implementation
Sampling Hidden Values
For inputs X, we calculate the predictive mean and covariance forevery unit in the first hidden layer. We then sample values from eachpredictive distribution
For every hidden layer thereafter, we take the samples from theprevious layer, calculate the predictive mean and covariance, andrepeat sampling until the final layer.
We use K different samples {(µk , Σk)}Kk=1 to approximate themarginal likelihood:
P(y|X) ≈K∑
k=1
P(y|µk , Σk) =K∑
k=1
N (µk , Σk)
Keyon Vafa Training and Inference for Deep GPs April 26, 2016 26 / 50
![Page 59: Training and Inference for Deep Gaussian Processes](https://reader034.fdocuments.net/reader034/viewer/2022051404/587495d21a28abc62f8bab0d/html5/thumbnails/59.jpg)
Implementation
Sampling Hidden Values
For inputs X, we calculate the predictive mean and covariance forevery unit in the first hidden layer. We then sample values from eachpredictive distribution
For every hidden layer thereafter, we take the samples from theprevious layer, calculate the predictive mean and covariance, andrepeat sampling until the final layer.
We use K different samples {(µk , Σk)}Kk=1 to approximate themarginal likelihood:
P(y|X) ≈K∑
k=1
P(y|µk , Σk) =K∑
k=1
N (µk , Σk)
Keyon Vafa Training and Inference for Deep GPs April 26, 2016 26 / 50
![Page 60: Training and Inference for Deep Gaussian Processes](https://reader034.fdocuments.net/reader034/viewer/2022051404/587495d21a28abc62f8bab0d/html5/thumbnails/60.jpg)
Implementation
FITC for Deep GPs
To make fitting more scalable, we replace every GP in the model witha FITC GP
For each GP, corresponding to hidden unit d in layer l , we introduce
pseudo inputs X(l)d and corresponding pseudo outputs y
(l)d .
With the addition of the pseudo data, we are required to learn thefollowing set of parameters:
Θ =
{{X
(l)d , y
(l)d ,θ
(l)d
}D(l)
d=1
}L
l=1
.
Keyon Vafa Training and Inference for Deep GPs April 26, 2016 27 / 50
![Page 61: Training and Inference for Deep Gaussian Processes](https://reader034.fdocuments.net/reader034/viewer/2022051404/587495d21a28abc62f8bab0d/html5/thumbnails/61.jpg)
Implementation
FITC for Deep GPs
To make fitting more scalable, we replace every GP in the model witha FITC GP
For each GP, corresponding to hidden unit d in layer l , we introduce
pseudo inputs X(l)d and corresponding pseudo outputs y
(l)d .
With the addition of the pseudo data, we are required to learn thefollowing set of parameters:
Θ =
{{X
(l)d , y
(l)d ,θ
(l)d
}D(l)
d=1
}L
l=1
.
Keyon Vafa Training and Inference for Deep GPs April 26, 2016 27 / 50
![Page 62: Training and Inference for Deep Gaussian Processes](https://reader034.fdocuments.net/reader034/viewer/2022051404/587495d21a28abc62f8bab0d/html5/thumbnails/62.jpg)
Implementation
FITC for Deep GPs
To make fitting more scalable, we replace every GP in the model witha FITC GP
For each GP, corresponding to hidden unit d in layer l , we introduce
pseudo inputs X(l)d and corresponding pseudo outputs y
(l)d .
With the addition of the pseudo data, we are required to learn thefollowing set of parameters:
Θ =
{{X
(l)d , y
(l)d ,θ
(l)d
}D(l)
d=1
}L
l=1
.
Keyon Vafa Training and Inference for Deep GPs April 26, 2016 27 / 50
![Page 63: Training and Inference for Deep Gaussian Processes](https://reader034.fdocuments.net/reader034/viewer/2022051404/587495d21a28abc62f8bab0d/html5/thumbnails/63.jpg)
Implementation
Example: DGPS Algorithm on 2 Layers
ynHnXnf
X(1), y(1),θ(1)
g
X(2), y(2),θ(2)
Our goal is to learn
{(X(l), y(l))}2l=1, the pseudo data for each layer
θ(1) and θ(2), the kernel parameters for f and g
Keyon Vafa Training and Inference for Deep GPs April 26, 2016 28 / 50
![Page 64: Training and Inference for Deep Gaussian Processes](https://reader034.fdocuments.net/reader034/viewer/2022051404/587495d21a28abc62f8bab0d/html5/thumbnails/64.jpg)
Implementation
Example: DGPS Algorithm on 2 Layers
ynHnXnf
X(1), y(1),θ(1)
g
X(2), y(2),θ(2)
Our goal is to learn
{(X(l), y(l))}2l=1, the pseudo data for each layer
θ(1) and θ(2), the kernel parameters for f and g
Keyon Vafa Training and Inference for Deep GPs April 26, 2016 28 / 50
![Page 65: Training and Inference for Deep Gaussian Processes](https://reader034.fdocuments.net/reader034/viewer/2022051404/587495d21a28abc62f8bab0d/html5/thumbnails/65.jpg)
Implementation
Example: DGPS Algorithm on 2 Layers
ynHnXnf
X(1), y(1),θ(1)
g
X(2), y(2),θ(2)
Our goal is to learn
{(X(l), y(l))}2l=1, the pseudo data for each layer
θ(1) and θ(2), the kernel parameters for f and g
Keyon Vafa Training and Inference for Deep GPs April 26, 2016 28 / 50
![Page 66: Training and Inference for Deep Gaussian Processes](https://reader034.fdocuments.net/reader034/viewer/2022051404/587495d21a28abc62f8bab0d/html5/thumbnails/66.jpg)
Implementation
Example: DGPS Algorithm on 2 Layers
ynHnXnf
X(1), y(1),θ(1)
g
X(2), y(2),θ(2)
To sample values H from the hidden layer, we use the FITC approximationand assume
P(
H|X, X(1), y(1)
)= N
(µ(1),Σ(1)
)
where
µ(1) = KXX
(1)K−1X(1)
X(1) y
(1)
Σ(1) = diag(
KXX −KXX
(1)K−1X(1)
X(1)KX
(1)X
).
We obtain K samples, {Hk}Kk=1 from the above distribution.
Keyon Vafa Training and Inference for Deep GPs April 26, 2016 29 / 50
![Page 67: Training and Inference for Deep Gaussian Processes](https://reader034.fdocuments.net/reader034/viewer/2022051404/587495d21a28abc62f8bab0d/html5/thumbnails/67.jpg)
Implementation
Example: DGPS Algorithm on 2 Layers
ynHnXnf
X(1), y(1),θ(1)
g
X(2), y(2),θ(2)
To sample values H from the hidden layer, we use the FITC approximationand assume
P(
H|X, X(1), y(1)
)= N
(µ(1),Σ(1)
)where
µ(1) = KXX
(1)K−1X(1)
X(1) y
(1)
Σ(1) = diag(
KXX −KXX
(1)K−1X(1)
X(1)KX
(1)X
).
We obtain K samples, {Hk}Kk=1 from the above distribution.
Keyon Vafa Training and Inference for Deep GPs April 26, 2016 29 / 50
![Page 68: Training and Inference for Deep Gaussian Processes](https://reader034.fdocuments.net/reader034/viewer/2022051404/587495d21a28abc62f8bab0d/html5/thumbnails/68.jpg)
Implementation
Example: DGPS Algorithm on 2 Layers
ynHnXnf
X(1), y(1),θ(1)
g
X(2), y(2),θ(2)
To sample values H from the hidden layer, we use the FITC approximationand assume
P(
H|X, X(1), y(1)
)= N
(µ(1),Σ(1)
)where
µ(1) = KXX
(1)K−1X(1)
X(1) y
(1)
Σ(1) = diag(
KXX −KXX
(1)K−1X(1)
X(1)KX
(1)X
).
We obtain K samples, {Hk}Kk=1 from the above distribution.
Keyon Vafa Training and Inference for Deep GPs April 26, 2016 29 / 50
![Page 69: Training and Inference for Deep Gaussian Processes](https://reader034.fdocuments.net/reader034/viewer/2022051404/587495d21a28abc62f8bab0d/html5/thumbnails/69.jpg)
Implementation
Example: DGPS Algorithm on 2 Layers
ynHnXnf
X(1), y(1),θ(1)
g
X(2), y(2),θ(2)
For each sample Hk , we can approximate
P(
y|Hk , X(2), y(2)
)≈ N
(µ(2), Σ
(2))
where
µ(2) = KHk X
(2)K−1X(2)
X(2) y
(2)
Σ(2)
= diag(
KHk Hk−K
Hk X(2)K−1
X(2)
X(2)KX
(2)Hk
).
Keyon Vafa Training and Inference for Deep GPs April 26, 2016 30 / 50
![Page 70: Training and Inference for Deep Gaussian Processes](https://reader034.fdocuments.net/reader034/viewer/2022051404/587495d21a28abc62f8bab0d/html5/thumbnails/70.jpg)
Implementation
Example: DGPS Algorithm on 2 Layers
ynHnXnf
X(1), y(1),θ(1)
g
X(2), y(2),θ(2)
For each sample Hk , we can approximate
P(
y|Hk , X(2), y(2)
)≈ N
(µ(2), Σ
(2))
where
µ(2) = KHk X
(2)K−1X(2)
X(2) y
(2)
Σ(2)
= diag(
KHk Hk−K
Hk X(2)K−1
X(2)
X(2)KX
(2)Hk
).
Keyon Vafa Training and Inference for Deep GPs April 26, 2016 30 / 50
![Page 71: Training and Inference for Deep Gaussian Processes](https://reader034.fdocuments.net/reader034/viewer/2022051404/587495d21a28abc62f8bab0d/html5/thumbnails/71.jpg)
Implementation
Example: DGPS Algorithm on 2 Layers
Thus, we can approximate the marginal likelihood with our samples:
P(y|X,Θ) ≈ 1
K
K∑k=1
P(
y|Hk , X(2), y(2)
).
Incorporating the prior over the pseudo outputs into our objective, wehave:
L(y|X,Θ) = logP(y|X,Θ) +L∑
l=1
D(l)∑d=1
logP(
y(l)d
∣∣X(l)d
).
Keyon Vafa Training and Inference for Deep GPs April 26, 2016 31 / 50
![Page 72: Training and Inference for Deep Gaussian Processes](https://reader034.fdocuments.net/reader034/viewer/2022051404/587495d21a28abc62f8bab0d/html5/thumbnails/72.jpg)
Implementation
Example: DGPS Algorithm on 2 Layers
Thus, we can approximate the marginal likelihood with our samples:
P(y|X,Θ) ≈ 1
K
K∑k=1
P(
y|Hk , X(2), y(2)
).
Incorporating the prior over the pseudo outputs into our objective, wehave:
L(y|X,Θ) = logP(y|X,Θ) +L∑
l=1
D(l)∑d=1
logP(
y(l)d
∣∣X(l)d
).
Keyon Vafa Training and Inference for Deep GPs April 26, 2016 31 / 50
![Page 73: Training and Inference for Deep Gaussian Processes](https://reader034.fdocuments.net/reader034/viewer/2022051404/587495d21a28abc62f8bab0d/html5/thumbnails/73.jpg)
Experiments and Analysis
Table of Contents
1 Gaussian Processes
2 Deep Gaussian Processes
3 Implementation
4 Experiments and Analysis
5 Conclusion
Keyon Vafa Training and Inference for Deep GPs April 26, 2016 32 / 50
![Page 74: Training and Inference for Deep Gaussian Processes](https://reader034.fdocuments.net/reader034/viewer/2022051404/587495d21a28abc62f8bab0d/html5/thumbnails/74.jpg)
Experiments and Analysis
Step Function
We test on a step function with noise: X ∈ [−2, 2], yi = sign(xi ) + εi ,where εi ∼ N (0, .01).
The non-stationarity of a step function is appealing from a deep GPperspective.
Keyon Vafa Training and Inference for Deep GPs April 26, 2016 33 / 50
![Page 75: Training and Inference for Deep Gaussian Processes](https://reader034.fdocuments.net/reader034/viewer/2022051404/587495d21a28abc62f8bab0d/html5/thumbnails/75.jpg)
Experiments and Analysis
Step Function
We test on a step function with noise: X ∈ [−2, 2], yi = sign(xi ) + εi ,where εi ∼ N (0, .01).
The non-stationarity of a step function is appealing from a deep GPperspective.
Keyon Vafa Training and Inference for Deep GPs April 26, 2016 33 / 50
![Page 76: Training and Inference for Deep Gaussian Processes](https://reader034.fdocuments.net/reader034/viewer/2022051404/587495d21a28abc62f8bab0d/html5/thumbnails/76.jpg)
Experiments and Analysis
Step Function
x
f(x)
Samples from a Single-Layer GP
Figure: Functions sampled from a single-layer GP. Evidently, the predictive drawsdo not fully capture the shape of the step function.
Keyon Vafa Training and Inference for Deep GPs April 26, 2016 34 / 50
![Page 77: Training and Inference for Deep Gaussian Processes](https://reader034.fdocuments.net/reader034/viewer/2022051404/587495d21a28abc62f8bab0d/html5/thumbnails/77.jpg)
Experiments and Analysis
Step Function
Figure: Predictive draws from a single-layer GP and a two-layer deep GP.
Keyon Vafa Training and Inference for Deep GPs April 26, 2016 35 / 50
![Page 78: Training and Inference for Deep Gaussian Processes](https://reader034.fdocuments.net/reader034/viewer/2022051404/587495d21a28abc62f8bab0d/html5/thumbnails/78.jpg)
Experiments and Analysis
Step Function
Figure: Predictive draws from a three-layer deep GP.
Keyon Vafa Training and Inference for Deep GPs April 26, 2016 36 / 50
![Page 79: Training and Inference for Deep Gaussian Processes](https://reader034.fdocuments.net/reader034/viewer/2022051404/587495d21a28abc62f8bab0d/html5/thumbnails/79.jpg)
Experiments and Analysis
Step Function
x
f(x)
Random Initialization
x
Hid
den v
alu
es
Hidden values
f(x)
x
f(x)
Smart Initialization
xH
idden v
alu
es
Hidden values
f(x)
Impact of Parameter Initializations on Predictive Draws
Keyon Vafa Training and Inference for Deep GPs April 26, 2016 37 / 50
![Page 80: Training and Inference for Deep Gaussian Processes](https://reader034.fdocuments.net/reader034/viewer/2022051404/587495d21a28abc62f8bab0d/html5/thumbnails/80.jpg)
Experiments and Analysis
Step Function
1 2 3
Number of Layers
0.0
0.2
0.4
0.6
0.8
1.0
Mean S
quare
d E
rror
1 2 3
Number of Layers
0.0
0.2
0.4
0.6
0.8
1.0
1 2 3
Number of Layers
0.0
0.2
0.4
0.6
0.8
1.0
1 2 33
2
1
0
1
2Lo
g L
ikelih
ood p
er
Data
50 Data Points
1 2 33
2
1
0
1
2100 Data Points
1 2 33
2
1
0
1
2200 Data Points
Figure: Experimental results measuring the test log-likelihood per data and testmean squared error of the noisy step function. We vary the number of layers usedin the model, along with the number of data points used in the original stepfunction (which is divided 80/20 into train/test). We run 10 trials at eachcombination.
Keyon Vafa Training and Inference for Deep GPs April 26, 2016 38 / 50
![Page 81: Training and Inference for Deep Gaussian Processes](https://reader034.fdocuments.net/reader034/viewer/2022051404/587495d21a28abc62f8bab0d/html5/thumbnails/81.jpg)
Experiments and Analysis
Step Function
Occasionally, models with deeper architectures outperform those that aremore shallow, yet they also possess the widest distributions and trials withthe worst results.
2 1 0 1 2
Train Log-Likelihood per Data
2
1
0
1
2
Test
Log-L
ikelih
ood p
er
Data
Layers
1
2
3
0.0 0.2 0.4 0.6 0.8 1.0
Train MSE
0.0
0.2
0.4
0.6
0.8
1.0
Test
MSE
Layers
1
2
3
Figure: Test set log-likelihoods per data and mean squared errors plotted againsttheir training set counterparts for the step function experiment. Overfitting doesnot appear to be a problem.
Keyon Vafa Training and Inference for Deep GPs April 26, 2016 39 / 50
![Page 82: Training and Inference for Deep Gaussian Processes](https://reader034.fdocuments.net/reader034/viewer/2022051404/587495d21a28abc62f8bab0d/html5/thumbnails/82.jpg)
Experiments and Analysis
Step Function
Overfitting does not appear to be a problem.
If we can successfully optimize our objective, deeper architectures arebetter suited at learning the noisy step function than shallower ones.
However, it becomes more difficult to train and successfully optimizeas the number of layers grows and the number of parametersincreases.
Keyon Vafa Training and Inference for Deep GPs April 26, 2016 40 / 50
![Page 83: Training and Inference for Deep Gaussian Processes](https://reader034.fdocuments.net/reader034/viewer/2022051404/587495d21a28abc62f8bab0d/html5/thumbnails/83.jpg)
Experiments and Analysis
Step Function
Overfitting does not appear to be a problem.
If we can successfully optimize our objective, deeper architectures arebetter suited at learning the noisy step function than shallower ones.
However, it becomes more difficult to train and successfully optimizeas the number of layers grows and the number of parametersincreases.
Keyon Vafa Training and Inference for Deep GPs April 26, 2016 40 / 50
![Page 84: Training and Inference for Deep Gaussian Processes](https://reader034.fdocuments.net/reader034/viewer/2022051404/587495d21a28abc62f8bab0d/html5/thumbnails/84.jpg)
Experiments and Analysis
Step Function
Overfitting does not appear to be a problem.
If we can successfully optimize our objective, deeper architectures arebetter suited at learning the noisy step function than shallower ones.
However, it becomes more difficult to train and successfully optimizeas the number of layers grows and the number of parametersincreases.
Keyon Vafa Training and Inference for Deep GPs April 26, 2016 40 / 50
![Page 85: Training and Inference for Deep Gaussian Processes](https://reader034.fdocuments.net/reader034/viewer/2022051404/587495d21a28abc62f8bab0d/html5/thumbnails/85.jpg)
Experiments and Analysis
Step Function
x
f(x)
Random Seed 66
x
Hid
den v
alu
es
1
Hidden values 1
Hid
den v
alu
es
2
Hidden values 2
f(x)
x
f(x)
Random Seed 0
x
Hid
den v
alu
es
1
Hidden values 1
Hid
den v
alu
es
2
Hidden values 2
f(x)
Figure: Predictive draws from two identical three-layer models, albeit withdifferent random parameter initializations.
Keyon Vafa Training and Inference for Deep GPs April 26, 2016 41 / 50
![Page 86: Training and Inference for Deep Gaussian Processes](https://reader034.fdocuments.net/reader034/viewer/2022051404/587495d21a28abc62f8bab0d/html5/thumbnails/86.jpg)
Experiments and Analysis
Step Function
Ways to combat optimization challenges:
Using random restarts
Decreasing the number of model parameters
Trying different optimization methods
Experimenting with more diverse architectures, i.e. increasing thedimension of the hidden layer
Keyon Vafa Training and Inference for Deep GPs April 26, 2016 42 / 50
![Page 87: Training and Inference for Deep Gaussian Processes](https://reader034.fdocuments.net/reader034/viewer/2022051404/587495d21a28abc62f8bab0d/html5/thumbnails/87.jpg)
Experiments and Analysis
Step Function
Ways to combat optimization challenges:
Using random restarts
Decreasing the number of model parameters
Trying different optimization methods
Experimenting with more diverse architectures, i.e. increasing thedimension of the hidden layer
Keyon Vafa Training and Inference for Deep GPs April 26, 2016 42 / 50
![Page 88: Training and Inference for Deep Gaussian Processes](https://reader034.fdocuments.net/reader034/viewer/2022051404/587495d21a28abc62f8bab0d/html5/thumbnails/88.jpg)
Experiments and Analysis
Step Function
Ways to combat optimization challenges:
Using random restarts
Decreasing the number of model parameters
Trying different optimization methods
Experimenting with more diverse architectures, i.e. increasing thedimension of the hidden layer
Keyon Vafa Training and Inference for Deep GPs April 26, 2016 42 / 50
![Page 89: Training and Inference for Deep Gaussian Processes](https://reader034.fdocuments.net/reader034/viewer/2022051404/587495d21a28abc62f8bab0d/html5/thumbnails/89.jpg)
Experiments and Analysis
Step Function
Ways to combat optimization challenges:
Using random restarts
Decreasing the number of model parameters
Trying different optimization methods
Experimenting with more diverse architectures, i.e. increasing thedimension of the hidden layer
Keyon Vafa Training and Inference for Deep GPs April 26, 2016 42 / 50
![Page 90: Training and Inference for Deep Gaussian Processes](https://reader034.fdocuments.net/reader034/viewer/2022051404/587495d21a28abc62f8bab0d/html5/thumbnails/90.jpg)
Experiments and Analysis
Step Function
Ways to combat optimization challenges:
Using random restarts
Decreasing the number of model parameters
Trying different optimization methods
Experimenting with more diverse architectures, i.e. increasing thedimension of the hidden layer
Keyon Vafa Training and Inference for Deep GPs April 26, 2016 42 / 50
![Page 91: Training and Inference for Deep Gaussian Processes](https://reader034.fdocuments.net/reader034/viewer/2022051404/587495d21a28abc62f8bab0d/html5/thumbnails/91.jpg)
Experiments and Analysis
Toy Non-Stationary Data
We create toy non-stationary data to evaluate a deep GP’s ability tolearn a non-stationary function.
We divide the input space into three regions: X1 ∈ [−4,−3],X2 ∈ [−1, 1] and X3 ∈ [2, 4], each of which consists of 40 data points.
Sample from a GP with length-scale l = .25 for regions X1 and X3,using l = 2 for region X2.
Keyon Vafa Training and Inference for Deep GPs April 26, 2016 43 / 50
![Page 92: Training and Inference for Deep Gaussian Processes](https://reader034.fdocuments.net/reader034/viewer/2022051404/587495d21a28abc62f8bab0d/html5/thumbnails/92.jpg)
Experiments and Analysis
Toy Non-Stationary Data
We create toy non-stationary data to evaluate a deep GP’s ability tolearn a non-stationary function.
We divide the input space into three regions: X1 ∈ [−4,−3],X2 ∈ [−1, 1] and X3 ∈ [2, 4], each of which consists of 40 data points.
Sample from a GP with length-scale l = .25 for regions X1 and X3,using l = 2 for region X2.
Keyon Vafa Training and Inference for Deep GPs April 26, 2016 43 / 50
![Page 93: Training and Inference for Deep Gaussian Processes](https://reader034.fdocuments.net/reader034/viewer/2022051404/587495d21a28abc62f8bab0d/html5/thumbnails/93.jpg)
Experiments and Analysis
Toy Non-Stationary Data
We create toy non-stationary data to evaluate a deep GP’s ability tolearn a non-stationary function.
We divide the input space into three regions: X1 ∈ [−4,−3],X2 ∈ [−1, 1] and X3 ∈ [2, 4], each of which consists of 40 data points.
Sample from a GP with length-scale l = .25 for regions X1 and X3,using l = 2 for region X2.
Keyon Vafa Training and Inference for Deep GPs April 26, 2016 43 / 50
![Page 94: Training and Inference for Deep Gaussian Processes](https://reader034.fdocuments.net/reader034/viewer/2022051404/587495d21a28abc62f8bab0d/html5/thumbnails/94.jpg)
Experiments and Analysis
Toy Non-Stationary Data
x
Outp
uts
Data
x
f(x)
2-Layer Deep GP
x
Hid
den v
alu
es
Hidden values
f(x)
x
f(x)
1-Layer GP
Predictive Draws for Toy Non-Stationary Data
Figure: Predictive draws from the single-layer and 2-layer models for toynon-stationary data with squared exponential kernels.
Keyon Vafa Training and Inference for Deep GPs April 26, 2016 44 / 50
![Page 95: Training and Inference for Deep Gaussian Processes](https://reader034.fdocuments.net/reader034/viewer/2022051404/587495d21a28abc62f8bab0d/html5/thumbnails/95.jpg)
Experiments and Analysis
Toy Non-Stationary Data
x
f(x)
Non-Stationary Data: 3-Layer Deep GP
Figure: The optimization for a 3-layer model can get stuck in a local optimum,and although the predictive draws are non-stationary, our predictions are poor atthe tails.
Keyon Vafa Training and Inference for Deep GPs April 26, 2016 45 / 50
![Page 96: Training and Inference for Deep Gaussian Processes](https://reader034.fdocuments.net/reader034/viewer/2022051404/587495d21a28abc62f8bab0d/html5/thumbnails/96.jpg)
Experiments and Analysis
Motorcycle Data
94 points, where the inputs are time in milliseconds since impact of amotorcycle accident and outputs are corresponding helmetaccelerations.
Dataset is somewhat non-stationary, as the accelerations are constantearly on but after a certain time become more varying.
Keyon Vafa Training and Inference for Deep GPs April 26, 2016 46 / 50
![Page 97: Training and Inference for Deep Gaussian Processes](https://reader034.fdocuments.net/reader034/viewer/2022051404/587495d21a28abc62f8bab0d/html5/thumbnails/97.jpg)
Experiments and Analysis
Motorcycle Data
94 points, where the inputs are time in milliseconds since impact of amotorcycle accident and outputs are corresponding helmetaccelerations.
Dataset is somewhat non-stationary, as the accelerations are constantearly on but after a certain time become more varying.
Keyon Vafa Training and Inference for Deep GPs April 26, 2016 46 / 50
![Page 98: Training and Inference for Deep Gaussian Processes](https://reader034.fdocuments.net/reader034/viewer/2022051404/587495d21a28abc62f8bab0d/html5/thumbnails/98.jpg)
Experiments and Analysis
Motorcycle Data
Time
Acc
ele
rati
on
Data
Time
Acc
ele
rati
on
2-Layer Deep GP
Time
Hid
den v
alu
es
Hidden values
Acc
ele
rati
on
Time
Acc
ele
rati
on
1-Layer GP
Predictive Draws for Motorcycle Data
Figure: Predictive draws from the single-layer and 2-layer models trained onmotorcycle data with squared exponential kernels.
Keyon Vafa Training and Inference for Deep GPs April 26, 2016 47 / 50
![Page 99: Training and Inference for Deep Gaussian Processes](https://reader034.fdocuments.net/reader034/viewer/2022051404/587495d21a28abc62f8bab0d/html5/thumbnails/99.jpg)
Conclusion
Table of Contents
1 Gaussian Processes
2 Deep Gaussian Processes
3 Implementation
4 Experiments and Analysis
5 Conclusion
Keyon Vafa Training and Inference for Deep GPs April 26, 2016 48 / 50
![Page 100: Training and Inference for Deep Gaussian Processes](https://reader034.fdocuments.net/reader034/viewer/2022051404/587495d21a28abc62f8bab0d/html5/thumbnails/100.jpg)
Conclusion
Future Directions
Natural extensions include
Trying different optimization methods to avoid getting stuck in localoptima
Introducing variational parameters so we do not have to learn pseudooutputs
Extending model to classification
Exploring properties of more complex architectures, and evaluate themodel likelihood to choose optimal configuration
Keyon Vafa Training and Inference for Deep GPs April 26, 2016 49 / 50
![Page 101: Training and Inference for Deep Gaussian Processes](https://reader034.fdocuments.net/reader034/viewer/2022051404/587495d21a28abc62f8bab0d/html5/thumbnails/101.jpg)
Conclusion
Future Directions
Natural extensions include
Trying different optimization methods to avoid getting stuck in localoptima
Introducing variational parameters so we do not have to learn pseudooutputs
Extending model to classification
Exploring properties of more complex architectures, and evaluate themodel likelihood to choose optimal configuration
Keyon Vafa Training and Inference for Deep GPs April 26, 2016 49 / 50
![Page 102: Training and Inference for Deep Gaussian Processes](https://reader034.fdocuments.net/reader034/viewer/2022051404/587495d21a28abc62f8bab0d/html5/thumbnails/102.jpg)
Conclusion
Future Directions
Natural extensions include
Trying different optimization methods to avoid getting stuck in localoptima
Introducing variational parameters so we do not have to learn pseudooutputs
Extending model to classification
Exploring properties of more complex architectures, and evaluate themodel likelihood to choose optimal configuration
Keyon Vafa Training and Inference for Deep GPs April 26, 2016 49 / 50
![Page 103: Training and Inference for Deep Gaussian Processes](https://reader034.fdocuments.net/reader034/viewer/2022051404/587495d21a28abc62f8bab0d/html5/thumbnails/103.jpg)
Conclusion
Future Directions
Natural extensions include
Trying different optimization methods to avoid getting stuck in localoptima
Introducing variational parameters so we do not have to learn pseudooutputs
Extending model to classification
Exploring properties of more complex architectures, and evaluate themodel likelihood to choose optimal configuration
Keyon Vafa Training and Inference for Deep GPs April 26, 2016 49 / 50
![Page 104: Training and Inference for Deep Gaussian Processes](https://reader034.fdocuments.net/reader034/viewer/2022051404/587495d21a28abc62f8bab0d/html5/thumbnails/104.jpg)
Conclusion
Future Directions
Natural extensions include
Trying different optimization methods to avoid getting stuck in localoptima
Introducing variational parameters so we do not have to learn pseudooutputs
Extending model to classification
Exploring properties of more complex architectures, and evaluate themodel likelihood to choose optimal configuration
Keyon Vafa Training and Inference for Deep GPs April 26, 2016 49 / 50
![Page 105: Training and Inference for Deep Gaussian Processes](https://reader034.fdocuments.net/reader034/viewer/2022051404/587495d21a28abc62f8bab0d/html5/thumbnails/105.jpg)
Conclusion
Acknowledgments
A huge thank-you to Sasha Rush, Finale Doshi-Velez, David Duvenaud,and Miguel Hernandez-Lobato. This thesis would not be possible withoutyour help and support!
Keyon Vafa Training and Inference for Deep GPs April 26, 2016 50 / 50