Scalable Gaussian Process Methodsseminars/seminars/Extra/2015... · I Numerical integration (1D)...
Transcript of Scalable Gaussian Process Methodsseminars/seminars/Extra/2015... · I Numerical integration (1D)...
Overview
Background: Gaussian processes
Sparse Gaussian Processes
Stochastic Variational Inference
Gaussian Likelihoods
Non-Gaussian likelihoods
Bonus: Deep GPs
GPs - analytical solution
f |y ∼ GP(k(x, x)K−1y, k(x, x′) − k(x, x)K−1k(x, x′)
)
0 5 10 15
t−1.5
−1.0
−0.5
0.0
0.5
1.0
1.5
2.0
2.5
f
What about noise?
Gaussian noise is tractable - additional parameter σn
0 5 10 15
t−1.5
−1.0
−0.5
0.0
0.5
1.0
1.5
2.0
2.5
f
Overview
Background: Gaussian processes
Sparse Gaussian Processes
Stochastic Variational Inference
Gaussian Likelihoods
Non-Gaussian likelihoods
Bonus: Deep GPs
Motivation
Inference in a GP has the following demands:
Complexity: O(n3)Storage: O(n2)
Inference in a sparse GP has the following demands:
Complexity: O(nm2)Storage: O(nm)
where we get to pick m!
Still not good enough!
Big Data
I In parametric models, stochastic optimisation is used.I This allows for application to Big Data.
This work
I Show how to use Stochastic Variational Inference in GPsI Stochastic optimisation scheme: each step requires O(m3)
Overview
Background: Gaussian processes
Sparse Gaussian Processes
Stochastic Variational Inference
Gaussian Likelihoods
Non-Gaussian likelihoods
Bonus: Deep GPs
Incomplete bibliography
(With apologies for 27,000 omissions)
I Csato and Opper, 2002I Seeger 2003, Lawrence and Seeger 2003I Snelson and Ghahramani 2006I Quinonero-Candela and Rasmussen 2005I Titsias 2009I Alvarez and Lawrence 2011
Computational savings
Knn ≈ Qnn = KnmK−1mmKmn
Instead of inverting Knn, we make a low rank (or Nystrom)approximation, and invert Kmm instead.
Information capture
Everything we want to do with a GP involvesmarginalising f
I PredictionsI Marginal likelihoodI Estimating covariance parameters
The posterior of f is the central object. This means invertingKnn.
Introducing u
Take and extra M points on the function, u = f (Z).
p(y, f,u) = p(y | f)p(f |u)p(u)
p(y | f) = N(y|f, σ2I
)p(f |u) = N
(f|KnmKmmıu, K
)p(u) = N (u|0,Kmm)
X, y
f (x) ∼ GP
p(f) = N (0,Knn)
p(f |y,X)
p(u) = N (0,Kmm)
p(u |y,X)
input space (X)
function v
alues
The alternative posterior
Instead of doing
p(f |y,X) =p(y | f)p(f |X)∫p(y | f)p(f |X)df
We’ll do
p(u |y,Z) =p(y |u)p(u |Z)∫p(y |u)p(u |Z)du
but p(y |u) involves inverting Knn
The alternative posterior
Instead of doing
p(f |y,X) =p(y | f)p(f |X)∫p(y | f)p(f |X)df
We’ll do
p(u |y,Z) =p(y |u)p(u |Z)∫p(y |u)p(u |Z)du
but p(y |u) involves inverting Knn
Variational marginalisation of f
p(y |u) =p(y | f)p(f |u)
p(f |y,u)
ln p(y |u) = ln p(y | f) + lnp(f |u)
p(f |y,u)
ln p(y |u) = Ep(f |u)
[ln p(y | f)
]+ Ep(f |u)
[ln
p(f |u)p(f |y,u)
]ln p(y |u) = ln p(y |u) + KL[p(f|u)||p(f |y,u)]
No inversion of Knn required
Variational marginalisation of f
p(y |u) =p(y | f)p(f |u)
p(f |y,u)
ln p(y |u) = ln p(y | f) + lnp(f |u)
p(f |y,u)
ln p(y |u) = Ep(f |u)
[ln p(y | f)
]+ Ep(f |u)
[ln
p(f |u)p(f |y,u)
]ln p(y |u) = ln p(y |u) + KL[p(f|u)||p(f |y,u)]
No inversion of Knn required
Variational marginalisation of f
p(y |u) =p(y | f)p(f |u)
p(f |y,u)
ln p(y |u) = ln p(y | f) + lnp(f |u)
p(f |y,u)
ln p(y |u) = Ep(f |u)
[ln p(y | f)
]+ Ep(f |u)
[ln
p(f |u)p(f |y,u)
]
ln p(y |u) = ln p(y |u) + KL[p(f|u)||p(f |y,u)]
No inversion of Knn required
Variational marginalisation of f
p(y |u) =p(y | f)p(f |u)
p(f |y,u)
ln p(y |u) = ln p(y | f) + lnp(f |u)
p(f |y,u)
ln p(y |u) = Ep(f |u)
[ln p(y | f)
]+ Ep(f |u)
[ln
p(f |u)p(f |y,u)
]ln p(y |u) = ln p(y |u) + KL[p(f|u)||p(f |y,u)]
No inversion of Knn required
An approximate likelihood
p(y |u) =n∏
i=1
N
(yi|k>mnK−1
mmu, σ2)
exp{−
12σ2
(knn − k>mnK−1
mmkmn)}
A straightforward likelihood approximation, and a penaltyterm
Now we can marginalise u
p(u |y,Z) =p(y |u)p(u |Z)∫p(y |u)p(u |Z)du
I Computing the (approximate) posterior costs O(nm2)I We also get a lower bound of the marginal likelihoodI This is the standard variational sparse GP (?).
Overview
Background: Gaussian processes
Sparse Gaussian Processes
Stochastic Variational Inference
Gaussian Likelihoods
Non-Gaussian likelihoods
Bonus: Deep GPs
Stochastic Variational Inference
I Combine the ideas of stochastic optimisation withVariational inference
I example: apply Latent Dirichlet allocation to projectGutenberg
I Can apply variational techniques to Big DataI How could this work in GPs?
But GPs are not factorizing models?
I The variational marginalisation of f introducedfactorisation across the datapoints (conditioned on u)
I Marginalising u re-introdcuced dependencies between thedata
I Solution: a variational treatment of u
Variational Bayes
p(u |y) =p(y |u)p(u)
p(y)
p(y) =p(y |u)p(u)
p(u |y)
ln p(y) = lnp(y |u)p(u)
q(u)+ ln
q(u)p(u |y)
ln p(y) = Eq(u)
[ln
p(y |u)p(u)q(u)
]+ Eq(u)
[ln
q(u)p(u |y)
]ln p(y) = L + KL
(q(u) ‖ p(u |y)
)
Variational Bayes
p(u |y) =p(y |u)p(u)
p(y)
p(y) =p(y |u)p(u)
p(u |y)
ln p(y) = lnp(y |u)p(u)
q(u)+ ln
q(u)p(u |y)
ln p(y) = Eq(u)
[ln
p(y |u)p(u)q(u)
]+ Eq(u)
[ln
q(u)p(u |y)
]ln p(y) = L + KL
(q(u) ‖ p(u |y)
)
Variational Bayes
p(u |y) =p(y |u)p(u)
p(y)
p(y) =p(y |u)p(u)
p(u |y)
ln p(y) = lnp(y |u)p(u)
q(u)+ ln
q(u)p(u |y)
ln p(y) = Eq(u)
[ln
p(y |u)p(u)q(u)
]+ Eq(u)
[ln
q(u)p(u |y)
]ln p(y) = L + KL
(q(u) ‖ p(u |y)
)
Variational Bayes
p(u |y) =p(y |u)p(u)
p(y)
p(y) =p(y |u)p(u)
p(u |y)
ln p(y) = lnp(y |u)p(u)
q(u)+ ln
q(u)p(u |y)
ln p(y) = Eq(u)
[ln
p(y |u)p(u)q(u)
]+ Eq(u)
[ln
q(u)p(u |y)
]
ln p(y) = L + KL(q(u) ‖ p(u |y)
)
Variational Bayes
p(u |y) =p(y |u)p(u)
p(y)
p(y) =p(y |u)p(u)
p(u |y)
ln p(y) = lnp(y |u)p(u)
q(u)+ ln
q(u)p(u |y)
ln p(y) = Eq(u)
[ln
p(y |u)p(u)q(u)
]+ Eq(u)
[ln
q(u)p(u |y)
]ln p(y) = L + KL
(q(u) ‖ p(u |y)
)
The objective
L = Eq(f)[log p(y | f)] − KL(q(u) ‖ p(u)
)I Tractable for Gaussian likelihoodsI Numerical integration (1D) for intractable likelihoods
(classification)
Optimisation
The variational objective L3 is a function of
I the parameters of the covariance functionI the parameters of q(u)I the inducing inputs, Z
Original strategy: set Z. Take the data in small minibatches,take stochastic gradient steps in the covariance functionparameters, stochastic natural gradient steps in the parametersof q(u).
New strategy: represent S as LL> (unconstrained). Throwm,L,Z,θ at Adagrad.
Optimisation
The variational objective L3 is a function of
I the parameters of the covariance functionI the parameters of q(u)I the inducing inputs, Z
Original strategy: set Z. Take the data in small minibatches,take stochastic gradient steps in the covariance functionparameters, stochastic natural gradient steps in the parametersof q(u).New strategy: represent S as LL> (unconstrained). Throwm,L,Z,θ at Adagrad.
Overview
Background: Gaussian processes
Sparse Gaussian Processes
Stochastic Variational Inference
Gaussian Likelihoods
Non-Gaussian likelihoods
Bonus: Deep GPs
UK apartment prices
I Monthly price paid data for February to October 2012(England and Wales)
I from http://data.gov.uk/dataset/land-registry-monthly-price-paid-data/
I 75,000 entriesI Cross referenced against a postcode database to get
lattitude and longitudeI Regressed the normalised logarithm of the apartment
prices
Airline data
I Flight delays for everycommercial flight in theUSA from January to April2008.
I Average delay was 30minutes.
I We randomly selected800,000 datapoints (wehave limited memory!)
I 700,000 train, 100,000 test
Month
DayOfM
onth
DayOfW
eek
DepTim
e
ArrTim
e
AirTim
e
Distan
ce
PlaneA
ge0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Invers
e length
scale
N=800 N=1000 N=120032
33
34
35
36
37
RM
SE
GPs on subsets
0 200 400 600 800 1000 1200iteration
32
33
34
35
36
37SVI GP
Overview
Background: Gaussian processes
Sparse Gaussian Processes
Stochastic Variational Inference
Gaussian Likelihoods
Non-Gaussian likelihoods
Bonus: Deep GPs
What about Classification?
I Some of the above integrals are intractable, because of thenon-Gaussian likelihood
I Since the likelihood factorizes, we need to do None-dimension integrals
I Gauss-hermite quadrature works well! For derivatives alsoI Paper in AISTATS 2015
Is it competetive?
100 101 102 103 104
0.1
0.2
0.3
0.4
0.5
Time (seconds)
Holdou
tnegativelogprobab
ility
KLSp M=4KLSp M=50KLSp M=200MFSp M=4MFSp M=50MFSp M=200EPFitc M=4EPFitc M=50EPFitc M=200
Big data?
102 103 104 1050.320.340.360.380.4
error(%
)
102 103 104 1050.6
0.62
0.64
0.66
Time (seconds)
-log
p(y)
Overview
Background: Gaussian processes
Sparse Gaussian Processes
Stochastic Variational Inference
Gaussian Likelihoods
Non-Gaussian likelihoods
Bonus: Deep GPs
000001002003004005006007008009010011012013014015016017018019020021022023024025026027028029030031032033034035036037038039040041042043044045046047048049050051052053054
055056057058059060061062063064065066067068069070071072073074075076077078079080081082083084085086087088089090091092093094095096097098099100101102103104105106107108109
0
0.5
1
000001002003004005006007008009010011012013014015016017018019020021022023024025026027028029030031032033034035036037038039040041042043044045046047048049050051052053054
055056057058059060061062063064065066067068069070071072073074075076077078079080081082083084085086087088089090091092093094095096097098099100101102103104105106107108109
0
0.5
1
000001002003004005006007008009010011012013014015016017018019020021022023024025026027028029030031032033034035036037038039040041042043044045046047048049050051052053054
055056057058059060061062063064065066067068069070071072073074075076077078079080081082083084085086087088089090091092093094095096097098099100101102103104105106107108109
0
0.5
1
000001002003004005006007008009010011012013014015016017018019020021022023024025026027028029030031032033034035036037038039040041042043044045046047048049050051052053054
055056057058059060061062063064065066067068069070071072073074075076077078079080081082083084085086087088089090091092093094095096097098099100101102103104105106107108109
−1 −0.5 0 0.5 1 1.5 2
0
0.5
1