Prediction with Gaussian Processes: Basic Ideascvrg/trinity2005/chris_williams.pdf ·...
Transcript of Prediction with Gaussian Processes: Basic Ideascvrg/trinity2005/chris_williams.pdf ·...
Prediction with Gaussian Processes:Basic Ideas
Chris Williams
TH
E
U N I V E RS
IT
Y
OF
ED I N B U
RG
H
School of Informatics, University of Edinburgh, UK
Overview
• Bayesian Prediction
• Gaussian Process Priors over Functions
• GP regression
• GP classification
• Regularization, SVMs
Bayesian prediction
• Define a prior over functions
• Observe data, obtain a posterior distribution over functions
P (f |D) ∝ P (f)P (D|f)
posterior ∝ prior × likelihood
• Make predictions by averaging predictions over the posterior P (f |D)
• Averaging mitigates overfitting
Bayesian Linear Regression
f(x) =∑
i
wiφi(x) w ∼ N(0,Σ)
Samples from the prior
Gaussian Processes: Priors over functions• For a stochastic process f(x), mean function is
µ(x) = E[f(x)].
Assume µ(x) ≡ 0 ∀x
• Covariance function
k(x,x′) = E[f(x)f(x′)].
• Forget those weights! We should be thinking of defining priors over functions, notweights.
• Priors over function-space can be defined directly by choosing a covariance function,e.g.
k(x,x′) = exp(−w|x − x′|)
• Gaussian processes are stochastic processes defined by their mean and covariancefunctions.
Examples of GPs
• σ20 + σ2
1xx′
• exp−|x − x′|
• exp−(x − x′)2
Connection to feature spaceA Gaussian process prior over functions can be thought of as a Gaussian prior on thecoefficients w ∼ N(0,Λ) where
f(x) =
NF∑
i=1
wiφi(x) = w.Φ(x)
Φ(x) =
φ1(x)φ2(x)
...φNF
(x)
In many interesting cases, NF = ∞
Choose Φ(·) as eigenfunctions of the kernel k(x,x′) wrt p(x) (Mercer)∫
k(x,y)p(x)φi(x) dx = λiφi(y)
Gaussian process regression
Dataset D = (xi, yi)ni=1, Gaussian likelihood p(yi|fi) ∼ N(0, σ2)
f(x) =n
∑
i=1
αik(x, xi)
where
α = (K + σ2I)−1y
var(x) = k(x, x) − kT (x)(K + σ2I)−1k(x)
in time O(n3), with k(x) = (k(x, x1), . . . ,k(x, xn))T
After 1 observation:
After 2 observations:
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1−2.5
−2
−1.5
−1
−0.5
0
0.5
1
1.5
2
2.5
x
Y(x)
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1−2.5
−2
−1.5
−1
−0.5
0
0.5
1
1.5
2
2.5
x
Y(x)
Example: Inverse Dynamics of a Robot Arm
• 7 d.o.f robot. Inverse dynamics maps 21-d vector (7 positions, 7velocities, 7 accelerations) to 7 torques.
• 44,484 training examples
Method SMSE MSLLLR 0.075 -1.29RBD 0.104 -LWPR 0.040 -GPR 0.011 -2.25
• Approximation methods can reduce O(n3) to O(nm2) for m � n
• Subset of datapoints, subset of regressors, projected processapproximation, Bayesian Committee Machine
• GP regression is competitive with other kernel methods (e.g. SVMs)
• Can use non-Gaussian likelihoods (e.g. Student-t)
Linear smoothing
f(x∗) = kT (x∗)(K + σ2nI)−1y
0 0.2 0.4 0.6 0.8 1
0
0.2
0 0.2 0.4 0.6 0.8 1
0
0.05
σ2n = 0.1 σ2
n = 10
Adapting kernel parameters
k(xi,xj) = v0 exp−1
2
d∑
l=1
wl(xil − x
jl )
2
w1 = 5.0 w2 = 5.0 w1 = 5.0 w2 = 0.5
• For GPs, the marginal likelihood (aka Bayesian evidence) logP (y|θ) canbe optimized wrt the kernel parameters θ = (v0,w)
• For GP regression logP (y|θ) can be computed exactly
logP (y|θ) = −1
2log |K + σ2I| − 1
2yT (K + σ2I)−1y − n
2log 2π
• Can also use LOO-CV
Previous work
• Wiener-Kolmogorov prediction theory (1940’s)
• Splines (Kimeldorf and Wahba, 1971; Wahba 1990)
• ARMA models for time-series
• Kriging in geostatistics (for 2-d or 3-d spaces)
• Regularization networks (Poggio and Girosi, 1989, 1990)
• Design and Analysis of Computer Experiments (Sacks et al, 1989)
• Infinite neural networks (Neal, 1995)
GP prediction for classification problems
−3
3f
0
1
π
Squash through logistic (or erf) function
• Can also handle multi-class problems
• Likelihood
− logP (yi|fi) = log(1 + e−yifi)
• Integrals can’t be done analytically
– Find maximum a posteriori value of P (f |y) (Williams and Barber,1997)
– Expectation-Propagation (Minka, 2001; Opper and Winther, 2000)
– MCMC methods (Neal, 1997)
MAP Gaussian process classificationTo obtain the MAP approximation to the GPC solution, we findf that maximizes the convexfunction
Ψ(y) = −n
∑
i=1
log(1 + e−yifi) − 1
2fTK−1f + c
The optimization is carried out using the Newton-Raphson iteration
fnew = K(I + WK)−1(W f + (t − π))
where W = diag(π1(1 − π1), .., πn(1 − πn)) and πi = σ(fi). Basic complexity is O(n3)
For a test point x∗ we compute f(x∗) and the variance, and make the prediction as
P(class 1|x∗,D) =
∫
σ(f∗)p(f∗|y)df∗
°
°
°
•
°
°
°
°
°
•
°
•
•
•
•
•
•
°•
•
USPS Digits: 3 vs 5
2 3 4 5
0
1
2
3
4
5
log lengthscale, log(l)
log
mag
nitu
de, l
og(σ
f)Log marginal likelihood
−100
−105 −115
−115
−130
−150
−200
−200
2 3 4 5
0
1
2
3
4
5
log lengthscale, log(l)
log
mag
nitu
de, l
og(σ
f)
Information about test targets in bits
0.25
0.25
0.5
0.5
0.7
0.7
0.8
0.8
0.84
2 3 4 5
0
1
2
3
4
5
log lengthscale, log(l)
log
mag
nitu
de, l
og(σ
f)
2119191919191919191818181919191919191919191919
2220181819191817171717171818181818181818181818
2323211818181716161615151616161616161616161615
2926222018171716161616171717171717171717171717
3329262423191817151616161616161615151515151515
3434302826252323201918171717171717171718181818
3534343030292724232221201918181818181818191919
3635343432303027262322222221212021202020202020
3936363435323231302927252424232423222222222223
4139373635363232313030262525252626262625242424
4240393836353632323131292627252526272828282828
4542403938363636323231312927262525282828292929
5145424039383636363232313129272626283029303030
6051454240393836363632333131292728272830293030
8960514542403938363736323331312828282930302930
88605145424039383637363233313128292826303029
886051454240393836373632333131282928293030
Number of test misclassifications
0 0.1 0.2 0.30
0.01
0.02
0.03
rejection rate
mis
clas
sific
atio
n ra
te
EPLaplaceSVMP1NNLSClin probit
Covariance functions
Covariance function equationconstant σ2
0linear
∑Dd=1 σ2
dxdx′d
polynomial (x · x′ + σ20)
p
squared exponential exp(− r2
2`2)
Matern 12ν−1Γ(ν)
(√2ν`
r
)ν
Kν
(√2ν`
r
)
exponential exp(−r`)
rational quadratic (1 + r2
2α`2)−α
neural network sin−1 2x>Σx′√
(1+2x>Σx)(1+2x′>Σx′)
Regularization
• f(x) is the (functional) minimum of
J[f ] =1
2σ2
n∑
i=1
(yi − f(xi))2 +
1
2‖f‖2H
(1st term = − log-likelihood, 2nd term = − log-prior)
• However, the regularization framework does not yield predictive varianceor marginal likelihood
SVMs
1-norm soft margin classifier has the form
f(x) =n
∑
i=1
yiα∗i k(x, xi) + w∗
0
where yi ∈ {−1,1} and α∗ optimizes the quadratic form
Q(α) =n
∑
i=1
αi −1
2
n∑
i,j=1
titjαiαjk(xi, xj)
subject to the constraintsn
∑
i=1
yiαi = 0
C ≥ αi ≥ 0, i = 1, . . . , n
This is a quadratic programming problem. Can be solved in many ways, e.g.with interior point methods, or special purpose algorithms such as SMO.
Basic complexity is O(n3).
• Define gσ(z) = log(1 + e−z)
• SVM classifier is similar to GP classifier, but with gσ replaced bygSV M(z) = [1 − z]+ (Wahba, 1999)
−2 0 1 40
1
2
log(1 + exp(−z))max(1−z, 0)
• Note that the MAP solution using gσ solution is not sparse, but gives aprobability output