Prediction with Gaussian Processes: Basic Ideascvrg/trinity2005/chris_williams.pdf ·...

Prediction with Gaussian Processes:Basic Ideas

Chris Williams

TH

E

U N I V E RS

IT

Y

OF

ED I N B U

RG

H

School of Informatics, University of Edinburgh, UK

Overview

• Bayesian Prediction

• Gaussian Process Priors over Functions

• GP regression

• GP classification

• Regularization, SVMs

Bayesian prediction

• Define a prior over functions

• Observe data, obtain a posterior distribution over functions

P (f |D) ∝ P (f)P (D|f)

posterior ∝ prior × likelihood

• Make predictions by averaging predictions over the posterior P (f |D)

• Averaging mitigates overfitting

Bayesian Linear Regression

f(x) =∑

i

wiφi(x) w ∼ N(0,Σ)

Samples from the prior

Gaussian Processes: Priors over functions• For a stochastic process f(x), mean function is

µ(x) = E[f(x)].

Assume µ(x) ≡ 0 ∀x

• Covariance function

k(x,x′) = E[f(x)f(x′)].

• Forget those weights! We should be thinking of defining priors over functions, notweights.

• Priors over function-space can be defined directly by choosing a covariance function,e.g.

k(x,x′) = exp(−w|x − x′|)

• Gaussian processes are stochastic processes defined by their mean and covariancefunctions.

Examples of GPs

• σ20 + σ2

1xx′

• exp−|x − x′|

• exp−(x − x′)2

Connection to feature spaceA Gaussian process prior over functions can be thought of as a Gaussian prior on thecoefficients w ∼ N(0,Λ) where

f(x) =

NF∑

i=1

wiφi(x) = w.Φ(x)

Φ(x) =

φ1(x)φ2(x)

...φNF

(x)

In many interesting cases, NF = ∞

Choose Φ(·) as eigenfunctions of the kernel k(x,x′) wrt p(x) (Mercer)∫

k(x,y)p(x)φi(x) dx = λiφi(y)

Gaussian process regression

Dataset D = (xi, yi)ni=1, Gaussian likelihood p(yi|fi) ∼ N(0, σ2)

f(x) =n

∑

i=1

αik(x, xi)

where

α = (K + σ2I)−1y

var(x) = k(x, x) − kT (x)(K + σ2I)−1k(x)

in time O(n3), with k(x) = (k(x, x1), . . . ,k(x, xn))T

After 1 observation:

After 2 observations:

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1−2.5

−2

−1.5

−1

−0.5

0

0.5

1

1.5

2

2.5

x

Y(x)

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1−2.5

−2

−1.5

−1

−0.5

0

0.5

1

1.5

2

2.5

x

Y(x)

Example: Inverse Dynamics of a Robot Arm

• 7 d.o.f robot. Inverse dynamics maps 21-d vector (7 positions, 7velocities, 7 accelerations) to 7 torques.

• 44,484 training examples

Method SMSE MSLLLR 0.075 -1.29RBD 0.104 -LWPR 0.040 -GPR 0.011 -2.25

• Approximation methods can reduce O(n3) to O(nm2) for m � n

• Subset of datapoints, subset of regressors, projected processapproximation, Bayesian Committee Machine

• GP regression is competitive with other kernel methods (e.g. SVMs)

• Can use non-Gaussian likelihoods (e.g. Student-t)

Linear smoothing

f(x∗) = kT (x∗)(K + σ2nI)−1y

0 0.2 0.4 0.6 0.8 1

0

0.2

0 0.2 0.4 0.6 0.8 1

0

0.05

σ2n = 0.1 σ2

n = 10

Adapting kernel parameters

k(xi,xj) = v0 exp−1

2

d∑

l=1

wl(xil − x

jl )

2

w1 = 5.0 w2 = 5.0 w1 = 5.0 w2 = 0.5

• For GPs, the marginal likelihood (aka Bayesian evidence) logP (y|θ) canbe optimized wrt the kernel parameters θ = (v0,w)

• For GP regression logP (y|θ) can be computed exactly

logP (y|θ) = −1

2log |K + σ2I| − 1

2yT (K + σ2I)−1y − n

2log 2π

• Can also use LOO-CV

Previous work

• Wiener-Kolmogorov prediction theory (1940’s)

• Splines (Kimeldorf and Wahba, 1971; Wahba 1990)

• ARMA models for time-series

• Kriging in geostatistics (for 2-d or 3-d spaces)

• Regularization networks (Poggio and Girosi, 1989, 1990)

• Design and Analysis of Computer Experiments (Sacks et al, 1989)

• Infinite neural networks (Neal, 1995)

GP prediction for classification problems

−3

3f

0

1

π

Squash through logistic (or erf) function

• Can also handle multi-class problems

• Likelihood

− logP (yi|fi) = log(1 + e−yifi)

• Integrals can’t be done analytically

– Find maximum a posteriori value of P (f |y) (Williams and Barber,1997)

– Expectation-Propagation (Minka, 2001; Opper and Winther, 2000)

– MCMC methods (Neal, 1997)

MAP Gaussian process classificationTo obtain the MAP approximation to the GPC solution, we findf that maximizes the convexfunction

Ψ(y) = −n

∑

i=1

log(1 + e−yifi) − 1

2fTK−1f + c

The optimization is carried out using the Newton-Raphson iteration

fnew = K(I + WK)−1(W f + (t − π))

where W = diag(π1(1 − π1), .., πn(1 − πn)) and πi = σ(fi). Basic complexity is O(n3)

For a test point x∗ we compute f(x∗) and the variance, and make the prediction as

P(class 1|x∗,D) =

∫

σ(f∗)p(f∗|y)df∗

°

°

°

•

°

°

°

°

°

•

°

•

•

•

•

•

•

°•

•

USPS Digits: 3 vs 5

2 3 4 5

0

1

2

3

4

5

log lengthscale, log(l)

log

mag

nitu

de, l

og(σ

f)Log marginal likelihood

−100

−105 −115

−115

−130

−150

−200

−200

2 3 4 5

0

1

2

3

4

5


log

mag

nitu

de, l

og(σ

f)

Information about test targets in bits

0.25

0.25

0.5

0.5

0.7

0.7

0.8

0.8

0.84

2 3 4 5

0

1

2

3

4

5


log

mag

nitu

de, l

og(σ

f)

2119191919191919191818181919191919191919191919

2220181819191817171717171818181818181818181818

2323211818181716161615151616161616161616161615

2926222018171716161616171717171717171717171717

3329262423191817151616161616161615151515151515

3434302826252323201918171717171717171718181818

3534343030292724232221201918181818181818191919

3635343432303027262322222221212021202020202020

3936363435323231302927252424232423222222222223

4139373635363232313030262525252626262625242424

4240393836353632323131292627252526272828282828

4542403938363636323231312927262525282828292929

5145424039383636363232313129272626283029303030

6051454240393836363632333131292728272830293030

8960514542403938363736323331312828282930302930

88605145424039383637363233313128292826303029

886051454240393836373632333131282928293030

Number of test misclassifications

0 0.1 0.2 0.30

0.01

0.02

0.03

rejection rate

mis

clas

sific

atio

n ra

te

EPLaplaceSVMP1NNLSClin probit

Covariance functions

Covariance function equationconstant σ2

0linear

∑Dd=1 σ2

dxdx′d

polynomial (x · x′ + σ20)

p

squared exponential exp(− r2

2`2)

Matern 12ν−1Γ(ν)

(√2ν`

r

)ν

Kν

(√2ν`

r

)

exponential exp(−r`)

rational quadratic (1 + r2

2α`2)−α

neural network sin−1 2x>Σx′√

(1+2x>Σx)(1+2x′>Σx′)

Regularization

• f(x) is the (functional) minimum of

J[f ] =1

2σ2

n∑

i=1

(yi − f(xi))2 +

1

2‖f‖2H

(1st term = − log-likelihood, 2nd term = − log-prior)

• However, the regularization framework does not yield predictive varianceor marginal likelihood

SVMs

1-norm soft margin classifier has the form

f(x) =n

∑

i=1

yiα∗i k(x, xi) + w∗

0

where yi ∈ {−1,1} and α∗ optimizes the quadratic form

Q(α) =n

∑

i=1

αi −1

2

n∑

i,j=1

titjαiαjk(xi, xj)

subject to the constraintsn

∑

i=1

yiαi = 0

C ≥ αi ≥ 0, i = 1, . . . , n

This is a quadratic programming problem. Can be solved in many ways, e.g.with interior point methods, or special purpose algorithms such as SMO.

Basic complexity is O(n3).

• Define gσ(z) = log(1 + e−z)

• SVM classifier is similar to GP classifier, but with gσ replaced bygSV M(z) = [1 − z]+ (Wahba, 1999)

−2 0 1 40

1

2

log(1 + exp(−z))max(1−z, 0)

• Note that the MAP solution using gσ solution is not sparse, but gives aprobability output

Prediction with Gaussian Processes: Basic Ideascvrg/trinity2005/chris_williams.pdf ·...

Documents

Transcript of Prediction with Gaussian Processes: Basic Ideascvrg/trinity2005/chris_williams.pdf ·...