Gaussian ProcessesGaussian Processes Marc Deisenroth Centre for Artificial Intelligence Department...

87
Gaussian Processes Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 [email protected] https://deisenroth.cc AIMS Rwanda and AIMS Ghana March/April 2020

Transcript of Gaussian ProcessesGaussian Processes Marc Deisenroth Centre for Artificial Intelligence Department...

Page 1: Gaussian ProcessesGaussian Processes Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk AIMS

Gaussian Processes

Marc DeisenrothCentre for Artificial IntelligenceDepartment of Computer ScienceUniversity College London

@[email protected]://deisenroth.cc

AIMS Rwanda and AIMS Ghana

March/April 2020

Page 2: Gaussian ProcessesGaussian Processes Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk AIMS

Reference

http://www.gaussianprocess.org/

Marc Deisenroth (UCL) Gaussian Processes March/April 2020 2

Page 3: Gaussian ProcessesGaussian Processes Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk AIMS

Model Selection

Marc Deisenroth (UCL) Gaussian Processes March/April 2020 41

Page 4: Gaussian ProcessesGaussian Processes Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk AIMS

Influence of Prior on Posterior

Generalization error measured by log-predictive density (lpd)

lpd “ log ppy˚|x˚,X,y, `q

for different length-scales ` and different datasets

Shorter length-scale More flexible model Faster increasein uncertainty away from data Bad generalization properties

Marc Deisenroth (UCL) Gaussian Processes March/April 2020 42

Page 5: Gaussian ProcessesGaussian Processes Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk AIMS

Influence of Prior on Posterior

Generalization error measured by log-predictive density (lpd)

lpd “ log ppy˚|x˚,X,y, `q

for different length-scales ` and different datasetsShorter length-scale More flexible model Faster increasein uncertainty away from data Bad generalization properties

Marc Deisenroth (UCL) Gaussian Processes March/April 2020 42

Page 6: Gaussian ProcessesGaussian Processes Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk AIMS

Influence of Prior on Posterior

Generalization error measured by log-predictive density (lpd)

lpd “ log ppy˚|x˚,X,y, `q

for different length-scales ` and different datasetsShorter length-scale More flexible model Faster increasein uncertainty away from data Bad generalization properties

Marc Deisenroth (UCL) Gaussian Processes March/April 2020 42

Page 7: Gaussian ProcessesGaussian Processes Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk AIMS

Influence of Prior on Posterior

Generalization error measured by log-predictive density (lpd)

lpd “ log ppy˚|x˚,X,y, `q

for different length-scales ` and different datasetsShorter length-scale More flexible model Faster increasein uncertainty away from data Bad generalization properties

Marc Deisenroth (UCL) Gaussian Processes March/April 2020 42

Page 8: Gaussian ProcessesGaussian Processes Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk AIMS

Influence of Prior on Posterior

Generalization error measured by log-predictive density (lpd)

lpd “ log ppy˚|x˚,X,y, `q

for different length-scales ` and different datasetsShorter length-scale More flexible model Faster increasein uncertainty away from data Bad generalization properties

Marc Deisenroth (UCL) Gaussian Processes March/April 2020 42

Page 9: Gaussian ProcessesGaussian Processes Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk AIMS

Influence of Prior on Posterior

Generalization error measured by log-predictive density (lpd)

lpd “ log ppy˚|x˚,X,y, `q

for different length-scales ` and different datasetsShorter length-scale More flexible model Faster increasein uncertainty away from data Bad generalization properties

Marc Deisenroth (UCL) Gaussian Processes March/April 2020 42

Page 10: Gaussian ProcessesGaussian Processes Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk AIMS

Influence of Prior on Posterior

Generalization error measured by log-predictive density (lpd)

lpd “ log ppy˚|x˚,X,y, `q

for different length-scales ` and different datasetsShorter length-scale More flexible model Faster increasein uncertainty away from data Bad generalization properties

Marc Deisenroth (UCL) Gaussian Processes March/April 2020 42

Page 11: Gaussian ProcessesGaussian Processes Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk AIMS

Influence of Prior on Posterior

Generalization error measured by log-predictive density (lpd)

lpd “ log ppy˚|x˚,X,y, `q

for different length-scales ` and different datasetsShorter length-scale More flexible model Faster increasein uncertainty away from data Bad generalization properties

Marc Deisenroth (UCL) Gaussian Processes March/April 2020 42

Page 12: Gaussian ProcessesGaussian Processes Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk AIMS

Influence of Prior on Posterior

Generalization error measured by log-predictive density (lpd)

lpd “ log ppy˚|x˚,X,y, `q

for different length-scales ` and different datasets

Shorter length-scale More flexible model Faster increasein uncertainty away from data Bad generalization properties

Marc Deisenroth (UCL) Gaussian Processes March/April 2020 42

Page 13: Gaussian ProcessesGaussian Processes Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk AIMS

Model Selection

Make predictions equipped with uncertainty

Choice of prior (e.g., length-scale) influences predictions

Different tasks require different priors

How do we select a good prior?

Model Selection in GPs§ Choose hyper-parameters of the GP

§ Choose good mean function and kernel

Marc Deisenroth (UCL) Gaussian Processes March/April 2020 43

Page 14: Gaussian ProcessesGaussian Processes Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk AIMS

Model Selection

Make predictions equipped with uncertainty

Choice of prior (e.g., length-scale) influences predictions

Different tasks require different priors

How do we select a good prior?

Model Selection in GPs§ Choose hyper-parameters of the GP

§ Choose good mean function and kernel

Marc Deisenroth (UCL) Gaussian Processes March/April 2020 43

Page 15: Gaussian ProcessesGaussian Processes Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk AIMS

Model Selection

Make predictions equipped with uncertainty

Choice of prior (e.g., length-scale) influences predictions

Different tasks require different priors

How do we select a good prior?

Model Selection in GPs§ Choose hyper-parameters of the GP

§ Choose good mean function and kernel

Marc Deisenroth (UCL) Gaussian Processes March/April 2020 43

Page 16: Gaussian ProcessesGaussian Processes Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk AIMS

Hyper-Parameters of a GP

The GP possesses a set of hyper-parameters:

Parameters of the mean function

Parameters of the covariance function (e.g., length-scales andsignal variance)

Likelihood parameters (e.g., noise variance �2n)

Train a GP to find a good set of hyper-parameters

Higher-level model selection to find good mean and covariancefunctions(can also be automated: Automatic Statistician (Lloyd et al., 2014))

Marc Deisenroth (UCL) Gaussian Processes March/April 2020 44

Page 17: Gaussian ProcessesGaussian Processes Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk AIMS

Hyper-Parameters of a GP

The GP possesses a set of hyper-parameters:

Parameters of the mean function

Parameters of the covariance function (e.g., length-scales andsignal variance)

Likelihood parameters (e.g., noise variance �2n)

Train a GP to find a good set of hyper-parameters

Higher-level model selection to find good mean and covariancefunctions(can also be automated: Automatic Statistician (Lloyd et al., 2014))

Marc Deisenroth (UCL) Gaussian Processes March/April 2020 44

Page 18: Gaussian ProcessesGaussian Processes Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk AIMS

Hyper-Parameters of a GP

The GP possesses a set of hyper-parameters:

Parameters of the mean function

Parameters of the covariance function (e.g., length-scales andsignal variance)

Likelihood parameters (e.g., noise variance �2n)

Train a GP to find a good set of hyper-parameters

Higher-level model selection to find good mean and covariancefunctions(can also be automated: Automatic Statistician (Lloyd et al., 2014))

Marc Deisenroth (UCL) Gaussian Processes March/April 2020 44

Page 19: Gaussian ProcessesGaussian Processes Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk AIMS

Marc Deisenroth (UCL) Gaussian Processes March/April 2020 45

Gp p pCaibdb sumrule LinearRegression

level2 hyper parameters mof theGaussianprocess 4

t reel InbL den Hell

Of Effatek

observed

functionvalues noisy Maximum likelihood

angqaxs aigmmaf.si

sumrmeLPrior

argfqaxpto lya

fplylflplf.IO df plylH prior on Bayes theorem

This is the marginallikelihood Maximum likelihoodType2

Page 20: Gaussian ProcessesGaussian Processes Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk AIMS

Gaussian Process Training: Hyper-Parameters

GP TrainingFind good hyper-parameters ✓ (kernel/meanfunction parameters , noise variance �2

n)

�nyixi

f

N

Place a prior pp✓q on hyper-parameters

Posterior over hyper-parameters:

pp✓|X,yq “pp✓q ppy|X,✓q

ppy|Xqppy|X,✓q “

ªppy|f,Xqppf |X,✓qdf

Marc Deisenroth (UCL) Gaussian Processes March/April 2020 46

Page 21: Gaussian ProcessesGaussian Processes Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk AIMS

Gaussian Process Training: Hyper-Parameters

GP TrainingFind good hyper-parameters ✓ (kernel/meanfunction parameters , noise variance �2

n)

�nyixi

f

N

Place a prior pp✓q on hyper-parameters

Posterior over hyper-parameters:

pp✓|X,yq “pp✓q ppy|X,✓q

ppy|Xqppy|X,✓q “

ªppy|f,Xqppf |X,✓qdf

Marc Deisenroth (UCL) Gaussian Processes March/April 2020 46

Page 22: Gaussian ProcessesGaussian Processes Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk AIMS

Gaussian Process Training: Hyper-Parameters

Posterior over hyper-parameters:

pp✓|X,yq “pp✓q ppy|X,✓q

ppy|Xqppy|X,✓q “

ªppy|fpXqq ppfpXq|✓q df

�nyixi

f

N

Choose hyper-parameters ✓˚, such that

✓˚ P argmax✓

log pp✓q ` log ppy|X,✓q

Maximize marginal likelihood if pp✓q “ U (uniform prior)

Marc Deisenroth (UCL) Gaussian Processes March/April 2020 47

nal likelihood evidencemarginalized likelihoodmargi

integrateout

likelihood

p.ly X fplylXiOIplf7dO marginalized marginal likelihood

Pmarginallikelihood

f f plyl f Kilpl full Adf plot domargideidence

Page 23: Gaussian ProcessesGaussian Processes Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk AIMS

Gaussian Process Training: Hyper-Parameters

Posterior over hyper-parameters:

pp✓|X,yq “pp✓q ppy|X,✓q

ppy|Xqppy|X,✓q “

ªppy|fpXqq ppfpXq|✓q df

�nyixi

f

N

Choose hyper-parameters ✓˚, such that

✓˚ P argmax✓

log pp✓q ` log ppy|X,✓q

Maximize marginal likelihood if pp✓q “ U (uniform prior)

Marc Deisenroth (UCL) Gaussian Processes March/April 2020 47

Page 24: Gaussian ProcessesGaussian Processes Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk AIMS

Gaussian Process Training: Hyper-Parameters

Posterior over hyper-parameters:

pp✓|X,yq “pp✓q ppy|X,✓q

ppy|Xqppy|X,✓q “

ªppy|fpXqq ppfpXq|✓q df

�nyixi

f

N

Choose hyper-parameters ✓˚, such that

✓˚ P argmax✓

log pp✓q ` log ppy|X,✓q

Maximize marginal likelihood if pp✓q “ U (uniform prior)

Marc Deisenroth (UCL) Gaussian Processes March/April 2020 47

Page 25: Gaussian ProcessesGaussian Processes Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk AIMS

Training via Marginal Likelihood Maximization

GP TrainingMaximize the evidence/marginal likelihood (probability of the datagiven the hyper-parameters, where the unwieldy f has beenintegrated out) Also called Maximum Likelihood Type-II

Marginal likelihood (with a prior mean function mp¨q ” 0):

ppy|X,✓q “ª

ppy|fpXqq ppfpXq|✓q df

һ

N`y | fpXq, �2

nI˘

N`fpXq |0, K

˘df

“ N`y |0, K ` �2

nI˘

Learning the GP hyper-parameters:

✓˚ P argmax✓

log ppy|X,✓q

Marc Deisenroth (UCL) Gaussian Processes March/April 2020 48

Page 26: Gaussian ProcessesGaussian Processes Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk AIMS

Training via Marginal Likelihood Maximization

GP TrainingMaximize the evidence/marginal likelihood (probability of the datagiven the hyper-parameters, where the unwieldy f has beenintegrated out) Also called Maximum Likelihood Type-II

Marginal likelihood (with a prior mean function mp¨q ” 0):

ppy|X,✓q “ª

ppy|fpXqq ppfpXq|✓q df

һ

N`y | fpXq, �2

nI˘

N`fpXq |0, K

˘df

“ N`y |0, K ` �2

nI˘

Learning the GP hyper-parameters:

✓˚ P argmax✓

log ppy|X,✓q

Marc Deisenroth (UCL) Gaussian Processes March/April 2020 48

Page 27: Gaussian ProcessesGaussian Processes Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk AIMS

Training via Marginal Likelihood Maximization

GP TrainingMaximize the evidence/marginal likelihood (probability of the datagiven the hyper-parameters, where the unwieldy f has beenintegrated out) Also called Maximum Likelihood Type-II

Marginal likelihood (with a prior mean function mp¨q ” 0):

ppy|X,✓q “ª

ppy|fpXqq ppfpXq|✓q df

һ

N`y | fpXq, �2

nI˘

N`fpXq |0, K

˘df

“ N`y |0, K ` �2

nI˘

Learning the GP hyper-parameters:

✓˚ P argmax✓

log ppy|X,✓q

Marc Deisenroth (UCL) Gaussian Processes March/April 2020 48

Page 28: Gaussian ProcessesGaussian Processes Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk AIMS

Gradient-based Optimization

Log-marginal likelihood:

log ppy|X,✓q “ ´12y

JK´1✓ y ´ 1

2 log |K✓| ` const

K✓ :“ K ` �2nI

Gradient-based optimization to get hyper-parameters ✓˚:

B log ppy|X,✓qB✓i

“ 12y

JK´1✓

BK✓

B✓iK´1

✓ y ´ 12 tr

`K´1

BK✓

B✓i˘

“ 12 tr

`p↵↵J ´ K´1

✓ qBK✓

B✓i˘,

↵ :“ K´1✓ y

Marc Deisenroth (UCL) Gaussian Processes March/April 2020 49

Page 29: Gaussian ProcessesGaussian Processes Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk AIMS

Gradient-based Optimization

Log-marginal likelihood:

log ppy|X,✓q “ ´12y

JK´1✓ y ´ 1

2 log |K✓| ` const

K✓ :“ K ` �2nI

Gradient-based optimization to get hyper-parameters ✓˚:

B log ppy|X,✓qB✓i

“ 12y

JK´1✓

BK✓

B✓iK´1

✓ y ´ 12 tr

`K´1

BK✓

B✓i˘

“ 12 tr

`p↵↵J ´ K´1

✓ qBK✓

B✓i˘,

↵ :“ K´1✓ y

Marc Deisenroth (UCL) Gaussian Processes March/April 2020 49

Page 30: Gaussian ProcessesGaussian Processes Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk AIMS

Marginal Likelihood Illustration1

“ELBO” refers to the log-marginal likelihood

Data-fit term gets worse, but marginal likelihood increases

1Thanks to Mark van der Wilk

Marc Deisenroth (UCL) Gaussian Processes March/April 2020 50

Page 31: Gaussian ProcessesGaussian Processes Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk AIMS

Inspecting the Marginal Likelihood

Log-marginal likelihood:

log ppy|X,✓q “ ´12y

JK´1✓ y ´ 1

2 log |K✓| ` const , K✓ :“ K ` �2nI

Quadratic term measures whether observation y is within thevariation allowed by the prior

Determinant is the product of the variances of the prior (volumeof the prior) Volume « richness of model class

Marginal likelihoodAutomatic trade-off between data fit and model complexity

Marc Deisenroth (UCL) Gaussian Processes March/April 2020 51

Page 32: Gaussian ProcessesGaussian Processes Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk AIMS

Inspecting the Marginal Likelihood

Log-marginal likelihood:

log ppy|X,✓q “ ´12y

JK´1✓ y ´ 1

2 log |K✓| ` const , K✓ :“ K ` �2nI

Quadratic term measures whether observation y is within thevariation allowed by the prior

Determinant is the product of the variances of the prior (volumeof the prior) Volume « richness of model class

Marginal likelihoodAutomatic trade-off between data fit and model complexity

Marc Deisenroth (UCL) Gaussian Processes March/April 2020 51

Page 33: Gaussian ProcessesGaussian Processes Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk AIMS

Inspecting the Marginal Likelihood

Log-marginal likelihood:

log ppy|X,✓q “ ´12y

JK´1✓ y ´ 1

2 log |K✓| ` const , K✓ :“ K ` �2nI

Quadratic term measures whether observation y is within thevariation allowed by the prior

Determinant is the product of the variances of the prior (volumeof the prior) Volume « richness of model class

Marginal likelihoodAutomatic trade-off between data fit and model complexity

Marc Deisenroth (UCL) Gaussian Processes March/April 2020 51

Page 34: Gaussian ProcessesGaussian Processes Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk AIMS

Inspecting the Marginal Likelihood

Log-marginal likelihood:

log ppy|X,✓q “ ´12y

JK´1✓ y ´ 1

2 log |K✓| ` const , K✓ :“ K ` �2nI

Quadratic term measures whether observation y is within thevariation allowed by the prior

Determinant is the product of the variances of the prior (volumeof the prior) Volume « richness of model class

Marginal likelihoodAutomatic trade-off between data fit and model complexity

Marc Deisenroth (UCL) Gaussian Processes March/April 2020 51

Page 35: Gaussian ProcessesGaussian Processes Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk AIMS

Marginal Likelihood Surface

Several plausible hyper-parameters (local optima)

What do you expect to happen in each local optimum?

Marc Deisenroth (UCL) Gaussian Processes March/April 2020 52

Page 36: Gaussian ProcessesGaussian Processes Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk AIMS

Marginal Likelihood Surface

Several plausible hyper-parameters (local optima)

What do you expect to happen in each local optimum?

Marc Deisenroth (UCL) Gaussian Processes March/April 2020 52

Page 37: Gaussian ProcessesGaussian Processes Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk AIMS

Demo

https://drafts.distill.pub/gp/

Marc Deisenroth (UCL) Gaussian Processes March/April 2020 53

Page 38: Gaussian ProcessesGaussian Processes Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk AIMS

Marginal Likelihood and Parameter Learning

The marginal likelihood is non-convex

Especially in the very-small-data regime, a GP can end up inthree different situations when optimizing the hyper-parameters:

Short length-scales, low noise (highly nonlinear mean functionwith little noise)Long length-scales, high noise (everything is considered noise)Hybrid

Re-start hyper-parameter optimization from random initializationto mitigate the problem

With increasing data set size the GP typically ends up in the“hybrid” mode. Other modes are unlikely.

Ideally, we would integrate the hyper-parameters outNo closed-form solution Markov chain Monte Carlo

Marc Deisenroth (UCL) Gaussian Processes March/April 2020 54

Page 39: Gaussian ProcessesGaussian Processes Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk AIMS

Marginal Likelihood and Parameter Learning

The marginal likelihood is non-convexEspecially in the very-small-data regime, a GP can end up inthree different situations when optimizing the hyper-parameters:

Short length-scales, low noise (highly nonlinear mean functionwith little noise)Long length-scales, high noise (everything is considered noise)Hybrid

Re-start hyper-parameter optimization from random initializationto mitigate the problem

With increasing data set size the GP typically ends up in the“hybrid” mode. Other modes are unlikely.

Ideally, we would integrate the hyper-parameters outNo closed-form solution Markov chain Monte Carlo

Marc Deisenroth (UCL) Gaussian Processes March/April 2020 54

Page 40: Gaussian ProcessesGaussian Processes Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk AIMS

Marginal Likelihood and Parameter Learning

The marginal likelihood is non-convexEspecially in the very-small-data regime, a GP can end up inthree different situations when optimizing the hyper-parameters:

Short length-scales, low noise (highly nonlinear mean functionwith little noise)

Long length-scales, high noise (everything is considered noise)Hybrid

Re-start hyper-parameter optimization from random initializationto mitigate the problem

With increasing data set size the GP typically ends up in the“hybrid” mode. Other modes are unlikely.

Ideally, we would integrate the hyper-parameters outNo closed-form solution Markov chain Monte Carlo

Marc Deisenroth (UCL) Gaussian Processes March/April 2020 54

Page 41: Gaussian ProcessesGaussian Processes Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk AIMS

Marginal Likelihood and Parameter Learning

The marginal likelihood is non-convexEspecially in the very-small-data regime, a GP can end up inthree different situations when optimizing the hyper-parameters:

Short length-scales, low noise (highly nonlinear mean functionwith little noise)Long length-scales, high noise (everything is considered noise)

Hybrid

Re-start hyper-parameter optimization from random initializationto mitigate the problem

With increasing data set size the GP typically ends up in the“hybrid” mode. Other modes are unlikely.

Ideally, we would integrate the hyper-parameters outNo closed-form solution Markov chain Monte Carlo

Marc Deisenroth (UCL) Gaussian Processes March/April 2020 54

Page 42: Gaussian ProcessesGaussian Processes Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk AIMS

Marginal Likelihood and Parameter Learning

The marginal likelihood is non-convexEspecially in the very-small-data regime, a GP can end up inthree different situations when optimizing the hyper-parameters:

Short length-scales, low noise (highly nonlinear mean functionwith little noise)Long length-scales, high noise (everything is considered noise)Hybrid

Re-start hyper-parameter optimization from random initializationto mitigate the problem

With increasing data set size the GP typically ends up in the“hybrid” mode. Other modes are unlikely.

Ideally, we would integrate the hyper-parameters outNo closed-form solution Markov chain Monte Carlo

Marc Deisenroth (UCL) Gaussian Processes March/April 2020 54

Page 43: Gaussian ProcessesGaussian Processes Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk AIMS

Marginal Likelihood and Parameter Learning

The marginal likelihood is non-convexEspecially in the very-small-data regime, a GP can end up inthree different situations when optimizing the hyper-parameters:

Short length-scales, low noise (highly nonlinear mean functionwith little noise)Long length-scales, high noise (everything is considered noise)Hybrid

Re-start hyper-parameter optimization from random initializationto mitigate the problem

With increasing data set size the GP typically ends up in the“hybrid” mode. Other modes are unlikely.

Ideally, we would integrate the hyper-parameters outNo closed-form solution Markov chain Monte Carlo

Marc Deisenroth (UCL) Gaussian Processes March/April 2020 54

Page 44: Gaussian ProcessesGaussian Processes Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk AIMS

Marginal Likelihood and Parameter Learning

The marginal likelihood is non-convexEspecially in the very-small-data regime, a GP can end up inthree different situations when optimizing the hyper-parameters:

Short length-scales, low noise (highly nonlinear mean functionwith little noise)Long length-scales, high noise (everything is considered noise)Hybrid

Re-start hyper-parameter optimization from random initializationto mitigate the problem

With increasing data set size the GP typically ends up in the“hybrid” mode. Other modes are unlikely.

Ideally, we would integrate the hyper-parameters outNo closed-form solution Markov chain Monte Carlo

Marc Deisenroth (UCL) Gaussian Processes March/April 2020 54

Page 45: Gaussian ProcessesGaussian Processes Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk AIMS

Marginal Likelihood and Parameter Learning

The marginal likelihood is non-convexEspecially in the very-small-data regime, a GP can end up inthree different situations when optimizing the hyper-parameters:

Short length-scales, low noise (highly nonlinear mean functionwith little noise)Long length-scales, high noise (everything is considered noise)Hybrid

Re-start hyper-parameter optimization from random initializationto mitigate the problem

With increasing data set size the GP typically ends up in the“hybrid” mode. Other modes are unlikely.

Ideally, we would integrate the hyper-parameters outNo closed-form solution Markov chain Monte Carlo

Marc Deisenroth (UCL) Gaussian Processes March/April 2020 54

Page 46: Gaussian ProcessesGaussian Processes Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk AIMS

Why Does the Marginal Likelihood Work?

Overall goal: Good generalization performance on unseen testdata

Minimizing training error is not a good idea (e.g., maximumlikelihood) Overfitting

Just adding uncertainty does not help either if the model iswrong, but it makes predictions more cautious

Marginal likelihood seems to find a good balance between fittingthe data and finding a simple model (Occam’s razor)

Why does the marginal likelihood lead to models that generalizewell?

Marc Deisenroth (UCL) Gaussian Processes March/April 2020 55

Page 47: Gaussian ProcessesGaussian Processes Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk AIMS

Why Does the Marginal Likelihood Work?

Overall goal: Good generalization performance on unseen testdata

Minimizing training error is not a good idea (e.g., maximumlikelihood) Overfitting

Just adding uncertainty does not help either if the model iswrong, but it makes predictions more cautious

Marginal likelihood seems to find a good balance between fittingthe data and finding a simple model (Occam’s razor)

Why does the marginal likelihood lead to models that generalizewell?

Marc Deisenroth (UCL) Gaussian Processes March/April 2020 55

Page 48: Gaussian ProcessesGaussian Processes Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk AIMS

Why Does the Marginal Likelihood Work?

Overall goal: Good generalization performance on unseen testdata

Minimizing training error is not a good idea (e.g., maximumlikelihood) Overfitting

Just adding uncertainty does not help either if the model iswrong, but it makes predictions more cautious

Marginal likelihood seems to find a good balance between fittingthe data and finding a simple model (Occam’s razor)

Why does the marginal likelihood lead to models that generalizewell?

Marc Deisenroth (UCL) Gaussian Processes March/April 2020 55

Page 49: Gaussian ProcessesGaussian Processes Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk AIMS

Marginal Likelihood: Incremental Prediction

“Probability of the training data” given the parameters

General factorization (ignoring inputs X):

ppy|✓q “ ppy1, . . . , yN |✓q

“ ppy1|✓qppy2|y1,✓qppy3|y1, y2,✓q ¨ . . . ¨ ppyN |y1, . . . , yN´1,✓q

“ ppy1|✓qNπ

n“2

ppyn|y1, . . . , yn´1,✓q

Marc Deisenroth (UCL) Gaussian Processes March/April 2020 56

pca b plans pCb

marginal likelihoodsplyjotplyzly.ttplyzly yaOte

ply ydf

plynly yµ fP 3101

Page 50: Gaussian ProcessesGaussian Processes Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk AIMS

Marginal Likelihood: Incremental Prediction

“Probability of the training data” given the parameters

General factorization (ignoring inputs X):

ppy|✓q “ ppy1, . . . , yN |✓q“ ppy1|✓qppy2|y1,✓qppy3|y1, y2,✓q ¨ . . . ¨ ppyN |y1, . . . , yN´1,✓q

“ ppy1|✓qNπ

n“2

ppyn|y1, . . . , yn´1,✓q

Marc Deisenroth (UCL) Gaussian Processes March/April 2020 56

Page 51: Gaussian ProcessesGaussian Processes Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk AIMS

Marginal Likelihood: Incremental Prediction

“Probability of the training data” given the parameters

General factorization (ignoring inputs X):

ppy|✓q “ ppy1, . . . , yN |✓q“ ppy1|✓qppy2|y1,✓qppy3|y1, y2,✓q ¨ . . . ¨ ppyN |y1, . . . , yN´1,✓q

“ ppy1|✓qNπ

n“2

ppyn|y1, . . . , yn´1,✓q

Marc Deisenroth (UCL) Gaussian Processes March/April 2020 56

Page 52: Gaussian ProcessesGaussian Processes Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk AIMS

Marginal Likelihood: Incremental Prediction (2)

ppy|✓q “ ppy1|✓qNπ

n“2

ppyn|y1, . . . , yn´1,✓q

If we think of this as a sequence model (where data arrivessequentially), the marginal likelihood predicts the nth trainingobservation given all “previous” observations

Predict training data yn that has not been accounted for (we onlycondition on y1, . . . , yn´1) Treat next data point as test data

Intuition: If it continuously predicted well on all N previouspoints, it probably will do well next time

Proxy for generalization error on unseen test data

Marc Deisenroth (UCL) Gaussian Processes March/April 2020 57

Page 53: Gaussian ProcessesGaussian Processes Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk AIMS

Marginal Likelihood: Incremental Prediction (2)

ppy|✓q “ ppy1|✓qNπ

n“2

ppyn|y1, . . . , yn´1,✓q

If we think of this as a sequence model (where data arrivessequentially), the marginal likelihood predicts the nth trainingobservation given all “previous” observations

Predict training data yn that has not been accounted for (we onlycondition on y1, . . . , yn´1) Treat next data point as test data

Intuition: If it continuously predicted well on all N previouspoints, it probably will do well next time

Proxy for generalization error on unseen test data

Marc Deisenroth (UCL) Gaussian Processes March/April 2020 57

Page 54: Gaussian ProcessesGaussian Processes Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk AIMS

Marginal Likelihood: Incremental Prediction (2)

ppy|✓q “ ppy1|✓qNπ

n“2

ppyn|y1, . . . , yn´1,✓q

If we think of this as a sequence model (where data arrivessequentially), the marginal likelihood predicts the nth trainingobservation given all “previous” observations

Predict training data yn that has not been accounted for (we onlycondition on y1, . . . , yn´1) Treat next data point as test data

Intuition: If it continuously predicted well on all N previouspoints, it probably will do well next time

Proxy for generalization error on unseen test data

Marc Deisenroth (UCL) Gaussian Processes March/April 2020 57

Page 55: Gaussian ProcessesGaussian Processes Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk AIMS

Marginal Likelihood: Incremental Prediction (2)

ppy|✓q “ ppy1|✓qNπ

n“2

ppyn|y1, . . . , yn´1,✓q

If we think of this as a sequence model (where data arrivessequentially), the marginal likelihood predicts the nth trainingobservation given all “previous” observations

Predict training data yn that has not been accounted for (we onlycondition on y1, . . . , yn´1) Treat next data point as test data

Intuition: If it continuously predicted well on all N previouspoints, it probably will do well next time

Proxy for generalization error on unseen test data

Marc Deisenroth (UCL) Gaussian Processes March/April 2020 57

Page 56: Gaussian ProcessesGaussian Processes Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk AIMS

Marginal Likelihood: Incremental Prediction (2)

ppy|✓q “ ppy1|✓qNπ

n“2

ppyn|y1, . . . , yn´1,✓q

If we think of this as a sequence model (where data arrivessequentially), the marginal likelihood predicts the nth trainingobservation given all “previous” observations

Predict training data yn that has not been accounted for (we onlycondition on y1, . . . , yn´1) Treat next data point as test data

Intuition: If it continuously predicted well on all N previouspoints, it probably will do well next time

Proxy for generalization error on unseen test data

Marc Deisenroth (UCL) Gaussian Processes March/April 2020 57

Page 57: Gaussian ProcessesGaussian Processes Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk AIMS

Marginal Likelihood: Illustration

ppy|✓q “ ppy1, . . . , yN |✓q “ ppy1|✓qNπ

n“2

ppyn|y1, . . . , yn´1,✓q

Marc Deisenroth (UCL) Gaussian Processes March/April 2020 58

Page 58: Gaussian ProcessesGaussian Processes Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk AIMS

Marginal Likelihood Computation in Action2

Short length-scale

2Thanks to Mark van der Wilk

Marc Deisenroth (UCL) Gaussian Processes March/April 2020 59

Page 59: Gaussian ProcessesGaussian Processes Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk AIMS

Marginal Likelihood Computation in Action3

Long length-scale

3Thanks to Mark van der Wilk

Marc Deisenroth (UCL) Gaussian Processes March/April 2020 60

Page 60: Gaussian ProcessesGaussian Processes Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk AIMS

Marginal Likelihood Computation in Action4

Optimal length-scale

4Thanks to Mark van der Wilk

Marc Deisenroth (UCL) Gaussian Processes March/April 2020 61

Page 61: Gaussian ProcessesGaussian Processes Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk AIMS

Marginal Likelihood Evolution

Short lengthscale: consistently overestimates varianceNo high density, even with observations inside the error bars

Long lengthscale: consistently underestimates varianceLow density because observations are outside the error bars

Optimal lengthscale: trades off both behaviors reasonably well

Marc Deisenroth (UCL) Gaussian Processes March/April 2020 62

Page 62: Gaussian ProcessesGaussian Processes Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk AIMS

Marginal Likelihood Evolution

Short lengthscale: consistently overestimates varianceNo high density, even with observations inside the error bars

Long lengthscale: consistently underestimates varianceLow density because observations are outside the error bars

Optimal lengthscale: trades off both behaviors reasonably well

Marc Deisenroth (UCL) Gaussian Processes March/April 2020 62

Page 63: Gaussian ProcessesGaussian Processes Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk AIMS

Marginal Likelihood Evolution

Short lengthscale: consistently overestimates varianceNo high density, even with observations inside the error bars

Long lengthscale: consistently underestimates varianceLow density because observations are outside the error bars

Optimal lengthscale: trades off both behaviors reasonably well

Marc Deisenroth (UCL) Gaussian Processes March/April 2020 62

Page 64: Gaussian ProcessesGaussian Processes Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk AIMS

Model Selection—Mean Function and Kernel

Assume we have a finite set of models Mi, each one specifyinga mean function mi and a kernel ki. How do we find the bestone?

Some options:Cross validationBayesian Information Criterion, Akaike Information CriterionCompare marginal likelihood values (assuming a uniform prior onthe set of models)

Marc Deisenroth (UCL) Gaussian Processes March/April 2020 63

Page 65: Gaussian ProcessesGaussian Processes Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk AIMS

Model Selection—Mean Function and Kernel

Assume we have a finite set of models Mi, each one specifyinga mean function mi and a kernel ki. How do we find the bestone?Some options:

Cross validationBayesian Information Criterion, Akaike Information CriterionCompare marginal likelihood values (assuming a uniform prior onthe set of models)

Marc Deisenroth (UCL) Gaussian Processes March/April 2020 63

Page 66: Gaussian ProcessesGaussian Processes Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk AIMS

Example

x-4 -3 -2 -1 0 1 2 3 4

f(x)

-2

-1

0

1

2

3

Four different kernels (mean function fixed to m ” 0)MAP hyper-parameters for each kernelLog-marginal likelihood values for each (optimized) model

Marc Deisenroth (UCL) Gaussian Processes March/April 2020 64

Page 67: Gaussian ProcessesGaussian Processes Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk AIMS

Example

x-4 -3 -2 -1 0 1 2 3 4

f(x)

-2

-1

0

1

2

3Constant kernel, LML=-1.1073

Four different kernels (mean function fixed to m ” 0)MAP hyper-parameters for each kernelLog-marginal likelihood values for each (optimized) model

Marc Deisenroth (UCL) Gaussian Processes March/April 2020 64

Page 68: Gaussian ProcessesGaussian Processes Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk AIMS

Example

x-4 -3 -2 -1 0 1 2 3 4

f(x)

-2

-1

0

1

2

3Linear kernel, LML=-1.0065

Four different kernels (mean function fixed to m ” 0)MAP hyper-parameters for each kernelLog-marginal likelihood values for each (optimized) model

Marc Deisenroth (UCL) Gaussian Processes March/April 2020 64

Page 69: Gaussian ProcessesGaussian Processes Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk AIMS

Example

x-4 -3 -2 -1 0 1 2 3 4

f(x)

-2

-1

0

1

2

3Matern kernel, LML=-0.8625

Four different kernels (mean function fixed to m ” 0)MAP hyper-parameters for each kernelLog-marginal likelihood values for each (optimized) model

Marc Deisenroth (UCL) Gaussian Processes March/April 2020 64

Page 70: Gaussian ProcessesGaussian Processes Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk AIMS

Example

x-4 -3 -2 -1 0 1 2 3 4

f(x)

-2

-1

0

1

2

3Gaussian kernel, LML=-0.69308

Four different kernels (mean function fixed to m ” 0)MAP hyper-parameters for each kernelLog-marginal likelihood values for each (optimized) model

Marc Deisenroth (UCL) Gaussian Processes March/April 2020 64

Page 71: Gaussian ProcessesGaussian Processes Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk AIMS

GP Training: Illustration5

Prior: fpxq “ ✓sfsmoothpxq ` ✓pfperiodicpxq, with smooth andperiodic GP priors, respectively.

Amount of periodicity vs. smoothness is automatically chosenby selecting hyper-parameters ✓s, ✓p.

Marginal likelihood learns how to generalize, not just to fit thedata

5Thanks to Mark van der Wilk

Marc Deisenroth (UCL) Gaussian Processes March/April 2020 65

Page 72: Gaussian ProcessesGaussian Processes Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk AIMS

GP Training: Illustration5

Prior: fpxq “ ✓sfsmoothpxq ` ✓pfperiodicpxq, with smooth andperiodic GP priors, respectively.

Amount of periodicity vs. smoothness is automatically chosenby selecting hyper-parameters ✓s, ✓p.

Marginal likelihood learns how to generalize, not just to fit thedata

5Thanks to Mark van der Wilk

Marc Deisenroth (UCL) Gaussian Processes March/April 2020 65

Page 73: Gaussian ProcessesGaussian Processes Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk AIMS

Limitations and Guidelines

Marc Deisenroth (UCL) Gaussian Processes March/April 2020 66

Page 74: Gaussian ProcessesGaussian Processes Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk AIMS

Limitations of Gaussian Processes

Computational and memory complexityTraining set size: N

Training scales in OpN3qPrediction (variances) scales in OpN2qMemory requirement: OpND ` N2q

Practical limit N « 10, 000

Some solution approaches:Sparse GPs with inducing variables (e.g., Snelson &Ghahramani, 2006; Quinonero-Candela & Rasmussen, 2005;Titsias 2009; Hensman et al., 2013; Matthews et al., 2016)Combination of local GP expert models (e.g., Tresp 2000; Cao &Fleet 2014; Deisenroth & Ng, 2015)Variational Fourier features (Hensman et al., 2018)

Marc Deisenroth (UCL) Gaussian Processes March/April 2020 67

Page 75: Gaussian ProcessesGaussian Processes Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk AIMS

Limitations of Gaussian Processes

Computational and memory complexityTraining set size: N

Training scales in OpN3qPrediction (variances) scales in OpN2qMemory requirement: OpND ` N2q

Practical limit N « 10, 000

Some solution approaches:Sparse GPs with inducing variables (e.g., Snelson &Ghahramani, 2006; Quinonero-Candela & Rasmussen, 2005;Titsias 2009; Hensman et al., 2013; Matthews et al., 2016)Combination of local GP expert models (e.g., Tresp 2000; Cao &Fleet 2014; Deisenroth & Ng, 2015)Variational Fourier features (Hensman et al., 2018)

Marc Deisenroth (UCL) Gaussian Processes March/April 2020 67

Page 76: Gaussian ProcessesGaussian Processes Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk AIMS

Tips and Tricks for Practitioners

To set initial hyper-parameters, use domain knowledge.

Standardize input data and set initial length-scales ` to « 0.5.Standardize targets y and set initial signal variance to �f « 1.Often useful: Set initial noise level relatively high (e.g.,�n « 0.5 ˆ �f amplitude), even if you think your data have lownoise. The optimization surface for your other parameters will beeasier to move in.When optimizing hyper-parameters, try random restarts or othertricks to avoid local optima are advised.Mitigate the problem of numerical instability (Choleskydecomposition of K ` �2

nI) by penalizing high signal-to-noiseratios �f{�n

https://drafts.distill.pub/gp

Marc Deisenroth (UCL) Gaussian Processes March/April 2020 68

Page 77: Gaussian ProcessesGaussian Processes Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk AIMS

Tips and Tricks for Practitioners

To set initial hyper-parameters, use domain knowledge.Standardize input data and set initial length-scales ` to « 0.5.

Standardize targets y and set initial signal variance to �f « 1.Often useful: Set initial noise level relatively high (e.g.,�n « 0.5 ˆ �f amplitude), even if you think your data have lownoise. The optimization surface for your other parameters will beeasier to move in.When optimizing hyper-parameters, try random restarts or othertricks to avoid local optima are advised.Mitigate the problem of numerical instability (Choleskydecomposition of K ` �2

nI) by penalizing high signal-to-noiseratios �f{�n

https://drafts.distill.pub/gp

Marc Deisenroth (UCL) Gaussian Processes March/April 2020 68

Page 78: Gaussian ProcessesGaussian Processes Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk AIMS

Tips and Tricks for Practitioners

To set initial hyper-parameters, use domain knowledge.Standardize input data and set initial length-scales ` to « 0.5.Standardize targets y and set initial signal variance to �f « 1.

Often useful: Set initial noise level relatively high (e.g.,�n « 0.5 ˆ �f amplitude), even if you think your data have lownoise. The optimization surface for your other parameters will beeasier to move in.When optimizing hyper-parameters, try random restarts or othertricks to avoid local optima are advised.Mitigate the problem of numerical instability (Choleskydecomposition of K ` �2

nI) by penalizing high signal-to-noiseratios �f{�n

https://drafts.distill.pub/gp

Marc Deisenroth (UCL) Gaussian Processes March/April 2020 68

Page 79: Gaussian ProcessesGaussian Processes Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk AIMS

Tips and Tricks for Practitioners

To set initial hyper-parameters, use domain knowledge.Standardize input data and set initial length-scales ` to « 0.5.Standardize targets y and set initial signal variance to �f « 1.Often useful: Set initial noise level relatively high (e.g.,�n « 0.5 ˆ �f amplitude), even if you think your data have lownoise. The optimization surface for your other parameters will beeasier to move in.

When optimizing hyper-parameters, try random restarts or othertricks to avoid local optima are advised.Mitigate the problem of numerical instability (Choleskydecomposition of K ` �2

nI) by penalizing high signal-to-noiseratios �f{�n

https://drafts.distill.pub/gp

Marc Deisenroth (UCL) Gaussian Processes March/April 2020 68

Page 80: Gaussian ProcessesGaussian Processes Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk AIMS

Tips and Tricks for Practitioners

To set initial hyper-parameters, use domain knowledge.Standardize input data and set initial length-scales ` to « 0.5.Standardize targets y and set initial signal variance to �f « 1.Often useful: Set initial noise level relatively high (e.g.,�n « 0.5 ˆ �f amplitude), even if you think your data have lownoise. The optimization surface for your other parameters will beeasier to move in.When optimizing hyper-parameters, try random restarts or othertricks to avoid local optima are advised.

Mitigate the problem of numerical instability (Choleskydecomposition of K ` �2

nI) by penalizing high signal-to-noiseratios �f{�n

https://drafts.distill.pub/gp

Marc Deisenroth (UCL) Gaussian Processes March/April 2020 68

Page 81: Gaussian ProcessesGaussian Processes Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk AIMS

Tips and Tricks for Practitioners

To set initial hyper-parameters, use domain knowledge.Standardize input data and set initial length-scales ` to « 0.5.Standardize targets y and set initial signal variance to �f « 1.Often useful: Set initial noise level relatively high (e.g.,�n « 0.5 ˆ �f amplitude), even if you think your data have lownoise. The optimization surface for your other parameters will beeasier to move in.When optimizing hyper-parameters, try random restarts or othertricks to avoid local optima are advised.Mitigate the problem of numerical instability (Choleskydecomposition of K ` �2

nI) by penalizing high signal-to-noiseratios �f{�n

https://drafts.distill.pub/gp

Marc Deisenroth (UCL) Gaussian Processes March/April 2020 68

Page 82: Gaussian ProcessesGaussian Processes Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk AIMS

Application Areas

Marc Deisenroth (UCL) Gaussian Processes March/April 2020 69

Page 83: Gaussian ProcessesGaussian Processes Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk AIMS

Application Areas

−2 0 2−5

0

5

angle in rad

ang.

vel.

in ra

d/s

−2

0

2

4

6

8

Reinforcement learning and roboticsModel value functions and/or dynamics with GPs

Bayesian optimization (Experimental Design)Model unknown utility functions with GPs

GeostatisticsSpatial modeling (e.g., landscapes, resources)

Sensor networksTime-series modeling and forecasting

Marc Deisenroth (UCL) Gaussian Processes March/April 2020 70

Page 84: Gaussian ProcessesGaussian Processes Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk AIMS

Summary

!5 !4 !3 !2 !1 0 1 2 3 4 5 6 7 8!3

!2

!1

0

1

2

3

x

f(x)

Gaussian processes are the gold-standard for regressionClosely related to Bayesian linear regressionComputations boil down to manipulating multivariate GaussiandistributionsMarginal likelihood objective automatically trades off data fit andmodel complexity

Marc Deisenroth (UCL) Gaussian Processes March/April 2020 71

Page 85: Gaussian ProcessesGaussian Processes Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk AIMS

References I

[1] G. Bertone, M. P. Deisenroth, J. S. Kim, S. Liem, R. R. de Austri, and M. Welling. Accelerating the BSM Interpretation ofLHC Data with Machine Learning. arXiv preprint arXiv:1611.02704, 2016.

[2] R. Calandra, J. Peters, C. E. Rasmussen, and M. P. Deisenroth. Manifold Gaussian Processes for Regression. InProceedings of the International Joint Conference on Neural Networks, 2016.

[3] Y. Cao and D. J. Fleet. Generalized Product of Experts for Automatic and Principled Fusion of Gaussian ProcessPredictions. http://arxiv.org/abs/1410.7827, 2014.

[4] N. A. C. Cressie. Statistics for Spatial Data. Wiley-Interscience, 1993.

[5] M. Cutler and J. P. How. Efficient Reinforcement Learning for Robots using Informative Simulated Priors. In Proceedings

of the International Conference on Robotics and Automation, 2015.

[6] M. P. Deisenroth and J. W. Ng. Distributed Gaussian Processes. In Proceedings of the International Conference on

Machine Learning, 2015.

[7] M. P. Deisenroth, C. E. Rasmussen, and D. Fox. Learning to Control a Low-Cost Manipulator using Data-EfficientReinforcement Learning. In Proceedings of Robotics: Science and Systems, 2011.

[8] M. P. Deisenroth, C. E. Rasmussen, and J. Peters. Gaussian Process Dynamic Programming. Neurocomputing,72(7–9):1508–1524, Mar. 2009.

[9] M. P. Deisenroth, R. Turner, M. Huber, U. D. Hanebeck, and C. E. Rasmussen. Robust Filtering and Smoothing withGaussian Processes. IEEE Transactions on Automatic Control, 57(7):1865–1871, 2012.

[10] R. Frigola, F. Lindsten, T. B. Schon, and C. E. Rasmussen. Bayesian Inference and Learning in Gaussian ProcessState-Space Models with Particle MCMC. In Advances in Neural Information Processing Systems. 2013.

[11] N. HajiGhassemi and M. P. Deisenroth. Approximate Inference for Long-Term Forecasting with Periodic GaussianProcesses. In Proceedings of the International Conference on Artificial Intelligence and Statistics, 2014.

[12] J. Hensman, N. Durrande, and A. Solin. Variational Fourier Features for Gaussian Processes. Journal of Machine

Learning Research, pages 1–52, 2018.

Marc Deisenroth (UCL) Gaussian Processes March/April 2020 72

Page 86: Gaussian ProcessesGaussian Processes Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk AIMS

References II

[13] J. Hensman, N. Fusi, and N. D. Lawrence. Gaussian Processes for Big Data. In Proceedings of the Conference on

Uncertainty in Artificial Intelligence, 2013.

[14] A. Krause, A. Singh, and C. Guestrin. Near-Optimal Sensor Placements in Gaussian Processes: Theory, EfficientAlgorithms and Empirical Studies. Journal of Machine Learning Research, 9:235–284, Feb. 2008.

[15] M. C. H. Lee, H. Salimbeni, M. P. Deisenroth, and B. Glocker. Patch Kernels for Gaussian Processes inHigh-Dimensional Imaging Problems. In NIPS Workshop on Practical Bayesian Nonparametrics, 2016.

[16] J. R. Lloyd, D. Duvenaud, R. Grosse, J. B. Tenenbaum, and Z. Ghahramani. Automatic Construction andNatural-Language Description of Nonparametric Regression Models. In AAAI Conference on Artificial Intelligence,pages 1–11, 2014.

[17] D. J. C. MacKay. Introduction to Gaussian Processes. In C. M. Bishop, editor, Neural Networks and Machine Learning,volume 168, pages 133–165. Springer, 1998.

[18] A. G. d. G. Matthews, J. Hensman, R. Turner, and Z. Ghahramani. On Sparse Variational Methods and theKullback-Leibler Divergence between Stochastic Processes. In Proceedings of the International Conference on Artificial

Intelligence and Statistics, 2016.

[19] M. A. Osborne, S. J. Roberts, A. Rogers, S. D. Ramchurn, and N. R. Jennings. Towards Real-Time InformationProcessing of Sensor Network Data Using Computationally Efficient Multi-output Gaussian Processes. In Proceedings

of the International Conference on Information Processing in Sensor Networks, pages 109–120. IEEE ComputerSociety, 2008.

[20] J. Quinonero-Candela and C. E. Rasmussen. A Unifying View of Sparse Approximate Gaussian Process Regression.Journal of Machine Learning Research, 6(2):1939–1960, 2005.

[21] C. E. Rasmussen and C. K. I. Williams. Gaussian Processes for Machine Learning. The MIT Press, 2006.

[22] S. Roberts, M. A. Osborne, M. Ebden, S. Reece, N. Gibson, and S. Aigrain. Gaussian Processes for Time SeriesModelling. Philosophical Transactions of the Royal Society (Part A), 371(1984), Feb. 2013.

Marc Deisenroth (UCL) Gaussian Processes March/April 2020 73

Page 87: Gaussian ProcessesGaussian Processes Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk AIMS

References III

[23] B. Scholkopf and A. J. Smola. Learning with Kernels—Support Vector Machines, Regularization, Optimization, and

Beyond. Adaptive Computation and Machine Learning. The MIT Press, Cambridge, MA, USA, 2002.

[24] E. Snelson and Z. Ghahramani. Sparse Gaussian Processes using Pseudo-inputs. In Y. Weiss, B. Scholkopf, and J. C.Platt, editors, Advances in Neural Information Processing Systems 18, pages 1257–1264. The MIT Press, Cambridge,MA, USA, 2006.

[25] M. K. Titsias. Variational Learning of Inducing Variables in Sparse Gaussian Processes. In Proceedings of the

International Conference on Artificial Intelligence and Statistics, 2009.

[26] V. Tresp. A Bayesian Committee Machine. Neural Computation, 12(11):2719–2741, 2000.

[27] A. G. Wilson, Z. Hu, R. Salakhutdinov, and E. P. Xing. Deep Kernel Learning. In Proceedings of the International

Conference on Artificial Intelligence and Statistics, 2016.

Marc Deisenroth (UCL) Gaussian Processes March/April 2020 74