Bayesian methods, priors and Gaussian processes John Paul Gosling Department of Probability and...

66
Bayesian methods, priors and Gaussian processes John Paul Gosling Department of Probability and Statistics
  • date post

    22-Dec-2015
  • Category

    Documents

  • view

    231
  • download

    1

Transcript of Bayesian methods, priors and Gaussian processes John Paul Gosling Department of Probability and...

Page 1: Bayesian methods, priors and Gaussian processes John Paul Gosling Department of Probability and Statistics.

Bayesian methods, priors and Gaussian processes

John Paul GoslingDepartment of Probability and Statistics

Page 2: Bayesian methods, priors and Gaussian processes John Paul Gosling Department of Probability and Statistics.

24-25 January 2007 An Overview of State-of-the-Art Data Modelling2

Overview

• The Bayesian paradigm

• Bayesian data modelling

• Quantifying prior beliefs

• Data modelling with Gaussian processes

Page 3: Bayesian methods, priors and Gaussian processes John Paul Gosling Department of Probability and Statistics.

Bayesian methods

The beginning, the subjectivist philosophy, and an overview of Bayesian techniques.

Page 4: Bayesian methods, priors and Gaussian processes John Paul Gosling Department of Probability and Statistics.

24-25 January 2007 An Overview of State-of-the-Art Data Modelling4

Subjective probability

• Bayesian statistics involves a very different way of thinking about probability in comparison to classical inference.

• The probability of a proposition is defined to a measure of a person’s degree of belief.

• Wherever there is uncertainty, there is probability

• This covers aleatory and epistemic uncertainty

Page 5: Bayesian methods, priors and Gaussian processes John Paul Gosling Department of Probability and Statistics.

24-25 January 2007 An Overview of State-of-the-Art Data Modelling5

Differences with classical inferenceTo a frequentist, data are repeatable, parameters are not:

P(data|parameters)

To a Bayesian, the parameters are uncertain, the observed data are not:

P(parameters|data)

Page 6: Bayesian methods, priors and Gaussian processes John Paul Gosling Department of Probability and Statistics.

24-25 January 2007 An Overview of State-of-the-Art Data Modelling6

• This can be extended to continuous distributions:

• In early probability courses, we are taught Bayes’s theorem for events:

• In Bayesian statistics, we use Bayes’s theorem in a particular way:

Bayes’s theorem for distributions

Page 7: Bayesian methods, priors and Gaussian processes John Paul Gosling Department of Probability and Statistics.

24-25 January 2007 An Overview of State-of-the-Art Data Modelling7

Prior Posterior

Prior to posterior updating

Bayes’s theorem is used to update our beliefs.

The posterior is proportional to the prior times the likelihood.

Data

Page 8: Bayesian methods, priors and Gaussian processes John Paul Gosling Department of Probability and Statistics.

24-25 January 2007 An Overview of State-of-the-Art Data Modelling8

Posterior distribution

• So, once we have our posterior, we have captured all our beliefs about the parameter of interest.

• We can use this to do informal inference, i.e. intervals, summary statistics.

• Formally, to make choices about the parameter, we must couple this with decision theory to calculate the optimal decision.

Page 9: Bayesian methods, priors and Gaussian processes John Paul Gosling Department of Probability and Statistics.

24-25 January 2007 An Overview of State-of-the-Art Data Modelling9

Data

Sequential updating

Today’s posterior is tomorrow’s prior

Prior beliefs

Posterior beliefs

More dataPosterior beliefs

Page 10: Bayesian methods, priors and Gaussian processes John Paul Gosling Department of Probability and Statistics.

24-25 January 2007 An Overview of State-of-the-Art Data Modelling10

The triplot

• A triplot gives a graphical representation of prior to posterior updating.

Prior

Likelihood

Posterior

Page 11: Bayesian methods, priors and Gaussian processes John Paul Gosling Department of Probability and Statistics.

24-25 January 2007 An Overview of State-of-the-Art Data Modelling11

Audience participation

Quantification of our prior beliefs

• What proportion of people in this room are left handed? – call this parameter ψ

• When I toss this coin, what’s the probability of me getting a tail? – call this θ

Page 12: Bayesian methods, priors and Gaussian processes John Paul Gosling Department of Probability and Statistics.

24-25 January 2007 An Overview of State-of-the-Art Data Modelling12

A simple example

• The archetypal example in probability theory is the outcome of tossing a coin.

• Each toss of a coin is a Bernoulli trial with the probability of tails given by θ.

• If we carry out 10 independent trials, we know the number of tails(X) will follow a binomial distribution. [X | θ ~ Bi(10, θ)]

Page 13: Bayesian methods, priors and Gaussian processes John Paul Gosling Department of Probability and Statistics.

24-25 January 2007 An Overview of State-of-the-Art Data Modelling13

Our prior distribution

• A Beta(2,2) distribution may reflect our beliefs about θ.

Page 14: Bayesian methods, priors and Gaussian processes John Paul Gosling Department of Probability and Statistics.

24-25 January 2007 An Overview of State-of-the-Art Data Modelling14

Our posterior distribution

• If we observe X = 3, we get the following triplot:

Page 15: Bayesian methods, priors and Gaussian processes John Paul Gosling Department of Probability and Statistics.

24-25 January 2007 An Overview of State-of-the-Art Data Modelling15

Our posterior distribution

• If we are more convinced, a priori, that θ = 0.5 and we observe X = 3, we get the following triplot:

Page 16: Bayesian methods, priors and Gaussian processes John Paul Gosling Department of Probability and Statistics.

24-25 January 2007 An Overview of State-of-the-Art Data Modelling16

Credible intervals

• If asked to provide an interval in which there is a 90% chance of θ lying, we can derive this directly from our posterior distribution.

• Such an interval is called a credible interval.

• In frequentist statistics, there are confidence intervals that cannotcannot be interpreted in the same way.

• In our example, using our first prior distribution, we can report a 95% posterior credible interval for θ of (0.14,0.62).

Page 17: Bayesian methods, priors and Gaussian processes John Paul Gosling Department of Probability and Statistics.

24-25 January 2007 An Overview of State-of-the-Art Data Modelling17

• Yesterday we saw a lot of this:

• We have a least squares solution given by

• Instead of trying to find the optimal set of parameters, we express our beliefs about them.

Basic linear model

Page 18: Bayesian methods, priors and Gaussian processes John Paul Gosling Department of Probability and Statistics.

24-25 January 2007 An Overview of State-of-the-Art Data Modelling18

Basic linear model

• By selecting appropriate priors for the two parameters, we can derive the posterior analytically.

• It is a normal inverse-gamma distribution.

• The mean of our posterior distribution is then

which is a weighted average of the LSE and prior mean.

Page 19: Bayesian methods, priors and Gaussian processes John Paul Gosling Department of Probability and Statistics.

24-25 January 2007 An Overview of State-of-the-Art Data Modelling19

Bayesian model comparison

• Suppose we have two plausible models for a set of data, M and N say.

• We can calculate posterior odds in favour of M using

Page 20: Bayesian methods, priors and Gaussian processes John Paul Gosling Department of Probability and Statistics.

24-25 January 2007 An Overview of State-of-the-Art Data Modelling20

Bayesian model comparison

• The Bayes factor is calculated using

• A Bayes factor that is greater than one would mean that your odds in favour of M increase.

• Bayes factors naturally help guard against too much model structure.

Page 21: Bayesian methods, priors and Gaussian processes John Paul Gosling Department of Probability and Statistics.

24-25 January 2007 An Overview of State-of-the-Art Data Modelling21

Advantages/Disadvantages• Bayesian methods are often more complex

than frequentist methods.

• There is not much software to give scientists off-the-shelf analyses.

• Subjectivity: all the inferences are based on somebody’s beliefs.

Page 22: Bayesian methods, priors and Gaussian processes John Paul Gosling Department of Probability and Statistics.

24-25 January 2007 An Overview of State-of-the-Art Data Modelling22

Advantages/Disadvantages• Bayesian statistics offers a framework to deal with all

the uncertainty.• Bayesians make use of more information – not just

the data in their particular experiment.• The Bayesian paradigm is very flexible and it is able

to tackle problems that frequentist techniques could not.

• In selecting priors and likelihoods, Bayesians are showing their hands – they can’t get away with making arbitrary choices when it comes to inference.

• …

Page 23: Bayesian methods, priors and Gaussian processes John Paul Gosling Department of Probability and Statistics.

24-25 January 2007 An Overview of State-of-the-Art Data Modelling23

Summary

• The basic principles of Bayesian statistics have been covered.

• We have seen how we update our beliefs in the light of data.

• Hopefully, I’ve convinced you that the Bayesian way is the right way.

Page 24: Bayesian methods, priors and Gaussian processes John Paul Gosling Department of Probability and Statistics.

Priors

Advice on choosing suitable prior distributions and eliciting their parameters.

Page 25: Bayesian methods, priors and Gaussian processes John Paul Gosling Department of Probability and Statistics.

24-25 January 2007 An Overview of State-of-the-Art Data Modelling25

Importance of priors

• As we saw in the previous section, prior beliefs about uncertain parameters are a fundamental part of Bayesian statistics.

• When we have few data about the parameter of interest, our prior beliefs dominate inference about that parameter.

• In any application, effort should be made to model our prior beliefs accurately.

Page 26: Bayesian methods, priors and Gaussian processes John Paul Gosling Department of Probability and Statistics.

24-25 January 2007 An Overview of State-of-the-Art Data Modelling26

Weak prior information• If we accept the subjective nature of Bayesian statistics

and are not comfortable using subjective priors, then many have argued that we should try to specify prior distributions that represent no prior information.

• These prior distributions are called noninformative, reference, ignorance or weak priors.

• The idea is to have a completely flat prior distribution over all possible values of the parameter.

• Unfortunately, this can lead to improper distributions being used.

Page 27: Bayesian methods, priors and Gaussian processes John Paul Gosling Department of Probability and Statistics.

24-25 January 2007 An Overview of State-of-the-Art Data Modelling27

Weak prior information

• In our coin tossing example, Be(1,1), Be(0.5,0.5) and Be(0,0) have been recommended as noninformative priors. Be(0,0) is improper.

Page 28: Bayesian methods, priors and Gaussian processes John Paul Gosling Department of Probability and Statistics.

24-25 January 2007 An Overview of State-of-the-Art Data Modelling28

Conjugate priors

• When we move away from noninformative priors, we might use priors that are in a convenient form.

• That is a form where combining them with the likelihood produces a distribution from the same family.

• In our example, the beta distribution is a conjugate prior for a binomial likelihood.

Page 29: Bayesian methods, priors and Gaussian processes John Paul Gosling Department of Probability and Statistics.

24-25 January 2007 An Overview of State-of-the-Art Data Modelling29

Informative priors

• An informative prior is an accurate representation of our prior beliefs.

• We are not interested in the prior being part of some conjugate family.

• An informative prior is essential when we have few or no data for the parameter of interest.

• Elicitation, in this context, is the process of translating someone’s beliefs into a distribution.

Page 30: Bayesian methods, priors and Gaussian processes John Paul Gosling Department of Probability and Statistics.

24-25 January 2007 An Overview of State-of-the-Art Data Modelling30

Elicitation• It is unrealistic to expect someone to be

able to fully specify their beliefs in terms of a probability distribution.

• Often, they are only able to report a few summaries of the distribution.

• We usually work with medians, modes and percentiles.

• Sometimes they are able to report means and variances, but there are more doubts about these values.

Page 31: Bayesian methods, priors and Gaussian processes John Paul Gosling Department of Probability and Statistics.

24-25 January 2007 An Overview of State-of-the-Art Data Modelling31

Elicitation

• Once we have some information about their beliefs, we fit some parametric distribution to them.

• These distribution almost never fit the judgements precisely.

• There are nonparametric techniques that can bypass this.

• Feedback is essential in the elicitation process.

Page 32: Bayesian methods, priors and Gaussian processes John Paul Gosling Department of Probability and Statistics.

24-25 January 2007 An Overview of State-of-the-Art Data Modelling32

Normal with unknown mean

Noninformative prior:

Page 33: Bayesian methods, priors and Gaussian processes John Paul Gosling Department of Probability and Statistics.

24-25 January 2007 An Overview of State-of-the-Art Data Modelling33

Normal with unknown mean

Conjugate prior:

Page 34: Bayesian methods, priors and Gaussian processes John Paul Gosling Department of Probability and Statistics.

24-25 January 2007 An Overview of State-of-the-Art Data Modelling34

Normal with unknown mean

Proper prior:

Page 35: Bayesian methods, priors and Gaussian processes John Paul Gosling Department of Probability and Statistics.

24-25 January 2007 An Overview of State-of-the-Art Data Modelling35

Structuring prior information

• It is possible to structure our prior beliefs in a hierarchical manner:

• Here is referred to as the hyperparameter(s).

Data model:

First level of prior:

Second level of prior:

x

Page 36: Bayesian methods, priors and Gaussian processes John Paul Gosling Department of Probability and Statistics.

24-25 January 2007 An Overview of State-of-the-Art Data Modelling36

• An example of this type of hierarchical is a nonparametric regression model.

• We want to know about μ so the other parameters must be removed. The other parameters are known as nuisance parameters.

Structuring prior information

Data model:

First level of prior:

Second level of prior:

Page 37: Bayesian methods, priors and Gaussian processes John Paul Gosling Department of Probability and Statistics.

24-25 January 2007 An Overview of State-of-the-Art Data Modelling37

Analytical tractability

• The more complexity that is built into your prior and likelihood the more likely it is that you won’t be able to derive your posterior analytically.

• In the ’90’s, computational techniques were devised to combat this.

• Markov chain Monte Carlo (MCMC) techniques allow us to access our posterior distributions even in complex models.

Page 38: Bayesian methods, priors and Gaussian processes John Paul Gosling Department of Probability and Statistics.

24-25 January 2007 An Overview of State-of-the-Art Data Modelling38

Sensitivity analysis

• It is clear that the elicitation of prior distributions is far from being a precise science.

• A good Bayesian analysis will check that the conclusions are sufficiently robust to changes in the prior.

• If they aren’t, we need more data or more agreement on the prior structure.

Page 39: Bayesian methods, priors and Gaussian processes John Paul Gosling Department of Probability and Statistics.

24-25 January 2007 An Overview of State-of-the-Art Data Modelling39

Summary

• Prior distributions are an important part of Bayesian statistics.

• They are far from being ad hoc, pick-the-easiest-to-use distributions when modelled properly.

• There are classes of noninformative priors that allow us to represent ignorance.

Page 40: Bayesian methods, priors and Gaussian processes John Paul Gosling Department of Probability and Statistics.

Gaussian processes

A Bayesian data modelling technique that fully accounts for uncertainty.

Page 41: Bayesian methods, priors and Gaussian processes John Paul Gosling Department of Probability and Statistics.

24-25 January 2007 An Overview of State-of-the-Art Data Modelling41

Data modelling: a fully probabilistic method

• Bayesian statistics offers a framework to account for uncertainty in data modelling.

• In this section, we’ll concentrate on regression using Gaussian processes and the associated Bayesian techniques

Page 42: Bayesian methods, priors and Gaussian processes John Paul Gosling Department of Probability and Statistics.

24-25 January 2007 An Overview of State-of-the-Art Data Modelling42

We have:

or

and are uncertain.

In order to proceed, we must elicit our beliefs about these two.

can be dealt with as in the previous section.

The basic idea

Page 43: Bayesian methods, priors and Gaussian processes John Paul Gosling Department of Probability and Statistics.

24-25 January 2007 An Overview of State-of-the-Art Data Modelling43

• We assume that f(.) follows a Gaussian process a priori.

• That is:

• i.e. any sample of f(x)’s will follow a MV-normal.

Gaussian processes

A process is Gaussian if and only if every finite sample from the process is a vector-valued Gaussian random variable.

Page 44: Bayesian methods, priors and Gaussian processes John Paul Gosling Department of Probability and Statistics.

24-25 January 2007 An Overview of State-of-the-Art Data Modelling44

Gaussian processes

We have prior beliefs about the form of the underlying model.

We observe/experiment to get data about the model with which we train

our GP.

We are left with our posterior beliefs about the model, which can have a ‘nice’ form.

Page 45: Bayesian methods, priors and Gaussian processes John Paul Gosling Department of Probability and Statistics.

24-25 January 2007 An Overview of State-of-the-Art Data Modelling45

A simple example

Warning: more audience

participation coming up

Page 46: Bayesian methods, priors and Gaussian processes John Paul Gosling Department of Probability and Statistics.

24-25 January 2007 An Overview of State-of-the-Art Data Modelling46

A simple example

• Imagine we have data about some one dimensional phenomenon.

• Also, we’ll assume that there is no observational error.

• We’ll start with five data points between 0 and 4.

• A priori, we believe is roughly linear and differentiable everywhere.

Page 47: Bayesian methods, priors and Gaussian processes John Paul Gosling Department of Probability and Statistics.

24-25 January 2007 An Overview of State-of-the-Art Data Modelling47

A simple example

Page 48: Bayesian methods, priors and Gaussian processes John Paul Gosling Department of Probability and Statistics.

24-25 January 2007 An Overview of State-of-the-Art Data Modelling48

A simple example

Page 49: Bayesian methods, priors and Gaussian processes John Paul Gosling Department of Probability and Statistics.

24-25 January 2007 An Overview of State-of-the-Art Data Modelling49

A simple example

Page 50: Bayesian methods, priors and Gaussian processes John Paul Gosling Department of Probability and Statistics.

24-25 January 2007 An Overview of State-of-the-Art Data Modelling50

A simple example with error

• Now, we’ll start over and put some Gaussian error on the observations.

• Note: in kriging, this is equivalent to adding a nugget effect.

,

Page 51: Bayesian methods, priors and Gaussian processes John Paul Gosling Department of Probability and Statistics.

24-25 January 2007 An Overview of State-of-the-Art Data Modelling51

A simple example with error

Page 52: Bayesian methods, priors and Gaussian processes John Paul Gosling Department of Probability and Statistics.

24-25 January 2007 An Overview of State-of-the-Art Data Modelling52

The mean function

Recall that our prior mean for is given by

where is vector of regression

functions evaluated at and is a vector of

unknown coefficients.

The form of the regression functions is dependent on the application.

Page 53: Bayesian methods, priors and Gaussian processes John Paul Gosling Department of Probability and Statistics.

24-25 January 2007 An Overview of State-of-the-Art Data Modelling53

The mean function

• It is common practice to use a constant (bias)

• Linear functions

• Gaussian basis functions

• Trigonometric basis functions

• …

It is important to capture your beliefs about in the mean function.

Page 54: Bayesian methods, priors and Gaussian processes John Paul Gosling Department of Probability and Statistics.

24-25 January 2007 An Overview of State-of-the-Art Data Modelling54

The correlation structure

The correlation function defines how we believe

will deviate nonparametrically from the mean

function.

In the examples here, I have used a stationary correlation function of the form:

Page 55: Bayesian methods, priors and Gaussian processes John Paul Gosling Department of Probability and Statistics.

24-25 January 2007 An Overview of State-of-the-Art Data Modelling55

Dealing with the model parametersWe have the following hyperparameters:

can be removed analytically using conjugate priors.

are not so easily accounted for…

Page 56: Bayesian methods, priors and Gaussian processes John Paul Gosling Department of Probability and Statistics.

24-25 January 2007 An Overview of State-of-the-Art Data Modelling56

A 2-D exampleRock porosity somewhere in the US

Page 57: Bayesian methods, priors and Gaussian processes John Paul Gosling Department of Probability and Statistics.

24-25 January 2007 An Overview of State-of-the-Art Data Modelling57

A 2-D example

Mean of our posterior beliefs about the underlying model, f(.).

Page 58: Bayesian methods, priors and Gaussian processes John Paul Gosling Department of Probability and Statistics.

24-25 January 2007 An Overview of State-of-the-Art Data Modelling58

A 2-D example

Mean of our posterior beliefs about the underlying model, f(.), in 3D!!!

Page 59: Bayesian methods, priors and Gaussian processes John Paul Gosling Department of Probability and Statistics.

24-25 January 2007 An Overview of State-of-the-Art Data Modelling59

A 2-D example

Our uncertainty about f(.) – two standard deviations

Page 60: Bayesian methods, priors and Gaussian processes John Paul Gosling Department of Probability and Statistics.

24-25 January 2007 An Overview of State-of-the-Art Data Modelling60

A 2-D example

Our uncertainty about f(.) looks much better in 3D.

Page 61: Bayesian methods, priors and Gaussian processes John Paul Gosling Department of Probability and Statistics.

24-25 January 2007 An Overview of State-of-the-Art Data Modelling61

A 2-D example - prediction

• The geologists held back two observations at:P1 = (0.60,0.35), z1 = 10.0 and P2 (0.20,0.90), z2 = 20.8

• Using our posterior distribution for f(.) and e, we get the following 90% credible intervals:

z1|rest of points in (8.7,12.0) and

z2|rest of points in (21.1,26.0)

Page 62: Bayesian methods, priors and Gaussian processes John Paul Gosling Department of Probability and Statistics.

24-25 January 2007 An Overview of State-of-the-Art Data Modelling62

Diagnostics• Cross validation allows us to check the validity of

our GP fit.• Two variations are often used: leave-one-out or

leave-final-20% out.• Leave-one-out

• Hyperparameters use all data and are then fixed when prediction is carried out for each omitted point.

• Leave-final-20%-out (hold out)• Hyperparameters are estimated using the reduced data

subset.

• Cross validation is not enough to justify GP fit.

Page 63: Bayesian methods, priors and Gaussian processes John Paul Gosling Department of Probability and Statistics.

24-25 January 2007 An Overview of State-of-the-Art Data Modelling63

Cross validation for the 2-D e.g.• Applying leave-one-out cross validation gives a

RMSE of:

Constant: 2.1787

Linear: 2.1185

(Using a linear function, reduces RMSE by 2.8%)• Applying leave-last-20%-out cross validation gives:

Constant: 6.8684

Linear: 5.7466

(A 16.3% difference)

Page 64: Bayesian methods, priors and Gaussian processes John Paul Gosling Department of Probability and Statistics.

24-25 January 2007 An Overview of State-of-the-Art Data Modelling64

Benefits and limitations of GPs• Gaussian processes offer a rich class of models,

which, when fitted properly, is extremely flexible.

• It also offers us a framework in which we can account for all of our uncertainty.

• If there are discontinuities, the method will struggle to provide a good fit.

• The computation time hinges on the inversion of a square matrix of size (number of data points).

Page 65: Bayesian methods, priors and Gaussian processes John Paul Gosling Department of Probability and Statistics.

24-25 January 2007 An Overview of State-of-the-Art Data Modelling65

Extensions• Nonstationarity in the covariance can be

modelled by added extra levels to variance term or deforming the input space.

• Discontinuity can be handled by using piecewise Gaussian process models.

• The GP model can be applied in a classification setting.

• There is a lot more research on GPs and there probably will be a way of using them in your applications.

Page 66: Bayesian methods, priors and Gaussian processes John Paul Gosling Department of Probability and Statistics.

24-25 January 2007 An Overview of State-of-the-Art Data Modelling66

Further details

I have set up a section on my website that has a comprehensive list of references for extended information on the topics covered in this presentation.

j-p-gosling.staff.shef.ac.uk