Bayesian methods, priors and Gaussian processes John Paul Gosling Department of Probability and...

Bayesian methods, priors and Gaussian processes

John Paul GoslingDepartment of Probability and Statistics

24-25 January 2007 An Overview of State-of-the-Art Data Modelling2

Overview

• The Bayesian paradigm

• Bayesian data modelling

• Quantifying prior beliefs

• Data modelling with Gaussian processes

Bayesian methods

The beginning, the subjectivist philosophy, and an overview of Bayesian techniques.


Subjective probability

• Bayesian statistics involves a very different way of thinking about probability in comparison to classical inference.

• The probability of a proposition is defined to a measure of a person’s degree of belief.

• Wherever there is uncertainty, there is probability

• This covers aleatory and epistemic uncertainty


Differences with classical inferenceTo a frequentist, data are repeatable, parameters are not:

P(data|parameters)

To a Bayesian, the parameters are uncertain, the observed data are not:

P(parameters|data)


• This can be extended to continuous distributions:

• In early probability courses, we are taught Bayes’s theorem for events:

• In Bayesian statistics, we use Bayes’s theorem in a particular way:

Bayes’s theorem for distributions


Prior Posterior

Prior to posterior updating

Bayes’s theorem is used to update our beliefs.

The posterior is proportional to the prior times the likelihood.

Data


Posterior distribution

• So, once we have our posterior, we have captured all our beliefs about the parameter of interest.

• We can use this to do informal inference, i.e. intervals, summary statistics.

• Formally, to make choices about the parameter, we must couple this with decision theory to calculate the optimal decision.


Data

Sequential updating

Today’s posterior is tomorrow’s prior

Prior beliefs

Posterior beliefs

More dataPosterior beliefs


The triplot

• A triplot gives a graphical representation of prior to posterior updating.

Prior

Likelihood

Posterior


Audience participation

Quantification of our prior beliefs

• What proportion of people in this room are left handed? – call this parameter ψ

• When I toss this coin, what’s the probability of me getting a tail? – call this θ


A simple example

• The archetypal example in probability theory is the outcome of tossing a coin.

• Each toss of a coin is a Bernoulli trial with the probability of tails given by θ.

• If we carry out 10 independent trials, we know the number of tails(X) will follow a binomial distribution. [X | θ ~ Bi(10, θ)]


Our prior distribution

• A Beta(2,2) distribution may reflect our beliefs about θ.


Our posterior distribution

• If we observe X = 3, we get the following triplot:


Our posterior distribution

• If we are more convinced, a priori, that θ = 0.5 and we observe X = 3, we get the following triplot:


Credible intervals

• If asked to provide an interval in which there is a 90% chance of θ lying, we can derive this directly from our posterior distribution.

• Such an interval is called a credible interval.

• In frequentist statistics, there are confidence intervals that cannotcannot be interpreted in the same way.

• In our example, using our first prior distribution, we can report a 95% posterior credible interval for θ of (0.14,0.62).


• Yesterday we saw a lot of this:

• We have a least squares solution given by

• Instead of trying to find the optimal set of parameters, we express our beliefs about them.

Basic linear model


Basic linear model

• By selecting appropriate priors for the two parameters, we can derive the posterior analytically.

• It is a normal inverse-gamma distribution.

• The mean of our posterior distribution is then

which is a weighted average of the LSE and prior mean.


Bayesian model comparison

• Suppose we have two plausible models for a set of data, M and N say.

• We can calculate posterior odds in favour of M using


Bayesian model comparison

• The Bayes factor is calculated using

• A Bayes factor that is greater than one would mean that your odds in favour of M increase.

• Bayes factors naturally help guard against too much model structure.


Advantages/Disadvantages• Bayesian methods are often more complex

than frequentist methods.

• There is not much software to give scientists off-the-shelf analyses.

• Subjectivity: all the inferences are based on somebody’s beliefs.


Advantages/Disadvantages• Bayesian statistics offers a framework to deal with all

the uncertainty.• Bayesians make use of more information – not just

the data in their particular experiment.• The Bayesian paradigm is very flexible and it is able

to tackle problems that frequentist techniques could not.

• In selecting priors and likelihoods, Bayesians are showing their hands – they can’t get away with making arbitrary choices when it comes to inference.

• …


Summary

• The basic principles of Bayesian statistics have been covered.

• We have seen how we update our beliefs in the light of data.

• Hopefully, I’ve convinced you that the Bayesian way is the right way.

Priors

Advice on choosing suitable prior distributions and eliciting their parameters.


Importance of priors

• As we saw in the previous section, prior beliefs about uncertain parameters are a fundamental part of Bayesian statistics.

• When we have few data about the parameter of interest, our prior beliefs dominate inference about that parameter.

• In any application, effort should be made to model our prior beliefs accurately.


Weak prior information• If we accept the subjective nature of Bayesian statistics

and are not comfortable using subjective priors, then many have argued that we should try to specify prior distributions that represent no prior information.

• These prior distributions are called noninformative, reference, ignorance or weak priors.

• The idea is to have a completely flat prior distribution over all possible values of the parameter.

• Unfortunately, this can lead to improper distributions being used.


Weak prior information

• In our coin tossing example, Be(1,1), Be(0.5,0.5) and Be(0,0) have been recommended as noninformative priors. Be(0,0) is improper.


Conjugate priors

• When we move away from noninformative priors, we might use priors that are in a convenient form.

• That is a form where combining them with the likelihood produces a distribution from the same family.

• In our example, the beta distribution is a conjugate prior for a binomial likelihood.


Informative priors

• An informative prior is an accurate representation of our prior beliefs.

• We are not interested in the prior being part of some conjugate family.

• An informative prior is essential when we have few or no data for the parameter of interest.

• Elicitation, in this context, is the process of translating someone’s beliefs into a distribution.


Elicitation• It is unrealistic to expect someone to be

able to fully specify their beliefs in terms of a probability distribution.

• Often, they are only able to report a few summaries of the distribution.

• We usually work with medians, modes and percentiles.

• Sometimes they are able to report means and variances, but there are more doubts about these values.


Elicitation

• Once we have some information about their beliefs, we fit some parametric distribution to them.

• These distribution almost never fit the judgements precisely.

• There are nonparametric techniques that can bypass this.

• Feedback is essential in the elicitation process.


Normal with unknown mean

Noninformative prior:



Conjugate prior:



Proper prior:


Structuring prior information

• It is possible to structure our prior beliefs in a hierarchical manner:

• Here is referred to as the hyperparameter(s).

Data model:

First level of prior:

Second level of prior:

x


• An example of this type of hierarchical is a nonparametric regression model.

• We want to know about μ so the other parameters must be removed. The other parameters are known as nuisance parameters.

Structuring prior information

Data model:

First level of prior:

Second level of prior:


Analytical tractability

• The more complexity that is built into your prior and likelihood the more likely it is that you won’t be able to derive your posterior analytically.

• In the ’90’s, computational techniques were devised to combat this.

• Markov chain Monte Carlo (MCMC) techniques allow us to access our posterior distributions even in complex models.


Sensitivity analysis

• It is clear that the elicitation of prior distributions is far from being a precise science.

• A good Bayesian analysis will check that the conclusions are sufficiently robust to changes in the prior.

• If they aren’t, we need more data or more agreement on the prior structure.


Summary

• Prior distributions are an important part of Bayesian statistics.

• They are far from being ad hoc, pick-the-easiest-to-use distributions when modelled properly.

• There are classes of noninformative priors that allow us to represent ignorance.

Gaussian processes

A Bayesian data modelling technique that fully accounts for uncertainty.


Data modelling: a fully probabilistic method

• Bayesian statistics offers a framework to account for uncertainty in data modelling.

• In this section, we’ll concentrate on regression using Gaussian processes and the associated Bayesian techniques


We have:

or

and are uncertain.

In order to proceed, we must elicit our beliefs about these two.

can be dealt with as in the previous section.

The basic idea


• We assume that f(.) follows a Gaussian process a priori.

• That is:

• i.e. any sample of f(x)’s will follow a MV-normal.

Gaussian processes

A process is Gaussian if and only if every finite sample from the process is a vector-valued Gaussian random variable.


Gaussian processes

We have prior beliefs about the form of the underlying model.

We observe/experiment to get data about the model with which we train

our GP.

We are left with our posterior beliefs about the model, which can have a ‘nice’ form.


A simple example

Warning: more audience

participation coming up


A simple example

• Imagine we have data about some one dimensional phenomenon.

• Also, we’ll assume that there is no observational error.

• We’ll start with five data points between 0 and 4.

• A priori, we believe is roughly linear and differentiable everywhere.


A simple example


A simple example with error

• Now, we’ll start over and put some Gaussian error on the observations.

• Note: in kriging, this is equivalent to adding a nugget effect.

,


A simple example with error


The mean function

Recall that our prior mean for is given by

where is vector of regression

functions evaluated at and is a vector of

unknown coefficients.

The form of the regression functions is dependent on the application.


The mean function

• It is common practice to use a constant (bias)

• Linear functions

• Gaussian basis functions

• Trigonometric basis functions

• …

It is important to capture your beliefs about in the mean function.


The correlation structure

The correlation function defines how we believe

will deviate nonparametrically from the mean

function.

In the examples here, I have used a stationary correlation function of the form:


Dealing with the model parametersWe have the following hyperparameters:

can be removed analytically using conjugate priors.

are not so easily accounted for…


A 2-D exampleRock porosity somewhere in the US


A 2-D example

Mean of our posterior beliefs about the underlying model, f(.).


A 2-D example

Mean of our posterior beliefs about the underlying model, f(.), in 3D!!!


A 2-D example

Our uncertainty about f(.) – two standard deviations


A 2-D example

Our uncertainty about f(.) looks much better in 3D.


A 2-D example - prediction

• The geologists held back two observations at:P1 = (0.60,0.35), z1 = 10.0 and P2 (0.20,0.90), z2 = 20.8

• Using our posterior distribution for f(.) and e, we get the following 90% credible intervals:

z1|rest of points in (8.7,12.0) and

z2|rest of points in (21.1,26.0)


Diagnostics• Cross validation allows us to check the validity of

our GP fit.• Two variations are often used: leave-one-out or

leave-final-20% out.• Leave-one-out

• Hyperparameters use all data and are then fixed when prediction is carried out for each omitted point.

• Leave-final-20%-out (hold out)• Hyperparameters are estimated using the reduced data

subset.

• Cross validation is not enough to justify GP fit.


Cross validation for the 2-D e.g.• Applying leave-one-out cross validation gives a

RMSE of:

Constant: 2.1787

Linear: 2.1185

(Using a linear function, reduces RMSE by 2.8%)• Applying leave-last-20%-out cross validation gives:

Constant: 6.8684

Linear: 5.7466

(A 16.3% difference)


Benefits and limitations of GPs• Gaussian processes offer a rich class of models,

which, when fitted properly, is extremely flexible.

• It also offers us a framework in which we can account for all of our uncertainty.

• If there are discontinuities, the method will struggle to provide a good fit.

• The computation time hinges on the inversion of a square matrix of size (number of data points).


Extensions• Nonstationarity in the covariance can be

modelled by added extra levels to variance term or deforming the input space.

• Discontinuity can be handled by using piecewise Gaussian process models.

• The GP model can be applied in a classification setting.

• There is a lot more research on GPs and there probably will be a way of using them in your applications.


Further details

I have set up a section on my website that has a comprehensive list of references for extended information on the topics covered in this presentation.

j-p-gosling.staff.shef.ac.uk

Bayesian methods, priors and Gaussian processes John Paul Gosling Department of Probability and...

Documents

Transcript of Bayesian methods, priors and Gaussian processes John Paul Gosling Department of Probability and...