Lecture 4: Introduction to Bayesian Statistics /MCMC...

43
1 Lecture 4: Introduction to Bayesian Statistics /MCMC methods Bruce Walsh lecture notes 2013 Synbreed course version 2 July 2013

Transcript of Lecture 4: Introduction to Bayesian Statistics /MCMC...

Page 1: Lecture 4: Introduction to Bayesian Statistics /MCMC methodsnitro.biosci.arizona.edu/workshops/Synbreed2013/Lectures/Lecture04.pdf · Introduction to Bayesian Statistics /MCMC methods

1

Lecture 4:Introduction to Bayesian

Statistics /MCMC methods

Bruce Walsh lecture notes2013 Synbreed course

version 2 July 2013

Page 2: Lecture 4: Introduction to Bayesian Statistics /MCMC methodsnitro.biosci.arizona.edu/workshops/Synbreed2013/Lectures/Lecture04.pdf · Introduction to Bayesian Statistics /MCMC methods

2

Overview• What is Bayesian Statistics?

– Priors and posteriors

• Why Bayesian?

• Details for priors– Prior hyperparameters

– Informative vs. flat priors

– Conjugating priors

• Describing the posterior– Credible intervals (highest density regions)

– “hypothesis testing” via Bayes factors

Page 3: Lecture 4: Introduction to Bayesian Statistics /MCMC methodsnitro.biosci.arizona.edu/workshops/Synbreed2013/Lectures/Lecture04.pdf · Introduction to Bayesian Statistics /MCMC methods

3

Bayesian statistics

• Classic (“frequentist”) statistics– typically report POINT estimates

– Define probabilities as the frequency of events ifexperiment is replicated over and over.

• Parameters fixed, data random

• Bayesian statistics– returns a FULL DISTRIBUTION of the possible

parameter values ! given the data x, the Posteriorp(! | x)

• Data fixed, parameters random

Page 4: Lecture 4: Introduction to Bayesian Statistics /MCMC methodsnitro.biosci.arizona.edu/workshops/Synbreed2013/Lectures/Lecture04.pdf · Introduction to Bayesian Statistics /MCMC methods

4

Bayes’ theoremProb( " | x) = Pr(x | ") Pr(") / Pr(x)

Prob( " | x) = posterior for " given the data x.

Prob(x | " ) = likelihood of x given " .

Prob(") = prior on " before the experiment.

Typically write posterior = C * likelihood * priorwhere C = 1/Pr(X) is a constant so that theposterior integrates to one.

Can think of Bayesian statistics as a natural extensionof likelihood.

Page 5: Lecture 4: Introduction to Bayesian Statistics /MCMC methodsnitro.biosci.arizona.edu/workshops/Synbreed2013/Lectures/Lecture04.pdf · Introduction to Bayesian Statistics /MCMC methods

5

Prob( " | x) = posterior for " given the data x.

Prob(x | " ) = likelihood of x given ".

Prob( " ) = prior on " before the experiment.

= result when the data and prior information are jointly considered

= signal on the value of " given the observed data

= prior information (distribution) of possible " before the experiment

A flat prior means all possible values are equallylikely (no prior information)

Page 6: Lecture 4: Introduction to Bayesian Statistics /MCMC methodsnitro.biosci.arizona.edu/workshops/Synbreed2013/Lectures/Lecture04.pdf · Introduction to Bayesian Statistics /MCMC methods

6

Example: Normal with unknown mean

Page 7: Lecture 4: Introduction to Bayesian Statistics /MCMC methodsnitro.biosci.arizona.edu/workshops/Synbreed2013/Lectures/Lecture04.pdf · Introduction to Bayesian Statistics /MCMC methods

7

Page 8: Lecture 4: Introduction to Bayesian Statistics /MCMC methodsnitro.biosci.arizona.edu/workshops/Synbreed2013/Lectures/Lecture04.pdf · Introduction to Bayesian Statistics /MCMC methods

8

Posterior =

Signal from prior Signal from data

When variance of prior#o

2 small, strong signal,pulls posterior towards µo

Shrinks (or regresses) samplemean back towards prior mean µo

Strength of signalincreases with samplesize n.

Page 9: Lecture 4: Introduction to Bayesian Statistics /MCMC methodsnitro.biosci.arizona.edu/workshops/Synbreed2013/Lectures/Lecture04.pdf · Introduction to Bayesian Statistics /MCMC methods

9

Bayesian = random effects• Every parameter is a bayesian analysis is a

random effect

• Recall that random effects models allow usto deal with high-dimensional data sets

• Also gives EXACT maximum likelihoodresults for any sample size (given thechosen prior)

Page 10: Lecture 4: Introduction to Bayesian Statistics /MCMC methodsnitro.biosci.arizona.edu/workshops/Synbreed2013/Lectures/Lecture04.pdf · Introduction to Bayesian Statistics /MCMC methods

10

Marginal posteriors

• Often have parameters of interest andadditional nuisance parameters needed forthe model but of no interest

• A marginal posterior integrates the fullposterior over the nuisance parameters toremove their effects

• The resulting marginal posterior fullyincorporates how any uncertainly in thenuisance parameters influences thedistribution of the parameters of interest

Page 11: Lecture 4: Introduction to Bayesian Statistics /MCMC methodsnitro.biosci.arizona.edu/workshops/Synbreed2013/Lectures/Lecture04.pdf · Introduction to Bayesian Statistics /MCMC methods

11

Why Bayesian?• Historical: deep philosophical differences

on the definition of probability

• Real world (today):– Allow for incorporation of any prior information

– Easily handles all of the uncertainly in high-dimensional data sets (as everything is arandom effect)

– Marginal posteriors

– Computationally feasible with MCMC resamplingmethods (easily allow us to generate drawsfrom the posterior or any marginal posterior)

Page 12: Lecture 4: Introduction to Bayesian Statistics /MCMC methodsnitro.biosci.arizona.edu/workshops/Synbreed2013/Lectures/Lecture04.pdf · Introduction to Bayesian Statistics /MCMC methods

12

Priors• The hyperparameters are the values of the

parameters in the prior (e.g., the mean andvariance for a gaussian prior

• Empirical Bayes uses ML to estimate thehyperparameters given the data.

• Flat (or uninformative) priors have very highvariances, spreading out prior support around somemean value

• In the limit, the probability density function forthe prior is a uniform. Care may be needed whenusing a uniform if the parameter range is infinite.

Page 13: Lecture 4: Introduction to Bayesian Statistics /MCMC methodsnitro.biosci.arizona.edu/workshops/Synbreed2013/Lectures/Lecture04.pdf · Introduction to Bayesian Statistics /MCMC methods

13

Conjugating priors

• A conjugating prior returns a posterior ofthe same distributional family, but withdifferent parameters

– Example for gaussian prior on mean undergaussian likelihood function

• Specific parameters in many commonlikelihoods have known conjugate priors

Page 14: Lecture 4: Introduction to Bayesian Statistics /MCMC methodsnitro.biosci.arizona.edu/workshops/Synbreed2013/Lectures/Lecture04.pdf · Introduction to Bayesian Statistics /MCMC methods

14

Conjugation is very important in some of the MCMC methods

Motivation for otherwise “odd” priors that you mean seein the literature.

Page 15: Lecture 4: Introduction to Bayesian Statistics /MCMC methodsnitro.biosci.arizona.edu/workshops/Synbreed2013/Lectures/Lecture04.pdf · Introduction to Bayesian Statistics /MCMC methods

15

Page 16: Lecture 4: Introduction to Bayesian Statistics /MCMC methodsnitro.biosci.arizona.edu/workshops/Synbreed2013/Lectures/Lecture04.pdf · Introduction to Bayesian Statistics /MCMC methods

16

Page 17: Lecture 4: Introduction to Bayesian Statistics /MCMC methodsnitro.biosci.arizona.edu/workshops/Synbreed2013/Lectures/Lecture04.pdf · Introduction to Bayesian Statistics /MCMC methods

17

Page 18: Lecture 4: Introduction to Bayesian Statistics /MCMC methodsnitro.biosci.arizona.edu/workshops/Synbreed2013/Lectures/Lecture04.pdf · Introduction to Bayesian Statistics /MCMC methods

18

Page 19: Lecture 4: Introduction to Bayesian Statistics /MCMC methodsnitro.biosci.arizona.edu/workshops/Synbreed2013/Lectures/Lecture04.pdf · Introduction to Bayesian Statistics /MCMC methods

19

Page 20: Lecture 4: Introduction to Bayesian Statistics /MCMC methodsnitro.biosci.arizona.edu/workshops/Synbreed2013/Lectures/Lecture04.pdf · Introduction to Bayesian Statistics /MCMC methods

20

Markov Chain Monte Carlo(MCMC)

Much more details in the online notes

Page 21: Lecture 4: Introduction to Bayesian Statistics /MCMC methodsnitro.biosci.arizona.edu/workshops/Synbreed2013/Lectures/Lecture04.pdf · Introduction to Bayesian Statistics /MCMC methods

21

Overview• Simulating draws from complex distributions

• Markov chains

– Definition

– Irreducibile, aperiodic chains

– Stationarity

• MCMC samplers

– Metropolis-Hasting algorithm

– Burning in the sampler

– Simulated annealing

– Gibbs sampling

Page 22: Lecture 4: Introduction to Bayesian Statistics /MCMC methodsnitro.biosci.arizona.edu/workshops/Synbreed2013/Lectures/Lecture04.pdf · Introduction to Bayesian Statistics /MCMC methods

22

What is MCMC?

• Suppose we have an expression for some verycomplex distribution (such as a posterior)

• Ideally, it would be nice to have what amounts to abutton on our calculator that generates a randomdrawn from this distribution

• By generating a large number of such draws, wecan fully characterize the distribution

• MCMC (markov chain monte carlo) provides a wayto do this

Page 23: Lecture 4: Introduction to Bayesian Statistics /MCMC methodsnitro.biosci.arizona.edu/workshops/Synbreed2013/Lectures/Lecture04.pdf · Introduction to Bayesian Statistics /MCMC methods

23

Markov Processes

• A markov process is a stochastic (random) processsuch that the probability of having value x at timet simply depends on the value at time t-1 and notonany of the previous values

• Pr( X[t] = x | X[0] = x0, X[1] = x1 …, x[t-1] = xt-1 ) =

– Pr( X[t] = x | x[t-1] = xt-1 )

• Most stochastic processes we think of are Markovian

Page 24: Lecture 4: Introduction to Bayesian Statistics /MCMC methodsnitro.biosci.arizona.edu/workshops/Synbreed2013/Lectures/Lecture04.pdf · Introduction to Bayesian Statistics /MCMC methods

24

Markov chain• A MC is simply the realization of a Markov process

over time

• If the state space (the collection of possiblevalues) is discrete, a simple matrix equationdescribes the evolution of a Markov process

• Let $ be a column vector whose elementscorrespond to the state space, so that the ithelement of $t is the probability the chain has statei at time t.

• Let the ijth element of the matrix P be thetransition probability from state i to state j

• $1 = P $0 . Hence, $t = Pt $0

Page 25: Lecture 4: Introduction to Bayesian Statistics /MCMC methodsnitro.biosci.arizona.edu/workshops/Synbreed2013/Lectures/Lecture04.pdf · Introduction to Bayesian Statistics /MCMC methods

25

R S C

C

SR

The ith entry in row j is the probability ofjumping to state i given the current state is j.

Page 26: Lecture 4: Introduction to Bayesian Statistics /MCMC methodsnitro.biosci.arizona.edu/workshops/Synbreed2013/Lectures/Lecture04.pdf · Introduction to Bayesian Statistics /MCMC methods

26

Example (cont)

Page 27: Lecture 4: Introduction to Bayesian Statistics /MCMC methodsnitro.biosci.arizona.edu/workshops/Synbreed2013/Lectures/Lecture04.pdf · Introduction to Bayesian Statistics /MCMC methods

27

Stationarity

Page 28: Lecture 4: Introduction to Bayesian Statistics /MCMC methodsnitro.biosci.arizona.edu/workshops/Synbreed2013/Lectures/Lecture04.pdf · Introduction to Bayesian Statistics /MCMC methods

28

Key for MCMC: One can use the mathematical formfor any complex distribution to construct transition probabilities for a Markov chain whosestationary distributions are draws from the targetdistribution

Means we can start the chain anywhere (any initial value)and if we run it long enough (burn in the chain), we approach stationarity and hence draws from the target

i.e., transitions from j -> k balance k -> j

Page 29: Lecture 4: Introduction to Bayesian Statistics /MCMC methodsnitro.biosci.arizona.edu/workshops/Synbreed2013/Lectures/Lecture04.pdf · Introduction to Bayesian Statistics /MCMC methods

29

Metropolis-HastingsThe first MCMC approach was the Metropolis-Hastingsalgorithm

Basic idea: generate a number and either accept orreject that number based on a function thatdepends on the mathematical form of the distributionwe are sampling from

LW Appendix 2 shows that this generates a Markov Chain whose stationary values correspond to drawsFrom the target distribution

MH always works, but can be VERY slow, as most valuesmight be rejected

Page 30: Lecture 4: Introduction to Bayesian Statistics /MCMC methodsnitro.biosci.arizona.edu/workshops/Synbreed2013/Lectures/Lecture04.pdf · Introduction to Bayesian Statistics /MCMC methods

30

Page 31: Lecture 4: Introduction to Bayesian Statistics /MCMC methodsnitro.biosci.arizona.edu/workshops/Synbreed2013/Lectures/Lecture04.pdf · Introduction to Bayesian Statistics /MCMC methods

31

Page 32: Lecture 4: Introduction to Bayesian Statistics /MCMC methodsnitro.biosci.arizona.edu/workshops/Synbreed2013/Lectures/Lecture04.pdf · Introduction to Bayesian Statistics /MCMC methods

32

Page 33: Lecture 4: Introduction to Bayesian Statistics /MCMC methodsnitro.biosci.arizona.edu/workshops/Synbreed2013/Lectures/Lecture04.pdf · Introduction to Bayesian Statistics /MCMC methods

33

Page 34: Lecture 4: Introduction to Bayesian Statistics /MCMC methodsnitro.biosci.arizona.edu/workshops/Synbreed2013/Lectures/Lecture04.pdf · Introduction to Bayesian Statistics /MCMC methods

34

Page 35: Lecture 4: Introduction to Bayesian Statistics /MCMC methodsnitro.biosci.arizona.edu/workshops/Synbreed2013/Lectures/Lecture04.pdf · Introduction to Bayesian Statistics /MCMC methods

35

Page 36: Lecture 4: Introduction to Bayesian Statistics /MCMC methodsnitro.biosci.arizona.edu/workshops/Synbreed2013/Lectures/Lecture04.pdf · Introduction to Bayesian Statistics /MCMC methods

36

Convergence• Two issues: convergence vs. sampling low-

density regions of the posterior– Convergence: worry that you have sampled the

chain for a sufficiently long period that drawsare indeed from the stationary distribution

– Adequate sampling of low density regions: Evenif convergence has been reached, need to runthe sampler long enough to sample important(i.e. extreme), but low-density regions of theposterior.

• WL Appendix 2 details some of the diagonistictools that can be used. However, none perfect

• Basic idea: either compare subsamples of thechain or different chains to access consistency

Page 37: Lecture 4: Introduction to Bayesian Statistics /MCMC methodsnitro.biosci.arizona.edu/workshops/Synbreed2013/Lectures/Lecture04.pdf · Introduction to Bayesian Statistics /MCMC methods

37

MH and Gibbs sampler

• While MH works, as mentioned it can be very slowin that most proposal values can be rejected, sothat (say) 500 iterations of the MH sampler mayyield only one or two values

• The Gibbs sampler is a version were ALL draws areaccepted.

• Key: all conditional distributions must have a niceform (conditionally conjugate likelihoods)

Page 38: Lecture 4: Introduction to Bayesian Statistics /MCMC methodsnitro.biosci.arizona.edu/workshops/Synbreed2013/Lectures/Lecture04.pdf · Introduction to Bayesian Statistics /MCMC methods

38

Page 39: Lecture 4: Introduction to Bayesian Statistics /MCMC methodsnitro.biosci.arizona.edu/workshops/Synbreed2013/Lectures/Lecture04.pdf · Introduction to Bayesian Statistics /MCMC methods

39

Page 40: Lecture 4: Introduction to Bayesian Statistics /MCMC methodsnitro.biosci.arizona.edu/workshops/Synbreed2013/Lectures/Lecture04.pdf · Introduction to Bayesian Statistics /MCMC methods

40

Note that the sample of (say) the xi are drawsfrom the marginal posterior for x

Page 41: Lecture 4: Introduction to Bayesian Statistics /MCMC methodsnitro.biosci.arizona.edu/workshops/Synbreed2013/Lectures/Lecture04.pdf · Introduction to Bayesian Statistics /MCMC methods

41

ABC• Approximate Bayesian Computation (or ABC) is

becoming more widespread, esp. in populationgenetics

• The idea is that the likelihood may be sufficientlycomplex that it is easier to simulate it (given a setof starting parameters).

• For example, consider a pop gen model for geneticvariation with unknown parameters for mutationrate, pop size before a bottleneck, size after abottleneck, and time in the past that thebottleneck occurred

• The data are measures of molecular diversity(number of segregating sites, pairwisedifferences)

Page 42: Lecture 4: Introduction to Bayesian Statistics /MCMC methodsnitro.biosci.arizona.edu/workshops/Synbreed2013/Lectures/Lecture04.pdf · Introduction to Bayesian Statistics /MCMC methods

42

ABC (cont)

• Under ABC, one first draws values forthese parameters from some prior

• These are used to run a simulation, whichreturns the observed data

• If the simulation run matches the observedvalues, this vector of parameter values issaved. If no match, a new set is drawn

• The resulting collection of saved (notrejected) parameter values forms theposterior

Page 43: Lecture 4: Introduction to Bayesian Statistics /MCMC methodsnitro.biosci.arizona.edu/workshops/Synbreed2013/Lectures/Lecture04.pdf · Introduction to Bayesian Statistics /MCMC methods

43

ABC (cont)

• One significant issue is how close is close --- with discrete data, usually want exactmatches, with continuous data want thevalues to agree within some very smalldistance metric e

• The tradeoff is much more CPU time forcloser/exact matches vs. less precision ifthe match is not exact