Lecture 4: Introduction to Bayesian Statistics /MCMC...
Transcript of Lecture 4: Introduction to Bayesian Statistics /MCMC...
1
Lecture 4:Introduction to Bayesian
Statistics /MCMC methods
Bruce Walsh lecture notes2013 Synbreed course
version 2 July 2013
2
Overview• What is Bayesian Statistics?
– Priors and posteriors
• Why Bayesian?
• Details for priors– Prior hyperparameters
– Informative vs. flat priors
– Conjugating priors
• Describing the posterior– Credible intervals (highest density regions)
– “hypothesis testing” via Bayes factors
3
Bayesian statistics
• Classic (“frequentist”) statistics– typically report POINT estimates
– Define probabilities as the frequency of events ifexperiment is replicated over and over.
• Parameters fixed, data random
• Bayesian statistics– returns a FULL DISTRIBUTION of the possible
parameter values ! given the data x, the Posteriorp(! | x)
• Data fixed, parameters random
4
Bayes’ theoremProb( " | x) = Pr(x | ") Pr(") / Pr(x)
Prob( " | x) = posterior for " given the data x.
Prob(x | " ) = likelihood of x given " .
Prob(") = prior on " before the experiment.
Typically write posterior = C * likelihood * priorwhere C = 1/Pr(X) is a constant so that theposterior integrates to one.
Can think of Bayesian statistics as a natural extensionof likelihood.
5
Prob( " | x) = posterior for " given the data x.
Prob(x | " ) = likelihood of x given ".
Prob( " ) = prior on " before the experiment.
= result when the data and prior information are jointly considered
= signal on the value of " given the observed data
= prior information (distribution) of possible " before the experiment
A flat prior means all possible values are equallylikely (no prior information)
6
Example: Normal with unknown mean
7
8
Posterior =
Signal from prior Signal from data
When variance of prior#o
2 small, strong signal,pulls posterior towards µo
Shrinks (or regresses) samplemean back towards prior mean µo
Strength of signalincreases with samplesize n.
9
Bayesian = random effects• Every parameter is a bayesian analysis is a
random effect
• Recall that random effects models allow usto deal with high-dimensional data sets
• Also gives EXACT maximum likelihoodresults for any sample size (given thechosen prior)
10
Marginal posteriors
• Often have parameters of interest andadditional nuisance parameters needed forthe model but of no interest
• A marginal posterior integrates the fullposterior over the nuisance parameters toremove their effects
• The resulting marginal posterior fullyincorporates how any uncertainly in thenuisance parameters influences thedistribution of the parameters of interest
11
Why Bayesian?• Historical: deep philosophical differences
on the definition of probability
• Real world (today):– Allow for incorporation of any prior information
– Easily handles all of the uncertainly in high-dimensional data sets (as everything is arandom effect)
– Marginal posteriors
– Computationally feasible with MCMC resamplingmethods (easily allow us to generate drawsfrom the posterior or any marginal posterior)
12
Priors• The hyperparameters are the values of the
parameters in the prior (e.g., the mean andvariance for a gaussian prior
• Empirical Bayes uses ML to estimate thehyperparameters given the data.
• Flat (or uninformative) priors have very highvariances, spreading out prior support around somemean value
• In the limit, the probability density function forthe prior is a uniform. Care may be needed whenusing a uniform if the parameter range is infinite.
13
Conjugating priors
• A conjugating prior returns a posterior ofthe same distributional family, but withdifferent parameters
– Example for gaussian prior on mean undergaussian likelihood function
• Specific parameters in many commonlikelihoods have known conjugate priors
14
Conjugation is very important in some of the MCMC methods
Motivation for otherwise “odd” priors that you mean seein the literature.
15
16
17
18
19
20
Markov Chain Monte Carlo(MCMC)
Much more details in the online notes
21
Overview• Simulating draws from complex distributions
• Markov chains
– Definition
– Irreducibile, aperiodic chains
– Stationarity
• MCMC samplers
– Metropolis-Hasting algorithm
– Burning in the sampler
– Simulated annealing
– Gibbs sampling
22
What is MCMC?
• Suppose we have an expression for some verycomplex distribution (such as a posterior)
• Ideally, it would be nice to have what amounts to abutton on our calculator that generates a randomdrawn from this distribution
• By generating a large number of such draws, wecan fully characterize the distribution
• MCMC (markov chain monte carlo) provides a wayto do this
23
Markov Processes
• A markov process is a stochastic (random) processsuch that the probability of having value x at timet simply depends on the value at time t-1 and notonany of the previous values
• Pr( X[t] = x | X[0] = x0, X[1] = x1 …, x[t-1] = xt-1 ) =
– Pr( X[t] = x | x[t-1] = xt-1 )
• Most stochastic processes we think of are Markovian
24
Markov chain• A MC is simply the realization of a Markov process
over time
• If the state space (the collection of possiblevalues) is discrete, a simple matrix equationdescribes the evolution of a Markov process
• Let $ be a column vector whose elementscorrespond to the state space, so that the ithelement of $t is the probability the chain has statei at time t.
• Let the ijth element of the matrix P be thetransition probability from state i to state j
• $1 = P $0 . Hence, $t = Pt $0
25
R S C
C
SR
The ith entry in row j is the probability ofjumping to state i given the current state is j.
26
Example (cont)
27
Stationarity
28
Key for MCMC: One can use the mathematical formfor any complex distribution to construct transition probabilities for a Markov chain whosestationary distributions are draws from the targetdistribution
Means we can start the chain anywhere (any initial value)and if we run it long enough (burn in the chain), we approach stationarity and hence draws from the target
i.e., transitions from j -> k balance k -> j
29
Metropolis-HastingsThe first MCMC approach was the Metropolis-Hastingsalgorithm
Basic idea: generate a number and either accept orreject that number based on a function thatdepends on the mathematical form of the distributionwe are sampling from
LW Appendix 2 shows that this generates a Markov Chain whose stationary values correspond to drawsFrom the target distribution
MH always works, but can be VERY slow, as most valuesmight be rejected
30
31
32
33
34
35
36
Convergence• Two issues: convergence vs. sampling low-
density regions of the posterior– Convergence: worry that you have sampled the
chain for a sufficiently long period that drawsare indeed from the stationary distribution
– Adequate sampling of low density regions: Evenif convergence has been reached, need to runthe sampler long enough to sample important(i.e. extreme), but low-density regions of theposterior.
• WL Appendix 2 details some of the diagonistictools that can be used. However, none perfect
• Basic idea: either compare subsamples of thechain or different chains to access consistency
37
MH and Gibbs sampler
• While MH works, as mentioned it can be very slowin that most proposal values can be rejected, sothat (say) 500 iterations of the MH sampler mayyield only one or two values
• The Gibbs sampler is a version were ALL draws areaccepted.
• Key: all conditional distributions must have a niceform (conditionally conjugate likelihoods)
38
39
40
Note that the sample of (say) the xi are drawsfrom the marginal posterior for x
41
ABC• Approximate Bayesian Computation (or ABC) is
becoming more widespread, esp. in populationgenetics
• The idea is that the likelihood may be sufficientlycomplex that it is easier to simulate it (given a setof starting parameters).
• For example, consider a pop gen model for geneticvariation with unknown parameters for mutationrate, pop size before a bottleneck, size after abottleneck, and time in the past that thebottleneck occurred
• The data are measures of molecular diversity(number of segregating sites, pairwisedifferences)
42
ABC (cont)
• Under ABC, one first draws values forthese parameters from some prior
• These are used to run a simulation, whichreturns the observed data
• If the simulation run matches the observedvalues, this vector of parameter values issaved. If no match, a new set is drawn
• The resulting collection of saved (notrejected) parameter values forms theposterior
43
ABC (cont)
• One significant issue is how close is close --- with discrete data, usually want exactmatches, with continuous data want thevalues to agree within some very smalldistance metric e
• The tradeoff is much more CPU time forcloser/exact matches vs. less precision ifthe match is not exact