Lecture 1: Bayesian Inference and Data Analysis Department of Statistics, Rajshahi University,...

Lecture 1: Bayesian Inference and Data Analysis

Department of Statistics, Rajshahi University, Rajshahi

-Anandamayee MajumdarVisiting Scientist, University of North Texas

School of Public Health, USA;Professor, University of Suzhou, China.

Overview• Applications• Introduction• Steps and Components• Motivation• Bayes Rule• Probability as a Measure of Certainty• Simulation from a distribution using inverse CDF• A one parameter model example• Binomial example approached by Bayes and Laplace.

Applications to Computer Science

• Bayesian inference has applications in Artificial intelligence and Expert systems. Bayesian inference techniques have been a fundamental part of computerized pattern recognition techniques since the late 1950s.

• Recently Bayesian inference has gained popularity amongst the phylogenetics community for these reasons; a number of applications allow many demographic and evolutionary parameters to be estimated simultaneously. In the areas of population genetics and dynamical systems theory, approximate Bayesian computation (ABC) is also becoming increasingly popular.

• As applied to statistical classification, Bayesian inference has been used in recent years to develop algorithms for identifying e-mail spam.

http://en.wikipedia.org/wiki/Artificial_intelligence

http://en.wikipedia.org/wiki/Expert_system

http://en.wikipedia.org/wiki/Pattern_recognition

http://en.wikipedia.org/wiki/Phylogenetics

http://en.wikipedia.org/wiki/Population_genetics

http://en.wikipedia.org/wiki/Dynamical_systems_theory

http://en.wikipedia.org/wiki/Approximate_Bayesian_computation

http://en.wikipedia.org/wiki/Statistical_classification

http://en.wikipedia.org/wiki/E-mail_spam

Application to the Court Room

• Bayesian inference can be used by jurors to coherently accumulate the evidence for and against a defendant, and to see whether, in totality, it meets their personal threshold for 'beyond a reasonable doubt’. The benefit of a Bayesian approach is that it gives the juror an unbiased, rational mechanism for combining evidence.

http://en.wikipedia.org/wiki/Beyond_a_reasonable_doubt

Other Applications

• Population genetics• Ecology• Archaeology• Environmental Science• Finance• ….and many more

Introduction: Bayesian Inference

• Practical methods for learning from data• Use of Probability Models• Quantify Uncertainty

Steps1. Set up a full probability model

(a joint distribution for all observable and unobservable quantities in a problem)

• Consistent with underlying scientific

problem• Consistent with data collection process

Steps (continued)2. Conditioning on observed data:

Calculate and interpret the

posterior distribution (the conditional probability distribution of the unobserved quantities given observed data)

P (θ | Data)

Steps (continued)

3. Evaluate the fit of the model and the implications of the resulting posterior distribution• Does model fit data?• Are conclusions reasonable?• How sensitive are results to the modeling

assumptions in step 1?

Step 3 continued

• If necessary one can alter or expand the model and repeat the three steps

Step 1 is a stumbling block

• How do we go about constructing the joint distribution, i.e. the full probability model?

• Advanced improved techniques in second step may help

• Advances in carrying out the third step alleviate the somewhat the issue of incorrect model specification in first step.

Primary motivation for Bayesian thinking

• Facilitates common sense interpretation of statistical conclusions.

• Eg. Bayesian (probability) interval for an unknown quantity of interest can be directly regarded as having a high probability of containing the unknown quantity in contrast to a frequentist (confidence) interval which is justified with a retrospective perspective and sampling methodology.

Primary motivation for Bayesian thinking (continued)

• Increased emphasis has been placed on interval estimation than hypothesis testing – adds a strong impetus to the Bayesian viewpoint

-We shall look at the extent to which

Bayesian interpretations of common

simple statistics procedures are justified.

Real Life Example

• A clinical trial of cancer patients might be designed to compare the 5 year survival probability given the new drug – with that in the standard treatment

• Inference based on a sample of patients• We can not assign patients to both treatments• Causal inference (compare the observed

outcome in a patient to the unobserved outcome if exposed to the other treatment)

Two kinds of estimands

• Estimand = Unobserved quantity for which inference is needed

1. Potentially observable quantity (Ÿ).

2. Quantities that are not directly observable

(parameters) (θ).

• The first helps to understand how model fits real data

General notation

• θ → denotes unobservable vector quantities or population parameters of interest

• y → observed data y= (y1, y2, …, yn) • Ÿ → potentially observable but unknown

quantity (replication, future prediction etc)

• In general these are multivariate quantities

General notation

• x → explanatory variable / covariate

• X → entire set of explanatory variables for all n units (of data)

Fundamental Difference

Bayesian Approach• Inference of θ → based on p(θ|y)• Inference of Ÿ → based on p(Ÿ|y)*Bayesian statistical conclusions: Made using probability statements (‘highly unlikely’, ‘very likely’)

Frequentist Approach• Inference of θ → based on p(y |θ)• Inference of Ÿ → based on θ → based on

p(y |θ)*Frequentist statistical conclusions based on p-values (‘not

significant’ ,`test can not be rejected’ etc)

Practical similarity? Difference?

• Despite differences in many simple analyses, results obtained using the two different procedures yield superficially similar results (especially in asymptotic cases)

• Bayesian methods can be easily extended to more complex problems

• Usually Bayesian models work better with less data• Bayesian method can include prior information into the

analysis through the prior distribution• Easy sequential updates of inference possible by assuming

previous posterior distribution as new prior distribution (Bayesian updating) as new data becomes available.

A Fundamental Concept:The Prior distribution

• θ→ random because it is unknown to us, though we may have some feeling about it from before

• Prior distribution → “subjective” probability that quantifies whatever belief (however vague) we may have about θ before having looked at the data.

Fundamental Result:Bayes Rule

• Due to Thomas Bayes (1702–1761)• Joint distribution p(θ, y) = p(y | θ) p(θ )• p(θ | y) = p(θ, y)/p(y)

= p(y | θ)p(θ)/p(y)

http://en.wikipedia.org/wiki/Thomas_Bayes

http://en.wikipedia.org/wiki/Thomas_Bayes

Gist – Main point to remember

• p(θ | y) α p(y | θ) p(θ) as p(y) is free of θ

• Any two data that yields the same likelihood, yields the same inference

• Encapsulates the technical core of Bayesian inference : primary task is to develop the model p(θ, y) and perform the necessary computations to summarize p(θ|y) appropriately.

Attractive property of Bayes Rule

• Posterior Odds

= p(θ1|y)/p(θ2|y)

= {p(θ1 )p(y |θ1)/p(y)}/ {p(θ2 ) p(y |θ2)/p(y)}

= {p(θ1 ) / p(θ2 )} {p(y|θ1) / p(y|θ2)}

= Prior Odds * Likelihood Ratio

Example: Hemophilia Inheritance

• Father →XY, Mother →XX • Hemophilia exhibits X-chromosome-linked

recessive inheritance• If son receives a bad chromosome from mother, he

will be affected• If daughter receives one bad chromosome from

mother, she will not be affected, but will be a carrier

• If both X are affected in a woman it is fatal (occurrence rare)

• A woman has an affected brother → mother carrier of hemophilia

• Mother →Xgood Xbad

• Father not affected

Unknown quantity of interest

θ = 0 if woman is not a carrier

1 if woman is carrier

Prior: P(θ=0) = P(θ=1) = 0.5

Model and Likelihood• Suppose the woman has two sons, neither of

whom are affected.

Let yi = 1 denote an affected son

0 denote an unaffected son• The two conditions of two sons are

independent given θ (no two are identical twins).

Pr(y1=0, y2=0 | θ=1) =(0.5)(0.5)=0.25

Pr(y1=0, y2=0 | θ=0) =(1)(1)=1

Posterior distribution • Bayes Rule: Combines the information in the

data with the prior probability

y = (y1, y2) joint data

Posterior probability of interest:

p(θ=1|y) = p(y |θ=1)p(θ=1) / {p(y|θ=1)p(θ=1) + p(y|θ=0)p(θ=0)}

= (0.25)(0.5) / {(0.25)(0.5) + (1)(0.5)} = 0.2

Conclusions

• It is clear that if the woman has unaffected children it is less probable she is a carrier

• Bayes Rule provides a formal mechanism in terms of prior and posterior odds.

• Prior odds= 0.5/0.5=1• Likelihood ratio = 0.25/1= 0.25• So posterior odds = (1) (0.25) = 0.25• So P(θ=1|y)=0.2

Easy sequential analysis performance with Bayesian Analysis

• Suppose that the woman has a third son, also unaffected.

• We do not repeat entire analysis• Use previous posterior distribution as new prior

P(θ=1| y1, y2,y3)

= P(y3|θ=1)(0.2)/{P(y3|θ=1)(0.2)+ P(y3|θ=0)(0.8)}

= (0.5)(0.2)/{(0.5)(0.2) + (1)(0.8)}

= 0.111

Probability as a measure of uncertainty

Legitimate to ask in Bayesian Analysis• Pr(Rain tomorrow)?• Pr(Victory of Bangladesh in 20-20 match)?• Pr(Heads if coin is tossed)?• Pr(Average height of students within (4ft, 5ft)) of interest after data is acquired

• Pr(Sample average of students within (4ft, 5ft)) of interest before data is acquired

• Bayesian Analysis methods enable statements to be made about the partial knowledge available (based on data) concerning some situation (unobservable, or as yet unobserved) in a systematic way, using probability as the measure

• Guiding principle: State of knowledge about anything unknown is described by a probability distribution

Usual Numerical Methods of Certainty

1. Symmetry/ Exchangeability Argument• Probability = # favourable cases/

# possibilities• (Coin tossing experiment)• Involves assumptions, on physical condition of

toss, physical conditions about forces at work• Dubious if we know a coin is either double-

headed or double-tailed.

Usual Numerical Methods of Certainty

2. Frequency Argument

• Probability = Relative frequency obtained in a very long sequence (experiments assumed, identically performed, physically independent of each other)

Other arguments in consideration

• Physical randomness induces uncertainty (we speak of ‘likely’, ‘less likely’ etc events)

• Axiomatic approach: Decision theory related• Coherence of bets: (define probability through

odds ratio)

A. Fundamental difficulties remain defining odds

B. Ultimate test is success of application!

Summarizing inference using simulation

• Simulation:

Forms a central part of Bayesian Analysis

→ Relative ease with which samples can be drawn from even a complex, explicitly unknown probability distribution

• For example:

• To estimate 95th percentile of the posterior distribution of θ|y, draw a random sample of size L (large), from p(θ|y) and use the 0.95Lth order statistic.

• For most purposes L=1000 is adequate for such estimates

• Generating values from a probability distribution is often straight forward with modern computing techniques

• This technique is based on (Pseudo) random number generators → yields a deterministic sequence that appears to have the same properties as a sequence of independent random draws from uniform distribution on [0,1]

Sampling using inverse cumulative distribution function

• F is the c.d.f of a random variable • F-1 (U) =inf{x: F(x) ≥ U} will follow the distribution defined

by F(.) where U ~ Uniform(0, 1). • If F is discrete, F-1 can be tabulated

Posterior samples as building blocks of posterior distribution

• One can use this array, to generate the posterior distribution

• One can use this array to find the posterior distribution of, say, θ1/θ2 or say, log(θ3) by adding appropriate columns to this array and using the existing columns – extremely straight forward!

Single - parameter models

• Consider some fundamental and widely used one dimensional models—the binomial, normal, Poisson, and exponential etc

• We shall discuss important concepts and computational methods for Bayesian data analysis

Estimating a probability from binomial data

• Sequence of Bernoulli trials; data y1 ,…, yn , each of which is either 0 or 1 (n fixed). Exchangeability implies likelihood depends only on sum of yi (y).

• Provides a relatively simple but important example

• Parallels the very first published Bayesian analysis by Thomas Bayes in 1763

Proportion of female births

• 200 years ago it was established that the proportion of female births in European populations was less than 0.5

• This century interest has focused on factors that may influence the gender ratio.

• The currently accepted value of the proportion of female births in very large European-race populations is 0.485.

• Define the parameter θ to be the proportion of female births

• Alternative way of reporting this parameter is as a ratio of male to female birth rates

• Bayesian inference in the binomial model, we must specify a prior distribution for θ .

• For simplicity assume the prior to be Uniform(0,1)

• Bayes rule implies that

p(θ|y) α θy (1-θ)n-y

• In single- parameter problems, this allows immediate graphical presentation of the posterior distribution.

• Since p(θ|y) is a density and should integrate to 1, the normalizing constant can be worked out.

• The posterior distribution is recognizable as a beta distribution

• θ|y ~ Beta(y+1, n-y+1)• In analyzing the binomial model, Pierre-Simon Laplace (1749–

1827) also used the uniform prior distribution. • His first serious application was to estimate the proportion of

female births in a population. • A total of 241,945 girls and 251,527 boys were born in Paris

from 1745 to 1770. • Laplace used (Normal) approximation and showed that • P(θ≥0.5|y =241,945, n =251,527+241,945)

≈ 1.15 × 10 -42

So he was ‘morally certain’ that θ <0.5.

http://en.wikipedia.org/wiki/Pierre-Simon_Laplace


Dept. of Statistics, Rajshahi University, Rajshahi

Anandamayee MajumdarVisiting Scientist, University of North Texas


Overview• Prediction in the Binomial example• Justification of the Uniform prior in Binomial case• Prior distributions – more discussion and an example• Hyperparameters, hyperpriors• Hierarchical models• Posterior distribution as a compromise between prior

distribution and data. • Graphical and Numerical Summaries• Posterior probability intervals (or credible intervals)• Normal example with unknown mean and known variance• Central Limit Theorem in the Bayesian Context• Large sample properties and results

Prediction in the Binomial Example

• In the binomial example with the uniform prior distribution, the prior predictive distribution (marginal of y) can be evaluated explicitly

• Marginal distribution of y:

p(y=i) = 1/(n+1) for i=0,1,…, n• All values of y are equally likely, a priori . • For posterior prediction, we might be more

interested in the outcome of one new trial, rather than another set of n new trials.

Prediction in the Binomial example

• Letting y_tilde denote the result of a new trial, exchangeable with the first n

• This result, based on the uniform prior distribution, is known as ‘Laplace’s law of succession.’

Justification of Uniform prior in Binomial problem

• Bayes: The resulting marginal p(y) is uniform over {0, 1, …,n} . Justification good in the sense it uses y and n only.

• Laplace: Insufficient information about θ

→ justified by a flat distribution. This argument is often followed in Bayesian model building.

Interpretation: prior distributions

1. In the population interpretation, the prior distribution represents a population of possible parameter values, from which the θ of current interest has been drawn.

• Probability of failure in a new industrial process: there is no perfectly relevant population

2. In the more subjective state of knowledge interpretation, the guiding principle is that we must express our knowledge (and uncertainty) about θ as if its value could be thought of as a random realization from the prior distribution.

Prior: Constraints and Flexibility

• Prior distribution should include all plausible values of θ

• Prior need not be realistically concentrated around the true value……………………….

• …because often the information about θ contained in the data will far outweigh any reasonable prior specification.

Posterior distribution as compromise between data and prior information

• The process of Bayesian inference involves passing from a prior distribution, p( θ ), to a posterior distribution, p( θ |y), → natural to expect that some general relations might hold between these two distributions.

1. E (θ) =E(E(θ |y))

the prior mean of θ is the average of all possible posterior means over the distribution of possible data

2. Var (θ) = E(Var(θ|y)) + Var(E(θ|y)),

the posterior variance is on average smaller than the prior variance, by an amount that depends on the variation in posterior means over the distribution of possible data.

Posterior distribution → compromise between the prior and the data

• In the binomial example with the uniform prior distribution, the prior mean is ½

• The posterior mean y+1/n+2, is a compromise between the prior mean ½ and the sample proportion y/n

• Clearly the prior mean has a smaller and smaller role as the size of the data sample increases.

• This is a very general feature of Bayesian inference• The posterior distribution is centered at a point that represents

a compromise between the prior information and the data• The compromise is controlled to a greater extent by the

data as the sample size increases.

Displaying & summarizing posterior inference

• Graphical displays useful • Eg. Histograms, boxplots• Contour plots, scatterplots in multiparameter

problems• Numerical summaries also desirable• Summaries of location are the mean, median,

and mode(s)• Variation is commonly summarized by the

standard deviation, IQR, other quantiles

Posterior Interval Summaries in Bayesian Inference

1. A 100(1 −α)% central posterior interval : Range of values above and below which lies exactly 100(α /2)% of the posterior probability

• For simple models (Binomial, Normal, Poisson, etc), posterior intervals can be computed directly from c.d.f. (use standard computer functions)

Posterior Interval Summaries in Bayesian Inference

2. Highest posterior density (HPD) interval: the region of values that contains 100(1−α)% of the posterior probability, also has the characteristic that the density within the region is never lower than that outside. • HPD region is identical to a central posterior interval if the

posterior distribution is unimodal and symmetric. • In general, we prefer the central posterior interval to the

HPD region because the former has a direct interpretation as the posterior α/2 and 1−α/2 quantiles, is invariant to one-to-one transformations of the estimand, and is usually easier to compute

A special case: comparison of central probability interval and a HPD interval

Prior – categorization (Andrew Gelman)

(1) Prior distributions giving numerical information that is crucial to estimation of the model. This would be a traditional informative prior, which might come from a literature review or explicitly from an earlier data analysis.

(2) Prior distributions that are not supplying any controversial information but are strong enough to pull the data away from inappropriate inferences that are consistent with the likelihood. This might be called a weakly informative prior.

(3) Prior distributions that are uniform, or nearly so, and basically allow the information from the likelihood to be interpreted probabilistically. These are noninformative priors, or maybe, in some cases, weakly informative.

Noninformative priors

• "Non-informative prior distribution: A prior distribution which is non-commital about a parameter, for example, a uniform distribution." -Everitt (1998)

Improper Priors

• A ‘prior’ distribution’ which integrates to infinity over the parameter space

• Eg. Assume a constant prior for the Normal mean.• Many authors (Lindley, 1973; De Groot, 1937; Kass and

Wasserman, 1996) warn against the danger of over-interpreting those priors since they are not probability densities.

• As long as it yields a proper posterior distribution Bayesian methodology can be carried out

• Improper priors have been proved to be limits of data adaptive proper priors –Akaike (JRSSB, 1980)

Jeffrey’s prior • Named after Harold Jeffreys, is a non-informative (objective)

prior distribution on parameter space that is proportional to the square root of the determinant of the Fisher information

p(θ) α √det(I(θ))

• It has the key feature that it is invariant under reparameterization of the parameter vector. If φ = f(θ) then p(φ) = √det(I(φ))

• Sometimes the Jeffreys prior cannot be normalized, and thus one must use an improper prior. For example, the Jeffreys prior for the distribution mean is uniform over the entire real line in the case of a Gaussian distribution of known variance.

http://en.wikipedia.org/wiki/Harold_Jeffreys

http://en.wikipedia.org/wiki/Harold_Jeffreys

http://en.wikipedia.org/wiki/Non-informative_prior

http://en.wikipedia.org/wiki/Prior_distribution

http://en.wikipedia.org/wiki/Square_root

http://en.wikipedia.org/wiki/Determinant

http://en.wikipedia.org/wiki/Fisher_information

http://en.wikipedia.org/wiki/Parametrization

http://en.wikipedia.org/wiki/Parametrization

http://en.wikipedia.org/wiki/Normalizing_constant

http://en.wikipedia.org/wiki/Improper_prior

http://en.wikipedia.org/wiki/Gaussian_distribution

Informative priorsConjugacy: Binomial example

• Likelihood of the parametric form

p(y|θ) α θa (1-θ)b (Binomial family) • Thus, if the prior has the same form, so will the posterior.

p(θ) α θα-1 (1-θ)β-1 (Beta family)• If the prior and posterior distribution follow the same

parametric family/functional form, then we get conjugacy. • Beta and Binomial families are said to be conjugate

families.• Other examples: Poisson and Gamma, Normal (mean) and

Normal, Normal (variance) and Inverse Gamma etc.

Informative (Beta) prior (continued)

• p(θ | y) α θy (1-θ)n-y θα-1 (1-θ)β-1

= θy+α-1 (1-θ)n-y+β-1

= Beta (y+α, n-y+β) • E(θ | y) = y+α /(n+α+β) → y/n as n→∞

• Var(θ | y) = (y+α)(n+β-y)/(n+α+β)2(n+α+β+1)

→ O(1/n) as n→∞• As n becomes large the effect of the prior

diminishes, also posterior distribution shrinks!

Basic justification of conjugate priors

• Similar to that for using standard models (such as Binomial and Normal) for the likelihood:

1. Easy to understand the results, which can often be put in analytic form

2. They are often a good approximation

3. They simplify computations.

• Also, they are useful as building blocks for more complicated models, including in many dimensions, where conjugacy is typically impossible.

• For these reasons, conjugate models can be good starting points; for example, mixtures of conjugate families can sometimes be useful when simple conjugate distributions are not reasonable

Hyperparameters

• Parameters of prior distributions are called hyperparameters, to distinguish them from parameters of the model of the underlying data.

• Eg. We use the Beta(α, β) to model the distribution of the parameter p of a Binomial distribution Binomial(n, p)

• p is a parameter of the underlying system • α and β are parameters of the prior distribution

(beta distribution), hence hyperparameters.

Hyperpriors

• A hyperprior is a prior distribution on a hyperparameter

• They arise particularly in the use of conjugate priors.

http://en.wikipedia.org/wiki/Prior_distribution

http://en.wikipedia.org/wiki/Hyperparameter

http://en.wikipedia.org/wiki/Conjugate_prior

Purpose of Hyperpriors

1. To express uncertainty about the hyperparameter. Assuming fixed hyperparameters is rigid, making them random allows data to choose the hyperparameters, and makes the ‘data speak’.

2. By using a hyperprior, the prior distribution itself becomes a mixture distribution; a weighted average of the various prior distributions (over different hyperparameters), with the hyperprior being the weighting.

This adds additional possible distributions (beyond the parametric family one is using), because parametric families of distributions are generally not convex sets – as a mixture density is a convex combination of distributions, it will in general lie outside the family.

For instance, the mixture of two normal distributions is not a normal distribution: if one takes different means (sufficiently distant) and mix 50% of each, one obtains a bimodal distribution, which is thus not normal. In fact, the convex hull of normal distributions is dense in all distributions, so in some cases, you can arbitrarily closely approximate a given prior by using a family with a suitable hyperprior.

Purpose of Hyperpriors

3. Dynamical system

A hyperprior is a distribution on the space of possible hyperparameters. If one is using conjugate priors, then this space is preserved by moving to posteriors – thus as data arrives, the distribution changes, but remains on this space: as data arrives, the distribution evolves as a dynamical system (each point of hyperparameter space evolving to the updated hyperparameters), over time converging, just as the prior itself converges.

4. Ideal for hierarchical / multilevel models where hierarchy arises as a natural phenomenon and for information sharing with many sources of data

http://en.wikipedia.org/wiki/Dynamical_system

Hierarchical/Multilevel model• Generalization of linear and generalized linear

modeling in which regression coeffecients are themselves given a model, whose parameters are also estimated from data.

Results using noninformative priors

• Many simple Bayesian analyses based on noninformative prior distributions give similar results to standard non-Bayesian approaches (for example, the posterior t interval for the normal mean with unknown variance).

• The extent to which a noninformative prior distribution can be justified as an objective assumption depends on

the amount of information available in the data; as

the sample size n increases, the influence of the

prior distribution on posterior inferences decreases.

Informative Nonconjugate prior distributions

• For more complex problems, conjugacy may not be possible

• Although they can make interpretations of posterior inferences less transparent

and computation more difficult,

nonconjugate prior distributions

do not pose any new conceptual problems.

Example: estimating the probability of a female birth given placenta previa

• A special abnormal condition in expecting women• An early study concerning the gender of placenta

previa births in Germany found that of a total of 980 births, 437 were female.

• How much evidence does this provide for the claim that the proportion of female births in the population of placenta previa births is less than 0.485, the proportion of female births in the general population?

Posterior summary using Uniform prior

• Posterior distribution is Beta(438, 544).• Posterior mean of θ is 0.446 • Posterior standard deviation is 0.016• Posterior median is 0.446 • Posterior central 95% posterior interval is

[0.415, 0.477]. • This 95% posterior interval matches, to three decimal

places, the interval that would be obtained by using a normal approximation with the calculated posterior mean and standard deviation.

Check same summary using simulations

• Simulate 1000 iid draws from the

Beta(438, 544) posterior distribution• 2.5th and 97.5th percentiles give central 95%

posterior interval [0.415, 0.476] • median of the 1000 draws from the posterior

distribution is 0.446• The sample mean and standard deviation of

the 1000 draws are 0.445 and 0.016

Draws from the posterior distribution of(a) the probability of female birth, θ ;

(b) the logit transform, logit( θ ); (c) the male-to-female gender ratio,

Sensitivity to prior specification

α/α+β α+β E(θ|y) 95% posterior interval for θ

0.500 2 0.446 [0.415, 0.477]

0.485 2 0.446 [0.415, 0.477]

0.485 5 0.446 [0.415, 0.477]

0.485 10 0.446 [0.415, 0.477]

0.485 20 0.447 [0.416, 0.478]

0.485 100 0.450 [0.420, 0.479]

0.485 200 0.453 [0.424, 0.481]

*Interpret α/α+β, as the center and α+β as the number of observations (if large this implies prior is concentrated)

Discussion

• The first row corresponds to uniform prior• The lower the row, the more concentrated is

the prior distribution towards 0.485• Only when α+β ≥ 100 (likened to prior

number of observations), the posterior interval begins to change.

• Even then the intervals exclude the prior mean

Alternative: Instead of conjugate prior, use a ‘flat’ non-conjugate prior

(weakly informative prior)

(a) Prior density for θ in nonconjugate analysis of birth ratio example; (b) histogram of 1000 draws from a discrete approximation to the posterior density. *Figures are plotted on different scales.

Details of the nonconjugate flat prior(piecewise linear)

• Centered around 0.485 but is flat far away from this value to admit the possibility that the truth is far away.

• 40% of the probability mass is outside the

interval [0.385, 0.585]• This prior distribution has mean 0.493 and

standard deviation 0.21, similar to the standard deviation of a Beta distribution with a + β =5.

Evaluating the posterior distribution• The unnormalized posterior distribution is obtained at a grid of

θ values, (0.000, 0.001,…, 1.000), by multiplying the prior density and the binomial likelihood at each point.

• Samples from the posterior distribution can be obtained by normalizing the distribution on the discrete grid of θ values.

• Figure (b) is a histogram of 1000 draws from the discrete posterior distribution.

• The posterior median is 0.448, 95% central posterior interval is [0.419, 0.480].

• Because the prior distribution is overwhelmed by the data, results match those in table based on Beta distributions.

• In the grid approach, we avoid grids that are too coarse and distort a significant portion of the posterior mass.

Estimating the mean of a normal distribution with known variance

• The normal distribution is fundamental to most statistical modeling.

• CLT helps to justify using the normal likelihood in many statistical problems, as an approximation to a less analytically convenient actual likelihood.

• Also, even when the normal distribution does not itself provide a good model fit, it can be useful as a component of a more complicated model involving Student-t or finite mixture distributions.

• For now, we simply work through the Bayesian results assuming Normal distribution is true

Normal model with multiple observations, variance known

• A sample of independent and identically distributed observations y =( y 1 , … , y n ) is available.

• The posterior density is

Posterior distribution also Normal

Remarks about posterior results• Posterior variance converges to σ2/n if n→∞ or

if prior variance τ02 →∞

• Posterior mean is weighted average of prior mean and sample mean

• Incidentally, the same result is obtained by adding information for the data points y 1 , y 2 ,…, y n one point at a time, using the posterior distribution at each step as the prior distribution for the next

CLT in Bayesian context

((θ - E(θ | y) ) /√Var(θ | y) | y) → N(0,1)

as n→∞

• Often used to justify approximating the posterior distribution with a normal distribution.

• For the binomial parameter θ , the normal distribution is a more accurate approximation in practice if we transform θ to the logit scale…

• …that is, perform inference for log( θ /(1 − θ )) instead of θ • probability space from [0, 1] expands to (−∞, ∞), which is

more fitting for a normal approximation.

Large sample results

• The large-sample results are not actually necessary for performing Bayesian data analysis… but are often useful as approximations and as tools for understanding.

Normal approximations to the posterior distribution

• A Taylor series expansion of log p(θ|y) centered at the posterior mode, (where mode can be a vector and is assumed to be in the interior of the parameter space), gives

Posterior distribution converges to…

Remark:

• For a finite sample size n, the normal approximation is typically more accurate for

conditional and marginal distributions

of components of θ than for the full

joint distribution.

Posterior Consistency• If the true data distribution is included in the parametric

family—that is, if f(y)=p(y|θ0) for some θ0—then,

in addition to asymptotic normality, the property

of consistency holds: the posterior distribution converges

to a point mass at the true parameter value, θ0, as n→∞.

• When the true distribution is not included in the

parametric family, there is no longer a true value θ0,

but its role in the theoretical result is replaced by a

value θ0 that makes the model distribution, p(y|θ),

closest to the true distribution, in a technical

involving Kullback-Leibler information

Large sample correspondence between Bayesian and Frequentist methods

• When n→∞, a 95% central posterior interval for θ will cover the true value 95% of the time under repeated sampling with any fixed true θ.

When asymptotic results fail

• Correspond to situations in which the prior distribution has an impact on the posterior inference, even in the limit of infinite sample sizes.

• Usually when likelihood is flat• For example when the model is unidentifiable (there exist

two distinct parameters yielding same likelihood)

Eg. f(x) = p g(x) + (1-p) h(x) where 0>p>1, (p, g,h) unknown

• Number of parameters increase with data• Prior distributions that exclude point of convergence or

yield improper posterior distributions


Dept. of Statistics, Rajshahi University, Rajshahi

Anandamayee MajumdarVisiting Scientist, University of North Texas


Overview

• Model checking and improvement• Test quantities, P-values• Starting the computation in Bayesian Inference• Simulation of potentially observable quantities• Posterior simulation methods: The Gibbs Sampler,

Rejection sampling, Metropolis Hastings algorithm• Bivariate Unit Normal Example with Bivariate Jumping

kernel• Recommended strategies for simulation.• Advanced techniques for Monte Carlo simulation

Model checking and improvement

• Checking the model is crucial to statistical analysis. • Bayesian inferences assume the whole structure of a probability model and

can yield misleading inferences when the model is poor. • A good Bayesian analysis, therefore, should include some check of the

adequacy of the fit of the model to the data and the plausibility of the model for the purposes for which the model will be used.

• This is sometimes discussed as a problem of sensitivity to the prior distribution,

• but in practice the likelihood model is typically just as suspect; • throughout, we use ‘model’ to encompass:

1. The sampling distribution, 2. the prior distribution,

3. Hierarchical structure, and 4. issues such as which

explanatory variables have been included in a regression.

Judging model flaws by their practical implications

• Model TRUE or FALSE – is not the question• Relevant question: ‘Do the model’s

deficiencies have a noticeable effect on the substantive inferences?’

• Do the inferences from the model make sense?

.. NO: suggests a potential for creating a more accurate probability model for the parameters and data collection process.

• Is the model consistent with data? Posterior predictive checking

If the model fits, then replicated data generated under the model should look similar to observed data.

This is really a self-consistency check: an observed discrepancy can be due to model misfit or chance.

Basic technique for checking fit

• Draw simulated values from the posterior predictive distribution of replicated data and compare these samples to the observed data.

• Any systematic differences between the simulations and the data indicate potential failings of the model

Example: Newcomb’s speed of light measurements

• 66 measurements on the speed of light • modeled as N(μ, σ2), with a non-informative

uniform prior distribution on (μ, log σ). • However, the lowest of Newcomb’s

measurements look like outliers• Question: Could the extreme measurements

have reasonably come from a normal distribution?

Simulating replicated data using posterior sample

• y observed data • θ be the vector of parameters• yrep replicated data that could have been

observed (if x is the explanatory variable vector for y, then it is also for yrep)

Smallest observation of Newcomb’s speed of light data (the vertical line at the left of the graph),

compared to the smallest observations from each of the 20 posterior predictive simulated datasets

Test quantity, or discrepancy measure

• T(y, θ), is a scalar summary of parameters and data that is used as a standard when comparing data to predictive simulations.

• Test quantities play the role in Bayesian model checking that test statistics play in classical testing.

• Test quantity depends on both data and parameter.

P-values or tail area probabilities

• Classical p-value

• Bayesian p-value

• The probability is taken over the joint

posterior distribution, p(θ, yrep|y):

Example

• Consider a sequence of binary outcomes, y1,…, yn,

• Modeled as n iid Bernoulli trials • Uniform prior distribution on θ• suppose the observed data are, in order, 1, 1,

0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0. • The observed autocorrelation is evidence that

the model is flawed.

• T=number of switches between 0’s and 1’s• To simulate yrep under the model, we first draw

θ from its Beta(8, 14) posterior distribution• Then draw 10,000 independent replications

from Bernoulli(θ)

• P-value = 0.028

Other posterior predictive checks for model fit/model comparison

• Use partial data for building model• Use the rest of the data, to make prediction

(usually in the case when each value of y is associated with covariate and/or coordinate information)

• Compare prediction coverage for different competing models

Computation in Bayesian Inference

• Distribution to be simulated as the target distribution → denote it as p(θ|y)

• Assume target density p(θ|y) can be easily computed for any value of θ, up to a proportionality constant involving only y

• Starting point -- Crude estimation of parameters. Often reliable, and easy to compute.

Use posterior simulations to make inferences about

1. Predictive quantities : ŷk ~ p(ŷ |θk)

or ŷk ~ p(ŷ |Xk, θk) in the regression model

2. Or replications yrep,k ~ p(y |θk)

or yrep,k ~ p(y|X, θk) in the regression model

How many simulation draws are needed?

• In general, few simulations are needed to estimate posterior medians, probabilities near 0.5, and low-dimensional summaries

• More simulations needed for extreme quantiles, posterior means, probabilities of rare events, and higher-dimensional summaries.

• Simulation draws typically 100 - 2000

Direct simulation

• In simple nonhierarchical Bayesian models, it is often easy to draw from the posterior distribution directly, especially if conjugate prior distributions have been assumed.

• In complex problems, we sometimes simulate hyperparameters (marginally), and then conditionally, the intermediate parameters

Rejection sampling• Suppose we want to obtain a single random draw from a density p(θ|y).

We require a positive function g(θ) defined for all θ for which p(θ|y)>0 that has the following properties:

1. We are able to draw random samples from the probability density proportional to g. It is not required that g(θ) integrate to 1, but g(θ) must have a finite integral.

2. there must be some known constant M for which p(θ|y)/g(θ)≤M for all θ.

• The rejection sampling algorithm proceeds in two steps:

1. Sample θ at random from the probability density proportional to g(θ).

2. With probability p(θ|y)/(Mg(θ)), accept θ as a draw

from p. If the drawn θ is rejected, repeat step 1.

Illustration of rejection sampling. The top curve is anapproximation function, Mg(θ), and the bottom curve is the target density,

p(θ|y). As required, Mg(θ)≥p(θ|y) for all θ. The vertical line indicates a single random draw θ from the density proportional to g. The probability that a

sampled draw θ is accepted is the ratio of the height of the lower curve to the height of the higher curve at the value θ

Posterior simulation

• Mostly used Markov chain simulation methods:

1. Gibbs sampler

2. Metropolis-Hastings algorithm

Markov Chain Monte Carlo (MCMC) simulation

• Definition: A Markov chain is a sequence of random variables θ1, θ2,…, for which, for any t, the distribution of θt given all previous θ’s depends only on the most recent value, θt−1.

• Key: Create a Markov process whose stationary distribution is the specified p(θ|y), and run the simulation long enough that the distribution of the current draws is close enough to this stationary distribution.

• For any specific p(θ|y), a variety of Markov chains with the desired property can be constructed

The Gibbs Sampler• Suppose the parameter vector θ has been

divided into d components or subvectors,

θ= (θ1,…, θd).

In iteration t, we simulate θjt

˜ p(θj |θ-j t-1, y)

where θ-j t-1 represents all the components of θ,

except for θj, at their current values

for j=1,…,d

Bivariate Normal Example

• Suppose (y1, y2) ˜ Biv. Normal ((θ1, θ2), (1, ρ, ρ,1))

We note that the Gibbs sampler takes on the following steps:

θ1|θ2, y ~ N(y1+ρ(θ2−y2), 1−ρ2)

θ2|θ1, y ~ N(y2+ρ(θ1−y1), 1−ρ2).

• Four independent sequences of the Gibbs sampler for a bivariate normal distribution with fixed correlation ρ=0.8, with overdispersed starting points indicated by solid squares.

(a) First 10 iterations, showing the component-by-component updating of the Gibbs iterations.

(b) After 500 iterations, the sequences have reached approximate convergence.

(c) Iterates from the second halves of the sequences.

The Metropolis algorithm• The Metropolis algorithm is an adaptation of a random

walk that uses an acceptance/rejection rule to converge to the specified target distribution.

1. Draw a starting point θ0, for which p(θ0|y)>0, from a starting distribution p0(θ). Or we may simply choose starting values dispersed around a crude approximate

2. For t=1, 2,…

(a) Sample a proposal θ* from a jumping distribution

at time t, Jt(θ*|θt−1). For the Metropolis algorithm,

Jt(θa|θb)=Jt(θb|θa) for all θa, θb, t

(b) Calculate the ratio of the densities,

The Metropolis algorithm

c. Set

The Metropolis-Hastings algorithm

• Same as Metropolis algorithm, except that the jumping rule does not have to be symmetric Jt(θa|θb)≠Jt(θb|θa) for some θa, θb, t

• to correct for the asymmetry in the jumping rule, the ratio r becomes:

• Allowing asymmetric jumping rules can be useful in increasing the speed of the random walk.

Properties of a good jumping rule

• For any θ, it is easy to sample from J(θ*|θ).

• It is easy to compute the ratio r.

• Each jump goes a reasonable distance in the parameter space (otherwise the random walk moves too slowly).

• The jumps are not rejected too frequently (otherwise the random walk wastes too much

time standing still).

Difficulties of inference from iterative simulation

• If iterations have not proceeded long enough, the simulations may be grossly unrepresentative of the target distribution

• Even when the simulations have reached approximate convergence, the early iterations still are influenced by the starting approximation rather than the target distribution

• Within-sequence correlation; aside from any convergence issues, simulation inference from correlated draws is generally less precise than from the same number of independent draws

Solutions• Design the simulation runs to allow effective monitoring of convergence, in

particular by simulating multiple sequences with starting points dispersed throughout parameter space

• Monitor the convergence of all quantities of interest by comparing variation between and within simulated sequences until ‘within’ variation roughly equals ‘between’ variation

• If the simulation efficiency is unacceptably low (in the sense of requiring too much real time on the computer to obtain approximate convergence of posterior inferences for quantities of interest), the algorithm can be altered

• To diminish the effect of the starting distribution, discard the first half of each sequence and focus attention on the second half (burn in fraction depends on the problem at hand)

• Once approximate convergence has been reached, thin the sequences by keeping every kth simulation draw from each sequence and discard the rest.

Monitoring convergence of each scalar estimand

• Suppose we have simulated m parallel sequences, each of length n (after discarding the first half of the simulations).

• For each scalar estimand ψ, we label the simulation draws as ψij (i=1,…, n; j=1,…, m), and we compute B and W, the between- and within-sequence variances:

Step 1

• Estimate var(ψ|y), the marginal posterior variance of the estimand, by a weighted average of W and B, namely

• This quantity overestimates the marginal posterior variance assuming the starting distribution is appropriately overdispersed, but is unbiased under stationarity

Step 2

• ‘Within’ variance W should be an underestimate of var(ψ|y) because the individual sequences have not had time to range over all of the target distribution

• In the limit as n→∞, the E(W) → var(ψ|y). • Potential scale reduction is estimated by

• This declines to 1 as n→∞.

Monitoring convergence for the entire distribution

• Once is near 1 for all scalar estimands of interest, just collect the mn simulations from the second halves of all the sequences together and treat them as a sample from the target distribution.

A small experimental dataset• Coagulation time in seconds for blood drawn from 24

animals randomly allocated to four different diets. • Different treatments have different numbers of

observations because the randomization was unrestricted.

Diet Measurements

A 62, 60, 63, 59

B 63, 67, 71, 64, 65, 66

C 68, 66, 71, 67, 68, 68

D 56, 62, 60, 61, 63, 64, 63, 59

Hierarchical Model and Priors in this example

Under the hierarchical normal model

• Data yij, i=1,…, nj, j=1,…, J, are independently normally distributed within each of J groups, with means θj, common variance σ2. The total number of observations is n.

• The group means θj are assumed to follow a normal distribution with unknown mean μ and variance τ2,

• A uniform prior distribution is assumed for

(μ, log σ, τ)• If we were to assign a uniform prior distribution to

log τ, the posterior distribution would be improper

Joint posterior density of all the parameters

Starting points• Choose over-dispersed starting points for

each parameter θj by simply taking random points from the data yij from group j.

• Starting points for μ can be taken as the average of the starting θj values.

• No starting values are needed for τ or σ as they can be drawn as the first steps in the Gibbs sampler

Gibbs sampler

1. The conditional distributions for this model all have simple conjugate forms

Gibbs sampler

2. Conditional on y and the other parameters in

the model, μ has a normal distribution

with mean =

and variance = σ2/J

3. Again,

Gibbs Sampler

4. Posterior distribution

Estimand Posterior quantiles 2.5% 25% median 75% 97.5% Rhat

θ1 58.9 60.6 61.3 62.1 63.5 1.01

θ2 63.9 65.3 65.9 66.6 67.7 1.01

θ3 66.0 67.1 67.8 68.5 69.5 1.01

θ4 59.5 60.6 61.1 61.7 62.8 1.01

μ 56.9 62.2 63.9 65.5 73.4 1.04

σ 1.8 2.2 2.4 2.6 3.3 1.00

Bivariate unit Normal Example

• Suppose (y1, y2) ˜ Bivariate N ((θ1, θ2), (1, 0, 0,1))

• Data (y1, y2) = (0, 0)

• Target density p(θ|y) = N(θ|0, I), where I is the 2×2 identity matrix

• The jumping distribution is also bivariate normal, centered at the current iteration and scaled to 1/5 the size:

Jt(θ*|θt−1)=N(θ*|θt−1, 0.22I). (symmetric)

• Density ratio τ =N(θ*|0, I)/N(θt–1|0, I).

• Five independent sequences of a Markov chain simulation for

The bivariate unit normal distribution, with overdispersed starting points indicated by solid squares.

(a) After 50 iterations, the sequences are still far from convergence (due to inefficient jumping rule, deliberately chosen to demonstrate the mixing)

(b) After 1000 iterations, the sequences are nearer to convergence.

(c) 3rd figure shows the iterates from the second halves of the sequences. The points in the 3rd figure have been jittered so that steps in which the random walks stood still are not hidden.

Bivariate unit normal density with bivariate normal jumping kernel

Recommended strategy for posterior simulation

1. Start off with crude estimates and possibly a mode-based approximation to the posterior distribution .

2. If possible, simulate from the posterior distribution directly or sequentially, starting with hyperparameters and then moving to the main parameters

3. Most likely, the best approach is to set up a Markov chain simulation algorithm. The updating can be done one parameter at a time or with parameters in batches (as is often convenient in regressions and hierarchical models).

4. For parameters (or batches of parameters) whose conditional posterior distributions have standard forms, use the Gibbs sampler.

5. For parameters whose conditional distributions do not have standard forms, use Metropolis jumps. Tune the scale of each jumping distribution so that acceptance rates are near 20% (when altering a vector of parameters) or 40% (when altering one parameter at a time).

6. Construct a transformation so that the parameters are

approximately independent—this will speed the convergence of the

Gibbs sampler. Or add auxiliary variables (data augmentation) to

speed up the computation.

7. Start the Markov chain simulations with parameter values taken from the

crude estimates or mode-based approximations, with noise added

so they are over-dispersed with respect to the target distribution.

8. Run multiple Markov chains and monitor the mixing of the sequences.

Run until approximate convergence appears to have been reached.

9. If Rhat is near 1 for all scalar estimands of interest, summarize inference about the posterior distribution by treating the set of all iterates from the second half of the simulated sequences as an identically distributed sample from the target distribution. At this point, simulations from the different sequences can be mixed.

10 . Compare the posterior inferences from the Markov chain simulation to the approximate distribution used to start the simulations. If they are not close with respect to locations and approximate distributional shape, check for errors before believing that the Markov chain simulation has produced a better answer.

Advanced techniques for Markov Chain Simulation

Hybrid Monte Carlo methods: For moving rapidly through the target distribution• Borrows ideas from physics to add auxiliary variables that suppress the local

random walk behavior in the Metropolis algorithm• Thus allowing it to move much more rapidly through the target distribution. For

each component θj in the target space, hybrid Monte Carlo adds a `momentum’ variable φ

• Both θ and φ are then updated together in a new Metropolis algorithm, in which the jumping distribution for θ is determined largely by φ

• Roughly, the momentum gives the expected distance and direction of the jump in θ, so that successive jumps tend to be in the same direction, allowing the simulation to move rapidly where possible through the space of θ.

• The MH accept/reject rule stops the movement when it reaches areas of low probability, at which point the momentum changes until the jumping

can continue. • Hybrid Monte Carlo is also called Hamiltonian Monte Carlo because it is related to

the model of Hamiltonian dynamics in physics.

Advanced techniques for Markov Chain Simulation

Langevin methods• The basic symmetric-jumping Metropolis algorithm is simple to apply but

has the disadvantage of wasting many of its jumps by going into low-probability areas of the target distribution.

• For example, optimal Metropolis algorithms in high dimensions have acceptance rates below 25%, which means that, in the best case,

over 3/4 of the jumps are wasted.• A potential improvement is afforded by the Langevin algorithm, in which

each jump is associated with a small shift in the direction of the gradient of the logarithm of the target density, thus moving the jumps toward higher density regions of the distribution.

• This jumping rule is not symmetric.

Thank you for your attention!

Lecture 1: Bayesian Inference and Data Analysis Department of Statistics, Rajshahi University,...

Documents

Transcript of Lecture 1: Bayesian Inference and Data Analysis Department of Statistics, Rajshahi University,...