Bayesian Model Averaging and Model Searchjltobias/BayesClass/... · Bayesian framework are often...

Motivation Basic g-prior Results with BMA Markov Chain Monte Carlo Model Composition: MC3

Bayesian Model Averaging and Model Search

Econ 690

Purdue University

February 19, 2010

Justin L. Tobias Bayesian Model Averaging


Outline

1 Motivation

2 Basic g -prior Results with BMA

3 Markov Chain Monte Carlo Model Composition: MC 3



Motivation for Model Averaging

In Frequentist econometrics, the dominant paradigm is to select aparticular model.

The process used in formulating the model is not typically madetransparent, and the resulting inference does not typically take intoaccount the uncertainty associated with the selection of the modelitself.

In other words, a variety of pretest-type procedures are employedto select a model, and then, conditioned on that selection,statistics are reported.



As an alternative to model selection, many Bayesians prefer toaverage across models rather than select a single model.

To fix ideas, consider a binary choice example: We could use aprobit, a logit, a regression with Student-t errors, a mixture ofNormals, or other methods like the log-log link or skew Normalmodels.

Each of these specifications will generate a prediction regardingsome parameter of interest.

Instead of selecting one of these, might it be more sensible ingeneral to obtain a prediction from each model and then combinethese predictions in some “optimal” way?



Procedures for combining model-specific estimates within theBayesian framework are often termed exercises in Bayesian ModelAveraging.

The motivation for Bayesian model averaging (BMA), followsimmediately from the laws of probability.

That is, if we let Mr for r = 1, ..,R denote the R different modelsunder consideration and φ be a vector of parameters which has acommon interpretation in all models, then the rules of probabilitysuggest



Alternatively, if g (φ) is a function of φ, the rules of conditionalexpectation imply that:

In words, the logic of Bayesian inference says that:

It can also be shown that averaging over all possible models in thisfashion produces better predictive ability (under a logarithmicscoring rule) than the selection of any particular model [e.g.,Raftery, Madigan and Hoeting (1997)].



OK, so what’s the big deal? We already talked about calculatingexpectations, or marginal posteriors within a given model, as wellas calculating marginal likelihoods for a range of possible models.

So, is there really anything more that is required to implement aBMA procedure?

It is important to recognize that, in many empirical examples, thenumber of models under consideration is simply too large forevaluation of, e.g., a marginal likelihood for every model.

For instance, in the regression model, one has K potentialexplanatory variables, leading to R = 2K possible models!

For example, with K = 20, we have R = 1, 048, 576 possiblespecifications!



In this lecture, we discuss some possible computationalstrategies for carrying out this model comparison exercise.

To this end, we first introduce benchmark-type priors that areoften used in these analyses and thereby discuss the g -prior ofZellner (1986).

We then discuss the technique of Markov Chain Monte CarloModel Composition as a potentially useful device for modelexploration when the number of possible models, R is large.



Basic g-prior Results

Consider the regression model:

(here, we parameterize in terms of the error precision h−1 ratherthan the variance σ2) under the priors



In addition, select

with

and

This prior is termed the g -prior by Zellner (1986). Fernandez, Leyand Steel (2001) consider the use of this prior for BMA purposesand suggest the use of gr = 1/n so that the information in theprior is roughly the same as the informational content in oneobservation.



Note, in the above that r denotes a particular model, and βr

denotes a particular configuration of possible β coefficients.

Along these lines, Xr denotes the stacked covariate matrix formodel r . For example, if model r only has the β1 and β2, thenXr = [x1 x2].

Fernandez et al also recommend standardizing all explanatoryvariables to have mean zero so that the intercept α has acommon interpretation across models (as the unconditionalmean of y).



With this construction, it follows that the marginal posteriordistribution of βr is multivariate Student-t with mean

with

and covariance matrix:



where, in the above quantities we define

s2r =

1gr+1y ′PXr y + gr

gr+1 (y − y ιN)′ (y − y ιN)

ν,

wherePXr = IN − Xr

(X ′rXr

)−1X ′r .

and ν = n.



Similarly, with a bit of algebra, and following similar steps to thosegiven in our earlier lecture on marginal likelihoods and hypothesistesting in the linear model, we can calculate:

Posterior model probabilities can then be calculated using:

This constant can, when required, be obtained by summing over allthe (unnormalized) marginal likelihoods from each candidatemodel.



The foregoing suggests how one could implement a BMAprocedure in practice:

For the linear model, employ the Fernandez et al (2001) priors.

Consider every possible model, as defined by theexclusion/inclusion of each covariate.

For each model and a given object of interest, calculate itsposterior mean, or marginal posterior distribution usingproperties of the multivariate Student-t distribution.

Calculate the (unnormalized) marginal likelihood using theformula given above

Normalize the marginal likelihoods into posterior modelprobabilities (under equal prior odds) and calculatemodel-averaged posterior means or marginal posteriors.



We now illustrate how BMA (in conjunction with a method wedescribe later) can be used in practice using a real data set.

The data set used is taken from Fernandez, Ley and Steel (2001a)and is available from the Journal of Applied Econometrics dataarchive (www.econ.queensu.ca/jae).

This data set covers N = 72 countries and contains K = 41potential explanatory variables.

The dependent variable is average per capita GDP growth for theperiod 1960-1992.



In the following table, “BMA Post. Prob.” can be interpreted asthe probability that the corresponding explanatory variable shouldbe included.

That is, it is the sum of all the posterior model probabilities forthose models that actually include the given covariate.

The other two columns of the table contain posterior means andstandard deviations for each regression coefficient, averaged acrossmodels. Remember that models where a particular explanatoryvariable is excluded are interpreted as implying a zero value for itscoefficient. Hence, the averages involve some terms whereE

[g (φ) |y ,M(s)

]is calculated and others where the value of zero

is used.



Bayesian Model Averaging ResultsExplanatoryVariable

BMAPost. Prob.

PosteriorMean

PosteriorSt. Dev

Primary School Enrolment 0.207 0.004 0.010

Life expectancy 0.935 0.001 3.4 × 10−4

GDP level in 1960 0.998 −0.016 0.003Fraction GDP in Mining 0.460 0.019 0.023Degree of Capitalism 0.452 0.001 0.001No. Years Open Economy 0.515 0.007 0.008

% of Pop. Speaking English 0.068 −4.3 × 10−4 0.002

% of Pop. Speaking Foreign Lang. 0.067 2.9 × 10−4 0.001

Exchange Rate Distortions 0.081 −4.0 × 10−6 1.7 × 10−5

Equipment Investment 0.927 0.161 0.068Non-equipment Investment 0.427 0.024 0.032

St. Dev. of Black Market Premium 0.049 −6.3 × 10−7 3.9 × 10−6

Outward Orientation 0.039 −7.1 × 10−5 5.9 × 10−4

Black Market Premium 0.181 −0.001 0.003

Area 0.031 −5.0 × 10−9 1.1 × 10−7

Latin America 0.207 −0.002 0.004Sub-Saharan Africa 0.736 −0.011 0.008Higher Education Enrolment 0.043 −0.001 0.010Public Education Share 0.032 0.001 0.025

Revolutions and Coups 0.030 −3.7 × 10−6 0.001

War 0.076 −2.8 × 10−4 0.001



Bayesian Model Averaging ResultsExplanatoryVariable

BMAPost. Prob.

PosteriorMean

PosteriorSt. Dev

Political Rights 0.094 −1.5 × 10−4 0.001

Civil Liberties 0.127 −2.9 × 10−4 0.001

Latitude 0.041 9.1 × 10−7 3.1 × 10−5

Age 0.083 −3.9 × 10−6 1.6 × 10−5

British Colony 0.037 −6.6 × 10−5 0.001Fraction Buddhist 0.201 0.003 0.006

Fraction Catholic 0.126 −2.9 × 10−4 0.003Fraction Confucian 0.989 0.056 0.014

Ethnolinguistic Fractionalization 0.056 3.2 × 10−4 0.002

French Colony 0.050 2.0 × 10−4 0.001Fraction Hindu 0.120 −0.003 0.011

Fraction Jewish 0.035 −2.3 × 10−4 0.003Fraction Muslim 0.651 0.009 0.008

Primary Exports 0.098 −9.6 × 10−4 0.004Fraction Protestant 0.451 −0.006 0.007Rule of Law 0.489 0.007 0.008

Spanish Colony 0.057 2.2 × 10−4 1.5 × 10−3

Population Growth 0.036 0.005 0.046

Ratio Workers to Population 0.046 −3.0 × 10−4 0.002

Size of Labor Force 0.072 6.7 × 10−9 3.7 × 10−8



As discussed earlier, however, BMA has its limitations.

In particular, when K is large, calculation of something like aposterior mean and posterior model probability for every modelbecomes nearly infeasible.

Indeed, in the above table, we did not calculate the posteriorprobability for each model, since there are more than 40 covariates.

Instead, we employed an alternate procedure that may be able tomore quickly determine those models receiving high posteriorprobability.



Markov Chain Monte Carlo Model Composition: MC 3

Should this be (MC )3 instead of MC 3?Consider the following algorithm, as described by Madigan andYork (1995):

Let the model space be denoted as {Mr} for r = 1, ..,R. This canbe expressed in terms of a K × 1 vector γ = (γ1, .., γK )′ where allelements are either 0 or 1.

Models defined by γj = 1 indicate the j th explanatory variableenters the model (else γj = 0).

There are 2K possible configurations of γ and the space of thisparameter is equivalent to the model space.



Consider the following M − H type routine for sampling over themodel space:

Suppose we are currently “at” M(s−1) in our sampler, withassociated current set of parameter values γ(s−1).

Now, consider all re-configurations of M(s−1) that includeboth the model itself, as well as all other potential modelsthat differ from the current model in one component ONLY.

That is, we define the set of all possible models that we can“move to” as the current model together with all models thatadd a single or delete a single explanatory variable from thecurrent model under consideration.



With the sampling defined in this way, we can apply the standardM − H formula governing the probability of movement fromM(s−1) to M∗ in the above set:

(Why is this the correct acceptance probability?) Also note thatp(y |γ(s−1)) and p(y |γ∗) are our marginal likelihoods, that can becalculated analytically with our g − prior .

In the common case where equal prior weight is allocated to eachmodel, p (γ∗) = p

(γ(s−1)

)and these terms cancel out of the

above ratio.



Given the simulated output from this procedure, a variety ofquantities of interest can be calculated including:

The models most supported by the data, determined as thosemodels that are visited most frequently when applying oursimulator.

The posterior probability of a particular model, determined asthe fraction of times that model is visited in the sampler.

Bayes factors comparing models M1 and M2, determined asthe ratio of the number of times our model visits M1 relativeto M2.

Model-averaged posterior distributions of a function ofinterest, g(θ) where:



In the foregoing example, we did not calculate the ML’s for eachmodel, but instead, we employed the MC 3 method.

Specifically, we obtained 1, 100, 000 draws and discarded the first100, 000 as burn-in replications.

The column “BMA Post. Prob.” was calculated as the proportionof models drawn by the MC3 algorithm containing thecorresponding explanatory variable.



To illustrate the use of the MC 3 method, we consider the followinggenerated data experiment.

We first generate 10 potential explanatory variables as follows:

xiiid∼ N(010, .7I10 + .3ι10ι

′10).

We then generate y using

yi = 3 + 4.5x2i − 6.2x4i + x6i − 5.4x9i + εi ,

whereεi

iid∼ N(0, 1).



In our MC 3 method, we allow all variables (including an intercept)to be excluded/included in our model.

We start the sampler away from the “true” model of the datagenerating process by choosing:

γ = [0 1 1 1 1 0 0 1 1 1 1]′

The sampler is rung for 10, 000 iterations, discarding the first1, 000 as the burn-in period.



In terms of the “true” in/out classification of each variable, wecould write

γtrue = [1 0 1 0 1 0 1 0 0 1 0]′

The posterior mean associated with γ is found to be:

E (γ|y) = [1.00 .074 1.00 .056 1.00 .051 1.00 .03 .04 1.00 .05].

So, when the variable is to be included, we clearly get it right.

However, at a particular iteration, we may choose to keep an“inappropriate” variable in the model. Then, we typically must waitfor our next chance to exclude that variable from the specification.



Further Reading

Fernandez,C., E. Ley and M. Steel (2001).

“Benchmark Priors for Bayesian Model Averaging”

Journal of Econometrics 100(2), 381-427.

Madigan, D. and York, J. (1995).

“Bayesian Graphical Models for Discrete Data”

International Statistical Review 63, 215-232.

Raftery, A.E. (1997).

“Approximate Bayes Factors and Accounting for Model Uncertainty in Generalized Linear Models”

Biometrika 83, 251-266.

Raftery, A.E., Madigan, D. and J. A. Hoeting (1997).

“Bayesian Model Averaging for Linear Regression Models”

JASA 179-191.

Zellner, A. (1986).

“On assessing prior distributions and Bayesian regression analysis with g -prior distributions]”in Bayesian Inference and Decision Techniques: Essays in Honor of Bruno de Finetti, North-Holland.


Bayesian Model Averaging and Model Searchjltobias/BayesClass/... · Bayesian framework are often...

Documents

Transcript of Bayesian Model Averaging and Model Searchjltobias/BayesClass/... · Bayesian framework are often...