Lecture 4: Model Comparison · n by n matrices W and V specify the structure of spatial dependence...

Lecture 4: Model Comparison

James P. LeSage

University of Toledo

Department of Economics

Toledo, OH 43606

[email protected]

March, 2004

We consider a class of spatial regression models introduced in

Ord (1975) and elaborated in Anselin (1988), shown in (1). The

sample observations in these models represent regions located in

space, for example counties, states, or countries.

y = ρWy + Xβ + u (1)

u = λV u + ε

Where X denotes an n by k matrix of explanatory variables

as in ordinary least-squares regression, β is an associated k

by 1 parameter vector and ε is an n by 1 disturbance vector,

which we assume takes the form: ε ∼ N(0, σ2In). The

n by n matrices W and V specify the structure of spatial

dependence between observations (y) or disturbances (u), with a

common specification having elements Wij > 0 for observations

j = 1 . . . n sufficiently close (as measured by some distance

metric) to observation i. As noted above, observations reflect

geographical regions, so distances might be calculated on the

basis of the centroid of the regions/observations. The expressions

Wy and V u produce vectors that are often called spatial lags,

and ρ and λ denote scalar parameters to be estimated along with

β and σ2. Non-zero values for the scalars ρ and λ indicate that

the spatial lags exert an influence on y or u.

1

Comparing alternative models

We wish to compare:

1) model specifications, e.g., y = ρWy + Xβ + ε, vs.

y = Xβ + u, u = ρWu + ε

Selection of the appropriate model, SAR, SEM, SDM or

SAC has been the subject of numerous Monte Carlo studies

that examine alternative systematic or sequential specification

search approaches. Florax, Folmer and Rey (2003) provide

a review of this literature. All of these approaches have

in common maximum likelihood estimation methods in

conjunction with conventional specification tests such as the

Lagrange multiplier (LM) or likelihood ratio (LR) tests.

2) weight matrix specifications, e.g., W based on contiguity

vs. W based on nearest neighbors, distance or parameterized

Trying to parameterize the weight matrix causes a problem for

maximum likelihood methods. Likelihood becomes ill-defined

when the spatial dependence parameters is zero. No problem

for Bayesian methods.

3) explanatory variables, e.g., X1, X2, X3 vs. X1, X3.

Only Bayesian methods offer the potential for a comprehensive

solution here.

2

Current state of parameter estimation andinference in spatial econometrics

A lot of good methods/tools, each with their strengths and

weaknesses.

• Likelihood

– Strengths, inference regarding parameter dispersion is

theoretically sound, strong connection to economic theory

of production, utility, spillovers, imposes theoretical

restriction on spatial dependence parameter

– Weaknesses, slow, hard to code for large problems, not

robust to non-normal error distributions, inference regarding

dispersion is difficult in practice

• GMM

– Strengths, fast, easy to code, robust to error distribution,

theoretically sound, strong connection to economic theory

of production, utility, spillovers

– Weaknesses, doesn’t impose theoretical restriction

on spatial dependence parameter, inference regarding

dispersion is an unsettled issue

3

• Semi-parametric

– Strengths, robust wrt to many possible problems (e.g.

many error distributions, model specification problems),

good for prediction

– Weaknesses, throws away parsimonious structure (spatial

autoregressive), data requirements and tuning parameters

make it harder to implement in small samples, inference

difficult, weak connection to economic theory of production,

utility, spillovers

• Bayesian

– Strengths, inference regarding parameter dispersion (both

theoretical and applied), strong connection to economic

theory of production, utility, spillovers, imposes theoretical

restriction on spatial dependence parameter, works for

binary, truncated, missing, and multinomial dependent

variables, parameterized spatial weight matrices

– Weaknesses, slow, hard to code

4

Points of failure for non-Bayesian methods

• Likelihood, fails for parameterized weight matrices because the

likelihood ratio is ill-defined at ρ = 0.

• Likelihood, requires sequential testing for comparison of model

specifications, and reliance on a host of Monte Carlo evidence.

Boils down to parameter inference on a nested model structure.

(Changing the weight matrix or explanatory variables destroys

nesting)

• GMM, not well developed in this area. Boils down to parameter

inference on a nested model structure. (Changing the weight

matrix or explanatory variables destroys nesting)

• Semi-parametric, doesn’t wish to participate in this issue.

Doesn’t believe in a true data generating process, relying

instead on flexible functional forms. Throws away

parsimonious model structures/specifications that can be

derived from economic theory.

5

Current state of Bayesian model comparison inspatial econometrics

• Things that are currently available in my MATLAB spatial

econometrics toolbox, or will be available soon.

• Focus here only on spatial autoregressive/spatial error models

(ignoring other spatial estimation functions). Some of this

based on recent work with Olivier Parent.

y = ρWy + αι + Xβ + u

u = λDu + ε

ε ∼ N(0, σ2V )

V = diag(v1, v2, . . . , vn)

• Need to rely on priors (π) that are not too informative and

not too diffuse to avoid Lindley’s (1957) paradox.

6

Priors

• For β, Zellner’s g-prior,

πb(β|σ2) ∼ N [0, σ

2(giX

′Mi

XMi)−1

] (2)

Fernandez, Ley and Steel (2001a, 2001b) provide a theoretical

justification for use of the g−prior as well as Monte

Carlo evidence comparing nine alternative approaches to

setting the hyperparameter g. They recommend setting

gi = 1/max{n, k2Mi}, for the case of least-squares based

MC3 methodology.

• For α, a diffuse prior

• For σ2, a Gamma prior, (or diffuse where ν = d = 0)

πs(σ2) ∼ G(ν, d) (3)

• For ρ, λ, either a uniform prior on the interval [−1, 1] or a

type of β(a, a)distribution centered on zero.

πr(ρ) = U [−1, 1] (4)

πr(ρ) =1

Be(a, a)

(1 + ρ)a−1(1− ρ)a−1

22a−1

7

Prior for ρ, λ

1 0.8 0.6 0.4 0.2 0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

prior comparisons

a=1.01

a=1.1

a=2.0

8

Log marginal posteriors

• Univariate integration wrt ρ, λ for SAR and SEM models (this

problem solved)

SAR : y = ρWy + αι + Xβ + ε

SEM : y = αι + Xβ + u

u = λDu + ε

ε ∼ N(0, σ2In)

• Bivariate integration for general model (this problem NOT yet

solved)


u = λDu + ε

ε ∼ N(0, σ2In)

• MCMC solution is needed for heteroscedastic models: (this

problem solved)

ε ∼ N(0, σ2V )

V = diag(v1, v2, . . . , vn)

9

Nature of the integration problem

• SAR model,

– we can analytically integrate out β, σ arriving at a log

marginal posterior:

p(ρ|y) = K2(g

1 + g)k/2

×Z|In − ρW |

× [νs2+ S(ρ) + Q(ρ)]

−n+ν2 πr(ρ)dρ

K2 = Γ

�ν

2

�−1

νs2

2

!ν2

(2π)−n

22n+ν

2 Γ

�n + ν

2

�=

Γ�

n+ν2

�Γ�

ν2

� (νs2)

ν2π

−n2

Q(ρ) =g

g + 1β(ρ)

′X′Xβ(ρ)

– |In − ρW |, S(ρ), Q(ρ) and πr(ρ) can be vectorized

over a grid of ρ values making integration easy.

10

• SEM model, y? = y − λDy, X? = (X − λDX), and

πb(β|σ2) ∼ N [0, σ2C], C = (gX?′Mi

X?Mi

)−1,

– we can analytically integrate out β, σ arriving at a log marginal

posterior:

p(λ|y) = Γ

�n− 1

2

�π−n

2 (5)

×Z � |C?|

|X?′X? + C?|

�12

|In − λW |

× [S(λ) + Q]−n−1

2 πr(λ)dλ

S(λ) =g

(1 + g)e

?(λ)

′e

?(λ)

e?(λ) = y

? −X?β − αι

Q =1

(1 + g)(y

? − y?ι)′(y

? − y?ι)

β = (X?′X

?)−1

X?′y

?

α = y?

y?

= y − λWy

Wy = (1/n)(Wy)

11

– |In− λD|, S(λ), Q(λ) and πr(λ) can be vectorized over a

grid of λ values, but is more computationally intensive.

12

Operational/implementation issues

For the case of a finite # of homoscedastic SAR, SEMmodels

• No need to estimate β, σ to do model comparison

• There is a need to store the vectorized log-marginal posterior

for scaling reasons. A vector of 2,000 values seems sufficient

for each model being compared.

• Using the log-determinant approximation methods of Pace and

Barry (1999) and vectorization, problems involving reasonably

large samples are no problem.

13

• An example for SAR models y = ρWy + αι + Xβ + ε,

ε ∼ N(0, σ2In)

# of observations time for time for total timelog-det log-marginal

49 0.0400 0.0250 0.1000506 0.1205 0.0300 0.1805

3,107 1.3965 0.0855 1.472524,473 10.6905 0.6160 12.2275

• An example for SEM models y = αι+Xβ+(In−λD)−1ε,

ε ∼ N(0, σ2In)

(grid of 0.001 over lambda)# of observations time for time for total time

log-det log-marginal49 0.0400 0.1000 0.1755

506 0.1205 0.8915 1.03653,107 1.3965 1.1570 2.4535

24,473 10.6905 49.1955 59.8110

(grid of 0.01 over lambda with spline interpolation)# of observations time for time for total time

log-det log-marginal49 0.0400 0.0200 0.0950

506 0.1205 0.1050 0.25503,107 1.3965 0.1255 1.4870

24,473 10.6905 5.3070 17.0190

14

Performance, n, signal/noise, spatial dependence

n = 49, r-squared = 0.9 (average over 30 trials)Models\rho -0.5 -0.25 -0.10 0 0.10 0.25 0.50Pr(SAR)* 0.98 0.76 0.44 0.32 0.42 0.71 0.98Pr(SEM) 0.02 0.24 0.56 0.68 0.58 0.29 0.02

Models\lam -0.5 -0.25 -0.10 0 0.10 0.25 0.50Pr(SAR) 0.14 0.20 0.31 0.29 0.29 0.38 0.17Pr(SEM)* 0.86 0.80 0.69 0.71 0.71 0.62 0.83

n = 506, r-squared = 0.9Models\rho -0.5 -0.25 -0.10 0 0.10 0.25 0.50Pr(SAR)* 1.00 0.99 0.17 0.33 0.67 1.00 1.00Pr(SEM) 0.00 0.01 0.83 0.67 0.33 0.00 0.00


n = 3107, r-squared = 0.9Models\rho -0.5 -0.25 -0.10 0 0.10 0.25 0.50Pr(SAR)* 1.00 1.00 1.00 0.28 1.00 1.00 1.00Pr(SEM) 0.00 0.00 0.00 0.72 0.00 0.00 0.00


15

Weight matrix comparisons

n = 506, SAR model (averaged over 30 trials)

Models\rho -0.5 -0.25 -0.10 0 0.10 0.25 0.50Pr(W1) 0.00 0.00 0.01 0.11 0.00 0.00 0.00Pr(W2) 0.00 0.00 0.07 0.12 0.08 0.00 0.00Pr(W3)* 1.00 1.00 0.73 0.16 0.74 1.00 1.00Pr(W4) 0.00 0.00 0.14 0.19 0.10 0.00 0.00Pr(W5) 0.00 0.00 0.04 0.20 0.05 0.00 0.00Pr(W6) 0.00 0.00 0.01 0.22 0.03 0.00 0.00

16

For the case of a finite # of homoscedastic general (SAC)models

• Bivariate integration would require 2,000 by 2,000 or 4,000,000

double precision numbers.

• A Bivariate grid over ρ, λ is required for the log-determinant

terms |In − ρW | and |In − λD|.• A smaller grid with possible spline interpolation may be

possible.

• I’m close to an MCMC solution, univariate integration over ρ

conditional on λ, and univariate integration over λ conditional

on ρ, then take and MCMC average of the log-marginal

posterior vectors.

17

For the case of a finite # of heteroscedastic SAR, SEMmodels

• An MCMC solution needs to be used.

• One advantage of this approach is that the log-marginal

posterior would come (almost) for free as a part of MCMC

estimation of these models.

• On every trip through the MCMC sampler, evaluate the log-

marginal posterior for current values of β, σ, ρ(λ) and V =

diag(v1, v2, . . . , vn). Take the MCMC average.

• An example for heteroscedastic SAR models:

• 2,500 draws, first 500 excluded for burn-in

# of observations time for time for total timelog-det log-marginal

49 0.0350 55.4300 55.5750506 0.1850 89.4940 89.74903,107 0.8460 270.2480 271.189524,473 10.8355 2118.9620 2130.4235

(35 minutes)

18

MC3 and BMA

For the case of an infinite # of homoscedastic SAR,SEM models

– A large literature on Bayesian model averaging over alternative

linear regression models containing differing explanatory

variables exists (Raftery, Madigan, Hoeting, 1997, Fernandez,

Ley, and Steel, 2001a).

– We introduce SAR and SEM model estimation when

uncertainty exists regarding the choice of regressors. The

Markov Chain Monte Carlo model composition (MC3)

approach introduced in Madigan and York (1995) is set forth

here for the SAR and SEM models.

– For a regression model with and intercept and k possible

explanatory variables, there are 2k possible ways to select

regressors to be included or excluded from the model. For

k = 15 say, we have 32,768 possible models, ruling out

computation of the log-marginal for all possible models as

infeasible.

– The MC3 method of Madigan and York (1995) devises

a strategic stochastic process that can move through the

potentially large model space and samples regions of high

posterior support. This eliminates the need to consider all

models by constructing a sample from relevant parts of the

model space, while ignoring irrelevant models.

– Specifically, they construct a Markov chain M(i), i = 1, 2, . . .

with state space ℵ that has an equilibrium distribution

19

p(Mi|y), where p(Mi|y) denotes the posterior probability

of model Mi based on the data y.

– The Markov chain is based on a neighborhood, nbd(M) for

each M ∈ ℵ, which consists of the model M itself along with

models containing either one variable more, or one variable

less than M . The addition of an explanatory variable to the

model is often labelled a ‘birth process’ whereas deleting a

variable from the set of explanatory variables is called a ‘death

process’.

– A transition matrix, q, is defined by setting q(M → M ′) = 0

for all M ′ 3 nbd(M) and q(M → M ′) constant for all

M ′ ∈ nbd(M). If the chain is currently in state M , we

proceed by drawing M ′ from q(M → M ′). This new model

is then accepted with probability:

min

�1,

p(M ′|y)

p(M |y)

�=�1, OM ′,M

�(6)

– We note that the computational ease of constructing posterior

model probabilities, or Bayes factors for the case of equal prior

probabilities assigned to all candidate models, allows us to

easily construct a Metropolis-Hastings sampling scheme that

implements the MC3 method.

– A vector of the log-marginal values for the current model M

is stored during sampling along with a vector for the proposed

model M ′. These are then scaled and integrated to produce

OM ′,M which is used in (6) to whether to accept the new

20

model or stay with the current model.

21

An example for SAR models

– Generated SAR model: y = ρWy + Xβ + ε

– ε ∼ N(0, σ2In)

– with X = [X1, X2, X3]

# of unique models found = 141# of models with probs > 0.001 = 26# of MCMC draws = 10000variables x1 x2 x3 x1out x2out x3out x4out mprobsmodel 1 1 1 1 1 0 1 0 0.0120model 2 1 1 1 0 1 0 1 0.0124model 3 1 1 1 1 0 0 1 0.0129model 4 1 1 1 1 0 0 0 0.0150model 5 1 1 1 0 0 1 0 0.0683model 6 1 1 1 0 0 0 0 0.0722model 7 1 1 1 0 1 0 0 0.0736model 8 1 1 1 0 0 0 1 0.0773model 9 1 1 1 1 0 0 0 0.0850

model 10 1 1 1 0 0 0 0 0.4839freqs 26 26 26 11 11 11 11vprobs 1.000 1.000 1.000 0.423 0.423 0.423 0.423

22

SAR model: y = ρWy + Xβ + ε, ε ∼ N(0, σ2In)OLS BMA Model information (48 seconds, 20,000 draws)# of unique models found = 61# of models with prob > 0.001 = 21variables x1 x2 x3 x1out x2out x3out mprobsmodel 1 1 0 0 1 0 0 0.0127model 2 1 0 0 0 1 0 0.0142model 3 1 0 1 1 0 0 0.0180model 4 1 0 1 0 0 1 0.0181model 5 1 0 1 0 1 0 0.0237model 6 1 1 0 0 0 0 0.0938model 7* 1 1 1 0 0 0 0.1207model 8 1 0 0 0 0 0 0.2557model 9 1 0 1 0 0 0 0.3805freqs 18 9 12 5 6 5vprobs 0.857 0.429 0.571 0.238 0.286 0.238

SAR BMA Model information (300 seconds, 20,000 draws)# of unique models found = 48# of models with prob > 0.001 = 14variables x1 x2 x3 x1out x2out x3out mprobsmodel 1 1 1 0 0 0 0 0.0052model 2 1 0 1 1 0 0 0.0123model 3 1 0 1 0 0 1 0.0130model 4 1 0 1 0 1 0 0.0178model 5 1 1 1 1 0 0 0.0279model 6 1 1 1 0 0 1 0.0324model 7 1 1 1 0 1 0 0.0374model 8 1 0 1 0 0 0 0.2562model 9* 1 1 1 0 0 0 0.5836freqs 13 9 12 4 4 4vprobs 0.929 0.643 0.857 0.286 0.286 0.286

23

Model Averaging

– For the election dataset we compare the MC3 methodology to

posterior model probabilities for two SAR models, one based

on actual explanatory variables, a constant term, education

homeownership and household income, as the X matrix and

another based on this correct X matrix plus 3 random normal

vectors. Of course, these bogus explanatory variables should

not appear in the high posterior probability models identified

by the MC3 estimation methodology.

– We compared these two models first by producing posterior

model probabilities for each of these and the probability

associated with the true model without the bogus explanatory

variables was 1.0. An alternative would be to consider the

set of all 2k possible models which arise from a set of

k = 7 explanatory variables. Since we have k = 7, there

are 27 = 128 possible models, so we could compute posterior

model probabilities for each of these in this small example. Our

next example based on the cross-country growth regressions

illustrates the difficulty of taking this approach in general, since

k = 16 so 2k = 65, 536 possible models.

24

– We follow Fernandez et al. (2001a) and use Zellner’s g-prior

with the value of g set to 1000*σ2, where σ2 represents an

estimate from the sar g model with all variables included in

the X−matrix. The posterior mean estimates from the sar gmodel with all variables included in the X−matrix are shown

below along with the prior standard deviations for each of the

β coefficients. If one considers an interval of ±3σ, around a

prior mean of zero for the β parameters used by the Zellner

g-prior, these prior standard deviations seem loose enough to

allow the sample data to determine the resulting estimates.

bhat prior std0.6263 1.27240.2196 0.41800.4819 0.4559

-0.0989 0.49460.0004 0.0670

-0.0009 0.0666-0.0000 0.0668

25

Model averaging informationModel const educ homeo income x1-bog x2-bog x3-bog probsmodel 1 1 0 1 1 0 1 0 0.0108model 2 1 0 1 1 1 0 0 0.0126model 3 1 1 1 1 1 1 1 0.0130model 4 1 0 1 1 0 0 1 0.0177model 5 1 0 1 0 0 0 0 0.0247model 6 1 1 1 1 1 1 0 0.0290model 7 1 0 1 1 0 0 0 0.0417model 8 1 1 1 1 0 1 1 0.0441model 9 1 1 1 1 1 0 1 0.0498model 10 1 1 1 1 0 1 0 0.0985model 11 1 1 1 1 1 0 0 0.1114model 12 1 1 1 1 0 0 1 0.1693model 13 1 1 1 1 0 0 0 0.3774#Occurences 51 35 51 44 34 29 37 1.0000

Bayesian spatial autoregressive modelDependent Variable = votersR-squared = 0.4422Rbar-squared = 0.4417mean of sige draws = 0.0138Nobs, Nvars = 3107, 4ndraws,nomit = 2500, 500total time in secs = 18.3560time for lndet = 1.7220time for sampling = 16.4140Pace and Barry, 1999 MC lndet approximation usedorder for MC appr = 50iter for MC appr = 30numerical integration used for rhomin and max rho = -1.0000, 1.0000***************************************************************

Posterior EstimatesVariable Coefficient Std Deviation p-levelconst 0.626298 0.041466 0.000000educ 0.220239 0.015633 0.000000homeowners 0.481845 0.014338 0.000000income -0.098988 0.016475 0.000000

26

rho 0.588173 0.015693 0.000000

Bayesian spatial autoregressive modelHomoscedastic versionDependent Variable = votersR-squared = 0.4421Rbar-squared = 0.4410mean of sige draws = 0.0138Nobs, Nvars = 3107, 7ndraws,nomit = 2500, 500total time in secs = 19.0880time for lndet = 1.8530time for sampling = 17.1150Pace and Barry, 1999 MC lndet approximation usedorder for MC appr = 50iter for MC appr = 30numerical integration used for rhomin and max rho = -1.0000, 1.0000***************************************************************

Posterior EstimatesVariable Coefficient Std Deviation p-levelconst 0.626294 0.042451 0.000000educ 0.219571 0.016212 0.000000homeowners 0.481878 0.014427 0.000000income -0.098920 0.016697 0.000000x1-bogus 0.000418 0.002156 0.412500x2-bogus -0.000900 0.002071 0.333000x3-bogus -0.000043 0.002145 0.493500rho 0.589170 0.015343 0.000000

27

SAR Bayesian Model Averaging EstimatesDependent Variable = votersR-squared = 0.4338sigma^2 = 0.0139# unique models = 72Nobs, Nvars = 3107, 7ndraws for BMA = 5000ndraws for estimates = 1200nomit for estimates = 200time for lndet = 1.9130time for BMA sampling= 75.0280time for estimates = 105.7220Pace and Barry, 1999 MC lndet approximation usedorder for MC appr = 50iter for MC appr = 30min and max rho = -1.0000, 1.0000***************************************************************Variable Prior Mean Std Deviationconst 0.000000 1.272378educ 0.000000 0.417961homeowners 0.000000 0.455872income 0.000000 0.494618x1-bogus 0.000000 0.067004x2-bogus 0.000000 0.066649x3-bogus 0.000000 0.066815***************************************************************

Posterior EstimatesVariable Coefficient Std Deviation p-levelconst 0.582201 0.018759 0.000000educ 0.195921 0.006934 0.000000homeowners 0.483636 0.006542 0.000000income -0.082365 0.007474 0.000000x1-bogus 0.000074 0.000273 0.399000x2-bogus -0.000205 0.000243 0.192000x3-bogus 0.000008 0.000407 0.508000rho 0.600754 0.006940 0.000000

28

Conclusions

• Work that is done:

– For finite homoscedastic models, involving alternative W

matrices, SAR vs. SEM model specification

– For infinite homoscedastic models, involving SAR, SEM

models, MC3, BMA over alternative X ′s– For finite heteroscedastic models, involving alternative W

matrices, SAR vs. SEM model specification

• Work to be done:

– For infinite heteroscedastic models, involving SAR, SEM

models, MC3, BMA over alternative X ′s– MC3, BMA for the case of SAR, SEM, alternative W matrices

and alternative X’s

– More general models:


u = λDu + ε

ε ∼ N(0, σ2In)

• I am trying to produce a paper/manual that describes the

theory and use of my toolbox functions for model comparison

purposes.

29

Lecture 4: Model Comparison · n by n matrices W and V specify the structure of spatial dependence...

Documents

Transcript of Lecture 4: Model Comparison · n by n matrices W and V specify the structure of spatial dependence...