Lecture 4: Model Comparison · n by n matrices W and V specify the structure of spatial dependence...
Transcript of Lecture 4: Model Comparison · n by n matrices W and V specify the structure of spatial dependence...
Lecture 4: Model Comparison
James P. LeSage
University of Toledo
Department of Economics
Toledo, OH 43606
March, 2004
We consider a class of spatial regression models introduced in
Ord (1975) and elaborated in Anselin (1988), shown in (1). The
sample observations in these models represent regions located in
space, for example counties, states, or countries.
y = ρWy + Xβ + u (1)
u = λV u + ε
Where X denotes an n by k matrix of explanatory variables
as in ordinary least-squares regression, β is an associated k
by 1 parameter vector and ε is an n by 1 disturbance vector,
which we assume takes the form: ε ∼ N(0, σ2In). The
n by n matrices W and V specify the structure of spatial
dependence between observations (y) or disturbances (u), with a
common specification having elements Wij > 0 for observations
j = 1 . . . n sufficiently close (as measured by some distance
metric) to observation i. As noted above, observations reflect
geographical regions, so distances might be calculated on the
basis of the centroid of the regions/observations. The expressions
Wy and V u produce vectors that are often called spatial lags,
and ρ and λ denote scalar parameters to be estimated along with
β and σ2. Non-zero values for the scalars ρ and λ indicate that
the spatial lags exert an influence on y or u.
1
Comparing alternative models
We wish to compare:
1) model specifications, e.g., y = ρWy + Xβ + ε, vs.
y = Xβ + u, u = ρWu + ε
Selection of the appropriate model, SAR, SEM, SDM or
SAC has been the subject of numerous Monte Carlo studies
that examine alternative systematic or sequential specification
search approaches. Florax, Folmer and Rey (2003) provide
a review of this literature. All of these approaches have
in common maximum likelihood estimation methods in
conjunction with conventional specification tests such as the
Lagrange multiplier (LM) or likelihood ratio (LR) tests.
2) weight matrix specifications, e.g., W based on contiguity
vs. W based on nearest neighbors, distance or parameterized
Trying to parameterize the weight matrix causes a problem for
maximum likelihood methods. Likelihood becomes ill-defined
when the spatial dependence parameters is zero. No problem
for Bayesian methods.
3) explanatory variables, e.g., X1, X2, X3 vs. X1, X3.
Only Bayesian methods offer the potential for a comprehensive
solution here.
2
Current state of parameter estimation andinference in spatial econometrics
A lot of good methods/tools, each with their strengths and
weaknesses.
• Likelihood
– Strengths, inference regarding parameter dispersion is
theoretically sound, strong connection to economic theory
of production, utility, spillovers, imposes theoretical
restriction on spatial dependence parameter
– Weaknesses, slow, hard to code for large problems, not
robust to non-normal error distributions, inference regarding
dispersion is difficult in practice
• GMM
– Strengths, fast, easy to code, robust to error distribution,
theoretically sound, strong connection to economic theory
of production, utility, spillovers
– Weaknesses, doesn’t impose theoretical restriction
on spatial dependence parameter, inference regarding
dispersion is an unsettled issue
3
• Semi-parametric
– Strengths, robust wrt to many possible problems (e.g.
many error distributions, model specification problems),
good for prediction
– Weaknesses, throws away parsimonious structure (spatial
autoregressive), data requirements and tuning parameters
make it harder to implement in small samples, inference
difficult, weak connection to economic theory of production,
utility, spillovers
• Bayesian
– Strengths, inference regarding parameter dispersion (both
theoretical and applied), strong connection to economic
theory of production, utility, spillovers, imposes theoretical
restriction on spatial dependence parameter, works for
binary, truncated, missing, and multinomial dependent
variables, parameterized spatial weight matrices
– Weaknesses, slow, hard to code
4
Points of failure for non-Bayesian methods
• Likelihood, fails for parameterized weight matrices because the
likelihood ratio is ill-defined at ρ = 0.
• Likelihood, requires sequential testing for comparison of model
specifications, and reliance on a host of Monte Carlo evidence.
Boils down to parameter inference on a nested model structure.
(Changing the weight matrix or explanatory variables destroys
nesting)
• GMM, not well developed in this area. Boils down to parameter
inference on a nested model structure. (Changing the weight
matrix or explanatory variables destroys nesting)
• Semi-parametric, doesn’t wish to participate in this issue.
Doesn’t believe in a true data generating process, relying
instead on flexible functional forms. Throws away
parsimonious model structures/specifications that can be
derived from economic theory.
5
Current state of Bayesian model comparison inspatial econometrics
• Things that are currently available in my MATLAB spatial
econometrics toolbox, or will be available soon.
• Focus here only on spatial autoregressive/spatial error models
(ignoring other spatial estimation functions). Some of this
based on recent work with Olivier Parent.
y = ρWy + αι + Xβ + u
u = λDu + ε
ε ∼ N(0, σ2V )
V = diag(v1, v2, . . . , vn)
• Need to rely on priors (π) that are not too informative and
not too diffuse to avoid Lindley’s (1957) paradox.
6
Priors
• For β, Zellner’s g-prior,
πb(β|σ2) ∼ N [0, σ
2(giX
′Mi
XMi)−1
] (2)
Fernandez, Ley and Steel (2001a, 2001b) provide a theoretical
justification for use of the g−prior as well as Monte
Carlo evidence comparing nine alternative approaches to
setting the hyperparameter g. They recommend setting
gi = 1/max{n, k2Mi}, for the case of least-squares based
MC3 methodology.
• For α, a diffuse prior
• For σ2, a Gamma prior, (or diffuse where ν = d = 0)
πs(σ2) ∼ G(ν, d) (3)
• For ρ, λ, either a uniform prior on the interval [−1, 1] or a
type of β(a, a)distribution centered on zero.
πr(ρ) = U [−1, 1] (4)
πr(ρ) =1
Be(a, a)
(1 + ρ)a−1(1− ρ)a−1
22a−1
7
Prior for ρ, λ
1 0.8 0.6 0.4 0.2 0 0.2 0.4 0.6 0.8 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
prior comparisons
a=1.01
a=1.1
a=2.0
8
Log marginal posteriors
• Univariate integration wrt ρ, λ for SAR and SEM models (this
problem solved)
SAR : y = ρWy + αι + Xβ + ε
SEM : y = αι + Xβ + u
u = λDu + ε
ε ∼ N(0, σ2In)
• Bivariate integration for general model (this problem NOT yet
solved)
y = ρWy + αι + Xβ + u
u = λDu + ε
ε ∼ N(0, σ2In)
• MCMC solution is needed for heteroscedastic models: (this
problem solved)
ε ∼ N(0, σ2V )
V = diag(v1, v2, . . . , vn)
9
Nature of the integration problem
• SAR model,
– we can analytically integrate out β, σ arriving at a log
marginal posterior:
p(ρ|y) = K2(g
1 + g)k/2
×Z|In − ρW |
× [νs2+ S(ρ) + Q(ρ)]
−n+ν2 πr(ρ)dρ
K2 = Γ
�ν
2
�−1
νs2
2
!ν2
(2π)−n
22n+ν
2 Γ
�n + ν
2
�=
Γ�
n+ν2
�Γ�
ν2
� (νs2)
ν2π
−n2
Q(ρ) =g
g + 1β(ρ)
′X′Xβ(ρ)
– |In − ρW |, S(ρ), Q(ρ) and πr(ρ) can be vectorized
over a grid of ρ values making integration easy.
10
• SEM model, y? = y − λDy, X? = (X − λDX), and
πb(β|σ2) ∼ N [0, σ2C], C = (gX?′Mi
X?Mi
)−1,
– we can analytically integrate out β, σ arriving at a log marginal
posterior:
p(λ|y) = Γ
�n− 1
2
�π−n
2 (5)
×Z � |C?|
|X?′X? + C?|
�12
|In − λW |
× [S(λ) + Q]−n−1
2 πr(λ)dλ
S(λ) =g
(1 + g)e
?(λ)
′e
?(λ)
e?(λ) = y
? −X?β − αι
Q =1
(1 + g)(y
? − y?ι)′(y
? − y?ι)
β = (X?′X
?)−1
X?′y
?
α = y?
y?
= y − λWy
Wy = (1/n)(Wy)
11
– |In− λD|, S(λ), Q(λ) and πr(λ) can be vectorized over a
grid of λ values, but is more computationally intensive.
12
Operational/implementation issues
For the case of a finite # of homoscedastic SAR, SEMmodels
• No need to estimate β, σ to do model comparison
• There is a need to store the vectorized log-marginal posterior
for scaling reasons. A vector of 2,000 values seems sufficient
for each model being compared.
• Using the log-determinant approximation methods of Pace and
Barry (1999) and vectorization, problems involving reasonably
large samples are no problem.
13
• An example for SAR models y = ρWy + αι + Xβ + ε,
ε ∼ N(0, σ2In)
# of observations time for time for total timelog-det log-marginal
49 0.0400 0.0250 0.1000506 0.1205 0.0300 0.1805
3,107 1.3965 0.0855 1.472524,473 10.6905 0.6160 12.2275
• An example for SEM models y = αι+Xβ+(In−λD)−1ε,
ε ∼ N(0, σ2In)
(grid of 0.001 over lambda)# of observations time for time for total time
log-det log-marginal49 0.0400 0.1000 0.1755
506 0.1205 0.8915 1.03653,107 1.3965 1.1570 2.4535
24,473 10.6905 49.1955 59.8110
(grid of 0.01 over lambda with spline interpolation)# of observations time for time for total time
log-det log-marginal49 0.0400 0.0200 0.0950
506 0.1205 0.1050 0.25503,107 1.3965 0.1255 1.4870
24,473 10.6905 5.3070 17.0190
14
Performance, n, signal/noise, spatial dependence
n = 49, r-squared = 0.9 (average over 30 trials)Models\rho -0.5 -0.25 -0.10 0 0.10 0.25 0.50Pr(SAR)* 0.98 0.76 0.44 0.32 0.42 0.71 0.98Pr(SEM) 0.02 0.24 0.56 0.68 0.58 0.29 0.02
Models\lam -0.5 -0.25 -0.10 0 0.10 0.25 0.50Pr(SAR) 0.14 0.20 0.31 0.29 0.29 0.38 0.17Pr(SEM)* 0.86 0.80 0.69 0.71 0.71 0.62 0.83
n = 506, r-squared = 0.9Models\rho -0.5 -0.25 -0.10 0 0.10 0.25 0.50Pr(SAR)* 1.00 0.99 0.17 0.33 0.67 1.00 1.00Pr(SEM) 0.00 0.01 0.83 0.67 0.33 0.00 0.00
Models\lam -0.5 -0.25 -0.10 0 0.10 0.25 0.50Pr(SAR) 0.32 0.21 0.11 0.30 0.01 0.00 0.00Pr(SEM)* 0.68 0.79 0.89 0.70 0.99 1.00 1.00
n = 3107, r-squared = 0.9Models\rho -0.5 -0.25 -0.10 0 0.10 0.25 0.50Pr(SAR)* 1.00 1.00 1.00 0.28 1.00 1.00 1.00Pr(SEM) 0.00 0.00 0.00 0.72 0.00 0.00 0.00
Models\lam -0.5 -0.25 -0.10 0 0.10 0.25 0.50Pr(SAR) 0.00 0.00 0.06 0.27 0.02 0.00 0.00Pr(SEM)* 1.00 1.00 0.94 0.73 0.98 1.00 1.00
15
Weight matrix comparisons
n = 506, SAR model (averaged over 30 trials)
Models\rho -0.5 -0.25 -0.10 0 0.10 0.25 0.50Pr(W1) 0.00 0.00 0.01 0.11 0.00 0.00 0.00Pr(W2) 0.00 0.00 0.07 0.12 0.08 0.00 0.00Pr(W3)* 1.00 1.00 0.73 0.16 0.74 1.00 1.00Pr(W4) 0.00 0.00 0.14 0.19 0.10 0.00 0.00Pr(W5) 0.00 0.00 0.04 0.20 0.05 0.00 0.00Pr(W6) 0.00 0.00 0.01 0.22 0.03 0.00 0.00
16
For the case of a finite # of homoscedastic general (SAC)models
• Bivariate integration would require 2,000 by 2,000 or 4,000,000
double precision numbers.
• A Bivariate grid over ρ, λ is required for the log-determinant
terms |In − ρW | and |In − λD|.• A smaller grid with possible spline interpolation may be
possible.
• I’m close to an MCMC solution, univariate integration over ρ
conditional on λ, and univariate integration over λ conditional
on ρ, then take and MCMC average of the log-marginal
posterior vectors.
17
For the case of a finite # of heteroscedastic SAR, SEMmodels
• An MCMC solution needs to be used.
• One advantage of this approach is that the log-marginal
posterior would come (almost) for free as a part of MCMC
estimation of these models.
• On every trip through the MCMC sampler, evaluate the log-
marginal posterior for current values of β, σ, ρ(λ) and V =
diag(v1, v2, . . . , vn). Take the MCMC average.
• An example for heteroscedastic SAR models:
• 2,500 draws, first 500 excluded for burn-in
# of observations time for time for total timelog-det log-marginal
49 0.0350 55.4300 55.5750506 0.1850 89.4940 89.74903,107 0.8460 270.2480 271.189524,473 10.8355 2118.9620 2130.4235
(35 minutes)
18
MC3 and BMA
For the case of an infinite # of homoscedastic SAR,SEM models
– A large literature on Bayesian model averaging over alternative
linear regression models containing differing explanatory
variables exists (Raftery, Madigan, Hoeting, 1997, Fernandez,
Ley, and Steel, 2001a).
– We introduce SAR and SEM model estimation when
uncertainty exists regarding the choice of regressors. The
Markov Chain Monte Carlo model composition (MC3)
approach introduced in Madigan and York (1995) is set forth
here for the SAR and SEM models.
– For a regression model with and intercept and k possible
explanatory variables, there are 2k possible ways to select
regressors to be included or excluded from the model. For
k = 15 say, we have 32,768 possible models, ruling out
computation of the log-marginal for all possible models as
infeasible.
– The MC3 method of Madigan and York (1995) devises
a strategic stochastic process that can move through the
potentially large model space and samples regions of high
posterior support. This eliminates the need to consider all
models by constructing a sample from relevant parts of the
model space, while ignoring irrelevant models.
– Specifically, they construct a Markov chain M(i), i = 1, 2, . . .
with state space ℵ that has an equilibrium distribution
19
p(Mi|y), where p(Mi|y) denotes the posterior probability
of model Mi based on the data y.
– The Markov chain is based on a neighborhood, nbd(M) for
each M ∈ ℵ, which consists of the model M itself along with
models containing either one variable more, or one variable
less than M . The addition of an explanatory variable to the
model is often labelled a ‘birth process’ whereas deleting a
variable from the set of explanatory variables is called a ‘death
process’.
– A transition matrix, q, is defined by setting q(M → M ′) = 0
for all M ′ 3 nbd(M) and q(M → M ′) constant for all
M ′ ∈ nbd(M). If the chain is currently in state M , we
proceed by drawing M ′ from q(M → M ′). This new model
is then accepted with probability:
min
�1,
p(M ′|y)
p(M |y)
�=�1, OM ′,M
�(6)
– We note that the computational ease of constructing posterior
model probabilities, or Bayes factors for the case of equal prior
probabilities assigned to all candidate models, allows us to
easily construct a Metropolis-Hastings sampling scheme that
implements the MC3 method.
– A vector of the log-marginal values for the current model M
is stored during sampling along with a vector for the proposed
model M ′. These are then scaled and integrated to produce
OM ′,M which is used in (6) to whether to accept the new
20
model or stay with the current model.
21
An example for SAR models
– Generated SAR model: y = ρWy + Xβ + ε
– ε ∼ N(0, σ2In)
– with X = [X1, X2, X3]
# of unique models found = 141# of models with probs > 0.001 = 26# of MCMC draws = 10000variables x1 x2 x3 x1out x2out x3out x4out mprobsmodel 1 1 1 1 1 0 1 0 0.0120model 2 1 1 1 0 1 0 1 0.0124model 3 1 1 1 1 0 0 1 0.0129model 4 1 1 1 1 0 0 0 0.0150model 5 1 1 1 0 0 1 0 0.0683model 6 1 1 1 0 0 0 0 0.0722model 7 1 1 1 0 1 0 0 0.0736model 8 1 1 1 0 0 0 1 0.0773model 9 1 1 1 1 0 0 0 0.0850
model 10 1 1 1 0 0 0 0 0.4839freqs 26 26 26 11 11 11 11vprobs 1.000 1.000 1.000 0.423 0.423 0.423 0.423
22
SAR model: y = ρWy + Xβ + ε, ε ∼ N(0, σ2In)OLS BMA Model information (48 seconds, 20,000 draws)# of unique models found = 61# of models with prob > 0.001 = 21variables x1 x2 x3 x1out x2out x3out mprobsmodel 1 1 0 0 1 0 0 0.0127model 2 1 0 0 0 1 0 0.0142model 3 1 0 1 1 0 0 0.0180model 4 1 0 1 0 0 1 0.0181model 5 1 0 1 0 1 0 0.0237model 6 1 1 0 0 0 0 0.0938model 7* 1 1 1 0 0 0 0.1207model 8 1 0 0 0 0 0 0.2557model 9 1 0 1 0 0 0 0.3805freqs 18 9 12 5 6 5vprobs 0.857 0.429 0.571 0.238 0.286 0.238
SAR BMA Model information (300 seconds, 20,000 draws)# of unique models found = 48# of models with prob > 0.001 = 14variables x1 x2 x3 x1out x2out x3out mprobsmodel 1 1 1 0 0 0 0 0.0052model 2 1 0 1 1 0 0 0.0123model 3 1 0 1 0 0 1 0.0130model 4 1 0 1 0 1 0 0.0178model 5 1 1 1 1 0 0 0.0279model 6 1 1 1 0 0 1 0.0324model 7 1 1 1 0 1 0 0.0374model 8 1 0 1 0 0 0 0.2562model 9* 1 1 1 0 0 0 0.5836freqs 13 9 12 4 4 4vprobs 0.929 0.643 0.857 0.286 0.286 0.286
23
Model Averaging
– For the election dataset we compare the MC3 methodology to
posterior model probabilities for two SAR models, one based
on actual explanatory variables, a constant term, education
homeownership and household income, as the X matrix and
another based on this correct X matrix plus 3 random normal
vectors. Of course, these bogus explanatory variables should
not appear in the high posterior probability models identified
by the MC3 estimation methodology.
– We compared these two models first by producing posterior
model probabilities for each of these and the probability
associated with the true model without the bogus explanatory
variables was 1.0. An alternative would be to consider the
set of all 2k possible models which arise from a set of
k = 7 explanatory variables. Since we have k = 7, there
are 27 = 128 possible models, so we could compute posterior
model probabilities for each of these in this small example. Our
next example based on the cross-country growth regressions
illustrates the difficulty of taking this approach in general, since
k = 16 so 2k = 65, 536 possible models.
24
– We follow Fernandez et al. (2001a) and use Zellner’s g-prior
with the value of g set to 1000*σ2, where σ2 represents an
estimate from the sar g model with all variables included in
the X−matrix. The posterior mean estimates from the sar gmodel with all variables included in the X−matrix are shown
below along with the prior standard deviations for each of the
β coefficients. If one considers an interval of ±3σ, around a
prior mean of zero for the β parameters used by the Zellner
g-prior, these prior standard deviations seem loose enough to
allow the sample data to determine the resulting estimates.
bhat prior std0.6263 1.27240.2196 0.41800.4819 0.4559
-0.0989 0.49460.0004 0.0670
-0.0009 0.0666-0.0000 0.0668
25
Model averaging informationModel const educ homeo income x1-bog x2-bog x3-bog probsmodel 1 1 0 1 1 0 1 0 0.0108model 2 1 0 1 1 1 0 0 0.0126model 3 1 1 1 1 1 1 1 0.0130model 4 1 0 1 1 0 0 1 0.0177model 5 1 0 1 0 0 0 0 0.0247model 6 1 1 1 1 1 1 0 0.0290model 7 1 0 1 1 0 0 0 0.0417model 8 1 1 1 1 0 1 1 0.0441model 9 1 1 1 1 1 0 1 0.0498model 10 1 1 1 1 0 1 0 0.0985model 11 1 1 1 1 1 0 0 0.1114model 12 1 1 1 1 0 0 1 0.1693model 13 1 1 1 1 0 0 0 0.3774#Occurences 51 35 51 44 34 29 37 1.0000
Bayesian spatial autoregressive modelDependent Variable = votersR-squared = 0.4422Rbar-squared = 0.4417mean of sige draws = 0.0138Nobs, Nvars = 3107, 4ndraws,nomit = 2500, 500total time in secs = 18.3560time for lndet = 1.7220time for sampling = 16.4140Pace and Barry, 1999 MC lndet approximation usedorder for MC appr = 50iter for MC appr = 30numerical integration used for rhomin and max rho = -1.0000, 1.0000***************************************************************
Posterior EstimatesVariable Coefficient Std Deviation p-levelconst 0.626298 0.041466 0.000000educ 0.220239 0.015633 0.000000homeowners 0.481845 0.014338 0.000000income -0.098988 0.016475 0.000000
26
rho 0.588173 0.015693 0.000000
Bayesian spatial autoregressive modelHomoscedastic versionDependent Variable = votersR-squared = 0.4421Rbar-squared = 0.4410mean of sige draws = 0.0138Nobs, Nvars = 3107, 7ndraws,nomit = 2500, 500total time in secs = 19.0880time for lndet = 1.8530time for sampling = 17.1150Pace and Barry, 1999 MC lndet approximation usedorder for MC appr = 50iter for MC appr = 30numerical integration used for rhomin and max rho = -1.0000, 1.0000***************************************************************
Posterior EstimatesVariable Coefficient Std Deviation p-levelconst 0.626294 0.042451 0.000000educ 0.219571 0.016212 0.000000homeowners 0.481878 0.014427 0.000000income -0.098920 0.016697 0.000000x1-bogus 0.000418 0.002156 0.412500x2-bogus -0.000900 0.002071 0.333000x3-bogus -0.000043 0.002145 0.493500rho 0.589170 0.015343 0.000000
27
SAR Bayesian Model Averaging EstimatesDependent Variable = votersR-squared = 0.4338sigma^2 = 0.0139# unique models = 72Nobs, Nvars = 3107, 7ndraws for BMA = 5000ndraws for estimates = 1200nomit for estimates = 200time for lndet = 1.9130time for BMA sampling= 75.0280time for estimates = 105.7220Pace and Barry, 1999 MC lndet approximation usedorder for MC appr = 50iter for MC appr = 30min and max rho = -1.0000, 1.0000***************************************************************Variable Prior Mean Std Deviationconst 0.000000 1.272378educ 0.000000 0.417961homeowners 0.000000 0.455872income 0.000000 0.494618x1-bogus 0.000000 0.067004x2-bogus 0.000000 0.066649x3-bogus 0.000000 0.066815***************************************************************
Posterior EstimatesVariable Coefficient Std Deviation p-levelconst 0.582201 0.018759 0.000000educ 0.195921 0.006934 0.000000homeowners 0.483636 0.006542 0.000000income -0.082365 0.007474 0.000000x1-bogus 0.000074 0.000273 0.399000x2-bogus -0.000205 0.000243 0.192000x3-bogus 0.000008 0.000407 0.508000rho 0.600754 0.006940 0.000000
28
Conclusions
• Work that is done:
– For finite homoscedastic models, involving alternative W
matrices, SAR vs. SEM model specification
– For infinite homoscedastic models, involving SAR, SEM
models, MC3, BMA over alternative X ′s– For finite heteroscedastic models, involving alternative W
matrices, SAR vs. SEM model specification
• Work to be done:
– For infinite heteroscedastic models, involving SAR, SEM
models, MC3, BMA over alternative X ′s– MC3, BMA for the case of SAR, SEM, alternative W matrices
and alternative X’s
– More general models:
y = ρWy + αι + Xβ + u
u = λDu + ε
ε ∼ N(0, σ2In)
• I am trying to produce a paper/manual that describes the
theory and use of my toolbox functions for model comparison
purposes.
29