Partial maximum likelihood estimation of spatial probit models

Journal of Econometrics 172 (2013) 77–89

Contents lists available at SciVerse ScienceDirect

Journal of Econometrics

journal homepage: www.elsevier.com/locate/jeconom

Partial maximum likelihood estimation of spatial probit models

Honglin Wang a, Emma M. Iglesias b,∗, Jeffrey M. Wooldridge c

a Hong Kong Institute for Monetary Research, 55/F, Two International Finance Centre, 8 Finance Street, Central, Hong Kongb Department of Applied Economics II. Facultad de Economía y Empresa. University of A Coruña, Campus de Elviña, 15071. A Coruña, Spainc Department of Economics, Michigan State University, 101 Marshall-Adams Hall, East Lansing, MI 48824-1038, USA

a r t i c l e i n f o

Article history:Received 31 October 2009Received in revised form17 February 2012Accepted 13 August 2012Available online 21 August 2012

JEL classification:C12C13C21C24C25

Keywords:Spatial statisticsMaximum likelihoodProbit model

a b s t r a c t

This paper analyzes spatial Probit models for cross sectional dependent data in a binary choice context.Observations are divided by pairwise groups and bivariate normal distributions are specified within eachgroup. Partial maximum likelihood estimators are introduced and they are shown to be consistent andasymptotically normal under some regularity conditions. Consistent covariancematrix estimators are alsoprovided. Estimates of average partial effects can also be obtained once we characterize the conditionaldistribution of the latent error. Finally, a simulation study shows the advantages of our new estimationprocedure in this setting. Our proposed partial maximum likelihood estimators are shown to be moreefficient than the generalized method of moments counterparts.

© 2012 Elsevier B.V. All rights reserved.

1. Introduction

Most econometric techniques using cross-sectional data arebased on the assumption of independence of the observations.When the data are outcomes measured at different geographicallocations the assumption of independence is tenuous, especiallyas economic activities have becomemore andmore correlated overspace with the advent of modern communication and transporta-tion improvements. Technological advances in the geographic in-formation system (GIS) make collecting spatial data easier thanever before. Consequently, the possibility of spatial correlationamong observations has received more and more attention in a

We are grateful to two referees, the Associate Editor, the Co-Editor andparticipants at the 2009 Econometric Society EuropeanMeeting, the 2009 Simposiode Análisis Económico, the 2010 ‘‘Brunel Macroeconomic Research Centre’’-QASSConference on Macro and Financial Economics and at seminars at London CityUniversity, London School of Economics, Tinbergen Institute, UCL, University CarlosIII, University of Essex and University of Exeter for very useful comments. Anyremaining errors are our own.∗ Corresponding author.

E-mail addresses: [email protected] (H. Wang), [email protected],[email protected] (E.M. Iglesias), [email protected] (J.M. Wooldridge).

0304-4076/$ – see front matter© 2012 Elsevier B.V. All rights reserved.doi:10.1016/j.jeconom.2012.08.005

wide range of fields, including regional, real estate, agricultural, en-vironmental, and industrial organization economics (Lee, 2004).

Econometricians have begun to pay more attention to spatialdependence problems in the last two decades, and there have beenimportant advances both theoretical and empirical.1 The analysisof spatial data starts with an underlying spatial structure generat-ing observed spatial correlations (Anselin and Florax, 1995). Thereare two popular ways of capturing spatial dependence. The first isin the domain of geostatistics, where the spatial index is continu-ous (Conley, 1999). The second is to assume that spatial sites form acountable lattice (Lee, 2004). Among lattice models, there are alsotwo types of spatial dependence models that have received thebulk of the attention: the spatial autoregressive dependent vari-able model (SAR) and the spatial autoregressive error model (SAE).Inmost applications of spatial models, the dependent variables arecontinuous, work that has been added by important theoretical re-sults in Conley (1999), Lee (2004), and Kelejian and Prucha (1999,2001). Nevertheless, there are a handful of applications that ad-dress spatial dependencewith discrete choice dependent variables

1 Anselin et al. (2004) wrote a comprehensive review about econometrics forspatial models.

http://dx.doi.org/10.1016/j.jeconom.2012.08.005

http://www.elsevier.com/locate/jeconom

http://www.elsevier.com/locate/jeconom

mailto:[email protected]





78 H. Wang et al. / Journal of Econometrics 172 (2013) 77–89

(Case, 1991;McMillen, 1995; Pinkse and Slade, 1998; Lesage, 2000;Beron andVijverberg, 2003; Pinkse et al., 2006). The purpose of thispaper is to advance the available estimation methods for spatiallycorrelated binary outcomes.

While the analysis can be made more general, we focus onthe probit model with spatially correlated data. As is now wellknown, if we ignore the spatial correlation and construct apseudo-likelihood function as if we had independent draws, theresulting pooled maximum likelihood estimator (MLE) is, underfairly general conditions, consistent and asymptotically normal,provided the marginal model is correctly specified. Poirier andRuud (1988) established this result for time series data, and it ispretty clear that it holds, under certain assumptions that restrictthe amount of dependence for spatial data. The main drawback toapplying the pooled MLE when the observations are dependent isa loss of efficiency. Some authors, for example Robinson (1982),explicitly consider joint maximum likelihood estimation of anonlinear model with time series data. Unfortunately, in thecontext of spatially correlated data obtaining maximum likelihoodestimators that account for the joint dependence in the data iscomputationally very demanding.

Rather than taking either extreme — ignoring the dependencein the data or trying to model full joint dependence — middle-ground approaches are possible. For example, Poirier and Ruud(1988) showhow to estimate the probitmodelwith dependence intime-series data using generalized conditional moment (GCM) es-timators. These estimators are computationally attractive and rel-atively more efficient than ignoring serial dependence. Generally,nonlinear models with a time series dimension can be estimatedby generalized method of moments (GMM). The GMM approach is(asymptotically)more efficient than just using a pooledMLEproce-dure. However, because time series dependence is ignored in form-ing moment conditions, GMM estimators still can be considerablyless efficient than the joint MLE.

Similar considerations hold for spatially correlated data. Meth-ods that only use information on themarginal distributions — suchas Pinkse and Slade’s (1998) GMM estimator of the SAE probitmodel based on the pooled MLE first order conditions — poten-tially give up much in terms of efficiency compared with a fullMLE approach. The motivation for the current paper is that jointMLE is often prohibitively difficult while recognizing that methodsbased only on marginal distributions will often be too imprecise.Therefore, we propose a middle ground between a pooled probitapproach and full maximum likelihood. In particular, we choose tocapture spatial dependence by assuming that sites form a count-able lattice. Then, we divide the lattice into many small groups(clusters), where the clusters are formed from adjacent observa-tions. The resulting structure is a large number of small clusters.If we can obtain the joint density of the responses within cluster,we can improve upon methods that completely ignore the spatialdependence while arriving at estimation methods much less com-putationally demanding than joint MLE. We refer to our proposedmethod as ‘‘partial MLE’’ because we are only using partial jointdistributions, not the entire joint distribution.

Because we model spatial correlation only within a cluster,we still need to account for spatial correlation across clusters.This feature is what distinguishes the current setting from astandard panel data setting, where independence across clustersare assumed. To obtain valid inference,we appeal to Conley (1999),who extends Newey and West (1987) to allow for data generatedby a countable lattice. Conley (1999) uses metrics of economicdistance to characterize dependence among agents, and shows thatthe GMM estimator is consistent and asymptotically normal undersome assumptions similar to time-series data.

The rest of the paper is organized as follows. Section 2 providesa brief overview of popular spatial models with a binary response.

Section 3 presents the bivariate spatial probit model. In Section 4,we prove consistency and asymptotic normality of the PMLestimator (PMLE) under regularity assumptions, and discuss howto get consistent covariance matrix estimators. Section 5 presentsa simulation study showing the advantages of our new estimationprocedure in this setting. Finally, Section 6 concludes. The proofsare collected in Appendix A, while the results for the simulationstudy are provided in Appendix B.

2. Discrete choice models with spatial dependence

It is useful to begin with a brief discussion of general binaryresponse models with spatial dependence. For a draw i, let Yi bea binary outcome and Xi a 1 × K vector of covariates. Assume thatYi is generated as

Yi = 1[Xiβ + εi > 0], (1)

where εi is an unobserved error and β is a K × 1 parameter vectorto be estimated. Regardless of any dependence in the data across i,if εi is independent of Xi, then the response probability P(Yi = 1|Xi)can be obtained if the distribution of εi is known. In the case whereεi ∼ Normal(0, 1), it is well known that P(Yi = 1|Xi) = Φ(Xiβ),where Φ denotes the standard normal cumulative distributionfunction (cdf). The ‘‘marginal probability’’ can be used, undergeneral assumptions, to consistently estimate β using a pooledMLE procedure — even though the data may not be independent.This is effectively the insight of the Poirier and Ruud (1988) resultsfor time series data.

Allowing explicitly for spatial correlation of the kind that ispopular for linear models raises a couple of important issues, asrecognized in Pinkse and Slade (1998). First, the variance of theerror in such models typically depends on the distances amongpairs of observations in the lattice — via the matrix that is used ina weighted least squares analysis. LetW denote and N × N matrixof weights that are exogenous in the sense that

εi|X,W ∼ Normal(0, hi(W , λ)), (2)

where (hi(W , λ)) > 0 is a variance function that depends onλ. Theform of hi(·) differs across spatial models and is not yet impor-tant. The exogeneity assumption is embodied in the requirementE(εi|X,W ) = 0, which also imposes a strict exogeneity assump-tion on the covariates X .

If we maintain (2) along with (1) then D(Yi|X,W ) follows a so-called heteroskedastic probit model with

P(Yi = 1|X,W ) = Φ

Xiβ/

hi(W , λ)

. (3)

Under sufficient regularity conditions — mainly restricting theamount of spatial dependence — β and λ can be consistentlyand

√n-asymptotically normally estimated by using a pooled

heteroskedastic probit approach. These moment conditions areused in the Pinkse and Slade (1998) GMM estimator.

Before we proceed further, the presence of W in (3) raisesa question about how we should summarize the partial effectsof the elements of Xi on the response probability. The notion ofthe average structural function (ASF), proposed by Blundell andPowell (2004) in a different context, seems useful. In the presentapplication, the ASF is defined as

ASF(x) = EWΦ

xβ/

hi(W , λ)

. (4)

The average partial effects are obtained by taking changes or partialderivatives of ASF(x). Given consistent estimators β and λ, ASF(x)can be (under regularity conditions) consistently estimated by

n−1n

i=1

Φ

xβ/

hi(W , λ)

. (5)

H. Wang et al. / Journal of Econometrics 172 (2013) 77–89 79

See Wooldridge (2005) for further discussion of average partialeffects in the context of heteroskedastic probit models.

We now turn to specific spatialmodels that have been proposedfor both linear and binary responses.Writtenwith a latent variableY ∗

i , with Yi = 1[Y ∗

i > 0], the probit model with spatial errorcorrelation (SAE) can be written as

Y ∗

i = Xiβ + εi, (6)

εi = λ

nj=1

Wijεj + ui. (7)

Here the Wij are elements of the spatial weights matrix Wintroduced above. The specification of Wij usually relies on somemeasure of spatial distances between observations i and j, suchas the Euclidean distance. (By convention, Wii = 0 for all i.) Theparameter λ is the spatial autoregressive error coefficient, and theui are assumed to be i.i.d. Normal(0, 1) random variables. We canwrite (6) and (7) in matrix form as

Y ∗= Xβ + ε (8)

ε = (I − λW )−1 u, (9)

so that the variance–covariance matrix for the N × 1 vector ε is

Ω ≡ Var(ε|X,W ) = [(I − λW )′(I − λW )]−1. (10)

If Y ∗ is observable, Eqs. (8) and (9) define the linear SAE model,and its estimation and asymptotic properties using full MLE havebeen extensively studied by Lee (2004). Here, we only observethe binary responses. Estimating β and λ when Yi = 1[Y ∗

i >0] is considerably more complicated than the linear case. Infact, constructing the likelihood function requires N-dimensionalintegration of a multivariate normal distribution. We refer thereader to Lee (2004) for details.

While the formulation in (10) is common, it is not the onlypossibility. We may prefer more of a moving average structure,such as

εi = ui + λ

h=i

Wihuh

, (11)

where the ui are i.i.d.Normal(0, 1) random variables. This formula-tion is attractive because it is relatively easy to find variances andpairwise covariances (which wewill use in the pseudoMLEs intro-duced in the next section). In particular,

Var(εi|W ) = 1 + λ2

h=i

W 2ih

(12)

and

Cov(εi, εj|W ) = λWij + λWji + λ2

h=i,h=j

WihWjh

. (13)

Notice that if we use only Var(εi|W ) in a pooled analysis wewouldhave to take λ > 0.

3. Using partial MLEs to estimate general spatial probit models

As mentioned earlier, estimating a probit spatial autocorrela-tion model by full MLE is a prodigious task. The EM algorithm canbe used (McMillen, 1992), the RIS simulator (Beron and Vijver-berg, 2003), and the Bayesian Gibbs sampler (Lesage, 2000). Buteach of these approaches is computationally burdensome, makingit very difficult to conduct simulation studies or to quickly estimatea range of models.

Fig. 1. 2n observations =⇒ n groups.

Belowwe discuss univariate and bivariate probit approaches toestimation. Both of these are computationally much simpler thanjoint MLE.

3.1. Univariate probit partial MLE

If we use only the information in the marginal distributionsP(Yi = 1|X,W ) — the approach taken by Pinkse and Slade (1998)— then we are lead to a partial (or pooled) log likelihood functionof the form

L =

ni=1

Yi log[Φ(Xiβ/σi(λ))]

+ (1 − Yi) log[1 − Φ(Xiβ/σi(λ))], (14)

where σi(λ) is shorthand for√Var(εi|W ). Assuming that β and

λ are identified, and that the conditions below in Section 4 hold,the pooled heteroskedastic probit is generally consistent and

√n-

asymptotically normal. But, for reasons we discussed above, it islikely to be very inefficient relative to the full MLE. While Pinkseand Slade’s GMM estimator can help a little, estimators that usesome information on the spatial correlation across observationsseem more promising in terms of increasing precision.

3.2. Bivariate probit partial MLE

We now turn to using information on pairs of ‘‘nearby’’ obser-vations to identify and estimate β and λ. There is nothing specialabout using pairs; we could use, say, triplets, or even larger groups.But the bivariate case is easy to illustrate and is computationallyquite feasible.

For illustration, assume a sample includes 2n observations, andwe divide the 2n observations into n pairwise groups accordingfor example to the spatial Euclidean distance between them (seeFig. 1). In other words, each group includes two observations,with the idea being that the internal correlation between the twoobservations is more important than external correlations withobservations in other groups. Of course, the way that we group theobservations will affect the asymptotic variance of our procedure.In practice, we recommend to specify different types of groupingand to check if the variance estimates are reduced significantlyin the different cases. One could, after obtaining estimators fromseveral groupings, apply an efficient minimum distance procedure(for example Wooldridge, 2010, Chapter 14) to obtain a singleestimator.

Let Y ∗g =

Y ∗

g1, Y∗

g2

be the bivariate vectors of latent outcomes

for group g , and assume for notational simplicity that we have 2n


observations. Write the linear equations for group g as

Y ∗

g1 = Xg1β + εg1 (15)

Y ∗

g2 = Xg2β + εg2, (16)

where Xg1 and Xg2 are 1 × K vectors of regressors and β is aK × 1 vector; εg1 and εg2 are scalars. These two equations looklike a two-period panel data model, but we must recognize thatthe variances and covariance depend on the weighting matrix inthe underlying spatial model. Not only are εg1 and εg2 correlatedwith each other but they are also with the errors in other groups.Therefore, the variances and covariance between εg1 and εg2 notonly depend on theweight within the group, but also weights withother observations out of the group.

By assumption, E(εg1|Xg1,W ) = E(εg2|Xg2,W ) = 0. Write the2 × 2 variance matrix as

Var(εg |Xg ,W ) ≡ Ωg(W , λ) =

Ωg11 Ωg12Ωg21 Ωg22

, (17)

where we often suppress the dependence of Ωg(W , λ) on W andλ in what follows. The variance terms are the same as in thePinkse and Slade (1998) approach. To implement our procedure,the covariance must also be computed; exploiting this correlationis the source of improving the precision of the estimates of β andλ.

Let Yg1 and Yg2 be the binary outcomes associated with group g .The conditional bivariate normal distribution of Yg1 and Yg2 givenXg (andW ) is

P(Yg1 = 1, Yg2 = 1|Xg)

= P(Xg1β + εg1 > 0, Xg2β + εg2 > 0|Xg) (18)= P(εg1 ≤ Xg1β, εg2 ≤ Xg2β|Xg)

= Φ2

Xg1βΩg11

,Xg2βΩg22

, ρg

(19)

ρg =Cov(εg1, εg2)

Var(εg1)Var(εg2)

=Ωg12Ωg11Ωg22

, (20)

where Φ2 is the standard bivariate normal distribution, φ2 is thestandard density function of the bivariate normal distributionand ρg is the standardized covariance between two error terms.Estimation in this context is similar to ‘‘random effects’’ probitwith two ‘‘time periods’’ and n observations. The difference is thatwe have system heteroskedasticity in Var(εg |Xg ,W ) and spatialcorrelation across g .

Obtaining the joint probabilities within the group is notdifficult, and is most easily done finding marginal and conditionalprobabilities. Given that (εg1, εg2) has a joint normal distribution,we can write

εg1 = δg1εg2 + eg1 (21)

where

δg1 =Cov(εg1, εg2)

Var(εg2), (22)

and eg1 is independent of Xg and εg2. Because of the joint normalityof (εg1, εg2), eg1 is also normally distributed with E(eg1) = 0, and

Var(eg1) = Var(εg1)− δ2g1Var(εg2). (23)

Thus, we can write

P(Yg1 = 1|Xg , εg2) = Φ

Xg1β + δg1εg2

Var(eg1)

. (24)

Once we obtain (24), we can retrieve explicit expressions forP(Yg1 = 1, Yg2 = 1|Xg), P(Yg1 = 0, Yg2 = 1|Xg), P(Yg1 = 1,Yg2 = 0|Xg) and P(Yg1 = 0, Yg2 = 0|Xg) as given in Appendix A.

These properties form the basis of a partialMLEwith two responsesper group. We now turn to the asymptotic properties of thisestimator.

4. Asymptotic properties of the partial MLEs

In the context of panel data and also cluster samples,Wooldridge (2010, Chapters 13 and 20) discusses partial MLEmethods. These PMLEs apply to pooled log likelihoods wheresome dependence — across time or within cluster — is ignoredin estimation. The asymptotic theory in Wooldridge (2010)is straightforward because observations are assumed to beindependent across groups (and the group sizes are fixed, as weassume here). In the present setting, we still have correlationacross all clusters due to the spatial nature of the data. But thearguments for howpartialMLE identifies parameters and generallyhas desirable asymptotic properties is essentially unchangedfrom the standard case. Nevertheless, the details of showingthat the groups are sufficiently weakly dependent are notsimple, and estimating the asymptotic variance matrix requiressome care.

If we let Yg be the 2 × 1 vector of observed responses for groupg , the partial log likelihood has the form

L =

ng=1

log fg(Yg |Xg ,W , θ), (25)

where fg(yg |Xg ,W , θ) is the density of Yg given Xg (and we againassume there are 2n total observations). Because this conditionaldensity is correctly specified, the partial-log-likelihood functiongenerally identifies θ0 because of the Kullback–Leibler inequalityapplied for each g (Wooldridge, 2010, Chapter 13). Of course,we would need to assume or otherwise show the uniquenessof θ0.

The general PMLE results apply to the spatial probit modelif we have correctly specified the bivariate normal densitiesφ2g(Yg1,Yg2|Xg ,W , θ). To ensure correct specification, we mustproperly obtain the 2×2 conditional variance–covariancematrix of(εg1, εg2)

′. This iswhere the underlying spatialmodel comes in.Wemust also take care in restricting the spatial dependence in the dataso that the standardized sum in (25) satisfies the usual limit laws.To ensure weak dependence, we assume that the spatial processis strong mixing, which means the grouped observations form astrong mixing sequence, too. The asymptotic approximations weuse are based on the thought experiment that the geographic areais increasing in size.

To facilitate asymptotic analysis, write the partial log likelihoodfunction as

L =

ng=1

Yg1Yg2 log P(Yg1 = 1, Yg2 = 1|Xg)

+ Yg1(1 − Yg2) log P(Yg1 = 1, Yg2 = 0|Xg)

+ (1 − Yg1)Yg2 log P(Yg1 = 0, Yg2 = 1|Xg)

+ (1 − Yg1)(1 − Yg2) log P(Yg1 = 0, Yg2 = 0|Xg) (26)

and for the sake of brevity define

Pg(1, 1) ≡ log P(Yg1 = 1, Yg2 = 1|Xg);

Pg(1, 0) ≡ log P(Yg1 = 1, Yg2 = 0|Xg);(27)

Pg(0, 1) ≡ log P(Yg1 = 0, Yg2 = 1|Xg) andPg(0, 0) ≡ log P(Yg1 = 0, Yg2 = 0|Xg).

(28)


Therefore, we can rewrite the partial log likelihood (PLL) as

L =

ng=1

Yg1Yg2Pg(1, 1)+ Yg1(1 − Yg2)Pg(1, 0)

+ (1 − Yg1)Yg2Pg(0, 1)+ (1 − Yg1)(1 − Yg2)Pg(0, 0). (29)

The PLL in (29) is the simplestway onemight exploit spatial cor-relation in pairs of observations. One possibility is to expand thegroup sizes and shrink the number of groups, although expandingthe group size makes the computational problem harder (becausethe dimension of Yg grows). Previously wementioned the possibil-ity of using several different pairings and using minimum distanceestimation. A related possibility is to estimate (β, λ) by poolingpseudo log likelihoods across multiple partitions of the data. Sup-pose we settle on J partitions of the data into n groups of two ob-servations. Let ijg1 and ijg2 denote the observation index of the firstand second member of group g in partition j = 1, . . . , J . Then wecan form a PLL as

Jj=1

ng=1

log fijg1,ijg2 (β, λ) , (30)

where fijg1,ijg2 (β, λ) has the same form as the contribution to thelog likelihood given in (29). This estimator will be a bit more com-putationally demanding than the one that we propose explicitly inthis paper, but it will be more efficient. In this paper our asymp-totic analysis is restricted to the case of a single partition of thedata.

4.1. Consistency of bivariate probit estimation

In this section, to make the asymptotic arguments formal, wedistinguish between the true value, θ0, and a generic parametervalue θ . We establish conditions under which the PMLE estimatorintroduced above is weakly consistent, that is,θ p

−→ θ0, as n → ∞.The objective function for the bivariate probit PMLE, standard-

ized by n−1, is

Qn (θ) ≡ n−1n

g=1

Yg1Yg2Pg(1, 1)+ Yg1(1 − Yg2)Pg(1, 0)

+ (1 − Yg1)Yg2Pg(0, 1)

+ (1 − Yg1)(1 − Yg2)Pg(0, 0), (31)

andθ maximizes Qn (θ) over the parameter space Θ . Rememberthat this objective function represents a partial log likelihood:we are only using information on the conditional distributionsD(Yg1, Yg2|X,W ) and not D(Y1, Y2, . . . , Yn|X,W ) — as in a fullmaximum likelihood setting.

The identification condition essentially requires that the limitof E[Qn (θ)] is uniquely maximized at the true value θ0. From theargument described earlier, the only issue is whether θ0 is unique.Define the limiting function as

Q (θ) ≡ limn→∞

E[Qn (θ)].

Then θ0 will uniquely maximize Q (θ) in well-specified modelswhen there is not perfect collinearity among the regressors orsome other degeneracy. It can require some care in parameterizingthe spatial autocorrelation, but standard models of spatialautocorrelation cause no problems. As in Pinkse and Slade (1998)we assume uniqueness in all our analysis.

The following Theorem 1 states the main consistency result fora broad class of spatial probit models.

Theorem 1. If (i) θ0 is the interior of a compact set Θ , which is theclosure of a concave set, (ii) Q attains a unique maximum over the

compact set Θ at θ0, (iii) Q is continuous on Θ , (iv) the density ofobservations in any region whose area exceeds a fixed minimum isbounded, (v) as n → ∞,

sup1≤g≤n

1P(Yg1 = 1, Yg2 = 1|Xg)

+1

P(Yg1 = 1, Yg2 = 0|Xg)

+1

P(Yg1 = 0, Yg2 = 1|Xg)

+1

P(Yg1 = 0, Yg2 = 0|Xg)

< ∞

(vi) as n → ∞, supg(Xg

+Yg

) = O(1), (vii) supngj |Cov(Ygi, Yji)| ≤ α(dgj), i = 1, 2 where dgj denotes the distance betweengroup g and j, and α(d) → 0 as d → ∞, (viii) limn→∞ E[Qn (θ)]exists, (ix) supg

Wg < ∞, thenθ − θ0 = op(1).

Proof. Given in Appendix A.

Condition (i) is a standard assumption for optimizationestimators. Condition (ii) is the identification condition for MLE.Condition (iii) assumes that the function Q is continuous in themetric space, which is a reasonable assumption and necessary forthe proof that Qn (θ) is stochastically equicontinuous. Condition(iv) simply excludes that an infinite number of observationscrowd in one bounded area. The minimum area restriction isimposed because an infinitesimal area around a single observationhas infinite density. Condition (v) makes sure any one of thesefour situations will be present in a sufficiently large samplein our bivariate probit structure. Condition (vi) makes sure theregressors are deterministic and uniformly bounded, which is nota strong assumption in this literature. Condition (vii) is the keyassumption for this theorem, and it requires that the dependenceamong groups decays sufficiently quickly when the distancebetween groups become further apart. This assumption employsthe concept from α-mixing to define the rate of dependencedecreasing as distance increases. Condition (viii) assumes thelimit of E[Sn (θ)] exists as n → ∞, which is not a strongassumption. Condition (ix) is actually implied by the rule ofdividing groups, which just excludes that the two groups areexactly in the same location. An important remark is that theassumptions in Theorem 1 allow for general types of spatialdependence as the one given in (7), (11) and higher order spatialerror lags. Moreover, for simplicity reasons, we focus on the settingof a bivariate probit with a likelihood function as given in (31).However, the results of our Theorems can be easily generalizedat the expense of more complex notation to go beyond thebivariate dependence provided that we extend assumptions suchas (v) to allow for a finite number of observations inside eachgroup.

4.2. Asymptotic normality

Our proof of asymptotic normality must recognize the spatialdependence in the scores of the partial log likelihood. To dealwith general dependence problems, a common approach in theliterature is to use the so called ‘‘Bernstein Sums’’, which breakup Sn into blocks (partial sums). This is the approach we takehere. Each block must be so large, relative to the rate at whichthe memory of the sequence decays, that the degree to which thenext block can be predicted from current information is negligible.At the same time, the number of blocks must increase with n sothat the CLT argument can be applied to this derived sequence(Davidson, 1994).

In this section, we show under what assumptions we areable to apply McLeish’s central limit theorem (1974) to spatialdependence cases to get asymptotic normality for the spatial Probit


estimator. This is presented in the following Theorem. AT denotesthe transpose ofmatrix A. Define the score of the objective functionas

Sn(θ) ≡∂Qn

∂θ(θ). (32)

Theorem 2. If the assumptions of Theorem 1 hold, and in addi-tion: (i) as d → ∞, d2α(dd∗)

α(d∗)= o(1) for all fixed d∗ > 0, (ii) the

sampling area grows uniformly at a rate of√n in two non-opposing

directions, (iii) B(θ0) ≡ limn→∞ E[nSn(θ0)STn (θ0)] and A(θ0) ≡

limn→∞ −E[Hn(θ0)] are positive definite matrices. Then√n(θ − θ0) → N[0, A(θ0)−1B(θ0)A(θ0)−1

],

where Sn(θ0) ≡∂Qn∂θ(θ0) and Hn(θ0) =

∂2Qn∂θ∂θT

(θ0).

Proof. Given in Appendix A.

Condition (i) is stronger than condition (vii) in Theorem 1, andit is also stronger than the usual condition in time series databecause spatial dependent data has more dimension correlationsthan time series data. It shows how dependence decays whendistance between groups gets further away, and the dependencedecays at a fast enough rate. As stated in Pinkse and Slade(1998, p. 134), ‘‘an Euclidean-weighting scheme does not satisfycondition (i). (i) also implies that α(d) is positive. However,since α(d) is an upper bound, the possibility that covariancesdo not decline monotonically is not excluded’’. Condition (ii) justrepeats the assumption in the Bernstein’s blocking method, thetwo non-opposing directions just exclude sampling area grows attwo parallel directions, which does not make much sense in thespatial dependent case. Conditions in (iii) are natural conditionsabout matrices, which are implied by the previous assumptions.Matrices are semidefinite if some extreme situations happen suchas P(Yg1 = 1, Yg2 = 1|Xg) = 0, which are assumed to be excludedin the previous assumptions.

4.3. Estimation of variance–covariance matrices

Consistent estimation of the asymptotic covariance matrix isimportant for the construction of asymptotic confidence intervalsand hypotheses tests. Estimation of A(θo) is relatively easy, aswe can use the sample average of the negative Hessian andreplace θ0 withθ . Or, we can use a version based on a conditionalexpectation — see Wooldridge (2010, Chapter 13). Estimation ofB(θo) is substantially more difficult when there is dependencein the data — especially spatial dependence. Newey and West(1987) is the most commonly used approach for pure timeseries problems; Andrews (1991) established the consistency ofkernel HAC (heteroskedasticity and autocorrelation consistent)estimators under fairly general conditions. But we need anapproach that allows for two-dimensional correlation.

Pinkse and Slade (1998) showed that, under conditions similarto those that imply asymptotic normality, Bn(θ) ∞

−→ B(θ0), whereBn(θ0) ≡ nE[Sn(θ0)STn (θ0)] (see Lemma 8 in Appendix A). Unfor-tunately, Pinkse and Slade’s estimator is feasible only if we canget closed form expressions for E[Sn(θ0)STn (θ0)], something that isvery difficult. Instead, we follow an approach proposed by Conley(1999).

A feasible way to obtain a consistent estimate of a vari-ance–covariance matrix that allows for a wider range of depen-dence is to apply the approach of Conley (1999). To this end, letΞΛ be the σ -algebra generated by a given random field ψsm , sm ∈

Λ with Λ compact, and let |Λ| be the number of sm ∈ Λ. LetΥ (Λ1,Λ2) denote the minimum Euclidean distance from an el-ement ofΛ1 to an element ofΛ2. There exists also a regular lattice

index random field W ∗s that is equal to one if location s ∈ Z2 is

sampled and zero otherwise. W ∗s is assumed to be independent of

the underlying random field and to have a finite expectation andto be stationary. The strong mixing coefficients are defined as

αk,l (n) ≡ sup |P (A ∩ B)− P (A) P (B)| ,A ∈ ΞΛ1 , B ∈ ΞΛ2 and

|Λ1| ≤ k, |Λ2| ≤ l, Υ (Λ1,Λ2) ≥ n.

We also define a new process Rs (θ) such as

Rs (θ) =

S (θ) ifW ∗

s = 1,0 ifW ∗

s = 0.

We have the following theorem.

Theorem 3. If (i) Λτ grows uniformly in two non-opposing direc-tions as τ −→ ∞, (ii) B(θ0) ≡ limn→∞ E[Sn(θ0)STn (θ0)] andA(θ0) ≡ limn→∞ −E[H(θ0)] are uniformly positive definite matrices,(iii)Ygi, Yji as defined in Theorem1, i = 1, 2 andW ∗

s are strongmixingwhere αk,l (n) converges to zero as n → ∞; S (θ) is Borel measurablefor all θ ∈ Θ, and continuous on Θ and first moment continuouson Θ, (iv)

∞

m=1 mαk,l (m) < ∞ for k + l ≤ 4, (v) α1,∞ (m) =

om−2

, (vi) for some δ > 0, E (∥S (θ0)∥)2+δ < ∞ and

∞

m=1

mα1,1 (m)δ/(2+δ) < ∞, (vii) H(θ) is Borel measurable for all θ ∈

Θ, continuous on Θ and second moment continuous, A(θ0) existsand is full rank, (viii)

s∈Z2 cov (R0 (θ0) , Rs (θ0)) is a non-singular

matrix, (ix) the KMP (j, k) are uniformly bounded and KMP (j, k) −→

1, nτ −→ ∞ as τ −→ ∞(M, P −→ ∞), LM = oM1/3

and

LP = oP1/3

, (x) for some δ > 0, E (∥S (θ0)∥)4+δ < ∞ and Ygi, Yji

as defined in Theorem 1, i = 1, 2 and W ∗s are strong mixing where

α∞,∞ (m)δ/(2+δ) = om−4

, (xi) E supΘ

Rm,p (θ)2 < ∞ and

E supΘ(∂/∂θ) Rm,p (θ)

2 < ∞, and if

Bτ = n−1τ

LMj=0

LPk=0

Mm=j+1

Pp=k+1

KMP (j, k)

×

Rm,p

θ Rm−j,p−kθT +

Rm−j,p−kθ Rm,p

θT

−n−1τ

Mm=1

Pp=1

Rm,pθ Rm,p

θT ,thenBτ − B(θ0) = op(1) as τ −→ ∞,

where we split s = [m, p],Λτ is a rectangle so that m ∈ 1, 2,. . . ,M and p ∈ 1, 2, . . . , P.

To ensure positive semi-definite covariance matrix estimates, weneed to choose an appropriate two-dimensional weights function thatis a Bartlett window in each dimension

KMP (j, k)

=

1 −

|j|LM

1 −

|k|LP

for |j| < LM , |k| < LP

0 else

.Proof. The result follows from Conley (1999, Proposition 3).

5. Simulation study

In the previous section we demonstrated consistency andasymptotic normality of the PMLE based on the bivariate normaldistribution. Unfortunately, it is difficult to show theoretically thatthe PMLE that uses groups of two is more efficient than univariate


pooled estimators: both estimation approaches have neglecteddynamics making analytical comparisons of the asymptoticvariances very difficult, if not impossible. Intuitively, it seemsreasonable that usingmore information about the spatial structureshould produce more precise estimators. In this section we use asmall simulation study to verify this intuition.

5.1. Simulation design and results

Instead of comparing our PMLE to the GMM estimator of Pinkseand Slade (1998) directly, we choose to compare the bivariatePMLE to the univariate PMLE, which we refer to as the het-eroskedastic probit estimator (HPE).We have two justifications forusing the HPE rather than the GMM version. First, the HPE uses thesame moment conditions as the GMM estimator because both usethe first-order condition from the HPE. Thus, efficiency gains fromusing an optimal weighting matrix are unlikely to be important.Second, the STATA2 source codes for bivariate probit estimationand heteroskedastic probit estimation are available online, and wecan easily adopt the code for the kind of heteroskedasticity in thevariances and covariances implied by common spatial dependencestructures. We consider two simulation settings allowing for dif-ferent types of spatial dependence.

5.1.1. Case 1According to the theoretical framework given in previous

sections, we could generate a dataset which allows a generalcorrelation structure across groups as in Eqs. (6) and (7). Werequire knowing the 2 × 2 matrices Ωg as functions of λ and W .Generally, it is quite difficult to derive the pairwise covariancesfor the bivariate probit because the exact formula for Ωg12 (andof Ωg11,Ωg22) is very complicated; they must be obtained fromthe inverse of the full 2n× 2n variance–covariance matrix. For theSAE model, this matrix is

[(I − λW )′(I − λW )]−1

=

Ω111 · · · · · · · · · · · ·

· · · · · · · · · · · · · · ·

· · · · · · Ωg11 Ωg12 · · ·

· · · · · · Ωg21 Ωg22 · · ·

· · · · · · · · · · · · Ωn22

, (33)

and it is difficult to obtain theΩghi in closed form. Instead, it seemsreasonable to do the following. Let R be a weighting matrix (whichcan be generated in STATA3) according to the distance betweenobservations. Then define

Y ∗

i = Xi1β1 + Xi2β2 + Xi3β3 + εi (34)

ε = λRu, (35)

where u ∼ Normal(0, I2n). The weighting matrix R is standardizedso that the diagonal elements are ones, and then the elements ofR shrink as distance between observations increases. Using thisapproach, it is relatively easy to determine Var(εi) and Cov(εi, εj),which facilitates the HP bivariate probit estimation. We still allowgeneral correlation across groups, and we are able to comparethe efficiency gains from only using the marginal information (theHP approach) to using both diagonal and off-diagonal information(bivariate probit).

2 See http://www.stata.com/.3 The STATA command is ‘‘Spatwmat’’. Since the speed to calculate the inverse of

amatrix ismuch slower as the size ofmatrix increases, andmoreover themaximummatrix size in Stata is 800, we allow here each observation to be spatially correlatedto nearby 99 observations.

In generating the data according to Eqs. (34) and (35) we setthe true parameter values for β1, β2 and β3 all equal to unity. Weare particularly interested in estimation of the spatial parameterλ, and so we vary its value as follows: λ = 0.2; 0.4; 0.6; and 0.8.These values for λ are in the range of the estimated value in theempirical application of Pinkse and Slade (1998).We consider totalsample sizes of N = 500 (so n = 250 groups), N = 1000, andN = 1500.We use 1000 replications in the simulations. The resultsare reported in Table 1 (for the spatial parameterλ) and Table 2 (forβ1, β2 and β3) in Appendix B.

We start with estimation of β . Table 2 shows that the PMLEhas little bias with N = 1500 (except when λ = .2), whereasthe HPE still has substantial bias. The poor behavior of the HPEfor estimating β may be due to its inability to estimate λ. ThePMLE does much better in terms of precision, too. Generally, as weexpect, the Monte Carlo standard deviations shrink as the samplesize increases.

Table 1 shows that the HPE struggles when trying to estimateλ. The PMLE is always much closer to the true parameter value —although it has a systematic upward bias for eachN —with smallerstandard deviations across all sample sizes and parameter values.The bias of the PMLE decreases whenN increases but there is roomfor improvement. Possible bias adjustments to the estimator of λis a good topic for future research.

In summary, from the simulation results of Tables 1 and 2, wesee how the PMLE clearly outperforms the HPE, especially whenestimating the spatial parameter λ. While simulation findings arenecessarily special, the ones here provide strong support for theidea that using even a little information on the spatial correlationstructure can go a long way in obtaining less biased, more preciseestimators.

5.1.2. Case 2We consider a second data generating process given as (34),

where again we set the true parameter values for β1, β2 andβ3 all equal to unity. In this case we assume (11) where the uiare i.i.d. Normal(0, 1) and Wih is the reciprocal of the Euclideandistance between i and h. We obtain the closed form expressionsfor the variance and covariance given in (12) and (13). Results for1000 replications are provided in Tables 3 and 4 in Appendix B.Again, the PMLE provides substantial improvements over theHPE, especially when estimating λ. It makes sense that usinginformation in the pairwise data helps to substantially improvethe precision in estimating λ, which is fundamentally a spatialcorrelation parameter. Efficiency gains of the bivariate procedurein estimating β are smaller in Table 4 — possibly because we choseto group nearby observations — but still nontrivial. For this datagenerating process (DGP), both HPE and PMLE show little bias inestimating β , especially for the largest sample size.

6. Conclusions

The idea of this paper is simple and intuitive: rather than justusing information contained the marginal distributions, we divideobservations into pairwise groups and use a partial MLE approach.Using the spatial correlation for pairs of outcomes, we provethat the bivariate PMLE is consistent and asymptotically normalunder reasonable regularity conditions (although these could berelaxed in future research). We also discuss how to get consistentcovariance matrix estimators under general spatial dependenceby following the approach of Conley (1999), which is much morepractical than the proposal of Pinkse and Slade (1998).

The simulation study in Section 5 demonstrates that usingbivariate rather than univariate distributions not only improvesefficiency, but can substantially decrease finite-sample bias —especially for estimating the spatial correlation parameters.

http://www.stata.com/


The fact that we can undertake a substantial simulation studydemonstrates that our approach is computational much morefeasible than the full, jointMLE. Our conjecture is that an estimatorthat uses, say, trivariate distributions would perform even better.Of course that comes at the expense of computation. Nevertheless,computation for a single data set should not be difficult for evenlarger group sizes. We think the findings for group sizes of twomake a strong case for the general PMLE approach.

A fixed and known spatial error structure is a limitation of ourresults. Ideally, one could accommodate endogenous location deci-sions. Unfortunately, endogenous location raises both conceptualand technical difficulties that need to be studied in future research.Extensions that are more immediate are models with spatial dis-tributed lags in the covariates and other kinds of nonlinear modelsthat can be estimated by PMLE, including Tobit, count and switch-ing models.

Appendix A

A.1. Expressions of conditional bivariate distributions

Since

P(Yg1 = 1, Yg2 = 1|Xg)

= P(Yg1 = 1|Yg2 = 1, Xg) · P(Yg2 = 1|Xg) (36)

it is easy to see that P(Yg2 = 1|Xg) = ΦXg2β/

Var(εg2)

, and

thus it remains to get P(Yg1 = 1|Yg2 = 1, Xg).

First, since Yg2 = 1 if and only if εg2 > −Xg2β, and εg2 followsa normal distribution and it is independent of Xg , then the densityof εg2 given εg2 > −Xg2β is

φ

εg2√

Var(εg2)

P(εg2 > −Xg2β)

=

φ

εg2√

Var(εg2)

Φ

Xg2β√Var(εg2)

. (37)

Therefore,

P(Yg1 = 1|Yg2 = 1, Xg)

= E[P(Yg1 = 1|Xg , εg2)|Yg2 = 1, Xg) (38)

= E

Φ

Xg1β + δg1εg2

Var(eg1)

Yg2 = 1, Xg

(39)

=1

Φ

Xg2β√Var(εg2)

∞

−Xg2βΦ

×

Xg1β + δg1εg2

Var(eg1)

φ

εg2

Var(εg2)

dεg2 (40)

and it is easy to see that P(Yg1 = 0|Yg2 = 1, Xg) = 1 − P(Yg1 =

1|Yg2 = 1, Xg) because Yg1 is the binary variable.Similarly, we can get

P(Yg1 = 1|Yg2 = 0, Xg) =1

1 − Φ

Xg2β√Var(εg2)

×

Xg2β

−∞

Φ

Xg1β + δg1εg2

Var(eg1)

φ

εg2

Var(εg2)

dεg2 (41)

and P(Yg1 = 0|Yg2 = 0, Xg) = 1 − P(Yg1 = 1|Yg2 = 0, Xg).

Now we are ready to get P(Yg1 = 1, Yg2 = 1|Xg) as follows

P(Yg1 = 1, Yg2 = 1|Xg) =1

Φ

Xg2β√Var(εg2)

×

∞

−Xg2βΦ

Xg1β + δg1εg2

Var(eg1)

φ

εg2

Var(εg2)

dεg2

×Φ

Xg2βVar(εg2)

(42)

=

∞

−Xg2βΦ

Xg1β + δg1εg2

Var(eg1)

φ

εg2

Var(εg2)

dεg2, (43)

and similarly we can obtain finally

P(Yg1 = 0, Yg2 = 1|Xg) = Φ

Xg2βVar(εg2)

−

∞

−Xg2βΦ

Xg1β + δg1εg2

Var(eg1)

φ

εg2

Var(εg2)

dεg2 (44)

P(Yg1 = 1, Yg2 = 0|Xg)

=

Xg2β

−∞

Φ

Xg1β + δg1εg2

Var(eg1)

φ

εg2

Var(εg2)

dεg2 (45)

P(Yg1 = 0, Yg2 = 0|Xg) =

1 − Φ

Xg2βVar(εg2)

−

Xg2β

−∞

Φ

Xg1β + δg1εg2

Var(egθ1)

φ

εg2

Var(εg2)

dεg2. (46)

A.2. Proofs of theorems

Proof of Theorem 1. By Newey and Mcfadden (1994), for consis-tency it is sufficient to verify the following conditions:

(i) Q has a unique maximum at θ0.(ii) Qn (θ)− Q (θ) = op(1) at all θ ∈ Θ.

(iii) Qn (θ) is stochastically equicontinuous and Q is continuousonΘ.

We have already assumed condition (i). The proof of condition(ii) is provided in Lemma 1, and the proof that Qn (θ) isstochastically equicontinuous can be found in Lemma 2.

Proof of Theorem 2. To find out the asymptotic normality of thePartial MLE for spatial bivariate Probit model, we start the prooffrom mean value theorem. Since ∂Qn

∂θ(θ) = 0, and by using the

mean value theorem

∂Qn

∂θ(θ) = 0 =

∂Qn

∂θ(θ0)+

∂2Qn

∂θ∂θ T(θ∗)(θ − θ0) (47)

⇒ (θ − θ0) = −

∂2Qn

∂θ∂θ T(θ∗)

−1∂Qn

∂θ(θ0), (48)

where θ∗ lies betweenθ and θ0.First, let us discuss the term ∂2Qn

∂θ∂θT(θ∗) to find out the asymptotic

properties of ∂2Qn∂θ∂θT

(θ∗). Recall that

Qn (θ) =1n

ng=1

Yg1Yg2Pg(1, 1)+ Yg1(1 − Yg2)Pg(1, 0)


+ (1 − Yg1)Yg2Pg(0, 1)

+ (1 − Yg1)(1 − Yg2)Pg(0, 0), (49)

where Pg(1, 1) ≡ log P(Yg1 = 1, Yg2 = 1|Xg) and so on. Also

∂2Qn

∂θ∂θ T(θ) =

1n

ng=1

Yg1Yg2

∂2Pg(1, 1)∂θ∂θ T

+ Yg1(1 − Yg2)∂2Pg(1, 0)∂θ∂θ T

+ (1 − Yg1)(Yg2)∂2Pg(0, 1)∂θ∂θ T

+ (1 − Yg1)(1 − Yg2)∂2Pg(0, 0)∂θ∂θ T

, (50)

where

∂2Pg(1, 1)∂θ∂θ T

=−1

[P(Yg1 = 1, Yg2 = 1|Xg)]2

×

∂P(Yg1 = 1, Yg2 = 1|Xg)

∂θ

2+

1P(Yg1 = 1, Yg2 = 1|Xg)

×∂2[P(Yg1 = 1, Yg2 = 1|Xg)]

∂θ∂θ T, (51)

and all other terms behave similarly.As before,weonly discuss one of these terms, and the same logic

applies to the other terms. We know that

1n

ng=1

Yg1Yg2

∂2Pg(1, 1)∂θ∂θ T

(θ∗)

=1n

ng=1

Yg1Yg2

−1

[P(Yg1 = 1, Yg2 = 1|Xg)]2

×

∂P(Yg1 = 1, Yg2 = 1|Xg)

∂θ(θ∗)

2+

1P(Yg1 = 1, Yg2 = 1|Xg)

×∂2[P(Yg1 = 1, Yg2 = 1|Xg)]

∂θ∂θ T(θ∗)

. (52)

Look at the first term of the above equation given by

1n

ng=1

Yg1Yg2

−1

[P(Yg1 = 1, Yg2 = 1|Xg)]2

×

∂P(Yg1 = 1, Yg2 = 1|Xg)

∂θ(θ∗)

2. (53)

Since 1

[P(Yg1=1,Yg2=1|Xg )]2

< ∞, we can write this term as

1n

ng=1

Kg11

∂P(Yg1 = 1, Yg2 = 1|Xg)

∂θ(θ∗)

2, (54)

where Kg11 ≡ Yg1Yg2−1

[P(Yg1=1,Yg2=1|Xg )]2.

In order to prove

1n

ng=1

Kg11

∂P(Yg1 = 1, Yg2 = 1|Xg)

∂θ(θ∗)

2p−→

1n

ng=1

Kg11

∂P(Yg1 = 1, Yg2 = 1|Xg)

∂θ(θ0)

2, (55)

we need to show that it holds for all ∥ϖ∥ = 1. Set Kg11 = ϖ TKg

and then

ϖ T1n

ng=1

Kg11

∂P(Yg1 = 1, Yg2 = 1|Xg)

∂θ(θ)2

−1n

ng=1

Kg11

∂P(Yg1 = 1, Yg2 = 1|Xg)

∂θ(θ0)

2(56)

=1n

ng=1

Kg11

∂P(Yg1 = 1, Yg2 = 1|Xg)

∂θ(θ)2

−

∂P(Yg1 = 1, Yg2 = 1|Xg)

∂θ(θ0)

2(57)

= (θ − θ0)2n

ng=1

Kg11∂P(Yg1 = 1, Yg2 = 1|Xg)

∂θ(θ∗)

×∂2P(Yg1 = 1, Yg2 = 1|Xg)

∂θ∂θ T(θ∗). (58)

From the proof of Theorem 1, we know that supg ∂P(Yg1=1,Yg2=1|Xg )∂θ

< ∞. FromLemma3, supg

∂2P(Yg1=1,Yg2=1|Xg )∂θ∂θT

< ∞. FromTheorem1,we also know thatθ−θ0 = op(1) andhence

(θ − θ0)2n

ng=1

Kg11∂P(Yg1 = 1, Yg2 = 1|Xg)

∂θ(θ∗)

×∂2P(Yg1 = 1, Yg2 = 1|Xg)

∂θ∂θ T(θ∗) = op(1) (59)

=⇒ ϖ T1n

ng=1

Kg11

∂P(Yg1 = 1, Yg2 = 1|Xg)

∂θ(θ)2

−1n

ng=1

Kg11

∂P(Yg1 = 1, Yg2 = 1|Xg)

∂θ(θ0)

2= op(1) (60)

=⇒1n

ng=1

Kg11

∂P(Yg1 = 1, Yg2 = 1|Xg)

∂θ(θ∗)

2p−→

1n

ng=1

Kg11

∂P(Yg1 = 1, Yg2 = 1|Xg)

∂θ(θ0)

2. (61)

By definition,

limn→∞

1n

ng=1

Kg11

∂P(Yg1 = 1, Yg2 = 1|Xg)

∂θ(θ0)

2

= E

Kg11

∂P(Yg1 = 1, Yg2 = 1|Xg)

∂θ(θ0)

2, (62)


and therefore,

1n

ng=1

Yg1Yg2

−1

[P(Yg1 = 1, Yg2 = 1|Xg)]2

×

∂P(Yg1 = 1, Yg2 = 1|Xg)

∂θ(θ∗)

2p−→ (63)

EYg1Yg2

−1[P(Yg1 = 1, Yg2 = 1|Xg)]2

×

∂P(Yg1 = 1, Yg2 = 1|Xg)

∂θ(θ0)

2. (64)

Similarly, we can prove in relation to the second term in (52)that

1n

ng=1

Yg1Yg21

P(Yg1 = 1, Yg2 = 1|Xg)

×∂2[P(Yg1 = 1, Yg2 = 1|Xg)]

∂θ∂θ T(θ∗) (65)

p−→ E

Yg1Yg2

1P(Yg1 = 1, Yg2 = 1|Xg)

×∂2[P(Yg1 = 1, Yg2 = 1|Xg)]

∂θ∂θ T(θ0)

. (66)

As usual, we apply repeatedly the above arguments to the otherterms. Finally, we can get that

limn→∞

∂2Qn

∂θ∂θ T(θ∗)

p−→ E

∂2Qn

∂θ∂θ T(θ0)

. (67)

If we define

H ≡

Yg1Yg2

∂2Pg(1, 1)∂θ∂θ T

+ Yg1(1 − Yg2)∂2Pg(1, 0)∂θ∂θ T

+ (1 − Yg1)(Yg2)∂2Pg(0, 1)∂θ∂θ T

+ (1 − Yg1)(1 − Yg2)∂2Pg(0, 0)∂θ∂θ T

(68)

where H denotes the Hessian, Eq. (68) can be rewritten as

limn→∞

1n

ng=1

H(θ∗)p−→ lim

n→∞E[H(θ0)]. (69)

Therefore, it remains to show the asymptotic normality of thescore term, Sn(θ0). Now

Sn(θ0) =1n

ng=1

Yg1Yg2

∂Pg(1, 1)∂θ

(θ0)

+ Yg1(1 − Yg2)∂Pg(1, 0)∂θ

(θ0)

+ (1 − Yg1)Yg2∂Pg(0, 1)∂θ

(θ0)

+ (1 − Yg1)(1 − Yg2)∂Pg(0, 0)∂θ

(θ0)

. (70)

Weneed to show thatB−12 (θ0)Sn(θ0) → N(0, IK ), whereB(θ) ≡

limn→∞ nE[Sn(θ)STn (θ)]. Note that the informationmatrix equality

does not hold here, i.e. −E[Hn(θ0)] = E[Sn(θ)STn (θ)], because thescore terms are correlated with each other over space. In thispart, we follow Pinkse and Slade (1998) and we use Bernstein’sblocking methods and the McLeish’s (1974) central limit theoremfor dependent processes. First, define Tnan ≡ Π

anj=1(1 + iγDn,j),

where i2 = −1, and Dn,j(j = 1, 2 . . . an) is an array of randomvariables on the probability triple (Ω,z, P).γ is a real number.McLeish’s (1974) central limit theorem for dependent processesrequires the following four conditions

(i) Tnan is uniformly integrable,(ii) ETnan → 1,

(iii)an

j=1 D2n,j

p−→ 1,

(iv) maxj≤an |Dn,j|p−→ 0.

Now we need to define Dn,j in our case. Let Y0n ≡ ϖ T√nSg (θ0)

√B(θ0)

= n−

12n

g=1 Ang for implicitly define Ang . In order to

prove Y0nd−→ N(0, 1), we need to establish that the property holds

for all ∥ϖ∥ = 1 using the Cramer–Wold device. As in the proof ofLemma 1 in Theorem 1, we split the region in which observationsare located up to an an area of size

√bn ×

√bn. We also know that

an increases faster than√n and bn slower, where an and bn are in-

tegers such that anbn = n. Let an and bn be constructed such thatα√

bnan → 0. Let nτ−

12 × bn < 1, uniformly in n, for some fixed

0 < τ < 12 . Let Λnj denote the set of indices corresponding to the

observations in area j. By assumption a number C > 0 exists suchthat maxj(#Λnj) < Cbn. Define Dn,j ≡ n−

12

g∈ΛnjAng , and hence

we can write Y0n =an

j=1 Dn,j.

Now we are ready to discuss the four conditions for Mcleish’s(1974) central limit theorem. First, look at condition (iv), which re-quires that maxj≤an |Dn,j| = op(1)

maxj≤an

|Dn,j| = maxj≤an

n−12g∈Λnj

Ang

. (71)

Since by assumption

maxj(#Λnj) < Cbn ⇒ max

j≤an

n−12g∈Λnj

Ang

≤ Cbn × n−

12 sup

Ang , (72)

where # denotes the number of objects, by definition we have that

ϖ T√

nSg(θ0)√B(θ0)

= n−

12

ng=1

Ang ,

Ang = ϖ T 1√B0

Yg1Yg2

∂Pg(1, 1)∂θ

(θ0)

+ Yg1(1 − Yg2)∂Pg(1, 0)∂θ

(θ0)+ (1 − Yg1)Yg2∂Pg(0, 1)∂θ

(θ0)

+ (1 − Yg1)(1 − Yg2)∂Pg(0, 0)∂θ

(θ0)

. (73)

Since B(θ0) is positive definite, B(θ0)−12 is bounded as n → ∞,

and we have that supg

Yg < ∞ by assumption (vi) in Theo-

rem 1. We have also proved that supg

∂Pg (1,1)∂θ

< ∞ in Lemma 2.


Therefore, we are able to prove that supAng

< ∞. Then Cbn ×

n−12 sup

Ang = Op

Cbn × n−

12

= op(1) by construction of bn.

Hence we can get that maxj≤an |Dn,j| = op(1).Second, let us discuss condition (i): Tnan is uniformly inte-

grable. Following Davidson (1994), if a random variable is in-tegrable, the contribution to the integer of extreme randomvariable values must be negligible. In other words, if E|Tnan | <∞, E(|Tnan |1|Tnan |>K ) → 0, as K → ∞, it is equivalent to sayP[supn>N |Tnan | > K ] = 0, for some K > 0 as n → ∞. Herewe follow the proof of Lemma 10 in Pinkse and Slade (1998). Wehave that

Psupn>N

|Tnan | > K

= Psupn>N

|Πanj=1(1 + iγDn,j)| > K

(74)

≤ Psupn>N

Πanj=1

1 + γ 2D2

n,j

> K

(75)

=

Psupn>N

Πanj=1

1 + γ 2D2

n,j

> K

supn>N,j

nτ |Dn,j| ≤ C

× Psup nτ |Dn,j| ≤ C

+ P

supn>N

Πanj=1

1 + γ 2D2

n,j

> K

supn>N,j

nτ |Dn,j| > C

× Psup nτ |Dn,j| > C

(76)

≤

Psupn>N

Πanj=1

1 + γ 2D2

n,j

> K

supn>N,j

nτ |Dn,j| ≤ C

+ Psup nτ |Dn,j| > C

(77)

where C is a uniform upper bound to

g∈ΛnjAng . Therefore,

Psup nτ |Dn,j| > C

= P

sup nτ |n−

12g∈Λnj

Ang | > C

(78)

= Psup nτ−

12g∈Λnj

|Ang | > C

≤ Psup nτ−

12 bn

g∈Λnj

|Ang | > C

= 0 (79)

since nτ−12 bn < 1 and by construction of bn. Then,

Psupn>N

Πanj=1

1 + γ 2D2

n,j

> K sup

n>N,jnτ |Dn,j| ≤ C

≤ P

supn>N

|(1 + γ 2n−2τC2)an2 | > K

= 0 (80)

provided we set K sufficiently large. Therefore, we proved thatP[supn>N |Tnan | > K ] = 0 ⇒ Tn is uniformly integrable.

Third, condition (ii) requires that ETnan → 1, which is equiva-lent to saying that ETnan − 1 = o(1); see proof in Lemma 4.

Fourth, in order to prove (iii):an

j=1 D2n,j

p−→ 1, by Lemma 7,an

j=1 D2n,j − 1 =

anj=1 E(D

2n,j)− 1 + op(1) and

anj=1

E(D2n,j)− 1 + op(1)

= E(Y 20n)− 1 −

ani=j

E(Dn,iDn,j)+ op(1) = op(1), (81)

by construction of Y0n, since E(Y 20n) = 1. It remains to show thatan

i=j E(Dn,iDn,j) = o(1). This condition is proved in Lemmas 5–7.4

A.3. Technical lemmas

The proofs of Theorems 1 and 2 require the use of the followingLemmas 1–8. The proofs are in the technical appendix that isavailable upon request from the authors.

Lemma 1. Under the assumptions in Theorem 1, Qn (θ) − Q (θ) =

op(1) for all θ ∈ Θ.

Lemma 2. Under the assumptions in Theorem 1, Qn (θ) − Q (θ) isstochastically equicontinuous.

Lemma 3. Under the assumptions in Theorem 2, supg ∂2P(Yg1=1,Yg2=1|Xg )∂θ∂θT

< ∞.

Lemma 4. Under the assumptions in Theorem 2, ETnan − 1 = o(1),where Tnan ≡ Π

anj=1(1 + iγDn,j).

Lemma 5. Under the assumptions in Theorem 2, max

i=j

|E(Dn,iDn,j)| = o(n−1bn) = o(a−1n ).

Lemma 6. Under the assumptions in Theorem 2, max√

anl=2

j∈Ξnil

|E(Dn,iDn,j)| = o(a−1n ).

Lemma 7. Under the assumptions in Theorem 2,an

j=1 D2n,j =

anj=1

E(D2n,j)+ op(1).

Finally, the following Lemma 8 generalizes Pinkse and Slade(1998) results as a way to obtain consistent estimates of thevariance covariance matrix. The proof is also contained in thetechnical appendix that is available upon request from theauthors.

Lemma 8. If assumptions in Theorem 2 hold, and supg

∂Φ4∂θ

+∂Φ3∂θ

< ∞, then An(θ) − A(θ0) = op(1) and Bn(θ) − B(θ0) = op(1);where Bn(θ) ≡ nE[Sn(θ)STn (θ)] and A(θ) ≡ −E[H(θ)].

Appendix B

See Tables 1–4.

Appendix C. Supplementary data

Supplementary material related to this article can be foundonline at http://dx.doi.org/10.1016/j.jeconom.2012.08.005.

4 Lemmas 5–8 are along the lines of those in Pinkse and Slade (1998), which area simplified version of the proofs in Davidson (1994).



Table 1*Case 1: Simulation results of different estimators of λ in the context of the bivariate spatial probit model.

λ = 0.2 λ = 0.4 λ = 0.6 λ = 0.8HPE PMLE HPE PMLE HPE PMLE HPE PMLE

N = 500 Mean 3.938 0.514 6.177 0.519 7.698 0.571 7.735 0.634Bias 3.738 0.314 5.777 0.319 7.098 −0.029 6.935 −0.166(s.d.) (12.158) (0.120) (15.776) (0.205) (16.929) (0.151) (16.202) (0.289)

N = 1000 Mean 3.174 0.512 4.668 0.518 5.456 0.581 5.914 0.672Bias 2.974 0.312 4.268 0.118 4.856 −0.019 5.114 −0.128(s.d) (8.844) (0.107) (9.100) (0.133) (9.631) (0.149) (10.173) (0.276)

N = 1500 Mean 2.746 0.511 4.050 0.507 4.872 0.609 5.426 0.708Bias 2.546 0.311 3.650 0.107 4.272 0.009 4.626 −0.092(s.d.) (6.423) (0.099) (7.414) (0.124) (8.598) (0.149) (8.514) (0.253)

* Results are presented for the bivariate Partial Maximum Likelihood Estimator (PMLE) and the Heteroskedastic Probit Estimator (HPE) of λ. Numbers in brackets arestandard deviations (s.d.)

Table 2*Case 1: Simulation results of different estimators of β1, β2 and β3 in the context of the bivariate spatial probit model.

β1 = 1 β2 = 1 β3 = 1HPE PMLE HPE PMLE HPE PMLE

λ = 0.2 N = 500 Mean 5.322 2.618 5.333 2.619 5.329 2.623(s.d.) (8.844) (0.839) (8.872) (0.855) (8.863) (0.870)

N = 1000 Mean 5.308 2.616 5.296 2.616 5.289 2.618(s.d) (7.612) (0.560) (7.570) (0.560) (7.568) (0.564)

N = 1500 Mean 5.247 2.604 5.239 2.602 5.235 2.604(s.d.) (6.624) (0.540) (6.606) (0.536) (6.613) (0.543)

λ = 0.4 N = 500 Mean 3.610 1.329 3.614 1.329 3.608 1.328(s.d.) (5.305) (0.362) (5.311) (0.365) (5.290) (0.366)

N = 1000 Mean 3.600 1.318 3.593 1.316 3.588 1.315(s.d.) (4.192) (0.355) (4.177) (0.355) (4.178) (0.353)

N = 1500 Mean 3.456 1.281 3.441 1.281 3.438 1.278(s.d.) (3.818) (0.342) (3.793) (0.343) (3.798) (0.339)

λ = 0.6 N = 500 Mean 2.898 0.972 2.876 0.966 2.885 0.969(s.d.) (3.761) (0.271) (3.723) (0.268) (3.735) (0.271)

N = 1000 Mean 2.669 0.981 2.669 0.979 2.657 0.978(s.d.) (2.951) (0.261) (2.953) (0.261) (2.916) (0.259)

N = 1500 Mean 2.508 1.016 2.499 1.015 2.501 1.016(s.d.) (2.726) (0.250) (2.706) (0.250) (2.708) (0.253)

λ = 0.8 N = 500 Mean 2.246 0.805 2.237 0.801 2.249 0.802(s.d.) (2.810) (0.373) (2.803) (0.373) (2.841) (0.392)

N = 1000 Mean 2.098 0.843 2.096 0.843 2.082 0.843(s.d.) (2.281) (0.349) (2.279) (0.349) (2.246) (0.340)

N = 1500 Mean 2.086 0.884 2.096 0.886 2.094 0.886(s.d.) (2.059) (0.316) (2.071) (0.314) (2.073) (0.318)

* Results are presented for our new Partial Maximum Likelihood Estimator (PMLE) and the Heteroskedastic Probit Estimator (HPE) of β1, β2 and β3 . Numbers in bracketsshow standard deviations (s.d.)

Table 3*Case 2: Simulation results of different estimators of λ in the context of the bivariate spatial probit model.

λ = 0.2 λ = 0.4 λ = 0.6 λ = 0.8HPE PMLE HPE PMLE HPE PMLE HPE PMLE

N = 500 Mean 2.151 0.381 2.575 0.667 2.491 0.970 2.876 1.202Bias 1.951 0.181 1.175 0.267 1.891 0.370 2.076 0.402(s.d.) (4.630) (0.844) (5.073) (0.923) (4.996) (0.913) (6.213) (0.966)

N = 1000 Mean 1.013 0.356 1.089 0.606 1.307 0.863 1.660 1.160Bias 0.813 0.156 0.689 0.206 0.707 0.263 0.860 0.360(s.d) (2.131) (0.606) (2.241) (0.622) (2.424) (0.671) (2.675) (0.813)

N = 1500 Mean 0.684 0.324 0.792 0.592 0.906 0.860 1.305 1.156Bias 0.484 0.124 0.392 0.192 0.306 0.260 0.505 0.356(s.d.) (1.508) (0.484) (1.566) (0.515) (1.611) (0.601) (1.910) (0.706)

* Results are presented for the bivariate Partial Maximum Likelihood Estimator (PMLE) and the Heteroskedastic Probit Estimator (HPE) of λ. Numbers in brackets arestandard deviations (s.d.)


Table 4*Case 2: Simulation results of different estimators of β1, β2 and β3 in the context of the bivariate spatial probit model.

β1 = 1 β2 = 1 β3 = 1HPE PMLE HPE PMLE HPE PMLE

λ = 0.2 N = 500 Mean 1.043 1.021 1.042 1.020 1.042 1.020(s.d.) (0.120) (0.109) (0.114) (0.104) (0.122) (0.108)

N = 1000 Mean 1.017 1.006 1.022 1.011 1.016 1.005(s.d) (0.079) (0.071) (0.078) (0.071) (0.079) (0.070)

N = 1500 Mean 1.012 1.004 1.013 1.005 1.010 1.002(s.d.) (0.065) (0.059) (0.063) (0.058) (0.064) (0.057)

λ = 0.4 N = 500 Mean 1.043 1.017 1.042 1.018 1.043 1.019(s.d.) (0.125) (0.112) (0.112) (0.105) (0.119) (0.110)

N = 1000 Mean 1.017 1.005 1.020 1.008 1.014 1.002(s.d.) (0.079) (0.072) (0.080) (0.072) (0.077) (0.069)

N = 1500 Mean 1.014 1.005 1.013 1.004 1.009 1.000(s.d.) (0.065) (0.058) (0.061) (0.056) (0.062) (0.056)

λ = 0.6 N = 500 Mean 1.046 1.022 1.045 1.022 1.043 1.020(s.d.) (0.123) (0.107) (0.126) (0.111) (0.126) (0.108)

N = 1000 Mean 1.019 1.006 1.020 1.007 1.015 1.003(s.d.) (0.083) (0.075) (0.084) (0.075) (0.079) (0.071)

N = 1500 Mean 1.010 1.002 1.012 1.004 1.010 1.002(s.d.) (0.066) (0.060) (0.065) (0.060) (0.063) (0.059)

λ = 0.8 N = 500 Mean 1.035 1.013 1.036 1.014 1.037 1.015(s.d.) (0.125) (0.115) (0.125) (0.111) (0.125) (0.115)

N = 1000 Mean 1.016 1.003 1.017 1.005 1.016 1.004(s.d.) (0.083) (0.086) (0.082) (0.090) (0.084) (0.088)

N = 1500 Mean 1.011 1.000 1.012 1.001 1.012 1.001(s.d.) (0.065) (0.058) (0.064) (0.057) (0.064) (0.056)

* Results are presented for the bivariate Partial Maximum Likelihood Estimator (PMLE) and the Heteroskedastic Probit Estimator (HPE) of β1, β2 and β3 . Numbers inbrackets are standard deviations (s.d.)

References

Andrews, D.W.K., 1991. Heteroskedasticity and autocorrelation consistent covari-ance matrix estimation. Econometrica 59 (3), 817–858.

Anselin, L., Florax, R.J.G.M., 1995. New Direction in Spatial Econometrics. Springer-Verlag, Berlin, Germany.

Anselin, L., Florax, R.J.G.M., Rey, J.S., 2004. Econometrics for Spatial Models:Recent Advances. In: Advances in Spatial econometrics, Springer-Verlag, Berlin,Germany, pp. 1–28.

Beron, K.J., Vijverberg, W.P., 2003. Probit in a Spatial Context: A Monte CarloApproach. In: Advances in Spatial econometrics, Springer-Verlag, Berlin,Germany, pp. 169–196.

Blundell, R., Powell, J.L., 2004. Endogeneity in semiparametric binary responsemodels. Review of Economic Studies 71, 655–679.

Case, A.C., 1991. Spatial patterns in household demand. Econometrica 59, 953–965.Conley, T.G., 1999. GMM estimation with cross sectional dependence. Journal of

Econometrics 92, 1–45.Davidson, J., 1994. Stochastic Limit Theory. Oxford University Press, Oxford.Kelejian, H.H., Prucha, I.R., 1999. A generalized moments estimator for the

autoregressive parameter in a spatialmodel. International Economic Review40,509–533.

Kelejian, H.H., Prucha, I.R., 2001. On the asymptotic distribution of the Moran I teststatistic with applications. Journal of Econometrics 104, 219–257.

Lee, L.-F., 2004. Asymptotic distribution of quasi-maximum likelihood estimatorsfor spatial autoregressive models. Econometrica 72 (6), 1899–1925.

Lesage, J.P., 2000. Bayesian estimation of limit dependent variable spatialautoregressive models. Geographical Analysis 32, 19–35.

McMillen, D.P., 1992. Probitwith spatial autocorrelation. Journal of Regional Science32, 335–348.

McMillen, D.P., 1995. Spatial Effects in Probit Models: A Monte Carlo Investigation.In: New Directions in Spatial econometrics, Springer-Verlag, Berlin, Germany,pp. 189–228.

Newey, W.K., Mcfadden, D., 1994. Large sample estimation and hypothesis testing.In: Handbook of Econometrics, Vol 4. North-Holland, New York, Ch. 36.

Newey, W.K., West, K.D., 1987. A simple, positive semi-definite, heteroskedasticityand autocorrelation consistent covariance matrix. Econometrica 55, 308–703.

Pinkse, J., Slade, M.E., 1998. Contracting in space: an application of spatial statisticsto discrete-choice models. Journal of Econometrics 85, 125–154.

Pinkse, J., Slade, M.E., Shen, L., 2006. Dynamic spatial discrete choice using one-stepGMM: an application to mine operating decisions. Spatial Economic Analysis 1(1), 53–99.

Poirier, D., Ruud, P.A., 1988. Probit with dependent observations. Review ofEconomic Studies 55, 593–614.

Robinson, P.M., 1982. On the asymptotic properties of estimators of modelscontaining limited dependent variables. Econometrica 50, 27–41.

Wooldridge, J.M., 2005. Unobserved heterogeneity and estimation of average partialeffects. In: Andrews, D.W.K., Stock, J.H. (Eds.), Identification and Inferencefor Econometric Models: Essays in Honor of Thomas Rothenberg. CambridgeUniversity Press, Cambridge, pp. 27–55.

Wooldridge, J.M., 2010. Econometric Analysis of Cross Section and Panel Data,second ed. MIT Press, Cambridge, Massachusetts.

Partial maximum likelihood estimation of spatial probit models

Documents

Transcript of Partial maximum likelihood estimation of spatial probit models