Econometrics - CEMFIarellano/econometrics_class_notes_2017-18.pdf · J. Stock and M. Watson,...

Econometrics Class Notes

Manuel Arellano

CEMFI

March 2018

1

ECONOMETRICS Manuel Arellano

CEMFI, 2017-2018 Lectures: Mon 9:30-11:00, Wed 9:30-11:00. Exercises: Wed 11:30-13:00 (even weeks) conducted by José Gutiérrez [email protected] Computing workshop: Wed 11:30-13:00 (odd weeks) conducted by Siqi Wei [email protected] Grades will be based on class exercises (15%), computer workshop (15%), and final exam (70%).

Course outline

Part I: Regression 1. Least squares 1.1 The linear model 1.2 Asymptotic properties of OLS 1.3 Flexible nonlinear regression 1.4 Beyond means: quantile regression 2. Heteroskedasticity and clustering 2.1 Robust variance estimation 2.2 Cluster standard errors 2.3 Generalized least squares estimation 2.4 Grouped data Part II: Likelihood methods 3. Maximum likelihood and large sample testing 3.1 Consistency and asymptotic normality 3.2 Asymptotic testing 3.3 M-estimators 3.4 Bootstrap methods 4. Bayesian inference 4.1 Bayesian analysis 4.2 Specifying prior distributions 4.3 Large-sample Bayesian inference 4.4 Markov chain Monte Carlo methods

2

Part III: Time series 5. Stochastic processes 5.1 Stationarity and ergodicity 5.2 Autoregressive and moving average models 5.3 Asymptotics with dependent observations 5.4 Nonstationary processes 6. Time series regression 6.1 Regression models for time series 6.2 Robust inference 6.3 Granger causality 6.4 Cointegration Part IV: Endogeneity 7. Instrumental variables 7.1 Measurement error 7.2 Simultaneity 7.3 Two-stage least squares 7.4 Generalized method of moments 8. Endogenous treatment effects 8.1 Potential outcomes 8.2 Matching methods 8.3 Self-selected treatment 8.4 Local average treatment effects Textbooks F. Hayashi, Econometrics, Princeton University Press, 2000 (main text). J. Angrist and J.-S. Pischke, Mostly Harmless Econometrics: An Empiricist's Companion, Princeton University Press, 2009. C. Gourieroux and A. Monfort, Statistics and Econometric Models, vol. 1 and 2, Cambridge University Press, 1995. J. D. Hamilton, Time Series Analysis, Princeton University Press, 1994. J. Stock and M. Watson, Introduction to Econometrics, Pearson Education, 2nd edition, 2007. J. Wooldridge, Econometric Analysis of Cross Section and Panel Data, MIT Press, 2010.

RegressionClass Notes

Manuel Arellano

January 20, 2017

1 Means and predictors

Given some data y1, ..., yn we could calculate a mean y = (1/n)Pn

i=1 yi as a single quantity that

summarizes the n data points. y is an optimal predictor that minimizes mean squared error:

y = argmina

Xn

i=1(yi − a)2 .

Now if we have data on two variables for the same units yi, xini=1, we can get a better predictorof y using the additional information in x calculating the regression line byi = ba+bbxi where³ba,bb´ = argmin

a,b

Xn

i=1(yi − a− bxi)

2 .

More generally, if xi is a vector xi = (1, x2i, ..., xki)0, we calculate the linear predictor byi = x0ibβ wherebβ = argmin

b

Xn

i=1

¡yi − x0ib

¢2. (1)

The algebra of linear predictors First order conditions of (1) areXn

i=1xi

³yi − x0ibβ´ = 0. (2)

IfPn

i=1 xix0i is full rank (which requires n ≥ k) there is a unique solution:

bβ = ³Xn

i=1xix

0i

´−1Xn

i=1xiyi. (3)

We may use the compact notation X 0X =Pn

i=1 xix0i and X 0y =

Pni=1 xiyi where y = (y1, ..., yn)

0 and

X = (x1, ..., xn)0.

Denoting residuals as bui = yi − x0ibβ, from the first order conditions (2) we can immediately say

that as long as a constant term is included in xi:1

n

Xn

i=1bui = 0, 1

n

Xn

i=1xjibui = 0 for j = 2, ..., k.

Therefore, the mean of the residuals is zero and the covariance between the residuals and each of the

x variables is also zero. Moreover, since byi is a linear combination of xi, the covariance between bui andbyi is also zero. We conclude that a linear regression decomposes yi into two orthogonal components:yi = byi + bui,

so that dV ar (yi) = dV ar (byi) + dV ar (bui). An R2 measures the fraction of the variance of yi that is

accounted by byi:R2 =

dV ar (byi)dV ar (yi) .1

2 Consistency and asymptotic normality of linear predictors

If our data yi, xini=1 are a random sample from some population we can study the properties of bβ asan estimator of the corresponding population quantity:

β =£E¡xix

0i

¢¤−1E (xiyi) , (4)

where we require that E (xix0i) has full rank.

Letting the population linear predictor error be ui = (yi − x0iβ), the estimation error is

bβ − β =

µ1

n

Xn

i=1xix

0i

¶−1 1n

Xn

i=1xiui.

Clearly, E (xiui) = 0, since β solves the first-order conditions E [xi (yi − x0iβ)] = 0. By Slutsky’s

theorem and the law of large numbers:

plimn→∞

³bβ − β´=

µplimn→∞

1

n

Xn

i=1xix

0i

¶−1plimn→∞

1

n

Xn

i=1xiui =

£E¡xix

0i

¢¤−1E (xiui) = 0. (5)

Therefore, bβ is a consistent estimator of β.Moreover, because of the central limit theorem

1√n

Xn

i=1xiui

d→ N (0, V )

where V = E¡u2ixix

0i

¢. In addition, using Cramér’s theorem we can assert that

√n³bβ − β

´d→ N (0,W ) (6)

where

W =£E¡xix

0i

¢¤−1E¡u2ixix

0i

¢ £E¡xix

0i

¢¤−1, (7)

and also for individual coefficients:

√n³bβj − βj

´d→ N (0, wjj) (8)

where wjj is the j-th diagonal element of W .

Asymptotic standard errors and confidence intervals A consistent estimator of W is:

cW =

µ1

n

Xn

i=1xix

0i

¶−1µ 1n

Xn

i=1bu2ixix0i¶µ 1nXn

i=1xix

0i

¶−1. (9)

The quantitypbwjj/n is called an asymptotic standard error of bβj , or simply a standard error.

It is an approximate standard deviation of bβj in a large sample, and it is used as a measure of theprecision of an estimate.

2

Due to Cramér’s theorem:bβj − βjpbwjj/n

d→ N (0, 1) . (10)

The use of this statement is in calculating approximate confidence intervals. A 95% large sample

confidence interval is:µbβj − 1.96qbwjj/n, bβj + 1.96qbwjj/n

¶. (11)

3 Classical regression model

A linear predictor is the best linear approximation to the conditional mean of y given x in the sense:

β = argminb

En£E (yi | xi)− x0ib

¤2o. (12)

That is, x0iβ minimizes the mean squared approximation errors where the mean is taken with respect

to the distribution of x. Therefore, changing the distribution of x will change the linear predictor

unless the conditional mean is linear, in which case E (yi | xi) = x0iβ.

If En[E (yi | xi)− x0iβ]

2ois not zero or close to zero, x0ibβ will not be a very informative summary

of the dependence in mean between y and x. In general, the use of a linear predictor is hard to

motivate if the conditional mean is notoriously nonlinear.

The classical regression model is a linear model that makes the following two assumptions:

E (y | X) = Xβ (A1)

V ar (y | X) = σ2In. (A2)

The first assumption (A1) asserts that E (yi | x1, ..., xn) = x0iβ for all i. This assumption contains

two parts. The first one is that E (yi | x1, ..., xn) = E (yi | xi); this part of the assumption will alwayshold if yi, xini=1 is a random sample and is sometimes called strict exogeneity. The second part is

the linearity assumption E (yi | xi) = x0iβ. Under A1 bβ is an unbiased estimator:E³bβ | X´ = ¡X 0X

¢−1X 0E (y | X) = β (13)

and therefore also E³bβ´ = β by the law of iterated expectations.

The second assumption (A2) says that V ar (yi | x1, ..., xn) = σ2 and Cov (yi, yj | x1, ..., xn) = 0 forall i and j. Under random sampling V ar (yi | x1, ..., xn) = V ar (yi | xi) and Cov (yi, yj | x1, ..., xn) = 0always hold. Assumption A2 also requires that V ar (yi | xi) is constant for all xi and this situation iscalled homoskedasticity. The alternative situation when V ar (yi | xi) may vary with xi is called het-

eroskedasticity. When the data are time series the zero covariance condition Cov (yi, yj | x1, ..., xn) = 0is called lack of autocorrelation.

3

Under A2 the variance matrix of bβ given X is

V ar³bβ | X´ = σ2

¡X 0X

¢−1. (14)

Moreover, under A2 since E¡u2ixix

0i

¢= σ2E (xix

0i) the sandwich formula (7) becomes

W = σ2£E¡xix

0i

¢¤−1. (15)

To obtain an unbiased estimator of σ2 note that under A2, letting M = In −X (X 0X)−1X 0, we have

E¡bu0bu¢ = E

£E¡u0Mu | X¢¤ = E

¡tr£ME

¡uu0 | X¢¤¢ = σ2tr (M) = σ2 (n− k) , (16)

so that an unbiased estimator of σ2 is given by the degrees of freedom corrected residual variance:

bσ2 = bu0bun− k

. (17)

Sampling distributions under conditional normality Consider as a third assumption:

y | X ∼ N ¡Xβ, σ2In

¢. (A3)

Under A3:

bβ | X ∼ N ³β, σ2

¡X 0X

¢−1´, (18)

so that also

bβj | X ∼ N ¡βj , σ

2ajj¢

(19)

where ajj is the j-th diagonal element of (X 0X)−1. Moreover, conditionally and unconditionally we

have

zj ≡bβj − βjpσ2ajj

∼ N (0, 1) . (20)

This result, which holds exactly for the normal classical regression model, also holds under homoskedas-

ticity as a large-sample approximation for linear predictors and non-normal populations, in light of

(8), (15), and Cramér’s theorem.

Heteroskedasticity-consistent standard errors Note that the validity of the large sample

results in (9), (10) and (11) does not require homoskedasticity. This is why the asymptotic standard

errorspbwjj/n calculated from (9) are usually called heteroskedasticity-consistent or White standard

errors, after the work of Halbert White.

4

Other distributional results The other key exact distributional results in this context arebu0buσ2∼ χ2n−k independent of zj (21)

and bβj − βjqbσ2ajj ∼ tn−k. (22)

In addition, letting now bβj denote a subset of r coefficients and Ajj the corresponding submatrix of

(X 0X)−1, we have³bβj − βj

´0A−1jj

³bβj − βj

´σ2

∼ χ2r (23)

and ³bβj − βj

´0A−1jj

³bβj − βj

´/rbσ2 ∼ Fr,(n−k). (24)

4 Weighted least squares

The ordinary least squares (OLS) statistic bβ is a function of simple means of xix0i and xiyi. Under

heteroskedasticity it may make sense to consider weighted means in which observations with a smaller

variance receive a larger weight. Let us consider estimators of the form

eβ = ³Xn

i=1wixix

0i

´−1Xn

i=1wixiyi (25)

where wi are some weights. OLS is the special case in which wi = 1 for all i.

Under appropriate regularity conditions

plim³eβ − β

´=£E¡wixix

0i

¢¤−1E (wixiui) . (26)

Thus, in general to ensure consistency of eβ we need that E (wixiui) = 0. This result will hold if

E (ui | xi) = 0 and wi = w (xi) is a function of xi only:

E (wixiui) = E (wixiE (ui | xi)) = 0,

but more generally eβ is not a consistent estimator of the population linear projection coefficient βwhen E (yi | xi) 6= x0iβ.

1

Subject to consistency, the asymptotic normality result is

√n³eβ − β

´d→ N

³0,£E¡wixix

0i

¢¤−1E¡u2iw

2i xix

0i

¢ £E¡wixix

0i

¢¤−1´. (27)

1Actually, if xi has density f (x), β is consistent for the optimal linear predictor under an alternative probability

distribution of xi given by g (x) ∝ f (x)w (x).

5

Asymptotic efficiency When weights are chosen to be proportional to the reciprocal of σ2i =

E¡u2i | xi

¢, the asymptotic variance in (27) becomes∙

E

µxix

0i

σ2i

¶¸−1. (28)

Moreover, it can be shown that for any (conformable) vector q:

q0£E¡wixix

0i

¢¤−1E¡σ2iw

2i xix

0i

¢ £E¡wixix

0i

¢¤−1q ≥ q0

∙E

µxix

0i

σ2i

¶¸−1q. (29)

Statement (29) says that the asymptotic variance of any linear combination of weighted LS estimates

q0eβ is the smallest when the weights are wi ∝ 1/σ2i . To prove (29) note that2

E

µxix

0i

σ2i

¶−E

¡wixix

0i

¢ £E¡σ2iw

2i xix

0i

¢¤−1E¡wixix

0i

¢= H 0E

¡mim

0i

¢H (30)

where

H =

ÃI

− £E ¡σ2iw2i xix0i¢¤−1E (wixix0i)

!, mi =

Ãxiσi

σiwixi

!.

Also note that for any q we have q0 [H 0E (mim0i)H] q ≥ 0.

Generalized least squares In view of (29) we can say that the estimator

eβGLS = µXn

i=1

xix0i

σ2i

¶−1Xn

i=1

xiyiσ2i

(31)

is asymptotically efficient in the sense of having the smallest asymptotic variance among the class of

consistent weighted least squares estimators. eβGLS is a generalized least squares estimator (GLS).In matrix notation:

eβGLS = ¡X 0Ω−1X¢−1

X 0Ω−1y (32)

where Ω = diag¡σ21, ..., σ

2n

¢.

In a generalized classical regression model we have E (y | X) = Xβ and V ar (y | X) = Ω.The asymptotic normality result is

√n³eβGLS − β

´d→ N

Ã0,

∙E

µxix

0i

σ2i

¶¸−1!. (33)

Usually eβGLS is an infeasible estimator because σ2i is an unknown function of xi. In a feasible GLSestimation σ2i is replaced by a (parametric or nonparametric) estimated quantity. The large-sample

properties of the resulting estimator may or may not coincide with those of the infeasible GLS.

2We are using the fact that if A and B are positive definite matrices, then A − B is positive definite if and only if

B−1 −A−1 is positive definite.

6

5 Cluster-robust standard errors

Suppose the sample yi, xini=1 consists of H groups or clusters of Mh observations each (n = M1 +

... +MH), such that observations are independent across groups but dependent within groups, H is

large and Mh is small (fixed) for all h. For convenience let us order observations by groups and use

a double-index notation (yhm, xhm) for h = 1, ...,H(group index) and m = 1, ...,Mh (within group

index).

The compact notation for linear regression was y = Xβ+u. A similar notation for the observations

in cluster h is

yh = Xhβ + uh (34)

where yh = (yh1, ..., yhMh)0, etc. Using this notation the OLS estimator is

bβ = ¡X 0X¢−1

X 0y =

ÃHXh=1

X 0hXh

!−1 HXh=1

X 0hyh. (35)

Note that in terms of individual observations we can write X 0y =PH

h=1

PMhm=1 xhmyhm, etc.

The scaled estimation error is

√H³bβ − β

´=

µX 0XH

¶−1 1√H

HXh=1

X 0huh.

Applying the central limit theorem at cluster level, a consistent estimate of the variance of√H³bβ − β

´is given byµ

X 0XH

¶−1 1H

HXh=1

X 0hbuhbu0hXh

µX 0XH

¶−1, (36)

so that cluster-robust standard errors can be obtained as the square roots of the diagonal elements of

the covariance matrix

dV ar ³bβ´ = ¡X 0X¢−1Ã HX

h=1

X 0hbuhbu0hXh

!¡X 0X

¢−1. (37)

This is the sandwich formula associated with clustering. Its rationale is as a large H approximation.

There are many applications of this tool, both with actual cluster survey designs and with other data

sets with potential group-level dependence.

7

Maximum likelihoodClass Notes

Manuel Arellano

Revised: January 17, 2018

1 Likelihood models

Given some data w1, ..., wn, a likelihood model is a family of density functions fn (w1, ..., wn; θ),

θ ∈ Θ ⊂ Rp, that is specified as a probability model for the observed data. If the data are a

random sample from some population with density function g (wi) then fn (w1, ..., wn; θ) is a model for∏ni=1 g (wi). If the model is correctly specified fn (w1, ..., wn; θ) =

∏ni=1 f (wi, θ) and g (wi) = f (wi, θ0)

for some value θ0 in Θ. The function f may be a conditional or an unconditional density. It may be

flexible enough to be unrestricted or it may place restrictions in the form of the population density.1

In the likelihood function L (θ) =∏ni=1 f (wi, θ), the data are given and θ is the argument. Usually

we work with the log likelihood function:

L (θ) =∑n

i=1 ln f (wi, θ) . (1)

L (θ) measures the ability of different densities within the pre-specified class to generate the data. The

maximum likelihood estimator (MLE) is the value of θ associated with the largest likelihood:

θ = arg maxθ∈Θ

L (θ) . (2)

Example 1: Linear regression under normality The model is y | X ∼ N(Xβ, σ2In

)so

that

fn (y1, ..., yn | x1, ..., xn; θ) =∏n

i=1f (yi | xi; θ)

where θ =(β′, σ2

)′ andf (yi | xi; θ) =

1√2σ2π

exp

[− 1

2σ2

(yi − x′iβ

)2] (3)

L (θ) = −n2

ln (2π)− n

2lnσ2 − 1

2σ2

∑ni=1

(yi − x′iβ

)2 (4)

The MLE θ =(β′, σ2)′consists of the OLS estimator β and the residual variance σ2 without degrees

of freedom adjustment. Letting u = y −Xβ we have:

β =(X ′X

)−1X ′y σ2 =

u′u

n. (5)

1This note follows the practice of using the term density both for continuous random variables and for the probability

function of discrete random variables; as for example in David Cox, Principles of Statistical Inference, 2006.

1

Example 2: Logit regression There is a binary dependent variable yi that takes only two

values, 0 and 1. Therefore, in this case f (yi | xi) = pyii (1− pi)(1−yi) where pi = Pr (yi = 1 | xi).In the logit model the log odds ratio depends linearly on xi:

ln

(pi

1− pi

)= x′iθ,

so that pi = Λ (x′iθ) where Λ is the logistic cdf Λ (r) = 1/ (1 + exp (−r)).Assuming that fn (y1, ..., yn | x1, ..., xn; θ) =

∏ni=1 f (yi | xi; θ), the log likelihood function is

L (θ) =∑n

i=1

yi ln Λ

(x′iθ)

+ (1− yi) ln[1− Λ

(x′iθ)]

.

The first and second partial derivatives of L (θ) are:

∂L (θ)

∂θ=∑n

i=1 xi[yi − Λ

(x′iθ)]

∂2L (θ)

∂θ∂θ′= −

∑ni=1 xix

′iΛi (1− Λi) .

Since the Hessian matrix is negative semidefinite, as long as it is nonsingular, there exists a single

maximum to the likelihood function.2 The first order conditions ∂L (θ) /∂θ = 0 are nonlinear but we

can find their root θ using the Newton-Raphson method of successive approximations. Namely, we

begin by finding the root θ1 to a linear approximation of ∂L (θ) /∂θ around some initial value θ0 and

iterate the procedure until convergence:

θj+1 = θj −(∂2L (θj)

∂θ∂θ′

)−1∂L (θj)

∂θ(j = 0, 1, 2, ...) .

What does θ estimate? The population counterpart of the sample calculation (2) is

θ0 = arg maxθ∈Θ

E [ln f (W, θ)] (6)

where W is a random variable with density g (w) so that E [ln f (W, θ)] =∫

ln f (w, θ) g (w) dw.

If the population density g (w) belongs to the f (w, θ) family, then θ0 as defined in (6) is the true

parameter value. In effect, due to Jensen’s inequality:∫ln f (w, θ) f (w, θ0) dw −

∫ln f (w, θ0) f (w, θ0) dw

=

∫ln

(f (w, θ)

f (w, θ0)

)f (w, θ0) dw ≤ ln

∫ (f (w, θ)

f (w, θ0)

)f (w, θ0) dw = ln

∫f (w, θ) dw = ln 1 = 0.

Thus, when g (w) = f (w, θ0) we have

E [ln f (W, θ)]− E [ln f (W, θ0)] ≤ 0 for all θ.

2To see that the Hessian is negative semidefinite, note that using the decomposition κ′iκi = Λi (1− Λi) with κ′i =[(1− Λi) Λ

1/2i ,Λi (1− Λi)

1/2], the Hessian can be written as ∂2L (θ) /∂θ∂θ′ = −

∑ni=1 xiκ

′iκix

′i.

2

The value θ0 can be interpreted more generally as follows:3

θ0 = arg minθ∈Θ

E

[ln

g (W )

f (W, θ)

]. (7)

The quantity E[ln g(W )

f(W,θ)

]≡∫

ln(

g(w)f(w,θ)

)g (w) dw is the Kullback-Leibler divergence (KLD) from

f (w, θ) to g (w). The KLD is the expected log difference between g (w) and f (w, θ) when the expec-

tation is taken using g (w). Thus, f (w, θ0) can be regarded as the best approximation to g (w) in the

class f (w, θ) when the approximation is understood in the KLD sense.

If g (w) = f (w, θ0) then θ0 is called the “true value”. If g (w) does not belong to the f (w, θ) class

and f (w, θ0) is just the best approximation to g (w) in the KLD sense, then θ0 is called a “pseudo-true

value”.

The extent to which a pseudo-true value remains an interesting quantity is model specific. For

example, (3) is a restrictive model of the conditional distribution of yi given xi, first because it assumes

that the dependence of yi on xi occurs exclusively through the conditional mean E (yi | xi) and secondlybecause this conditional mean is assumed to be a linear function of xi. However, if yi depends on

xi in other ways, for example through the conditional variance, the parameter values θ0 =(β′0, σ

20

)′remain interpretable quantities: β0 as ∂E (yi | xi) /∂xi and σ2

0 as the unconditional variance of the

errors ui = yi − x′iβ0. If E (yi | xi) is a nonlinear function, β0 and σ20 can only be characterized as the

linear projection regression coeffi cient vector and the linear projection error variance, respectively.

Pseudo maximum likelihood estimation The statistic θ is the maximum likelihood estimator

under the assumption that g (w) belongs to the f (w, θ) class. In the absence of this assumption, θ is

a pseudo-maximum likelihood estimator (PML) based on the f (w, θ) family of densities. Sometimes

θ is called a quasi-maximum likelihood estimator.

2 Consistency and asymptotic normality of PML estimators

Under regularity and identification conditions a PML estimator θ is a consistent estimator of the

(pseudo) true value θ0. Since θ may not have a closed form expression we need a method for establishing

the consistency of an estimator that maximizes an objective function. The following theorem taken

from Newey and McFadden (1994) provides such a method. The requirements are boundedness of

the parameter space, uniform convergence of the objective function to some nonstochastic continuous

limit, and that the limiting objective function is uniquely maximized at the truth (identification).

Consistency Theorem Suppose that θ maximizes the objective function Sn (θ) in the parameter

space Θ. Assume the following:

3Since E[ln g(W )

f(W,θ)

]= E [ln g (W )]− [ln f (W, θ)] and E [ln g (W )] does not depend on θ, the arg min in (7) is the same

as the arg max in (6).

3

(a) Θ is a compact set.

(b) The function Sn (θ) converges uniformly in probability to S0 (θ).

(c) S0 (θ) is continuous.

(d) S0 (θ) is uniquely maximized at θ0.

Then θp→ θ0.

In the PML context Sn (θ) = (1/n)∑n

i=1 ln f (wi, θ) and S0 (θ) = E [ln f (W, θ)]. In particular, in

the regression example, by the law of large numbers:4

S0 (θ) = −1

2ln (2π)− 1

2lnσ2 − 1

2σ2E[(yi − x′iβ

)2] (8)

and, noting that yi − x′iβ ≡ ui − x′i (β − β0), also

S0 (θ) = −1

2ln (2π)− 1

2lnσ2 − 1

2σ2

[σ2

0 + (β − β0)′E(xix′i

)(β − β0)

]. (9)

In this example, S0 (θ) is uniquely maximized at θ0 as long as E (xix′i) has full rank.

Asymptotic normality To discuss asymptotic normality, in addition to the conditions required

for consistency, we assume that f (wi, θ) has first and second derivatives in a neighborhood of θ0, and θ0

is an interior point of Θ.5 For simplicity, use the notation ì (θ) = ln f (wi, θ) and qi (θ) = ∂ì (θ) /∂θ.

Note that if the data are iid the score qi (θ0) is also iid with zero mean vector and covariance matrix

V = E

(∂ì (θ0)

∂θ

∂ì (θ0)

∂θ′

). (10)

Next, because of the central limit theorem we have

1√n

∂L (θ0)

∂θ≡ 1√

n

n∑i=1

∂ì (θ0)

∂θ

d→ N (0, V ) . (11)

As for the Hessian matrix, its convergence follows from the law of large numbers:

1

n

∂2L (θ0)

∂θ∂θ′≡ 1

n

n∑i=1

∂2ì (θ0)

∂θ∂θ′≡ Hn (θ0)

p→ E

[∂2ì (θ0)

∂θ∂θ′

]≡ H. (12)

We assume that H is a non-singular matrix and that Hn

(θ)

p→ H for any θ such that θp→ θ0.

Now, using the mean value theorem:

0 =∂L(θ)

∂θj=∂L (θ0)

∂θj+

p∑`=1

∂2L(θ[j]

)∂θj∂θ`

(θ` − θ0`

)(j = 1, ..., p) (13)

4Given the equivalence in this case between pointwise and uniform convergence.5We can proceed as if θ were an interior point of Θ since consistency of θ for θ0 and the assumption that θ0 is interior

to Θ implies that the probability that θ is not interior goes to zero as n→∞.

4

where θ` is the `-th element of θ, and θ[j] denotes a p × 1 random vector such that∥∥∥θ[j] − θ0

∥∥∥ ≤∥∥∥θ − θ0

∥∥∥.6Note that θ

p→ θ0 implies θ[j]p→ θ0 and also

1

n

∂2L(θ[j]

)∂θj∂θ

′`

p→ (j, `) element of H,

which leads to the asymptotic linear representation of the estimation error:7

√n(θ − θ0

)= −H−1 1√

n

∂L (θ0)

∂θ+ op (1) . (14)

Finally, using (11) and Cramér’s theorem we obtain:

√n(θ − θ0

)d→ N (0,W ) (15)

where W = H−1V H−1 or at length:

W =

[E

(∂2ì (θ0)

∂θ∂θ′

)]−1

E

(∂ì (θ0)

∂θ

∂ì (θ0)

∂θ′

)[E

(∂2ì (θ0)

∂θ∂θ′

)]−1

. (16)

Asymptotic standard errors A consistent estimator of the asymptotic variance matrix W is:

W =

1

n

n∑i=1

∂2ì

(θ)

∂θ∂θ′

−1 1

n

n∑i=1

∂ì

(θ)

∂θ

∂ì

(θ)

∂θ′

1

n

n∑i=1

∂2ì

(θ)

∂θ∂θ′

−1

. (17)

3 The information matrix identity

As long as f (w, θ) is a density function it integrates to one:∫f (w, θ) dw = 1. (18)

Taking partial derivatives in (18) with respect to θ we get the zero mean property of the score:∫∂ ln f (w, θ)

∂θf (w, θ) dw = 0. (19)

Next, taking partial derivatives again:∫∂2 ln f (w, θ)

∂θ∂θ′f (w, θ) dw +

∫∂ ln f (w, θ)

∂θ

∂ ln f (w, θ)

∂θ′f (w, θ) dw = 0. (20)

Therefore, if g (w) = f (w, θ0) we have

E

(∂ì (θ0)

∂θ

∂ì (θ0)

∂θ′

)= −E

(∂2ì (θ0)

∂θ∂θ′

). (21)

6The expansion has to be made element by element since θ[j] may be different for each j.7The notation op (1) denotes a term that converges to zero in probability.

5

This result is known as the information matrix identity. It says that when evaluated at θ0 the covari-

ance matrix of the score coincides with minus the expected Hessian of the log-likelihood function for

observation i. It is an identity in the sense of (20), but in general it need not hold if the expectations

in (21) are taken with respect to g (w) and g (w) 6= f (w, θ0).

The implication is that under correct specification V = −H in the sandwich formula (16) and

therefore:

√n(θ − θ0

)d→ N

(0, [I (θ0)]−1

)(22)

where

I (θ0) = −E(∂2ì (θ0)

∂θ∂θ′

)≡ − plim

n→∞

1

n

∂2L (θ0)

∂θ∂θ′. (23)

The matrix I (θ0) is known as the information matrix or the Fisher information, after the work of

Ronald Fisher. It is called information because it can be regarded as a measure of the amount of

information that the random variable W with density f (w, θ) contains about the unknown parameter

θ. Intuitively, the greater the expected curvature of the log likelihood at θ = θ0 the greater the

information and the smaller the asymptotic variance of the maximum likelihood estimator, which is

given by the inverse of the information matrix.

The asymptotic Cramér-Rao inequality Under suitable regularity conditions, the matrix

[I (θ0)]−1 is a lower bound for the asymptotic covariance matrix of any consistent estimator of θ0.

Furthermore, this lower bound is attained by the maximum likelihood estimator.

Estimating the information matrix There are a variety of consistent estimators of I (θ0).

One possibility is to use the observed Hessian evaluated at θ:

I = − 1

n

∂2L(θ)

∂θ∂θ′. (24)

Another possibility is the expected Hessian evaluated at θ, as long as its functional form is known:

I = I(θ), (25)

Yet another possibility in a conditional likelihood model f (y | x; θ) is to use a sample average of the

expected Hessian conditioned on xi and evaluated at θ:

˜I = − 1

n

n∑i=1

E

∂2 ln f(y | xi; θ

)∂θ∂θ′

| xi

. (26)

Finally, one can use the variance of the score-form of the information matrix to obtain an estimate:

I =

1

n

n∑i=1

∂ì(θ)

∂θ

∂ì

(θ)

∂θ′

. (27)

6

4 Example: Normal linear regression

Letting ui = yi − x′iβ0, the score at the true value for the log-likelihood function in (4) is

∂`i (θ0)

∂θ=

1

σ20

xiui1

2σ20

(u2i − σ2

0

) .

The covariance matrix of the score is

V = E

1σ40u2ixix

′i

12σ60

xiui(u2i − σ2

0

)1

2σ60x′iui

(u2i − σ2

0

)1

4σ80

(u2i − σ2

0

)2 =

1σ40E(u2ixix

′i

)1

2σ60E(u3ixi)

12σ60

E(u3ix′i

)1

4σ80

[E(u4i

)− σ4

0

] .

The expected Hessian is

H = E

− 1σ20xix′i − 1

σ40xiui

− 1σ40x′iui − 1

2σ40−(u2i − σ2

0

)1σ60

= −

1σ20E (xix

′i) 0

0′ 12σ40

.

The sandwich formula in (16) is:

W =

(σ2

0 [E (xix′i)]−1 0

0′ 2σ40

) 1σ40E(u2ixix

′i

)1

2σ60E(u3ixi)

12σ60

E(u3ix′i

)1

4σ80

[E(u2i

)− σ4

0

]( σ2

0 [E (xix′i)]−1 0

0′ 2σ40

).

Note that under misspecification the information matrix identity does not hold since V 6= −H.The first block-diagonal components of V and −H will coincide under conditional homoskedasticity,

that is, if E(u2i | x

)= σ2

0. The off-diagonal block of V is zero under conditional symmetry, that is, if

E(u3i | x

)= 0. Lastly, the second block-diagonal terms of V and −H will coincide under the normal

kurtosis condition E(u4i

)= 3σ4

0. These conditions are satisfied when model (4) is correctly specified

but not in general.

Under correct specification:

√n(θ − θ0

)d→ N

[0,

(σ2

0 [E (xix′i)]−1 0

0′ 2σ40

)]. (28)

5 Estimation subject to constraints

We may wish to estimate parameters subject to constraints. Sometimes one seeks to ensure internal

consistency in a model that is required to answer a question of interest such as a welfare calculation, for

example enforcing symmetry of cross-price elasticities in a demand system. Another common situation

is an interest in the value or adequacy of economic restrictions, such as constant returns to scale in a

production function. Finally, one may be simply willing to consider a restricted version of a model as

a way of producing a simpler or tighter summary of data.

7

Constraints on parameters may be expressed as equation restrictions

h (θ0) = 0 (29)

where h (θ) is a vector of r restrictions: h (θ) = (h1 (θ) , ..., hr (θ))′. Alternatively, constrains may be

expressed in parametric form

θ0 = θ (α0) (30)

whereby θ is functionally related to a free parameter vector α of smaller dimension than θ such that the

number of restrictions is r = dim (θ)−dim (α). Depending on the problem it may be more convenient

to express restrictions in one way or the other (or in mixed form).

One way of obtaining a constrained estimator of θ0 is to maximize the log-likelihood function L (θ)

subject to the constraints h (θ) = 0:(θ, λ)

= arg maxθ,λ

[1

nL (θ)− λ′h (θ)

](31)

where λ is an r × 1 vector of Lagrange multipliers.

Alternatively, if the restrictions have been parameterized as in (30), the log-likelihood function can

be maximized with respect to α as in an unrestricted problem:

α = arg maxα

L [θ (α)] . (32)

Restricted estimates of θ0 are then given by θ = θ (α).

Asymptotic normality of constrained estimators The asymptotic variance of√n (α− α0)

can be obtained as an application of the result in (15)-(17) to the log-likelihood L∗ (α) = L [θ (α)].

Letting G = G (α0) where G (α) = ∂θ (α) /∂α′, using the chain rule we have

∂L∗ (α0)

∂α= G′

∂L (θ0)

∂θ, (33)

and8

E

(∂2L∗ (α0)

∂α∂α′

)= G′HG. (34)

Moreover,

1√n

∂L∗ (α0)

∂α

d→ N(0, G′V G

). (35)

Therefore,

√n (α− α0)

d→ N(

0,(G′HG

)−1G′V G

(G′HG

)−1). (36)

8The matrix ∂2L∗(α0)∂α∂α′ contains an additional term, which is equal to zero in expectation.

8

The asymptotic distribution of θ then follows from the delta method:

√n(θ − θ0

)d→ N (0,WR) (37)

where WR is a matrix of reduced rank given by

WR = G(G′HG

)−1G′V G

(G′HG

)−1G′. (38)

Under correct specification V = −H, so that the previous results become

√n (α− α0)

d→ N(

0,(G′I (θ0)G

)−1). (39)

and

√n(θ − θ0

)d→ N

(0, G

(G′I (θ0)G

)−1G′). (40)

Under correct specification, θ is asymptotically more effi cient than θ since the difference between

their asymptotic variance matrices is positive-semidefinite:

[I (θ0)]−1 −G(G′I (θ0)G

)−1G′ = A−1

[I −G∗

(G∗′G∗

)−1G∗′]A−1′ ≥ 0 (41)

where I (θ0) = A′A and G∗ = AG. However, under misspecification the sandwich matrix W =

H−1V H−1 and WR in (38) cannot be ordered.

The likelihood ratio statistic By construction an unrestricted maximum is greater than a

restricted one: L(θ)≥ L

(θ). It seems therefore natural to look at the log-likelihood change L

(θ)−

L(θ)as a measure of the cost of imposing the restrictions. It can be shown that under correct

specification (or if the information matrix identity holds) and the r restrictions on θ0 also hold then

LR = 2[L(θ)− L

(θ)]

d→ χ2r . (42)

This result is used to construct a large-sample test of the restrictions: the rule is to reject the restric-

tions if LR > κ where κ is the (1− α)-quantile of the χ2r distribution and α is the chosen size of the

test.

The Wald statistic We can also learn about the adequacy of the restrictions by examining

how far the unrestricted estimates are from satisfying the constraints. We are thus led to look at

unrestricted estimates of the constraints given by h(θ).

Letting D = D (θ0) and D = D(θ)where D (θ) = ∂h (θ) /∂θ′, under the assumption that h (θ0) =

0, from the delta method we have

√nh(θ)

d→ N(0, DWD′

)(43)

9

and

WR = n

[h(θ)′ (

DW D′)−1

h(θ)]

d→ χ2r . (44)

The quantity WR is a Wald statistic. Like the LR statistic it can be used to construct a large-sample

test of the restrictions with a similar rejection region. However, while the calculation of LR requires

both θ and θ, the calculation of WR only requires the unrestricted estimate θ.

Another difference is that, contrary to LR, WR still has a large-sample chi-square distribution

under misspecification if it relies on a robust estimate of the variance of θ as in (44). A Wald statistic

that is directly comparable to the LR statistic would be a non-robust version of the form:

W = n

[h(θ)′(

D[I(θ)]−1

D′)−1

h(θ)]

. (45)

The Lagrange Multiplier statistic Another angle on the cost of imposing the restrictions is

to examine how far the estimated Lagrange multiplier vector λ is from zero. The first-order conditions

from the optimization problem in (31) are:

1

n

∂L(θ)

∂θ= D

(θ)′λ (46)

h(θ)

= 0, (47)

which lead to the asymptotic linear representation of λ:9

√nλ =

(DH−1D′

)−1DH−1 1√

n

∂L (θ0)

∂θ+ op (1) (48)

and

√nλ

d→ N[0,(DH−1D′

)−1DH−1V H−1D′

(DH−1D′

)−1], (49)

and also

LMR = nλ′DH−1D′

(DH−1V H−1D′

)−1DH−1D′λ

d→ χ2r (50)

9Using the mean value theorem for each component of the first-order conditions

D(θ)′λ =

1

n

∂L (θ0)

∂θ+

1

n

∂2L (θ∗)

∂θ∂θ′

(θ − θ0

)0 = h

(θ)

= h (θ0) +D (θ∗∗)(θ − θ0

)and combining the two expressions using that h (θ0) = 0 we get:

D (θ∗∗)

(1

n

∂2L (θ∗)

∂θ∂θ′

)−1D(θ)′λ = D (θ∗∗)

(1

n

∂2L (θ∗)

∂θ∂θ′

)−11

n

∂L (θ0)

∂θ.

10

where

D = D(θ)

H =1

n

n∑i=1

∂2ì

(θ)

∂θ∂θ′V =

1

n

n∑i=1

∂ì

(θ)

∂θ

∂ì

(θ)

∂θ′.

The quantity LMR is a Lagrange Multiplier statistic. Like the Wald statistic in (44) it can be

used to construct a large-sample test of the restrictions, which remains valid if the objective function

is only a pseudo likelihood function and it does not satisfy the information matrix identity.

In view of (46) we can replace D′λ in (50) with n−1∂L(θ)/∂θ, which produces the score form of

the statistic. The score function is exactly equal to zero when evaluated at θ, but not when evaluated

at θ. If the constraints are true, we would expect both n−1∂L(θ)/∂θ and λ to be small quantities,

so that the rejection region of the null hypothesis h (θ0) = 0 is associated with large values of LMR.

Under correct specification V = −H, we get

√nλ

d→ N[0,(D [I (θ0)]−1D′

)−1]

(51)

and

LM = nλ′D[I(θ)]−1

D′λ ≡ 1

n

∂L(θ)

∂θ′

[I(θ)]−1 ∂L

(θ)

∂θ

d→ χ2r . (52)

The statistic LM is a non-robust version of the Lagrange Multiplier statistic that is directly

comparable to the LR statistic.

6 Example: LR, Wald and LM in the normal linear regression model

The model is the same as in (4). We consider the partitions X = (X1, X2) and β =(β′1, β

′2

)′ whereX1 is of order n× r, X2 is n× (k − r), β1 is r × 1 and β2 is (k − r)× 1, and the r restrictions

β1 = 0. (53)

The unrestricted estimates are θ = (β′, σ2)′ as in (5), whereas the restricted estimates are

β1 = 0, β2 =(X ′2X2

)−1X ′2y, σ2 =

u′u

n(54)

where u = y −X2β2.

The LR statistic is given by

LR = n ln

(u′u

u′u

). (55)

If we modify the example to assume that σ2 is known, β and β remain unchanged but in this case

LRσ =u′u− u′u

σ2, (56)

11

which is exactly distributed as χ2r under normality.

Turning to the Wald statistic, recall that

√n(β − β0

)d→ N

(0, σ2 plim

(X ′X/n

)−1)

and introduce the partition

(X ′X

)−1=

(A11 A12

A21 A22

). (57)

Thus, if the restrictions hold

√nβ1

d→ N[0, σ2 plim (nA11)

](58)

and therefore the (non-robust) Wald statistic is given by:

W =β′1A−111 β1

σ2 . (59)

It can be shown that β′1A−111 β1 = u′u− u′u,10 so that also

W =u′u− u′u

σ2 . (60)

Moreover, if σ2 is known

Wσ =u′u− u′u

σ2,

so that Wσ = LRσ. With unknown σ2 it can be shown that W ≥ LR even if both statistics have the

same asymptotic distribution.

Finally, turning to the LM statistic, the components of the score are:

∂L(θ)∂β1

= 1σ2X ′1 (y −Xβ)

∂L(θ)∂β2

= 1σ2X ′2 (y −Xβ)

∂L(θ)∂σ2

= n2σ4

[1n (y −Xβ)′ (y −Xβ)− σ2

],

(61)

which once evaluated at restricted estimates are

∂L(θ)∂β1

= 1σ2X ′1u 6= 0

∂L(θ)∂β2

= 1σ2X ′2u = 0

∂L(θ)∂σ2

= n2σ4

(1n u′u− σ2

)= 0.

(62)

10Using the partitioned inverse matrix result A−111 = X ′1 (I −M2)X1 and MM2 = M2 where M2 = X2 (X ′2X2)−1X ′2

and M = X (X ′X)−1X ′. Premultiplying y = Xβ + u by (I −M2) and taking squares we get y′ (I −M2) y =

β′1X′1 (I −M2)X1β1 + u′ (I −M2) u, which equals u′u = β

′1A−111 β1 + u′u.

12

Moreover, the information matrix in this case is

I (θ0) =

1σ20

plim(X′Xn

)0

0′ 12σ40

, (63)

so that using

I(θ)

=

1σ2

(X′Xn

)0

0′ 12σ4

, (64)

the LM statistic becomes

LM =(

u′X1σ2

0′ 0) σ2

(A11 A12

A21 A22

)0

0

0′ 0 2σ4/n

X′1u

σ2

0

0

(65)

and

LM =u′X1A11X

′1u

σ2 . (66)

It can be shown that

u′X1A11X′1u = β

′1A−111 β1 = u′u− u′u (67)

so that also

LR =u′u− u′u

σ2 . (68)

Moreover, if σ2 is known

LMσ =u′u− u′u

σ2

so that LMσ = Wσ = LRσ. With unknown σ2 it is easy to show that W ≥ LR ≥ LM .

13

Bootstrap MethodsClass Notes

Manuel ArellanoJanuary 2016

Introduction

• The bootstrap is an alternative method of assessing sampling variability. It is amechanical procedure that can be applied in a wide variety of situations.

• It works in the same way regardless of whether something straightforward is beingestimated or something more complex.

• The bootstrap was invented and given its name by Brad Efron in a paper published in1979 in the Annals of Statistics.

• The bootstrap is probably the most widely used methodological innovation instatistics since Ronald Fisher’s development of the analysis of variance in the 1920s(Erich Lehmann 2008).

1

The idea of the bootstrap• Let Y1, ...,YN be a random sample according to some distribution F and letbqN = h (Y1, ...,YN ) be some statistic of interest. We want to estimate thedistribution of bqN

PrF"bqN ≤ r

#= PrF [h (Y1, ...,YN ) ≤ r ]

where the subscript F indicates the distribution of the Y ’s.• A simple estimator of PrF

"bqN ≤ r

#is the plug-in estimator. It replaces F by the

empirical cdf bFN :

bFN (s) =1N

N

Âi=1

1 (Yi ≤ s) ,

which assigns probability 1/N to each of the observed values y1, ..., yN of Y1, ...,YN .• The resulting estimator is then

PrbFN [h (Y∗1 , ...,Y

∗N ) ≤ r ] (1)

where Y ∗1 , ...,Y∗N denotes a random sample from bFN .

• The formula (1) for the estimator of the cdf of bqN is easy to to write down, but it isprohibitive to calculate except for small N .

• To see this note that each of the Y ∗i is capable of taking on the N values y1, ..., yN ,so that the total number of values of h

$Y ∗1 , ...,Y

∗N

%that has to be considered is NN .

To calculate (1), one would have to count how many of these NN values are ≤ r .2

The idea of the bootstrap (continued)

• For example, suppose that Yi is binary, F is given by Pr (Y = 1) = p, bqN is thesample mean and N = 3. There are eight possible samples:

y1, y2, y3 Pr (y1, y2, y3) bqN(0, 0, 0) (1− p)3 0(1, 0, 0) p (1− p)2 1/3(0, 1, 0) p (1− p)2 1/3(0, 0, 1) p (1− p)2 1/3(1, 1, 0) p2 (1− p) 2/3(1, 0, 1) p2 (1− p) 2/3(0, 1, 1) p2 (1− p) 2/3(1, 1, 1) p3 1

• So that PrF"bqN ≤ r

#is determined by

r PrF"bqN = r

#

0 (1− p)3

1/3 3p (1− p)2

2/3 3p2 (1− p)1 p3

3


• Suppose that the observed values y1, y2, y3 are (0, 1, 1), so that the observed value ofbqN is 2/3. Therefore, our estimate of Pr

"bqN = r

#is given by

r PrbFN

"bqN = r

#

0$1− 2

3

%3= 1

27 = .037

1/3 3 23$1− 2

3

%2= 6

27 = .222

2/3 3$ 23

%2 $1− 2

3

%= 12

27 = .444

1$ 23

%3= 8

27 = .297

• The previous example is so simple that the calculation of PrbFN

"bqN ≤ r

#can be done

analytically, but in general this type of calculation is beyond reach.

4


Estimation by simulation

• A standard device for (approximately) evaluating probabilities that are too di¢cult tocalculate exactly is simulation.

• To calculate the probability of an event, one generates a sample from the underlyingdistribution and notes the frequency with which the event occurs in the generatedsample.

• If the sample is su¢ciently large, this frequency will provide an approximation to theoriginal probability with a negligible error.

• Such approximation to the probability (1) constitutes the second step of thebootstrap method.

• A number M of samples Y ∗1 , ...,Y∗N (the “bootstrap” samples) are drawn from bFN ,

and the frequency with which

h (Y ∗1 , ...,Y∗N ) ≤ r

provides the desired approximation to the estimator (1) (Lehmann 2008).

5

Numerical illustration

• To illustrate the method I have generated M = 1000 bootstrap samples of size N = 3with p = 2/3, which is the value of p that corresponds to the sample distribution(0, 1, 1). The result is

r #samples bootstrap pdf PrbFN

"bqN = r

#

(0, 0, 0) 0 37 .037 .037(1, 0, 0) , (0, 1, 0) , (0, 0, 1) 1/3 222 .222 .222(1, 1, 0) , (1, 0, 1) , (0, 1, 1) 2/3 453 .453 .444

(1, 1, 1) 1 288 .288 .297

• The discrepancy between the last two columns can be made arbitrarily small byincreasing M .

• The method we have described consisting in drawing random samples withreplacement treating the observed sample as the population is called “non-parametricbootstrap”.

6

Bootstrap standard errors

• The bootstrap procedure is very flexible and applicable to many di§erent situationssuch as the bias and variance of an estimator, to the calculation of confidenceintervals, etc.

• As a result of resampling we have available M estimates from the artificial samples:bq(1)N , ...,bq(M )N . A bootstrap standard error is then obtained as

"1

M − 1

M

Âm=1

'bq(m)N − bqN

(2#1/2

(2)

where bqN = ÂMm=1 bq(m)N /M .

• In the previous example, the bootstrap mean is bqN = 0.664, the bootstrap standarderror is 0.271 calculated as in (2) with M = 1000, and the analytical standard error is

2

4bqN"1− bqN

#

n

3

51/2

= 0.272

where bqN = 2/3 and n = 3.

7

Asymptotic properties of bootstrap methods

• Using the bootstrap standard error to construct test statistics cannot be shown toimprove on the approximation provided by the usual asymptotic theory, but the goodnews is that under general regularity conditions it does have the same asymptoticjustification as conventional asymptotic procedures.

• This is good news because bootstrap standard errors are often much easier to obtainthan analytical standard errors.

Refinements for large-sample pivotal statistics• Even better news is the fact that in many cases the bootstrap does improve theapproximation of the distribution of test statistics, in the sense that the bootstrap canprovide an asymptotic refinement compared with the usual asymptotic theory.

• The key aspect for achieving such refinements (consisting in having an asymptoticapproximation with errors of a smaller order of magnitude in powers of the samplesize) is that the statistic being bootstraped is asymptotically pivotal.

• An asymptotically pivotal statistic is one whose limiting distribution does not dependon unknown parameters (like standard normal or chi-square distributions).

• This is the case with t-ratios and Wald test statistics, for example.• Note that for a t-ratio to be asymptotically pivotal in a regression with heteros-kedasticity, the robust White form of the t-ratio needs to be used.

8

Asymptotic properties of bootstrap methods (continued)

• The upshot of the previous discussion is that the replication of the quantity of interest(mean, median, etc.) is not always the best way to use the bootstrap if improvementson asymptotic approximations are sought.

• In particular, when we wish to calculate a confidence interval, it is better not tobootstrap the estimate itself but rather to bootstrap the distribution of the t-value.

• This is feasible when we have a large sample estimate of the standard error, but one isskeptical about the accuracy of the normal probability approximation.

• Such procedure will provide more accurate estimates of confidence intervals thaneither the simple bootstrap or the asymptotic standard errors.

• However, often we are interested in bootstrap methods because an analytic standarderror is not available or is hard to calculate.

• In those cases the motivation for the bootstrap is not necessarily improving on theasymptotic approximation but rather obtaining a simpler approximation with the samejustification as the conventional asymptotic approximation.

9

An example: the sample mean

• To illustrate how the bootstrap works, let us consider the estimation of a samplemean:

yi = b+ ui ,where ui are iid with zero mean.

• The OLS estimate on the original sample is:

bb = 1N

N

Âi=1

yi = y .

• Then, sampling from the original sample is equivalent to selecting N indices withprobability 1/N . Let W be a generic draw from that distribution. We have:

E(W |y1, ..., yN ) =1N

N

Âi=1

yi , Var(W |y1, ..., yN ) =1N

N

Âi=1(yi − y )

2 .

• Let eb be the sample mean of a bootstrap sample. It follows from the sample meantheorem that:

E(eb|y1, ..., yN ) = E(W |y1, ..., yN ) =1N

N

Âi=1

yi ,

and

Var(eb|y1, ..., yN ) =Var(W |y1, ..., yN )

N=

1N2

N

Âi=1(yi − y )

2 .

10

An example: the sample mean (continued)

• Therefore the distribution of eb, conditional on the original sample, is centered aroundthe original estimate and consistently estimates the variance of bb.

• This illustrates the bootstrap principle according to which the relation between eb andbb is the same as the relation between bb and the true b.

11

Bootstrapping dependent samples

• We have discussed Efron’s nonparametric bootstrap.

• There are other forms of bootstrap in the literature (parametric bootstrap, residualbootstrap, subsampling).

Time series models

• Dealing with time-series models requires to take serial dependence into account.

• One way of doing it is to sample by blocks.

Stratified clustered samples

• The bootstrap can be applied to a stratified clustered sample.

• All we have to do is to treat the strata separately, and resample, not the basicunderlying units (the households) but rather the primary sample units (the clusters).

12

Using replicate weights

• Taking stratification and clustering sampling features into account, either analyticallyor by bootstrap, requires the availability of stratum and cluster indicators.

• Generally, Statistical O¢ces or survey providers do not make them available forconfidentiality reasons.

• To enable the estimation of the sampling distribution of estimators and test statisticswithout disclosing stratum or cluster information, an alternative is to provide replicateweights.

• For example, the EFF provides 999 replicate weights. Specifically the EFF providesreplicate cross-section weights, replicate longitudinal weights, and multiplicity factors(Bover, 2004).

• Multiplicity factors indicate the number of times a given household appears in aparticular bootstrap sample.

• The provision of replicate weights is an important development because it facilitatesthe use of replication methods, which are simple and of general applicability, togetherwith allowing for confidentiality safe ward.

13

Bayesian analysisClass Notes

Manuel Arellano

March 8, 2016

1 Introduction

Bayesian methods have traditionally had limited influence in empirical economics, but they have

become increasingly important with the popularization of computer-intensive stochastic simulation

algorithms in the 1990s. This is particularly so in macroeconomics, where applications of Bayesian

inference include vector autoregressions (VARs) and dynamic stochastic general equilibrium (DSGE)

models. Bayesian approaches are also attractive in models with many parameters, such as panel

data models with individual heterogeneity and flexible nonlinear regression models. Examples include

discrete choice models of consumer demand in the fields of industrial organization and marketing.

An empirical study uses data to learn about quantities of interest (parameters). A likelihood

function or some of its features specify the information in the data about those quantities. Such

specification typically involves the use of a priori information in the form of parametric or functional

restrictions. In the Bayesian approach to inference, one not only assigns a probability measure to the

sample space but also to the parameter space. Specifying a probability distribution over potential

parameter values is the conventional way of modelling uncertainty in decision-making, and offers a

systematic way of incorporating uncertain prior information into statistical procedures.

Outline The following section introduces the Bayesian way of combining a prior distribution with

the likelihood of the data to generate point and interval estimates. This is followed by some comments

on the specification of prior distributions. Next we turn to discuss asymptotic approximations; the

main result is that in regular cases there is a large-sample equivalence between Bayesian probability

statements and frequentist confidence statements. As a result, frequentist and Bayesian inferences

are often very similar and can be reinterpreted in each other’s terms. Finally, we review Markov

chain Monte Carlo methods (MCMC). The development of these methods has greatly reduced the

computational diffi culties that held back Bayesian applications in the past.

Bayesian methods are now not only generally feasible, but sometimes also a better practical al-

ternative to frequentist methods. The upshot is an emerging Bayesian/frequentist synthesis around

increasing agreement on what works for different kinds of problems. The shifting focus from philo-

sophical debate to methodological considerations is a healthy state of affairs because both frequentist

and Bayesian approaches have features that are appealing to most scientists.

1

2 Bayesian inference

Let us consider a data set y = (y1, ..., yn) and a probability density (or mass) function of y conditional

on an unknown parameter θ:

f (y1, ..., yn | θ) .

If y is an iid sample then f (y1, ..., yn | θ) =∏ni=1 f (yi | θ) where f (yi | θ) denotes the pdf of yi. In

survey sampling f (yi | θ = θ0) is the pdf of the population, (y1, ..., yn) are n independent draws from

such population, and θ0 denotes the true value of θ in the pdf that generated the data.

In general, for shortness we just write f (y | θ) = f (y1, ..., yn | θ). As a function of the parameterthis is called the likelihood function, also denoted L (θ). We are interested in inference about the

unknown parameter given the data. Any uncertain prior information about the value of θ is specified

in a prior probability distribution for the parameter, p (θ). Both the likelihood and the prior are

chosen by the researcher. We then combine the prior distribution and the sample information, using

Bayes’theorem, to obtain the conditional distribution of the parameter given the data, also known as

the posterior distribution:

p (θ | y) =f (y, θ)

f (y)=

f (y | θ) p (θ)∫f (y | θ∗) p (θ∗) dθ∗

.

Note that as a function of θ, the posterior density is proportional to

p (θ | y) ∝ f (y | θ) p (θ) = L (θ) p (θ) .

Once we calculate this product, all we have to do is to find the constant that makes this expression

integrate to one as a function of the parameter. The posterior density describes how likely it is that

a value of θ has generated the observed data.

Point estimation We can use the posterior density to form optimal point estimates. The notion

of optimality is minimizing mean posterior loss for some loss function ` (r):

minc

∫Θ` (c− θ) p (θ | y) dθ

The posterior mean

θ =

∫Θθp (θ | y) dθ

is the point estimate that minimizes mean squared loss ` (r) = r2. The posterior median minimizes

mean absolute loss ` (r) = |r|. The posterior mode θ is the maximizer of the posterior density andminimizes mean Dirac loss. When the prior density is flat, the posterior mode coincides with the

maximum likelihood estimator.

2

Interval estimation The posterior quantiles characterize the posterior uncertainty about the

parameter, and they can be used to obtain interval estimates. Any interval (θ`, θu) such that∫ θu

θ`

p (θ | y) dθ = 1− α

is called a credible interval with coverage probability 1 − α. If the posterior density is unimodal,

a common choice is the shortest connected credible interval or the highest posterior density (HPD)

interval. In practice, often an equal-tail-probability interval is favored because of its computational

simplicity. In such case, θ` and θu are just the α/2 and 1−α/2 posterior quantiles, respectively. Equal-tail-probability intervals tend to be longer than the others, except in the case of a symmetric posterior

density. If the posterior is multi-modal then the HPD interval may consist of disjoint segments.1

Frequentist confidence intervals and Bayesian credible intervals are the two main interval estimation

methods in statistics. In a confidence interval the coverage probability is calculated from a sampling

density, whereas in a credible interval the coverage probability is calculated from a posterior density.

As discussed in the next section, despite the differences in the two methods, they often provide similar

interval estimates in large samples.

Bernoulli example Let us consider a random sample (y1, ..., yn) of Bernoulli random variables.

The likelihood of the sample is given by

L (θ) = θm (1− θ)n−m

where m =∑n

i=1 yi. The maximum likelihood estimator is

θ =m

n.

In general, given some prior p (θ), the posterior mode solves

θ = arg maxθ

[lnL (θ) + ln p (θ)] .

Since θ is a probability value, a suitable parameter space over which to specify a prior probability

distribution is the (0, 1) interval. A flexible and convenient choice is the Beta distribution:

p (θ;α, β) =1

B(α, β)θα−1 (1− θ)β−1

where B(α, β) is the beta function, which is constant with respect to θ:

B(α, β) =

∫ 1

0sα−1 (1− s)β−1 ds.

The quantities (α, β) are parameters of the prior, to be set according to our a priori information about

θ. These parameters are called prior hyperparameters.1The minimum density of any point within the HPD interval exceeds the density of any point outside that interval.

3

The Beta distribution is a convenient prior because the posterior is also a Beta distribution:

p (θ | y) ∝ L (θ) p (θ) ∝ θm+α−1 (1− θ)n−m+β−1

That is, if θ ∼ Beta (α, β) then θ | y ∼ Beta (α+m,β + n−m). This situation is described by saying

that the Beta distribution is the conjugate prior to the Bernoulli.

The posterior mode is given by

θ = arg maxθ

[θm+α−1 (1− θ)n−m+β−1

]=

m+ α− 1

n+ β − 1 + α− 1. (1)

This result illustrates some interesting properties of the posterior mode in this example. The posterior

mode is equivalent to the ML estimate of a data set with α−1 additional ones and β−1 additional zeros.

Such data augmentation interpretation provides guidance on how to choose α and β in describing a

priori knowledge about the probability of success in Bernoulli trials. It also illustrates the vanishing

effect of the prior in a large sample. Note for now that if n is large θ ≈ θ. However, maximum likelihoodmay not be a satisfactory estimator in a small sample that only contains zeros if the probability of

success is known a priori to be greater than zero.

3 Specification of prior distribution

There is a diversity of considerations involved in the specification of a prior, not altogether different

from those involved in the specification of a likelihood model.

Conjugate priors One consideration in selecting the form of both prior and likelihood is mathe-

matical convenience. Conjugate prior distributions, such as the Beta density in the previous example,

have traditionally played a central role in Bayesian inference for analytical and computational reasons.

A prior is conjugate for a family of distributions if the prior and the posterior are of the same family.

When a likelihood model is used together with its conjugate prior, the posterior is not only known to

be from the same family of densities as the prior, but explicit formulas for the posterior hyperparame-

ters are also available. In general, distributions in the exponential family have conjugate priors. Some

likelihood models together with their conjugate priors are the following:

• Bernoulli —Beta• Binomial —Beta• Poisson —Gamma• Normal with known variance —Normal• Exponential —Gamma• Uniform —Pareto

• Geometric —Beta

4

Conjugate priors not only have advantages in tractability but also in interpretation, since the prior

can be interpreted in terms of a prior sample size or additional pseudo-data (as illustrated in the

Bernoulli example).

Informative priors The argument for using a probability distribution to specify uncertain a

priori information is more compelling when prior knowledge can be associated to past experience, or

to a process of elicitation of consensus expert views. Other times, a parameter is a random realization

drawn from some population, for example, in a model with individual effects for longitudinal survey

data; a situation in which there exists an actual population prior distribution. In those cases one

would like the prior to accurately express the information available about the parameters. However,

often little is known a priori and one would like a prior density to just express lack of information, an

issue that we consider next.

Flat priors For a scalar θ taking values on the entire real line a uniform, a flat prior distribution

is typically employed as an uninformative prior, that is, one that sets p (θ) = 1. A flat prior is non-

informative in the sense of having little impact on the posterior, which is simply a renormalization

of the likelihood into a density for θ.2 A flat prior is therefore appealing from the point of view of

seeking to summarize the likelihood.

Note that a flat prior is improper in the sense that∫

Θ p (θ) dθ = ∞.3 If an improper prior is

combined with a likelihood that cannot be renormalized (due to lacking a finite integral with respect

to θ), the result is an improper posterior that cannot be used for inference. Flat priors are often

approximated by a proper prior with a large variance.

If p (θ) is uniform, then the prior of a transformation of θ is not uniform. If θ is a positive number,

a standard reference prior is to assume a flat prior on ln θ, p (ln θ) = 1, which implies

p (θ) =1

θ. (2)

Similarly, if θ lies in the (0, 1) interval, a flat prior on the logit transformation of θ, ln(

θ1−θ

), implies

p (θ) =1

θ (1− θ) . (3)

These priors are improper because∫∞

01θdθ and

∫ 10

1θ(1−θ)dθ both diverge. They are easily dominated

by the data, but (2) assigns most of the weight to values of θ that are either very large or very close

to zero, and (3) puts most of the weight on values very near 0 and 1.

For example, if (y1, ..., yn) is a random sample from a normal population N(µ, 1

τ

), the standard

improper reference prior for (µ, τ) is to specify independent flat priors on µ and ln τ , so that

p (µ, τ) = p (µ) p (τ) =1

τ.

2Though arguably a flat prior places a large weight on extreme parameter values.3Any prior distribution with infinite mass is called improper.

5

Jeffreys prior It is a rule for choosing a non-informative prior that is invariant to transformation:

p (θ) ∝ [det I (θ)]1/2

where I (θ) is the information matrix. If γ = h (θ) is one-to-one, applying Jeffreys’rule directly to γ

we get the same prior as applying the rule to θ and then transforming to obtain p [h (θ)].

Bernoulli example continued Let us illustrate three standard candidates for non-informative

prior in the Bernoulli example. The first possibility is to use a flat prior in the log-odds scale, leading

to (3); this is the Beta (0, 0) distribution since it can be regarded as the limit of the numerator of the

beta distribution as α, β → 0. The second is Jeffreys’prior, which in this case is proportional to

p (θ) =1√

θ (1− θ),

and corresponds to the Beta(0.5, 0.5) distribution. Finally, the third candidate is the uniform prior

p (θ) = 1, which corresponds to the Beta(1, 1) distribution.

All three priors are data augmentation priors. The Beta (0, 0) prior adds no prior observations,

Jeffreys’prior adds one observation with half a success and half a failure, and the uniform prior adds

two observations with one success and one failure. The ML estimator coincides with the posterior

mode for the Beta(1, 1) prior, and with the posterior mean for the Beta(0, 0) prior.

4 Large-sample Bayesian inference

To evaluate the performance of a point estimator we typically resort to large-sample approximations.

The basic tools are consistency (convergence in probability of the estimation error to zero) and asymp-

totic normality (limiting sampling distribution of the scaled estimation error). Similarly, to obtain a

(frequentist) confidence interval we usually rely on an asymptotic approximation. Here we wish to

consider (i) asymptotic approximations to the posterior distribution, and (ii) the sampling properties

of Bayesian estimators in large samples.

The main result of large-sample Bayesian inference is that as the sample size increases the pos-

terior distribution of the parameter vector approaches a multivariate normal distribution, which is

independent of the prior distribution. The convergence is in probability, where the probability is mea-

sured with respect to the true distribution of y. Posterior asymptotic results formalize the notion that

the importance of the prior diminishes as n increases. Only when n is small, the prior choice is an

important part of the specification of the model.

These results hold under suitable conditions on the prior distribution, the likelihood, and the

parameter space. Conditions include a prior that assigns positive probability to a neighborhood about

θ0; a posterior distribution that is not improper; identification; a likelihood that is a continuous

function of θ, and a true value θ0 that is not on the boundary of Θ.

6

4.1 Consistency of the posterior distribution

If the population distribution of a random sample y = (y1, ..., yn) is included in the parametric like-

lihood family, so that it equals f (yi | θ0) for some θ0, the posterior is consistent in the sense that it

converges to a point mass at the true parameter value θ0 as n → ∞. When the true distribution isnot included in the parametric family, there is no longer a true value θ0, except in the sense of the

value θ0 that makes the model distribution f (yi | θ) closest to the true distribution g (yi) according

to the Kullback-Leibler divergence:

KL (θ) =

∫ln

(g (yi)

f (yi | θ)

)g (yi) dyi,

so that4

θ0 = arg minθ∈Θ

KL (θ) .

Here is a consistency theorem of the posterior distribution for a discrete parameter space. The

result is valid if g (yi) is not included in the f (yi | θ) family, in which case we may refer to∏ni=1 f (yi | θ)

as a pseudo-likelihood and to p (θ | y) as a pseudo-posterior. The theorem and its proof are taken from

Gelman et al (2014, p. 586).

Theorem (finite parameter space) If the parameter space Θ is finite and Pr (θ = θ0) > 0,

then Pr (θ = θ0 | y) → 1 as n → ∞, where θ0 is the value of θ that minimizes the Kullback-Leibler

divergence.

Proof: For any θ 6= θ0 let us consider the log posterior odds relative to θ0:

ln

(p (θ | y)

p (θ0 | y)

)= ln

(p (θ)

p (θ0)

)+

n∑i=1

ln

(f (yi | θ)f (yi | θ0)

)(4)

For fixed values of θ and θ0, if the yi’s are iid draws from g (yi), the second term on the right is a sum

of n iid random variables with a mean given by

E

[ln

(f (yi | θ)f (yi | θ0)

)]= KL (θ0)−KL (θ) ≤ 0.

Thus, as long as θ0 is the unique minimizer of KL (θ), for θ 6= θ0 the second term on the right of (4)

is the sum of n iid random variables with negative mean. By the LLN, the sum approaches −∞ as

n → ∞. As long as the first term on the right is finite (provided p (θ0) > 0), the whole expression

approaches −∞ in the limit. Then p(θ|y)p(θ0|y) → 0, and so p (θ | y)→ 0. Moreover, since all probabilities

add up to 1, p (θ0 | y)→ 1.

If θ has a continuous distribution, p (θ0 | y) is always zero for any finite sample, and so the previous

argument does not apply, but it can still be shown that p (θ | y) becomes more and more concentrated

about θ0 as n increases. A statement of the theorem for the continuous case in Gelman et al is as

follows.4Equivalently, θ0 = argmaxθ∈Θ E [ln f (y | θ)].

7

Theorem (continuous parameter space) If θ is defined on a compact set and A is a neigh-

borhood of θ0 with nonzero prior probability, then Pr (θ ∈ A | y)→ 1 as n→∞, where θ0 is the value

of θ that minimizes KL (θ).

Bernouilli example Recall that the posterior distribution in this case is:

p (θ | y) ∝ θm+α−1 (1− θ)n−m+β−1 ∼ Beta (m+ α, n−m+ β)

with mean and variance given by5

E (θ | y) =m+ α

n+ α+ β=m

n+O

(1

n

)

V ar (θ | y) =(m+ α) (n−m+ β)

(n+ α+ β)2 (n+ α+ β + 1)= O

(1

n

).

As n increases the posterior distribution becomes concentrated at a value that does not depend on

the prior distribution.

By the strong LLN, for each ε > 06

Pr(

limn→∞

∣∣∣mn− θ0

∣∣∣ < ε | θ0

)= 1.

Therefore, with probability 1, the sequence of posterior probability densities

limn→∞

p (θ | y) = limn→∞

Beta (nθ0 + α, n (1− θ0) + β)

has a limit distribution with mean θ0 and variance 0, independent of α and β. Thus, under each

conjugate Beta prior, with probability 1, the posterior probability for θ converges to the Dirac Delta

distribution concentrated on the true parameter value.

4.2 Asymptotic normality of the posterior distribution

We have seen that as n → ∞ the posterior distribution converges to a degenerate measure at the

true value θ0 (posterior consistency). To obtain a non-degenerate limit, we consider the sequence of

posterior distributions of γ =√n(θ − θ

), whose densities are given by7

p∗ (γ | y) =1√np

(θ +

1√nγ | y

).

5The mean and variance of X ∼ Beta (α, β) are:

E (X) =α

α+ β, V ar (X) =

αβ

(α+ β)2 (α+ β + 1).

6That is,∫1(limn→∞

∣∣mn− θ0

∣∣ < ε)θm0 (1− θ0)

n−m dm = 1, for any θ0.7We use the ML estimator θ as the centering quantity, but the limiting result is unaffected if the posterior mode θ is

used instead or if γ =√n (θ − Tn) with Tn = θ0 + [nI (θ0)]

−1 [∂ lnL (θ0) /∂θ].

8

The basic result of large-sample Bayesian inference is that as more and more data arrive, the

posterior distribution approaches a normal distribution. This result is known as the Bernstein-von

Mises Theorem. See, for example, Lehmann and Casella (1998, Theorem 8.2, p. 489), van der Vaart

(1998, Theorem 10.1, p. 141), or Chernozhukov and Hong (2003, Theorem 1, p. 305).

A formal statement for iid data and a scalar parameter, under the standard regularity conditions of

MLE asymptotics, the condition that the prior p (θ) is continuous and positive in an open neighborhood

of θ0, and some additional technical conditions, is as follows:∫ ∣∣∣∣∣∣p∗ (γ | y)− 1√2πσ2

θ

exp

(− 1

2σ2θ

γ2

)∣∣∣∣∣∣ dγ p→ 0.

where σ2θ = 1/I (θ0). That is, the L1 distance between the scaled and centered posterior and a

N(0, σ2

θ

)density centered at the random quantity γ goes to zero in probability. Thus, for large n,

p (θ | y) is approximately a random normal density with random mean parameter θ and a constant

variance parameter I (θ0)−1 /n:

p (θ | y) ≈ N(θ,

1

nI (θ0)−1

).

The result can be extended to a multidimensional parameter. To gain intuition for this result let us

consider a Taylor expansion of ln p (θ | y) about the posterior mode θ:

ln p (θ | y) ≈ ln p(θ | y

)+∂ ln p

(θ | y

)∂θ′

(θ − θ

)+

1

2

(θ − θ

)′ ∂2 ln p(θ | y

)∂θ∂θ′

(θ − θ

)= c− 1

2

√n(θ − θ

)′ − 1

n

∂2 ln p(θ | y

)∂θ∂θ′

√n(θ − θ)

Note that ∂ ln p(θ | y

)/∂θ′ = 0. Moreover,

1

n

∂2 ln p(θ | y

)∂θ∂θ′

=1

n

∂2 ln p(θ)

∂θ∂θ′+

1

n

n∑i=1

∂2 ln f(yi | θ

)∂θ∂θ′

=1

n

n∑i=1

∂2 ln f(yi | θ

)∂θ∂θ′

+O

(1

n

)≈ −I

(θ)

Thus, for large n the curvature of the log posterior can be approximated by the Fisher information:

ln p (θ | y) ≈ c− 1

2

√n(θ − θ

)′I(θ)√

n(θ − θ

).

Dropping terms that do not include θ we get the approximation

p (θ | y) ∝ exp

[−1

2

(θ − θ

)′nI(θ)(

θ − θ)],

which corresponds to the kernel of a multivariate normal density N(θ, 1

nI(θ)−1

).

9

Often, convergence to normality of the posterior distribution for a parameter θ can be improved

by transformation. If φ is a continuous transformation of θ, then both p (φ | y) and p (θ | y) approach

normal distributions, but the accuracy of the approximation for finite n can vary substantially with

the transformation chosen.

A Bernstein-von Mises Theorem states that under adequate conditions the posterior distribution

is asymptotically normal, centered at the MLE with a variance equal to the asymptotic frequentist

variance of the MLE. From a frequentist point of view, this implies that Bayesian methods can be used

to obtain statistically effi cient estimators and consistent confidence intervals. The limiting distribution

does not depend on the Bayesian prior.

4.3 Asymptotic behavior of the posterior in pseudo-likelihood models

If g (yi) 6= f (yi | θ) for all θ ∈ Θ, then the fitted model f (yi | θ) is misspecified. In such case thelarge-n sampling distribution of the pseudo-ML estimator is

√n(θ − θ0

)d→ N (0,ΣS)

where ΣS is the sandwich covariance matrix:

ΣS = ΣMV ΣM ,

with ΣM = [−E (Hi)]−1 ≡ [I (θ0)]−1, V = E (qiq

′i) and

qi =∂ ln f (yi | θ0)

∂θ, Hi =

∂2 ln f (yi | θ0)

∂θ∂θ′

In a correctly specified model the information identity holds V = Σ−1M but in general V 6= Σ−1

M .

The large sample shape of a posterior distribution obtained from Πni=1f (yi | θ) becomes close to

θ | y ∼ N(θ,

1

nΣM

).

Thus, misspecification produces a discrepancy between the sampling distribution of θ and the shape of

the (pseudo)-likelihood. That is, the pseudo likelihood does not correctly reflect the sample information

about θ contained in θ. So, for the purpose of Bayesian inference about θ0 (in the knowledge that θ0

is only a pseudo-true value) it makes sense to start from the correct large-sample approximation to

the likelihood of θ instead of the (incorrect) approximate likelihood of (y1...yn). That is, to consider

a posterior distribution of the form:

p(θ | θ

)∝ exp

[−1

2n(θ − θ

)′Σ−1S

(θ − θ

)]× p (θ) (5)

This approach is proposed in Müller (2013) who shows that Bayesian inference about θ0 is of lower

asymptotic frequentist risk when the standard pseudo-posterior

p(θ | θ

)∝ exp

[n∑i=1

ln f (yi | θ)]× p (θ) (6)

10

is substituted by the pseudo-posterior (5) that relies on the asymptotic likelihood of θ (an "artificial"

normal posterior centered at the MLE with sandwich covariance matrix).

4.4 Asymptotic frequentist properties of Bayesian inferences

The posterior mode is consistent in repeated sampling with fixed θ as n→∞. Moreover, the posteriormode is also asymptotically normal in repeated samples. So the large-sample Bayesian statement

holds [I(θ)]1/2 (

θ − θ)| y ∼ N (0, I)

alongside the large-sample frequentist statement[I(θ)]1/2 (

θ − θ)| θ ∼ N (0, I) .

See for example Lehmann and Casella (1998, Theorem 8.3, p. 490).

These results imply that in regular estimation problems the posterior distribution is asymptotically

the same as the repeated sample distribution. So, for example, a 95% central posterior interval for θ

will cover the true value 95% of the time under repeated sampling with any fixed true θ. The frequentist

statement speaks of probabilities of θ (y) whereas the Bayesian statement speaks of probabilities of θ.

Specifically,

Pr (θ ≤ r | y) =

∫1 (θ ≤ r) p (θ | y) dθ ∝

∫1 (θ ≤ r) f (y | θ) p (θ) dθ

Pr[θ (y) ≤ r | θ0

]=

∫1(θ (y) ≤ r

)f (y | θ0) dy

These results require that the true data distribution is included in the parametric likelihood family.

Bernoulli example continued The posterior mode corresponding to the beta prior with para-

meters (α, β) in (1) and the maximum likelihood estimator θ = m/n satisfy

√n(θ − θ

)=√n(θ − θ

)+Rn

where

Rn =

√n

n+ k

(α− 1− km

n

).

and k = α+ β − 2. Since Rnp→ 0, it follows that

√n(θ − θ

)has the same asymptotic distribution as

√n(θ − θ

), namely N [0, θ (1− θ)]. Therefore, the normalized posterior mode has an asymptotic nor-

mal distribution, which is independent of the prior parameters and has the same asymptotic variance

as that of the MLE, so that the posterior mode is asymptotically effi cient.

11

Robustness to statistical principle and its failures The dual frequentist/Bayesian inter-

pretation of many textbook estimation procedures suggests that it is possible to aim for robustness to

statistical philosophies in statistical methodology, at least in regular estimation problems.

Even for small samples, many statistical methods can be considered as approximations to Bayesian

inferences based on particular prior distributions. As a way of understanding a statistical procedure,

it is often useful to determine the implicit underlying prior distribution (Gelman et al 2014, p. 92).

In the case of units roots the symmetry of Bayesian probability statements and classical confidence

statements breaks down. With normal errors and a flat prior the Bayesian posterior is normal even

if the true data generating process is a random walk (Sims and Uhlig 1991). Kim (1998) studied

conditions for asymptotic posterior normality, which cover much more general situations than the

normal random walk with flat priors.

5 Markov chain Monte Carlo methods

A Markov Chain Monte Carlo method simulates a series of parameter draws such that the marginal

distribution of the series is (approximately) the posterior distribution of the parameters.

The posterior density is proportional to

p (θ | y) ∝ f (y | θ) p (θ) .

Usually f (y | θ) p (θ) is easy to compute. However, computation of point estimates and credible

intervals typically requires the evaluation of integrals of the form∫Θ h (θ) f (y | θ) p (θ) dθ∫

Θ f (y | θ) p (θ) dθ

for various functions h (.). For problems for which no analytic solution exists, MCMC methods provide

powerful tools for evaluating these integrals, especially when θ is high dimensional.

MCMC is a collection of computational methods that produce an ergodic Markov chain with

the stationary distribution p (θ | y). A continuous-state Markov chain is a sequence θ(1), θ(2), ..., that

satisfies the Markov property:

Pr(θ(j+1) | θ(j), ..., θ(1)

)= Pr

(θ(j+1) | θ(j)

).

The probability Pr(θ′ | θ

)of transitioning from state θ to state θ′ is called the transition kernel and

we denote it K(θ′ | θ

). Our interest will be in the steady-state probability distribution of the process.

Given a starting value θ(0), a chain(θ(1), θ(2), ..., θ(M)

)is generated using a transition kernel

with stationary distribution p (θ | y), which ensures the convergence of the marginal distribution

of θ(M) to p (θ | y). For suffi ciently large M , the MCMC methods produce a dependent sample

12

(θ(1), θ(2), ..., θ(M)

)whose empirical distribution approaches p (θ | y). The ergodicity and construc-

tion of the chains usually imply that as M →∞,

θ =1

M

M∑j=1

h(θ(j))

p→∫

Θh (θ) p (θ | y) dθ.

Analogously, a 90% interval estimation is constructed simply by taking the 0.05th and 0.95th

quantiles of the sequence(h(θ(1)), ..., h

(θ(M)

)).

In the theory of Markov chains one looks for conditions under which there exists an invariant

distribution, and conditions under which iterations of the transition kernel K(θ′ | θ

)converge to the

invariant distribution. In the context of MCMC methods the situation is the reverse: the invariant

distribution is known and in order to generate samples from it the methods look for a transition kernel

whose iterations converge to the invariant distribution. The problem is to find a suitable K(θ′ | θ

)that satisfies the invariance property:

p(θ′ | y

)=

∫K(θ′ | θ

)p (θ | y) dθ. (7)

Under the invariance property, if θ(j) is a draw from p (θ | y) then θ(j+1) is also a draw from p (θ | y).

A useful fact is that the steady-state distribution p (θ | y) satisfies the detailed balance condition:

K(θ′ | θ

)p (θ | y) = K

(θ | θ′

)p(θ′ | y

)for all θ, θ′. (8)

The interpretation of equation (8) is that the amount of mass transitioning from θ′ to θ is the same

as the amount of mass that transitions back from θ to θ′.

The invariance property is not enough to guarantee that an average of draws h(θ(j))fromK

(θ′ | θ

)converges to the posterior mean. It has to be proved thatK

(θ′ | θ

)has a unique invariant distribution,

that repeatedly drawing from K(θ′ | θ

)leads to convergence to the unique invariant distribution

regardless of the initial condition, and that the dependence of the draws θ(j) decays suffi ciently fast

such that Monte Carlo sample averages converge to population means. Robert and Casella (2004)

provide a textbook treatment of the convergence theory for MCMC algorithms.

Two general methods of constructing transition kernels are the Metropolis-Hastings algorithm and

the Gibbs sampler, which we discuss in turn.

5.1 Metropolis-Hastings method

The Metropolis—Hastings algorithm proceeds by generating candidates that are either accepted or

rejected according to some probability, which is driven by a ratio of posterior evaluations. A description

of the algorithm is as follows.

Given the posterior density f (y | θ) p (θ), known up to a constant, and a prespecified conditional

density q(θ′ | θ

)called the "proposal distribution", generate

(θ(1), θ(2), ..., θ(M)

)in the following way:

13

1. Choose a starting value θ(0).

2. Draw a proposal θ∗ from q(θ∗ | θ(j)

).

3. Update θ(j+1) from θ(j) for j = 1, 2..., using

θ(j+1) =

θ∗ with probability ρ(θ∗ | θ(j)

),

θ(j) with probability 1− ρ(θ∗ | θ(j)

),

where

ρ(θ∗ | θ(j)

)= min

1,f (y | θ∗) p (θ∗) q

(θ(j) | θ∗

)f(y | θ(j)

)p(θ(j))q(θ∗ | θ(j)

)

Some intuition for how the algorithm deals with a candidate transition from θ to θ′ is as follows

(Letham and Rudin 2012). If p(θ′ | y

)> p (θ | y), then for every accepted draw of θ, we should have at

least as many accepted draws of θ′ and so we always accept the transition θ → θ′. If p(θ′ | y

)< p (θ | y),

then for every accepted draw θ, we should have on average p(θ′|y)p(θ|y) accepted draws of θ

′. We thus

accept the transition with probability p(θ′|y)p(θ|y) . Thus, for any proposed transition, we accept it with

probabilitymin[1, p(θ

′|y)p(θ|y)

], which corresponds to ρ

(θ′ | θ

)when the proposal distribution is symmetric:

q(θ′ | θ

)= q

(θ | θ′

), as is the case in the original Metropolis algorithm.

The chain of draws so produced spends a relatively high proportion of time in the higher den-

sity regions and a lower proportion in the lower density regions. Because such proportions of times

are balanced in the right way, the generated sequence of parameter draws has the desired marginal

distribution in the limit. A key practical aspect of this calculation is that the posterior constant of

integration is not needed since ρ(θ′ | θ

)only depends on a posterior ratio.

Choosing a proposal distribution To guarantee the existence of a stationary distribution,

the proposal distribution q(θ′ | θ

)should be such that there is a positive density of reaching any state

from any other state. A popular implementation of the M-H algorithm is to use the random walk

proposal distribution:

q(θ′ | θ

)= N

(θ, σ2

)for some variance σ2. In practice, one will try several proposal distributions to find out which is most

suitable in terms of rejection rates and coverage of the parameter space.8

Other practical considerations include discarding a certain number of the first draws to reduce the

dependence on the starting point (burn-in), and only retaining every d−th iteration of the chain toreduce the dependence between draws (thinning).

8See Letham and Rudin (2012) for examples of the practice of MCMC simulation using the OpenBUGS software

package.

14

Transition kernel and convergence of the M-H algorithm The M-H algorithm describes

how to generate a parameter draw θ(j+1) conditional on a parameter draw θ(j). Since the proposal

distribution q(θ′ | θ

)and the acceptance probability ρ

(θ′ | θ

)depend only on the current state, the

sequence of draws forms a Markov chain. The M-H transition kernel can be written as

K(θ′ | θ

)= q

(θ′ | θ

)ρ(θ′ | θ

)+ r (θ) δθ

(θ′). (9)

The first term q(θ′ | θ

)ρ(θ′ | θ

)is the density that θ′ is proposed given θ, times the probability that

it is accepted. To this we add the term r (θ) δθ(θ′), which gives the probability r (θ) that conditional

on θ the proposal is rejected times the Dirac delta function δθ(θ′), equal to one if θ′ = θ and zero

otherwise. Here

r (θ) = 1−∫q(θ′ | θ

)ρ(θ′ | θ

)dθ′.

If the proposal is rejected, then the algorithm sets θ(j+1) = θ(j), which means that conditional on the

rejection, the transition density contains a point mass at θ = θ′, which is captured by the Dirac delta

function.

For the M-H algorithm to generate a sequence of draws from p (θ | y) a necessary condition is that

the posterior distribution is an invariant distribution under the transition kernel (9), namely that it

satisfies condition (7). See Lancaster (2004, p. 213) or Herbst and Schorfheide (2015) for proofs that

K(θ′ | θ

)satisfies the invariance property.

5.2 Gibbs sampling

The Gibbs sampler is a fast sampling method that can be used in situations when we have access to

conditional distributions.

The idea behind the Gibbs sampler is to partition the parameter vector into two components

θ = (θ1, θ2). Instead of sampling θ(j+1) directly from K(θ | θ(j)

), one first samples θ(j+1)

1 from

p(θ1 | θ(j)

2

)and then samples θ(j+1)

2 from p(θ2 | θ(j+1)

1

). Clearly if

(θ

(j)1 , θ

(j)2

)is a draw from the

posterior distribution, so is(θ

(j+1)1 , θ

(j+1)2

)generated as above, so that the Gibbs sampler kernel

satisfies the invariance property; that is, it has p (θ1, θ2 | y) as its stationary distribution (see Lancaster

2004, p. 209).

The Gibbs sampler kernel is

K(θ1, θ2 | θ′1, θ′2

)= p

(θ1 | θ′2

)p (θ2 | θ1) .

It can be regarded as a special case of Metropolis-Hastings where the proposal distribution is taken

to be the conditional posterior distribution.

The Gibbs sampler is related to data augmentation. A probit model nicely illustrates this aspect

(Lancaster 2004, Example 4.17, p. 211).

15

Bibliographical note

• A good textbook source on applied Bayesian methods is Gelman, Carlin, Stern, Dunson, Vehtari,

and Rubin (2014). Textbook treatments of Bayesian econometrics include Koop (2003), Lancaster

(2004), Geweke (2005), and Greenberg (2012).

• Rothenberg (1973)’s Cowles Foundation Monograph 23 provides a classic discussion of the use of apriori information in frequentist and Bayesian approaches to econometrics.

• Herbst and Schorfheide (2015) provide an up-to-date account of applications of Bayesian inferenceto DSGE macro models.

• Arellano and Bonhomme (2011) review nonlinear panel data models, drawing on the link betweenrandom-effects approaches and Bayesian computation.

• Ruppert, Wand, and Carroll (2003) and Rossi (2014) discuss likelihood-based inference of flexiblenonlinear models.

• Fiorentini, Sentana, and Shephard (2004) develop simulation-based Bayesian estimation methods oftime-series latent variable models of financial volatility.

• The initial work on Bayesian asymptotics is due to Laplace. Further early work was done by

Bernstein (1917) and von Mises (1931). Textbook sources are Lehmann and Casella (1998) and van

der Vaart (1998). Chernozhukov and Hong (2003) provide a review of the literature.

• The Metropolis-Hastings (M—H) algorithm was developed by Metropolis et al in 1953 and generalizedby Hastings in 1970, but was unknown to statisticians until the early 1990s. Tierney (1994) and Chib

and Greenberg (1995) created awareness about the algorithm and stimulated their use in statistics.

• The name Gibbs sampler was introduced by Geman and Geman (1984) after the statistical physicistWillard Gibbs.

• Chib (2001) and Robert and Casella (1999) provide excellent treatments of MCMC methods.

• In models with moment restrictions, Chernozhukov and Hong (2003) propose using a GMM-likecriterion function in place of the unknown likelihood to calculate quasi-posterior distributions by

MCMC methods.

16

References

[1] Arellano, Manuel, and Stéphane Bonhomme (2011): “Nonlinear Panel Data Analysis”, Annual

Review of Economics, 3, 395—424.

[2] Bernstein, S. (1917): Theory of Probability, 4th Edition 1946. Gostekhizdat, Moscow—Leningrad

(in Russian).

[3] Chernozhukov, Victor, and Han Hong (2003): "An MCMC approach to classical estimation",

Journal of Econometrics, 115, 293—346.

[4] Chib, Siddhartha (2001): "Markov chain Monte Carlo methods: computation and inference". In:

Heckman, J.J., Leamer, E. (Eds.), Handbook of Econometrics, Vol. 5. North-Holland, Amsterdam,

3564—3634 (Chapter 5).

[5] Chib, Siddhartha and Edward Greenberg (1995): “Understanding the Metropolis—Hastings Algo-

rithm”, The American Statistician, 49, 327—335.

[6] Fiorentini, Gabriele, Enrique Sentana, and Neil Shephard (2004): "Likelihood-Based Estimation

of Latent Generalized ARCH Structures", Econometrica, 72(5), 1481—1517.

[7] Gelman, Andrew, John Carlin, Hal Stern, David Dunson, Aki Vehtari, and Donald Rubin (2014):

Bayesian Data Analysis, Third Edition, CRC Press.

[8] Geman, Stuart and Donald Geman (1984): “Stochastic Relaxation, Gibbs Distributions, and the

Bayesian Restoration of Images”, IEEE Transactions on Pattern Analysis and Machine Intelli-

gence, 6(6), 721-741.

[9] Geweke, John (2005): Contemporary Bayesian Econometrics and Statistics, John Wiley & Sons.

[10] Greenberg, Edward (2012): Introduction to Bayesian econometrics, Cambridge University Press.

[11] Hastings, W. K. (1970): "Monte Carlo Sampling Methods Using Markov Chains and Their Ap-

plications", Biometrika, 57(1), 97—109.

[12] Herbst, Edward, and Frank Schorfheide (2015): Bayesian Estimation of DSGE Models, Princeton.

[13] Kim, Jae-Young (1998): "Large Sample Properties of Posterior Densities, Bayesian Information

Criterion and the Likelihood Principle in Nonstationary Time Series Models", Econometrica, 66,

359—380.

[14] Koop, Gary (2003): Bayesian Econometrics, John Wiley & Sons.

[15] Lancaster, Tony (2004): An Introduction to Modern Bayesian Econometrics, Blackwell.

17

[16] Lehmann, E. L. and George Casella (1998): Theory of Point Estimation, Second Edition, Springer.

[17] Letham, Ben and Cynthia Rudin (2012): “Probabilistic Modeling and Bayesian Analysis”, Pre-

diction, Machine Learning, and Statistics Lecture Notes, Sloan School of Management, MIT.

[18] Metropolis, N., A. Rosenbluth, M. Rosenbluth, A. Teller, and E. Teller (1953): "Equations of

State Calculations by Fast Computing Machines", Journal of Chemical Physics, 21, 1087—1092.

[19] Müller, Ulrich (2013): "Risk of Bayesian Inference in Misspecified Models, and the Sandwich

Covariance Matrix", Econometrica, 81(5), 1805—1849.

[20] Robert, C.P. and George Casella (1999): Monte Carlo Statistical Methods. Springer, Berlin.

[21] Rossi, Peter E. (2014): Bayesian Non- and Semi-parametric Methods and Applications, Princeton

University Press.

[22] Rothenberg, Thomas (1973): Effi cient Estimation with A Priori Information, Cowles Foundation

Monograph 23, Yale University Press.

[23] Ruppert, David, M. P. Wand, and R. J. Carroll (2003): Semiparametric Regression, Cambridge

University Press.

[24] Sims, Christopher A. and Harald Uhlig (1991): "Understanding Unit Rooters: A Helicopter Tour"

Econometrica, 59(6), 1591—1599.

[25] Tierney, Luke (1994): "Markov Chains for Exploring Posterior Distributions" (with discussion),

Annals of Statistics, 22, 1701—1762.

[26] van der Vaart, A. W. (1998): Asymptotic Statistics, Cambridge University Press.

[27] von Mises, Richard (1931): Wahrscheinlichkeitsrechnung. Springer, Berlin (Probability Theory, in

German).

18

Time seriesClass Notes

Manuel Arellano

Revised: February 12, 2018

1 Time series as stochastic outcomes

A time series is a sequence of data points wtTt=1 observed over time, typically at equally spacedintervals; for example, the quarterly GDP per capita or the daily number of tweets that mention

a specific product. We wish to discuss probabilistic models that regard the observed time series as

a realization of a probability distribution function f (w1, ..., wT ). In a random sample observations

are independent and identically distributed so that f (w1, ..., wT ) =∏Tt=1 f (wt). In a time series,

observations near in time tend to be more similar, in which case the independence assumption is not

appropriate. Moreover, the level or other features of the series often change over time, and in that

case the assumption of identically distributed observations is not appropriate either. Thus, the joint

distribution of the data may differ from the product of marginal distributions of each data point

f (w1, ..., wT ) 6= f1 (w1)× f2 (w2)× ...× fT (wT )

and the form of those marginal distributions may change over time. Due to the natural temporal

ordering of data, a factorization that will be often useful is

f (w1, ..., wT ) = f1 (w1)∏T

t=2ft (wt | wt−1, ..., w1) .

If the distributions f1 (.), f2 (.), ... changed arbitrarily there would be no regularities on which

to base their statistical analysis, since we only have one observation on each distribution. Similarly,

if the joint distributions of consecutive pairs of observations changed arbitrarily, there would be no

regularities on which to base the analysis of their dependence. Thus, modeling the dependence among

observations and their evolving pattern are central to time series analysis. The basic building block

to facilitate statistical analysis of time series is some stationary form of dependence that preserves the

assumption of identically distributed observations. First we will introduce the concept of stationary

dependence and later on we will discuss ways of introducing nonstationarities.

Stochastic process A stochastic process is a collection of random variables that are indexed

with respect to the elements in a set of indices. The set may be finite or infinite and contain integer

or real numbers. Integer numbers may be equidistant or irregularly spaced. In the case of our time

series the set is 1, 2, 3, ..., T, but usually it is convenient to consider a set of indices t covering allintegers from −∞ to +∞. In such case we are dealing with a double-infinite sequence of the form

wt∞t=−∞ = ..., w−1, w0, w1, w2, ..., wT , wT+1, ... ,

1

which is regarded as a single realization of the stochastic process, and the observed time series is a

portion of this realization. Hypothetical repeated realizations of the process could be indexed as:w(1)t , w

(2)t , ..., w

(n)t

∞t=−∞

.

2 Stationarity

A process wt is (strictly) stationary if the joint distribution of wt1 , wt2 , ..., wtk for a given subset ofindices t1, t2, ..., tk is equal to the joint distribution of wt1+j , wt2+j , ..., wtk+j for any j > 0. That is,

the distribution of a collection of time points only depends on how far apart they are, and not where

they start:

f (wt1 , wt2 , ..., wtk) = f (wt1+j , wt2+j , ..., wtk+j) .

This implies that marginal distributions are all equal, that the joint distribution of any pair of variables

only depends on the time interval between them, and so on:

fwt (.) = fws (.) for all t, s

fwt,ws (., .) = fwt+j ,ws+j (., .) for all t, s, j.

In terms of moments, the implication is that the unconditional mean and variance µt, σ2t of the distri-

bution fwt (.) are constant:

E (wt) ≡ µt = µ, V ar (wt) ≡ σ2t = σ2.

Moreover, the covariance γt,s between wt and ws only depends on |t− s|:

Cov (wt, ws) ≡ γt,s = γ|t−s|.

Thus, using the notation γ0 = σ2, the covariance matrix of w = (w1, ..., wT ) takes the form

V ar (w) =

γ0 γ1 . . . γT−1

γ1 γ0 γ1 . . . γT−2... γ1 γ0

. . ....

γT−2. . . γ1

γT−1 γT−2 . . . γ0

.

Similarly, the correlation ρt,s between wt and ws only depends on |t− s|:

ρt,s ≡γt,sσtσs

= ρ|t−s|

The quantity ρj is called the autocorrelation of order j and when seen as a function of j it is called

the autocorrelation function.

2

A stationary process is also called strictly stationary in contrast with weaker forms of stationarity.

For example, we talk of stationarity in mean if µt = µ or of covariance stationarity (or weak station-

arity) if the process is stationary in mean, variance and covariances. In a normal process covariance

stationarity is equivalent to strict stationarity.

Processes that are uncorrelated or independent A sequence of serially uncorrelated random

variables with zero mean and constant finite variance is called a “white noise”process; that is, white

noise is a covariance stationary process wt such that

E (wt) = 0

V ar (wt) = γ0 <∞

Cov (wt, wt−j) = 0 for all j 6= 0.

In this process observations are uncorrelated but not necessarily independent. In an independent white

noise process wt is also statistically independent of past observations:

f (wt | wt−1, wt−2, ...w1) = f (wt) .

Another possibility is a mean independent white noise process that satisfies

E (wt | wt−1, wt−2, ...w1) = 0.

In this case wt is called a martingale difference sequence. A martingale difference is a stronger form of

independence than uncorrelatedness but weaker than statistical independence. For example, a martin-

gale difference does not rule out the possibility that E(w2t | wt−1, ...w1

)depends on past observations.

Prediction Consider the problem of selecting a predictor of wt given a set of past values

wt−1, ..., wt−j. The conditional mean E (wt | wt−1, ..., wt−j) is the best predictor when the loss func-tion is quadratic. Similarly, the linear projection E∗ (wt | wt−1, ..., wt−j) is the best linear predictorunder quadratic loss. For example, for a stationary process wt

E∗ (wt | wt−1) = α+ βwt−1

with β = γ1/γ0 and α = (1− β)µ. We can also write

wt = α+ βwt−1 + νt

where νt is the prediction error, which by construction is orthogonal to wt−1.

For convenience, predictors based on all past history are often considered:

Et−1 (wt) = E (wt | wt−1, wt−2, ...)

3

or

E∗t−1 (wt) = E∗ (wt | wt−1, wt−2, ...) ,

which are defined as the corresponding quadratic-mean limits of predictors given wt−1, ..., wt−j asj →∞.

Linear predictor k-period-ahead Let wt be a stationary time series with zero mean and let

ut denote the innovation in wt so that

wt = E∗t−1 (wt) + ut.

ut is a one-step-ahead forecast error that is orthogonal to all past values of the series. Similarly,

wt+1 = E∗t (wt+1) + ut+1.

Moreover, since the spaces spanned by (wt, wt−1, wt−2, ...) and (ut, wt−1, wt−2, ...) are the same, and

ut is orthogonal to (wt−1, wt−2, ...) we have:

E∗t (wt+1) = E∗ (wt+1 | ut, wt−1, wt−2, ...) = E∗t−1 (wt+1) + E∗ (wt+1 | ut) .

Thus, E∗ (wt+1 | ut)+ut+1 is the two-step-ahead forecast error in wt+1. In a similar way we can obtain

incremental errors for E∗t−1 (wt+2) , ..., E∗t−1 (wt+k).

Wold decomposition Letting E∗ (wt+1 | ut) = ψ1ut, we can write

wt+1 = ut+1 + ψ1ut + E∗t−1 (wt+1)

and repeating the argument we obtain the following representation of the process:

wt = (ut + ψ1ut−1 + ψ2ut−2 + ...) + κt

where ut ≡ wt −E∗t−1 (wt), ut−1 ≡ wt−1 −E∗t−2 (wt−1), etc. and κt denotes the linear prediction of wt

at the beginning of the process. This representation is called the Wold decomposition, after the work

of Herman Wold. It exists for any covariance stationary process with zero mean. The one-step-ahead

forecast errors ut are white noise and it can be shown that∑∞

j=0 ψ2j <∞ (with ψ0 = 1).1

The term κt is called the linearly deterministic part of wt because it is perfectly predictable based

on past observations of wt. The other part, consisting of∑∞

j=0 ψjut−j , is the linearly indeterministic

part of the process. The indeterministic part is the linear projection of wt onto the current and past

linear forecast errors, and the deterministic part is the corresponding projection error. If κt = 0, wt is

a purely non-deterministic process, also called a linearly regular process.

1See T. Sargent, Macroeconomic Theory, 1979.

4

Ergodicity A stochastic process is ergodic if it has the same behavior averaged over time as

averaged over the sample space. Specifically, a covariance stationary process is ergodic in mean if the

time series mean converges in probability to the same limit as a (hypothetical) cross-sectional mean

(known as the ensemble average), that is, to E (wt) = µ:

wT =1

T

∑Tt=1wt

p→ µ.

Ergodicity requires that the autocovariances γj tend to zero suffi ciently fast as j increases. In the

next section we check that wt is ergodic in mean if the following absolute summability condition issatisfied:∑∞

j=0

∣∣γj∣∣ <∞. (1)

Similarly, a covariance stationary process is ergodic in covariance if

1

T − j∑T

t=j+1 (wt − µ) (wt−j − µ)p→ γj .

In the special case in which wt is a normal stationary process, condition (1) guarantees ergodicityfor all moments.2

Example of stationary non-ergodic process Suppose that

wt = η + εt

where η ∼ iid(0, σ2η

)and εt ∼ iid

(0, σ2ε

)independent of η. We have

µ = E (wt) = E (η) + E (εt) = 0

γ0 = V ar (wt) = V ar (η) + V ar (εt) = σ2η + σ2ε

γj = Cov (wt, wt−j) = σ2η for j 6= 0.

Note that condition (1) is not satisfied in this example.

Let the index i denote a realization of the process in the probability space. The process is stationary

and yet

w(i)T =

1

T

∑Tt=1w

(i)t

p→ η(i)

instead of converging to µ = 0. Moreover,

1

T − j∑T

t=j+1

(w(i)t − η(i)

)(w(i)t−j − η

(i))

p→ 0

(or (T − j)−1∑T

t=j+1w(i)t w

(i)t−j

p→(η(i))2) instead of converging to γj = σ2η.

2 In general, a stationary process is ergodic in distribution if T−1∑T

t=1 1 (wt ≤ r)p→ Pr (wt ≤ r) for any r, where

1 (wt ≤ r) = 1 if wt ≤ r and 1 (wt ≤ r) = 0 if wt > r.

5

3 Asymptotic theory with dependent observations

Here we consider a law of large numbers and a central limit theorem for covariance stationary processes.

Law of Large Numbers Let wt be a covariance stationary stochastic process with E (wt) = µ

and Cov (wt, wt−j) = γj such that∑∞

j=0

∣∣γj∣∣ < ∞. Let the sample mean be wT = (1/T )∑T

t=1wt.

Then (i) wTp→ µ, and (ii) V ar

(√TwT

)→∑∞

j=−∞ γj .

A suffi cient condition for wTp→ µ is that E (wT ) → µ and V ar (wT ) → 0. For any T we have

E (wT ) = µ. Next,

V ar (wT ) = E[(wT − µ)2

]=

1

T 2∑T

t=1

∑Ts=1E [(wt − µ) (ws − µ)]

=1

T 2[Tγ0 + 2 (T − 1) γ1 + 2 (T − 2) γ2 + ...+ 2γT−1

].

To show that V ar (wT )→ 0, show that TV ar (wT ) is bounded under the assumption∑∞

j=0

∣∣γj∣∣ <∞:TV ar (wT ) =

∣∣∣∣γ0 +

(T − 1

T

)2γ1 +

(T − 2

T

)2γ2 + ...+

1

T2γT−1

∣∣∣∣ (2)

≤ |γ0|+(T − 1

T

)2 |γ1|+

(T − 2

T

)2 |γ2|+ ...+

1

T2∣∣γT−1∣∣

≤|γ0|+ 2 |γ1|+ 2 |γ2|+ ...+ 2

∣∣γT−1∣∣+ ....

To check that V ar(√

TwT

)→∑∞

j=−∞ γj see J. Hamilton, Time Series Analysis, 1994, p. 187—188.

Consistent estimation of second-order moments Let us consider the sample autocovariance

γj =1

T − j∑T

t=j+1 (wt − w0) (wt−j − w−j) =1

T − j∑T

t=j+1wtwt−j − w0w−j

where w0 = (T − j)−1∑T

t=j+1wt and w−j = (T − j)−1∑T

t=j+1wt−j . Let us define zt = wtwt−j . The

previous LLN can be applied to the process zt to state conditions under which

1

T − j∑T

t=j+1wtwt−jp→ E (wtwt−j) .

Note that if wt is strictly stationary so is zt.

Central Limit Theorem A central limit theorem provides conditions under which

wT − µ√V ar (wT )

d→ N (0, 1) .

Since in our context TV ar (wT )→∑∞

j=−∞ γj , an asymptotically equivalent statement is

√T (wT − µ)√∑∞

j=−∞ γj

d→ N (0, 1) .

6

or

√T (wT − µ)

d→ N(

0,∑∞

j=−∞ γj

). (3)

A condition under which this result holds is:

wt = µ+∑∞

j=0 ψjvt−j (4)

where vt is an i.i.d. sequence with E(v2t)<∞ and

∑∞j=0

∣∣ψj∣∣ <∞ (T.W. Anderson, The Statistical

Analysis of Time Series, 1971, p. 429). Result (3) also holds when the innovation process vt in (4)is a martingale difference sequence satisfying certain conditions.3

A multivariate version of (3) for the case in which wt is a vector-valued process is as follows:√T (wT − µ)

d→ N(

0,∑∞

j=−∞ Γj

). (5)

where

Γ0 = E[(wt − µ) (wt − µ)′

]and for j 6= 0:

Γj = E[(wt − µ) (wt−j − µ)′

].

Note that the autocovariance matrix Γj is not symmetric. We have Γ−j = Γ′j .

As in the scalar case, a condition under which result (5) holds is:

wt = µ+∑∞

j=0 Ψjvt−j

where vt is an i.i.d. vector sequence with E (vt) = 0, E (vtv′t) = Ω a symmetric positive definite

matrix, and the sequence of matrices Ψj∞j=0 is absolutely summable.4

Consistent estimation of the asymptotic variance To be able to use the previous central

limit theory for the construction of interval estimations and test statistics we need consistent estimators

of V =∑∞

j=−∞ γj . One possibility is to parameterize γj ; for example, assuming that the γj satisfy the

restrictions imposed by an ARMA model of the type that are discussed in the next section. Another

possibility is to obtain a flexible estimator of V of the type considered by Hansen (1982), and Newey

and West (1987), among others.5

The Newey-West estimator is a sample counterpart of expression (2) truncated after m lags:

V = γ0 +m∑j=1

(1− j

m+ 1

)2γj .

3Theorem 3.15 in P.C.B. Phillips and V. Solo (1992) “Asymptotics for Linear Processes,”The Annals of Statistics 20.4The matrix sequence Ψj∞j=0 is absolutely summable if each of its elements forms an absolutely summable sequence.5Hansen (1982) “Large Sample Properties of GMM Estimators”, Econometrica 50. Newey and West (1987) “A Simple,

Positive Semi-Definite, Heteroskedasticity and Autocorrelation Consistent Covariance Matrix”, Econometrica 55.

7

This estimator can be shown to be consistent for V if the truncation parameter m goes to infinity

with T more slowly than T 1/4 or than T 1/2, depending on the type of process. Nevertheless, the

specification of an appropriate growth rate for m gives little practical guidance on the choice of m.

Similarly, in the vector case the Newey-West estimator of V =∑∞

j=−∞ Γj is given by

V = Γ0 +m∑j=1

(1− j

m+ 1

)(Γj + Γ′j

)(6)

where Γj = (T − j)−1∑T

t=j+1 (wt − w0) (wt−j − w−j)′. A nice property of the Newey-West estimator(6) is that it is guaranteed to be a positive semi-definite matrix by construction.6

4 Autoregressive and moving average models

4.1 Autoregressive models

A first-order autoregressive process (or Markov process) assumes that wt is independent of wt−2, wt−3, ...conditionally on wt−1:

ft (wt | wt−1, ..., w1) = ft (wt | wt−1)

and, therefore, also

E (wt | wt−1, ..., w1) = E (wt | wt−1) .

Moreover, the standard linear AR(1) model also assumes

E (wt | wt−1) = α+ ρwt−1

V ar (wt | wt−1) = σ2.

Moment properties These assumptions have the following implications for marginal moments:

E (wt) = E [E (wt | wt−1)] = α+ ρE (wt−1) (7)

V ar (wt) = V ar [E (wt | wt−1)] + E [V ar (wt | wt−1)] = ρ2V ar (wt−1) + σ2

Cov (wt, wt−1) = Cov (E (wt | wt−1) , wt−1) = Cov (α+ ρwt−1, wt−1) = ρV ar (wt−1) .

Moreover,

E (wt | wt−2) = α+ ρE (wt−1 | wt−2) = α (1 + ρ) + ρ2wt−2.

6The estimator V = Γ0+∑m

j=1

(Γj + Γ′j

)has the same large sample justification as (6) but is not necessarily positive

semi-definite.

8

In general

E (wt | wt−j) = α(1 + ρ+ ...+ ρj−1

)+ ρjwt−j

and

Cov (wt, wt−j) = Cov (E (wt | wt−j) , wt−1) = ρjV ar (wt−j) .

In view of the recursion (7) we have

E (wt) = α(1 + ρ+ ...+ ρt−1

)+ ρtE (w0) .

Stability and stationarity For the process to be stationary is required that |ρ| < 1. In itself,

|ρ| < 1 is a condition of stability under which as t→∞ we obtain

E (wt)→ µ =α

1− ρ

V ar (wt)→ γ0 =σ2

1− ρ2

Cov (wt, wt−j)→ γj = ρjσ2

1− ρ2 .

These quantities are known as the steady state mean, variance and autocovariances. Thus, regardless

of the starting point, under the stability condition the AR(1) process is asymptotically covariance

stationary.

If the AR(1) process is stationary (due to being stable and having started in the distant past or

having started at t = 1 with the steady state distribution) then E (wt) = µ = α/ (1− ρ), V ar (wt) =

γ0 = σ2/(1− ρ2

)and Cov (wt, wt−j) = γj = ρjσ2/

(1− ρ2

).

The autocorrelation function of a stationary AR(1) process decreases exponentially and is given

by ρj .

Letting ut = wt−α−ρwt−1, the Wold representation of a stationary AR(1) process can be obtainedby repeated substitutions and is given by:

wt = µ+ ut + ρut−1 + ρ2ut−2 + ρ3ut−3 + ...

The parameter ρ measures the persistence in the process. The closer is ρ to one the more persistent

will be the deviations of the process from its mean.

Normality assumptions The standard additional assumption to fully specify the distribution

of wt | wt−1 is conditional normality:

wt | wt−1, ..., w1 ∼ N(α+ ρwt−1, σ

2). (8)

9

In itself this assumption does not imply unconditional normality. However, if we assume that the

initial observation is normally distributed with the steady state mean and variance:

w1 ∼ N(

α

1− ρ,σ2

1− ρ2

), (9)

then the process is fully stationary and (w1, ..., wT ) is jointly normally distributed as follows:w1

w2...

wT

∼ N

α

1− ρ

1

1...

1

,σ2

1− ρ2

1 ρ . . . ρT−1

ρ 1 . . . ρT−2

......

. . ....

ρT−1 ρT−2 . . . 1

.

Normal likelihood functions Under assumption (8), the log likelihood function of the time

series w1, ..., wT conditioned on the first observation is given by (up to an additive constant):

L(α, ρ, σ2

)= −(T − 1)

2lnσ2 − 1

2σ2

∑T

t=2(wt − α− ρwt−1)2 .

The corresponding maximum likelihood estimates are:

ρ =

∑Tt=2 (wt − w0) (wt−1 − w−1)∑T

t=2 (wt−1 − w−1)2(10)

α = w0 − ρw−1

σ2 =1

(T − 1)

∑Tt=2 (wt − α− ρwt−1)2

where w0 = (T − 1)−1∑T

t=2wt and w−1 = (T − 1)−1∑T

t=2wt−1.

Under the steady state assumption (9) the log likelihood of the first observation is given by:

`1(α, ρ, σ2

)= −1

2lnσ2 +

1

2ln(1− ρ2

)−(1− ρ2

)2σ2

(w1 −

α

1− ρ

)2.

Thus, the full log likelihood function under assumptions (8) and (9) becomes:

L∗(α, ρ, σ2

)= L

(α, ρ, σ2

)+ `1

(α, ρ, σ2

).

The estimators that maximize L∗(α, ρ, σ2

)lack a closed form expression.

Asymptotic properties of OLS estimates in the AR(1) model Let us focus on the OLS

estimate of the autoregressive parameter (10) when |ρ| < 1. Since ρ = γ1/γ0, consistency of ρ for

ρ = γ1/γ0 follows under conditions ensuring the consistency of sample autocovariances. Next, consider

the scaled estimation error and its large-sample approximation:

√T (ρ− ρ) =

[1

T

∑Tt=2 (wt−1 − w−1)2

]−1 1√T

∑Tt=2 (wt−1 − w−1)ut

≈(

σ2

1− ρ2

)−11√T

∑Tt=2 (wt−1 − µ)ut.

10

Under conditions ensuring the asymptotic normality result

T−1/2∑T

t=2 (wt−1 − µ)utd→ N (0, ω)

with ω = E[u2t (wt−1 − µ)2

]= σ4/

(1− ρ2

), we obtain

√T (ρ− ρ)

d→ N(0, 1− ρ2

).

When ρ ≥ 1 this result does not hold and the OLS properties are non-standard.

Forecasting with a stable AR(1) process A one-period-ahead forecast is

Et (wt+1) = α+ ρwt,

a two-period ahead forecast is

Et (wt+2) = α (1 + ρ) + ρ2wt,

and k-period ahead

Et (wt+k) = α(

1 + ρ+ ...+ ρk−1)

+ ρkwt =

(α

1− ρ

)(1− ρk

)+ ρkwt.

Thus, a k-period ahead forecast is a convex combination of the steady state mean and the most

recent value of the process available. As k →∞ the optimal forecast tends to the steady state mean

α/ (1− ρ).

AR(p) process A generalization of the AR(1) process is to an AR(p) process that specifies linear

dependence on the first p lags:

wt = α+ ρ1wt−1 + ...+ ρpwt−p + ut.

Second-order or higher-order autoregressive processes can capture richer patterns of behavior in time

series, including stochastic cycles.

4.2 Moving average models

To motivate the moving average model, consider the stationary linear process with iid shocks in (4):

wt = µ+ ut + ψ1ut−1 + ψ2ut−2 + ...

The independent white noise process is the special case when ψj = 0 for all j ≥ 1 and µ = 0. A

first-order moving average process relaxes the independence assumption by allowing ψ1 to be non-zero

while setting ψj = 0 for j > 1. Thus, the form of an MA(1) process is

wt = µ+ ut − θut−1

where ut ∼ iid(0, σ2

).

11

Moment properties In this case

E (wt) = µ

V ar (wt) = γ0 =(1 + θ2

)σ2

Cov (wt, wt−1) = γ1 = −θσ2

Cov (wt, wt−j) = γj = 0 for j > 1.

Note that the MA(1) process is stationary for all values of θ.

The first-order autocorrelation is

ρ1 =γ1γ0

= − θ

1 + θ2,

which means that −0.5 ≤ ρ1 ≤ 0.5.

Indeterminacy and invertibility The moving average parameter θ solves:

ρ1θ2 + θ + ρ1 = 0. (11)

The product of the roots of this equation is unity,7 so that if θ is a solution, then 1/θ is also a solution.

Moreover, if one solution is less than unity in absolute value, the other one will be greater than unity.

If |θ| < 1 then it can be seen that the MA(1) model can be written as a convergent series of past

values of the process:

(wt − µ) +∑∞

j=1 θj (wt−j − µ) = ut.

If on the contrary |θ| > 1, the MA(1) model can also be written as a convergent series but one involving

the future values of the process:

(wt − µ) +∑∞

j=1

1

θj(wt+j − µ) = υt.

where υt = −θut−1. Given a preference for associating present values of the process with past val-ues, the indeterminacy about the value of θ is avoided by requiring that |θ| < 1, a condition called

“invertibility”by Box and Jenkins (Time Series Analysis: Forecasting and Control, 1976).8

Normal likelihood function Under joint normality w = (w1, ..., wT )′ ∼ N[µι, σ2Ω (θ)

]with ι

denoting a T × 1 vector of ones and

Ω (θ) =

1 + θ2 −θ . . . 0

−θ 1 + θ2 . . . 0...

.... . .

...

0 0 . . . 1 + θ2

,

7The roots are(−1 +

√1− 4ρ21

)/ (2ρ1) and

(−1−

√1− 4ρ21

)/ (2ρ1) provided ρ1 6= 0.

8 If |ρ1| = 0.5 there is a unique non-invertible solution to (11) such that |θ| = 1.

12

the log likelihood function of the time series is given by

L(µ, θ, σ2

)= −T

2lnσ2 − 1

2ln |Ω (θ)| − 1

2σ2(w − µι)′ [Ω (θ)]−1 (w − µι) .

There is no closed form expression for the maximum likelihood estimator. The natural estimator

of µ is the sample mean wT . A simple consistent estimator θ is the invertible solution to the equation:

ρ1θ2 + θ + ρ1 = 0

where ρ1 = γ1/γ0. The corresponding estimator of σ2 is

σ2 =γ0

1 + θ2 .

Forecasting with an invertible MA(1) process The autoregressive representation of the

process is

wt =

(1− θt

1− θ

)µ− θwt−1 − ...− θt−1w1 − θtu0 + ut

and a similar expression one period ahead

wt+1 =

(1− θt+1

1− θ

)µ− θwt − ...− θtw1 − θt+1u0 + ut+1.

Thus, an infeasible forecast based on all past history is:

Et (wt+1) =

(1− θt+1

1− θ

)µ− θwt − ...− θtw1 − θt+1Et (u0) .

An approximate feasible forecast ignores the last term that contains Et (u0). Alternatively, we may

calculate the best linear predictor of wt+1 given w1, ..., wt taking into account that E (w) = µι and

V ar (w) = σ2Ω (θ). For example,

E∗ (wT+1 | wT , ..., w1) = δ + w′ϕ

where ϕ = [Ω (θ)]−1 q (θ), q (θ) = (−θ, 0, ..., 0) and δ = µ (1− ι′ϕ).

MA(q) process A generalization of the MA(1) process is to an MA(q) process that specifies

linear dependence on the first q shocks:

wt = µ+ ut − θ1ut−1 − ...− θqut−q.

ARMA (p, q) process A further generalization is a process that combines an autoregressive

component and a moving average component. For example, the ARMA(1,1) process takes the form:

wt = α+ ρwt−1 + ut − θut−1.

An ARMA process may be able to approximate a linear process to a given accuracy employing fewer

parameters than a pure autoregressive or a pure moving average process.

13

5 Nonstationary processes

5.1 Time trends

In a stationary process E (wt) = µ. A less restrictive assumption that allows for nonstationarity in

mean is to specify the mean as a function of time. For example, a linear trend:

E (wt) = α+ βt.

If wt represents the logarithm of some variable, β is a growth rate, which in this model is assumed to

be constant.

We could assume that

wt = α+ βt+ ut

where ut is a stationary stochastic process. In such case, E (wt) = α+ βt but V ar (wt) is constant.

In the same vein, the specification of the mean of the process could incorporate cyclical or seasonal

components.

Regression with trend In a regression with a linear trend, OLS estimation errors converge to

zero at a faster rate than the usual root-T consistency. The reason is that the second moment of a

conventional regressor is bounded whereas the second moment of a linear trend is not. To examine

this situation let us consider a simple linear trend model with an iid normal error and no intercept:

yt = βt+ ut ut ∼ iid N(0, σ2

). (12)

The OLS estimation error and the OLS mean and variance are given by

β − β =

∑Tt=1 t ut∑Tt=1 t

2.

E(β)

= β V ar(β)

=σ2∑Tt=1 t

2=

σ2

T (T + 1) (2T + 1) /6.

This is a classical regression model with xt = t, no intercept, and normal errors, except that in the

standard model∑T

t=1 x2t = O (T ) whereas here

∑Tt=1 t

2 = O(T 3).

In this case the following exact distributional result holds:

(∑Tt=1 t

2)1/2 (β − β)

σ∼ N (0, 1)

and therefore also as T →∞:

(∑Tt=1 t

2)1/2 (β − β)

σ

d→ N (0, 1)

14

or

T 3/2[

1

6

(1 +

1

T

)(2 +

1

T

)]1/2 (β − β)σ

d→ N (0, 1)

and

T 3/2(

1

6× 1× 2

)1/2 (β − β)σ

d→ N (0, 1) ,

and also

T 3/2(β − β

)d→ N

(0, 3σ2

). (13)

It can be shown that result (13) still holds if ut ∼ iid(0, σ2

)but non-normal.9 Thus, (13) justifies

the large-sample approximation

β ≈ N(β,

3σ2

T 3

).

This situation is described as T 3/2-consistency (in contrast with T 1/2-consistency) and β is said to be

“hyper-consistent”or “super-consistent”(although the last term sometimes is reserved for T -consistent

estimators).

5.2 Random walk

A random walk is a process such that

E (wt | wt−1, wt−2, ...) = wt−1,

so that the best forecast is the previous value and there is no mean reversion.

A random walk with iid shocks is an AR(1) model with ρ = 1:

wt = wt−1 + ut ut ∼ iid(0, σ2

)(14)

or equivalently

wt = ut + ut−1 + ...+ u1 + w0,

and the first-difference ∆wt = (wt − wt−1) is an independent white noise process.The random walk is a nonstationary process. Letting ω0 = V ar (w0), we have

V ar (wt) = ω0 + tσ2

Cov (wt, wt−j) = ω0 + (t− j)σ2 for j ≥ 1.

9See, for example, T.W. Anderson, The Statistical Analysis of Time Series, 1971, Theorem 2.6.1, p. 23.

15

Thus, the variance tends to infinity as t increases and the autocorrelation function decays slowly as j

increases.

More generally, we could consider processes such as (14) in which ut is a stationary process, but

not necessarily a white noise process. A time series such that its first difference is a stationary process

is called a first-order integrated process or an I (1) process:

wt ∼ I (1) .

If ut is an ARMA(p, q) process then wt is called an autoregressive integrated moving average

or ARIMA(p, 1, q) process. The argument can be generalized to second and higher-order integrated

process. For example, an I (2) process is such that it is stationary after taking differences twice.

Random walk with drift This is a process of the form

wt = wt−1 + δ + ut ut ∼ iid(0, σ2

).

In this case:

wt = δt+ (ut + ut−1 + ...+ u1 + w0) .

We have a linear trend, but contrary to (12) the stochastic component is I (1) instead of I (0).

Distinguishing between unit root and stationary processes There is a literature concerned

with large sample methods to test the null hypothesis of a unit root in wt against the alternative

hypothesis of stationarity (e.g. the Dickey-Fuller test and its variants). However, a realization from

an I (1) process may be diffi cult to distinguish from an I (0) process. Compare, for example, the

following integrated moving average process

wt − wt−1 = ut − 0.99ut−1

with the white noise process

wt = ut;

or the random walk process

wt − wt−1 = ut

with the stationary autoregressive process

wt − 0.99wt−1 = ut.

The differences between those I (1) and I (0) processes are in their long run behavior. Sometimes the

choice between an I (1) model and an I (0) model is made on the basis of the long-run properties that

are judged a priori to make sense for a time series to have.

16

Regression with time seriesClass Notes

Manuel Arellano

February 22, 2018

1 Classical regression model with time series

Model and assumptions The basic assumption is

E (yt | x1, ..., xT ) = E (yt | xt) = x′tβ.

The first equality is always satisfied with iid observations whereas the second imposes linearity in

the relationship. With dependent observations the first equality imposes restrictions on the stochas-

tic process of (yt, xt). In later sections we will study the nature of these restrictions and consider

generalizations that are suitable for time series.

The previous assumption can be written in the form of equation as follows

yt = x′tβ + ut E (ut | X) = 0

where X = (x1, ..., xT )′ and y = (y1, ..., yT )′. This is appropriate for a linear relationship between x

and y and unobservables that are mean independent of past and future values of x.

The second assumption in the classical regression model is V ar (y | X) = σ2IT , which amounts to

E(u2t | X

)= σ2 for all t

E (utut−j | X) = 0 for all t, j.

The case where E (utut−j | X) 6= 0 is called autocorrelation. An example is a Cobb-Douglas production

function in which u represents multiplicative measurement error in the output level, independent of

inputs at all periods. The error u is autocorrelated possibly as a result of temporal aggregation in the

data.

In the context of the classical regression model with time series it is convenient to distinguish

between conditional heteroskedasticity and unconditional heteroskedasticity. If E(u2t | X

)does not

depend on X there is no conditional heteroskedasticity and

E(u2t | X

)= E

(u2t).

However, if ut is not stationary in variance then

E(u2t)

= σ2t

where σ2t is a function of t. In this situation we speak of unconditional heteroskedasticity.

1

On the other hand if ut is stationary then E(u2t)

= σ2, which is compatible with the possibility of

conditional heteroskedasticity. For example,

E(u2t | X

)= δ0 + δ1x

2t

E(u2t)

= δ0 + δ1E(x2t)

= σ2.

In this case there will be unconditional homoskedasticity if E(x2t)is constant for all t.

The same reasoning can be applied to conditional and unconditional autocovariances. If ut is

stationary then

E (utut−j) = γj for all t.

However, it is possible that conditional autocovariances

E (utut−j | X) = γj (X)

depend on X, in which case we would have both conditional heteroskedasticity and autocorrelation.

OLS with dependent observations: robust inference Consider the usual representation for

the scaled estimation error:

√T(β − β

)=

(1

T

T∑t=1

xtx′t

)−11√T

T∑t=1

xtut.

Letting wt = xtut = xt (yt − x′tβ), we have that E (wt) = 0. Moreover, if (yt, xt) is stationary, so is

wt. Under some conditions, wTp→ 0, T−1

∑Tt=1 xtx

′t

p→ E (xtx′t) > 0, and

√TwT

d→ N (0, V ) with

V =∑∞

j=−∞ Γj and Γj = E(wtw

′t−j

)= E

(utut−jxtx′t−j

). It then follows that β is consistent and

asymptotically normal:

√T(β − β

)d→ N (0,W )

where

W =[E(xtx′t

)]−1V[E(xtx′t

)]−1.

Recall that if observations are iid V = E(u2txtx

′t

). If in addition there is absence of conditional

heteroskedasticity V = σ2E (xtx′t). On the other hand if γj (X) = γj for all j:

V =∑∞

j=−∞E (utut−j)E(xtx′t−j).

The Newey-West estimate of V is

V = Γ0 +

m∑j=1

(1− j

m+ 1

)(Γj + Γ′j

)2

where

Γj =1

T − j

T∑t=j+1

utut−jxtx′t−j

and ut are OLS residuals. The corresponding estimate of W is

W =

(1

T

T∑t=1

xtx′t

)−1V

(1

T

T∑t=1

xtx′t

)−1,

which boils down to the heteroskedasticity-consistent White standard error formula if m = 0 so that

V = Γ0.

If there is autocorrelation but no heteroskedasticity an alternative consistent estimator of V is

V = Γ0 +

m∑j=1

(Γj + Γ′j

)where

Γj = γj

1

T − j

T∑t=j+1

xtx′t−j

and γj = (T − j)−1

∑Tt=j+1 utut−j .

The previous limiting results do not depend on the validity of the assumptions of the classical

linear regression model. They are valid for inference about regression coeffi cients that are regarded as

estimates of linear projections from stationary and ergodic stochastic processes (or some alternative

mixing conditions under which the results hold).

For example, the results can be used if xt is yt−1 or contains lags of y among other variables,

regardless of whether the model is autoregressive or not.

This is not to say that OLS will necessarily be a consistent estimator of a dynamic model with

autocorrelation, in fact in general it will not be. For example, take an ARMA(1, 1) model:

yt = α+ ρyt−1 + ut

ut = vt − θvt−1.

The linear projection of yt on yt−1 will not coincide with α+ ρyt−1 because in the ARMA(1, 1) model

Cov (yt−1, ut) 6= 0. The previous results for regressions would allow us to make inference about the

linear projection coeffi cients, not about (α, ρ) in the ARMA(1,1) model.

Generalized least squares Under the assumption E (ut | X) = 0, OLS is not only consistent

but also unbiased. Letting u = (u1, ..., uT )′, its variance given X in finite samples is given by

V ar(β | X

)=(X ′X

)−1X ′E

(uu′ | X

)X(X ′X

)−1.

3

If there is autocorrelation there is a changing dependence among data points. This suggests considering

weighted least squares estimates in which different pairs of observations receive different weights. That

is, estimators of the form

β =(X ′H ′HX

)−1X ′H ′Hy

or equivalently OLS in the transformed regression

Hy = HXβ +Hu

where H is a T × T matrix of weights.We have previously dealt with the case in which H is diagonal. Here since the estimator only

depends on H ′H we can limit ourselves to consider weight matrices that are lower triangular:

H =

h11 0 . . . 0

h21 h22 . . . 0...

. . ....

hT1 hT2 hTT

,

so that

β − β =

(T∑t=1

x∗tx∗′t

)−1 T∑t=1

x∗tu∗t

where u∗t = ht1u1 + ht2u2 + ... + httut and similarly for x∗t . The elements of H may be constant or

functions of X: H = H (X).

Under the strict exogeneity assumption E (u | X) = 0, β is unbiased (and consistent) for any H.

In general, β will not be consistent for β in a best linear predictor or in a conditional expectation

model in the absence of the strict exogeneity assumption. We can obtain an asymptotic normality

result for β similar to the one for OLS simply replacing (yt, xt) with (y∗t , x∗t ).

Under some regularity conditions, as long as V ar (u | X) = Ω (X) = (H ′H)−1 = H−1H ′−1, we get

the optimal GLS estimator, which satisfies:

√T(β − β

)d→ N

(0,

[plimT→∞

1

TX ′Ω (X)−1X

]−1).

Thus, a triangular factorization of the inverse covariance matrix of y given X is an effi cient choice of

weight matrix. Intuitively, this transformation produces errors such that their autocovariance matrix is

an identity. Once again we obtain a generalized least squares statistic similar to the one we encountered

before:

β =(X ′Ω−1X

)−1X ′Ω−1y.

4

Feasible GLS with AR(1) errors A popular GLS parametric transformation is when ut is a

first-order autoregressive process. The details are as follows. Assuming that

ut = ρut−1 + εt εt ∼ iid(0, σ2

),

we have Ω = σ2V with

V =1

1− ρ2

1 ρ ρ2 . . . ρT−1

ρ 1 ρ . . . ρT−2

.... . .

...

ρT−1 ρT−2 ρT−3 . . . 1

.

It can be shown by direct multiplication that V −1 = H ′H with

H =

√1− ρ2 0 0 . . . 0 0

−ρ 1 0 . . . 0 0

0 −ρ 1 . . . 0 0...

....... . .

...

0 0 0 . . . −ρ 1

.

Thus, in this case

u∗ = Hu =

√

1− ρ2u1u2 − ρu1

...

uT − ρuT−1

=

√

1− ρ2u1ε2...

εT

.

If the first observation is omitted, GLS is equivalent to OLS in the transformed equation:

yt − ρyt−1 = (xt − ρxt−1)′ β + εt,

which is equivalent to MLE given the first observation.

Letting ρ be the autoregressive coeffi cient estimated from OLS residuals,1 the Cochrane-Orcutt

procedure consists in constructing the pseudo differences yt− ρyt−1 and xt− ρxt−1 and estimating thetransformed model by OLS.

Alternatively, the full log-likelihood can be maximized with respect to all parameters:

L(β, σ2, ρ

)= −T

2ln (2π)− 1

2ln |Ω| − 1

2(y −Xβ)′Ω−1 (y −Xβ) .

1Specifically, ρ =∑T

t=2 utut−1/∑T

t=2 u2t−1 where ut = yt − x′tβ and β = (X ′X)−1X ′y.

5

2 Distributed lags and partial adjustment

2.1 Distributed lags

A generalization of the classical regression model that maintains the strict exogeneity assumption but

allows for dynamic responses of y to changes in x is:

E (yt | x1, ..., xT ) = E (yt | xt, xt−1, ..., xt−p) = δ + β0xt + β1xt−1 + ...+ βpxt−p.

This is called a distributed lag model. We use a scalar regressor for simplicity. Formally, it is the same

as a model with p+ 2 exogenous regressors (1, xt, ..., xt−p).

The coeffi cient β0 is the short run multiplier whereas the long run multiplier is given by

γ = β0 + β1 + ...+ βp.

Letting x1 = 1 and xj = 0 for j 6= 1, the distribution of lag responses is obtained from:

Ex (y0) = δ

Ex (y1) = δ + β0

Ex (y2) = δ + β1...

Ex (yp+1) = δ + βp

Ex (yp+2) = δ.

The mean lag is given by(∑p

j=0 jβj

)/∑p

j=0 βj whereas the median lag is the lag at which 50 percent

of the total effect has taken place.

If one is interested in long run effects it may be convenient to reformulate the equation. For

example, if p = 2 we can go from

yt = δ + β0xt + β1xt−1 + β2xt−2 + ut

to the following reparameterization:

yt = δ + γxt + β1 (xt−1 − xt) + β2 (xt−2 − xt) + ut

where γ = β0 + β1 + β2 or also

yt = δ + γzt + (β1 − β0) (xt−1 − zt) + (β2 − β0) (xt−2 − zt) + ut

where zt = (xt + xt−1 + xt−2) /3.

Sometimes p is estimated as part of a model selection procedure. It is also common to model the

lag structure, that is, to specify(β0, β1, ..., βp

)as functions of a smaller set of parameters (e.g. the

polynomial model introduced by Shirley Almon in 1965), especially when the xt−j are highly colinear

and the information about individual βj is small.

6

Koyck distributed lags A long-standing popular model specifies a geometric lag structure:

βj = β0αj (j = 1, ..., p) .

If we let p→∞ we have

E(yt | xsTs=−∞

)= δ + β0xt + αβ0xt−1 + α2β0xt−2 + ... (1)

or in equation form

yt = δ + β0∑∞

j=0 αjxt−j + ut E (ut | ... x−1, x0, x1, ..., xT ) = 0.

The long run effect in this case is γ = β0/ (1− α) if |α| < 1.

Subtracting (1) from the lagged equation multiplied by α:

E(yt − αyt−1 | xsTs=−∞

)= δ∗ + β0xt

where δ∗ = (1− α) δ. Similarly,

yt = δ∗ + αyt−1 + β0xt + εt (2)

where εt = ut − αut−1 so that E (εt | ... x−1, x0, x1, ..., xT ) = 0.

In general, (δ∗, α, β0) are not the coeffi cients of a population linear regression of yt on (1, yt−1, xt)

because in equation (2) yt−1 is correlated with εt through ut−1. A similar situation arose in the case

of ARMA(1, 1) models.

For a given value of p, nonlinear least squares estimates of (δ, α, β0) solve:

min

T∑t=p+1

(yt − δ − β0

∑pj=0 α

jxt−j)2.

2.2 Partial adjustment

An equation that includes lags of yt together with other explanatory variables and an error term (with

properties to be discussed below) is often called a partial adjustment model. A simple version is:

yt = δ + αyt−1 + β0xt + ut. (3)

The name comes from a hypothesis of gradual adjustment to an optimal target value y∗t when adjust-

ment is costly:

yt − yt−1 = γ (y∗t − yt−1)

or

yt = (1− γ) yt−1 + γy∗t . (4)

7

Equation (4) gives rise to (3) if

y∗t = δ∗ + β∗xt + u∗t

with δ = γδ∗, β0 = γβ∗, ut = γu∗t , and α = (1− γ). The last coeffi cient captures the speed of

adjustment, which is typically a quantity of interest (e.g. in models of investment, labor demand or

consumption with habits).

The empirical content of a relationship like (3) depends on its statistical interpretation. One

possibility is to regard (δ + αyt−1 + β0xt) as the expectation of yt given (y1, ..., yt−1, x1, ..., xT ) so that

E (ut | y1, ..., yt−1, x1, ..., xT ) = 0.

This implies that E (ut | u1, ..., ut−1) = 0 and therefore lack of serial correlation in ut: E (utut−j) = 0

for j > 0. This interpretation is incompatible with the geometric distributed lag model and more

generally with any model in which it is intended to allow for both dynamics and serial correlation.

Partial adjustment vs serial correlation In a static model with serial correlation:

yt = δ + βxt + ut

the response of yt to a change in xt is static. There is just persistence in the error term ut (in the same

way that there may be persistence in xt). Examples include production functions and wage equations.

In a partial adjustment model

yt = δ + αyt−1 + β0xt + ut

the effect of x on y is dynamic (as seen in the discussion of geometric distributed lags).

In a static model with strictly exogenous x, serial correlation in u does not alter the fact that

β =Cov (xt, yt)

V ar (xt).

In contrast, under the assumptions of the static model with autocorrelation, the linear projection of

yt on zt = (yt−1, xt)′ does not provide consistent estimates of the static model parameters:(

ψ1

ψ2

)= [V ar (zt)]

−1Cov (zt, yt) =

(V ar (yt−1) Cov (yt−1, xt)

Cov (yt−1, xt) V ar (xt)

)−1(Cov (yt−1, yt)

Cov (xt, yt)

)

=

(β2V ar (xt) + σ2u βCov (xt−1, xt)

βCov (xt−1, xt) V ar (xt)

)−1(β2Cov (xt, xt−1) + Cov (ut, ut−1)

βV ar (xt)

).

In the special case in which Cov (xt−1, xt) = 0 we have(ψ1

ψ2

)=

Cov(ut,ut−1)β2V ar(xt)+σ2u

β

.

Thus, ψ2 = β but ψ1 6= 0 unless Cov (ut, ut−1) = 0.

8

Common factor restrictions Is it possible to distinguish empirically between partial adjust-

ment and serial correlation? A static model with AR(1) errors:

yt = δ + βxt + ut

ut = ρut−1 + εt

upon substitution can be written in the form

(yt − δ − βxt) = ρ (yt−1 − δ − βxt−1) + εt

or

yt = (1− ρ) δ + βxt − ρβxt−1 + ρyt−1 + εt.

This equation can be regarded as a special case of a partial adjustment model without serial correlation:

yt = π0 + ψ0xt + ψ1xt−1 + αyt−1 + εt

subject to the restriction

ψ1 = −αψ0. (5)

This type of constraint and its generalizations are Sargan’s Comfac or common factor restrictions.2

They can be easily tested using a Wald statistic because the estimation under the alternative hypothesis

can be done by OLS.

In fact, a Comfac restriction is always satisfied under the following null hypothesis:

H0 : yt = δ + βxt + ut E (ut | x1, ..., xT ) = 0.

To see this, let the linear projection of yt on (1, xt, xt−1, yt−1) be

yt = π0 + ψ0xt + ψ1xt−1 + αyt−1 + εt (6)

where E∗ (εt | xt, xt−1, yt−1) = 0. Due to the law of iterated projections

E∗ (yt | xt, xt−1) = π0 + ψ0xt + ψ1xt−1 + αE∗ (yt−1 | xt, xt−1)

or

(δ + βxt) = π0 + ψ0xt + ψ1xt−1 + α (δ + βxt−1) .

Matching coeffi cients:

π0 + αδ = δ =⇒ π0 = (1− α) δ

ψ0 = β

ψ1 + αβ = 0 =⇒ Comfac restriction2Sargan, J. D. (1980): “Some Tests of Dynamic Specification for a Single Equation,”Econometrica, 48, 879—897.

9

Thus, a test of the hypothesis H0: ψ1 + αβ = 0 in the linear projection (6) has power against a

static model with serial correlation and exogenous regressors even if the form of the autocorrelation

is not AR(1).

Partial adjustment with autocorrelation A Comfac test has power to reject a static model

with autocorrelation. However, identification of a dynamic model may be problematic if the model

combines both partial adjustment and serial correlation. For example, OLS is not consistent in the

partial adjustment model

yt = δ + αyt−1 + β0xt + ut

if ut is serially correlated. Nevertheless, if ut ∼AR(1): ut = ρut−1 + εt then

(yt − δ − αyt−1 − β0xt) = ρ (yt−1 − δ − αyt−2 − β0xt−1) + εt

or

yt = (1− ρ) δ + (α+ ρ) yt−1 − αρyt−2 + β0xt − ρβ0xt−1 + εt

giving rise to a new level of Comfac restrictions, which can be tested and enforced in estimation.

This type of model is an example of a stochastic relationship between variables in which the

regressors are not independent of the errors. The estimation problem for models of this type will be

considered in greater generality in the context of instrumental variable estimation.

3 Predetermined variables

In the classical regression model

yt = x′tβ + ut

the variable xt is strictly exogenous in the sense that

E (xtus) = 0 for all t, s.

We say that a variable is predetermined if

E (xtut) = 0, E (xtut+1) = 0, E (xtut+2) = 0, ...

but we do not exclude the possibility that

E (xtut−1) 6= 0, E (xtut−2) 6= 0, ...

An example of predetermined variable is yt−1 in the AR(1) model. However, yt−1 is not predeter-

mined in the geometric distributed lag model. Another example of predetermined variable arises in a

10

relation between female labor force participation and children. Yet another example is in testing for

market effi ciency in foreign exchange markets (Hansen and Hodrick, 1980).3

An alternative weaker condition for predetermined variables is to simply require E (ut | xt) = 0 or

E (xtut) = 0,

which is the assumption that implies consistency of OLS under standard regularity conditions.

Dynamic regression with sequential conditioning Here we consider partial adjustment

models for conditional means of the form

E (yt | y1, ..., yt−1, x1, ..., xt) .

In a linear specification with first order lags we have

yt = δ + αyt−1 + β0xt + β1xt−1 + ut

E (ut | y1, ..., yt−1, x1, ..., xt) = 0.

In this type of model ut is serially uncorrelated by construction. The regressor xt is predetermined in

the sense of being correlated with past values of u or y (feedback).

If x is strictly exogenous then

E (ut | y1, ..., yt−1, x1, ..., xt) = E (ut | y1, ..., yt−1, x1, ..., xT ) ,

so that a test of strict exogeneity in this context is a test of significance of future values of x in the

regression (a Sims’type test).4

Conditional means of the form E (yt | y1, ..., yt−1, x1, ..., xt) are natural predictors in the time seriescontext, but they may or may not correspond to quantities of economic interest in an application.

Estimation In regression models with sequential conditioning OLS is consistent but not unbi-

ased. In small samples the bias can be a problem. In an AR(1) model with a positive autoregressive

parameter the bias is negative.5

When x is a predetermined variable, regressions in linear transformations of the data such as GLS

are not justified in general and may lead to inconsistent estimates even if OLS is consistent.

3Hansen, L. P. and R. J. Hodrick (1980): “Forward Exchange Rates as Optimal Predictors of Future Spot Rates: An

Econometric Analysis,”Journal of Political Economy, 88, 829—853.4Sims, C. A. (1972): “Money, Income, and Causality,”American Economic Review, 62, 540—552.5Hurwicz, L. (1950): “Least Squares Bias in Time Series,”in Koopmans, T. C. (ed.), Statistical Inference in Dynamic

Economic Models, Cowles Commission Monograph No. 10, John Wiley, New York.

11

4 Granger causality

Let the observed time series be yT = (y1, ..., yT ) and xT = (x1, ..., xT ). The joint distribution of the

data admits the following factorizations:6

f(yT , xT

)= f

(yT | xT

)f(xT)

=∏Tt=1 f

(yt | yt−1, xT

)∏Tt=1 f

(xt | xt−1

)(7)

Also,7

f(yT , xT

)=∏Tt=1 f

(yt, xt | yt−1, xt−1

)=∏Tt=1 f

(yt | yt−1, xt

)∏Tt=1 f

(xt | yt−1, xt−1

)(8)

• In an autoregressive univariate time series analysis one models the mean or other characteristicsof the distribution of f

(xt | xt−1

).

• In a VAR multivariate time series analysis one models the means of the joint distribution

f(yt, xt | yt−1, xt−1

).

• In a dynamic regression with sequential conditioning one models the mean of f(yt | yt−1, xt

).

• In a classical regression model one models the means of f(yT | xT

)(in a static model assuming

that E(yt | xT

)= E (yt | xt)).

All these are different aspects of the joint distribution of the data we may be interested to study.

Granger non-causality We say that y does not Granger cause x if8

E∗(xt | xt−1, yt−1

)= E∗

(xt | xt−1

), (9)

or using a definition based on distributions if

f(xt | xt−1, yt−1

)= f

(xt | xt−1

). (10)

It can be shown that (9) is equivalent to the Sims’strict exogeneity condition:

E∗(yt | xT

)= E∗

(yt | xt

)(11)

and also that (10) is equivalent to the Chamberlain-Sims distributional strict exogeneity condition:9

f(yt | yt−1, xT

)= f

(yt | yt−1, xt

). (12)

6Arellano, M. (1992): “On Exogeneity and Identifiability,” Investigaciones Económicas, 16, 401—409.7With some abuse of notation f

(x1 | x0

)denotes f (x1) and f

(y1 | y0, xT

)denotes f

(y1 | xT

). Similarly,

f(y1, x1 | y0, x0

)denotes f (y1, x1), f

(y1 | y0, x1

)denotes f (y1 | x1) and f

(x1 | y0, x0

)denotes f (x1).

8Granger, C. W. J. (1969): “Investigating Causal Relations by Econometric Models and Cross-Spectral Methods,”

Econometrica, 37, 424—438.9Chamberlain, G. (1982): “The General Equivalence of Granger and Sims Causality,”Econometrica, 50, 569—581.

12

Note that if (10) holds, the second components in the factorizations (7) and (8) of the distribution

of the data will coincide, so that the first components must also coincide.

These are notions of causality based on the idea of temporal ordering of predictors: the effect

cannot happen before the cause.

Finding Granger causality is not in itself evidence of causality. Due to the operation of unobserv-

ables and omitted variables, Granger causality does not imply nor is it implied by causality.

5 Cointegration

Error correction mechanism representation (ECM) Consider a dynamic regression model

yt = δ + αyt−1 + β0xt + β1xt−1 + ut.

If we subtract yt−1 from both sides of the equation and add −β0xt−1 + β0xt−1 to the l.h.s. we get:

yt − yt−1 = δ − (1− α) yt−1 + β0 (xt − xt−1) + (β0 + β1)xt−1 + ut

and also

∆yt = δ + β0∆xt − (1− α) (yt−1 − γxt−1) + ut (13)

where γ is the long run effect

γ =β0 + β11− α .

Thus, (yt−1 − γxt−1) can be seen as the error in the long run relationship between y and x. Accordingto (13) a large deviation in the long run error will have a negative impact on the change ∆y given ∆x,

hence the term “error correction mechanism”representation applied to (13).

Equation (13) is convenient for enforcing long-run restrictions in the estimation of partial adjust-

ment models. For example, Davidson et al. (1978) imposed a long-run income elasticity of unity in

the estimation of a consumption function using an equation like (13) subject to γ = 1.10

Cointegration The ECM representation is specially useful in the case in which yt ∼ I (1) and

xt ∼ I (1) but yt − γxt ∼ I (0). In this situation one says that (yt, xt) are cointegrated.11

More generally, we say that the variables in an m× 1 time series vector wt are cointegrated if all

the variables are I (1) but there is a linear combination that is I (0):

a′wt ∼ I (0)

10Davidson, J. E. H., D. F Hendry, F. Srba, and S. Yeo (1978): “Econometric modelling of the aggregate time-series

relationship between consumers’expenditure and income in the United Kingdom,”The Economic Journal, 88, 661—692.11Engle, R. F. and C. W. Granger (1987): “Co-integration and error correction: Representation, estimation and

testing,”Econometrica, 55, 251—276.

13

for some vector a different from zero, which is called the cointegrating vector.

As an example let us consider the following model for wt = (yt, xt)′:

yt = βxt + ut

xt = xt−1 + εt

where (ut, εt) are white noise.

In this example xt is a random walk and therefore I (1). The variable yt is also I (1) but yt−βxt isI (0). The cointegration vector is (1,−β). The idea is that while there may be permanent changes over

time in the individual time series, there is a long run relationship that keeps together the individual

components, which is represented by a′wt.

An early example of error-correction model is Sargan’s 1964 study of wages and prices in the UK.12

In Sargan’s model ∆yt and ∆xt are wage and price inflation respectively, whereas the error correction

term is a deviation of real wages from a productivity trend. This equilibrium term captures the role of

real-wage resistance in wage bargains as a mechanism for regaining losses from unanticipated inflation.

Literature development Some important developments in the cointegration literature are:

• Representations for VAR process of cointegrated multivariate time series. In general, it can beshown that if wt ∼ I (1) is a cointegrated vector of time series then an ECM representation exits

(Granger representation theorem).13

• Estimation of the cointegrating vector. Given that the time series are I (1), estimators of the

cointegrating vector are “superconsistent”.

• Cointegration tests (with and without knowledge of the cointegrating vector).

12Sargan, J. D. (1964): “Wages and prices in the United Kingdom: A Study in Econometric Methodology.” In P.E.

Hart, G. Mills and J. Whitaker (eds), Econometric Analysis for National Economic Planning, Colston Papers 16, London.13Granger, C. W. J. (1981): “Some Properties of Time Series Data and Their Use in Econometric Model Specification,”

Journal of Econometrics, 16, 121—130.

14

Instrumental Variables in a Market ModelClass Notes

Manuel ArellanoRevised: September 19, 2014

Model

• Consider the following market model. The demand equation is

qd = a+ bp + u (1)

where qd is demand (for fish, say), p is price, and u is a preference shifter.

• The supply equation isqs = d+ yp − gz + # (2)

where qs is supply, z is rain at sea, and # is a production shifter.

• In equilibrium qd = qs = q.

• The model determines q and p whereas z , u, and # are determined outside the model.

• So q and p are endogenous or internal to the model, whereas z , u, and # areexogenous or external to the model.

1

Reduced form

• Now imagine realizations of these variables (for the same market over time or for across-section of markets).

• Values of z , u, and # occur according to some probability distribution.

• Given these values, the model produces realizations of q and p given by the reducedform:

(y− b) q = (ay− bd) + bgz + (yu − b#)

(y− b) p = (a− d) + gz + (u − #) .

Econometric problem

• Next, we formulate the following econometric problem:

1 We have data on p, q, and z but not on u and #.

2 We believe model (1)-(2) is correct but we do not know the values of the parameters.

3 We want to use data and the model to infer the slope b of the demand equation.

2

Endogeneity

• In the econometric sense, we say that p is an endogenous explanatory variable in thedemand equation if its realized values are correlated with those of the error term.

• Endogeneity in the econometric sense does not imply nor is it implied by endogeneityin the economic sense (of being a variable internally determined by the model) e.g. zis external but it would be endogenous in the supply equation if correlated to #.

• The implication of endogeneity in the econometric sense is that the equation ofinterest, as it applies to realized values, is not a regression (which by constructionwould have lack of correlation between error and regressor).

• We do not really know empirically if the realized values of p are correlated with thoseof u because we do not have data on u, but the model lets us expect such correlation.

• If p and u were in fact uncorrelated then b would coincide with the OLS coe¢cient,but in general

Cov (p, q)Var (p)

= b+Cov (p, u)Var (p)

.

• So our theory lets us suspect that as a measure of b, the quantity Cov (p, q) /Var (p)has a bias.

3

Instrumental variables• In the econometric sense, z is exogenous in the demand equation if it is uncorrelatedwith the error term (again, we do not observe it, we assume it if we believe there isno association between variation in preferences for fish and rain at sea).

• Moreover, there is an exclusion restriction since the theory tells us that z has noe§ect on demand given u and p. In other words, if we write demand as

q = a+ bp + jz + u,

the theory tells us that j = 0.• Given this exclusion, the orthogonality condition Cov (z , u) = 0, or equivalently

Cov (z , q) = bCov (z , p) ,

implies that, as long as Cov (z , p) 6= 0, b is determined as a ratio of data covariances:

b =Cov (z , q)Cov (z , p)

.

• If so we say that b is identified in the econometric problem that we posed. Otherwise,if Cov (z , p) = 0 then b is not identified.

• Essentially Cov (z , p) 6= 0 if g 6= 0. Thus, identification of demand depends on aproperty of supply.

• If Cov (z , u) = 0 (orthogonality) and Cov (z , p) 6= 0 (relevance) hold, z is a validinstrumental variable.

4

A graphical representation

• Suppose that demand is inelastic (b = 0), y = 1, and Cov (#, u) = 0. The model is

q = a+ u

q = d+ p − gz + #.

with reduced form

q = a+ u

p = (a− d) + gz + (u − #) .

• The “first-stage regression” is a regression of p on z and has slope g.• The reduced-form quantity equation is a regression of q on z and has slope zero (bg).• The OLS regression of q on p has positive slope unless Var (u) = 0 (a perfect fit):

Cov (p, u)Var (p)

=Var (u)

Var (u) + Var (gz − #).

• The relation between predicted q and p given z traces average demand for conjecturalvalues of p if g 6= 0 (Figure 1):

E ∗ (q | z) = a+ bE ∗ (p | z) .• If g = 0, E ∗ (p | z) = E (p) and E ∗ (q | z) = a+ bE (p) for all z , so demand is notidentified (Figure 2).

• Data points cluster along regression lines but not along the demand function (Fig. 3).

5

1. Identification

2. Underidentification

3. Scatters of data points

Endogenous prices that are econometrically exogenous

• Suppose the observed quantity is measured with an error v :

q = q∗ + v ,

but the market model has no preference shifter:

q∗ = a+ bp

q∗ = d+ yp − gz + #

• The observed demand isq = a+ bp + v .

• The reduced form price equation is

(y− b) p = (a− d) + gz − #.

• Contrary to the preference-shifter model, in the mismeasured demand model p doesnot directly depend on the demand equation error.

• Thus, if Cov (v , #) = 0, p is econometrically exogenous in the demand equation evenif p remains endogenous or internal to the model.

• In any case p is econometrically endogenous in the supply equation.

9

Identifying supply: Unobserved demand shifter as instrument

• If in the original model Cov (u, #) 6= 0, the supply equation is not identified becausethere is no observed demand shifter that could be used as an instrumental variable.

• All we know is that the true values of (d,y,g) satisfy the two equations

E (q) = d+ yE (p)− gE (z)

E (zq) = dE (z) + yE (zp)− gE!z2",

but since there are three unknowns the system admits a multiplicity of solutions.• However, if Cov (u, #) = 0 the supply is identified because u can be used as aninstrument.

• Intuitively, u is “observable” since the demand parameters are identified. Moreover, uis a relevant instrument in general subject to a rank condition.

• The full set of moment equations is:

E

0

BBBB@

&1z

'(q − a− bp)

0

@1z

q − a− bp

1

A (q − d− yp + gz)

1

CCCCA= 0,

so that there are five equations for five unknowns.

10

The language of structural equations and statistical relationships

• Let F be the joint distribution of quantities, prices and rain at sea in a population ofmarkets.

• Parameters of statistical relationships (e.g. regressions, IV estimands) arecharacteristics of F .

• Parameters of structural equations (e.g. demand and supply equations) are not inthemselves characteristics of F . They are used to describe a set of hypotheticalexperiments.

• The traditional notation employed in econometrics has often failed to make a sharpdistinction between the two.

• The potential outcome notation of Rubin (1974) and the do-calculus of Pearl (1994)have made this distinction explicit in their approaches to causality.

11

Instrumental variablesClass Notes

Manuel Arellano

March 8, 2018

1 Introduction

So far we have studied regression models. That is, models for the conditional expectation of one

variable given the values of other variables, or linear approximations to those expectations. Now we

wish to study relations between random variables that are not regressions. We have already seen some

examples: the relationship between yt and yt−1 in an ARMA(1,1) model, or the geometric distributed

lag model.

A linear regression model can be seen as a linear relationship between observable and unobservable

variables with the property that the regressors are orthogonal to the unobservable term. For example,

given two variables (yi, xi), the regression of y on x is

yi = α+ βxi + ui (1)

where β = Cov (yi, xi) /V ar(xi), therefore Cov (xi, ui) = 0.

Similarly, the regression of x on y is:

xi = γ + δyi + εi

where δ = Cov (yi, xi) /V ar(yi), and Cov (yi, εi) = 0. Solving the latter for yi we can also write:

yi = α† + β†xi + u†i (2)

with α† = −γ/δ, β† = 1/δ, u†i = −εi/δ.Both (1) and (2) are statistical linear relationships between y and x. If we are interested in some

economic relation between y and x, how should we choose between (1) and (2) or none of the two?

If the goal is to describe means, clearly we would opt for (1) if interested in the mean of y for given

values of x, and we would opt for (2) if interested in the mean of x for given values of y.

In equation (2) Cov(x, u†

)6= 0 but Cov

(y, u†

)= 0 whereas in equation (1) the opposite is true.

However, in the ARMA(1,1) model (referred to in the time series class notes) both the left-hand side

and the right-hand side variables are correlated with the error term.

To respond a question of this kind we need a prior idea about the nature of the unobservables in

the relationship. We first illustrate this situation by considering measurement error models.

1

2 Measurement error

Consider an exact relationship between the variables y∗i and x∗i :

y∗i = α+ βx∗i

Suppose that we observe x∗i without error but we observe an error-ridden measure of y∗i :

yi = y∗i + vi

where vi is a zero-mean measurement error independent of x∗i . Therefore,

yi = α+ βx∗i + vi.

In this case β coincides with the slope coeffi cient in the regression of yi on x∗i :

β =Cov (x∗i , yi)

V ar (x∗i )

Now suppose that we observe y∗i without error but x∗i is measured with an error εi independent of

(y∗i , x∗i ):

xi = x∗i + εi.

The relation between the observed variables is

y∗i = α+ βxi + ζi (3)

where ζi = −βεi. In this case the error is independent of y∗i but is correlated with xi. Thus, β

coincides with the inverse slope coeffi cient in the regression of xi on y∗i :

β =V ar (y∗i )

Cov (xi, y∗i ). (4)

In general, inverse regression may make sense if one suspects that the error term in the relationship

between y and x is essentially driven by measurement error in x. As it will become clear later (4) can

be interpreted as an instrumental-variable parameter in the sense that y∗i is used as an instrument for

xi in (3). Next, we consider measurement error in regression models as opposed to exact relationships.

2.1 Regression model with measurement error

Measurement error may be the result of conceptual differences between the variable of economic interest

and the one available in data, but it could also be the result of rounding errors or misreporting in

survey data or administrative records.

Let us consider the regression model

y∗i = α+ βx∗i + u∗i

where u∗i is independent of x∗i . Below we distinguish two cases: one in wish there is only measurement

error in y∗i and another in which there is only measurement error in x∗i .

2

Measurement error in y∗i We observe yi = y∗i + vi such that vi ⊥ (x∗i , u∗i ). In this case,

yi = α+ βx∗i + (u∗i + vi) ,

so that

β =Cov (x∗i , y

∗i )

V ar (x∗i )=Cov (x∗i , yi)

V ar (x∗i ).

The only difference with the original regression model is that the variance of the error term is larger

due to the measurement error component, which means that the R2 will be smaller:

R2∗ =

β2V ar (x∗i )

β2V ar (x∗i ) + σ2u

, R2 =β2V ar (x∗i )

β2V ar (x∗i ) + σ2u + σ2

v

,

so that the larger σ2v the smaller R

2 will be relative to R2∗:

R2 =R2∗

1 + σ2vβ2V ar(x∗i )+σ2u

.

Measurement error in x∗i Now xi = x∗i + εi such that εi ⊥ (x∗i , u∗i ). In this case,

y∗i = α+ βxi + (u∗i − βεi) .

Then

β =Cov (xi, y

∗i )

V ar (xi)=

Cov (x∗i , y∗i )

V ar (x∗i ) + σ2ε

=β

1 + σ2εV ar(x∗i )

= β − β(

λ

1 + λ

)

where λ = σ2ε/V ar (x∗i ). Thus, OLS estimates will be biased for β with a bias that depends on the

noise to signal ratio λ. For example, if λ = 1 the regression coeffi cient will be half the size of the effect

of interest.

An example: y∗i = consumption, x∗i = permanent income, u∗i = transitory consumption, εi =

transitory income.

Identification using λ If we have measurements of λ or σ2ε then consistent estimation may be

based on the following expressions:

β = (1 + λ)Cov (xi, y

∗i )

V ar (xi)=

Cov (xi, y∗i )

V ar (xi)− σ2ε

. (5)

More generally, if xi is a vector of variables measured with error, so that

yi = x′iβ +(ui − ε′iβ

)xi = x∗i + εi, E

(εiε′i

)= Ω,

a vector-valued generalization of (5) takes the form:

β =[E(xix′i

)− Ω

]−1E (xiyi) .

3

3 Instrumental-variable model

3.1 Identification

The set-up is as follows. We observe yi, xi, zini=1 with dim (xi) = k, dim (zi) = r such that

yi = x′iβ + ui E (ziui) = 0.

Typically there will be overlap between variables contained in xi and zi, for example a constant term

(“control” variables). Variables in xi that are absent from zi are endogenous explanatory variables.

Variables in zi that are absent from xi are external instruments.

The assumption E (ziui) = 0 implies that β solves the system of r equations:

E[zi(yi − x′iβ

)]= 0

or

E(zix′i

)β = E (ziyi) . (6)

If r < k, system (6) will have a multiplicity of solutions for β, so that β is not point identified. If r ≥ kand rankE (zix

′i) = k then β is identified. In estimation we will distinguish between the just-identified

case (r = k) and the over-identified case (r > k).

If r = k and the rank condition holds we have

β =[E(zix′i

)]−1E (ziyi) . (7)

In the simple case where xi = (1, xoi)′, zi = (1, zoi)

′ and β = (β1, β2)′ we get

β2 =Cov (zoi, yi)

Cov (zoi, xoi)

and

β1 = E (yi)− β2E (xoi) .

In general, the OLS parameters will differ from the parameters in the instrumental-variable model.

In the previous simple example we have:

Cov (xi, yi)

V ar (xi)= β2 +

Cov (xi, ui)

V ar (xi). (8)

Sometimes the orthogonality between instruments and error term is expressed in the form of a

stronger mean independence assumption instead of lack of correlation:

E (ui | zi) = 0.

4

3.2 Examples

Demand equation In this example the units are markets across space or over time, yi is quantity,

the endogenous explanatory variable is price and the external instrument is a supply shifter, such

as weather variation in the case of an agricultural product. This is the classic example from the

simultaneous equations literature.1

Evaluation of a training program Here the units are workers, the endogenous explanatory

variable is an indicator of participation in a training program and yi is some subsequent labor mar-

ket outcome, such as wages or employment status. The external instrument is an indicator of ran-

dom assignment to access to the program. In this example we would expect the coeffi cient in the

instrumental-variable line to be positive, whereas the coeffi cient in the OLS line could be negative.

Measurement error Consider the measurement error regression model:

yi = β1 + β2x∗i + vi

where we observe two measurements of x∗i with independent errors:

x1i = x∗i + ε1i

x2i = x∗i + ε2i.

All unobservables x∗i , vi, ε1i, ε2i are mutually independent. In this example, we could have xi =

(1, x1i)′, zi = (1, x2i)

′ and ui = vi−β2ε1i; or alternatively xi = (1, x2i)′, zi = (1, x1i)

′ and ui = vi−β2ε2i.

Time series regression with dynamics and serial correlation A simple example is the

ARMA(1,1) model:

yt = β1 + β2yt−1 + ut

ut = εt + θεt−1

where εt is a white noise error term. Here xt = (1, yt−1)′ and zt = (1, yt−2)′.

3.3 Estimation

Simple IV estimator When r = k a simple instrumental-variable estimator is the sample

counterpart of (7):

β =

(n∑i=1

zix′i

)−1 n∑i=1

ziyi.

1Haavelmo, T. (1943): “The statistical implications of a system of simultaneous equations,”Econometrica, 11, 1—12.

5

The estimation error is given by

β − β =

(1

n

n∑i=1

zix′i

)−1 1

n

n∑i=1

ziui.

Thus, plimn→∞ β = β if plim 1n

∑ni=1 zix

′i = E (zix

′i) = H, rankH = k, and plim 1

n

∑ni=1 ziui =

E (ziui) = 0.

Also,

√n(β − β

)d→ N

(0, H−1WH ′−1

)if n−1/2

∑ni=1 ziui

d→ N (0,W ). When yi, xi, zini=1 is a random sample then W = E(u2i ziz

′i

).

Overidentified IV If r > k the system (6) contains more equations than unknowns. To de-

termine the population value of β we could solve any rank-preserving k linear combinations for some

k × r matrix G:

GE(zix′i

)β = GE (ziyi)

so that

β =[E(Gzix

′i

)]−1E (Gziyi) , (9)

leading to consistent estimators of the form

βG =

(n∑i=1

Gzix′i

)−1 n∑i=1

Gziyi. (10)

Note that while (9) should be invariant to the choice of G if the model is correctly specified, the

estimated quantity (10) will differ due to sample error. For example, if xi = (1, xoi)′ and zi =

(1, z1i, z2i)′ we will have

Cov (z1i, yi)

Cov (z1i, xoi)=

Cov (z2i, yi)

Cov (z2i, xoi)

but

Cov (z1i, yi)

Cov (z1i, xoi)6= Cov (z2i, yi)

Cov (z2i, xoi).

Asymptotic normality Turning to large sample properties, repeating the previous asymptotic

normality argument for (10), under iid sampling we get:

√n(βG − β

)d→ N (0, VG)

with

VG =[GE

(zix′i

)]−1GE

(u2i ziz

′i

)G′[E(xiz′i

)G′]−1

. (11)

Thus, the large sample variance depends on the choice of G.

6

Optimality Let us now consider optimality following Sargan (1958).2 ForG = E (xiz′i)[E(u2i ziz

′i

)]−1

the matrix VG equals

V0 =[E(xiz′i

) [E(u2i ziz

′i

)]−1E(zix′i

)]−1.

Moreover, it can be shown that for any other choice of G we have:3

VG − V0 ≥ 0.

Therefore, estimators of the form

βGn =

(n∑i=1

Gnzix′i

)−1 n∑i=1

Gnziyi (12)

with a possibly stochastic Gn such that Gnp→ E (xiz

′i)[E(u2i ziz

′i

)]−1 (up to a multiplicative constant)

are optimal in the sense of being minimum asymptotic variance within the class of linear instrumental-

variable estimators, which use zi as instruments.

Under homoskedasticity E(u2i ziz

′i

)= σ2E (ziz

′i), therefore a choice of Gn such that

Gnp→ E

(xiz′i

) [E(ziz′i

)]−1= Π

is optimal. The matrix Π contains the OLS population coeffi cients in linear regressions of the xi

variables on zi.

Two-stage least squares Letting Π = (∑n

i=1 xiz′i) (∑n

i=1 ziz′i)−1 be the sample counterpart of

Π, the two-stage least squares estimator is

β2SLS =

(n∑i=1

Πzix′i

)−1 n∑i=1

Πziyi (13)

or in short

β2SLS =

(n∑i=1

xix′i

)−1 n∑i=1

xiyi (14)

where xi = Πzi is the vector of fitted values in the (“first-stage”) regressions of the xi variables on zi:

xi = Πzi + vi (15)

2Sargan, J. D. (1958): “The Estimation of Economic Relationships Using Instrumental Variables,”Econometrica, 26,

393—415.3To see this, let W−1 = C′C, H = CH, D = (GH)−1GC−1, and note that:

VG − V0 = (GH)−1GWG′(H ′G′

)−1 − (H ′W−1H)−1 = D [I −H (H ′H)−1H ′]D′.This is a positive semi-definite matrix because

[I −H

(H′H)−1

H′]is idempotent. This optimality result also applies

to clustered and serially dependent data since it does not require that W equals E(u2i ziz

′i

).

7

If a variable in xi is also contained in zi its fitted value will coincide with the variable itself and the

corresponding element of vi will be equal to zero.

Sometimes it is convenient to use matrix notation as follows:

Π =(X ′Z

) (Z ′Z

)−1

so that

β2SLS =[(X ′Z

) (Z ′Z

)−1 (Z ′X

)]−1 (X ′Z

) (Z ′Z

)−1 (Z ′y)

and

β2SLS =(X ′X

)−1X ′y

where X = Z (Z ′Z)−1 (Z ′X).

Note that β2SLS is also the OLS regression of y on X:

β2SLS =(X ′X

)−1X ′y.

This interpretation of the 2SLS estimator is the one that originated its traditional name.

Two-stage least squares estimation relies on a powerful intuition: we use as instrument the linear

combination of the instrumental variables that best predicts the endogenous explanatory variables in

the linear projection sense.

Consistency of β2SLS relies on n → ∞ for fixed r. Note that if r = n then X = X so that 2SLS

and OLS coincide. If r is less than n but close to it, one would expect 2SLS to be close to OLS.

Robust standard errors Although its optimality requires homoskedasticity, 2SLS (like OLS)

remains a popular estimator under more general conditions. Particularizing expression (11) to G = Π

we obtain the asymptotic variance of the 2SLS estimator

VΠ =[ΠE

(ziz′i

)Π′]−1

ΠE(u2i ziz

′i

)Π′[ΠE

(ziz′i

)Π′]−1

. (16)

Heteroskedasticity-robust standard errors and confidence intervals can be obtained from the esti-

mated variance:

VΠ =[ΠE

(ziz′i

)Π′]−1

ΠE(u2i ziz

′i

)Π′[ΠE

(ziz′i

)Π′]−1

= n(X ′X

)−1(

n∑i=1

u2i xix

′i

)(X ′X

)−1

where the ui are 2SLS residuals ui = yi − x′iβ2SLS .

With homoskedastic errors, (16) boils down to

VΠ = σ2[ΠE

(ziz′i

)Π′]−1 (17)

8

where σ2 = E(u2i

). In this case a consistent estimator of VΠ is simply

VΠ = nσ2(X ′X

)−1(18)

where σ2 = n−1∑n

i=1 u2i .

Note that if the residual variance is calculated from fitted-value residuals y − Xβ2SLS instead of

u = y −Xβ2SLS , we would get an inconsistent estimate of σ2 and therefore also of VΠ in (17).

3.4 Testing overidentifying restrictions

When r > k an IV estimator sets to zero k linear combinations of the moments:

GE(zix′i

)β = GE (ziyi)

Thus, there remains r − k linearly independent combinations that are not set to zero in estimationbut should be close to zero under correct specification. A test of overidentifying restrictions or Sargan

test is a test of the null hypothesis that the remaining r − k linear combinations are equal to zero.Under classical errors the form of the statistic is given by

S =u′Z (Z ′Z)−1 Z ′u

σ2

d→ χ2r−k (19)

It is easy to see that S = nR2 where R2 is the r-squared in a regression of u on Z.

A sketch of the result in (19) is as follows. With classical errors n−1/2∑n

i=1 ziuid→ N

[0, σ2E (ziz

′i)]

and therefore also

1√n

1

σC ′Z ′u

d→ N (0, Ir)

where we are using the factorization (Z ′Z/n)−1 = CC ′.

Next, using

u = y −Xβ2SLS = u−X(β2SLS − β

)and

β2SLS − β =[(X ′Z

) (Z ′Z

)−1 (Z ′X

)]−1 (X ′Z

) (Z ′Z

)−1Z ′u,

we get

h =1√n

1

σC ′Z ′u =

[Ir −B

(B′B

)−1B′] 1√

n

1

σC ′Z ′u

where B = C ′ (Z ′X/n).

Since the probability limit of[I −B (B′B)−1B′

]is idempotent with rank r − k it follows that

h′h = nu′Z (Z ′Z)−1 Z ′u

u′ud→ χ2

r−k.

9

In the presence of heteroskedasticity, the statistic S in (19) is not asymptotically chi-square, not

even under correct specification. An alternative robust Sargan statistic is:

SR =(u′Z)W−1

(Z ′u) d→ χ2

r−k (20)

where W =(∑n

i=1 u2i ziz

′i

)and u = y −Xβ

G†nwith G†n = (X ′Z) W−1.

Contrary to β2SLS , the IV estimator βG†n given by

βG†n

=[(X ′Z

)W−1

(Z ′X

)]−1 (X ′Z

)W−1

(Z ′y)

(21)

uses an optimal choice of Gn under heteroskedasticity. This improved IV estimator was studied by

Halbert White in 1982 under the name two-stage instrumental variables (2SIV) estimator.4

4White, H. (1982): “Instrumental Variables Regression with Independent Observations,”Econometrica, 50, 483—499.

10

Treatment e§ect methodsManuel ArellanoMarch 2018

• A treatment e§ect exercise is context-specific.

• The goal is to evaluate the impact of an existing policy by comparing the distributionof a chosen outcome variable for individuals a§ected by the policy (treatment group)with the distribution of una§ected individuals (control group).

• The aim is to choose the control and treatment groups in such a way thatmembership of one or the other, either results from randomization or can be regardedas if they were the result of randomization.

• In this way one hopes to achieve the standards of empirical credibility on causalevidence that are typical of experimental biomedical studies.

Potential outcomes and causality

• Association and causation have always been known to be di§erent, but amathematical framework for an unambiguous characterization of statistical causale§ects is surprisingly recent (Rubin 1974).

• Think of a population of individuals that are susceptible of treatment. Let Y1 be theoutcome for an individual if exposed to treatment and let Y0 be the outcome for thesame individual if not exposed. The treatment e§ect for that individual is Y1 − Y0.

• In general, individuals di§er in how much they gain from treatment, so that we canimagine a distribution of gains over the population with mean

aATE = E (Y1 − Y0) .

• The average treatment e§ect so defined is a standard measure of the causal e§ect oftreatment 1 relative to treatment 0 on the chosen outcome.

• Suppose that treatment has been administered to a fraction of the population, andwe observe whether an individual has been treated or not (D = 1 or 0) and theperson’s outcome Y . Thus, we are observing Y1 for the treated and Y0 for the rest:

Y = (1−D)Y0 +DY1.

2

• Because Y1 and Y0 can never be observed for the same individual, the distribution ofgains lacks empirical entity. It is just a conceptual device that can be related toobservables.

• This notion of causality is statistical because it is not interested in finding out causale§ects for specific individuals. Causality is defined in an average sense.

Connection with regression

• A standard measure of association between Y and D is:

b = E (Y | D = 1)− E (Y | D = 0)

= E (Y1 − Y0 | D = 1) + E (Y0 | D = 1)− E (Y0 | D = 0)• The second expression makes it clear that in general b di§ers from the average gainfor the treated (another standard measure of causality, that we call aTT ).

• The reason is that treated and nontreated units may have di§erent average outcomesin the absence of treatment.

• For example, this will be the case if treatment status is the result of individualdecisions, and those with low Y0 choose treatment more frequently than those withhigh Y0.

3

• From a structural model of D and Y one could obtain the implied average treatmente§ects, but here aATE or aTT have been directly defined with respect to thedistribution of potential outcomes, so that relative to a structure they are reducedform causal e§ects.

• Econometrics has conventionally distinguished between reduced form e§ects(uninterpretable but useful for prediction) and structural e§ects (associated with rulesof behavior).

• The TE literature emphasizes “reduced form causal e§ects” as an intermediatecategory between predictive and structural e§ects.

Social feedback

• The potential outcome representation is predicated on the assumption that the e§ectof treatment is independent of how many individuals receive treatment, so that thepossibility of di§erent outcomes depending on the treatment received by other units isruled out.

• This excludes general equilibrium or feedback e§ects, as well as strategic interactionsamong agents.

• So the framework is not well suited to the evaluation of system-wide reforms whichare intended to have substantial equilibrium e§ects.

4

Social experiments

• In the TE approach, a randomized field trial is regarded as the ideal research design.

• Observational studies seen as “more speculative” attempts to generate the force ofevidence of experiments.

• In a controlled experiment, treatment status is randomly assigned by the researcher,which by construction ensures:

(Y0,Y1) ? D

In such a case, F (Y1 | D = 1) = F (Y1) and F (Y0 | D = 0) = F (Y0). Theimplication is aATE = aTT = b.

• Analysis of data takes a simple form: An unbiased estimate of aATE is the di§erencebetween the average outcomes for treatments and controls:

baATE = Y T − Y C

• In a randomized setting, there is no need to “control” for covariates, renderingmultiple regression unnecessary, except if interested in e§ects for specific groups.

5

Matching

• There are many situations where experiments are too expensive, unfeasible, orunethical. A classical example is the analysis of the e§ects of smoking on mortality.

• Experiments guarantee the independence condition

(Y1,Y0) ? D

but with observational data it is not very plausible.• A less demanding condition for nonexperimental data is:

(Y1,Y0) ? D | X .• Conditional independence implies

E (Y1 | X ) = E (Y1 | D = 1,X ) = E (Y | D = 1,X )E (Y0 | X ) = E (Y0 | D = 0,X ) = E (Y | D = 0,X ) .

Therefore, for aATE we can calculate (and similarly for aTT ):

aATE = E (Y1 − Y0) =ZE (Y1 − Y0 | X ) dF (X )

=Z[E (Y | D = 1,X )− E (Y | D = 0,X )] dF (X ) .

• The following is a matching expression for aTT = E (Y1 − Y0 | D = 1):

E [Y − E (Y0 | D = 1,X ) | D = 1] = E [Y − µ0 (X ) | D = 1]

where µ0 (X ) = E (Y | D = 0,X ) is used as an imputation for Y0.6

Relation with multiple regression

• If we specify E (Y | D ,X ) as a linear regression on D , X and D × X we have

E (Y | D ,X ) = bD + gX + dDX

andE (Y | D = 1,X )− E (Y | D = 0,X ) = b+ dX .

aATE = b+ dE (X )

aTT = b+ dE (X | D = 1) ,

which can be easily estimated using linear regression.

• Alternatively, we can treat E (Y | D = 1,X ) and E (Y | D = 0,X ) as nonparametricfunctions of X .

• The last approach is closer in spirit to the matching literature, which has emphasizeddirect comparisons, free from functional form assumptions and extrapolation.

7

The common support condition

• Suppose for the sake of the argument that X is a single covariate whose support liesin the range XMIN ,XMAX .

• The support for the subpopulation of the treated (D = 1) is XMIN ,XI whereas thesupport for the controls (D = 0) is X0,XMAX and X0 < XI , so that

Pr (D = 1 | X 2 XMIN ,X0) = 1

0 < Pr (D = 1 | X 2 X0,XI ) < 1

Pr (D = 1 | X 2 XI ,XMAX ) = 0• The implication is that E (Y | D = 1,X ) is only identified for values of X in therange XMIN ,XI and E (Y | D = 0,X ) is only identified for values of X in the rangeX0,XMAX .

• Thus, we can only calculate the di§erence [E (Y | D = 1,X )− (Y | D = 0,X )] forvalues of X in the intersection range X0,XI , which implies that aATE is notidentified. Only the average treatment e§ect of units with X 2 X0,XI is identified.

• If we want to ensure identification, in addition to conditional independence we needthe overlap assumption:

0 < Pr (D = 1 | X ) < 1 for all X in its support

8

Lack of common support and parametric assumptions: a cautionary tale• Suppose that E (Y1 | X ) = E (Y0 | X ) = m (X ) for all X but the support is as before.We can only hope to establish that E (Y1 − Y0 | X = r ) = 0 for r 2 X0,XI .

• Conditional independence holds, so E (Y | D = 1,X ) = E (Y | D = 0,X ) = m (X ),which in our example is a nonlinear function of X .

• Suppose that we use linear projections in place of conditional expectations:

E ∗ (Y | D = 0,X ) = b0 + b1X

E ∗ (Y | D = 1,X ) = g0 + g1X

where

(b0, b1) = arg minb0 ,b1

EX |D=0n[E (Y | D = 0,X )− b0 − b1X ]2

o

(g0,g1) = arg ming0 ,g1

EX |D=1n[E (Y | D = 1,X )− g0 − g1X ]2

o

• Given the form of m (X ), f (X | D = 0) and f (X | D = 1) in the example, we shallget b1 > g1. If we now project outside the observable ranges, we find a spuriousnegative treatment e§ect for large X and a spurious positive e§ect for small X .

• So aATE calculated as (g0 − b0) + (g1 − b1)E (X ) may be positive, negative orclose to zero depending on the form of the distributions involved, despite the fact thatnot only E (Y1 − Y0) = 0 but also E (Y1 − Y0 | X ) = 0 for all values of X .

9

Figure 1

Figure 5

Figure 2

Imputing missing outcomes (discrete X )

• Suppose X is discrete, takes on J valuesn

x j

oJj=1

and we have a sample XiNi=1. Let

Nj = number of observations in cell j .Nj` = number of observations in cell j with D = `.

Yj` = mean outcome in cell j for D = `.

• Thus,%Yj1 − Y

j0

&is the sample counterpart of

E%Y | D = 1,X = x j

&− E

%Y | D = 0,X = x j

&,

which can be used to get the estimates

baATE =

J

Âj=1

%Yj1 − Y

j0

& Nj

N, b

aTT =J

Âj=1

%Yj1 − Y

j0

& Nj1N1

• The formula for baTT can also be written in the form

baTT =

1N1

ÂDi=1

%Yi − Y

j(i )0

&

where j (i) is the cell of Xi . Thus, baTT matches the outcome of each treated unitwith the mean of the nontreated units in the same cell.

• To see this note that E [E (Y | D = 1,X )− E (Y | D = 0,X ) | D = 1] =E [Y − E (Y | D = 0,X ) | D = 1].

10

Imputing missing outcomes (continuous X )• A matching estimator can be regarded as a way of constructing imputations formissing potential outcomes so that gains Y1i − Y0i can be estimated for each unit.

• In the discrete case

bY0i = Yj(i )0 ≡ Â

k2(D=0)

1 (Xk = Xi )Â`2(D=0) 1 (X` = Xi )

Yk

• In generalbY0i = Â

k2(D=0)w (i , k)Yk

• Di§erent matching estimators use di§erent weighting schemes.• Nearest neighbor matching:

w (i , k) ='1 if Xk = mini kXk − Xik0 otherwise

with perhaps matching restricted to cases where kXi − Xkk < # for some #. Usuallyapplied in situations where the interest is in aTT but also applicable to aATE .

• Kernel matching:

w (i , k) =1

Â`2(D=0) K%X`−Xi

gN0

&K(Xk − Xi

gN0

)

where K (.) is a kernel that downweights distant observations and gN0 is a bandwidthparameter. Local linear approaches provide a generalization.

11

Methods based on the propensity score

• Rosenbaum and Rubin called “propensity score” to

p (X ) = Pr (D = 1 | X )

and proved that if (Y1,Y0) ? D | X then

(Y1,Y0) ? D | p (X )

provided 0 < p (X ) < 1 for all X .

• We want to prove that provided (Y1,Y0) ? D | X then Pr (D = 1 | Y1,Y0,p (X )) =Pr (D = 1 | p (X )) ≡ p (X ). Using the law of iterated expectations:

E (D | Y1,Y0,p (X )) = E [E (D | Y1,Y0,X ) | Y1,Y0,p (X )]= E [E (D | X ) | Y1,Y0,p (X )] = p (X )

• The result tells us that we can match units with very di§erent values of X as long asthey have similar values of p (X ).

• These results suggest two-step procedures in which we begin by estimating thepropensity score.

12

Weighting on the propensity score• Under unconditional independence

aATE = E (Y | D = 1)− E (Y | D = 0) =E (DY )Pr (D = 1)

−E [(1−D)Y ]Pr (D = 0)

• Similarly, under conditional independence

E (Y1 − Y0 | X ) = E (Y | D = 1,X )− E (Y | D = 0,X )

=E (DY | X )Pr (D = 1 | X )

−E [(1−D)Y | X ]Pr (D = 0 | X )

= E(DY

p (X )−(1−D)Y1− p (X )

| X)

so that

aATE = E(DY

p (X )−(1−D)Y1− p (X )

)= E

(Y

[D − p (X )]p (X ) [1− p (X )]

)

• A simple estimator is

baATE =

1N

N

Âi=1

(DiYibp (Xi )

−(1−Di )Yi1− bp (Xi )

)

where bp (Xi ) is an estimator of the propensity score.

13

Di§erences between matching and OLS

• Matching avoids functional form assumptions and emphasizes the common supportcondition.

• Matching focuses on a single parameter at a time, which is obtained through explicitaggregation.

The requirement of random variation in outcomes

• Matching works on the presumption that for X = x there is random variation in D , sothat we can observe both Y1 and Y0. It fails if D is a deterministic function of X .

• There is a tension between the thought that if X is good enough then there may notbe within-cell variation in D , and the suspicion that seeing enough variation in D forgiven X is an indication that exogeneity is at fault.

14

Instrumental variables

1. Instrumental variable assumptions

• Suppose we have non-experimental data with covariates, but cannot assumeconditional independence as in matching:

(Y1,Y0) ? D | X .

• Suppose, however, that we have a variable Z that is an “exogenous source ofvariation in D” in the sense that it satisfies the independence assumption:

(Y1,Y0) ? Z | X

and the relevance assumption:Z 0 D | X .

• Matching can be regarded as a special case of IV in which Z = D , i.e. all variation inD is exogenous given X .

15

2. Instrumental-variable examples

Example 1: Non-compliance in randomized trials

• In a classic example, Z indicates assignment to treatment in an experimental design.Therefore, (Y1,Y0) ? Z .

• However, “actual treatment” D di§ers from Z because some individuals in thetreatment group decide not to treat (non-compliers). Z and D will be correlated ingeneral.

• Assignment to treatment is not a valid instrument in the presence of externalities thatbenefit members of the treatment group even if they are not treated themselves. Insuch case the exclusion restriction fails to hold.

• An example of this situation arises in a study of the e§ect of deworming on schoolparticipation in Kenya using school-level randomization (Miguel and Kremer,Econometrica, 2004).

16

Example 2: Ethnic enclaves and immigrant outcomes

• Interest in the e§ect of living in a highly concentrated ethnic area on labor success. InSweden 11% of the population was born abroad. Of those, more than 40% live in anethnic enclave (Edin, Fredriksson and Åslund, QJE, 2003).

• The causal e§ect is ambiguous. Residential segregation lowers the acquisition rate oflocal skills, preventing access to good jobs. But enclaves act as opportunity-increasingnetworks by disseminating information to new immigrants.

• Immigrants in ethnic enclaves have 5% lower earnings, after controlling for age,education, gender, family background, country of origin, and year of immigration.

• But this association may not be causal if the decision to live in an enclave depends onexpected opportunities.

• Swedish governments of 1985-1991assigned initial areas of residence to refugeeimmigrants. Motivated by the belief that dispersing immigrants promotes integration.

• Let Z indicate initial assignment (8 years before measuring ethnic enclave indicatorD). Edin et al. assumed that Z is independent of potential earnings Y0 and Y1.

• IV estimates implied a 13% gain for low-skill immigrants associated with one std.deviation increase in ethnic concentration. For high-skill immigrants there was noe§ect.

17

Example 3: Vietnam veterans and civilian earnings

• Did military service in Vietnam have a negative e§ect on earnings? (Angrist, 1990).

• Here we have:

• Instrumental variable: draft lottery eligibility.• Treatment variable: Veteran status.• Outcome variable: Log earnings.• Data: N = 11637 white men born 1950—1953.• March Population Surveys of 1979 and 1981—1985.

• This lottery was conducted annually during 1970-1974. It assigned numbers (from 1to 365) to dates of birth in the cohorts being drafted. Men with lowest numbers werecalled to serve up to a ceiling determined every year by the Department of Defense.

• Abadie (2002) uses as instrument an indicator for lottery numbers lower than 100.

• The fact that draft eligibility a§ected the probability of enrollment along with itsrandom nature makes this variable a good candidate to instrument “veteran status”.

• There was a strong selection process in the military during the Vietnam period. Somevolunteered, while others avoided enrollment using student or job deferments.

• Presumably, enrollment was influenced by future potential earnings.

18

3. Identification of causal e§ects in IV settings• The question is whether the availability of an instrumental variable identifies causale§ects. To answer it, I consider a binary Z , and abstract from conditioning.

Homogeneous e§ects• If the causal e§ect is the same for every individual

Y1i − Y0i = a

the availability of an IV allows us to identify a. This is the traditional situation ineconometric models with endogenous explanatory variables.

• In the homogeneous case

Yi = Y0i + (Y1i − Y0i )Di = Y0i + aDi .

• Also, taking into account that Y0i ? ZiE (Yi | Zi = 1) = E (Y0i ) + aE (Di | Zi = 1)E (Yi | Zi = 0) = E (Y0i ) + aE (Di | Zi = 0) .

• Subtracting both equations we obtain

a =E (Yi | Zi = 1)− E (Yi | Zi = 0)E (Di | Zi = 1)− E (Di | Zi = 0)

which determines a as long as

E (Di | Zi = 1) 6= E (Di | Zi = 0) .• Get the e§ect of D on Y through the e§ect of Z because Z only a§ects Y through D .

19

Heterogeneous e§ects

Summary

• In the heterogeneous case the availability of IVs is not su¢cient to identify a causale§ect.

• An additional assumption that helps identification of causal e§ects is the following“monotonicity” condition: Any person that was willing to treat if assigned to thecontrol group, would also be prepared to treat if assigned to the treatment group.

• The plausibility of this assumption depends on the context of application.

• Under monotonicity, the IV coe¢cient coincides with the average treatment e§ect forthose whose value of D would change when changing the value of Z (local averagetreatment e§ect or LATE).

20

Indicator of potential treatment status• In preparation for the discussion below let us introduce the following notation:

D ='D0 if Z = 0D1 if Z = 1

• Given data on (Y ,D) there are 4 observable groups but 8 underlying groups, whichcan be classified as never-takers, compliers, defiers, and always-takers.

Example

• Consider two levels of schooling (D = 0, 1, high school and college) with associatedpotential wages (Y0,Y1), so that individual returns are Y1 − Y0. Also consider anexogenous determinant of schooling Z with associated potential schooling levels(D0,D1). The IV Z is exogenous in the sense that it is independent of(Y0,Y1,D0,D1).

• An example of Z is proximity to college:

• Z = 0 college far away• Z = 1 college nearby• Defier with D = 1,Z = 0 (ie. D1 = 0): Person who goes to college when is far butwould not go if it was near.

• Defier with D = 0,Z = 1 (ie. D0 = 1): Person does not go to college when it is nearbut would go if it was far.

21

Table 1Observable and Latent Types

Z D D0 D1

Type 1 0 0 001

Type 1AType 1B

Never-takerComplier

Type 2 0 1 101

Type 2AType 2B

DefierAlways-taker

Type 3 1 001

0Type 3AType 3B

Never-takerDefier

Type 4 1 101

1Type 4AType 4B

ComplierAlways-taker

22

Availability of IV is not su¢cient by itself to identify causal e§ects

• Note that since

E (Y | Z = 1) = E (Y0) + E [(Y1 − Y0)D1 ]E (Y | Z = 0) = E (Y0) + E [(Y1 − Y0)D0 ]

we haveE (Y | Z = 1)− E (Y | Z = 0) = E [(Y1 − Y0) (D1 −D0)]

= E (Y1 − Y0 | D1 −D0 = 1)Pr (D1 −D0 = 1)−E (Y1 − Y0 | D1 −D0 = −1)Pr (D1 −D0 = −1)

• E (Y | Z = 1)− E (Y | Z = 0) could be negative and yet the causal e§ect bepositive for everyone, as long as the probability of defiers is su¢ciently large.

23

Additional assumption: Eligibility rules• An additional assumption that helps to identify aTT is an eligibility rule of the form:

Pr (D = 1 | Z = 0) = 0

i.e. individuals with Z = 0 are denied treatment.• In this situation:

E (Y | Z = 1) = E (Y0) + E [(Y1 − Y0)D | Z = 1]= E (Y0) + E (Y1 − Y0 | D = 1,Z = 1)E (D | Z = 1)

and since E (D | Z = 0) = 0

E (Y | Z = 0) = E (Y0) + E (Y1 − Y0 | D = 1,Z = 0)E (D | Z = 0) = E (Y0)• Therefore,

Wald parameter ≡E (Y | Z = 1)− E (Y | Z = 0)

E (D | Z = 1)= E (Y1 − Y0 | D = 1,Z = 1) .

• Moreover,

aTT ≡ E (Y1 − Y0 | D = 1) = E (Y1 − Y0 | D = 1,Z = 1) .

This is so because Pr (Z = 1 | D = 1) = 1. That is,

E (Y1 − Y0 | D = 1) = E (Y1 − Y0 | D = 1,Z = 1)Pr (Z = 1 | D = 1)+E (Y1 − Y0 | D = 1,Z = 0) [1− Pr (Z = 1 | D = 1)] .

• Thus, if Pr (D = 1 | Z = 0) = 0 the IV coe¢cient coincides with the averagetreatment e§ect on the treated. 24

4. Local average treatment e§ects (LATE)Monotonicity and LATEs• If we rule out defiers i.e. Pr (D1 −D0 = −1) = 0, we have

E (Y | Z = 1)− E (Y | Z = 0) = E (Y1 − Y0 | D1 −D0 = 1)Pr (D1 −D0 = 1)

and

E (D | Z = 1)− E (D | Z = 0) = E (D1)− E (D0) = Pr (D1 −D0 = 1) .• Therefore,

E (Y1 − Y0 | D1 −D0 = 1) =E (Y | Z = 1)− E (Y | Z = 0)E (D | Z = 1)− E (D | Z = 0)

• Imbens and Angrist called this parameter “local average treatment e§ects” (LATE).• Di§erent IV’s lead to di§erent parameters, even under instrument validity, which iscounter to standard GMM thinking.

• Policy relevance of a LATE parameter depends on the subpopulation of compliersdefined by the instrument. Most relevant LATE’s are those based on instruments thatare policy variables (eg college fee policies or college creation).

• What happens if there are no compliers? In the absence of defiers, the probability ofcompliers satisfies

Pr (D1 −D0 = 1) = E (D | Z = 1)− E (D | Z = 0) .

So, lack of compliers implies lack of instrument relevance, hence underidentification.

25

Distributions of potential wages for compliers

• Imbens and Rubin (1997) showed that under monotonicity not only the averagetreatment e§ect for compliers is identified but also the entire marginal distributions ofY0 and Y1 for compliers.

• Abadie (2002) gives a simple proof that suggests a Wald calculation. For any functionh (.) let us consider

W = h (Y )D ='W1 = h (Y1) if D = 1W0 = 0 if D = 0

.

Because (W1,W0,D1,D0) are independent of Z , we can apply the LATE formula toW and get

E (W1 −W0 | D1 −D0 = 1) =E (W | Z = 1)− E (W | Z = 0)E (D | Z = 1)− E (D | Z = 0)

,

or substituting

E (h (Y1) | D1 −D0 = 1) =E (h (Y )D | Z = 1)− E (h (Y )D | Z = 0)

E (D | Z = 1)− E (D | Z = 0).

• If we choose h (Y ) = 1 (Y ≤ r ), the previous formula gives as an expression for thecdf of Y1 for the compliers.

26

• Similarly, if we consider

V = h (Y ) (1−D) ='V1 = h (Y0) if 1−D = 1V0 = 0 if 1−D = 0

then

E (V1 − V0 | D1 −D0 = 1) =E (V | Z = 1)− E (V | Z = 0)

E (1−D | Z = 1)− E (1−D | Z = 0)or

E (h (Y0) | D1 −D0 = 1) =E (h (Y ) (1−D) | Z = 1)− E (h (Y ) (1−D) | Z = 0)

E (1−D | Z = 1)− E (1−D | Z = 0)from which we can get the cdf of Y0 for the compliers, again settingh (Y ) = 1 (Y ≤ r ).

• To see the intuition, suppose D is exogenous (Z = D), then the cdf of Y | D = 0coincides with the cdf of Y0, and the cdf of Y | D = 1 coincides with the cdf of Y1.

• If we regress h (Y )D on D , the OLS regression coe¢cient is

E [h (Y )D | D = 1]− E [h (Y )D | D = 0] = E [h (Y1)]

which for h (Y ) = 1 (Y ≤ r ) gives us the cdf of Y1.• Similarly, if we regress h (Y ) (1−D) on (1−D), the regression coe¢cient is

E [h (Y ) (1−D) | 1−D = 1]− E [h (Y ) (1−D) | 1−D = 0] = E [h (Y0)] .• In the IV case, we are running similar IV (instead of OLS) regressions using Z asinstrument and getting expected h (Y1) and h (Y0) for compliers.

27

Conditional estimation with instrumental variables• So far we abstracted from the fact that the validity of the instrument may only beconditional on X : It may be that (Y0,Y1) ? Z does not hold, but the following does:

(Y0,Y1) ? Z | X (conditional independence)

Z 0 D | X (conditional relevance)

• For example, in the analysis of returns to college where Z is an indicator of proximityto college. The problem is that Z is not randomly assigned but chosen by parents,and this choice may depend on characteristics that subsequently a§ect wages. Thevalidity of Z may be more credible given family background variables X .

• In a linear version of the problem:

• First stage: Regress D on Z and X ! get bD .• Second stage: Regress Y on bD and X .

• In general we now have conditional LATE given X :

g (X ) = E (Y1 − Y0 | D1 6= D0,X ) .

• On the other hand, we have conditional IV estimands:

b (X ) =E (Y | Z = 1,X )− E (Y | Z = 0,X )E (D | Z = 1,X )− E (D | Z = 0,X )

28

• What is the relevant aggregate e§ect? If the treatment e§ect is homogeneous given X

Y1 − Y0 = b (X ) ,

then a parameter of interest is:

E [b (X )] =Z

b (X ) dF (X ) .

• However, in the case of heterogeneous e§ects, it makes sense to consider an averagetreatment e§ect for the overall subpopulation of compliers:

bC =Z

b (X ) dF (X | compliers) .

• Calculating bC appears problematic because F (X | compliers) is unobservable, but

bC =Z

b (X )Pr (compliers | X )Pr (compliers)

dF (X )

=Z[E (Y | Z = 1,X )− E (Y | Z = 0,X )]

1Pr (compliers)

dF (X )

where

Pr (compliers) =Z[E (D | Z = 1,X )− E (D | Z = 0,X )] dF (X ) .

• Therefore,

bC =

R[E (Y | Z = 1,X )− E (Y | Z = 0,X )] dF (X )R[E (D | Z = 1,X )− E (D | Z = 0,X )] dF (X )

,

which can be estimated as a ratio of matching estimators.

Econometrics - CEMFIarellano/econometrics_class_notes_2017-18.pdf · J. Stock and M. Watson,...

Documents

Transcript of Econometrics - CEMFIarellano/econometrics_class_notes_2017-18.pdf · J. Stock and M. Watson,...