ESTIMATING AVERAGE TREATMENT EFFECTS: IV AND CONTROL...

ESTIMATING AVERAGE TREATMENT EFFECTS: IV ANDCONTROL FUNCTIONS

Jeff WooldridgeMichigan State University

BGSE/IZA Course in MicroeconometricsJuly 2009

1. Introduction2. Homogeneous Treatment Effect and Treatment Effects aFunction of Obervables3. Heterogeneous Treatment Effects and LATE4. Control Function Methods for Heterogeneous TreatmentEffects5. A Different Approach to Allowing Heterogeneous TreatmentEffects6. Nonlinear Response Models

1

1. Introduction

∙ Consider now the case where unconfoundedness, or selection on

observables, does not hold. Allow units to select into treatment based

on unobservables that affect the response.

∙ Even if eligibility is randomly assigned, actual enrollment in

programs may suffer from self selection. But randomized eligibility can

often be used as an instrumental variables.

∙ Identification and estimation are very simple in the constant treatment

effect case. Contemporary interest is often in the heterogeneous

treatment effect case.

2

∙ Three general approaches: (i) Study simple IV estimators to see if

they estimate anything of value under fairly weak assumptions; (ii) Add

more assumptions and try to estimate a parameter such as ate; (iii) Use

“structural” economic models without parametric assumptions, and try

to identify interesting policy or structural parameters.

3

2. Homogeneous Treatment Effect and Treatment Effects aFunction of Obervables

∙ In the simplest case we assume Y1 − Y0 is constant. Then

Y Y0 WY1 − Y0 Y0 W≡ 0 W V0

where 0 ≡ EY0 and V0 ≡ Y0 − 0. So, we have a simple

regression model with a constant slope.

4

∙ Suppose we have a single instrumental variable, Z. (At this point, Z

could be binary, continuous, or anything in between.) Recall the two

requirments for an instrument:

CovZ,W ≠ 0CovZ,V0 CovZ,Y0 0

5

∙ The first requirement (relevance) is fairly easily tested checked. The

second (exogeneity) is maintained. Note that having Z predetermined or

randomly assigned does not guarantee its exogeneity. The value of Z

could affect Y0 through other channels.

∙ Conveniently, the nature of Y is not restricted, but the treatment effect

is assumed constant. ate att is consistently estimated by the IV

estimator. [Remember, IV not usually unbiased.]

∙ Even if we restrict attention to consistency, it is not true that one

should use a “slightly” endogenous instrument rather than OLS.

6

∙Why? It is easy to show:

plim OLS V0W CorrW,V0

and

plim IV V0W

CorrZ,V0CorrZ,W

So, if CorrZ,W is small – that is, Z is a “weak” instrument – then

even a small correlation between Z and V0 can produce a larger

asymptotic bias than OLS.

7

∙ In economics, very common to see IV estimates that are larger in

magnitude than OLS estimates. Usually, other explanations are given

other than bad IV. Measurement error and heterogeneous treatment

effect (later) are among them.

∙Weak instruments lead to large asymptotic standard errors, too:

Avar N IV − V0

2

W2 Z,W2 .

When Z,W2 is small, the asymptotic variance can be very large. The

formula for the OLS estimator omits Z,W2 .

8

Adding Other Covariates and Instruments

∙ Suppose we need to add covariates before our instruments are

appropriately exogenous:

Yg g x − xg ug,g 0,1

where g EYg and x Ex.

∙We have made a functional form assumption once we assume Vg has

a zero mean, conditional on x. We also assume that netting out x makes

z exogenous. Easiest way to impose both:

Eug|x,z 0,g 0,1.

9

∙ In effect, we have made the standard exclusion restrictions plus a

linear functional form. Write

Y Y0 WY1 − Y0 0 x − x0 u0

W1 − 0 Wx − x Wu1 − u0

0 W x0 Wx − x u0 Wu1 − u0

where 1 − 0 and 0 0 − x0.

∙ The last term – the interaction of the treatment and the unobserved

gain from treatment, e ≡ u1 − u0, causes difficulties.

10

∙ In the simplest case, there is no heterogeneity except in the intercept:

0,u1 u0.Then

Y 0 W x0 u0,

and we can estimate the parameters by 2SLS using instruments 1,x,z.

∙ How might we exploit the binary nature of W?

1. (Not Recommended): Take the expected value conditional on all

exogenous variables:

EY|x,z 0 EW|x,z x0 Eu0|x,z

0 PW 1|x,z x0

because W is binary.

11

∙ Two-step procedure: (i) Estimate PW 1|x,z by probit (or logit)

and obtain the first-stage fitted probabilities, say

i 0 xi1 zi2. Should be able to reject 2 0 fairly

strongly. (ii) Use the i in place of Wi:

Yi on 1, i,xi, i 1, . . . ,N.

12

∙Why not recommended? (a) Inconsistent generally if model (probit in

this case) for PW 1|x,z is wrong. Then

EY|x,z ≠ 0 0 x1 z2 x0; (b) Need to adjust standard

errors for “generated regressor”; (c) No known efficiency properties;

(d) Tempting to achieve identification off of the nonlinearity in the

probit in the absense of instruments.

13

2. (Recommended): Use i as an IV, not a regressor. That is, estimate

Yi 0 Wi xi0 ui0,

by IV using instruments 1, i,xi.

∙ Not the same as using i as a regressor! The first stage when i is

used as an IV is

Ŵ i 0 1i xi2

and 0 ≠ 0, 1 ≠ 1, 2 ≠ 0.

14

∙Why recommended?

(a) Misspecification of probit model does not matter, providedW is

partially correlated with 0 x1 z2! So this method, like using

1,xi,zi as instruments, is robust to having the model for

PW 1|x,z wrong.

(b) No need to account for generated instruments in standard errors –

see Wooldridge (2002, Chapter 6).

(c) Estimator is efficient IV estimator if Varu0|x,z Varu0 and

probit model for W is correct.

15

∙ As a practical matter, there might not be much difference in the point

estimates between the approaches. Certainly not if the coefficients in

the regression Wi on 1, i,xi are close to 0,1,0. But using the fitted

values as a regressor has no advantages.

∙ Easy to modify to allow treatment to interact with x. After first stage

probit, estimate

Yi 0 Wi xi0 Wixi − x errori

by IV, using instruments 1, i,xi, i xi − x. Benefits same as

without interaction.

∙ Can safely ignore estimation error in sample mean, x.

16

∙ Note we have now allowed heterogeneous treatment effects but only

as a function of x. In particular,

x x − x,

and we can insert different values for x to see how the ATE varies with

observed characteristics.

17

3. Heterogeneous Treatment Effects and LATE

The Local Average Treatment Effect

∙What does IV estimate, in general, if the gain from treatment,

Y1 − Y0, is not constant? Imbens and Angrist (1994): Now

treatment is counterfactual, too. Let Z be the binary instrumental

variable, which is often randomized eligibility, so that Z is zero or one.

Let W0 be treatment status if Z 0 and let W1 be treatment status if

Z 1. (Some eligibles may not choose to participate; some

non-eligibles may find other ways to “participate.”)

18

∙We only observe one of the counterfactual treatments:

W 1 − ZW0 ZW1

Now write

Y 1 − WY0 WY1 Y0 W0Y1 − Y0 ZW1 − W0Y1 − Y0

19

Assumptions:

LATE.1: Z is independent of W0,W1,Y0,Y1

LATE.2: PW 1|Z 1 ≠ PW 1|Z 0

LATE.3: W1 ≥ W0 (no defiers, or monotonicity)

20

∙ This last condition means that if a unit would participates in the

program when not eligible, they would participate if made eligible.

(Vietnam draft lottery: a defier would be someone who would serve if

not drafted but would not serve if drafted. LATE.3 rules this out.)

∙ Imbens and Angrist define the local average treatment effect

(LATE) as

LATE EY1 − Y0|W1 − W0 1

which is the average treatment effect for those induced into treatment

by assignment. (That is, W0 0 and W1 1.)

21

∙ Imbens and Angrist call the subpopulation withW1 1, W0 0

the compliers. These individuals comply with assignment no matter

what their eligibility. That is, if they are assigned to the control group,

they would not participate, and if they are assigned to the treatment,

they do participate.

∙ One criticism of LATE is that it is a treatment effect for an

unidentifiable segment of the population: we do not know, on the basis

of observing Yi,Wi,Zi whether unit i is a complier.

22

∙ Consequently, LATE may not have “external validity,” that is, it may

not apply to evaluations of other programs even if the population is the

same.

∙ A related point is that two different binary instruments lead to

different groups of compliers in the same population. For example, if Y

is log of earnings, W is college attendance, and Z is living near a

college, the compliers are those who would attend college if one is

nearby, but not otherwise. But if Z is whether a grant is offered, the

compliers consist of those who would not attend college if they do not

receive a grant, but would attend if they do.

23

∙ The key result of IA is that, under the three LATE assumptions, the

usual IV estimator consistently estimates LATE. To see this, use the

independence assumption to write

EY|Z EY0 EW0Y1 − Y0 ZEW1 − W0Y1 − Y0

so that

EY|Z 1 − EY|Z 0 EW1 − W0Y1 − Y0.

24

∙ Because PW1 − W0 −1 0, W1 − W0 is a binary

variable, and

PW1 − W0 1 EW1 − W0 EW1 − EW0.

∙ But, again, Z is independent of W0,W1, and so

EW|Z 1 − ZEW0 ZEW1

which implies

EW1 − EW0 EW|Z 1 − EW|Z 0 PW 1|Z 1 − PW 1|Z 0 ≠ 0

by LATE.2.

26

∙We have shown that

EY1 − Y0|W1 − W0 1 EY|Z 1 − EY|Z 0PW 1|Z 1 − PW 1|Z 0 ,

and this is the probability limit of the simple IV estimator when Z and

W are both binary. In fact, the IV estimate is just the Wald estimate,

LATE N1−1∑i1

N ZiYi − N0−1∑i1

N 1 − ZiYiN1−1∑i1

N ZiWi − N0−1∑i1

N 1 − WiYi,

where N1 ∑i1N Zi and N0 N − N1. (Sometimes called a “grouping

estimator.”)

27

∙ This interpretation of IV has had wide influence in the treatment

effect literature. It is probably relied on too much to explain why IV

estimate is larger than OLS. Sometimes one forgets the possibility that

the IV is not exogenous.

∙ Literature is vast on allowing covariates, x, in the analysis and when z

is not a binary scalar. Flavor of results carries through.

28

4. Control Function Methods for Heterogeneous TreatmentEffects

∙ Suppose we think there are heterogeneous treatment effects but are

not satisfied with estimating LATE (or possibly the necessary

assumptions do not hold). The “old-fashioned” approach is to add more

assumptions and apply control function methods. This leads to the

so-called “switching regression” model, which is the same as a random

coefficient model.

∙ Recall that if Yg g xg ug we can write

Y 0 W x0 W x − x u0 W e

where e u1 − u0. Note that the ATE is still the coefficient on W.

29

∙We have to take seriously the underlying functional forms,

EYg|x g xg, and we will further restrict the conditional

distribution of Yg. In effect, this setup applies to Yg continuous.

(Contrast the approaches based on unconfoundedness, particularly PS

weighting and matching.)

∙ Even if we add the exogeneity assumptions

Eu0|x,z Ee|x,z 0,

it is generally not true that

30

EW e|x,z ≠ EW e.

and this is the condition for earlier IV estimators to be consistent for

if the interaction term W e is in the error term – see Wooldridge

(2003, Economics Letters).

∙ As it turns out, this condition can hold for continuous treatments, in

which case only 0 is estimated inconsistently. The IV estimator is

consistent for ate.

∙ For common binary response models for W, this mean independence

assumption is known not to hold.

31

∙ Instead, use a control function approach, that is, find EY|W,x,z:

EY|W,x,z 0 W x0 W x − x

Eu0|W,x,z WEe|W,x,z

∙We can estimate the expectations if we modelW. Probit is

convenient, and we now have to take the model seriously.

W 10 x1 z2 c ≥ 0

and

u0,u1,c|x,z ~ Multivariate Normal

∙ Can relax this a bit: Eu0|c,x,z and Eu1|c,x,z linear in c (and do

not depend on x,z) along with Dc|x,z standard normal.

32

∙ Expectations have the well-known “inverse Mills ratio” form:

EY|W,x,z 0 W x0 W x − x

Wr/r 1 − Wr/1 − r

where is the standard normal pdf and is the standard normal

cdf and r 1,x,z.

∙ This is an estimating equation for , and , too, when we want

x x − x.

33

∙ Two-Step Method (Heckman, but for “switching regression,” or

“endogenous treatment,” not sample selection):

(i) As before Run probit of Wi on 1,xi,zi and obtain i ri and

i ri.

(2) Run the OLS regression

Yi on 1,Wi,xi,Wi xi − x,Wi i/i, 1 − Wi i/1 − i

∙ There is a “generated regressor” problem. Need to adjust standard

errors and inference for first-stage estimation. Can use “heckit” in

Stata applied to treated and nontreated.

34

∙ If, say, we estimate the coefficients 0, 0′′ and 1, 1

′′ using

heckit on the Wi 0 and Wi 1 subgroups, important to form

x 1 − 0 x1 − 0

and, as a special case,

ate 1 − 0 x1 − 0.

It makes no sense to just look at 1 − 0 unless 1 0 has been

imposed or x is, coincidentally, close to zero.

35

∙ The ATEs are never obtained by looking at

ÊYi|Wi 1,xi,zi − ÊYi|Wi 0,xi,zi and averaging these

differences across i. We already know how and x depend on the

parameters. They do not depend on z!

∙ If the ATEs were based on EY|W,x,z there would be no

identification problem and never any functional form restrictions

because we could always estimate EY|W,x,z nonparametrically.

∙ Again, we use the expression for EY|W,x,z as an estimating

equation, not for computing treatment effects.

∙Whether separate estimations are carried out or the pooled regression

is used, can use the bootstrap for inference.

36

∙ The Stata command “treatreg” (typically using MLE, but sometimes

using the CF approach) does not apply to the full switching regression

case. In fact, it seems to apply only to the simple model with

homogeneous effect,

Y 0 W x0 u0,

in which case more robust IV methods are available. The CF regression

in this case is

Yi on 1,Wi,xi,Wi i/i 1 − Wi i/1 − i

∙ That is, treatreg imposes the restriction .

37

∙ As usual, need make sure z has at least one element that predicts

treatment once x is controlled for. (Use the first-stage probit.)

∙ In the pooled estimation in the general case, can use a joint test of the

two terms Wi i/i, 1 − Wi i/1 − i to test the null that W is

exogenous. Under the null, do not need to account for generated

regressors, so use a heteroskedasticity-robust Wald test.

38

∙ Extension of the usual Heckman approach. Suppose we allow lots of

heterogeneity in the two counterfactual equations:

Yig aig xibig g xig xidig uig

where aig,big is independent of xi,zi. We can write

Yi 0 Wi xi0 Wi xi − x ui0 Wi ei xidi0 Wixigi

where gi di1 − di0.

39

∙ Under sufficient normality and linear conditional expections of

heterogeneity given ci (the error in the probit for Wi) we get the control

function equation

EYi|Wi,xi,zi 0 Wi xi0 Wi xi − x

hWi,ri WihWi,ri hWi,rixi WihWi,rixi

where

hWi,ri Wiri − 1 − Wi−ri

40

∙ So the generalized residual from the probit, hWi,ri, gets added on

its own, interacted with Wi, interacted with xi, and interacted with

Wixi. (A total of 2K 1 terms.)

∙ The null that Wi is exogenous is a test that all terms have zero

coefficients. Usual heteroskedasticity-robust Wald test can be used.

∙ As usual, with a two-step estimation method, need to adjust standard

errors if conclude Wi is endogenous. (Bootstrap.)

41

∙ Even more flexibility: AllowWi to follow a heteroskedasticity probit,

that is

Wi 1ri exprioci ≥ 0

(where rio omits a constant) in which case the generalized residual is

hWi,ri,rio Wiexp−riori − 1 − Wi−exp−riori

∙ A lot of scope for studying robustness of findings even within

parametric models.

42

5. A Different Approach to Allowing HeterogeneousTreatment Effects

∙ Based on Wooldridge (2007, Advances in Econometrics). Go back to

Y 0 W x0 W x − x u0 W e

∙ Rather than find Eu0|W,x,z and Ee|W,x,z, now write

W e EW e|x,z a, where Ea|x,z 0 by definition. Then

Y 0 W x0 W x − x EW e|x,z u0 a,

where now the composite error, c u0 a, has zero mean conditional

on x,z.

43

∙ Turns out EW e|x,z is remarkably simple when W follows a probit:

EW e|x,z r 0 x1 z2

∙ The estimating equation is

Y 0 W x0 W x − x r q

Eq|x,z 0.

∙ But W is still endogenous in this equation, so we need an instrument.

(This is why the method differs from standard control function

approach.)

∙ The added regressor, r, is exogenous, and acts as its own

instrument. The natural instrument for W is r.

44

∙ Two-step procedure:

(i) As before, probit of W on 1,x,z. Now get i in addition to i.

(ii) Estimate

Yi 0 Wi xi0 Wi xi − x i errori

by IV, using instruments 1, i,xi, i xi − x, i.

∙ Generated regressor problem with i, so use delta method or

bootstrap. (Do not need to worry about generated instruments.)

45

∙ A test of whether W u1 − u0 is needed is just the usual

(heteroskedasticity-robust) t statistic on i, after IV estimation. Not a

test of endogeneity of W because W can still be correlated with u0.

∙ Drawback to this approach: Does not work with binary Z without

covariates: 0 2Z and 0 2Z are perfectly collinear when

Z is binary. Generally, might be sensitive to weak instruments. Need Z

to vary sufficiently.

∙ Control function approach can work with binary Z, at least in some

cases. But in CF case, test of interaction not robust to nonnormality

(endogeneity and functional form tied together).

46

∙ Can learn other things by studying the estimating equation. Assume

we have no covariates, so drop x and write

Y 0 W 0 2Z qEq|Z 0.

Via simulations, Angrist (1990, NBER Technical Working Paper)

studied the properties of the usual IV estimator in the equation without

0 2Z. In other words, the implicit error term is

0 2Z q. In general, the IV, Z (which is what Angrist used,

not 0 2Z), is correlated with 0 2Z. But not always. In

fact, in Angrist’s simulations, 0 2Z 2Z and Z has a

symmetric distribution about zero.

47

∙ Just like Z and Z2 are uncorrelated when Z is symmetrically

distributed about Z,

CovZ,2Z 0

(because is symmetric about zero). We already know

CovZ,q 0, and so Z is uncorrelated with 0 2Z q.

∙ Likely explains why Angrist’s simulation results for the simple IV

estimator look so promising for estimating .

48

∙ This method is likely less efficient than control function methods in

general. But it can be quite a bit easier with multiple treatment levels.

For example, suppose we replace W with a vector W1, . . . ,WJ. Then

the control function method requires computing the expectations

Eu0|W,x,z and Ee|W,x,z, which can be very difficult, whereas the

“correction function” method requires only calculation of EWje|x,z

for each j separately.

49

6. Nonlinear Response Models

∙What if Y is clearly a discrete outcome, such as a binary response?

Could take the Imbens and Angrist perspective and settle for

(hopefully) estimate LATE. But what if we want to estimate the ATE as

a function of covariates x?

∙ Unfortunately, no general, robust procedures. Joint MLE is the only

reliable method. That is, specify a model forW as a function of x,z

and also a model for Y.

50

∙ Some poor practices linger, especially in complicated discrete

response models (multinomial and conditional logit, for example). Even

if Y is binary, often see a two-step method applied.

∙ (1) Probit of Wi on xi,zi and get fitted values, i. (2) Probit of Yi

on i,xi.

∙ Coefficient on i is meaningless (except maybe for the sign). How do

we get the ATE from this? No current justification for these kinds of

approaches.

∙ Is it somehow possible that

51

∙ Can think of a similar Tobit method (when Y is a corner solution

response) that attempts to mimic “two-stage least squares.

52

∙What can we do that at least works under a set of assumptions?

∙ Suppose Yg are binary, and we are willing to assume, for g 0,1,

Yg 1g xg ug 0

ug|x,z ~ Normal0,1

∙We know that the ATE as a function of x is

x 1 x1 − 0 x0,

which is easy to estimate given estimates of the parameters.

53

∙ If we again assume

W 10 x1 z2 c ≥ 0

and

u0,u1,c|x,z ~ Multivariate Normal,

then estimation using standard software is straightforward: apply the

Heckman selection approach to the probit model for the control

Wi 0 and treated Wi 1 groups separately.

54

∙ In Stata, “heckprob.” This gives 0, 0 Wi 0, need instrument) and

1, 1 (Wi 1) and then we have

x 1 x1 − 0 x0,

which we can study at various values of x, or average across

subpopulations. The estimate of

ate Ex1 x1 − 0 x0

is

ate N−1∑i1

N

1 xi1 − 0 xi0.

55

∙ Simple test for null thatW is exogenous. Can show that the score

(Lagrange multiplier) test can be obtained as the following variable

addition test: (1) Probit of Wi on 1,xi,zi to get and the generalized

residual,

gri ≡ Wiri − 1 − Wi−ri

(2) Probit of Yi on 1,xi,Wi,gri and the usual MLE t statistic. Under the

null, no need to adjust the standard error for first-stage estimation.

∙ Cannot be theoretically justified as a correction under H1, but one

wonders for “local” forms of endogeneity.

56

∙ Rather than compute ate as earlier, can extend Blundell and Powell

(2003):

x N−1∑i1

N

x gri − x gri

where is the probit coefficient on Wi. Then average out across xi to

approximate ate. (Caution: I really have no idea if or when this works.)

57

∙ For testing the null, can do lots of different things. For example, with

multiple binary treatments Wi1, . . . ,WiG, can (assuming enough

instruments) compute a GR for eachWi1, then do probit

Yi on 1xi,Wi1, . . . ,WiG,gri1, . . . ,griG

and do a usual joint Wald test for joint significance of gri1, . . . ,griG.

Under null, they should be jointly significant.

58

∙ For a test, could use the Pearson (standardized) residual in place of

the generalized residual:

pri Wi − ri/ ri1 −ri

As with adding GR, might this approximate the average treatment

effect when averaged out of the second-stage probit?

∙ Recent work [Wooldridge (2008, manuscript)]: If Wi (scalar) is

endogenous and follows a probit, can use the same bivariate probit

(quasi-) log likelihood when Yi is a fractional response following a

probit mean function, without further retricting the distribution of Yi.

59

∙ If Yg is a corner (labor supply, charitable contributions) might

specify

Yg max0,g xg ug, g 1,2

and use the assumptions as above, except allow 02 and 1

2 as variances

of u0,u1. Under the multivariate normality of u0,u1,c (and being

independent of x,z), can use MLE. The estimated average treatment

effect as a function of x is

x 1 x1/1 11 x1/1

− 0 x0/0 00 x0/0

and average this across xi to estimate ate.

60

∙ For other functional forms, we can get by with fewer restrictions,

although often same parameters are imposed across regimes.

∙Write

EY|W,x,z, r1 exp x W r1,

where r1 is an unobservable. Terza’s (1998) control function method

based on

EY|W,x,z exp x WEexpr1|W,x,z.

61

∙When W follows a probit,

EY|W,x,z exp x WhW,0 x1 z2,

where

hW,0 x1 z2, exp2/2W r/z 1 − W1 − r/1 − r.

∙ Estimate using first stage probit, as before. Then, say, Poisson

QMLE with the above mean function to estimate , , and .

62

∙ Then the ATE as a function of x is estimated as

x exp x − exp x

∙What is the score test of H0 : 0 (W is exogenous)? Can show the

variable addition test is (1) Probit of Wi on 1,xi,zi, as usual; (2)

Poisson (say) regression of Yi on 1,xi,Wi, lgri where lgri is the log of

the usual generlized residual:

lgri Wi logri − 1 − Wi log−ri

63

∙ IV methods that work for any distribution of W are available

[Mullahy (1997)]. If

EY|W,x,z, r1 expx W r1,

and r1 is independent of z then

Eexp−x − WY|z Eexpr1|z 1,

where Eexpr1 1 is a normalization. The moment conditions are

Eexp−x − WY − 1|z 0.

64

ESTIMATING AVERAGE TREATMENT EFFECTS: IV AND CONTROL...

Documents

Transcript of ESTIMATING AVERAGE TREATMENT EFFECTS: IV AND CONTROL...