ESTIMATING AVERAGE TREATMENT EFFECTS: IV AND CONTROL...
Transcript of ESTIMATING AVERAGE TREATMENT EFFECTS: IV AND CONTROL...
ESTIMATING AVERAGE TREATMENT EFFECTS: IV ANDCONTROL FUNCTIONS
Jeff WooldridgeMichigan State University
BGSE/IZA Course in MicroeconometricsJuly 2009
1. Introduction2. Homogeneous Treatment Effect and Treatment Effects aFunction of Obervables3. Heterogeneous Treatment Effects and LATE4. Control Function Methods for Heterogeneous TreatmentEffects5. A Different Approach to Allowing Heterogeneous TreatmentEffects6. Nonlinear Response Models
1
1. Introduction
∙ Consider now the case where unconfoundedness, or selection on
observables, does not hold. Allow units to select into treatment based
on unobservables that affect the response.
∙ Even if eligibility is randomly assigned, actual enrollment in
programs may suffer from self selection. But randomized eligibility can
often be used as an instrumental variables.
∙ Identification and estimation are very simple in the constant treatment
effect case. Contemporary interest is often in the heterogeneous
treatment effect case.
2
∙ Three general approaches: (i) Study simple IV estimators to see if
they estimate anything of value under fairly weak assumptions; (ii) Add
more assumptions and try to estimate a parameter such as ate; (iii) Use
“structural” economic models without parametric assumptions, and try
to identify interesting policy or structural parameters.
3
2. Homogeneous Treatment Effect and Treatment Effects aFunction of Obervables
∙ In the simplest case we assume Y1 − Y0 is constant. Then
Y Y0 WY1 − Y0 Y0 W≡ 0 W V0
where 0 ≡ EY0 and V0 ≡ Y0 − 0. So, we have a simple
regression model with a constant slope.
4
∙ Suppose we have a single instrumental variable, Z. (At this point, Z
could be binary, continuous, or anything in between.) Recall the two
requirments for an instrument:
CovZ,W ≠ 0CovZ,V0 CovZ,Y0 0
5
∙ The first requirement (relevance) is fairly easily tested checked. The
second (exogeneity) is maintained. Note that having Z predetermined or
randomly assigned does not guarantee its exogeneity. The value of Z
could affect Y0 through other channels.
∙ Conveniently, the nature of Y is not restricted, but the treatment effect
is assumed constant. ate att is consistently estimated by the IV
estimator. [Remember, IV not usually unbiased.]
∙ Even if we restrict attention to consistency, it is not true that one
should use a “slightly” endogenous instrument rather than OLS.
6
∙Why? It is easy to show:
plim OLS V0W CorrW,V0
and
plim IV V0W
CorrZ,V0CorrZ,W
So, if CorrZ,W is small – that is, Z is a “weak” instrument – then
even a small correlation between Z and V0 can produce a larger
asymptotic bias than OLS.
7
∙ In economics, very common to see IV estimates that are larger in
magnitude than OLS estimates. Usually, other explanations are given
other than bad IV. Measurement error and heterogeneous treatment
effect (later) are among them.
∙Weak instruments lead to large asymptotic standard errors, too:
Avar N IV − V0
2
W2 Z,W2 .
When Z,W2 is small, the asymptotic variance can be very large. The
formula for the OLS estimator omits Z,W2 .
8
Adding Other Covariates and Instruments
∙ Suppose we need to add covariates before our instruments are
appropriately exogenous:
Yg g x − xg ug,g 0,1
where g EYg and x Ex.
∙We have made a functional form assumption once we assume Vg has
a zero mean, conditional on x. We also assume that netting out x makes
z exogenous. Easiest way to impose both:
Eug|x,z 0,g 0,1.
9
∙ In effect, we have made the standard exclusion restrictions plus a
linear functional form. Write
Y Y0 WY1 − Y0 0 x − x0 u0
W1 − 0 Wx − x Wu1 − u0
0 W x0 Wx − x u0 Wu1 − u0
where 1 − 0 and 0 0 − x0.
∙ The last term – the interaction of the treatment and the unobserved
gain from treatment, e ≡ u1 − u0, causes difficulties.
10
∙ In the simplest case, there is no heterogeneity except in the intercept:
0,u1 u0.Then
Y 0 W x0 u0,
and we can estimate the parameters by 2SLS using instruments 1,x,z.
∙ How might we exploit the binary nature of W?
1. (Not Recommended): Take the expected value conditional on all
exogenous variables:
EY|x,z 0 EW|x,z x0 Eu0|x,z
0 PW 1|x,z x0
because W is binary.
11
∙ Two-step procedure: (i) Estimate PW 1|x,z by probit (or logit)
and obtain the first-stage fitted probabilities, say
i 0 xi1 zi2. Should be able to reject 2 0 fairly
strongly. (ii) Use the i in place of Wi:
Yi on 1, i,xi, i 1, . . . ,N.
12
∙Why not recommended? (a) Inconsistent generally if model (probit in
this case) for PW 1|x,z is wrong. Then
EY|x,z ≠ 0 0 x1 z2 x0; (b) Need to adjust standard
errors for “generated regressor”; (c) No known efficiency properties;
(d) Tempting to achieve identification off of the nonlinearity in the
probit in the absense of instruments.
13
2. (Recommended): Use i as an IV, not a regressor. That is, estimate
Yi 0 Wi xi0 ui0,
by IV using instruments 1, i,xi.
∙ Not the same as using i as a regressor! The first stage when i is
used as an IV is
Ŵ i 0 1i xi2
and 0 ≠ 0, 1 ≠ 1, 2 ≠ 0.
14
∙Why recommended?
(a) Misspecification of probit model does not matter, providedW is
partially correlated with 0 x1 z2! So this method, like using
1,xi,zi as instruments, is robust to having the model for
PW 1|x,z wrong.
(b) No need to account for generated instruments in standard errors –
see Wooldridge (2002, Chapter 6).
(c) Estimator is efficient IV estimator if Varu0|x,z Varu0 and
probit model for W is correct.
15
∙ As a practical matter, there might not be much difference in the point
estimates between the approaches. Certainly not if the coefficients in
the regression Wi on 1, i,xi are close to 0,1,0. But using the fitted
values as a regressor has no advantages.
∙ Easy to modify to allow treatment to interact with x. After first stage
probit, estimate
Yi 0 Wi xi0 Wixi − x errori
by IV, using instruments 1, i,xi, i xi − x. Benefits same as
without interaction.
∙ Can safely ignore estimation error in sample mean, x.
16
∙ Note we have now allowed heterogeneous treatment effects but only
as a function of x. In particular,
x x − x,
and we can insert different values for x to see how the ATE varies with
observed characteristics.
17
3. Heterogeneous Treatment Effects and LATE
The Local Average Treatment Effect
∙What does IV estimate, in general, if the gain from treatment,
Y1 − Y0, is not constant? Imbens and Angrist (1994): Now
treatment is counterfactual, too. Let Z be the binary instrumental
variable, which is often randomized eligibility, so that Z is zero or one.
Let W0 be treatment status if Z 0 and let W1 be treatment status if
Z 1. (Some eligibles may not choose to participate; some
non-eligibles may find other ways to “participate.”)
18
∙We only observe one of the counterfactual treatments:
W 1 − ZW0 ZW1
Now write
Y 1 − WY0 WY1 Y0 W0Y1 − Y0 ZW1 − W0Y1 − Y0
19
Assumptions:
LATE.1: Z is independent of W0,W1,Y0,Y1
LATE.2: PW 1|Z 1 ≠ PW 1|Z 0
LATE.3: W1 ≥ W0 (no defiers, or monotonicity)
20
∙ This last condition means that if a unit would participates in the
program when not eligible, they would participate if made eligible.
(Vietnam draft lottery: a defier would be someone who would serve if
not drafted but would not serve if drafted. LATE.3 rules this out.)
∙ Imbens and Angrist define the local average treatment effect
(LATE) as
LATE EY1 − Y0|W1 − W0 1
which is the average treatment effect for those induced into treatment
by assignment. (That is, W0 0 and W1 1.)
21
∙ Imbens and Angrist call the subpopulation withW1 1, W0 0
the compliers. These individuals comply with assignment no matter
what their eligibility. That is, if they are assigned to the control group,
they would not participate, and if they are assigned to the treatment,
they do participate.
∙ One criticism of LATE is that it is a treatment effect for an
unidentifiable segment of the population: we do not know, on the basis
of observing Yi,Wi,Zi whether unit i is a complier.
22
∙ Consequently, LATE may not have “external validity,” that is, it may
not apply to evaluations of other programs even if the population is the
same.
∙ A related point is that two different binary instruments lead to
different groups of compliers in the same population. For example, if Y
is log of earnings, W is college attendance, and Z is living near a
college, the compliers are those who would attend college if one is
nearby, but not otherwise. But if Z is whether a grant is offered, the
compliers consist of those who would not attend college if they do not
receive a grant, but would attend if they do.
23
∙ The key result of IA is that, under the three LATE assumptions, the
usual IV estimator consistently estimates LATE. To see this, use the
independence assumption to write
EY|Z EY0 EW0Y1 − Y0 ZEW1 − W0Y1 − Y0
so that
EY|Z 1 − EY|Z 0 EW1 − W0Y1 − Y0.
24
Now
EW1 − W0Y1 − Y0
1 EY1 − Y0|W1 − W0 1PW1 − W0 1 0 EY1 − Y0|W1 − W0 0PW1 − W0 0
−1 EY1 − Y0|W1 − W0 −1PW1 − W0 −1 EY1 − Y0|W1 − W0 1PW
− EY1 − Y0|W1 − W0 −1PW1 − W0 −1 EY1 − Y0|W1 − W0 1PW
because PW1 − W0 −1 0 by the no defiers assumption.
25
∙ Because PW1 − W0 −1 0, W1 − W0 is a binary
variable, and
PW1 − W0 1 EW1 − W0 EW1 − EW0.
∙ But, again, Z is independent of W0,W1, and so
EW|Z 1 − ZEW0 ZEW1
which implies
EW1 − EW0 EW|Z 1 − EW|Z 0 PW 1|Z 1 − PW 1|Z 0 ≠ 0
by LATE.2.
26
∙We have shown that
EY1 − Y0|W1 − W0 1 EY|Z 1 − EY|Z 0PW 1|Z 1 − PW 1|Z 0 ,
and this is the probability limit of the simple IV estimator when Z and
W are both binary. In fact, the IV estimate is just the Wald estimate,
LATE N1−1∑i1
N ZiYi − N0−1∑i1
N 1 − ZiYiN1−1∑i1
N ZiWi − N0−1∑i1
N 1 − WiYi,
where N1 ∑i1N Zi and N0 N − N1. (Sometimes called a “grouping
estimator.”)
27
∙ This interpretation of IV has had wide influence in the treatment
effect literature. It is probably relied on too much to explain why IV
estimate is larger than OLS. Sometimes one forgets the possibility that
the IV is not exogenous.
∙ Literature is vast on allowing covariates, x, in the analysis and when z
is not a binary scalar. Flavor of results carries through.
28
4. Control Function Methods for Heterogeneous TreatmentEffects
∙ Suppose we think there are heterogeneous treatment effects but are
not satisfied with estimating LATE (or possibly the necessary
assumptions do not hold). The “old-fashioned” approach is to add more
assumptions and apply control function methods. This leads to the
so-called “switching regression” model, which is the same as a random
coefficient model.
∙ Recall that if Yg g xg ug we can write
Y 0 W x0 W x − x u0 W e
where e u1 − u0. Note that the ATE is still the coefficient on W.
29
∙We have to take seriously the underlying functional forms,
EYg|x g xg, and we will further restrict the conditional
distribution of Yg. In effect, this setup applies to Yg continuous.
(Contrast the approaches based on unconfoundedness, particularly PS
weighting and matching.)
∙ Even if we add the exogeneity assumptions
Eu0|x,z Ee|x,z 0,
it is generally not true that
30
EW e|x,z ≠ EW e.
and this is the condition for earlier IV estimators to be consistent for
if the interaction term W e is in the error term – see Wooldridge
(2003, Economics Letters).
∙ As it turns out, this condition can hold for continuous treatments, in
which case only 0 is estimated inconsistently. The IV estimator is
consistent for ate.
∙ For common binary response models for W, this mean independence
assumption is known not to hold.
31
∙ Instead, use a control function approach, that is, find EY|W,x,z:
EY|W,x,z 0 W x0 W x − x
Eu0|W,x,z WEe|W,x,z
∙We can estimate the expectations if we modelW. Probit is
convenient, and we now have to take the model seriously.
W 10 x1 z2 c ≥ 0
and
u0,u1,c|x,z ~ Multivariate Normal
∙ Can relax this a bit: Eu0|c,x,z and Eu1|c,x,z linear in c (and do
not depend on x,z) along with Dc|x,z standard normal.
32
∙ Expectations have the well-known “inverse Mills ratio” form:
EY|W,x,z 0 W x0 W x − x
Wr/r 1 − Wr/1 − r
where is the standard normal pdf and is the standard normal
cdf and r 1,x,z.
∙ This is an estimating equation for , and , too, when we want
x x − x.
33
∙ Two-Step Method (Heckman, but for “switching regression,” or
“endogenous treatment,” not sample selection):
(i) As before Run probit of Wi on 1,xi,zi and obtain i ri and
i ri.
(2) Run the OLS regression
Yi on 1,Wi,xi,Wi xi − x,Wi i/i, 1 − Wi i/1 − i
∙ There is a “generated regressor” problem. Need to adjust standard
errors and inference for first-stage estimation. Can use “heckit” in
Stata applied to treated and nontreated.
34
∙ If, say, we estimate the coefficients 0, 0′′ and 1, 1
′′ using
heckit on the Wi 0 and Wi 1 subgroups, important to form
x 1 − 0 x1 − 0
and, as a special case,
ate 1 − 0 x1 − 0.
It makes no sense to just look at 1 − 0 unless 1 0 has been
imposed or x is, coincidentally, close to zero.
35
∙ The ATEs are never obtained by looking at
ÊYi|Wi 1,xi,zi − ÊYi|Wi 0,xi,zi and averaging these
differences across i. We already know how and x depend on the
parameters. They do not depend on z!
∙ If the ATEs were based on EY|W,x,z there would be no
identification problem and never any functional form restrictions
because we could always estimate EY|W,x,z nonparametrically.
∙ Again, we use the expression for EY|W,x,z as an estimating
equation, not for computing treatment effects.
∙Whether separate estimations are carried out or the pooled regression
is used, can use the bootstrap for inference.
36
∙ The Stata command “treatreg” (typically using MLE, but sometimes
using the CF approach) does not apply to the full switching regression
case. In fact, it seems to apply only to the simple model with
homogeneous effect,
Y 0 W x0 u0,
in which case more robust IV methods are available. The CF regression
in this case is
Yi on 1,Wi,xi,Wi i/i 1 − Wi i/1 − i
∙ That is, treatreg imposes the restriction .
37
∙ As usual, need make sure z has at least one element that predicts
treatment once x is controlled for. (Use the first-stage probit.)
∙ In the pooled estimation in the general case, can use a joint test of the
two terms Wi i/i, 1 − Wi i/1 − i to test the null that W is
exogenous. Under the null, do not need to account for generated
regressors, so use a heteroskedasticity-robust Wald test.
38
∙ Extension of the usual Heckman approach. Suppose we allow lots of
heterogeneity in the two counterfactual equations:
Yig aig xibig g xig xidig uig
where aig,big is independent of xi,zi. We can write
Yi 0 Wi xi0 Wi xi − x ui0 Wi ei xidi0 Wixigi
where gi di1 − di0.
39
∙ Under sufficient normality and linear conditional expections of
heterogeneity given ci (the error in the probit for Wi) we get the control
function equation
EYi|Wi,xi,zi 0 Wi xi0 Wi xi − x
hWi,ri WihWi,ri hWi,rixi WihWi,rixi
where
hWi,ri Wiri − 1 − Wi−ri
40
∙ So the generalized residual from the probit, hWi,ri, gets added on
its own, interacted with Wi, interacted with xi, and interacted with
Wixi. (A total of 2K 1 terms.)
∙ The null that Wi is exogenous is a test that all terms have zero
coefficients. Usual heteroskedasticity-robust Wald test can be used.
∙ As usual, with a two-step estimation method, need to adjust standard
errors if conclude Wi is endogenous. (Bootstrap.)
41
∙ Even more flexibility: AllowWi to follow a heteroskedasticity probit,
that is
Wi 1ri exprioci ≥ 0
(where rio omits a constant) in which case the generalized residual is
hWi,ri,rio Wiexp−riori − 1 − Wi−exp−riori
∙ A lot of scope for studying robustness of findings even within
parametric models.
42
5. A Different Approach to Allowing HeterogeneousTreatment Effects
∙ Based on Wooldridge (2007, Advances in Econometrics). Go back to
Y 0 W x0 W x − x u0 W e
∙ Rather than find Eu0|W,x,z and Ee|W,x,z, now write
W e EW e|x,z a, where Ea|x,z 0 by definition. Then
Y 0 W x0 W x − x EW e|x,z u0 a,
where now the composite error, c u0 a, has zero mean conditional
on x,z.
43
∙ Turns out EW e|x,z is remarkably simple when W follows a probit:
EW e|x,z r 0 x1 z2
∙ The estimating equation is
Y 0 W x0 W x − x r q
Eq|x,z 0.
∙ But W is still endogenous in this equation, so we need an instrument.
(This is why the method differs from standard control function
approach.)
∙ The added regressor, r, is exogenous, and acts as its own
instrument. The natural instrument for W is r.
44
∙ Two-step procedure:
(i) As before, probit of W on 1,x,z. Now get i in addition to i.
(ii) Estimate
Yi 0 Wi xi0 Wi xi − x i errori
by IV, using instruments 1, i,xi, i xi − x, i.
∙ Generated regressor problem with i, so use delta method or
bootstrap. (Do not need to worry about generated instruments.)
45
∙ A test of whether W u1 − u0 is needed is just the usual
(heteroskedasticity-robust) t statistic on i, after IV estimation. Not a
test of endogeneity of W because W can still be correlated with u0.
∙ Drawback to this approach: Does not work with binary Z without
covariates: 0 2Z and 0 2Z are perfectly collinear when
Z is binary. Generally, might be sensitive to weak instruments. Need Z
to vary sufficiently.
∙ Control function approach can work with binary Z, at least in some
cases. But in CF case, test of interaction not robust to nonnormality
(endogeneity and functional form tied together).
46
∙ Can learn other things by studying the estimating equation. Assume
we have no covariates, so drop x and write
Y 0 W 0 2Z qEq|Z 0.
Via simulations, Angrist (1990, NBER Technical Working Paper)
studied the properties of the usual IV estimator in the equation without
0 2Z. In other words, the implicit error term is
0 2Z q. In general, the IV, Z (which is what Angrist used,
not 0 2Z), is correlated with 0 2Z. But not always. In
fact, in Angrist’s simulations, 0 2Z 2Z and Z has a
symmetric distribution about zero.
47
∙ Just like Z and Z2 are uncorrelated when Z is symmetrically
distributed about Z,
CovZ,2Z 0
(because is symmetric about zero). We already know
CovZ,q 0, and so Z is uncorrelated with 0 2Z q.
∙ Likely explains why Angrist’s simulation results for the simple IV
estimator look so promising for estimating .
48
∙ This method is likely less efficient than control function methods in
general. But it can be quite a bit easier with multiple treatment levels.
For example, suppose we replace W with a vector W1, . . . ,WJ. Then
the control function method requires computing the expectations
Eu0|W,x,z and Ee|W,x,z, which can be very difficult, whereas the
“correction function” method requires only calculation of EWje|x,z
for each j separately.
49
6. Nonlinear Response Models
∙What if Y is clearly a discrete outcome, such as a binary response?
Could take the Imbens and Angrist perspective and settle for
(hopefully) estimate LATE. But what if we want to estimate the ATE as
a function of covariates x?
∙ Unfortunately, no general, robust procedures. Joint MLE is the only
reliable method. That is, specify a model forW as a function of x,z
and also a model for Y.
50
∙ Some poor practices linger, especially in complicated discrete
response models (multinomial and conditional logit, for example). Even
if Y is binary, often see a two-step method applied.
∙ (1) Probit of Wi on xi,zi and get fitted values, i. (2) Probit of Yi
on i,xi.
∙ Coefficient on i is meaningless (except maybe for the sign). How do
we get the ATE from this? No current justification for these kinds of
approaches.
∙ Is it somehow possible that
51
∙ Can think of a similar Tobit method (when Y is a corner solution
response) that attempts to mimic “two-stage least squares.
52
∙What can we do that at least works under a set of assumptions?
∙ Suppose Yg are binary, and we are willing to assume, for g 0,1,
Yg 1g xg ug 0
ug|x,z ~ Normal0,1
∙We know that the ATE as a function of x is
x 1 x1 − 0 x0,
which is easy to estimate given estimates of the parameters.
53
∙ If we again assume
W 10 x1 z2 c ≥ 0
and
u0,u1,c|x,z ~ Multivariate Normal,
then estimation using standard software is straightforward: apply the
Heckman selection approach to the probit model for the control
Wi 0 and treated Wi 1 groups separately.
54
∙ In Stata, “heckprob.” This gives 0, 0 Wi 0, need instrument) and
1, 1 (Wi 1) and then we have
x 1 x1 − 0 x0,
which we can study at various values of x, or average across
subpopulations. The estimate of
ate Ex1 x1 − 0 x0
is
ate N−1∑i1
N
1 xi1 − 0 xi0.
55
∙ Simple test for null thatW is exogenous. Can show that the score
(Lagrange multiplier) test can be obtained as the following variable
addition test: (1) Probit of Wi on 1,xi,zi to get and the generalized
residual,
gri ≡ Wiri − 1 − Wi−ri
(2) Probit of Yi on 1,xi,Wi,gri and the usual MLE t statistic. Under the
null, no need to adjust the standard error for first-stage estimation.
∙ Cannot be theoretically justified as a correction under H1, but one
wonders for “local” forms of endogeneity.
56
∙ Rather than compute ate as earlier, can extend Blundell and Powell
(2003):
x N−1∑i1
N
x gri − x gri
where is the probit coefficient on Wi. Then average out across xi to
approximate ate. (Caution: I really have no idea if or when this works.)
57
∙ For testing the null, can do lots of different things. For example, with
multiple binary treatments Wi1, . . . ,WiG, can (assuming enough
instruments) compute a GR for eachWi1, then do probit
Yi on 1xi,Wi1, . . . ,WiG,gri1, . . . ,griG
and do a usual joint Wald test for joint significance of gri1, . . . ,griG.
Under null, they should be jointly significant.
58
∙ For a test, could use the Pearson (standardized) residual in place of
the generalized residual:
pri Wi − ri/ ri1 −ri
As with adding GR, might this approximate the average treatment
effect when averaged out of the second-stage probit?
∙ Recent work [Wooldridge (2008, manuscript)]: If Wi (scalar) is
endogenous and follows a probit, can use the same bivariate probit
(quasi-) log likelihood when Yi is a fractional response following a
probit mean function, without further retricting the distribution of Yi.
59
∙ If Yg is a corner (labor supply, charitable contributions) might
specify
Yg max0,g xg ug, g 1,2
and use the assumptions as above, except allow 02 and 1
2 as variances
of u0,u1. Under the multivariate normality of u0,u1,c (and being
independent of x,z), can use MLE. The estimated average treatment
effect as a function of x is
x 1 x1/1 11 x1/1
− 0 x0/0 00 x0/0
and average this across xi to estimate ate.
60
∙ For other functional forms, we can get by with fewer restrictions,
although often same parameters are imposed across regimes.
∙Write
EY|W,x,z, r1 exp x W r1,
where r1 is an unobservable. Terza’s (1998) control function method
based on
EY|W,x,z exp x WEexpr1|W,x,z.
61
∙When W follows a probit,
EY|W,x,z exp x WhW,0 x1 z2,
where
hW,0 x1 z2, exp2/2W r/z 1 − W1 − r/1 − r.
∙ Estimate using first stage probit, as before. Then, say, Poisson
QMLE with the above mean function to estimate , , and .
62
∙ Then the ATE as a function of x is estimated as
x exp x − exp x
∙What is the score test of H0 : 0 (W is exogenous)? Can show the
variable addition test is (1) Probit of Wi on 1,xi,zi, as usual; (2)
Poisson (say) regression of Yi on 1,xi,Wi, lgri where lgri is the log of
the usual generlized residual:
lgri Wi logri − 1 − Wi log−ri
63
∙ IV methods that work for any distribution of W are available
[Mullahy (1997)]. If
EY|W,x,z, r1 expx W r1,
and r1 is independent of z then
Eexp−x − WY|z Eexpr1|z 1,
where Eexpr1 1 is a normalization. The moment conditions are
Eexp−x − WY − 1|z 0.
64