Gov 2002: 7. Regression and Causalityโข Define linear regression: =argmin ๐ผ[( ๐โ โฒ ) ]...
Transcript of Gov 2002: 7. Regression and Causalityโข Define linear regression: =argmin ๐ผ[( ๐โ โฒ ) ]...
Gov 2002 7 Regression andCausality
Matthew Blackwell
October 15 2015
1 58
Agnostic Regression
Regression and Causality
Regression with Heterogeneous Treatment Effects
2 58
Where are we Where are we going
bull Last few weeks using matching weighting for estimatingcausal effects
bull This week how to use regression to estimate causal effectsbull Regression is so widely used itrsquos good to know what itrsquos
actually estimatingbull Goal salvage regression from the ashes of 1980rsquos textbooksbull Next week panel data
Reminder Email me and Stephen a half-page description of yourproposed research project
3 58
1 AgnosticRegression
4 58
Regression as parametric modeling
bull Gauss-Markov assumptions linearity iid sample full rank 119883119894 zero conditional mean error
homoskedasticity
bull OLS is BLUE plus normality of the errors and we get smallsample SEs
bull What is the basic approach here It is a model for theconditional distribution of 119884119894 given 119883119894
[119884119894|119883119894] sim 119873(119883prime119894 120573 1205901113569)
bull MLE from this model is the usual OLS estimator OLS
OLS =
โกโขโขโขโขโขโฃ1198731114012119894=1113568
119883119894119883prime119894
โคโฅโฅโฅโฅโฅโฆ
minus11135681198731114012119894=1113568
119883119894119884119894
5 58
Agnostic views on regression
[119884119894|119883119894] sim 119873(119883prime119894 120573 1205901113569)
bull Strong distributional assumption on 119884119894bull Properties like BLUE or MLE properties depend on these
assumptions holdingbull Alternative take an agnostic view on regression
Use OLS without believing these assumptions
bull Lose the distributional assumptions focus on the conditionalexpectation function (CEF)
120583(119909) = 120124[119884119894|119883119894 = 119909] = 1114012119910119910 sdot โ[119884119894 = 119910|119883119894 = 119909]
6 58
Justifying linear regression
bull Define linear regression
120573 = argmin119887
120124[(119884119894 minus 119883prime119894 119887)1113569]
bull The solution to this is the following
120573 = 120124[119883119894119883prime119894 ]minus1113568120124[119883119894119884119894]
bull Note that the is the population coefficient vector not theestimator yet
7 58
Regression anatomy
bull Consider simple linear regression
(120572 120573) = argmin119886119887
120124 1114094(119884119894 minus 119886 minus 119887119883119894)11135691114097
bull In this case we can write the populationtrue slope 120573 as
120573 = 120124[119883119894119883prime119894 ]minus1113568120124[119883119894119884119894] =
Cov(119884119894 119883119894)120141[119883119894]
bull With more covariates 120573 is more complicated but we can stillwrite it like this
bull Let 119896119894 be the residual from a regression of 119883119896119894 on all the otherindependent variables Then 120573119896 the coefficient for 119883119896119894 is
120573119896 =Cov(119884119894 119896119894)
119881(119896119894)
8 58
Justification 1 Linear CEFs
bull Justification 1 if the CEF is linear the population regressionfunction is it That is if 119864[119884119894|119883119894] = 119883prime
119894 119887 then 119887 = 120573bull When would we expect the CEF to be linear Two cases
1 Outcome and covariates are multivariate normal2 Linear regression model is saturated
bull A model is saturated if there are as many parameters asthere are possible combination of the 119883119894 variables
9 58
Saturated model example
bull Two binary variables 1198831113568119894 for incumbency status and 1198831113569119894 forparty of the candidate
bull Four possible values of 119883119894 four possible values of 120583(119883119894)
119864[119884119894|1198831113568119894 = 01198831113569119894 = 0] = 120572119864[119884119894|1198831113568119894 = 11198831113569119894 = 0] = 120572 + 120573119864[119884119894|1198831113568119894 = 01198831113569119894 = 1] = 120572 + 120574119864[119884119894|1198831113568119894 = 11198831113569119894 = 1] = 120572 + 120573 + 120574 + 120575
bull We can write the CEF as follows
119864[119884119894|1198831113568119894 1198831113569119894] = 120572 + 1205731198831113568119894 + 1205741198831113569119894 + 120575(11988311135681198941198831113569119894)
10 58
Saturated models example
119864[119884119894|1198831113568119894 1198831113569119894] = 120572 + 1205731198831113568119894 + 1205741198831113569119894 + 120575(11988311135681198941198831113569119894)
bull Basically each value of 120583(119883119894) is being estimated separately within-strata estimation No borrowing of information from across values of 119883119894
bull Requires a set of dummies for each categorical variable plusall interactions
bull Or a series of dummies for each unique combination of 119883119894bull This makes linearity hold mechanically and so linearity is
not an assumption Just a fact about saturated CEFs saturated models for limited dependent variables = A-OK
11 58
Saturated model example
bull Washington (AER) data on the effects of daughtersbull Wersquoll look at the relationship between voting and number of
kids (causal)
girls lt- foreignreaddta(rdquogirlsdtardquo)
head(girls[ c(rdquonamerdquo rdquototchirdquo rdquoaauwrdquo)])
name totchi aauw
1 ABERCROMBIE NEIL 0 100
2 ACKERMAN GARY L 3 88
3 ADERHOLT ROBERT B 0 0
4 ALLEN THOMAS H 2 100
5 ANDREWS ROBERT E 2 100
6 ARCHER WR 7 0
12 58
Linear model
summary(lm(aauw ~ totchi data = girls))
Coefficients
Estimate Std Error t value Pr(gt|t|)
(Intercept) 6131 181 3381 lt2e-16
totchi -533 062 -859 lt2e-16
---
Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1
Residual standard error 42 on 1733 degrees of freedom
(5 observations deleted due to missingness)
Multiple R-squared 00408 Adjusted R-squared 00403
F-statistic 738 on 1 and 1733 DF p-value lt2e-16
13 58
Saturated modelsummary(lm(aauw ~ asfactor(totchi) data = girls))
Coefficients
Estimate Std Error t value Pr(gt|t|)
(Intercept) 5641 276 2042 lt 2e-16
asfactor(totchi)1 545 411 133 01851
asfactor(totchi)2 -380 327 -116 02454
asfactor(totchi)3 -1365 345 -395 81e-05
asfactor(totchi)4 -1931 401 -482 16e-06
asfactor(totchi)5 -1546 485 -319 00015
asfactor(totchi)6 -3359 1042 -322 00013
asfactor(totchi)7 -1713 1141 -150 01336
asfactor(totchi)8 -5533 1228 -451 70e-06
asfactor(totchi)9 -5041 2408 -209 00364
asfactor(totchi)10 -5341 2090 -256 00107
asfactor(totchi)12 -5641 4153 -136 01745
---
Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1
Residual standard error 41 on 1723 degrees of freedom
(5 observations deleted due to missingness)
Multiple R-squared 00506 Adjusted R-squared 00446
F-statistic 836 on 11 and 1723 DF p-value 184e-1414 58
Saturated model minus the constantsummary(lm(aauw ~ asfactor(totchi) - 1 data = girls))
Coefficients
Estimate Std Error t value Pr(gt|t|)
asfactor(totchi)0 5641 276 2042 lt2e-16
asfactor(totchi)1 6186 305 2031 lt2e-16
asfactor(totchi)2 5262 175 3013 lt2e-16
asfactor(totchi)3 4276 207 2062 lt2e-16
asfactor(totchi)4 3711 290 1279 lt2e-16
asfactor(totchi)5 4095 399 1027 lt2e-16
asfactor(totchi)6 2282 1005 227 00233
asfactor(totchi)7 3929 1107 355 00004
asfactor(totchi)8 108 1196 009 09278
asfactor(totchi)9 600 2392 025 08020
asfactor(totchi)10 300 2072 014 08849
asfactor(totchi)12 000 4143 000 10000
---
Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1
Residual standard error 41 on 1723 degrees of freedom
(5 observations deleted due to missingness)
Multiple R-squared 0587 Adjusted R-squared 0584
F-statistic 204 on 12 and 1723 DF p-value lt2e-1615 58
Compare to within-strata means
bull The saturated model makes no assumptions about thebetween-strata relationships
bull Just calculates within-strata means
c1 lt- coef(lm(aauw ~ asfactor(totchi) - 1 data = girls))
c2 lt- with(girls tapply(aauw totchi mean narm = TRUE))
rbind(c1 c2)
0 1 2 3 4 5 6 7 8 9 10 12
c1 56 62 53 43 37 41 23 39 11 6 3 0
c2 56 62 53 43 37 41 23 39 11 6 3 0
16 58
Other justifications for OLS
bull Justification 2 119883prime119894 120573 is the best linear predictor (in a
mean-squared error sense) of 119884119894 Why 120573 = argmin119887 120124[(119884119894 minus 119883prime
119894 119887)1113569]
bull Justification 3 119883prime119894 120573 provides the minimum mean squared
error linear approxmiation to 119864[119884119894|119883119894]bull Even if the CEF is not linear a linear regression provides the
best linear approximation to that CEFbull Donrsquot need to believe the assumptions (linearity) in order to
use regression as a good approximation to the CEFbull Warning if the CEF is very nonlinear then this approximation
could be terrible
17 58
The error terms
bull Letrsquos define the error term 119890119894 equiv 119884119894 minus 119883prime119894 120573 so that
119884119894 = 119883prime119894 120573 + [119884119894 minus 119883prime
119894 120573] = 119883prime119894 120573 + 119890119894
bull Note the residual 119890119894 is uncorrelated with 119883119894
120124[119883119894119890119894] = 120124[119883119894(119884119894 minus 119883prime119894 120573)]
= 120124[119883119894119884119894] minus 120124[119883119894119883prime119894 120573]
= 120124[119883119894119884119894] minus 120124 1114094119883119894119883prime119894120124[119883119894119883prime
119894 ]minus1113568120124[119883119894119884119894]1114097= 120124[119883119894119884119894] minus 120124[119883119894119883prime
119894 ]120124[119883119894119883prime119894 ]minus1113568120124[119883119894119884119894]
= 120124[119883119894119884119894] minus 120124[119883119894119884119894] = 0
bull No assumptions on the linearity of 120124[119884119894|119883119894]
18 58
OLS estimator
bull We know the population value of 120573 is
120573 = 120124[119883119894119883prime119894 ]minus1113568120124[119883119894119884119894]
bull How do we get an estimator of thisbull Plug-in principle replace population expectation with
sample versions
=
โกโขโขโขโขโขโฃ1119873
1114012119894119883119894119883prime
119894
โคโฅโฅโฅโฅโฅโฆ
minus11135681119873
1114012119894119883119894119884119894
bull If you work through the matrix algebra this turns out to be
= (119831prime119831)minus1113568 119831prime119858
19 58
Asymptotic OLS inference
bull With this representation in hand we can write the OLSestimator as follows
= 120573 +
โกโขโขโขโขโขโฃ1114012
119894119883119894119883prime
119894
โคโฅโฅโฅโฅโฅโฆ
minus1113568
1114012119894119883119894119890119894
bull Core idea sum119894119883119894119890119894 is the sum of rvs so the CLT applies
bull That plus some simple asymptotic theory allows us to say
radic119873( minus 120573) 119873(0ฮฉ)
bull Converges in distribution to a Normal distribution with meanvector 0 and covariance matrix ฮฉ
ฮฉ = 120124[119883119894119883prime119894 ]minus1113568120124[119883119894119883prime
119894 1198901113569119894 ]120124[119883119894119883prime119894 ]minus1113568
bull No linearity assumption needed20 58
Estimating the variance
bull In large samples then
radic119873( minus 120573) sim 119873(0ฮฉ)
bull How to estimate ฮฉ Plug-in principle again
1114023ฮฉ =
โกโขโขโขโขโขโฃ1114012
119894119883119894119883prime
119894
โคโฅโฅโฅโฅโฅโฆ
minus1113568 โกโขโขโขโขโขโฃ1114012119894119883119894119883prime
119894 1113569119894
โคโฅโฅโฅโฅโฅโฆ
โกโขโขโขโขโขโฃ1114012
119894119883119894119883prime
119894
โคโฅโฅโฅโฅโฅโฆ
minus1113568
bull Replace 119890119894 with its emprical counterpart (residuals)119894 = 119884119894 minus 119883prime
119894 bull Replace the population moments of 119883119894 with their sample
counterpartsbull The square root of the diagonals of this covariance matrix are
the ldquorobustrdquo or Huber-White standard errors that Statacommonly report
21 58
Heteroskedasticity
bull No assumptions of homoskedasticitybull Heteroskedaticity will definitely occur when
CEF is linear but the 1205901113569(119909) = 120141[119884119894|119883119894 = 119909] is not constant in 119909 119864[119884119894|119883119894] is not linear but we use the linear regression to
approxmiate it
22 58
2 Regression andCausality
23 58
Regression and causality
bull Most econometrics textbooks regression defined withoutrespect to causality
bull But then when is ldquobiasedrdquo The above derivations work forsome 120124[119884119894|119883119894]
bull The question then is when does knowing the CEF tell ussomething about causality
bull MHE argues that a regression is causal when the CEF itapproximates is causal Identification is king
bull We will show that under certain conditions a regression of theoutcome on the treatment and the covariates can recover acausal parameter but perhaps not the one in which we areinterested
24 58
Review
bull Quick reminder we have potential outcomes 119884119894(1) and 119884119894(0)and two parameters the ATE and ATT
120591 = 119864[119884119894(1) minus 119884119894(0)]120591ATT = 119864[119884119894(1) minus 119884119894(0)|119863119894 = 1]
bull We have shown in past weeks that these effects are identifiedwhen ignorability holds MHE calls this the conditionalindependence assumption (CIA)
25 58
Linear constant effects model binarytreatment
bull Experiment with a simple experiment we can rewrite theconsistency assumption to be a regression formula
119884119894 = 119863119894119884119894(1) + (1 minus 119863119894)119884119894(0)= 119884119894(0) + (119884119894(1) minus 119884119894(0))119863119894
= 120124[119884119894(0)] + 120591119863119894 + (119884119894(0) minus 120124[119884119894(0)])= 1205831113567 + 120591119863119894 + 1199071113567119894
bull Note that if ignorability holds (as in an experiment) for 119884119894(0)then it will also hold for 1199071113567119894 since 120124[119884119894(0)] is constant Thusthis satifies the usual assumptions for regression
26 58
Now with covariates
bull Now assume no unmeasured confounders 119884119894(119889) โโ 119863119894|119883119894bull We will assume a linear model for the potential outcomes
119884119894(119889) = 120572 + 120591 sdot 119889 + 120578119894
bull Remember that linearity isnrsquot an assumption if 119863119894 is binarybull Effect of 119863119894 is constant here the 120578119894 are the only source of
individual variation and we have 119864[120578119894] = 0bull Consistency assumption allows us to write this as
119884119894 = 120572 + 120591119863119894 + 120578119894
27 58
Covariates in the error
bull Letrsquos assume that 120578119894 is linear in 119883119894 120578119894 = 119883prime119894 120574 + 120584119894
bull New error is uncorrelated with 119883119894 120124[120584119894|119883119894] = 0bull This is an assumption Might be falsebull Plug into the above
120124[119884119894(119889)|119883119894] = 119864[119884119894|119863119894 119883119894] = 120572 + 120591119863119894 + 119864[120578119894|119883119894]= 120572 + 120591119863119894 + 119883prime
119894 120574 + 119864[120584119894|119883119894]= 120572 + 120591119863119894 + 119883prime
119894 120574
28 58
Summing up regression with constanteffects
bull Reviewing the assumptions wersquove used no unmeasured confounders constant treatment effects linearity of the treatmentcovariates
bull Under these we can run the following regression to estimatethe ATE 120591
119884119894 = 120572 + 120591119863119894 + 119883prime119894 120574 + 120584119894
bull Works with continuous or ordinal 119863119894 if linearity in the effect ofthese variables is truly linear
29 58
OLS constant effects simulation
Model with linear covariates constant 0 effect of treatment
library(mvtnorm)
n lt- 100
p lt- 4
X lt- rmvnorm(n = 100 mean = rep(0 p))
gamma lt- c(274 137 137 137)
y lt- 210 + X c(gamma) + rnorm(n)
alpha lt- c(-1 05 -05 -01)
dprobs lt- bootinvlogit(X alpha)
d lt- rbinom(n size = 1 prob = dprobs)
30 58
OLS with no covariates
summary(lm(y ~ d))
Coefficients
Estimate Std Error t value Pr(gt|t|)
(Intercept) 22036 458 4813 lt 2e-16
d -2869 654 -439 0000029
---
Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1
Residual standard error 33 on 98 degrees of freedom
Multiple R-squared 0164 Adjusted R-squared 0156
F-statistic 192 on 1 and 98 DF p-value 0000029
31 58
OLS with covariates
summary(lm(y ~ d + X))
Coefficients
Estimate Std Error t value Pr(gt|t|)
(Intercept) 209952 0159 13207 lt2e-16
d -0128 0253 -05 062
X1 27368 0126 2175 lt2e-16
X2 13677 0114 1201 lt2e-16
X3 13673 0130 1051 lt2e-16
X4 13570 0106 1280 lt2e-16
---
Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1
Residual standard error 1 on 94 degrees of freedom
Multiple R-squared 0999 Adjusted R-squared 0999
F-statistic 244e+04 on 5 and 94 DF p-value lt2e-16
32 58
What happens with nonlinearity
Suppose we can only observe the following covariates
z1 lt- exp(X[ 1]2)
z2 lt- X[ 2](1 + exp(X[ 1])) + 10
z3 lt- (X[ 1] X[ 3]25 + 06)^3
z4 lt- (X[ 2] + X[ 4] + 20)^2
Implies that 119884119894 and 119863119894 are functions of log(1198851198941113568) 1198851198941113569 119885111356911989411135681198851198941113569
1 log(1198851198941113568) 1198851198941113570 log(1198851198941113568) and 119883111356811135691198941113571
Regression is a nonlinear function of the observed covariates
33 58
When linearity goes wrong
summary(lm(y ~ d + z1 + z2 + z3 + z4))
Coefficients
Estimate Std Error t value Pr(gt|t|)
(Intercept) 218728 300799 073 0469
d -69292 33854 -205 0043
z1 363110 26970 1346 lt2e-16
z2 -29033 36619 -079 0430
z3 862022 431030 200 0048
z4 04021 00329 1223 lt2e-16
---
Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1
Residual standard error 14 on 94 degrees of freedom
Multiple R-squared 0855 Adjusted R-squared 0847
F-statistic 111 on 5 and 94 DF p-value lt2e-16
34 58
3 Regression withHeterogeneousTreatment Effects
35 58
Heterogeneous effects binary treatment
bull Completely randomized experiment
119884119894 = 119863119894119884119894(1) + (1 minus 119863119894)119884119894(0)= 119884119894(0) + (119884119894(1) minus 119884119894(0))119863119894
= 1205831113567 + 120591119894119863119894 + (119884119894(0) minus 1205831113567)= 1205831113567 + 120591119863119894 + (119884119894(0) minus 1205831113567) + (120591119894 minus 120591) sdot 119863119894
= 1205831113567 + 120591119863119894 + 120576119894
bull Error term now includes two components1 ldquoBaselinerdquo variation in the outcome (119884119894(0) minus 1205831113567)2 Variation in the treatment effect (120591119894 minus 120591)
bull Easy to verify that under experiment 120124[120576119894|119863119894] = 0bull Thus OLS estimates the ATE with no covariates
36 58
Adding covariates
bull What happens with no unmeasured confounders Need tocondition on 119883119894 now
bull Remember identification of the ATEATT using iteratedexpectations
bull ATE is the weighted sum of CATEs
120591 = 1114012119909120591(119909) Pr[119883119894 = 119909]
bull ATEATT are weighted averages of CATEsbull What about the regression estimand 120591119877 How does it related
to the ATEATT
37 58
Heterogeneous effects and regression
bull Letrsquos investigate this under a saturated regression model
119884119894 = 1114012119909119861119909119894120572119909 + 120591119877119863119894 + 119890119894
bull Use a dummy variable for each unique combination of 119883119894119861119909119894 = 120128(119883119894 = 119909)
bull Linear in 119883119894 by construction
38 58
Investigating the regression coefficient
bull How can we investigate 120591119877 Well we can rely on theregression anatomy
120591119877 = Cov(119884119894 119863119894 minus 119864[119863119894|119883119894])120141(119863119894 minus 119864[119863119894|119883119894])
bull 119863119894 minus 120124[119863119894|119883119894] is the residual from a regression of 119863119894 on the fullset of dummies
bull With a little work we can show
120591119877 =120124 1114094120591(119883119894)(119863119894 minus 120124[119863119894|119883119894])11135691114097
120124[(119863119894 minus 119864[119863119894|119883119894])1113569]=
120124[120591(119883119894)1205901113569119889(119883119894)]120124[1205901113569119889(119883119894)]
bull 1205901113569119889(119909) = 120141[119863119894|119883119894 = 119909] is the conditional variance of treatmentassignment
39 58
ATE versus OLS
120591119877 = 120124[120591(119883119894)119882119894] = 1114012119909120591(119909)
1205901113569119889(119909)120124[1205901113569119889(119883119894)]
โ[119883119894 = 119909]
bull Compare to the ATE
120591 = 120124[120591(119883119894)] = 1114012119909120591(119909)โ[119883119894 = 119909]
bull Both weight strata relative to their size (โ[119883119894 = 119909])bull OLS weights strata higher if the treatment variance in those
strata (1205901113569119889(119909)) is higher in those strata relative to the averagevariance across strata (120124[1205901113569119889(119883119894)])
bull The ATE weights only by their size
40 58
Regression weighting
119882119894 =1205901113569119889(119883119894)
120124[1205901113569119889(119883119894)]
bull Why does OLS weight like thisbull OLS is a minimum-variance estimator more weight to
more precise within-strata estimatesbull Within-strata estimates are most precise when the treatment
is evenly spread and thus has the highest variancebull If 119863119894 is binary then we know the conditional variance will be
1205901113569119889(119909) = โ[119863119894 = 1|119883119894 = 119909] (1 minus โ[119863119894 = 1|119883119894 = 119909])= 119890(119909) (1 minus 119890(119909))
bull Maximum variance with โ[119863119894 = 1|119883119894 = 119909] = 12
41 58
OLS weighting examplebull Binary covariate
โ[119883119894 = 1] = 075 โ[119883119894 = 0] = 025โ[119863119894 = 1|119883119894 = 1] = 09 โ[119863119894 = 1|119883119894 = 0] = 05
1205901113569119889(1) = 009 1205901113569119889(0) = 025120591(1) = 1 120591(0) = minus1
bull Implies the ATE is 120591 = 05bull Average conditional variance 120124[1205901113569119889(119883119894)] = 013bull weights for 119883119894 = 1 are 009013 = 0692 for 119883119894 = 0
025013 = 192120591119877 = 120124[120591(119883119894)119882119894]
= 120591(1)119882(1)โ[119883119894 = 1] + 120591(0)119882(0)โ[119883119894 = 0]= 1 times 0692 times 075 + minus1 times 192 times 025= 0039
42 58
When will OLS estimate the ATE
bull When does 120591 = 120591119877bull Constant treatment effects 120591(119909) = 120591 = 120591119877bull Constant probability of treatment 119890(119909) = โ[119863119894 = 1|119883119894 = 119909] = 119890
Implies that the OLS weights are 1
bull Incorrect linearity assumption in 119883119894 will lead to more bias
43 58
Other ways to use regression
bull Whatrsquos the path forward Accept the bias (might be relatively small with saturated
models) Use a different regression approach
bull Let 120583119889(119909) = 120124[119884119894(119889)|119883119894 = 119909] be the CEF for the potentialoutcome under 119863119894 = 119889
bull By consistency and nuc we have 120583119889(119909) = 120124[119884119894|119863119894 = 119889119883119894 = 119909]bull Estimate a regression of 119884119894 on 119883119894 among the 119863119894 = 119889 groupbull Then 119889(119909) is just a predicted value from the regression for
119883119894 = 119909bull How can we use this
44 58
Imputation estimators
bull Impute the treated potential outcomes with 1114023119884119894(1) = 1113568(119883119894)bull Impute the control potential outcomes with 1114023119884119894(0) = 1113567(119883119894)bull Procedure
Regress 119884119894 on 119883119894 in the treated group and get predicted valuesfor all units (treated or control)
Regress 119884119894 on 119883119894 in the control group and get predicted valuesfor all units (treated or control)
Take the average difference between these predicted values
bull More mathematically look like this
120591119894119898119901 =1119873
11140121198941113568(119883119894) minus 1113567(119883119894)
bull Sometimes called an imputation estimator
45 58
Simple imputation estimator
bull Use predict() from the within-group models on the data fromthe entire sample
bull Useful trick use a model on the entire data andmodelframe() to get the right design matrix
heterogeneous effects
yhet lt- ifelse(d == 1 y + rnorm(n 0 5) y)
mod lt- lm(yhet ~ d + X)
mod1 lt- lm(yhet ~ X subset = d == 1)
mod0 lt- lm(yhet ~ X subset = d == 0)
y1imps lt- predict(mod1 modelframe(mod))
y0imps lt- predict(mod0 modelframe(mod))
mean(y1imps - y0imps)
[1] 061
46 58
Notes on imputation estimators
bull If 119889(119909) are consistent estimators then 120591119894119898119901 is consistent forthe ATE
bull Why donrsquot people use this Most people donrsquot know the results wersquove been talking about Harder to implement than vanilla OLS
bull Can use linear regression to estimate 119889(119909) = 119909prime120573119889bull Recent trend is to estimate 119889(119909) via non-parametric methods
such as Kernel regression local linear regression regression trees etc Easiest is generalized additive models (GAMs)
47 58
Imputation estimator visualization
-2 -1 0 1 2 3
-05
00
05
10
15
x
y
TreatedControl
48 58
Imputation estimator visualization
-2 -1 0 1 2 3
-05
00
05
10
15
x
y
TreatedControl
49 58
Imputation estimator visualization
-2 -1 0 1 2 3
-05
00
05
10
15
x
y
TreatedControl
50 58
Nonlinear relationshipsbull Same idea but with nonlinear relationship between 119884119894 and 119883119894
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
51 58
Nonlinear relationshipsbull Same idea but with nonlinear relationship between 119884119894 and 119883119894
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
52 58
Nonlinear relationshipsbull Same idea but with nonlinear relationship between 119884119894 and 119883119894
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
53 58
Using semiparametric regressionbull Here CEFs are nonlinear but we donrsquot know their formbull We can use GAMs from the mgcv package to for flexible
estimatelibrary(mgcv)
mod0 lt- gam(y ~ s(x) subset = d == 0)
summary(mod0)
Family gaussian
Link function identity
Formula
y ~ s(x)
Parametric coefficients
Estimate Std Error t value Pr(gt|t|)
(Intercept) -00225 00154 -146 016
Approximate significance of smooth terms
edf Refdf F p-value
s(x) 603 708 413 lt2e-16
---
Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1
R-sq(adj) = 0909 Deviance explained = 928
GCV = 00093204 Scale est = 00071351 n = 30
54 58
Using GAMs
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
55 58
Using GAMs
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
56 58
Using GAMs
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
57 58
Limited dependent variables
bull Usual advice model the data from first principles Logitprobit for binary Poisson for counts etc
bull OLS is a-ok with limited DVs when Binary treatment and no covariates (just diff-in-means) Binary treatment discrete covariates and saturated models
(stratified diff-in-means)
bull Imposing a model on LDVs in this case imposes adistributional assumption which could be wrong
bull Even in unsaturated models the marginal effect from OLSoften decent compared to nonlinear models
Could go wrong in small samples If using nonlinear models always get effects on the scale of the
outcome
58 58
- Agnostic Regression
- Regression and Causality
- Regression with Heterogeneous Treatment Effects
-
Agnostic Regression
Regression and Causality
Regression with Heterogeneous Treatment Effects
2 58
Where are we Where are we going
bull Last few weeks using matching weighting for estimatingcausal effects
bull This week how to use regression to estimate causal effectsbull Regression is so widely used itrsquos good to know what itrsquos
actually estimatingbull Goal salvage regression from the ashes of 1980rsquos textbooksbull Next week panel data
Reminder Email me and Stephen a half-page description of yourproposed research project
3 58
1 AgnosticRegression
4 58
Regression as parametric modeling
bull Gauss-Markov assumptions linearity iid sample full rank 119883119894 zero conditional mean error
homoskedasticity
bull OLS is BLUE plus normality of the errors and we get smallsample SEs
bull What is the basic approach here It is a model for theconditional distribution of 119884119894 given 119883119894
[119884119894|119883119894] sim 119873(119883prime119894 120573 1205901113569)
bull MLE from this model is the usual OLS estimator OLS
OLS =
โกโขโขโขโขโขโฃ1198731114012119894=1113568
119883119894119883prime119894
โคโฅโฅโฅโฅโฅโฆ
minus11135681198731114012119894=1113568
119883119894119884119894
5 58
Agnostic views on regression
[119884119894|119883119894] sim 119873(119883prime119894 120573 1205901113569)
bull Strong distributional assumption on 119884119894bull Properties like BLUE or MLE properties depend on these
assumptions holdingbull Alternative take an agnostic view on regression
Use OLS without believing these assumptions
bull Lose the distributional assumptions focus on the conditionalexpectation function (CEF)
120583(119909) = 120124[119884119894|119883119894 = 119909] = 1114012119910119910 sdot โ[119884119894 = 119910|119883119894 = 119909]
6 58
Justifying linear regression
bull Define linear regression
120573 = argmin119887
120124[(119884119894 minus 119883prime119894 119887)1113569]
bull The solution to this is the following
120573 = 120124[119883119894119883prime119894 ]minus1113568120124[119883119894119884119894]
bull Note that the is the population coefficient vector not theestimator yet
7 58
Regression anatomy
bull Consider simple linear regression
(120572 120573) = argmin119886119887
120124 1114094(119884119894 minus 119886 minus 119887119883119894)11135691114097
bull In this case we can write the populationtrue slope 120573 as
120573 = 120124[119883119894119883prime119894 ]minus1113568120124[119883119894119884119894] =
Cov(119884119894 119883119894)120141[119883119894]
bull With more covariates 120573 is more complicated but we can stillwrite it like this
bull Let 119896119894 be the residual from a regression of 119883119896119894 on all the otherindependent variables Then 120573119896 the coefficient for 119883119896119894 is
120573119896 =Cov(119884119894 119896119894)
119881(119896119894)
8 58
Justification 1 Linear CEFs
bull Justification 1 if the CEF is linear the population regressionfunction is it That is if 119864[119884119894|119883119894] = 119883prime
119894 119887 then 119887 = 120573bull When would we expect the CEF to be linear Two cases
1 Outcome and covariates are multivariate normal2 Linear regression model is saturated
bull A model is saturated if there are as many parameters asthere are possible combination of the 119883119894 variables
9 58
Saturated model example
bull Two binary variables 1198831113568119894 for incumbency status and 1198831113569119894 forparty of the candidate
bull Four possible values of 119883119894 four possible values of 120583(119883119894)
119864[119884119894|1198831113568119894 = 01198831113569119894 = 0] = 120572119864[119884119894|1198831113568119894 = 11198831113569119894 = 0] = 120572 + 120573119864[119884119894|1198831113568119894 = 01198831113569119894 = 1] = 120572 + 120574119864[119884119894|1198831113568119894 = 11198831113569119894 = 1] = 120572 + 120573 + 120574 + 120575
bull We can write the CEF as follows
119864[119884119894|1198831113568119894 1198831113569119894] = 120572 + 1205731198831113568119894 + 1205741198831113569119894 + 120575(11988311135681198941198831113569119894)
10 58
Saturated models example
119864[119884119894|1198831113568119894 1198831113569119894] = 120572 + 1205731198831113568119894 + 1205741198831113569119894 + 120575(11988311135681198941198831113569119894)
bull Basically each value of 120583(119883119894) is being estimated separately within-strata estimation No borrowing of information from across values of 119883119894
bull Requires a set of dummies for each categorical variable plusall interactions
bull Or a series of dummies for each unique combination of 119883119894bull This makes linearity hold mechanically and so linearity is
not an assumption Just a fact about saturated CEFs saturated models for limited dependent variables = A-OK
11 58
Saturated model example
bull Washington (AER) data on the effects of daughtersbull Wersquoll look at the relationship between voting and number of
kids (causal)
girls lt- foreignreaddta(rdquogirlsdtardquo)
head(girls[ c(rdquonamerdquo rdquototchirdquo rdquoaauwrdquo)])
name totchi aauw
1 ABERCROMBIE NEIL 0 100
2 ACKERMAN GARY L 3 88
3 ADERHOLT ROBERT B 0 0
4 ALLEN THOMAS H 2 100
5 ANDREWS ROBERT E 2 100
6 ARCHER WR 7 0
12 58
Linear model
summary(lm(aauw ~ totchi data = girls))
Coefficients
Estimate Std Error t value Pr(gt|t|)
(Intercept) 6131 181 3381 lt2e-16
totchi -533 062 -859 lt2e-16
---
Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1
Residual standard error 42 on 1733 degrees of freedom
(5 observations deleted due to missingness)
Multiple R-squared 00408 Adjusted R-squared 00403
F-statistic 738 on 1 and 1733 DF p-value lt2e-16
13 58
Saturated modelsummary(lm(aauw ~ asfactor(totchi) data = girls))
Coefficients
Estimate Std Error t value Pr(gt|t|)
(Intercept) 5641 276 2042 lt 2e-16
asfactor(totchi)1 545 411 133 01851
asfactor(totchi)2 -380 327 -116 02454
asfactor(totchi)3 -1365 345 -395 81e-05
asfactor(totchi)4 -1931 401 -482 16e-06
asfactor(totchi)5 -1546 485 -319 00015
asfactor(totchi)6 -3359 1042 -322 00013
asfactor(totchi)7 -1713 1141 -150 01336
asfactor(totchi)8 -5533 1228 -451 70e-06
asfactor(totchi)9 -5041 2408 -209 00364
asfactor(totchi)10 -5341 2090 -256 00107
asfactor(totchi)12 -5641 4153 -136 01745
---
Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1
Residual standard error 41 on 1723 degrees of freedom
(5 observations deleted due to missingness)
Multiple R-squared 00506 Adjusted R-squared 00446
F-statistic 836 on 11 and 1723 DF p-value 184e-1414 58
Saturated model minus the constantsummary(lm(aauw ~ asfactor(totchi) - 1 data = girls))
Coefficients
Estimate Std Error t value Pr(gt|t|)
asfactor(totchi)0 5641 276 2042 lt2e-16
asfactor(totchi)1 6186 305 2031 lt2e-16
asfactor(totchi)2 5262 175 3013 lt2e-16
asfactor(totchi)3 4276 207 2062 lt2e-16
asfactor(totchi)4 3711 290 1279 lt2e-16
asfactor(totchi)5 4095 399 1027 lt2e-16
asfactor(totchi)6 2282 1005 227 00233
asfactor(totchi)7 3929 1107 355 00004
asfactor(totchi)8 108 1196 009 09278
asfactor(totchi)9 600 2392 025 08020
asfactor(totchi)10 300 2072 014 08849
asfactor(totchi)12 000 4143 000 10000
---
Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1
Residual standard error 41 on 1723 degrees of freedom
(5 observations deleted due to missingness)
Multiple R-squared 0587 Adjusted R-squared 0584
F-statistic 204 on 12 and 1723 DF p-value lt2e-1615 58
Compare to within-strata means
bull The saturated model makes no assumptions about thebetween-strata relationships
bull Just calculates within-strata means
c1 lt- coef(lm(aauw ~ asfactor(totchi) - 1 data = girls))
c2 lt- with(girls tapply(aauw totchi mean narm = TRUE))
rbind(c1 c2)
0 1 2 3 4 5 6 7 8 9 10 12
c1 56 62 53 43 37 41 23 39 11 6 3 0
c2 56 62 53 43 37 41 23 39 11 6 3 0
16 58
Other justifications for OLS
bull Justification 2 119883prime119894 120573 is the best linear predictor (in a
mean-squared error sense) of 119884119894 Why 120573 = argmin119887 120124[(119884119894 minus 119883prime
119894 119887)1113569]
bull Justification 3 119883prime119894 120573 provides the minimum mean squared
error linear approxmiation to 119864[119884119894|119883119894]bull Even if the CEF is not linear a linear regression provides the
best linear approximation to that CEFbull Donrsquot need to believe the assumptions (linearity) in order to
use regression as a good approximation to the CEFbull Warning if the CEF is very nonlinear then this approximation
could be terrible
17 58
The error terms
bull Letrsquos define the error term 119890119894 equiv 119884119894 minus 119883prime119894 120573 so that
119884119894 = 119883prime119894 120573 + [119884119894 minus 119883prime
119894 120573] = 119883prime119894 120573 + 119890119894
bull Note the residual 119890119894 is uncorrelated with 119883119894
120124[119883119894119890119894] = 120124[119883119894(119884119894 minus 119883prime119894 120573)]
= 120124[119883119894119884119894] minus 120124[119883119894119883prime119894 120573]
= 120124[119883119894119884119894] minus 120124 1114094119883119894119883prime119894120124[119883119894119883prime
119894 ]minus1113568120124[119883119894119884119894]1114097= 120124[119883119894119884119894] minus 120124[119883119894119883prime
119894 ]120124[119883119894119883prime119894 ]minus1113568120124[119883119894119884119894]
= 120124[119883119894119884119894] minus 120124[119883119894119884119894] = 0
bull No assumptions on the linearity of 120124[119884119894|119883119894]
18 58
OLS estimator
bull We know the population value of 120573 is
120573 = 120124[119883119894119883prime119894 ]minus1113568120124[119883119894119884119894]
bull How do we get an estimator of thisbull Plug-in principle replace population expectation with
sample versions
=
โกโขโขโขโขโขโฃ1119873
1114012119894119883119894119883prime
119894
โคโฅโฅโฅโฅโฅโฆ
minus11135681119873
1114012119894119883119894119884119894
bull If you work through the matrix algebra this turns out to be
= (119831prime119831)minus1113568 119831prime119858
19 58
Asymptotic OLS inference
bull With this representation in hand we can write the OLSestimator as follows
= 120573 +
โกโขโขโขโขโขโฃ1114012
119894119883119894119883prime
119894
โคโฅโฅโฅโฅโฅโฆ
minus1113568
1114012119894119883119894119890119894
bull Core idea sum119894119883119894119890119894 is the sum of rvs so the CLT applies
bull That plus some simple asymptotic theory allows us to say
radic119873( minus 120573) 119873(0ฮฉ)
bull Converges in distribution to a Normal distribution with meanvector 0 and covariance matrix ฮฉ
ฮฉ = 120124[119883119894119883prime119894 ]minus1113568120124[119883119894119883prime
119894 1198901113569119894 ]120124[119883119894119883prime119894 ]minus1113568
bull No linearity assumption needed20 58
Estimating the variance
bull In large samples then
radic119873( minus 120573) sim 119873(0ฮฉ)
bull How to estimate ฮฉ Plug-in principle again
1114023ฮฉ =
โกโขโขโขโขโขโฃ1114012
119894119883119894119883prime
119894
โคโฅโฅโฅโฅโฅโฆ
minus1113568 โกโขโขโขโขโขโฃ1114012119894119883119894119883prime
119894 1113569119894
โคโฅโฅโฅโฅโฅโฆ
โกโขโขโขโขโขโฃ1114012
119894119883119894119883prime
119894
โคโฅโฅโฅโฅโฅโฆ
minus1113568
bull Replace 119890119894 with its emprical counterpart (residuals)119894 = 119884119894 minus 119883prime
119894 bull Replace the population moments of 119883119894 with their sample
counterpartsbull The square root of the diagonals of this covariance matrix are
the ldquorobustrdquo or Huber-White standard errors that Statacommonly report
21 58
Heteroskedasticity
bull No assumptions of homoskedasticitybull Heteroskedaticity will definitely occur when
CEF is linear but the 1205901113569(119909) = 120141[119884119894|119883119894 = 119909] is not constant in 119909 119864[119884119894|119883119894] is not linear but we use the linear regression to
approxmiate it
22 58
2 Regression andCausality
23 58
Regression and causality
bull Most econometrics textbooks regression defined withoutrespect to causality
bull But then when is ldquobiasedrdquo The above derivations work forsome 120124[119884119894|119883119894]
bull The question then is when does knowing the CEF tell ussomething about causality
bull MHE argues that a regression is causal when the CEF itapproximates is causal Identification is king
bull We will show that under certain conditions a regression of theoutcome on the treatment and the covariates can recover acausal parameter but perhaps not the one in which we areinterested
24 58
Review
bull Quick reminder we have potential outcomes 119884119894(1) and 119884119894(0)and two parameters the ATE and ATT
120591 = 119864[119884119894(1) minus 119884119894(0)]120591ATT = 119864[119884119894(1) minus 119884119894(0)|119863119894 = 1]
bull We have shown in past weeks that these effects are identifiedwhen ignorability holds MHE calls this the conditionalindependence assumption (CIA)
25 58
Linear constant effects model binarytreatment
bull Experiment with a simple experiment we can rewrite theconsistency assumption to be a regression formula
119884119894 = 119863119894119884119894(1) + (1 minus 119863119894)119884119894(0)= 119884119894(0) + (119884119894(1) minus 119884119894(0))119863119894
= 120124[119884119894(0)] + 120591119863119894 + (119884119894(0) minus 120124[119884119894(0)])= 1205831113567 + 120591119863119894 + 1199071113567119894
bull Note that if ignorability holds (as in an experiment) for 119884119894(0)then it will also hold for 1199071113567119894 since 120124[119884119894(0)] is constant Thusthis satifies the usual assumptions for regression
26 58
Now with covariates
bull Now assume no unmeasured confounders 119884119894(119889) โโ 119863119894|119883119894bull We will assume a linear model for the potential outcomes
119884119894(119889) = 120572 + 120591 sdot 119889 + 120578119894
bull Remember that linearity isnrsquot an assumption if 119863119894 is binarybull Effect of 119863119894 is constant here the 120578119894 are the only source of
individual variation and we have 119864[120578119894] = 0bull Consistency assumption allows us to write this as
119884119894 = 120572 + 120591119863119894 + 120578119894
27 58
Covariates in the error
bull Letrsquos assume that 120578119894 is linear in 119883119894 120578119894 = 119883prime119894 120574 + 120584119894
bull New error is uncorrelated with 119883119894 120124[120584119894|119883119894] = 0bull This is an assumption Might be falsebull Plug into the above
120124[119884119894(119889)|119883119894] = 119864[119884119894|119863119894 119883119894] = 120572 + 120591119863119894 + 119864[120578119894|119883119894]= 120572 + 120591119863119894 + 119883prime
119894 120574 + 119864[120584119894|119883119894]= 120572 + 120591119863119894 + 119883prime
119894 120574
28 58
Summing up regression with constanteffects
bull Reviewing the assumptions wersquove used no unmeasured confounders constant treatment effects linearity of the treatmentcovariates
bull Under these we can run the following regression to estimatethe ATE 120591
119884119894 = 120572 + 120591119863119894 + 119883prime119894 120574 + 120584119894
bull Works with continuous or ordinal 119863119894 if linearity in the effect ofthese variables is truly linear
29 58
OLS constant effects simulation
Model with linear covariates constant 0 effect of treatment
library(mvtnorm)
n lt- 100
p lt- 4
X lt- rmvnorm(n = 100 mean = rep(0 p))
gamma lt- c(274 137 137 137)
y lt- 210 + X c(gamma) + rnorm(n)
alpha lt- c(-1 05 -05 -01)
dprobs lt- bootinvlogit(X alpha)
d lt- rbinom(n size = 1 prob = dprobs)
30 58
OLS with no covariates
summary(lm(y ~ d))
Coefficients
Estimate Std Error t value Pr(gt|t|)
(Intercept) 22036 458 4813 lt 2e-16
d -2869 654 -439 0000029
---
Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1
Residual standard error 33 on 98 degrees of freedom
Multiple R-squared 0164 Adjusted R-squared 0156
F-statistic 192 on 1 and 98 DF p-value 0000029
31 58
OLS with covariates
summary(lm(y ~ d + X))
Coefficients
Estimate Std Error t value Pr(gt|t|)
(Intercept) 209952 0159 13207 lt2e-16
d -0128 0253 -05 062
X1 27368 0126 2175 lt2e-16
X2 13677 0114 1201 lt2e-16
X3 13673 0130 1051 lt2e-16
X4 13570 0106 1280 lt2e-16
---
Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1
Residual standard error 1 on 94 degrees of freedom
Multiple R-squared 0999 Adjusted R-squared 0999
F-statistic 244e+04 on 5 and 94 DF p-value lt2e-16
32 58
What happens with nonlinearity
Suppose we can only observe the following covariates
z1 lt- exp(X[ 1]2)
z2 lt- X[ 2](1 + exp(X[ 1])) + 10
z3 lt- (X[ 1] X[ 3]25 + 06)^3
z4 lt- (X[ 2] + X[ 4] + 20)^2
Implies that 119884119894 and 119863119894 are functions of log(1198851198941113568) 1198851198941113569 119885111356911989411135681198851198941113569
1 log(1198851198941113568) 1198851198941113570 log(1198851198941113568) and 119883111356811135691198941113571
Regression is a nonlinear function of the observed covariates
33 58
When linearity goes wrong
summary(lm(y ~ d + z1 + z2 + z3 + z4))
Coefficients
Estimate Std Error t value Pr(gt|t|)
(Intercept) 218728 300799 073 0469
d -69292 33854 -205 0043
z1 363110 26970 1346 lt2e-16
z2 -29033 36619 -079 0430
z3 862022 431030 200 0048
z4 04021 00329 1223 lt2e-16
---
Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1
Residual standard error 14 on 94 degrees of freedom
Multiple R-squared 0855 Adjusted R-squared 0847
F-statistic 111 on 5 and 94 DF p-value lt2e-16
34 58
3 Regression withHeterogeneousTreatment Effects
35 58
Heterogeneous effects binary treatment
bull Completely randomized experiment
119884119894 = 119863119894119884119894(1) + (1 minus 119863119894)119884119894(0)= 119884119894(0) + (119884119894(1) minus 119884119894(0))119863119894
= 1205831113567 + 120591119894119863119894 + (119884119894(0) minus 1205831113567)= 1205831113567 + 120591119863119894 + (119884119894(0) minus 1205831113567) + (120591119894 minus 120591) sdot 119863119894
= 1205831113567 + 120591119863119894 + 120576119894
bull Error term now includes two components1 ldquoBaselinerdquo variation in the outcome (119884119894(0) minus 1205831113567)2 Variation in the treatment effect (120591119894 minus 120591)
bull Easy to verify that under experiment 120124[120576119894|119863119894] = 0bull Thus OLS estimates the ATE with no covariates
36 58
Adding covariates
bull What happens with no unmeasured confounders Need tocondition on 119883119894 now
bull Remember identification of the ATEATT using iteratedexpectations
bull ATE is the weighted sum of CATEs
120591 = 1114012119909120591(119909) Pr[119883119894 = 119909]
bull ATEATT are weighted averages of CATEsbull What about the regression estimand 120591119877 How does it related
to the ATEATT
37 58
Heterogeneous effects and regression
bull Letrsquos investigate this under a saturated regression model
119884119894 = 1114012119909119861119909119894120572119909 + 120591119877119863119894 + 119890119894
bull Use a dummy variable for each unique combination of 119883119894119861119909119894 = 120128(119883119894 = 119909)
bull Linear in 119883119894 by construction
38 58
Investigating the regression coefficient
bull How can we investigate 120591119877 Well we can rely on theregression anatomy
120591119877 = Cov(119884119894 119863119894 minus 119864[119863119894|119883119894])120141(119863119894 minus 119864[119863119894|119883119894])
bull 119863119894 minus 120124[119863119894|119883119894] is the residual from a regression of 119863119894 on the fullset of dummies
bull With a little work we can show
120591119877 =120124 1114094120591(119883119894)(119863119894 minus 120124[119863119894|119883119894])11135691114097
120124[(119863119894 minus 119864[119863119894|119883119894])1113569]=
120124[120591(119883119894)1205901113569119889(119883119894)]120124[1205901113569119889(119883119894)]
bull 1205901113569119889(119909) = 120141[119863119894|119883119894 = 119909] is the conditional variance of treatmentassignment
39 58
ATE versus OLS
120591119877 = 120124[120591(119883119894)119882119894] = 1114012119909120591(119909)
1205901113569119889(119909)120124[1205901113569119889(119883119894)]
โ[119883119894 = 119909]
bull Compare to the ATE
120591 = 120124[120591(119883119894)] = 1114012119909120591(119909)โ[119883119894 = 119909]
bull Both weight strata relative to their size (โ[119883119894 = 119909])bull OLS weights strata higher if the treatment variance in those
strata (1205901113569119889(119909)) is higher in those strata relative to the averagevariance across strata (120124[1205901113569119889(119883119894)])
bull The ATE weights only by their size
40 58
Regression weighting
119882119894 =1205901113569119889(119883119894)
120124[1205901113569119889(119883119894)]
bull Why does OLS weight like thisbull OLS is a minimum-variance estimator more weight to
more precise within-strata estimatesbull Within-strata estimates are most precise when the treatment
is evenly spread and thus has the highest variancebull If 119863119894 is binary then we know the conditional variance will be
1205901113569119889(119909) = โ[119863119894 = 1|119883119894 = 119909] (1 minus โ[119863119894 = 1|119883119894 = 119909])= 119890(119909) (1 minus 119890(119909))
bull Maximum variance with โ[119863119894 = 1|119883119894 = 119909] = 12
41 58
OLS weighting examplebull Binary covariate
โ[119883119894 = 1] = 075 โ[119883119894 = 0] = 025โ[119863119894 = 1|119883119894 = 1] = 09 โ[119863119894 = 1|119883119894 = 0] = 05
1205901113569119889(1) = 009 1205901113569119889(0) = 025120591(1) = 1 120591(0) = minus1
bull Implies the ATE is 120591 = 05bull Average conditional variance 120124[1205901113569119889(119883119894)] = 013bull weights for 119883119894 = 1 are 009013 = 0692 for 119883119894 = 0
025013 = 192120591119877 = 120124[120591(119883119894)119882119894]
= 120591(1)119882(1)โ[119883119894 = 1] + 120591(0)119882(0)โ[119883119894 = 0]= 1 times 0692 times 075 + minus1 times 192 times 025= 0039
42 58
When will OLS estimate the ATE
bull When does 120591 = 120591119877bull Constant treatment effects 120591(119909) = 120591 = 120591119877bull Constant probability of treatment 119890(119909) = โ[119863119894 = 1|119883119894 = 119909] = 119890
Implies that the OLS weights are 1
bull Incorrect linearity assumption in 119883119894 will lead to more bias
43 58
Other ways to use regression
bull Whatrsquos the path forward Accept the bias (might be relatively small with saturated
models) Use a different regression approach
bull Let 120583119889(119909) = 120124[119884119894(119889)|119883119894 = 119909] be the CEF for the potentialoutcome under 119863119894 = 119889
bull By consistency and nuc we have 120583119889(119909) = 120124[119884119894|119863119894 = 119889119883119894 = 119909]bull Estimate a regression of 119884119894 on 119883119894 among the 119863119894 = 119889 groupbull Then 119889(119909) is just a predicted value from the regression for
119883119894 = 119909bull How can we use this
44 58
Imputation estimators
bull Impute the treated potential outcomes with 1114023119884119894(1) = 1113568(119883119894)bull Impute the control potential outcomes with 1114023119884119894(0) = 1113567(119883119894)bull Procedure
Regress 119884119894 on 119883119894 in the treated group and get predicted valuesfor all units (treated or control)
Regress 119884119894 on 119883119894 in the control group and get predicted valuesfor all units (treated or control)
Take the average difference between these predicted values
bull More mathematically look like this
120591119894119898119901 =1119873
11140121198941113568(119883119894) minus 1113567(119883119894)
bull Sometimes called an imputation estimator
45 58
Simple imputation estimator
bull Use predict() from the within-group models on the data fromthe entire sample
bull Useful trick use a model on the entire data andmodelframe() to get the right design matrix
heterogeneous effects
yhet lt- ifelse(d == 1 y + rnorm(n 0 5) y)
mod lt- lm(yhet ~ d + X)
mod1 lt- lm(yhet ~ X subset = d == 1)
mod0 lt- lm(yhet ~ X subset = d == 0)
y1imps lt- predict(mod1 modelframe(mod))
y0imps lt- predict(mod0 modelframe(mod))
mean(y1imps - y0imps)
[1] 061
46 58
Notes on imputation estimators
bull If 119889(119909) are consistent estimators then 120591119894119898119901 is consistent forthe ATE
bull Why donrsquot people use this Most people donrsquot know the results wersquove been talking about Harder to implement than vanilla OLS
bull Can use linear regression to estimate 119889(119909) = 119909prime120573119889bull Recent trend is to estimate 119889(119909) via non-parametric methods
such as Kernel regression local linear regression regression trees etc Easiest is generalized additive models (GAMs)
47 58
Imputation estimator visualization
-2 -1 0 1 2 3
-05
00
05
10
15
x
y
TreatedControl
48 58
Imputation estimator visualization
-2 -1 0 1 2 3
-05
00
05
10
15
x
y
TreatedControl
49 58
Imputation estimator visualization
-2 -1 0 1 2 3
-05
00
05
10
15
x
y
TreatedControl
50 58
Nonlinear relationshipsbull Same idea but with nonlinear relationship between 119884119894 and 119883119894
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
51 58
Nonlinear relationshipsbull Same idea but with nonlinear relationship between 119884119894 and 119883119894
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
52 58
Nonlinear relationshipsbull Same idea but with nonlinear relationship between 119884119894 and 119883119894
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
53 58
Using semiparametric regressionbull Here CEFs are nonlinear but we donrsquot know their formbull We can use GAMs from the mgcv package to for flexible
estimatelibrary(mgcv)
mod0 lt- gam(y ~ s(x) subset = d == 0)
summary(mod0)
Family gaussian
Link function identity
Formula
y ~ s(x)
Parametric coefficients
Estimate Std Error t value Pr(gt|t|)
(Intercept) -00225 00154 -146 016
Approximate significance of smooth terms
edf Refdf F p-value
s(x) 603 708 413 lt2e-16
---
Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1
R-sq(adj) = 0909 Deviance explained = 928
GCV = 00093204 Scale est = 00071351 n = 30
54 58
Using GAMs
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
55 58
Using GAMs
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
56 58
Using GAMs
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
57 58
Limited dependent variables
bull Usual advice model the data from first principles Logitprobit for binary Poisson for counts etc
bull OLS is a-ok with limited DVs when Binary treatment and no covariates (just diff-in-means) Binary treatment discrete covariates and saturated models
(stratified diff-in-means)
bull Imposing a model on LDVs in this case imposes adistributional assumption which could be wrong
bull Even in unsaturated models the marginal effect from OLSoften decent compared to nonlinear models
Could go wrong in small samples If using nonlinear models always get effects on the scale of the
outcome
58 58
- Agnostic Regression
- Regression and Causality
- Regression with Heterogeneous Treatment Effects
-
Where are we Where are we going
bull Last few weeks using matching weighting for estimatingcausal effects
bull This week how to use regression to estimate causal effectsbull Regression is so widely used itrsquos good to know what itrsquos
actually estimatingbull Goal salvage regression from the ashes of 1980rsquos textbooksbull Next week panel data
Reminder Email me and Stephen a half-page description of yourproposed research project
3 58
1 AgnosticRegression
4 58
Regression as parametric modeling
bull Gauss-Markov assumptions linearity iid sample full rank 119883119894 zero conditional mean error
homoskedasticity
bull OLS is BLUE plus normality of the errors and we get smallsample SEs
bull What is the basic approach here It is a model for theconditional distribution of 119884119894 given 119883119894
[119884119894|119883119894] sim 119873(119883prime119894 120573 1205901113569)
bull MLE from this model is the usual OLS estimator OLS
OLS =
โกโขโขโขโขโขโฃ1198731114012119894=1113568
119883119894119883prime119894
โคโฅโฅโฅโฅโฅโฆ
minus11135681198731114012119894=1113568
119883119894119884119894
5 58
Agnostic views on regression
[119884119894|119883119894] sim 119873(119883prime119894 120573 1205901113569)
bull Strong distributional assumption on 119884119894bull Properties like BLUE or MLE properties depend on these
assumptions holdingbull Alternative take an agnostic view on regression
Use OLS without believing these assumptions
bull Lose the distributional assumptions focus on the conditionalexpectation function (CEF)
120583(119909) = 120124[119884119894|119883119894 = 119909] = 1114012119910119910 sdot โ[119884119894 = 119910|119883119894 = 119909]
6 58
Justifying linear regression
bull Define linear regression
120573 = argmin119887
120124[(119884119894 minus 119883prime119894 119887)1113569]
bull The solution to this is the following
120573 = 120124[119883119894119883prime119894 ]minus1113568120124[119883119894119884119894]
bull Note that the is the population coefficient vector not theestimator yet
7 58
Regression anatomy
bull Consider simple linear regression
(120572 120573) = argmin119886119887
120124 1114094(119884119894 minus 119886 minus 119887119883119894)11135691114097
bull In this case we can write the populationtrue slope 120573 as
120573 = 120124[119883119894119883prime119894 ]minus1113568120124[119883119894119884119894] =
Cov(119884119894 119883119894)120141[119883119894]
bull With more covariates 120573 is more complicated but we can stillwrite it like this
bull Let 119896119894 be the residual from a regression of 119883119896119894 on all the otherindependent variables Then 120573119896 the coefficient for 119883119896119894 is
120573119896 =Cov(119884119894 119896119894)
119881(119896119894)
8 58
Justification 1 Linear CEFs
bull Justification 1 if the CEF is linear the population regressionfunction is it That is if 119864[119884119894|119883119894] = 119883prime
119894 119887 then 119887 = 120573bull When would we expect the CEF to be linear Two cases
1 Outcome and covariates are multivariate normal2 Linear regression model is saturated
bull A model is saturated if there are as many parameters asthere are possible combination of the 119883119894 variables
9 58
Saturated model example
bull Two binary variables 1198831113568119894 for incumbency status and 1198831113569119894 forparty of the candidate
bull Four possible values of 119883119894 four possible values of 120583(119883119894)
119864[119884119894|1198831113568119894 = 01198831113569119894 = 0] = 120572119864[119884119894|1198831113568119894 = 11198831113569119894 = 0] = 120572 + 120573119864[119884119894|1198831113568119894 = 01198831113569119894 = 1] = 120572 + 120574119864[119884119894|1198831113568119894 = 11198831113569119894 = 1] = 120572 + 120573 + 120574 + 120575
bull We can write the CEF as follows
119864[119884119894|1198831113568119894 1198831113569119894] = 120572 + 1205731198831113568119894 + 1205741198831113569119894 + 120575(11988311135681198941198831113569119894)
10 58
Saturated models example
119864[119884119894|1198831113568119894 1198831113569119894] = 120572 + 1205731198831113568119894 + 1205741198831113569119894 + 120575(11988311135681198941198831113569119894)
bull Basically each value of 120583(119883119894) is being estimated separately within-strata estimation No borrowing of information from across values of 119883119894
bull Requires a set of dummies for each categorical variable plusall interactions
bull Or a series of dummies for each unique combination of 119883119894bull This makes linearity hold mechanically and so linearity is
not an assumption Just a fact about saturated CEFs saturated models for limited dependent variables = A-OK
11 58
Saturated model example
bull Washington (AER) data on the effects of daughtersbull Wersquoll look at the relationship between voting and number of
kids (causal)
girls lt- foreignreaddta(rdquogirlsdtardquo)
head(girls[ c(rdquonamerdquo rdquototchirdquo rdquoaauwrdquo)])
name totchi aauw
1 ABERCROMBIE NEIL 0 100
2 ACKERMAN GARY L 3 88
3 ADERHOLT ROBERT B 0 0
4 ALLEN THOMAS H 2 100
5 ANDREWS ROBERT E 2 100
6 ARCHER WR 7 0
12 58
Linear model
summary(lm(aauw ~ totchi data = girls))
Coefficients
Estimate Std Error t value Pr(gt|t|)
(Intercept) 6131 181 3381 lt2e-16
totchi -533 062 -859 lt2e-16
---
Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1
Residual standard error 42 on 1733 degrees of freedom
(5 observations deleted due to missingness)
Multiple R-squared 00408 Adjusted R-squared 00403
F-statistic 738 on 1 and 1733 DF p-value lt2e-16
13 58
Saturated modelsummary(lm(aauw ~ asfactor(totchi) data = girls))
Coefficients
Estimate Std Error t value Pr(gt|t|)
(Intercept) 5641 276 2042 lt 2e-16
asfactor(totchi)1 545 411 133 01851
asfactor(totchi)2 -380 327 -116 02454
asfactor(totchi)3 -1365 345 -395 81e-05
asfactor(totchi)4 -1931 401 -482 16e-06
asfactor(totchi)5 -1546 485 -319 00015
asfactor(totchi)6 -3359 1042 -322 00013
asfactor(totchi)7 -1713 1141 -150 01336
asfactor(totchi)8 -5533 1228 -451 70e-06
asfactor(totchi)9 -5041 2408 -209 00364
asfactor(totchi)10 -5341 2090 -256 00107
asfactor(totchi)12 -5641 4153 -136 01745
---
Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1
Residual standard error 41 on 1723 degrees of freedom
(5 observations deleted due to missingness)
Multiple R-squared 00506 Adjusted R-squared 00446
F-statistic 836 on 11 and 1723 DF p-value 184e-1414 58
Saturated model minus the constantsummary(lm(aauw ~ asfactor(totchi) - 1 data = girls))
Coefficients
Estimate Std Error t value Pr(gt|t|)
asfactor(totchi)0 5641 276 2042 lt2e-16
asfactor(totchi)1 6186 305 2031 lt2e-16
asfactor(totchi)2 5262 175 3013 lt2e-16
asfactor(totchi)3 4276 207 2062 lt2e-16
asfactor(totchi)4 3711 290 1279 lt2e-16
asfactor(totchi)5 4095 399 1027 lt2e-16
asfactor(totchi)6 2282 1005 227 00233
asfactor(totchi)7 3929 1107 355 00004
asfactor(totchi)8 108 1196 009 09278
asfactor(totchi)9 600 2392 025 08020
asfactor(totchi)10 300 2072 014 08849
asfactor(totchi)12 000 4143 000 10000
---
Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1
Residual standard error 41 on 1723 degrees of freedom
(5 observations deleted due to missingness)
Multiple R-squared 0587 Adjusted R-squared 0584
F-statistic 204 on 12 and 1723 DF p-value lt2e-1615 58
Compare to within-strata means
bull The saturated model makes no assumptions about thebetween-strata relationships
bull Just calculates within-strata means
c1 lt- coef(lm(aauw ~ asfactor(totchi) - 1 data = girls))
c2 lt- with(girls tapply(aauw totchi mean narm = TRUE))
rbind(c1 c2)
0 1 2 3 4 5 6 7 8 9 10 12
c1 56 62 53 43 37 41 23 39 11 6 3 0
c2 56 62 53 43 37 41 23 39 11 6 3 0
16 58
Other justifications for OLS
bull Justification 2 119883prime119894 120573 is the best linear predictor (in a
mean-squared error sense) of 119884119894 Why 120573 = argmin119887 120124[(119884119894 minus 119883prime
119894 119887)1113569]
bull Justification 3 119883prime119894 120573 provides the minimum mean squared
error linear approxmiation to 119864[119884119894|119883119894]bull Even if the CEF is not linear a linear regression provides the
best linear approximation to that CEFbull Donrsquot need to believe the assumptions (linearity) in order to
use regression as a good approximation to the CEFbull Warning if the CEF is very nonlinear then this approximation
could be terrible
17 58
The error terms
bull Letrsquos define the error term 119890119894 equiv 119884119894 minus 119883prime119894 120573 so that
119884119894 = 119883prime119894 120573 + [119884119894 minus 119883prime
119894 120573] = 119883prime119894 120573 + 119890119894
bull Note the residual 119890119894 is uncorrelated with 119883119894
120124[119883119894119890119894] = 120124[119883119894(119884119894 minus 119883prime119894 120573)]
= 120124[119883119894119884119894] minus 120124[119883119894119883prime119894 120573]
= 120124[119883119894119884119894] minus 120124 1114094119883119894119883prime119894120124[119883119894119883prime
119894 ]minus1113568120124[119883119894119884119894]1114097= 120124[119883119894119884119894] minus 120124[119883119894119883prime
119894 ]120124[119883119894119883prime119894 ]minus1113568120124[119883119894119884119894]
= 120124[119883119894119884119894] minus 120124[119883119894119884119894] = 0
bull No assumptions on the linearity of 120124[119884119894|119883119894]
18 58
OLS estimator
bull We know the population value of 120573 is
120573 = 120124[119883119894119883prime119894 ]minus1113568120124[119883119894119884119894]
bull How do we get an estimator of thisbull Plug-in principle replace population expectation with
sample versions
=
โกโขโขโขโขโขโฃ1119873
1114012119894119883119894119883prime
119894
โคโฅโฅโฅโฅโฅโฆ
minus11135681119873
1114012119894119883119894119884119894
bull If you work through the matrix algebra this turns out to be
= (119831prime119831)minus1113568 119831prime119858
19 58
Asymptotic OLS inference
bull With this representation in hand we can write the OLSestimator as follows
= 120573 +
โกโขโขโขโขโขโฃ1114012
119894119883119894119883prime
119894
โคโฅโฅโฅโฅโฅโฆ
minus1113568
1114012119894119883119894119890119894
bull Core idea sum119894119883119894119890119894 is the sum of rvs so the CLT applies
bull That plus some simple asymptotic theory allows us to say
radic119873( minus 120573) 119873(0ฮฉ)
bull Converges in distribution to a Normal distribution with meanvector 0 and covariance matrix ฮฉ
ฮฉ = 120124[119883119894119883prime119894 ]minus1113568120124[119883119894119883prime
119894 1198901113569119894 ]120124[119883119894119883prime119894 ]minus1113568
bull No linearity assumption needed20 58
Estimating the variance
bull In large samples then
radic119873( minus 120573) sim 119873(0ฮฉ)
bull How to estimate ฮฉ Plug-in principle again
1114023ฮฉ =
โกโขโขโขโขโขโฃ1114012
119894119883119894119883prime
119894
โคโฅโฅโฅโฅโฅโฆ
minus1113568 โกโขโขโขโขโขโฃ1114012119894119883119894119883prime
119894 1113569119894
โคโฅโฅโฅโฅโฅโฆ
โกโขโขโขโขโขโฃ1114012
119894119883119894119883prime
119894
โคโฅโฅโฅโฅโฅโฆ
minus1113568
bull Replace 119890119894 with its emprical counterpart (residuals)119894 = 119884119894 minus 119883prime
119894 bull Replace the population moments of 119883119894 with their sample
counterpartsbull The square root of the diagonals of this covariance matrix are
the ldquorobustrdquo or Huber-White standard errors that Statacommonly report
21 58
Heteroskedasticity
bull No assumptions of homoskedasticitybull Heteroskedaticity will definitely occur when
CEF is linear but the 1205901113569(119909) = 120141[119884119894|119883119894 = 119909] is not constant in 119909 119864[119884119894|119883119894] is not linear but we use the linear regression to
approxmiate it
22 58
2 Regression andCausality
23 58
Regression and causality
bull Most econometrics textbooks regression defined withoutrespect to causality
bull But then when is ldquobiasedrdquo The above derivations work forsome 120124[119884119894|119883119894]
bull The question then is when does knowing the CEF tell ussomething about causality
bull MHE argues that a regression is causal when the CEF itapproximates is causal Identification is king
bull We will show that under certain conditions a regression of theoutcome on the treatment and the covariates can recover acausal parameter but perhaps not the one in which we areinterested
24 58
Review
bull Quick reminder we have potential outcomes 119884119894(1) and 119884119894(0)and two parameters the ATE and ATT
120591 = 119864[119884119894(1) minus 119884119894(0)]120591ATT = 119864[119884119894(1) minus 119884119894(0)|119863119894 = 1]
bull We have shown in past weeks that these effects are identifiedwhen ignorability holds MHE calls this the conditionalindependence assumption (CIA)
25 58
Linear constant effects model binarytreatment
bull Experiment with a simple experiment we can rewrite theconsistency assumption to be a regression formula
119884119894 = 119863119894119884119894(1) + (1 minus 119863119894)119884119894(0)= 119884119894(0) + (119884119894(1) minus 119884119894(0))119863119894
= 120124[119884119894(0)] + 120591119863119894 + (119884119894(0) minus 120124[119884119894(0)])= 1205831113567 + 120591119863119894 + 1199071113567119894
bull Note that if ignorability holds (as in an experiment) for 119884119894(0)then it will also hold for 1199071113567119894 since 120124[119884119894(0)] is constant Thusthis satifies the usual assumptions for regression
26 58
Now with covariates
bull Now assume no unmeasured confounders 119884119894(119889) โโ 119863119894|119883119894bull We will assume a linear model for the potential outcomes
119884119894(119889) = 120572 + 120591 sdot 119889 + 120578119894
bull Remember that linearity isnrsquot an assumption if 119863119894 is binarybull Effect of 119863119894 is constant here the 120578119894 are the only source of
individual variation and we have 119864[120578119894] = 0bull Consistency assumption allows us to write this as
119884119894 = 120572 + 120591119863119894 + 120578119894
27 58
Covariates in the error
bull Letrsquos assume that 120578119894 is linear in 119883119894 120578119894 = 119883prime119894 120574 + 120584119894
bull New error is uncorrelated with 119883119894 120124[120584119894|119883119894] = 0bull This is an assumption Might be falsebull Plug into the above
120124[119884119894(119889)|119883119894] = 119864[119884119894|119863119894 119883119894] = 120572 + 120591119863119894 + 119864[120578119894|119883119894]= 120572 + 120591119863119894 + 119883prime
119894 120574 + 119864[120584119894|119883119894]= 120572 + 120591119863119894 + 119883prime
119894 120574
28 58
Summing up regression with constanteffects
bull Reviewing the assumptions wersquove used no unmeasured confounders constant treatment effects linearity of the treatmentcovariates
bull Under these we can run the following regression to estimatethe ATE 120591
119884119894 = 120572 + 120591119863119894 + 119883prime119894 120574 + 120584119894
bull Works with continuous or ordinal 119863119894 if linearity in the effect ofthese variables is truly linear
29 58
OLS constant effects simulation
Model with linear covariates constant 0 effect of treatment
library(mvtnorm)
n lt- 100
p lt- 4
X lt- rmvnorm(n = 100 mean = rep(0 p))
gamma lt- c(274 137 137 137)
y lt- 210 + X c(gamma) + rnorm(n)
alpha lt- c(-1 05 -05 -01)
dprobs lt- bootinvlogit(X alpha)
d lt- rbinom(n size = 1 prob = dprobs)
30 58
OLS with no covariates
summary(lm(y ~ d))
Coefficients
Estimate Std Error t value Pr(gt|t|)
(Intercept) 22036 458 4813 lt 2e-16
d -2869 654 -439 0000029
---
Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1
Residual standard error 33 on 98 degrees of freedom
Multiple R-squared 0164 Adjusted R-squared 0156
F-statistic 192 on 1 and 98 DF p-value 0000029
31 58
OLS with covariates
summary(lm(y ~ d + X))
Coefficients
Estimate Std Error t value Pr(gt|t|)
(Intercept) 209952 0159 13207 lt2e-16
d -0128 0253 -05 062
X1 27368 0126 2175 lt2e-16
X2 13677 0114 1201 lt2e-16
X3 13673 0130 1051 lt2e-16
X4 13570 0106 1280 lt2e-16
---
Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1
Residual standard error 1 on 94 degrees of freedom
Multiple R-squared 0999 Adjusted R-squared 0999
F-statistic 244e+04 on 5 and 94 DF p-value lt2e-16
32 58
What happens with nonlinearity
Suppose we can only observe the following covariates
z1 lt- exp(X[ 1]2)
z2 lt- X[ 2](1 + exp(X[ 1])) + 10
z3 lt- (X[ 1] X[ 3]25 + 06)^3
z4 lt- (X[ 2] + X[ 4] + 20)^2
Implies that 119884119894 and 119863119894 are functions of log(1198851198941113568) 1198851198941113569 119885111356911989411135681198851198941113569
1 log(1198851198941113568) 1198851198941113570 log(1198851198941113568) and 119883111356811135691198941113571
Regression is a nonlinear function of the observed covariates
33 58
When linearity goes wrong
summary(lm(y ~ d + z1 + z2 + z3 + z4))
Coefficients
Estimate Std Error t value Pr(gt|t|)
(Intercept) 218728 300799 073 0469
d -69292 33854 -205 0043
z1 363110 26970 1346 lt2e-16
z2 -29033 36619 -079 0430
z3 862022 431030 200 0048
z4 04021 00329 1223 lt2e-16
---
Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1
Residual standard error 14 on 94 degrees of freedom
Multiple R-squared 0855 Adjusted R-squared 0847
F-statistic 111 on 5 and 94 DF p-value lt2e-16
34 58
3 Regression withHeterogeneousTreatment Effects
35 58
Heterogeneous effects binary treatment
bull Completely randomized experiment
119884119894 = 119863119894119884119894(1) + (1 minus 119863119894)119884119894(0)= 119884119894(0) + (119884119894(1) minus 119884119894(0))119863119894
= 1205831113567 + 120591119894119863119894 + (119884119894(0) minus 1205831113567)= 1205831113567 + 120591119863119894 + (119884119894(0) minus 1205831113567) + (120591119894 minus 120591) sdot 119863119894
= 1205831113567 + 120591119863119894 + 120576119894
bull Error term now includes two components1 ldquoBaselinerdquo variation in the outcome (119884119894(0) minus 1205831113567)2 Variation in the treatment effect (120591119894 minus 120591)
bull Easy to verify that under experiment 120124[120576119894|119863119894] = 0bull Thus OLS estimates the ATE with no covariates
36 58
Adding covariates
bull What happens with no unmeasured confounders Need tocondition on 119883119894 now
bull Remember identification of the ATEATT using iteratedexpectations
bull ATE is the weighted sum of CATEs
120591 = 1114012119909120591(119909) Pr[119883119894 = 119909]
bull ATEATT are weighted averages of CATEsbull What about the regression estimand 120591119877 How does it related
to the ATEATT
37 58
Heterogeneous effects and regression
bull Letrsquos investigate this under a saturated regression model
119884119894 = 1114012119909119861119909119894120572119909 + 120591119877119863119894 + 119890119894
bull Use a dummy variable for each unique combination of 119883119894119861119909119894 = 120128(119883119894 = 119909)
bull Linear in 119883119894 by construction
38 58
Investigating the regression coefficient
bull How can we investigate 120591119877 Well we can rely on theregression anatomy
120591119877 = Cov(119884119894 119863119894 minus 119864[119863119894|119883119894])120141(119863119894 minus 119864[119863119894|119883119894])
bull 119863119894 minus 120124[119863119894|119883119894] is the residual from a regression of 119863119894 on the fullset of dummies
bull With a little work we can show
120591119877 =120124 1114094120591(119883119894)(119863119894 minus 120124[119863119894|119883119894])11135691114097
120124[(119863119894 minus 119864[119863119894|119883119894])1113569]=
120124[120591(119883119894)1205901113569119889(119883119894)]120124[1205901113569119889(119883119894)]
bull 1205901113569119889(119909) = 120141[119863119894|119883119894 = 119909] is the conditional variance of treatmentassignment
39 58
ATE versus OLS
120591119877 = 120124[120591(119883119894)119882119894] = 1114012119909120591(119909)
1205901113569119889(119909)120124[1205901113569119889(119883119894)]
โ[119883119894 = 119909]
bull Compare to the ATE
120591 = 120124[120591(119883119894)] = 1114012119909120591(119909)โ[119883119894 = 119909]
bull Both weight strata relative to their size (โ[119883119894 = 119909])bull OLS weights strata higher if the treatment variance in those
strata (1205901113569119889(119909)) is higher in those strata relative to the averagevariance across strata (120124[1205901113569119889(119883119894)])
bull The ATE weights only by their size
40 58
Regression weighting
119882119894 =1205901113569119889(119883119894)
120124[1205901113569119889(119883119894)]
bull Why does OLS weight like thisbull OLS is a minimum-variance estimator more weight to
more precise within-strata estimatesbull Within-strata estimates are most precise when the treatment
is evenly spread and thus has the highest variancebull If 119863119894 is binary then we know the conditional variance will be
1205901113569119889(119909) = โ[119863119894 = 1|119883119894 = 119909] (1 minus โ[119863119894 = 1|119883119894 = 119909])= 119890(119909) (1 minus 119890(119909))
bull Maximum variance with โ[119863119894 = 1|119883119894 = 119909] = 12
41 58
OLS weighting examplebull Binary covariate
โ[119883119894 = 1] = 075 โ[119883119894 = 0] = 025โ[119863119894 = 1|119883119894 = 1] = 09 โ[119863119894 = 1|119883119894 = 0] = 05
1205901113569119889(1) = 009 1205901113569119889(0) = 025120591(1) = 1 120591(0) = minus1
bull Implies the ATE is 120591 = 05bull Average conditional variance 120124[1205901113569119889(119883119894)] = 013bull weights for 119883119894 = 1 are 009013 = 0692 for 119883119894 = 0
025013 = 192120591119877 = 120124[120591(119883119894)119882119894]
= 120591(1)119882(1)โ[119883119894 = 1] + 120591(0)119882(0)โ[119883119894 = 0]= 1 times 0692 times 075 + minus1 times 192 times 025= 0039
42 58
When will OLS estimate the ATE
bull When does 120591 = 120591119877bull Constant treatment effects 120591(119909) = 120591 = 120591119877bull Constant probability of treatment 119890(119909) = โ[119863119894 = 1|119883119894 = 119909] = 119890
Implies that the OLS weights are 1
bull Incorrect linearity assumption in 119883119894 will lead to more bias
43 58
Other ways to use regression
bull Whatrsquos the path forward Accept the bias (might be relatively small with saturated
models) Use a different regression approach
bull Let 120583119889(119909) = 120124[119884119894(119889)|119883119894 = 119909] be the CEF for the potentialoutcome under 119863119894 = 119889
bull By consistency and nuc we have 120583119889(119909) = 120124[119884119894|119863119894 = 119889119883119894 = 119909]bull Estimate a regression of 119884119894 on 119883119894 among the 119863119894 = 119889 groupbull Then 119889(119909) is just a predicted value from the regression for
119883119894 = 119909bull How can we use this
44 58
Imputation estimators
bull Impute the treated potential outcomes with 1114023119884119894(1) = 1113568(119883119894)bull Impute the control potential outcomes with 1114023119884119894(0) = 1113567(119883119894)bull Procedure
Regress 119884119894 on 119883119894 in the treated group and get predicted valuesfor all units (treated or control)
Regress 119884119894 on 119883119894 in the control group and get predicted valuesfor all units (treated or control)
Take the average difference between these predicted values
bull More mathematically look like this
120591119894119898119901 =1119873
11140121198941113568(119883119894) minus 1113567(119883119894)
bull Sometimes called an imputation estimator
45 58
Simple imputation estimator
bull Use predict() from the within-group models on the data fromthe entire sample
bull Useful trick use a model on the entire data andmodelframe() to get the right design matrix
heterogeneous effects
yhet lt- ifelse(d == 1 y + rnorm(n 0 5) y)
mod lt- lm(yhet ~ d + X)
mod1 lt- lm(yhet ~ X subset = d == 1)
mod0 lt- lm(yhet ~ X subset = d == 0)
y1imps lt- predict(mod1 modelframe(mod))
y0imps lt- predict(mod0 modelframe(mod))
mean(y1imps - y0imps)
[1] 061
46 58
Notes on imputation estimators
bull If 119889(119909) are consistent estimators then 120591119894119898119901 is consistent forthe ATE
bull Why donrsquot people use this Most people donrsquot know the results wersquove been talking about Harder to implement than vanilla OLS
bull Can use linear regression to estimate 119889(119909) = 119909prime120573119889bull Recent trend is to estimate 119889(119909) via non-parametric methods
such as Kernel regression local linear regression regression trees etc Easiest is generalized additive models (GAMs)
47 58
Imputation estimator visualization
-2 -1 0 1 2 3
-05
00
05
10
15
x
y
TreatedControl
48 58
Imputation estimator visualization
-2 -1 0 1 2 3
-05
00
05
10
15
x
y
TreatedControl
49 58
Imputation estimator visualization
-2 -1 0 1 2 3
-05
00
05
10
15
x
y
TreatedControl
50 58
Nonlinear relationshipsbull Same idea but with nonlinear relationship between 119884119894 and 119883119894
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
51 58
Nonlinear relationshipsbull Same idea but with nonlinear relationship between 119884119894 and 119883119894
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
52 58
Nonlinear relationshipsbull Same idea but with nonlinear relationship between 119884119894 and 119883119894
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
53 58
Using semiparametric regressionbull Here CEFs are nonlinear but we donrsquot know their formbull We can use GAMs from the mgcv package to for flexible
estimatelibrary(mgcv)
mod0 lt- gam(y ~ s(x) subset = d == 0)
summary(mod0)
Family gaussian
Link function identity
Formula
y ~ s(x)
Parametric coefficients
Estimate Std Error t value Pr(gt|t|)
(Intercept) -00225 00154 -146 016
Approximate significance of smooth terms
edf Refdf F p-value
s(x) 603 708 413 lt2e-16
---
Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1
R-sq(adj) = 0909 Deviance explained = 928
GCV = 00093204 Scale est = 00071351 n = 30
54 58
Using GAMs
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
55 58
Using GAMs
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
56 58
Using GAMs
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
57 58
Limited dependent variables
bull Usual advice model the data from first principles Logitprobit for binary Poisson for counts etc
bull OLS is a-ok with limited DVs when Binary treatment and no covariates (just diff-in-means) Binary treatment discrete covariates and saturated models
(stratified diff-in-means)
bull Imposing a model on LDVs in this case imposes adistributional assumption which could be wrong
bull Even in unsaturated models the marginal effect from OLSoften decent compared to nonlinear models
Could go wrong in small samples If using nonlinear models always get effects on the scale of the
outcome
58 58
- Agnostic Regression
- Regression and Causality
- Regression with Heterogeneous Treatment Effects
-
1 AgnosticRegression
4 58
Regression as parametric modeling
bull Gauss-Markov assumptions linearity iid sample full rank 119883119894 zero conditional mean error
homoskedasticity
bull OLS is BLUE plus normality of the errors and we get smallsample SEs
bull What is the basic approach here It is a model for theconditional distribution of 119884119894 given 119883119894
[119884119894|119883119894] sim 119873(119883prime119894 120573 1205901113569)
bull MLE from this model is the usual OLS estimator OLS
OLS =
โกโขโขโขโขโขโฃ1198731114012119894=1113568
119883119894119883prime119894
โคโฅโฅโฅโฅโฅโฆ
minus11135681198731114012119894=1113568
119883119894119884119894
5 58
Agnostic views on regression
[119884119894|119883119894] sim 119873(119883prime119894 120573 1205901113569)
bull Strong distributional assumption on 119884119894bull Properties like BLUE or MLE properties depend on these
assumptions holdingbull Alternative take an agnostic view on regression
Use OLS without believing these assumptions
bull Lose the distributional assumptions focus on the conditionalexpectation function (CEF)
120583(119909) = 120124[119884119894|119883119894 = 119909] = 1114012119910119910 sdot โ[119884119894 = 119910|119883119894 = 119909]
6 58
Justifying linear regression
bull Define linear regression
120573 = argmin119887
120124[(119884119894 minus 119883prime119894 119887)1113569]
bull The solution to this is the following
120573 = 120124[119883119894119883prime119894 ]minus1113568120124[119883119894119884119894]
bull Note that the is the population coefficient vector not theestimator yet
7 58
Regression anatomy
bull Consider simple linear regression
(120572 120573) = argmin119886119887
120124 1114094(119884119894 minus 119886 minus 119887119883119894)11135691114097
bull In this case we can write the populationtrue slope 120573 as
120573 = 120124[119883119894119883prime119894 ]minus1113568120124[119883119894119884119894] =
Cov(119884119894 119883119894)120141[119883119894]
bull With more covariates 120573 is more complicated but we can stillwrite it like this
bull Let 119896119894 be the residual from a regression of 119883119896119894 on all the otherindependent variables Then 120573119896 the coefficient for 119883119896119894 is
120573119896 =Cov(119884119894 119896119894)
119881(119896119894)
8 58
Justification 1 Linear CEFs
bull Justification 1 if the CEF is linear the population regressionfunction is it That is if 119864[119884119894|119883119894] = 119883prime
119894 119887 then 119887 = 120573bull When would we expect the CEF to be linear Two cases
1 Outcome and covariates are multivariate normal2 Linear regression model is saturated
bull A model is saturated if there are as many parameters asthere are possible combination of the 119883119894 variables
9 58
Saturated model example
bull Two binary variables 1198831113568119894 for incumbency status and 1198831113569119894 forparty of the candidate
bull Four possible values of 119883119894 four possible values of 120583(119883119894)
119864[119884119894|1198831113568119894 = 01198831113569119894 = 0] = 120572119864[119884119894|1198831113568119894 = 11198831113569119894 = 0] = 120572 + 120573119864[119884119894|1198831113568119894 = 01198831113569119894 = 1] = 120572 + 120574119864[119884119894|1198831113568119894 = 11198831113569119894 = 1] = 120572 + 120573 + 120574 + 120575
bull We can write the CEF as follows
119864[119884119894|1198831113568119894 1198831113569119894] = 120572 + 1205731198831113568119894 + 1205741198831113569119894 + 120575(11988311135681198941198831113569119894)
10 58
Saturated models example
119864[119884119894|1198831113568119894 1198831113569119894] = 120572 + 1205731198831113568119894 + 1205741198831113569119894 + 120575(11988311135681198941198831113569119894)
bull Basically each value of 120583(119883119894) is being estimated separately within-strata estimation No borrowing of information from across values of 119883119894
bull Requires a set of dummies for each categorical variable plusall interactions
bull Or a series of dummies for each unique combination of 119883119894bull This makes linearity hold mechanically and so linearity is
not an assumption Just a fact about saturated CEFs saturated models for limited dependent variables = A-OK
11 58
Saturated model example
bull Washington (AER) data on the effects of daughtersbull Wersquoll look at the relationship between voting and number of
kids (causal)
girls lt- foreignreaddta(rdquogirlsdtardquo)
head(girls[ c(rdquonamerdquo rdquototchirdquo rdquoaauwrdquo)])
name totchi aauw
1 ABERCROMBIE NEIL 0 100
2 ACKERMAN GARY L 3 88
3 ADERHOLT ROBERT B 0 0
4 ALLEN THOMAS H 2 100
5 ANDREWS ROBERT E 2 100
6 ARCHER WR 7 0
12 58
Linear model
summary(lm(aauw ~ totchi data = girls))
Coefficients
Estimate Std Error t value Pr(gt|t|)
(Intercept) 6131 181 3381 lt2e-16
totchi -533 062 -859 lt2e-16
---
Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1
Residual standard error 42 on 1733 degrees of freedom
(5 observations deleted due to missingness)
Multiple R-squared 00408 Adjusted R-squared 00403
F-statistic 738 on 1 and 1733 DF p-value lt2e-16
13 58
Saturated modelsummary(lm(aauw ~ asfactor(totchi) data = girls))
Coefficients
Estimate Std Error t value Pr(gt|t|)
(Intercept) 5641 276 2042 lt 2e-16
asfactor(totchi)1 545 411 133 01851
asfactor(totchi)2 -380 327 -116 02454
asfactor(totchi)3 -1365 345 -395 81e-05
asfactor(totchi)4 -1931 401 -482 16e-06
asfactor(totchi)5 -1546 485 -319 00015
asfactor(totchi)6 -3359 1042 -322 00013
asfactor(totchi)7 -1713 1141 -150 01336
asfactor(totchi)8 -5533 1228 -451 70e-06
asfactor(totchi)9 -5041 2408 -209 00364
asfactor(totchi)10 -5341 2090 -256 00107
asfactor(totchi)12 -5641 4153 -136 01745
---
Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1
Residual standard error 41 on 1723 degrees of freedom
(5 observations deleted due to missingness)
Multiple R-squared 00506 Adjusted R-squared 00446
F-statistic 836 on 11 and 1723 DF p-value 184e-1414 58
Saturated model minus the constantsummary(lm(aauw ~ asfactor(totchi) - 1 data = girls))
Coefficients
Estimate Std Error t value Pr(gt|t|)
asfactor(totchi)0 5641 276 2042 lt2e-16
asfactor(totchi)1 6186 305 2031 lt2e-16
asfactor(totchi)2 5262 175 3013 lt2e-16
asfactor(totchi)3 4276 207 2062 lt2e-16
asfactor(totchi)4 3711 290 1279 lt2e-16
asfactor(totchi)5 4095 399 1027 lt2e-16
asfactor(totchi)6 2282 1005 227 00233
asfactor(totchi)7 3929 1107 355 00004
asfactor(totchi)8 108 1196 009 09278
asfactor(totchi)9 600 2392 025 08020
asfactor(totchi)10 300 2072 014 08849
asfactor(totchi)12 000 4143 000 10000
---
Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1
Residual standard error 41 on 1723 degrees of freedom
(5 observations deleted due to missingness)
Multiple R-squared 0587 Adjusted R-squared 0584
F-statistic 204 on 12 and 1723 DF p-value lt2e-1615 58
Compare to within-strata means
bull The saturated model makes no assumptions about thebetween-strata relationships
bull Just calculates within-strata means
c1 lt- coef(lm(aauw ~ asfactor(totchi) - 1 data = girls))
c2 lt- with(girls tapply(aauw totchi mean narm = TRUE))
rbind(c1 c2)
0 1 2 3 4 5 6 7 8 9 10 12
c1 56 62 53 43 37 41 23 39 11 6 3 0
c2 56 62 53 43 37 41 23 39 11 6 3 0
16 58
Other justifications for OLS
bull Justification 2 119883prime119894 120573 is the best linear predictor (in a
mean-squared error sense) of 119884119894 Why 120573 = argmin119887 120124[(119884119894 minus 119883prime
119894 119887)1113569]
bull Justification 3 119883prime119894 120573 provides the minimum mean squared
error linear approxmiation to 119864[119884119894|119883119894]bull Even if the CEF is not linear a linear regression provides the
best linear approximation to that CEFbull Donrsquot need to believe the assumptions (linearity) in order to
use regression as a good approximation to the CEFbull Warning if the CEF is very nonlinear then this approximation
could be terrible
17 58
The error terms
bull Letrsquos define the error term 119890119894 equiv 119884119894 minus 119883prime119894 120573 so that
119884119894 = 119883prime119894 120573 + [119884119894 minus 119883prime
119894 120573] = 119883prime119894 120573 + 119890119894
bull Note the residual 119890119894 is uncorrelated with 119883119894
120124[119883119894119890119894] = 120124[119883119894(119884119894 minus 119883prime119894 120573)]
= 120124[119883119894119884119894] minus 120124[119883119894119883prime119894 120573]
= 120124[119883119894119884119894] minus 120124 1114094119883119894119883prime119894120124[119883119894119883prime
119894 ]minus1113568120124[119883119894119884119894]1114097= 120124[119883119894119884119894] minus 120124[119883119894119883prime
119894 ]120124[119883119894119883prime119894 ]minus1113568120124[119883119894119884119894]
= 120124[119883119894119884119894] minus 120124[119883119894119884119894] = 0
bull No assumptions on the linearity of 120124[119884119894|119883119894]
18 58
OLS estimator
bull We know the population value of 120573 is
120573 = 120124[119883119894119883prime119894 ]minus1113568120124[119883119894119884119894]
bull How do we get an estimator of thisbull Plug-in principle replace population expectation with
sample versions
=
โกโขโขโขโขโขโฃ1119873
1114012119894119883119894119883prime
119894
โคโฅโฅโฅโฅโฅโฆ
minus11135681119873
1114012119894119883119894119884119894
bull If you work through the matrix algebra this turns out to be
= (119831prime119831)minus1113568 119831prime119858
19 58
Asymptotic OLS inference
bull With this representation in hand we can write the OLSestimator as follows
= 120573 +
โกโขโขโขโขโขโฃ1114012
119894119883119894119883prime
119894
โคโฅโฅโฅโฅโฅโฆ
minus1113568
1114012119894119883119894119890119894
bull Core idea sum119894119883119894119890119894 is the sum of rvs so the CLT applies
bull That plus some simple asymptotic theory allows us to say
radic119873( minus 120573) 119873(0ฮฉ)
bull Converges in distribution to a Normal distribution with meanvector 0 and covariance matrix ฮฉ
ฮฉ = 120124[119883119894119883prime119894 ]minus1113568120124[119883119894119883prime
119894 1198901113569119894 ]120124[119883119894119883prime119894 ]minus1113568
bull No linearity assumption needed20 58
Estimating the variance
bull In large samples then
radic119873( minus 120573) sim 119873(0ฮฉ)
bull How to estimate ฮฉ Plug-in principle again
1114023ฮฉ =
โกโขโขโขโขโขโฃ1114012
119894119883119894119883prime
119894
โคโฅโฅโฅโฅโฅโฆ
minus1113568 โกโขโขโขโขโขโฃ1114012119894119883119894119883prime
119894 1113569119894
โคโฅโฅโฅโฅโฅโฆ
โกโขโขโขโขโขโฃ1114012
119894119883119894119883prime
119894
โคโฅโฅโฅโฅโฅโฆ
minus1113568
bull Replace 119890119894 with its emprical counterpart (residuals)119894 = 119884119894 minus 119883prime
119894 bull Replace the population moments of 119883119894 with their sample
counterpartsbull The square root of the diagonals of this covariance matrix are
the ldquorobustrdquo or Huber-White standard errors that Statacommonly report
21 58
Heteroskedasticity
bull No assumptions of homoskedasticitybull Heteroskedaticity will definitely occur when
CEF is linear but the 1205901113569(119909) = 120141[119884119894|119883119894 = 119909] is not constant in 119909 119864[119884119894|119883119894] is not linear but we use the linear regression to
approxmiate it
22 58
2 Regression andCausality
23 58
Regression and causality
bull Most econometrics textbooks regression defined withoutrespect to causality
bull But then when is ldquobiasedrdquo The above derivations work forsome 120124[119884119894|119883119894]
bull The question then is when does knowing the CEF tell ussomething about causality
bull MHE argues that a regression is causal when the CEF itapproximates is causal Identification is king
bull We will show that under certain conditions a regression of theoutcome on the treatment and the covariates can recover acausal parameter but perhaps not the one in which we areinterested
24 58
Review
bull Quick reminder we have potential outcomes 119884119894(1) and 119884119894(0)and two parameters the ATE and ATT
120591 = 119864[119884119894(1) minus 119884119894(0)]120591ATT = 119864[119884119894(1) minus 119884119894(0)|119863119894 = 1]
bull We have shown in past weeks that these effects are identifiedwhen ignorability holds MHE calls this the conditionalindependence assumption (CIA)
25 58
Linear constant effects model binarytreatment
bull Experiment with a simple experiment we can rewrite theconsistency assumption to be a regression formula
119884119894 = 119863119894119884119894(1) + (1 minus 119863119894)119884119894(0)= 119884119894(0) + (119884119894(1) minus 119884119894(0))119863119894
= 120124[119884119894(0)] + 120591119863119894 + (119884119894(0) minus 120124[119884119894(0)])= 1205831113567 + 120591119863119894 + 1199071113567119894
bull Note that if ignorability holds (as in an experiment) for 119884119894(0)then it will also hold for 1199071113567119894 since 120124[119884119894(0)] is constant Thusthis satifies the usual assumptions for regression
26 58
Now with covariates
bull Now assume no unmeasured confounders 119884119894(119889) โโ 119863119894|119883119894bull We will assume a linear model for the potential outcomes
119884119894(119889) = 120572 + 120591 sdot 119889 + 120578119894
bull Remember that linearity isnrsquot an assumption if 119863119894 is binarybull Effect of 119863119894 is constant here the 120578119894 are the only source of
individual variation and we have 119864[120578119894] = 0bull Consistency assumption allows us to write this as
119884119894 = 120572 + 120591119863119894 + 120578119894
27 58
Covariates in the error
bull Letrsquos assume that 120578119894 is linear in 119883119894 120578119894 = 119883prime119894 120574 + 120584119894
bull New error is uncorrelated with 119883119894 120124[120584119894|119883119894] = 0bull This is an assumption Might be falsebull Plug into the above
120124[119884119894(119889)|119883119894] = 119864[119884119894|119863119894 119883119894] = 120572 + 120591119863119894 + 119864[120578119894|119883119894]= 120572 + 120591119863119894 + 119883prime
119894 120574 + 119864[120584119894|119883119894]= 120572 + 120591119863119894 + 119883prime
119894 120574
28 58
Summing up regression with constanteffects
bull Reviewing the assumptions wersquove used no unmeasured confounders constant treatment effects linearity of the treatmentcovariates
bull Under these we can run the following regression to estimatethe ATE 120591
119884119894 = 120572 + 120591119863119894 + 119883prime119894 120574 + 120584119894
bull Works with continuous or ordinal 119863119894 if linearity in the effect ofthese variables is truly linear
29 58
OLS constant effects simulation
Model with linear covariates constant 0 effect of treatment
library(mvtnorm)
n lt- 100
p lt- 4
X lt- rmvnorm(n = 100 mean = rep(0 p))
gamma lt- c(274 137 137 137)
y lt- 210 + X c(gamma) + rnorm(n)
alpha lt- c(-1 05 -05 -01)
dprobs lt- bootinvlogit(X alpha)
d lt- rbinom(n size = 1 prob = dprobs)
30 58
OLS with no covariates
summary(lm(y ~ d))
Coefficients
Estimate Std Error t value Pr(gt|t|)
(Intercept) 22036 458 4813 lt 2e-16
d -2869 654 -439 0000029
---
Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1
Residual standard error 33 on 98 degrees of freedom
Multiple R-squared 0164 Adjusted R-squared 0156
F-statistic 192 on 1 and 98 DF p-value 0000029
31 58
OLS with covariates
summary(lm(y ~ d + X))
Coefficients
Estimate Std Error t value Pr(gt|t|)
(Intercept) 209952 0159 13207 lt2e-16
d -0128 0253 -05 062
X1 27368 0126 2175 lt2e-16
X2 13677 0114 1201 lt2e-16
X3 13673 0130 1051 lt2e-16
X4 13570 0106 1280 lt2e-16
---
Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1
Residual standard error 1 on 94 degrees of freedom
Multiple R-squared 0999 Adjusted R-squared 0999
F-statistic 244e+04 on 5 and 94 DF p-value lt2e-16
32 58
What happens with nonlinearity
Suppose we can only observe the following covariates
z1 lt- exp(X[ 1]2)
z2 lt- X[ 2](1 + exp(X[ 1])) + 10
z3 lt- (X[ 1] X[ 3]25 + 06)^3
z4 lt- (X[ 2] + X[ 4] + 20)^2
Implies that 119884119894 and 119863119894 are functions of log(1198851198941113568) 1198851198941113569 119885111356911989411135681198851198941113569
1 log(1198851198941113568) 1198851198941113570 log(1198851198941113568) and 119883111356811135691198941113571
Regression is a nonlinear function of the observed covariates
33 58
When linearity goes wrong
summary(lm(y ~ d + z1 + z2 + z3 + z4))
Coefficients
Estimate Std Error t value Pr(gt|t|)
(Intercept) 218728 300799 073 0469
d -69292 33854 -205 0043
z1 363110 26970 1346 lt2e-16
z2 -29033 36619 -079 0430
z3 862022 431030 200 0048
z4 04021 00329 1223 lt2e-16
---
Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1
Residual standard error 14 on 94 degrees of freedom
Multiple R-squared 0855 Adjusted R-squared 0847
F-statistic 111 on 5 and 94 DF p-value lt2e-16
34 58
3 Regression withHeterogeneousTreatment Effects
35 58
Heterogeneous effects binary treatment
bull Completely randomized experiment
119884119894 = 119863119894119884119894(1) + (1 minus 119863119894)119884119894(0)= 119884119894(0) + (119884119894(1) minus 119884119894(0))119863119894
= 1205831113567 + 120591119894119863119894 + (119884119894(0) minus 1205831113567)= 1205831113567 + 120591119863119894 + (119884119894(0) minus 1205831113567) + (120591119894 minus 120591) sdot 119863119894
= 1205831113567 + 120591119863119894 + 120576119894
bull Error term now includes two components1 ldquoBaselinerdquo variation in the outcome (119884119894(0) minus 1205831113567)2 Variation in the treatment effect (120591119894 minus 120591)
bull Easy to verify that under experiment 120124[120576119894|119863119894] = 0bull Thus OLS estimates the ATE with no covariates
36 58
Adding covariates
bull What happens with no unmeasured confounders Need tocondition on 119883119894 now
bull Remember identification of the ATEATT using iteratedexpectations
bull ATE is the weighted sum of CATEs
120591 = 1114012119909120591(119909) Pr[119883119894 = 119909]
bull ATEATT are weighted averages of CATEsbull What about the regression estimand 120591119877 How does it related
to the ATEATT
37 58
Heterogeneous effects and regression
bull Letrsquos investigate this under a saturated regression model
119884119894 = 1114012119909119861119909119894120572119909 + 120591119877119863119894 + 119890119894
bull Use a dummy variable for each unique combination of 119883119894119861119909119894 = 120128(119883119894 = 119909)
bull Linear in 119883119894 by construction
38 58
Investigating the regression coefficient
bull How can we investigate 120591119877 Well we can rely on theregression anatomy
120591119877 = Cov(119884119894 119863119894 minus 119864[119863119894|119883119894])120141(119863119894 minus 119864[119863119894|119883119894])
bull 119863119894 minus 120124[119863119894|119883119894] is the residual from a regression of 119863119894 on the fullset of dummies
bull With a little work we can show
120591119877 =120124 1114094120591(119883119894)(119863119894 minus 120124[119863119894|119883119894])11135691114097
120124[(119863119894 minus 119864[119863119894|119883119894])1113569]=
120124[120591(119883119894)1205901113569119889(119883119894)]120124[1205901113569119889(119883119894)]
bull 1205901113569119889(119909) = 120141[119863119894|119883119894 = 119909] is the conditional variance of treatmentassignment
39 58
ATE versus OLS
120591119877 = 120124[120591(119883119894)119882119894] = 1114012119909120591(119909)
1205901113569119889(119909)120124[1205901113569119889(119883119894)]
โ[119883119894 = 119909]
bull Compare to the ATE
120591 = 120124[120591(119883119894)] = 1114012119909120591(119909)โ[119883119894 = 119909]
bull Both weight strata relative to their size (โ[119883119894 = 119909])bull OLS weights strata higher if the treatment variance in those
strata (1205901113569119889(119909)) is higher in those strata relative to the averagevariance across strata (120124[1205901113569119889(119883119894)])
bull The ATE weights only by their size
40 58
Regression weighting
119882119894 =1205901113569119889(119883119894)
120124[1205901113569119889(119883119894)]
bull Why does OLS weight like thisbull OLS is a minimum-variance estimator more weight to
more precise within-strata estimatesbull Within-strata estimates are most precise when the treatment
is evenly spread and thus has the highest variancebull If 119863119894 is binary then we know the conditional variance will be
1205901113569119889(119909) = โ[119863119894 = 1|119883119894 = 119909] (1 minus โ[119863119894 = 1|119883119894 = 119909])= 119890(119909) (1 minus 119890(119909))
bull Maximum variance with โ[119863119894 = 1|119883119894 = 119909] = 12
41 58
OLS weighting examplebull Binary covariate
โ[119883119894 = 1] = 075 โ[119883119894 = 0] = 025โ[119863119894 = 1|119883119894 = 1] = 09 โ[119863119894 = 1|119883119894 = 0] = 05
1205901113569119889(1) = 009 1205901113569119889(0) = 025120591(1) = 1 120591(0) = minus1
bull Implies the ATE is 120591 = 05bull Average conditional variance 120124[1205901113569119889(119883119894)] = 013bull weights for 119883119894 = 1 are 009013 = 0692 for 119883119894 = 0
025013 = 192120591119877 = 120124[120591(119883119894)119882119894]
= 120591(1)119882(1)โ[119883119894 = 1] + 120591(0)119882(0)โ[119883119894 = 0]= 1 times 0692 times 075 + minus1 times 192 times 025= 0039
42 58
When will OLS estimate the ATE
bull When does 120591 = 120591119877bull Constant treatment effects 120591(119909) = 120591 = 120591119877bull Constant probability of treatment 119890(119909) = โ[119863119894 = 1|119883119894 = 119909] = 119890
Implies that the OLS weights are 1
bull Incorrect linearity assumption in 119883119894 will lead to more bias
43 58
Other ways to use regression
bull Whatrsquos the path forward Accept the bias (might be relatively small with saturated
models) Use a different regression approach
bull Let 120583119889(119909) = 120124[119884119894(119889)|119883119894 = 119909] be the CEF for the potentialoutcome under 119863119894 = 119889
bull By consistency and nuc we have 120583119889(119909) = 120124[119884119894|119863119894 = 119889119883119894 = 119909]bull Estimate a regression of 119884119894 on 119883119894 among the 119863119894 = 119889 groupbull Then 119889(119909) is just a predicted value from the regression for
119883119894 = 119909bull How can we use this
44 58
Imputation estimators
bull Impute the treated potential outcomes with 1114023119884119894(1) = 1113568(119883119894)bull Impute the control potential outcomes with 1114023119884119894(0) = 1113567(119883119894)bull Procedure
Regress 119884119894 on 119883119894 in the treated group and get predicted valuesfor all units (treated or control)
Regress 119884119894 on 119883119894 in the control group and get predicted valuesfor all units (treated or control)
Take the average difference between these predicted values
bull More mathematically look like this
120591119894119898119901 =1119873
11140121198941113568(119883119894) minus 1113567(119883119894)
bull Sometimes called an imputation estimator
45 58
Simple imputation estimator
bull Use predict() from the within-group models on the data fromthe entire sample
bull Useful trick use a model on the entire data andmodelframe() to get the right design matrix
heterogeneous effects
yhet lt- ifelse(d == 1 y + rnorm(n 0 5) y)
mod lt- lm(yhet ~ d + X)
mod1 lt- lm(yhet ~ X subset = d == 1)
mod0 lt- lm(yhet ~ X subset = d == 0)
y1imps lt- predict(mod1 modelframe(mod))
y0imps lt- predict(mod0 modelframe(mod))
mean(y1imps - y0imps)
[1] 061
46 58
Notes on imputation estimators
bull If 119889(119909) are consistent estimators then 120591119894119898119901 is consistent forthe ATE
bull Why donrsquot people use this Most people donrsquot know the results wersquove been talking about Harder to implement than vanilla OLS
bull Can use linear regression to estimate 119889(119909) = 119909prime120573119889bull Recent trend is to estimate 119889(119909) via non-parametric methods
such as Kernel regression local linear regression regression trees etc Easiest is generalized additive models (GAMs)
47 58
Imputation estimator visualization
-2 -1 0 1 2 3
-05
00
05
10
15
x
y
TreatedControl
48 58
Imputation estimator visualization
-2 -1 0 1 2 3
-05
00
05
10
15
x
y
TreatedControl
49 58
Imputation estimator visualization
-2 -1 0 1 2 3
-05
00
05
10
15
x
y
TreatedControl
50 58
Nonlinear relationshipsbull Same idea but with nonlinear relationship between 119884119894 and 119883119894
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
51 58
Nonlinear relationshipsbull Same idea but with nonlinear relationship between 119884119894 and 119883119894
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
52 58
Nonlinear relationshipsbull Same idea but with nonlinear relationship between 119884119894 and 119883119894
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
53 58
Using semiparametric regressionbull Here CEFs are nonlinear but we donrsquot know their formbull We can use GAMs from the mgcv package to for flexible
estimatelibrary(mgcv)
mod0 lt- gam(y ~ s(x) subset = d == 0)
summary(mod0)
Family gaussian
Link function identity
Formula
y ~ s(x)
Parametric coefficients
Estimate Std Error t value Pr(gt|t|)
(Intercept) -00225 00154 -146 016
Approximate significance of smooth terms
edf Refdf F p-value
s(x) 603 708 413 lt2e-16
---
Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1
R-sq(adj) = 0909 Deviance explained = 928
GCV = 00093204 Scale est = 00071351 n = 30
54 58
Using GAMs
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
55 58
Using GAMs
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
56 58
Using GAMs
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
57 58
Limited dependent variables
bull Usual advice model the data from first principles Logitprobit for binary Poisson for counts etc
bull OLS is a-ok with limited DVs when Binary treatment and no covariates (just diff-in-means) Binary treatment discrete covariates and saturated models
(stratified diff-in-means)
bull Imposing a model on LDVs in this case imposes adistributional assumption which could be wrong
bull Even in unsaturated models the marginal effect from OLSoften decent compared to nonlinear models
Could go wrong in small samples If using nonlinear models always get effects on the scale of the
outcome
58 58
- Agnostic Regression
- Regression and Causality
- Regression with Heterogeneous Treatment Effects
-
Regression as parametric modeling
bull Gauss-Markov assumptions linearity iid sample full rank 119883119894 zero conditional mean error
homoskedasticity
bull OLS is BLUE plus normality of the errors and we get smallsample SEs
bull What is the basic approach here It is a model for theconditional distribution of 119884119894 given 119883119894
[119884119894|119883119894] sim 119873(119883prime119894 120573 1205901113569)
bull MLE from this model is the usual OLS estimator OLS
OLS =
โกโขโขโขโขโขโฃ1198731114012119894=1113568
119883119894119883prime119894
โคโฅโฅโฅโฅโฅโฆ
minus11135681198731114012119894=1113568
119883119894119884119894
5 58
Agnostic views on regression
[119884119894|119883119894] sim 119873(119883prime119894 120573 1205901113569)
bull Strong distributional assumption on 119884119894bull Properties like BLUE or MLE properties depend on these
assumptions holdingbull Alternative take an agnostic view on regression
Use OLS without believing these assumptions
bull Lose the distributional assumptions focus on the conditionalexpectation function (CEF)
120583(119909) = 120124[119884119894|119883119894 = 119909] = 1114012119910119910 sdot โ[119884119894 = 119910|119883119894 = 119909]
6 58
Justifying linear regression
bull Define linear regression
120573 = argmin119887
120124[(119884119894 minus 119883prime119894 119887)1113569]
bull The solution to this is the following
120573 = 120124[119883119894119883prime119894 ]minus1113568120124[119883119894119884119894]
bull Note that the is the population coefficient vector not theestimator yet
7 58
Regression anatomy
bull Consider simple linear regression
(120572 120573) = argmin119886119887
120124 1114094(119884119894 minus 119886 minus 119887119883119894)11135691114097
bull In this case we can write the populationtrue slope 120573 as
120573 = 120124[119883119894119883prime119894 ]minus1113568120124[119883119894119884119894] =
Cov(119884119894 119883119894)120141[119883119894]
bull With more covariates 120573 is more complicated but we can stillwrite it like this
bull Let 119896119894 be the residual from a regression of 119883119896119894 on all the otherindependent variables Then 120573119896 the coefficient for 119883119896119894 is
120573119896 =Cov(119884119894 119896119894)
119881(119896119894)
8 58
Justification 1 Linear CEFs
bull Justification 1 if the CEF is linear the population regressionfunction is it That is if 119864[119884119894|119883119894] = 119883prime
119894 119887 then 119887 = 120573bull When would we expect the CEF to be linear Two cases
1 Outcome and covariates are multivariate normal2 Linear regression model is saturated
bull A model is saturated if there are as many parameters asthere are possible combination of the 119883119894 variables
9 58
Saturated model example
bull Two binary variables 1198831113568119894 for incumbency status and 1198831113569119894 forparty of the candidate
bull Four possible values of 119883119894 four possible values of 120583(119883119894)
119864[119884119894|1198831113568119894 = 01198831113569119894 = 0] = 120572119864[119884119894|1198831113568119894 = 11198831113569119894 = 0] = 120572 + 120573119864[119884119894|1198831113568119894 = 01198831113569119894 = 1] = 120572 + 120574119864[119884119894|1198831113568119894 = 11198831113569119894 = 1] = 120572 + 120573 + 120574 + 120575
bull We can write the CEF as follows
119864[119884119894|1198831113568119894 1198831113569119894] = 120572 + 1205731198831113568119894 + 1205741198831113569119894 + 120575(11988311135681198941198831113569119894)
10 58
Saturated models example
119864[119884119894|1198831113568119894 1198831113569119894] = 120572 + 1205731198831113568119894 + 1205741198831113569119894 + 120575(11988311135681198941198831113569119894)
bull Basically each value of 120583(119883119894) is being estimated separately within-strata estimation No borrowing of information from across values of 119883119894
bull Requires a set of dummies for each categorical variable plusall interactions
bull Or a series of dummies for each unique combination of 119883119894bull This makes linearity hold mechanically and so linearity is
not an assumption Just a fact about saturated CEFs saturated models for limited dependent variables = A-OK
11 58
Saturated model example
bull Washington (AER) data on the effects of daughtersbull Wersquoll look at the relationship between voting and number of
kids (causal)
girls lt- foreignreaddta(rdquogirlsdtardquo)
head(girls[ c(rdquonamerdquo rdquototchirdquo rdquoaauwrdquo)])
name totchi aauw
1 ABERCROMBIE NEIL 0 100
2 ACKERMAN GARY L 3 88
3 ADERHOLT ROBERT B 0 0
4 ALLEN THOMAS H 2 100
5 ANDREWS ROBERT E 2 100
6 ARCHER WR 7 0
12 58
Linear model
summary(lm(aauw ~ totchi data = girls))
Coefficients
Estimate Std Error t value Pr(gt|t|)
(Intercept) 6131 181 3381 lt2e-16
totchi -533 062 -859 lt2e-16
---
Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1
Residual standard error 42 on 1733 degrees of freedom
(5 observations deleted due to missingness)
Multiple R-squared 00408 Adjusted R-squared 00403
F-statistic 738 on 1 and 1733 DF p-value lt2e-16
13 58
Saturated modelsummary(lm(aauw ~ asfactor(totchi) data = girls))
Coefficients
Estimate Std Error t value Pr(gt|t|)
(Intercept) 5641 276 2042 lt 2e-16
asfactor(totchi)1 545 411 133 01851
asfactor(totchi)2 -380 327 -116 02454
asfactor(totchi)3 -1365 345 -395 81e-05
asfactor(totchi)4 -1931 401 -482 16e-06
asfactor(totchi)5 -1546 485 -319 00015
asfactor(totchi)6 -3359 1042 -322 00013
asfactor(totchi)7 -1713 1141 -150 01336
asfactor(totchi)8 -5533 1228 -451 70e-06
asfactor(totchi)9 -5041 2408 -209 00364
asfactor(totchi)10 -5341 2090 -256 00107
asfactor(totchi)12 -5641 4153 -136 01745
---
Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1
Residual standard error 41 on 1723 degrees of freedom
(5 observations deleted due to missingness)
Multiple R-squared 00506 Adjusted R-squared 00446
F-statistic 836 on 11 and 1723 DF p-value 184e-1414 58
Saturated model minus the constantsummary(lm(aauw ~ asfactor(totchi) - 1 data = girls))
Coefficients
Estimate Std Error t value Pr(gt|t|)
asfactor(totchi)0 5641 276 2042 lt2e-16
asfactor(totchi)1 6186 305 2031 lt2e-16
asfactor(totchi)2 5262 175 3013 lt2e-16
asfactor(totchi)3 4276 207 2062 lt2e-16
asfactor(totchi)4 3711 290 1279 lt2e-16
asfactor(totchi)5 4095 399 1027 lt2e-16
asfactor(totchi)6 2282 1005 227 00233
asfactor(totchi)7 3929 1107 355 00004
asfactor(totchi)8 108 1196 009 09278
asfactor(totchi)9 600 2392 025 08020
asfactor(totchi)10 300 2072 014 08849
asfactor(totchi)12 000 4143 000 10000
---
Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1
Residual standard error 41 on 1723 degrees of freedom
(5 observations deleted due to missingness)
Multiple R-squared 0587 Adjusted R-squared 0584
F-statistic 204 on 12 and 1723 DF p-value lt2e-1615 58
Compare to within-strata means
bull The saturated model makes no assumptions about thebetween-strata relationships
bull Just calculates within-strata means
c1 lt- coef(lm(aauw ~ asfactor(totchi) - 1 data = girls))
c2 lt- with(girls tapply(aauw totchi mean narm = TRUE))
rbind(c1 c2)
0 1 2 3 4 5 6 7 8 9 10 12
c1 56 62 53 43 37 41 23 39 11 6 3 0
c2 56 62 53 43 37 41 23 39 11 6 3 0
16 58
Other justifications for OLS
bull Justification 2 119883prime119894 120573 is the best linear predictor (in a
mean-squared error sense) of 119884119894 Why 120573 = argmin119887 120124[(119884119894 minus 119883prime
119894 119887)1113569]
bull Justification 3 119883prime119894 120573 provides the minimum mean squared
error linear approxmiation to 119864[119884119894|119883119894]bull Even if the CEF is not linear a linear regression provides the
best linear approximation to that CEFbull Donrsquot need to believe the assumptions (linearity) in order to
use regression as a good approximation to the CEFbull Warning if the CEF is very nonlinear then this approximation
could be terrible
17 58
The error terms
bull Letrsquos define the error term 119890119894 equiv 119884119894 minus 119883prime119894 120573 so that
119884119894 = 119883prime119894 120573 + [119884119894 minus 119883prime
119894 120573] = 119883prime119894 120573 + 119890119894
bull Note the residual 119890119894 is uncorrelated with 119883119894
120124[119883119894119890119894] = 120124[119883119894(119884119894 minus 119883prime119894 120573)]
= 120124[119883119894119884119894] minus 120124[119883119894119883prime119894 120573]
= 120124[119883119894119884119894] minus 120124 1114094119883119894119883prime119894120124[119883119894119883prime
119894 ]minus1113568120124[119883119894119884119894]1114097= 120124[119883119894119884119894] minus 120124[119883119894119883prime
119894 ]120124[119883119894119883prime119894 ]minus1113568120124[119883119894119884119894]
= 120124[119883119894119884119894] minus 120124[119883119894119884119894] = 0
bull No assumptions on the linearity of 120124[119884119894|119883119894]
18 58
OLS estimator
bull We know the population value of 120573 is
120573 = 120124[119883119894119883prime119894 ]minus1113568120124[119883119894119884119894]
bull How do we get an estimator of thisbull Plug-in principle replace population expectation with
sample versions
=
โกโขโขโขโขโขโฃ1119873
1114012119894119883119894119883prime
119894
โคโฅโฅโฅโฅโฅโฆ
minus11135681119873
1114012119894119883119894119884119894
bull If you work through the matrix algebra this turns out to be
= (119831prime119831)minus1113568 119831prime119858
19 58
Asymptotic OLS inference
bull With this representation in hand we can write the OLSestimator as follows
= 120573 +
โกโขโขโขโขโขโฃ1114012
119894119883119894119883prime
119894
โคโฅโฅโฅโฅโฅโฆ
minus1113568
1114012119894119883119894119890119894
bull Core idea sum119894119883119894119890119894 is the sum of rvs so the CLT applies
bull That plus some simple asymptotic theory allows us to say
radic119873( minus 120573) 119873(0ฮฉ)
bull Converges in distribution to a Normal distribution with meanvector 0 and covariance matrix ฮฉ
ฮฉ = 120124[119883119894119883prime119894 ]minus1113568120124[119883119894119883prime
119894 1198901113569119894 ]120124[119883119894119883prime119894 ]minus1113568
bull No linearity assumption needed20 58
Estimating the variance
bull In large samples then
radic119873( minus 120573) sim 119873(0ฮฉ)
bull How to estimate ฮฉ Plug-in principle again
1114023ฮฉ =
โกโขโขโขโขโขโฃ1114012
119894119883119894119883prime
119894
โคโฅโฅโฅโฅโฅโฆ
minus1113568 โกโขโขโขโขโขโฃ1114012119894119883119894119883prime
119894 1113569119894
โคโฅโฅโฅโฅโฅโฆ
โกโขโขโขโขโขโฃ1114012
119894119883119894119883prime
119894
โคโฅโฅโฅโฅโฅโฆ
minus1113568
bull Replace 119890119894 with its emprical counterpart (residuals)119894 = 119884119894 minus 119883prime
119894 bull Replace the population moments of 119883119894 with their sample
counterpartsbull The square root of the diagonals of this covariance matrix are
the ldquorobustrdquo or Huber-White standard errors that Statacommonly report
21 58
Heteroskedasticity
bull No assumptions of homoskedasticitybull Heteroskedaticity will definitely occur when
CEF is linear but the 1205901113569(119909) = 120141[119884119894|119883119894 = 119909] is not constant in 119909 119864[119884119894|119883119894] is not linear but we use the linear regression to
approxmiate it
22 58
2 Regression andCausality
23 58
Regression and causality
bull Most econometrics textbooks regression defined withoutrespect to causality
bull But then when is ldquobiasedrdquo The above derivations work forsome 120124[119884119894|119883119894]
bull The question then is when does knowing the CEF tell ussomething about causality
bull MHE argues that a regression is causal when the CEF itapproximates is causal Identification is king
bull We will show that under certain conditions a regression of theoutcome on the treatment and the covariates can recover acausal parameter but perhaps not the one in which we areinterested
24 58
Review
bull Quick reminder we have potential outcomes 119884119894(1) and 119884119894(0)and two parameters the ATE and ATT
120591 = 119864[119884119894(1) minus 119884119894(0)]120591ATT = 119864[119884119894(1) minus 119884119894(0)|119863119894 = 1]
bull We have shown in past weeks that these effects are identifiedwhen ignorability holds MHE calls this the conditionalindependence assumption (CIA)
25 58
Linear constant effects model binarytreatment
bull Experiment with a simple experiment we can rewrite theconsistency assumption to be a regression formula
119884119894 = 119863119894119884119894(1) + (1 minus 119863119894)119884119894(0)= 119884119894(0) + (119884119894(1) minus 119884119894(0))119863119894
= 120124[119884119894(0)] + 120591119863119894 + (119884119894(0) minus 120124[119884119894(0)])= 1205831113567 + 120591119863119894 + 1199071113567119894
bull Note that if ignorability holds (as in an experiment) for 119884119894(0)then it will also hold for 1199071113567119894 since 120124[119884119894(0)] is constant Thusthis satifies the usual assumptions for regression
26 58
Now with covariates
bull Now assume no unmeasured confounders 119884119894(119889) โโ 119863119894|119883119894bull We will assume a linear model for the potential outcomes
119884119894(119889) = 120572 + 120591 sdot 119889 + 120578119894
bull Remember that linearity isnrsquot an assumption if 119863119894 is binarybull Effect of 119863119894 is constant here the 120578119894 are the only source of
individual variation and we have 119864[120578119894] = 0bull Consistency assumption allows us to write this as
119884119894 = 120572 + 120591119863119894 + 120578119894
27 58
Covariates in the error
bull Letrsquos assume that 120578119894 is linear in 119883119894 120578119894 = 119883prime119894 120574 + 120584119894
bull New error is uncorrelated with 119883119894 120124[120584119894|119883119894] = 0bull This is an assumption Might be falsebull Plug into the above
120124[119884119894(119889)|119883119894] = 119864[119884119894|119863119894 119883119894] = 120572 + 120591119863119894 + 119864[120578119894|119883119894]= 120572 + 120591119863119894 + 119883prime
119894 120574 + 119864[120584119894|119883119894]= 120572 + 120591119863119894 + 119883prime
119894 120574
28 58
Summing up regression with constanteffects
bull Reviewing the assumptions wersquove used no unmeasured confounders constant treatment effects linearity of the treatmentcovariates
bull Under these we can run the following regression to estimatethe ATE 120591
119884119894 = 120572 + 120591119863119894 + 119883prime119894 120574 + 120584119894
bull Works with continuous or ordinal 119863119894 if linearity in the effect ofthese variables is truly linear
29 58
OLS constant effects simulation
Model with linear covariates constant 0 effect of treatment
library(mvtnorm)
n lt- 100
p lt- 4
X lt- rmvnorm(n = 100 mean = rep(0 p))
gamma lt- c(274 137 137 137)
y lt- 210 + X c(gamma) + rnorm(n)
alpha lt- c(-1 05 -05 -01)
dprobs lt- bootinvlogit(X alpha)
d lt- rbinom(n size = 1 prob = dprobs)
30 58
OLS with no covariates
summary(lm(y ~ d))
Coefficients
Estimate Std Error t value Pr(gt|t|)
(Intercept) 22036 458 4813 lt 2e-16
d -2869 654 -439 0000029
---
Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1
Residual standard error 33 on 98 degrees of freedom
Multiple R-squared 0164 Adjusted R-squared 0156
F-statistic 192 on 1 and 98 DF p-value 0000029
31 58
OLS with covariates
summary(lm(y ~ d + X))
Coefficients
Estimate Std Error t value Pr(gt|t|)
(Intercept) 209952 0159 13207 lt2e-16
d -0128 0253 -05 062
X1 27368 0126 2175 lt2e-16
X2 13677 0114 1201 lt2e-16
X3 13673 0130 1051 lt2e-16
X4 13570 0106 1280 lt2e-16
---
Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1
Residual standard error 1 on 94 degrees of freedom
Multiple R-squared 0999 Adjusted R-squared 0999
F-statistic 244e+04 on 5 and 94 DF p-value lt2e-16
32 58
What happens with nonlinearity
Suppose we can only observe the following covariates
z1 lt- exp(X[ 1]2)
z2 lt- X[ 2](1 + exp(X[ 1])) + 10
z3 lt- (X[ 1] X[ 3]25 + 06)^3
z4 lt- (X[ 2] + X[ 4] + 20)^2
Implies that 119884119894 and 119863119894 are functions of log(1198851198941113568) 1198851198941113569 119885111356911989411135681198851198941113569
1 log(1198851198941113568) 1198851198941113570 log(1198851198941113568) and 119883111356811135691198941113571
Regression is a nonlinear function of the observed covariates
33 58
When linearity goes wrong
summary(lm(y ~ d + z1 + z2 + z3 + z4))
Coefficients
Estimate Std Error t value Pr(gt|t|)
(Intercept) 218728 300799 073 0469
d -69292 33854 -205 0043
z1 363110 26970 1346 lt2e-16
z2 -29033 36619 -079 0430
z3 862022 431030 200 0048
z4 04021 00329 1223 lt2e-16
---
Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1
Residual standard error 14 on 94 degrees of freedom
Multiple R-squared 0855 Adjusted R-squared 0847
F-statistic 111 on 5 and 94 DF p-value lt2e-16
34 58
3 Regression withHeterogeneousTreatment Effects
35 58
Heterogeneous effects binary treatment
bull Completely randomized experiment
119884119894 = 119863119894119884119894(1) + (1 minus 119863119894)119884119894(0)= 119884119894(0) + (119884119894(1) minus 119884119894(0))119863119894
= 1205831113567 + 120591119894119863119894 + (119884119894(0) minus 1205831113567)= 1205831113567 + 120591119863119894 + (119884119894(0) minus 1205831113567) + (120591119894 minus 120591) sdot 119863119894
= 1205831113567 + 120591119863119894 + 120576119894
bull Error term now includes two components1 ldquoBaselinerdquo variation in the outcome (119884119894(0) minus 1205831113567)2 Variation in the treatment effect (120591119894 minus 120591)
bull Easy to verify that under experiment 120124[120576119894|119863119894] = 0bull Thus OLS estimates the ATE with no covariates
36 58
Adding covariates
bull What happens with no unmeasured confounders Need tocondition on 119883119894 now
bull Remember identification of the ATEATT using iteratedexpectations
bull ATE is the weighted sum of CATEs
120591 = 1114012119909120591(119909) Pr[119883119894 = 119909]
bull ATEATT are weighted averages of CATEsbull What about the regression estimand 120591119877 How does it related
to the ATEATT
37 58
Heterogeneous effects and regression
bull Letrsquos investigate this under a saturated regression model
119884119894 = 1114012119909119861119909119894120572119909 + 120591119877119863119894 + 119890119894
bull Use a dummy variable for each unique combination of 119883119894119861119909119894 = 120128(119883119894 = 119909)
bull Linear in 119883119894 by construction
38 58
Investigating the regression coefficient
bull How can we investigate 120591119877 Well we can rely on theregression anatomy
120591119877 = Cov(119884119894 119863119894 minus 119864[119863119894|119883119894])120141(119863119894 minus 119864[119863119894|119883119894])
bull 119863119894 minus 120124[119863119894|119883119894] is the residual from a regression of 119863119894 on the fullset of dummies
bull With a little work we can show
120591119877 =120124 1114094120591(119883119894)(119863119894 minus 120124[119863119894|119883119894])11135691114097
120124[(119863119894 minus 119864[119863119894|119883119894])1113569]=
120124[120591(119883119894)1205901113569119889(119883119894)]120124[1205901113569119889(119883119894)]
bull 1205901113569119889(119909) = 120141[119863119894|119883119894 = 119909] is the conditional variance of treatmentassignment
39 58
ATE versus OLS
120591119877 = 120124[120591(119883119894)119882119894] = 1114012119909120591(119909)
1205901113569119889(119909)120124[1205901113569119889(119883119894)]
โ[119883119894 = 119909]
bull Compare to the ATE
120591 = 120124[120591(119883119894)] = 1114012119909120591(119909)โ[119883119894 = 119909]
bull Both weight strata relative to their size (โ[119883119894 = 119909])bull OLS weights strata higher if the treatment variance in those
strata (1205901113569119889(119909)) is higher in those strata relative to the averagevariance across strata (120124[1205901113569119889(119883119894)])
bull The ATE weights only by their size
40 58
Regression weighting
119882119894 =1205901113569119889(119883119894)
120124[1205901113569119889(119883119894)]
bull Why does OLS weight like thisbull OLS is a minimum-variance estimator more weight to
more precise within-strata estimatesbull Within-strata estimates are most precise when the treatment
is evenly spread and thus has the highest variancebull If 119863119894 is binary then we know the conditional variance will be
1205901113569119889(119909) = โ[119863119894 = 1|119883119894 = 119909] (1 minus โ[119863119894 = 1|119883119894 = 119909])= 119890(119909) (1 minus 119890(119909))
bull Maximum variance with โ[119863119894 = 1|119883119894 = 119909] = 12
41 58
OLS weighting examplebull Binary covariate
โ[119883119894 = 1] = 075 โ[119883119894 = 0] = 025โ[119863119894 = 1|119883119894 = 1] = 09 โ[119863119894 = 1|119883119894 = 0] = 05
1205901113569119889(1) = 009 1205901113569119889(0) = 025120591(1) = 1 120591(0) = minus1
bull Implies the ATE is 120591 = 05bull Average conditional variance 120124[1205901113569119889(119883119894)] = 013bull weights for 119883119894 = 1 are 009013 = 0692 for 119883119894 = 0
025013 = 192120591119877 = 120124[120591(119883119894)119882119894]
= 120591(1)119882(1)โ[119883119894 = 1] + 120591(0)119882(0)โ[119883119894 = 0]= 1 times 0692 times 075 + minus1 times 192 times 025= 0039
42 58
When will OLS estimate the ATE
bull When does 120591 = 120591119877bull Constant treatment effects 120591(119909) = 120591 = 120591119877bull Constant probability of treatment 119890(119909) = โ[119863119894 = 1|119883119894 = 119909] = 119890
Implies that the OLS weights are 1
bull Incorrect linearity assumption in 119883119894 will lead to more bias
43 58
Other ways to use regression
bull Whatrsquos the path forward Accept the bias (might be relatively small with saturated
models) Use a different regression approach
bull Let 120583119889(119909) = 120124[119884119894(119889)|119883119894 = 119909] be the CEF for the potentialoutcome under 119863119894 = 119889
bull By consistency and nuc we have 120583119889(119909) = 120124[119884119894|119863119894 = 119889119883119894 = 119909]bull Estimate a regression of 119884119894 on 119883119894 among the 119863119894 = 119889 groupbull Then 119889(119909) is just a predicted value from the regression for
119883119894 = 119909bull How can we use this
44 58
Imputation estimators
bull Impute the treated potential outcomes with 1114023119884119894(1) = 1113568(119883119894)bull Impute the control potential outcomes with 1114023119884119894(0) = 1113567(119883119894)bull Procedure
Regress 119884119894 on 119883119894 in the treated group and get predicted valuesfor all units (treated or control)
Regress 119884119894 on 119883119894 in the control group and get predicted valuesfor all units (treated or control)
Take the average difference between these predicted values
bull More mathematically look like this
120591119894119898119901 =1119873
11140121198941113568(119883119894) minus 1113567(119883119894)
bull Sometimes called an imputation estimator
45 58
Simple imputation estimator
bull Use predict() from the within-group models on the data fromthe entire sample
bull Useful trick use a model on the entire data andmodelframe() to get the right design matrix
heterogeneous effects
yhet lt- ifelse(d == 1 y + rnorm(n 0 5) y)
mod lt- lm(yhet ~ d + X)
mod1 lt- lm(yhet ~ X subset = d == 1)
mod0 lt- lm(yhet ~ X subset = d == 0)
y1imps lt- predict(mod1 modelframe(mod))
y0imps lt- predict(mod0 modelframe(mod))
mean(y1imps - y0imps)
[1] 061
46 58
Notes on imputation estimators
bull If 119889(119909) are consistent estimators then 120591119894119898119901 is consistent forthe ATE
bull Why donrsquot people use this Most people donrsquot know the results wersquove been talking about Harder to implement than vanilla OLS
bull Can use linear regression to estimate 119889(119909) = 119909prime120573119889bull Recent trend is to estimate 119889(119909) via non-parametric methods
such as Kernel regression local linear regression regression trees etc Easiest is generalized additive models (GAMs)
47 58
Imputation estimator visualization
-2 -1 0 1 2 3
-05
00
05
10
15
x
y
TreatedControl
48 58
Imputation estimator visualization
-2 -1 0 1 2 3
-05
00
05
10
15
x
y
TreatedControl
49 58
Imputation estimator visualization
-2 -1 0 1 2 3
-05
00
05
10
15
x
y
TreatedControl
50 58
Nonlinear relationshipsbull Same idea but with nonlinear relationship between 119884119894 and 119883119894
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
51 58
Nonlinear relationshipsbull Same idea but with nonlinear relationship between 119884119894 and 119883119894
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
52 58
Nonlinear relationshipsbull Same idea but with nonlinear relationship between 119884119894 and 119883119894
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
53 58
Using semiparametric regressionbull Here CEFs are nonlinear but we donrsquot know their formbull We can use GAMs from the mgcv package to for flexible
estimatelibrary(mgcv)
mod0 lt- gam(y ~ s(x) subset = d == 0)
summary(mod0)
Family gaussian
Link function identity
Formula
y ~ s(x)
Parametric coefficients
Estimate Std Error t value Pr(gt|t|)
(Intercept) -00225 00154 -146 016
Approximate significance of smooth terms
edf Refdf F p-value
s(x) 603 708 413 lt2e-16
---
Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1
R-sq(adj) = 0909 Deviance explained = 928
GCV = 00093204 Scale est = 00071351 n = 30
54 58
Using GAMs
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
55 58
Using GAMs
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
56 58
Using GAMs
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
57 58
Limited dependent variables
bull Usual advice model the data from first principles Logitprobit for binary Poisson for counts etc
bull OLS is a-ok with limited DVs when Binary treatment and no covariates (just diff-in-means) Binary treatment discrete covariates and saturated models
(stratified diff-in-means)
bull Imposing a model on LDVs in this case imposes adistributional assumption which could be wrong
bull Even in unsaturated models the marginal effect from OLSoften decent compared to nonlinear models
Could go wrong in small samples If using nonlinear models always get effects on the scale of the
outcome
58 58
- Agnostic Regression
- Regression and Causality
- Regression with Heterogeneous Treatment Effects
-
Agnostic views on regression
[119884119894|119883119894] sim 119873(119883prime119894 120573 1205901113569)
bull Strong distributional assumption on 119884119894bull Properties like BLUE or MLE properties depend on these
assumptions holdingbull Alternative take an agnostic view on regression
Use OLS without believing these assumptions
bull Lose the distributional assumptions focus on the conditionalexpectation function (CEF)
120583(119909) = 120124[119884119894|119883119894 = 119909] = 1114012119910119910 sdot โ[119884119894 = 119910|119883119894 = 119909]
6 58
Justifying linear regression
bull Define linear regression
120573 = argmin119887
120124[(119884119894 minus 119883prime119894 119887)1113569]
bull The solution to this is the following
120573 = 120124[119883119894119883prime119894 ]minus1113568120124[119883119894119884119894]
bull Note that the is the population coefficient vector not theestimator yet
7 58
Regression anatomy
bull Consider simple linear regression
(120572 120573) = argmin119886119887
120124 1114094(119884119894 minus 119886 minus 119887119883119894)11135691114097
bull In this case we can write the populationtrue slope 120573 as
120573 = 120124[119883119894119883prime119894 ]minus1113568120124[119883119894119884119894] =
Cov(119884119894 119883119894)120141[119883119894]
bull With more covariates 120573 is more complicated but we can stillwrite it like this
bull Let 119896119894 be the residual from a regression of 119883119896119894 on all the otherindependent variables Then 120573119896 the coefficient for 119883119896119894 is
120573119896 =Cov(119884119894 119896119894)
119881(119896119894)
8 58
Justification 1 Linear CEFs
bull Justification 1 if the CEF is linear the population regressionfunction is it That is if 119864[119884119894|119883119894] = 119883prime
119894 119887 then 119887 = 120573bull When would we expect the CEF to be linear Two cases
1 Outcome and covariates are multivariate normal2 Linear regression model is saturated
bull A model is saturated if there are as many parameters asthere are possible combination of the 119883119894 variables
9 58
Saturated model example
bull Two binary variables 1198831113568119894 for incumbency status and 1198831113569119894 forparty of the candidate
bull Four possible values of 119883119894 four possible values of 120583(119883119894)
119864[119884119894|1198831113568119894 = 01198831113569119894 = 0] = 120572119864[119884119894|1198831113568119894 = 11198831113569119894 = 0] = 120572 + 120573119864[119884119894|1198831113568119894 = 01198831113569119894 = 1] = 120572 + 120574119864[119884119894|1198831113568119894 = 11198831113569119894 = 1] = 120572 + 120573 + 120574 + 120575
bull We can write the CEF as follows
119864[119884119894|1198831113568119894 1198831113569119894] = 120572 + 1205731198831113568119894 + 1205741198831113569119894 + 120575(11988311135681198941198831113569119894)
10 58
Saturated models example
119864[119884119894|1198831113568119894 1198831113569119894] = 120572 + 1205731198831113568119894 + 1205741198831113569119894 + 120575(11988311135681198941198831113569119894)
bull Basically each value of 120583(119883119894) is being estimated separately within-strata estimation No borrowing of information from across values of 119883119894
bull Requires a set of dummies for each categorical variable plusall interactions
bull Or a series of dummies for each unique combination of 119883119894bull This makes linearity hold mechanically and so linearity is
not an assumption Just a fact about saturated CEFs saturated models for limited dependent variables = A-OK
11 58
Saturated model example
bull Washington (AER) data on the effects of daughtersbull Wersquoll look at the relationship between voting and number of
kids (causal)
girls lt- foreignreaddta(rdquogirlsdtardquo)
head(girls[ c(rdquonamerdquo rdquototchirdquo rdquoaauwrdquo)])
name totchi aauw
1 ABERCROMBIE NEIL 0 100
2 ACKERMAN GARY L 3 88
3 ADERHOLT ROBERT B 0 0
4 ALLEN THOMAS H 2 100
5 ANDREWS ROBERT E 2 100
6 ARCHER WR 7 0
12 58
Linear model
summary(lm(aauw ~ totchi data = girls))
Coefficients
Estimate Std Error t value Pr(gt|t|)
(Intercept) 6131 181 3381 lt2e-16
totchi -533 062 -859 lt2e-16
---
Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1
Residual standard error 42 on 1733 degrees of freedom
(5 observations deleted due to missingness)
Multiple R-squared 00408 Adjusted R-squared 00403
F-statistic 738 on 1 and 1733 DF p-value lt2e-16
13 58
Saturated modelsummary(lm(aauw ~ asfactor(totchi) data = girls))
Coefficients
Estimate Std Error t value Pr(gt|t|)
(Intercept) 5641 276 2042 lt 2e-16
asfactor(totchi)1 545 411 133 01851
asfactor(totchi)2 -380 327 -116 02454
asfactor(totchi)3 -1365 345 -395 81e-05
asfactor(totchi)4 -1931 401 -482 16e-06
asfactor(totchi)5 -1546 485 -319 00015
asfactor(totchi)6 -3359 1042 -322 00013
asfactor(totchi)7 -1713 1141 -150 01336
asfactor(totchi)8 -5533 1228 -451 70e-06
asfactor(totchi)9 -5041 2408 -209 00364
asfactor(totchi)10 -5341 2090 -256 00107
asfactor(totchi)12 -5641 4153 -136 01745
---
Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1
Residual standard error 41 on 1723 degrees of freedom
(5 observations deleted due to missingness)
Multiple R-squared 00506 Adjusted R-squared 00446
F-statistic 836 on 11 and 1723 DF p-value 184e-1414 58
Saturated model minus the constantsummary(lm(aauw ~ asfactor(totchi) - 1 data = girls))
Coefficients
Estimate Std Error t value Pr(gt|t|)
asfactor(totchi)0 5641 276 2042 lt2e-16
asfactor(totchi)1 6186 305 2031 lt2e-16
asfactor(totchi)2 5262 175 3013 lt2e-16
asfactor(totchi)3 4276 207 2062 lt2e-16
asfactor(totchi)4 3711 290 1279 lt2e-16
asfactor(totchi)5 4095 399 1027 lt2e-16
asfactor(totchi)6 2282 1005 227 00233
asfactor(totchi)7 3929 1107 355 00004
asfactor(totchi)8 108 1196 009 09278
asfactor(totchi)9 600 2392 025 08020
asfactor(totchi)10 300 2072 014 08849
asfactor(totchi)12 000 4143 000 10000
---
Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1
Residual standard error 41 on 1723 degrees of freedom
(5 observations deleted due to missingness)
Multiple R-squared 0587 Adjusted R-squared 0584
F-statistic 204 on 12 and 1723 DF p-value lt2e-1615 58
Compare to within-strata means
bull The saturated model makes no assumptions about thebetween-strata relationships
bull Just calculates within-strata means
c1 lt- coef(lm(aauw ~ asfactor(totchi) - 1 data = girls))
c2 lt- with(girls tapply(aauw totchi mean narm = TRUE))
rbind(c1 c2)
0 1 2 3 4 5 6 7 8 9 10 12
c1 56 62 53 43 37 41 23 39 11 6 3 0
c2 56 62 53 43 37 41 23 39 11 6 3 0
16 58
Other justifications for OLS
bull Justification 2 119883prime119894 120573 is the best linear predictor (in a
mean-squared error sense) of 119884119894 Why 120573 = argmin119887 120124[(119884119894 minus 119883prime
119894 119887)1113569]
bull Justification 3 119883prime119894 120573 provides the minimum mean squared
error linear approxmiation to 119864[119884119894|119883119894]bull Even if the CEF is not linear a linear regression provides the
best linear approximation to that CEFbull Donrsquot need to believe the assumptions (linearity) in order to
use regression as a good approximation to the CEFbull Warning if the CEF is very nonlinear then this approximation
could be terrible
17 58
The error terms
bull Letrsquos define the error term 119890119894 equiv 119884119894 minus 119883prime119894 120573 so that
119884119894 = 119883prime119894 120573 + [119884119894 minus 119883prime
119894 120573] = 119883prime119894 120573 + 119890119894
bull Note the residual 119890119894 is uncorrelated with 119883119894
120124[119883119894119890119894] = 120124[119883119894(119884119894 minus 119883prime119894 120573)]
= 120124[119883119894119884119894] minus 120124[119883119894119883prime119894 120573]
= 120124[119883119894119884119894] minus 120124 1114094119883119894119883prime119894120124[119883119894119883prime
119894 ]minus1113568120124[119883119894119884119894]1114097= 120124[119883119894119884119894] minus 120124[119883119894119883prime
119894 ]120124[119883119894119883prime119894 ]minus1113568120124[119883119894119884119894]
= 120124[119883119894119884119894] minus 120124[119883119894119884119894] = 0
bull No assumptions on the linearity of 120124[119884119894|119883119894]
18 58
OLS estimator
bull We know the population value of 120573 is
120573 = 120124[119883119894119883prime119894 ]minus1113568120124[119883119894119884119894]
bull How do we get an estimator of thisbull Plug-in principle replace population expectation with
sample versions
=
โกโขโขโขโขโขโฃ1119873
1114012119894119883119894119883prime
119894
โคโฅโฅโฅโฅโฅโฆ
minus11135681119873
1114012119894119883119894119884119894
bull If you work through the matrix algebra this turns out to be
= (119831prime119831)minus1113568 119831prime119858
19 58
Asymptotic OLS inference
bull With this representation in hand we can write the OLSestimator as follows
= 120573 +
โกโขโขโขโขโขโฃ1114012
119894119883119894119883prime
119894
โคโฅโฅโฅโฅโฅโฆ
minus1113568
1114012119894119883119894119890119894
bull Core idea sum119894119883119894119890119894 is the sum of rvs so the CLT applies
bull That plus some simple asymptotic theory allows us to say
radic119873( minus 120573) 119873(0ฮฉ)
bull Converges in distribution to a Normal distribution with meanvector 0 and covariance matrix ฮฉ
ฮฉ = 120124[119883119894119883prime119894 ]minus1113568120124[119883119894119883prime
119894 1198901113569119894 ]120124[119883119894119883prime119894 ]minus1113568
bull No linearity assumption needed20 58
Estimating the variance
bull In large samples then
radic119873( minus 120573) sim 119873(0ฮฉ)
bull How to estimate ฮฉ Plug-in principle again
1114023ฮฉ =
โกโขโขโขโขโขโฃ1114012
119894119883119894119883prime
119894
โคโฅโฅโฅโฅโฅโฆ
minus1113568 โกโขโขโขโขโขโฃ1114012119894119883119894119883prime
119894 1113569119894
โคโฅโฅโฅโฅโฅโฆ
โกโขโขโขโขโขโฃ1114012
119894119883119894119883prime
119894
โคโฅโฅโฅโฅโฅโฆ
minus1113568
bull Replace 119890119894 with its emprical counterpart (residuals)119894 = 119884119894 minus 119883prime
119894 bull Replace the population moments of 119883119894 with their sample
counterpartsbull The square root of the diagonals of this covariance matrix are
the ldquorobustrdquo or Huber-White standard errors that Statacommonly report
21 58
Heteroskedasticity
bull No assumptions of homoskedasticitybull Heteroskedaticity will definitely occur when
CEF is linear but the 1205901113569(119909) = 120141[119884119894|119883119894 = 119909] is not constant in 119909 119864[119884119894|119883119894] is not linear but we use the linear regression to
approxmiate it
22 58
2 Regression andCausality
23 58
Regression and causality
bull Most econometrics textbooks regression defined withoutrespect to causality
bull But then when is ldquobiasedrdquo The above derivations work forsome 120124[119884119894|119883119894]
bull The question then is when does knowing the CEF tell ussomething about causality
bull MHE argues that a regression is causal when the CEF itapproximates is causal Identification is king
bull We will show that under certain conditions a regression of theoutcome on the treatment and the covariates can recover acausal parameter but perhaps not the one in which we areinterested
24 58
Review
bull Quick reminder we have potential outcomes 119884119894(1) and 119884119894(0)and two parameters the ATE and ATT
120591 = 119864[119884119894(1) minus 119884119894(0)]120591ATT = 119864[119884119894(1) minus 119884119894(0)|119863119894 = 1]
bull We have shown in past weeks that these effects are identifiedwhen ignorability holds MHE calls this the conditionalindependence assumption (CIA)
25 58
Linear constant effects model binarytreatment
bull Experiment with a simple experiment we can rewrite theconsistency assumption to be a regression formula
119884119894 = 119863119894119884119894(1) + (1 minus 119863119894)119884119894(0)= 119884119894(0) + (119884119894(1) minus 119884119894(0))119863119894
= 120124[119884119894(0)] + 120591119863119894 + (119884119894(0) minus 120124[119884119894(0)])= 1205831113567 + 120591119863119894 + 1199071113567119894
bull Note that if ignorability holds (as in an experiment) for 119884119894(0)then it will also hold for 1199071113567119894 since 120124[119884119894(0)] is constant Thusthis satifies the usual assumptions for regression
26 58
Now with covariates
bull Now assume no unmeasured confounders 119884119894(119889) โโ 119863119894|119883119894bull We will assume a linear model for the potential outcomes
119884119894(119889) = 120572 + 120591 sdot 119889 + 120578119894
bull Remember that linearity isnrsquot an assumption if 119863119894 is binarybull Effect of 119863119894 is constant here the 120578119894 are the only source of
individual variation and we have 119864[120578119894] = 0bull Consistency assumption allows us to write this as
119884119894 = 120572 + 120591119863119894 + 120578119894
27 58
Covariates in the error
bull Letrsquos assume that 120578119894 is linear in 119883119894 120578119894 = 119883prime119894 120574 + 120584119894
bull New error is uncorrelated with 119883119894 120124[120584119894|119883119894] = 0bull This is an assumption Might be falsebull Plug into the above
120124[119884119894(119889)|119883119894] = 119864[119884119894|119863119894 119883119894] = 120572 + 120591119863119894 + 119864[120578119894|119883119894]= 120572 + 120591119863119894 + 119883prime
119894 120574 + 119864[120584119894|119883119894]= 120572 + 120591119863119894 + 119883prime
119894 120574
28 58
Summing up regression with constanteffects
bull Reviewing the assumptions wersquove used no unmeasured confounders constant treatment effects linearity of the treatmentcovariates
bull Under these we can run the following regression to estimatethe ATE 120591
119884119894 = 120572 + 120591119863119894 + 119883prime119894 120574 + 120584119894
bull Works with continuous or ordinal 119863119894 if linearity in the effect ofthese variables is truly linear
29 58
OLS constant effects simulation
Model with linear covariates constant 0 effect of treatment
library(mvtnorm)
n lt- 100
p lt- 4
X lt- rmvnorm(n = 100 mean = rep(0 p))
gamma lt- c(274 137 137 137)
y lt- 210 + X c(gamma) + rnorm(n)
alpha lt- c(-1 05 -05 -01)
dprobs lt- bootinvlogit(X alpha)
d lt- rbinom(n size = 1 prob = dprobs)
30 58
OLS with no covariates
summary(lm(y ~ d))
Coefficients
Estimate Std Error t value Pr(gt|t|)
(Intercept) 22036 458 4813 lt 2e-16
d -2869 654 -439 0000029
---
Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1
Residual standard error 33 on 98 degrees of freedom
Multiple R-squared 0164 Adjusted R-squared 0156
F-statistic 192 on 1 and 98 DF p-value 0000029
31 58
OLS with covariates
summary(lm(y ~ d + X))
Coefficients
Estimate Std Error t value Pr(gt|t|)
(Intercept) 209952 0159 13207 lt2e-16
d -0128 0253 -05 062
X1 27368 0126 2175 lt2e-16
X2 13677 0114 1201 lt2e-16
X3 13673 0130 1051 lt2e-16
X4 13570 0106 1280 lt2e-16
---
Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1
Residual standard error 1 on 94 degrees of freedom
Multiple R-squared 0999 Adjusted R-squared 0999
F-statistic 244e+04 on 5 and 94 DF p-value lt2e-16
32 58
What happens with nonlinearity
Suppose we can only observe the following covariates
z1 lt- exp(X[ 1]2)
z2 lt- X[ 2](1 + exp(X[ 1])) + 10
z3 lt- (X[ 1] X[ 3]25 + 06)^3
z4 lt- (X[ 2] + X[ 4] + 20)^2
Implies that 119884119894 and 119863119894 are functions of log(1198851198941113568) 1198851198941113569 119885111356911989411135681198851198941113569
1 log(1198851198941113568) 1198851198941113570 log(1198851198941113568) and 119883111356811135691198941113571
Regression is a nonlinear function of the observed covariates
33 58
When linearity goes wrong
summary(lm(y ~ d + z1 + z2 + z3 + z4))
Coefficients
Estimate Std Error t value Pr(gt|t|)
(Intercept) 218728 300799 073 0469
d -69292 33854 -205 0043
z1 363110 26970 1346 lt2e-16
z2 -29033 36619 -079 0430
z3 862022 431030 200 0048
z4 04021 00329 1223 lt2e-16
---
Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1
Residual standard error 14 on 94 degrees of freedom
Multiple R-squared 0855 Adjusted R-squared 0847
F-statistic 111 on 5 and 94 DF p-value lt2e-16
34 58
3 Regression withHeterogeneousTreatment Effects
35 58
Heterogeneous effects binary treatment
bull Completely randomized experiment
119884119894 = 119863119894119884119894(1) + (1 minus 119863119894)119884119894(0)= 119884119894(0) + (119884119894(1) minus 119884119894(0))119863119894
= 1205831113567 + 120591119894119863119894 + (119884119894(0) minus 1205831113567)= 1205831113567 + 120591119863119894 + (119884119894(0) minus 1205831113567) + (120591119894 minus 120591) sdot 119863119894
= 1205831113567 + 120591119863119894 + 120576119894
bull Error term now includes two components1 ldquoBaselinerdquo variation in the outcome (119884119894(0) minus 1205831113567)2 Variation in the treatment effect (120591119894 minus 120591)
bull Easy to verify that under experiment 120124[120576119894|119863119894] = 0bull Thus OLS estimates the ATE with no covariates
36 58
Adding covariates
bull What happens with no unmeasured confounders Need tocondition on 119883119894 now
bull Remember identification of the ATEATT using iteratedexpectations
bull ATE is the weighted sum of CATEs
120591 = 1114012119909120591(119909) Pr[119883119894 = 119909]
bull ATEATT are weighted averages of CATEsbull What about the regression estimand 120591119877 How does it related
to the ATEATT
37 58
Heterogeneous effects and regression
bull Letrsquos investigate this under a saturated regression model
119884119894 = 1114012119909119861119909119894120572119909 + 120591119877119863119894 + 119890119894
bull Use a dummy variable for each unique combination of 119883119894119861119909119894 = 120128(119883119894 = 119909)
bull Linear in 119883119894 by construction
38 58
Investigating the regression coefficient
bull How can we investigate 120591119877 Well we can rely on theregression anatomy
120591119877 = Cov(119884119894 119863119894 minus 119864[119863119894|119883119894])120141(119863119894 minus 119864[119863119894|119883119894])
bull 119863119894 minus 120124[119863119894|119883119894] is the residual from a regression of 119863119894 on the fullset of dummies
bull With a little work we can show
120591119877 =120124 1114094120591(119883119894)(119863119894 minus 120124[119863119894|119883119894])11135691114097
120124[(119863119894 minus 119864[119863119894|119883119894])1113569]=
120124[120591(119883119894)1205901113569119889(119883119894)]120124[1205901113569119889(119883119894)]
bull 1205901113569119889(119909) = 120141[119863119894|119883119894 = 119909] is the conditional variance of treatmentassignment
39 58
ATE versus OLS
120591119877 = 120124[120591(119883119894)119882119894] = 1114012119909120591(119909)
1205901113569119889(119909)120124[1205901113569119889(119883119894)]
โ[119883119894 = 119909]
bull Compare to the ATE
120591 = 120124[120591(119883119894)] = 1114012119909120591(119909)โ[119883119894 = 119909]
bull Both weight strata relative to their size (โ[119883119894 = 119909])bull OLS weights strata higher if the treatment variance in those
strata (1205901113569119889(119909)) is higher in those strata relative to the averagevariance across strata (120124[1205901113569119889(119883119894)])
bull The ATE weights only by their size
40 58
Regression weighting
119882119894 =1205901113569119889(119883119894)
120124[1205901113569119889(119883119894)]
bull Why does OLS weight like thisbull OLS is a minimum-variance estimator more weight to
more precise within-strata estimatesbull Within-strata estimates are most precise when the treatment
is evenly spread and thus has the highest variancebull If 119863119894 is binary then we know the conditional variance will be
1205901113569119889(119909) = โ[119863119894 = 1|119883119894 = 119909] (1 minus โ[119863119894 = 1|119883119894 = 119909])= 119890(119909) (1 minus 119890(119909))
bull Maximum variance with โ[119863119894 = 1|119883119894 = 119909] = 12
41 58
OLS weighting examplebull Binary covariate
โ[119883119894 = 1] = 075 โ[119883119894 = 0] = 025โ[119863119894 = 1|119883119894 = 1] = 09 โ[119863119894 = 1|119883119894 = 0] = 05
1205901113569119889(1) = 009 1205901113569119889(0) = 025120591(1) = 1 120591(0) = minus1
bull Implies the ATE is 120591 = 05bull Average conditional variance 120124[1205901113569119889(119883119894)] = 013bull weights for 119883119894 = 1 are 009013 = 0692 for 119883119894 = 0
025013 = 192120591119877 = 120124[120591(119883119894)119882119894]
= 120591(1)119882(1)โ[119883119894 = 1] + 120591(0)119882(0)โ[119883119894 = 0]= 1 times 0692 times 075 + minus1 times 192 times 025= 0039
42 58
When will OLS estimate the ATE
bull When does 120591 = 120591119877bull Constant treatment effects 120591(119909) = 120591 = 120591119877bull Constant probability of treatment 119890(119909) = โ[119863119894 = 1|119883119894 = 119909] = 119890
Implies that the OLS weights are 1
bull Incorrect linearity assumption in 119883119894 will lead to more bias
43 58
Other ways to use regression
bull Whatrsquos the path forward Accept the bias (might be relatively small with saturated
models) Use a different regression approach
bull Let 120583119889(119909) = 120124[119884119894(119889)|119883119894 = 119909] be the CEF for the potentialoutcome under 119863119894 = 119889
bull By consistency and nuc we have 120583119889(119909) = 120124[119884119894|119863119894 = 119889119883119894 = 119909]bull Estimate a regression of 119884119894 on 119883119894 among the 119863119894 = 119889 groupbull Then 119889(119909) is just a predicted value from the regression for
119883119894 = 119909bull How can we use this
44 58
Imputation estimators
bull Impute the treated potential outcomes with 1114023119884119894(1) = 1113568(119883119894)bull Impute the control potential outcomes with 1114023119884119894(0) = 1113567(119883119894)bull Procedure
Regress 119884119894 on 119883119894 in the treated group and get predicted valuesfor all units (treated or control)
Regress 119884119894 on 119883119894 in the control group and get predicted valuesfor all units (treated or control)
Take the average difference between these predicted values
bull More mathematically look like this
120591119894119898119901 =1119873
11140121198941113568(119883119894) minus 1113567(119883119894)
bull Sometimes called an imputation estimator
45 58
Simple imputation estimator
bull Use predict() from the within-group models on the data fromthe entire sample
bull Useful trick use a model on the entire data andmodelframe() to get the right design matrix
heterogeneous effects
yhet lt- ifelse(d == 1 y + rnorm(n 0 5) y)
mod lt- lm(yhet ~ d + X)
mod1 lt- lm(yhet ~ X subset = d == 1)
mod0 lt- lm(yhet ~ X subset = d == 0)
y1imps lt- predict(mod1 modelframe(mod))
y0imps lt- predict(mod0 modelframe(mod))
mean(y1imps - y0imps)
[1] 061
46 58
Notes on imputation estimators
bull If 119889(119909) are consistent estimators then 120591119894119898119901 is consistent forthe ATE
bull Why donrsquot people use this Most people donrsquot know the results wersquove been talking about Harder to implement than vanilla OLS
bull Can use linear regression to estimate 119889(119909) = 119909prime120573119889bull Recent trend is to estimate 119889(119909) via non-parametric methods
such as Kernel regression local linear regression regression trees etc Easiest is generalized additive models (GAMs)
47 58
Imputation estimator visualization
-2 -1 0 1 2 3
-05
00
05
10
15
x
y
TreatedControl
48 58
Imputation estimator visualization
-2 -1 0 1 2 3
-05
00
05
10
15
x
y
TreatedControl
49 58
Imputation estimator visualization
-2 -1 0 1 2 3
-05
00
05
10
15
x
y
TreatedControl
50 58
Nonlinear relationshipsbull Same idea but with nonlinear relationship between 119884119894 and 119883119894
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
51 58
Nonlinear relationshipsbull Same idea but with nonlinear relationship between 119884119894 and 119883119894
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
52 58
Nonlinear relationshipsbull Same idea but with nonlinear relationship between 119884119894 and 119883119894
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
53 58
Using semiparametric regressionbull Here CEFs are nonlinear but we donrsquot know their formbull We can use GAMs from the mgcv package to for flexible
estimatelibrary(mgcv)
mod0 lt- gam(y ~ s(x) subset = d == 0)
summary(mod0)
Family gaussian
Link function identity
Formula
y ~ s(x)
Parametric coefficients
Estimate Std Error t value Pr(gt|t|)
(Intercept) -00225 00154 -146 016
Approximate significance of smooth terms
edf Refdf F p-value
s(x) 603 708 413 lt2e-16
---
Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1
R-sq(adj) = 0909 Deviance explained = 928
GCV = 00093204 Scale est = 00071351 n = 30
54 58
Using GAMs
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
55 58
Using GAMs
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
56 58
Using GAMs
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
57 58
Limited dependent variables
bull Usual advice model the data from first principles Logitprobit for binary Poisson for counts etc
bull OLS is a-ok with limited DVs when Binary treatment and no covariates (just diff-in-means) Binary treatment discrete covariates and saturated models
(stratified diff-in-means)
bull Imposing a model on LDVs in this case imposes adistributional assumption which could be wrong
bull Even in unsaturated models the marginal effect from OLSoften decent compared to nonlinear models
Could go wrong in small samples If using nonlinear models always get effects on the scale of the
outcome
58 58
- Agnostic Regression
- Regression and Causality
- Regression with Heterogeneous Treatment Effects
-
Justifying linear regression
bull Define linear regression
120573 = argmin119887
120124[(119884119894 minus 119883prime119894 119887)1113569]
bull The solution to this is the following
120573 = 120124[119883119894119883prime119894 ]minus1113568120124[119883119894119884119894]
bull Note that the is the population coefficient vector not theestimator yet
7 58
Regression anatomy
bull Consider simple linear regression
(120572 120573) = argmin119886119887
120124 1114094(119884119894 minus 119886 minus 119887119883119894)11135691114097
bull In this case we can write the populationtrue slope 120573 as
120573 = 120124[119883119894119883prime119894 ]minus1113568120124[119883119894119884119894] =
Cov(119884119894 119883119894)120141[119883119894]
bull With more covariates 120573 is more complicated but we can stillwrite it like this
bull Let 119896119894 be the residual from a regression of 119883119896119894 on all the otherindependent variables Then 120573119896 the coefficient for 119883119896119894 is
120573119896 =Cov(119884119894 119896119894)
119881(119896119894)
8 58
Justification 1 Linear CEFs
bull Justification 1 if the CEF is linear the population regressionfunction is it That is if 119864[119884119894|119883119894] = 119883prime
119894 119887 then 119887 = 120573bull When would we expect the CEF to be linear Two cases
1 Outcome and covariates are multivariate normal2 Linear regression model is saturated
bull A model is saturated if there are as many parameters asthere are possible combination of the 119883119894 variables
9 58
Saturated model example
bull Two binary variables 1198831113568119894 for incumbency status and 1198831113569119894 forparty of the candidate
bull Four possible values of 119883119894 four possible values of 120583(119883119894)
119864[119884119894|1198831113568119894 = 01198831113569119894 = 0] = 120572119864[119884119894|1198831113568119894 = 11198831113569119894 = 0] = 120572 + 120573119864[119884119894|1198831113568119894 = 01198831113569119894 = 1] = 120572 + 120574119864[119884119894|1198831113568119894 = 11198831113569119894 = 1] = 120572 + 120573 + 120574 + 120575
bull We can write the CEF as follows
119864[119884119894|1198831113568119894 1198831113569119894] = 120572 + 1205731198831113568119894 + 1205741198831113569119894 + 120575(11988311135681198941198831113569119894)
10 58
Saturated models example
119864[119884119894|1198831113568119894 1198831113569119894] = 120572 + 1205731198831113568119894 + 1205741198831113569119894 + 120575(11988311135681198941198831113569119894)
bull Basically each value of 120583(119883119894) is being estimated separately within-strata estimation No borrowing of information from across values of 119883119894
bull Requires a set of dummies for each categorical variable plusall interactions
bull Or a series of dummies for each unique combination of 119883119894bull This makes linearity hold mechanically and so linearity is
not an assumption Just a fact about saturated CEFs saturated models for limited dependent variables = A-OK
11 58
Saturated model example
bull Washington (AER) data on the effects of daughtersbull Wersquoll look at the relationship between voting and number of
kids (causal)
girls lt- foreignreaddta(rdquogirlsdtardquo)
head(girls[ c(rdquonamerdquo rdquototchirdquo rdquoaauwrdquo)])
name totchi aauw
1 ABERCROMBIE NEIL 0 100
2 ACKERMAN GARY L 3 88
3 ADERHOLT ROBERT B 0 0
4 ALLEN THOMAS H 2 100
5 ANDREWS ROBERT E 2 100
6 ARCHER WR 7 0
12 58
Linear model
summary(lm(aauw ~ totchi data = girls))
Coefficients
Estimate Std Error t value Pr(gt|t|)
(Intercept) 6131 181 3381 lt2e-16
totchi -533 062 -859 lt2e-16
---
Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1
Residual standard error 42 on 1733 degrees of freedom
(5 observations deleted due to missingness)
Multiple R-squared 00408 Adjusted R-squared 00403
F-statistic 738 on 1 and 1733 DF p-value lt2e-16
13 58
Saturated modelsummary(lm(aauw ~ asfactor(totchi) data = girls))
Coefficients
Estimate Std Error t value Pr(gt|t|)
(Intercept) 5641 276 2042 lt 2e-16
asfactor(totchi)1 545 411 133 01851
asfactor(totchi)2 -380 327 -116 02454
asfactor(totchi)3 -1365 345 -395 81e-05
asfactor(totchi)4 -1931 401 -482 16e-06
asfactor(totchi)5 -1546 485 -319 00015
asfactor(totchi)6 -3359 1042 -322 00013
asfactor(totchi)7 -1713 1141 -150 01336
asfactor(totchi)8 -5533 1228 -451 70e-06
asfactor(totchi)9 -5041 2408 -209 00364
asfactor(totchi)10 -5341 2090 -256 00107
asfactor(totchi)12 -5641 4153 -136 01745
---
Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1
Residual standard error 41 on 1723 degrees of freedom
(5 observations deleted due to missingness)
Multiple R-squared 00506 Adjusted R-squared 00446
F-statistic 836 on 11 and 1723 DF p-value 184e-1414 58
Saturated model minus the constantsummary(lm(aauw ~ asfactor(totchi) - 1 data = girls))
Coefficients
Estimate Std Error t value Pr(gt|t|)
asfactor(totchi)0 5641 276 2042 lt2e-16
asfactor(totchi)1 6186 305 2031 lt2e-16
asfactor(totchi)2 5262 175 3013 lt2e-16
asfactor(totchi)3 4276 207 2062 lt2e-16
asfactor(totchi)4 3711 290 1279 lt2e-16
asfactor(totchi)5 4095 399 1027 lt2e-16
asfactor(totchi)6 2282 1005 227 00233
asfactor(totchi)7 3929 1107 355 00004
asfactor(totchi)8 108 1196 009 09278
asfactor(totchi)9 600 2392 025 08020
asfactor(totchi)10 300 2072 014 08849
asfactor(totchi)12 000 4143 000 10000
---
Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1
Residual standard error 41 on 1723 degrees of freedom
(5 observations deleted due to missingness)
Multiple R-squared 0587 Adjusted R-squared 0584
F-statistic 204 on 12 and 1723 DF p-value lt2e-1615 58
Compare to within-strata means
bull The saturated model makes no assumptions about thebetween-strata relationships
bull Just calculates within-strata means
c1 lt- coef(lm(aauw ~ asfactor(totchi) - 1 data = girls))
c2 lt- with(girls tapply(aauw totchi mean narm = TRUE))
rbind(c1 c2)
0 1 2 3 4 5 6 7 8 9 10 12
c1 56 62 53 43 37 41 23 39 11 6 3 0
c2 56 62 53 43 37 41 23 39 11 6 3 0
16 58
Other justifications for OLS
bull Justification 2 119883prime119894 120573 is the best linear predictor (in a
mean-squared error sense) of 119884119894 Why 120573 = argmin119887 120124[(119884119894 minus 119883prime
119894 119887)1113569]
bull Justification 3 119883prime119894 120573 provides the minimum mean squared
error linear approxmiation to 119864[119884119894|119883119894]bull Even if the CEF is not linear a linear regression provides the
best linear approximation to that CEFbull Donrsquot need to believe the assumptions (linearity) in order to
use regression as a good approximation to the CEFbull Warning if the CEF is very nonlinear then this approximation
could be terrible
17 58
The error terms
bull Letrsquos define the error term 119890119894 equiv 119884119894 minus 119883prime119894 120573 so that
119884119894 = 119883prime119894 120573 + [119884119894 minus 119883prime
119894 120573] = 119883prime119894 120573 + 119890119894
bull Note the residual 119890119894 is uncorrelated with 119883119894
120124[119883119894119890119894] = 120124[119883119894(119884119894 minus 119883prime119894 120573)]
= 120124[119883119894119884119894] minus 120124[119883119894119883prime119894 120573]
= 120124[119883119894119884119894] minus 120124 1114094119883119894119883prime119894120124[119883119894119883prime
119894 ]minus1113568120124[119883119894119884119894]1114097= 120124[119883119894119884119894] minus 120124[119883119894119883prime
119894 ]120124[119883119894119883prime119894 ]minus1113568120124[119883119894119884119894]
= 120124[119883119894119884119894] minus 120124[119883119894119884119894] = 0
bull No assumptions on the linearity of 120124[119884119894|119883119894]
18 58
OLS estimator
bull We know the population value of 120573 is
120573 = 120124[119883119894119883prime119894 ]minus1113568120124[119883119894119884119894]
bull How do we get an estimator of thisbull Plug-in principle replace population expectation with
sample versions
=
โกโขโขโขโขโขโฃ1119873
1114012119894119883119894119883prime
119894
โคโฅโฅโฅโฅโฅโฆ
minus11135681119873
1114012119894119883119894119884119894
bull If you work through the matrix algebra this turns out to be
= (119831prime119831)minus1113568 119831prime119858
19 58
Asymptotic OLS inference
bull With this representation in hand we can write the OLSestimator as follows
= 120573 +
โกโขโขโขโขโขโฃ1114012
119894119883119894119883prime
119894
โคโฅโฅโฅโฅโฅโฆ
minus1113568
1114012119894119883119894119890119894
bull Core idea sum119894119883119894119890119894 is the sum of rvs so the CLT applies
bull That plus some simple asymptotic theory allows us to say
radic119873( minus 120573) 119873(0ฮฉ)
bull Converges in distribution to a Normal distribution with meanvector 0 and covariance matrix ฮฉ
ฮฉ = 120124[119883119894119883prime119894 ]minus1113568120124[119883119894119883prime
119894 1198901113569119894 ]120124[119883119894119883prime119894 ]minus1113568
bull No linearity assumption needed20 58
Estimating the variance
bull In large samples then
radic119873( minus 120573) sim 119873(0ฮฉ)
bull How to estimate ฮฉ Plug-in principle again
1114023ฮฉ =
โกโขโขโขโขโขโฃ1114012
119894119883119894119883prime
119894
โคโฅโฅโฅโฅโฅโฆ
minus1113568 โกโขโขโขโขโขโฃ1114012119894119883119894119883prime
119894 1113569119894
โคโฅโฅโฅโฅโฅโฆ
โกโขโขโขโขโขโฃ1114012
119894119883119894119883prime
119894
โคโฅโฅโฅโฅโฅโฆ
minus1113568
bull Replace 119890119894 with its emprical counterpart (residuals)119894 = 119884119894 minus 119883prime
119894 bull Replace the population moments of 119883119894 with their sample
counterpartsbull The square root of the diagonals of this covariance matrix are
the ldquorobustrdquo or Huber-White standard errors that Statacommonly report
21 58
Heteroskedasticity
bull No assumptions of homoskedasticitybull Heteroskedaticity will definitely occur when
CEF is linear but the 1205901113569(119909) = 120141[119884119894|119883119894 = 119909] is not constant in 119909 119864[119884119894|119883119894] is not linear but we use the linear regression to
approxmiate it
22 58
2 Regression andCausality
23 58
Regression and causality
bull Most econometrics textbooks regression defined withoutrespect to causality
bull But then when is ldquobiasedrdquo The above derivations work forsome 120124[119884119894|119883119894]
bull The question then is when does knowing the CEF tell ussomething about causality
bull MHE argues that a regression is causal when the CEF itapproximates is causal Identification is king
bull We will show that under certain conditions a regression of theoutcome on the treatment and the covariates can recover acausal parameter but perhaps not the one in which we areinterested
24 58
Review
bull Quick reminder we have potential outcomes 119884119894(1) and 119884119894(0)and two parameters the ATE and ATT
120591 = 119864[119884119894(1) minus 119884119894(0)]120591ATT = 119864[119884119894(1) minus 119884119894(0)|119863119894 = 1]
bull We have shown in past weeks that these effects are identifiedwhen ignorability holds MHE calls this the conditionalindependence assumption (CIA)
25 58
Linear constant effects model binarytreatment
bull Experiment with a simple experiment we can rewrite theconsistency assumption to be a regression formula
119884119894 = 119863119894119884119894(1) + (1 minus 119863119894)119884119894(0)= 119884119894(0) + (119884119894(1) minus 119884119894(0))119863119894
= 120124[119884119894(0)] + 120591119863119894 + (119884119894(0) minus 120124[119884119894(0)])= 1205831113567 + 120591119863119894 + 1199071113567119894
bull Note that if ignorability holds (as in an experiment) for 119884119894(0)then it will also hold for 1199071113567119894 since 120124[119884119894(0)] is constant Thusthis satifies the usual assumptions for regression
26 58
Now with covariates
bull Now assume no unmeasured confounders 119884119894(119889) โโ 119863119894|119883119894bull We will assume a linear model for the potential outcomes
119884119894(119889) = 120572 + 120591 sdot 119889 + 120578119894
bull Remember that linearity isnrsquot an assumption if 119863119894 is binarybull Effect of 119863119894 is constant here the 120578119894 are the only source of
individual variation and we have 119864[120578119894] = 0bull Consistency assumption allows us to write this as
119884119894 = 120572 + 120591119863119894 + 120578119894
27 58
Covariates in the error
bull Letrsquos assume that 120578119894 is linear in 119883119894 120578119894 = 119883prime119894 120574 + 120584119894
bull New error is uncorrelated with 119883119894 120124[120584119894|119883119894] = 0bull This is an assumption Might be falsebull Plug into the above
120124[119884119894(119889)|119883119894] = 119864[119884119894|119863119894 119883119894] = 120572 + 120591119863119894 + 119864[120578119894|119883119894]= 120572 + 120591119863119894 + 119883prime
119894 120574 + 119864[120584119894|119883119894]= 120572 + 120591119863119894 + 119883prime
119894 120574
28 58
Summing up regression with constanteffects
bull Reviewing the assumptions wersquove used no unmeasured confounders constant treatment effects linearity of the treatmentcovariates
bull Under these we can run the following regression to estimatethe ATE 120591
119884119894 = 120572 + 120591119863119894 + 119883prime119894 120574 + 120584119894
bull Works with continuous or ordinal 119863119894 if linearity in the effect ofthese variables is truly linear
29 58
OLS constant effects simulation
Model with linear covariates constant 0 effect of treatment
library(mvtnorm)
n lt- 100
p lt- 4
X lt- rmvnorm(n = 100 mean = rep(0 p))
gamma lt- c(274 137 137 137)
y lt- 210 + X c(gamma) + rnorm(n)
alpha lt- c(-1 05 -05 -01)
dprobs lt- bootinvlogit(X alpha)
d lt- rbinom(n size = 1 prob = dprobs)
30 58
OLS with no covariates
summary(lm(y ~ d))
Coefficients
Estimate Std Error t value Pr(gt|t|)
(Intercept) 22036 458 4813 lt 2e-16
d -2869 654 -439 0000029
---
Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1
Residual standard error 33 on 98 degrees of freedom
Multiple R-squared 0164 Adjusted R-squared 0156
F-statistic 192 on 1 and 98 DF p-value 0000029
31 58
OLS with covariates
summary(lm(y ~ d + X))
Coefficients
Estimate Std Error t value Pr(gt|t|)
(Intercept) 209952 0159 13207 lt2e-16
d -0128 0253 -05 062
X1 27368 0126 2175 lt2e-16
X2 13677 0114 1201 lt2e-16
X3 13673 0130 1051 lt2e-16
X4 13570 0106 1280 lt2e-16
---
Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1
Residual standard error 1 on 94 degrees of freedom
Multiple R-squared 0999 Adjusted R-squared 0999
F-statistic 244e+04 on 5 and 94 DF p-value lt2e-16
32 58
What happens with nonlinearity
Suppose we can only observe the following covariates
z1 lt- exp(X[ 1]2)
z2 lt- X[ 2](1 + exp(X[ 1])) + 10
z3 lt- (X[ 1] X[ 3]25 + 06)^3
z4 lt- (X[ 2] + X[ 4] + 20)^2
Implies that 119884119894 and 119863119894 are functions of log(1198851198941113568) 1198851198941113569 119885111356911989411135681198851198941113569
1 log(1198851198941113568) 1198851198941113570 log(1198851198941113568) and 119883111356811135691198941113571
Regression is a nonlinear function of the observed covariates
33 58
When linearity goes wrong
summary(lm(y ~ d + z1 + z2 + z3 + z4))
Coefficients
Estimate Std Error t value Pr(gt|t|)
(Intercept) 218728 300799 073 0469
d -69292 33854 -205 0043
z1 363110 26970 1346 lt2e-16
z2 -29033 36619 -079 0430
z3 862022 431030 200 0048
z4 04021 00329 1223 lt2e-16
---
Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1
Residual standard error 14 on 94 degrees of freedom
Multiple R-squared 0855 Adjusted R-squared 0847
F-statistic 111 on 5 and 94 DF p-value lt2e-16
34 58
3 Regression withHeterogeneousTreatment Effects
35 58
Heterogeneous effects binary treatment
bull Completely randomized experiment
119884119894 = 119863119894119884119894(1) + (1 minus 119863119894)119884119894(0)= 119884119894(0) + (119884119894(1) minus 119884119894(0))119863119894
= 1205831113567 + 120591119894119863119894 + (119884119894(0) minus 1205831113567)= 1205831113567 + 120591119863119894 + (119884119894(0) minus 1205831113567) + (120591119894 minus 120591) sdot 119863119894
= 1205831113567 + 120591119863119894 + 120576119894
bull Error term now includes two components1 ldquoBaselinerdquo variation in the outcome (119884119894(0) minus 1205831113567)2 Variation in the treatment effect (120591119894 minus 120591)
bull Easy to verify that under experiment 120124[120576119894|119863119894] = 0bull Thus OLS estimates the ATE with no covariates
36 58
Adding covariates
bull What happens with no unmeasured confounders Need tocondition on 119883119894 now
bull Remember identification of the ATEATT using iteratedexpectations
bull ATE is the weighted sum of CATEs
120591 = 1114012119909120591(119909) Pr[119883119894 = 119909]
bull ATEATT are weighted averages of CATEsbull What about the regression estimand 120591119877 How does it related
to the ATEATT
37 58
Heterogeneous effects and regression
bull Letrsquos investigate this under a saturated regression model
119884119894 = 1114012119909119861119909119894120572119909 + 120591119877119863119894 + 119890119894
bull Use a dummy variable for each unique combination of 119883119894119861119909119894 = 120128(119883119894 = 119909)
bull Linear in 119883119894 by construction
38 58
Investigating the regression coefficient
bull How can we investigate 120591119877 Well we can rely on theregression anatomy
120591119877 = Cov(119884119894 119863119894 minus 119864[119863119894|119883119894])120141(119863119894 minus 119864[119863119894|119883119894])
bull 119863119894 minus 120124[119863119894|119883119894] is the residual from a regression of 119863119894 on the fullset of dummies
bull With a little work we can show
120591119877 =120124 1114094120591(119883119894)(119863119894 minus 120124[119863119894|119883119894])11135691114097
120124[(119863119894 minus 119864[119863119894|119883119894])1113569]=
120124[120591(119883119894)1205901113569119889(119883119894)]120124[1205901113569119889(119883119894)]
bull 1205901113569119889(119909) = 120141[119863119894|119883119894 = 119909] is the conditional variance of treatmentassignment
39 58
ATE versus OLS
120591119877 = 120124[120591(119883119894)119882119894] = 1114012119909120591(119909)
1205901113569119889(119909)120124[1205901113569119889(119883119894)]
โ[119883119894 = 119909]
bull Compare to the ATE
120591 = 120124[120591(119883119894)] = 1114012119909120591(119909)โ[119883119894 = 119909]
bull Both weight strata relative to their size (โ[119883119894 = 119909])bull OLS weights strata higher if the treatment variance in those
strata (1205901113569119889(119909)) is higher in those strata relative to the averagevariance across strata (120124[1205901113569119889(119883119894)])
bull The ATE weights only by their size
40 58
Regression weighting
119882119894 =1205901113569119889(119883119894)
120124[1205901113569119889(119883119894)]
bull Why does OLS weight like thisbull OLS is a minimum-variance estimator more weight to
more precise within-strata estimatesbull Within-strata estimates are most precise when the treatment
is evenly spread and thus has the highest variancebull If 119863119894 is binary then we know the conditional variance will be
1205901113569119889(119909) = โ[119863119894 = 1|119883119894 = 119909] (1 minus โ[119863119894 = 1|119883119894 = 119909])= 119890(119909) (1 minus 119890(119909))
bull Maximum variance with โ[119863119894 = 1|119883119894 = 119909] = 12
41 58
OLS weighting examplebull Binary covariate
โ[119883119894 = 1] = 075 โ[119883119894 = 0] = 025โ[119863119894 = 1|119883119894 = 1] = 09 โ[119863119894 = 1|119883119894 = 0] = 05
1205901113569119889(1) = 009 1205901113569119889(0) = 025120591(1) = 1 120591(0) = minus1
bull Implies the ATE is 120591 = 05bull Average conditional variance 120124[1205901113569119889(119883119894)] = 013bull weights for 119883119894 = 1 are 009013 = 0692 for 119883119894 = 0
025013 = 192120591119877 = 120124[120591(119883119894)119882119894]
= 120591(1)119882(1)โ[119883119894 = 1] + 120591(0)119882(0)โ[119883119894 = 0]= 1 times 0692 times 075 + minus1 times 192 times 025= 0039
42 58
When will OLS estimate the ATE
bull When does 120591 = 120591119877bull Constant treatment effects 120591(119909) = 120591 = 120591119877bull Constant probability of treatment 119890(119909) = โ[119863119894 = 1|119883119894 = 119909] = 119890
Implies that the OLS weights are 1
bull Incorrect linearity assumption in 119883119894 will lead to more bias
43 58
Other ways to use regression
bull Whatrsquos the path forward Accept the bias (might be relatively small with saturated
models) Use a different regression approach
bull Let 120583119889(119909) = 120124[119884119894(119889)|119883119894 = 119909] be the CEF for the potentialoutcome under 119863119894 = 119889
bull By consistency and nuc we have 120583119889(119909) = 120124[119884119894|119863119894 = 119889119883119894 = 119909]bull Estimate a regression of 119884119894 on 119883119894 among the 119863119894 = 119889 groupbull Then 119889(119909) is just a predicted value from the regression for
119883119894 = 119909bull How can we use this
44 58
Imputation estimators
bull Impute the treated potential outcomes with 1114023119884119894(1) = 1113568(119883119894)bull Impute the control potential outcomes with 1114023119884119894(0) = 1113567(119883119894)bull Procedure
Regress 119884119894 on 119883119894 in the treated group and get predicted valuesfor all units (treated or control)
Regress 119884119894 on 119883119894 in the control group and get predicted valuesfor all units (treated or control)
Take the average difference between these predicted values
bull More mathematically look like this
120591119894119898119901 =1119873
11140121198941113568(119883119894) minus 1113567(119883119894)
bull Sometimes called an imputation estimator
45 58
Simple imputation estimator
bull Use predict() from the within-group models on the data fromthe entire sample
bull Useful trick use a model on the entire data andmodelframe() to get the right design matrix
heterogeneous effects
yhet lt- ifelse(d == 1 y + rnorm(n 0 5) y)
mod lt- lm(yhet ~ d + X)
mod1 lt- lm(yhet ~ X subset = d == 1)
mod0 lt- lm(yhet ~ X subset = d == 0)
y1imps lt- predict(mod1 modelframe(mod))
y0imps lt- predict(mod0 modelframe(mod))
mean(y1imps - y0imps)
[1] 061
46 58
Notes on imputation estimators
bull If 119889(119909) are consistent estimators then 120591119894119898119901 is consistent forthe ATE
bull Why donrsquot people use this Most people donrsquot know the results wersquove been talking about Harder to implement than vanilla OLS
bull Can use linear regression to estimate 119889(119909) = 119909prime120573119889bull Recent trend is to estimate 119889(119909) via non-parametric methods
such as Kernel regression local linear regression regression trees etc Easiest is generalized additive models (GAMs)
47 58
Imputation estimator visualization
-2 -1 0 1 2 3
-05
00
05
10
15
x
y
TreatedControl
48 58
Imputation estimator visualization
-2 -1 0 1 2 3
-05
00
05
10
15
x
y
TreatedControl
49 58
Imputation estimator visualization
-2 -1 0 1 2 3
-05
00
05
10
15
x
y
TreatedControl
50 58
Nonlinear relationshipsbull Same idea but with nonlinear relationship between 119884119894 and 119883119894
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
51 58
Nonlinear relationshipsbull Same idea but with nonlinear relationship between 119884119894 and 119883119894
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
52 58
Nonlinear relationshipsbull Same idea but with nonlinear relationship between 119884119894 and 119883119894
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
53 58
Using semiparametric regressionbull Here CEFs are nonlinear but we donrsquot know their formbull We can use GAMs from the mgcv package to for flexible
estimatelibrary(mgcv)
mod0 lt- gam(y ~ s(x) subset = d == 0)
summary(mod0)
Family gaussian
Link function identity
Formula
y ~ s(x)
Parametric coefficients
Estimate Std Error t value Pr(gt|t|)
(Intercept) -00225 00154 -146 016
Approximate significance of smooth terms
edf Refdf F p-value
s(x) 603 708 413 lt2e-16
---
Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1
R-sq(adj) = 0909 Deviance explained = 928
GCV = 00093204 Scale est = 00071351 n = 30
54 58
Using GAMs
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
55 58
Using GAMs
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
56 58
Using GAMs
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
57 58
Limited dependent variables
bull Usual advice model the data from first principles Logitprobit for binary Poisson for counts etc
bull OLS is a-ok with limited DVs when Binary treatment and no covariates (just diff-in-means) Binary treatment discrete covariates and saturated models
(stratified diff-in-means)
bull Imposing a model on LDVs in this case imposes adistributional assumption which could be wrong
bull Even in unsaturated models the marginal effect from OLSoften decent compared to nonlinear models
Could go wrong in small samples If using nonlinear models always get effects on the scale of the
outcome
58 58
- Agnostic Regression
- Regression and Causality
- Regression with Heterogeneous Treatment Effects
-
Regression anatomy
bull Consider simple linear regression
(120572 120573) = argmin119886119887
120124 1114094(119884119894 minus 119886 minus 119887119883119894)11135691114097
bull In this case we can write the populationtrue slope 120573 as
120573 = 120124[119883119894119883prime119894 ]minus1113568120124[119883119894119884119894] =
Cov(119884119894 119883119894)120141[119883119894]
bull With more covariates 120573 is more complicated but we can stillwrite it like this
bull Let 119896119894 be the residual from a regression of 119883119896119894 on all the otherindependent variables Then 120573119896 the coefficient for 119883119896119894 is
120573119896 =Cov(119884119894 119896119894)
119881(119896119894)
8 58
Justification 1 Linear CEFs
bull Justification 1 if the CEF is linear the population regressionfunction is it That is if 119864[119884119894|119883119894] = 119883prime
119894 119887 then 119887 = 120573bull When would we expect the CEF to be linear Two cases
1 Outcome and covariates are multivariate normal2 Linear regression model is saturated
bull A model is saturated if there are as many parameters asthere are possible combination of the 119883119894 variables
9 58
Saturated model example
bull Two binary variables 1198831113568119894 for incumbency status and 1198831113569119894 forparty of the candidate
bull Four possible values of 119883119894 four possible values of 120583(119883119894)
119864[119884119894|1198831113568119894 = 01198831113569119894 = 0] = 120572119864[119884119894|1198831113568119894 = 11198831113569119894 = 0] = 120572 + 120573119864[119884119894|1198831113568119894 = 01198831113569119894 = 1] = 120572 + 120574119864[119884119894|1198831113568119894 = 11198831113569119894 = 1] = 120572 + 120573 + 120574 + 120575
bull We can write the CEF as follows
119864[119884119894|1198831113568119894 1198831113569119894] = 120572 + 1205731198831113568119894 + 1205741198831113569119894 + 120575(11988311135681198941198831113569119894)
10 58
Saturated models example
119864[119884119894|1198831113568119894 1198831113569119894] = 120572 + 1205731198831113568119894 + 1205741198831113569119894 + 120575(11988311135681198941198831113569119894)
bull Basically each value of 120583(119883119894) is being estimated separately within-strata estimation No borrowing of information from across values of 119883119894
bull Requires a set of dummies for each categorical variable plusall interactions
bull Or a series of dummies for each unique combination of 119883119894bull This makes linearity hold mechanically and so linearity is
not an assumption Just a fact about saturated CEFs saturated models for limited dependent variables = A-OK
11 58
Saturated model example
bull Washington (AER) data on the effects of daughtersbull Wersquoll look at the relationship between voting and number of
kids (causal)
girls lt- foreignreaddta(rdquogirlsdtardquo)
head(girls[ c(rdquonamerdquo rdquototchirdquo rdquoaauwrdquo)])
name totchi aauw
1 ABERCROMBIE NEIL 0 100
2 ACKERMAN GARY L 3 88
3 ADERHOLT ROBERT B 0 0
4 ALLEN THOMAS H 2 100
5 ANDREWS ROBERT E 2 100
6 ARCHER WR 7 0
12 58
Linear model
summary(lm(aauw ~ totchi data = girls))
Coefficients
Estimate Std Error t value Pr(gt|t|)
(Intercept) 6131 181 3381 lt2e-16
totchi -533 062 -859 lt2e-16
---
Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1
Residual standard error 42 on 1733 degrees of freedom
(5 observations deleted due to missingness)
Multiple R-squared 00408 Adjusted R-squared 00403
F-statistic 738 on 1 and 1733 DF p-value lt2e-16
13 58
Saturated modelsummary(lm(aauw ~ asfactor(totchi) data = girls))
Coefficients
Estimate Std Error t value Pr(gt|t|)
(Intercept) 5641 276 2042 lt 2e-16
asfactor(totchi)1 545 411 133 01851
asfactor(totchi)2 -380 327 -116 02454
asfactor(totchi)3 -1365 345 -395 81e-05
asfactor(totchi)4 -1931 401 -482 16e-06
asfactor(totchi)5 -1546 485 -319 00015
asfactor(totchi)6 -3359 1042 -322 00013
asfactor(totchi)7 -1713 1141 -150 01336
asfactor(totchi)8 -5533 1228 -451 70e-06
asfactor(totchi)9 -5041 2408 -209 00364
asfactor(totchi)10 -5341 2090 -256 00107
asfactor(totchi)12 -5641 4153 -136 01745
---
Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1
Residual standard error 41 on 1723 degrees of freedom
(5 observations deleted due to missingness)
Multiple R-squared 00506 Adjusted R-squared 00446
F-statistic 836 on 11 and 1723 DF p-value 184e-1414 58
Saturated model minus the constantsummary(lm(aauw ~ asfactor(totchi) - 1 data = girls))
Coefficients
Estimate Std Error t value Pr(gt|t|)
asfactor(totchi)0 5641 276 2042 lt2e-16
asfactor(totchi)1 6186 305 2031 lt2e-16
asfactor(totchi)2 5262 175 3013 lt2e-16
asfactor(totchi)3 4276 207 2062 lt2e-16
asfactor(totchi)4 3711 290 1279 lt2e-16
asfactor(totchi)5 4095 399 1027 lt2e-16
asfactor(totchi)6 2282 1005 227 00233
asfactor(totchi)7 3929 1107 355 00004
asfactor(totchi)8 108 1196 009 09278
asfactor(totchi)9 600 2392 025 08020
asfactor(totchi)10 300 2072 014 08849
asfactor(totchi)12 000 4143 000 10000
---
Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1
Residual standard error 41 on 1723 degrees of freedom
(5 observations deleted due to missingness)
Multiple R-squared 0587 Adjusted R-squared 0584
F-statistic 204 on 12 and 1723 DF p-value lt2e-1615 58
Compare to within-strata means
bull The saturated model makes no assumptions about thebetween-strata relationships
bull Just calculates within-strata means
c1 lt- coef(lm(aauw ~ asfactor(totchi) - 1 data = girls))
c2 lt- with(girls tapply(aauw totchi mean narm = TRUE))
rbind(c1 c2)
0 1 2 3 4 5 6 7 8 9 10 12
c1 56 62 53 43 37 41 23 39 11 6 3 0
c2 56 62 53 43 37 41 23 39 11 6 3 0
16 58
Other justifications for OLS
bull Justification 2 119883prime119894 120573 is the best linear predictor (in a
mean-squared error sense) of 119884119894 Why 120573 = argmin119887 120124[(119884119894 minus 119883prime
119894 119887)1113569]
bull Justification 3 119883prime119894 120573 provides the minimum mean squared
error linear approxmiation to 119864[119884119894|119883119894]bull Even if the CEF is not linear a linear regression provides the
best linear approximation to that CEFbull Donrsquot need to believe the assumptions (linearity) in order to
use regression as a good approximation to the CEFbull Warning if the CEF is very nonlinear then this approximation
could be terrible
17 58
The error terms
bull Letrsquos define the error term 119890119894 equiv 119884119894 minus 119883prime119894 120573 so that
119884119894 = 119883prime119894 120573 + [119884119894 minus 119883prime
119894 120573] = 119883prime119894 120573 + 119890119894
bull Note the residual 119890119894 is uncorrelated with 119883119894
120124[119883119894119890119894] = 120124[119883119894(119884119894 minus 119883prime119894 120573)]
= 120124[119883119894119884119894] minus 120124[119883119894119883prime119894 120573]
= 120124[119883119894119884119894] minus 120124 1114094119883119894119883prime119894120124[119883119894119883prime
119894 ]minus1113568120124[119883119894119884119894]1114097= 120124[119883119894119884119894] minus 120124[119883119894119883prime
119894 ]120124[119883119894119883prime119894 ]minus1113568120124[119883119894119884119894]
= 120124[119883119894119884119894] minus 120124[119883119894119884119894] = 0
bull No assumptions on the linearity of 120124[119884119894|119883119894]
18 58
OLS estimator
bull We know the population value of 120573 is
120573 = 120124[119883119894119883prime119894 ]minus1113568120124[119883119894119884119894]
bull How do we get an estimator of thisbull Plug-in principle replace population expectation with
sample versions
=
โกโขโขโขโขโขโฃ1119873
1114012119894119883119894119883prime
119894
โคโฅโฅโฅโฅโฅโฆ
minus11135681119873
1114012119894119883119894119884119894
bull If you work through the matrix algebra this turns out to be
= (119831prime119831)minus1113568 119831prime119858
19 58
Asymptotic OLS inference
bull With this representation in hand we can write the OLSestimator as follows
= 120573 +
โกโขโขโขโขโขโฃ1114012
119894119883119894119883prime
119894
โคโฅโฅโฅโฅโฅโฆ
minus1113568
1114012119894119883119894119890119894
bull Core idea sum119894119883119894119890119894 is the sum of rvs so the CLT applies
bull That plus some simple asymptotic theory allows us to say
radic119873( minus 120573) 119873(0ฮฉ)
bull Converges in distribution to a Normal distribution with meanvector 0 and covariance matrix ฮฉ
ฮฉ = 120124[119883119894119883prime119894 ]minus1113568120124[119883119894119883prime
119894 1198901113569119894 ]120124[119883119894119883prime119894 ]minus1113568
bull No linearity assumption needed20 58
Estimating the variance
bull In large samples then
radic119873( minus 120573) sim 119873(0ฮฉ)
bull How to estimate ฮฉ Plug-in principle again
1114023ฮฉ =
โกโขโขโขโขโขโฃ1114012
119894119883119894119883prime
119894
โคโฅโฅโฅโฅโฅโฆ
minus1113568 โกโขโขโขโขโขโฃ1114012119894119883119894119883prime
119894 1113569119894
โคโฅโฅโฅโฅโฅโฆ
โกโขโขโขโขโขโฃ1114012
119894119883119894119883prime
119894
โคโฅโฅโฅโฅโฅโฆ
minus1113568
bull Replace 119890119894 with its emprical counterpart (residuals)119894 = 119884119894 minus 119883prime
119894 bull Replace the population moments of 119883119894 with their sample
counterpartsbull The square root of the diagonals of this covariance matrix are
the ldquorobustrdquo or Huber-White standard errors that Statacommonly report
21 58
Heteroskedasticity
bull No assumptions of homoskedasticitybull Heteroskedaticity will definitely occur when
CEF is linear but the 1205901113569(119909) = 120141[119884119894|119883119894 = 119909] is not constant in 119909 119864[119884119894|119883119894] is not linear but we use the linear regression to
approxmiate it
22 58
2 Regression andCausality
23 58
Regression and causality
bull Most econometrics textbooks regression defined withoutrespect to causality
bull But then when is ldquobiasedrdquo The above derivations work forsome 120124[119884119894|119883119894]
bull The question then is when does knowing the CEF tell ussomething about causality
bull MHE argues that a regression is causal when the CEF itapproximates is causal Identification is king
bull We will show that under certain conditions a regression of theoutcome on the treatment and the covariates can recover acausal parameter but perhaps not the one in which we areinterested
24 58
Review
bull Quick reminder we have potential outcomes 119884119894(1) and 119884119894(0)and two parameters the ATE and ATT
120591 = 119864[119884119894(1) minus 119884119894(0)]120591ATT = 119864[119884119894(1) minus 119884119894(0)|119863119894 = 1]
bull We have shown in past weeks that these effects are identifiedwhen ignorability holds MHE calls this the conditionalindependence assumption (CIA)
25 58
Linear constant effects model binarytreatment
bull Experiment with a simple experiment we can rewrite theconsistency assumption to be a regression formula
119884119894 = 119863119894119884119894(1) + (1 minus 119863119894)119884119894(0)= 119884119894(0) + (119884119894(1) minus 119884119894(0))119863119894
= 120124[119884119894(0)] + 120591119863119894 + (119884119894(0) minus 120124[119884119894(0)])= 1205831113567 + 120591119863119894 + 1199071113567119894
bull Note that if ignorability holds (as in an experiment) for 119884119894(0)then it will also hold for 1199071113567119894 since 120124[119884119894(0)] is constant Thusthis satifies the usual assumptions for regression
26 58
Now with covariates
bull Now assume no unmeasured confounders 119884119894(119889) โโ 119863119894|119883119894bull We will assume a linear model for the potential outcomes
119884119894(119889) = 120572 + 120591 sdot 119889 + 120578119894
bull Remember that linearity isnrsquot an assumption if 119863119894 is binarybull Effect of 119863119894 is constant here the 120578119894 are the only source of
individual variation and we have 119864[120578119894] = 0bull Consistency assumption allows us to write this as
119884119894 = 120572 + 120591119863119894 + 120578119894
27 58
Covariates in the error
bull Letrsquos assume that 120578119894 is linear in 119883119894 120578119894 = 119883prime119894 120574 + 120584119894
bull New error is uncorrelated with 119883119894 120124[120584119894|119883119894] = 0bull This is an assumption Might be falsebull Plug into the above
120124[119884119894(119889)|119883119894] = 119864[119884119894|119863119894 119883119894] = 120572 + 120591119863119894 + 119864[120578119894|119883119894]= 120572 + 120591119863119894 + 119883prime
119894 120574 + 119864[120584119894|119883119894]= 120572 + 120591119863119894 + 119883prime
119894 120574
28 58
Summing up regression with constanteffects
bull Reviewing the assumptions wersquove used no unmeasured confounders constant treatment effects linearity of the treatmentcovariates
bull Under these we can run the following regression to estimatethe ATE 120591
119884119894 = 120572 + 120591119863119894 + 119883prime119894 120574 + 120584119894
bull Works with continuous or ordinal 119863119894 if linearity in the effect ofthese variables is truly linear
29 58
OLS constant effects simulation
Model with linear covariates constant 0 effect of treatment
library(mvtnorm)
n lt- 100
p lt- 4
X lt- rmvnorm(n = 100 mean = rep(0 p))
gamma lt- c(274 137 137 137)
y lt- 210 + X c(gamma) + rnorm(n)
alpha lt- c(-1 05 -05 -01)
dprobs lt- bootinvlogit(X alpha)
d lt- rbinom(n size = 1 prob = dprobs)
30 58
OLS with no covariates
summary(lm(y ~ d))
Coefficients
Estimate Std Error t value Pr(gt|t|)
(Intercept) 22036 458 4813 lt 2e-16
d -2869 654 -439 0000029
---
Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1
Residual standard error 33 on 98 degrees of freedom
Multiple R-squared 0164 Adjusted R-squared 0156
F-statistic 192 on 1 and 98 DF p-value 0000029
31 58
OLS with covariates
summary(lm(y ~ d + X))
Coefficients
Estimate Std Error t value Pr(gt|t|)
(Intercept) 209952 0159 13207 lt2e-16
d -0128 0253 -05 062
X1 27368 0126 2175 lt2e-16
X2 13677 0114 1201 lt2e-16
X3 13673 0130 1051 lt2e-16
X4 13570 0106 1280 lt2e-16
---
Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1
Residual standard error 1 on 94 degrees of freedom
Multiple R-squared 0999 Adjusted R-squared 0999
F-statistic 244e+04 on 5 and 94 DF p-value lt2e-16
32 58
What happens with nonlinearity
Suppose we can only observe the following covariates
z1 lt- exp(X[ 1]2)
z2 lt- X[ 2](1 + exp(X[ 1])) + 10
z3 lt- (X[ 1] X[ 3]25 + 06)^3
z4 lt- (X[ 2] + X[ 4] + 20)^2
Implies that 119884119894 and 119863119894 are functions of log(1198851198941113568) 1198851198941113569 119885111356911989411135681198851198941113569
1 log(1198851198941113568) 1198851198941113570 log(1198851198941113568) and 119883111356811135691198941113571
Regression is a nonlinear function of the observed covariates
33 58
When linearity goes wrong
summary(lm(y ~ d + z1 + z2 + z3 + z4))
Coefficients
Estimate Std Error t value Pr(gt|t|)
(Intercept) 218728 300799 073 0469
d -69292 33854 -205 0043
z1 363110 26970 1346 lt2e-16
z2 -29033 36619 -079 0430
z3 862022 431030 200 0048
z4 04021 00329 1223 lt2e-16
---
Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1
Residual standard error 14 on 94 degrees of freedom
Multiple R-squared 0855 Adjusted R-squared 0847
F-statistic 111 on 5 and 94 DF p-value lt2e-16
34 58
3 Regression withHeterogeneousTreatment Effects
35 58
Heterogeneous effects binary treatment
bull Completely randomized experiment
119884119894 = 119863119894119884119894(1) + (1 minus 119863119894)119884119894(0)= 119884119894(0) + (119884119894(1) minus 119884119894(0))119863119894
= 1205831113567 + 120591119894119863119894 + (119884119894(0) minus 1205831113567)= 1205831113567 + 120591119863119894 + (119884119894(0) minus 1205831113567) + (120591119894 minus 120591) sdot 119863119894
= 1205831113567 + 120591119863119894 + 120576119894
bull Error term now includes two components1 ldquoBaselinerdquo variation in the outcome (119884119894(0) minus 1205831113567)2 Variation in the treatment effect (120591119894 minus 120591)
bull Easy to verify that under experiment 120124[120576119894|119863119894] = 0bull Thus OLS estimates the ATE with no covariates
36 58
Adding covariates
bull What happens with no unmeasured confounders Need tocondition on 119883119894 now
bull Remember identification of the ATEATT using iteratedexpectations
bull ATE is the weighted sum of CATEs
120591 = 1114012119909120591(119909) Pr[119883119894 = 119909]
bull ATEATT are weighted averages of CATEsbull What about the regression estimand 120591119877 How does it related
to the ATEATT
37 58
Heterogeneous effects and regression
bull Letrsquos investigate this under a saturated regression model
119884119894 = 1114012119909119861119909119894120572119909 + 120591119877119863119894 + 119890119894
bull Use a dummy variable for each unique combination of 119883119894119861119909119894 = 120128(119883119894 = 119909)
bull Linear in 119883119894 by construction
38 58
Investigating the regression coefficient
bull How can we investigate 120591119877 Well we can rely on theregression anatomy
120591119877 = Cov(119884119894 119863119894 minus 119864[119863119894|119883119894])120141(119863119894 minus 119864[119863119894|119883119894])
bull 119863119894 minus 120124[119863119894|119883119894] is the residual from a regression of 119863119894 on the fullset of dummies
bull With a little work we can show
120591119877 =120124 1114094120591(119883119894)(119863119894 minus 120124[119863119894|119883119894])11135691114097
120124[(119863119894 minus 119864[119863119894|119883119894])1113569]=
120124[120591(119883119894)1205901113569119889(119883119894)]120124[1205901113569119889(119883119894)]
bull 1205901113569119889(119909) = 120141[119863119894|119883119894 = 119909] is the conditional variance of treatmentassignment
39 58
ATE versus OLS
120591119877 = 120124[120591(119883119894)119882119894] = 1114012119909120591(119909)
1205901113569119889(119909)120124[1205901113569119889(119883119894)]
โ[119883119894 = 119909]
bull Compare to the ATE
120591 = 120124[120591(119883119894)] = 1114012119909120591(119909)โ[119883119894 = 119909]
bull Both weight strata relative to their size (โ[119883119894 = 119909])bull OLS weights strata higher if the treatment variance in those
strata (1205901113569119889(119909)) is higher in those strata relative to the averagevariance across strata (120124[1205901113569119889(119883119894)])
bull The ATE weights only by their size
40 58
Regression weighting
119882119894 =1205901113569119889(119883119894)
120124[1205901113569119889(119883119894)]
bull Why does OLS weight like thisbull OLS is a minimum-variance estimator more weight to
more precise within-strata estimatesbull Within-strata estimates are most precise when the treatment
is evenly spread and thus has the highest variancebull If 119863119894 is binary then we know the conditional variance will be
1205901113569119889(119909) = โ[119863119894 = 1|119883119894 = 119909] (1 minus โ[119863119894 = 1|119883119894 = 119909])= 119890(119909) (1 minus 119890(119909))
bull Maximum variance with โ[119863119894 = 1|119883119894 = 119909] = 12
41 58
OLS weighting examplebull Binary covariate
โ[119883119894 = 1] = 075 โ[119883119894 = 0] = 025โ[119863119894 = 1|119883119894 = 1] = 09 โ[119863119894 = 1|119883119894 = 0] = 05
1205901113569119889(1) = 009 1205901113569119889(0) = 025120591(1) = 1 120591(0) = minus1
bull Implies the ATE is 120591 = 05bull Average conditional variance 120124[1205901113569119889(119883119894)] = 013bull weights for 119883119894 = 1 are 009013 = 0692 for 119883119894 = 0
025013 = 192120591119877 = 120124[120591(119883119894)119882119894]
= 120591(1)119882(1)โ[119883119894 = 1] + 120591(0)119882(0)โ[119883119894 = 0]= 1 times 0692 times 075 + minus1 times 192 times 025= 0039
42 58
When will OLS estimate the ATE
bull When does 120591 = 120591119877bull Constant treatment effects 120591(119909) = 120591 = 120591119877bull Constant probability of treatment 119890(119909) = โ[119863119894 = 1|119883119894 = 119909] = 119890
Implies that the OLS weights are 1
bull Incorrect linearity assumption in 119883119894 will lead to more bias
43 58
Other ways to use regression
bull Whatrsquos the path forward Accept the bias (might be relatively small with saturated
models) Use a different regression approach
bull Let 120583119889(119909) = 120124[119884119894(119889)|119883119894 = 119909] be the CEF for the potentialoutcome under 119863119894 = 119889
bull By consistency and nuc we have 120583119889(119909) = 120124[119884119894|119863119894 = 119889119883119894 = 119909]bull Estimate a regression of 119884119894 on 119883119894 among the 119863119894 = 119889 groupbull Then 119889(119909) is just a predicted value from the regression for
119883119894 = 119909bull How can we use this
44 58
Imputation estimators
bull Impute the treated potential outcomes with 1114023119884119894(1) = 1113568(119883119894)bull Impute the control potential outcomes with 1114023119884119894(0) = 1113567(119883119894)bull Procedure
Regress 119884119894 on 119883119894 in the treated group and get predicted valuesfor all units (treated or control)
Regress 119884119894 on 119883119894 in the control group and get predicted valuesfor all units (treated or control)
Take the average difference between these predicted values
bull More mathematically look like this
120591119894119898119901 =1119873
11140121198941113568(119883119894) minus 1113567(119883119894)
bull Sometimes called an imputation estimator
45 58
Simple imputation estimator
bull Use predict() from the within-group models on the data fromthe entire sample
bull Useful trick use a model on the entire data andmodelframe() to get the right design matrix
heterogeneous effects
yhet lt- ifelse(d == 1 y + rnorm(n 0 5) y)
mod lt- lm(yhet ~ d + X)
mod1 lt- lm(yhet ~ X subset = d == 1)
mod0 lt- lm(yhet ~ X subset = d == 0)
y1imps lt- predict(mod1 modelframe(mod))
y0imps lt- predict(mod0 modelframe(mod))
mean(y1imps - y0imps)
[1] 061
46 58
Notes on imputation estimators
bull If 119889(119909) are consistent estimators then 120591119894119898119901 is consistent forthe ATE
bull Why donrsquot people use this Most people donrsquot know the results wersquove been talking about Harder to implement than vanilla OLS
bull Can use linear regression to estimate 119889(119909) = 119909prime120573119889bull Recent trend is to estimate 119889(119909) via non-parametric methods
such as Kernel regression local linear regression regression trees etc Easiest is generalized additive models (GAMs)
47 58
Imputation estimator visualization
-2 -1 0 1 2 3
-05
00
05
10
15
x
y
TreatedControl
48 58
Imputation estimator visualization
-2 -1 0 1 2 3
-05
00
05
10
15
x
y
TreatedControl
49 58
Imputation estimator visualization
-2 -1 0 1 2 3
-05
00
05
10
15
x
y
TreatedControl
50 58
Nonlinear relationshipsbull Same idea but with nonlinear relationship between 119884119894 and 119883119894
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
51 58
Nonlinear relationshipsbull Same idea but with nonlinear relationship between 119884119894 and 119883119894
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
52 58
Nonlinear relationshipsbull Same idea but with nonlinear relationship between 119884119894 and 119883119894
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
53 58
Using semiparametric regressionbull Here CEFs are nonlinear but we donrsquot know their formbull We can use GAMs from the mgcv package to for flexible
estimatelibrary(mgcv)
mod0 lt- gam(y ~ s(x) subset = d == 0)
summary(mod0)
Family gaussian
Link function identity
Formula
y ~ s(x)
Parametric coefficients
Estimate Std Error t value Pr(gt|t|)
(Intercept) -00225 00154 -146 016
Approximate significance of smooth terms
edf Refdf F p-value
s(x) 603 708 413 lt2e-16
---
Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1
R-sq(adj) = 0909 Deviance explained = 928
GCV = 00093204 Scale est = 00071351 n = 30
54 58
Using GAMs
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
55 58
Using GAMs
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
56 58
Using GAMs
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
57 58
Limited dependent variables
bull Usual advice model the data from first principles Logitprobit for binary Poisson for counts etc
bull OLS is a-ok with limited DVs when Binary treatment and no covariates (just diff-in-means) Binary treatment discrete covariates and saturated models
(stratified diff-in-means)
bull Imposing a model on LDVs in this case imposes adistributional assumption which could be wrong
bull Even in unsaturated models the marginal effect from OLSoften decent compared to nonlinear models
Could go wrong in small samples If using nonlinear models always get effects on the scale of the
outcome
58 58
- Agnostic Regression
- Regression and Causality
- Regression with Heterogeneous Treatment Effects
-
Justification 1 Linear CEFs
bull Justification 1 if the CEF is linear the population regressionfunction is it That is if 119864[119884119894|119883119894] = 119883prime
119894 119887 then 119887 = 120573bull When would we expect the CEF to be linear Two cases
1 Outcome and covariates are multivariate normal2 Linear regression model is saturated
bull A model is saturated if there are as many parameters asthere are possible combination of the 119883119894 variables
9 58
Saturated model example
bull Two binary variables 1198831113568119894 for incumbency status and 1198831113569119894 forparty of the candidate
bull Four possible values of 119883119894 four possible values of 120583(119883119894)
119864[119884119894|1198831113568119894 = 01198831113569119894 = 0] = 120572119864[119884119894|1198831113568119894 = 11198831113569119894 = 0] = 120572 + 120573119864[119884119894|1198831113568119894 = 01198831113569119894 = 1] = 120572 + 120574119864[119884119894|1198831113568119894 = 11198831113569119894 = 1] = 120572 + 120573 + 120574 + 120575
bull We can write the CEF as follows
119864[119884119894|1198831113568119894 1198831113569119894] = 120572 + 1205731198831113568119894 + 1205741198831113569119894 + 120575(11988311135681198941198831113569119894)
10 58
Saturated models example
119864[119884119894|1198831113568119894 1198831113569119894] = 120572 + 1205731198831113568119894 + 1205741198831113569119894 + 120575(11988311135681198941198831113569119894)
bull Basically each value of 120583(119883119894) is being estimated separately within-strata estimation No borrowing of information from across values of 119883119894
bull Requires a set of dummies for each categorical variable plusall interactions
bull Or a series of dummies for each unique combination of 119883119894bull This makes linearity hold mechanically and so linearity is
not an assumption Just a fact about saturated CEFs saturated models for limited dependent variables = A-OK
11 58
Saturated model example
bull Washington (AER) data on the effects of daughtersbull Wersquoll look at the relationship between voting and number of
kids (causal)
girls lt- foreignreaddta(rdquogirlsdtardquo)
head(girls[ c(rdquonamerdquo rdquototchirdquo rdquoaauwrdquo)])
name totchi aauw
1 ABERCROMBIE NEIL 0 100
2 ACKERMAN GARY L 3 88
3 ADERHOLT ROBERT B 0 0
4 ALLEN THOMAS H 2 100
5 ANDREWS ROBERT E 2 100
6 ARCHER WR 7 0
12 58
Linear model
summary(lm(aauw ~ totchi data = girls))
Coefficients
Estimate Std Error t value Pr(gt|t|)
(Intercept) 6131 181 3381 lt2e-16
totchi -533 062 -859 lt2e-16
---
Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1
Residual standard error 42 on 1733 degrees of freedom
(5 observations deleted due to missingness)
Multiple R-squared 00408 Adjusted R-squared 00403
F-statistic 738 on 1 and 1733 DF p-value lt2e-16
13 58
Saturated modelsummary(lm(aauw ~ asfactor(totchi) data = girls))
Coefficients
Estimate Std Error t value Pr(gt|t|)
(Intercept) 5641 276 2042 lt 2e-16
asfactor(totchi)1 545 411 133 01851
asfactor(totchi)2 -380 327 -116 02454
asfactor(totchi)3 -1365 345 -395 81e-05
asfactor(totchi)4 -1931 401 -482 16e-06
asfactor(totchi)5 -1546 485 -319 00015
asfactor(totchi)6 -3359 1042 -322 00013
asfactor(totchi)7 -1713 1141 -150 01336
asfactor(totchi)8 -5533 1228 -451 70e-06
asfactor(totchi)9 -5041 2408 -209 00364
asfactor(totchi)10 -5341 2090 -256 00107
asfactor(totchi)12 -5641 4153 -136 01745
---
Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1
Residual standard error 41 on 1723 degrees of freedom
(5 observations deleted due to missingness)
Multiple R-squared 00506 Adjusted R-squared 00446
F-statistic 836 on 11 and 1723 DF p-value 184e-1414 58
Saturated model minus the constantsummary(lm(aauw ~ asfactor(totchi) - 1 data = girls))
Coefficients
Estimate Std Error t value Pr(gt|t|)
asfactor(totchi)0 5641 276 2042 lt2e-16
asfactor(totchi)1 6186 305 2031 lt2e-16
asfactor(totchi)2 5262 175 3013 lt2e-16
asfactor(totchi)3 4276 207 2062 lt2e-16
asfactor(totchi)4 3711 290 1279 lt2e-16
asfactor(totchi)5 4095 399 1027 lt2e-16
asfactor(totchi)6 2282 1005 227 00233
asfactor(totchi)7 3929 1107 355 00004
asfactor(totchi)8 108 1196 009 09278
asfactor(totchi)9 600 2392 025 08020
asfactor(totchi)10 300 2072 014 08849
asfactor(totchi)12 000 4143 000 10000
---
Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1
Residual standard error 41 on 1723 degrees of freedom
(5 observations deleted due to missingness)
Multiple R-squared 0587 Adjusted R-squared 0584
F-statistic 204 on 12 and 1723 DF p-value lt2e-1615 58
Compare to within-strata means
bull The saturated model makes no assumptions about thebetween-strata relationships
bull Just calculates within-strata means
c1 lt- coef(lm(aauw ~ asfactor(totchi) - 1 data = girls))
c2 lt- with(girls tapply(aauw totchi mean narm = TRUE))
rbind(c1 c2)
0 1 2 3 4 5 6 7 8 9 10 12
c1 56 62 53 43 37 41 23 39 11 6 3 0
c2 56 62 53 43 37 41 23 39 11 6 3 0
16 58
Other justifications for OLS
bull Justification 2 119883prime119894 120573 is the best linear predictor (in a
mean-squared error sense) of 119884119894 Why 120573 = argmin119887 120124[(119884119894 minus 119883prime
119894 119887)1113569]
bull Justification 3 119883prime119894 120573 provides the minimum mean squared
error linear approxmiation to 119864[119884119894|119883119894]bull Even if the CEF is not linear a linear regression provides the
best linear approximation to that CEFbull Donrsquot need to believe the assumptions (linearity) in order to
use regression as a good approximation to the CEFbull Warning if the CEF is very nonlinear then this approximation
could be terrible
17 58
The error terms
bull Letrsquos define the error term 119890119894 equiv 119884119894 minus 119883prime119894 120573 so that
119884119894 = 119883prime119894 120573 + [119884119894 minus 119883prime
119894 120573] = 119883prime119894 120573 + 119890119894
bull Note the residual 119890119894 is uncorrelated with 119883119894
120124[119883119894119890119894] = 120124[119883119894(119884119894 minus 119883prime119894 120573)]
= 120124[119883119894119884119894] minus 120124[119883119894119883prime119894 120573]
= 120124[119883119894119884119894] minus 120124 1114094119883119894119883prime119894120124[119883119894119883prime
119894 ]minus1113568120124[119883119894119884119894]1114097= 120124[119883119894119884119894] minus 120124[119883119894119883prime
119894 ]120124[119883119894119883prime119894 ]minus1113568120124[119883119894119884119894]
= 120124[119883119894119884119894] minus 120124[119883119894119884119894] = 0
bull No assumptions on the linearity of 120124[119884119894|119883119894]
18 58
OLS estimator
bull We know the population value of 120573 is
120573 = 120124[119883119894119883prime119894 ]minus1113568120124[119883119894119884119894]
bull How do we get an estimator of thisbull Plug-in principle replace population expectation with
sample versions
=
โกโขโขโขโขโขโฃ1119873
1114012119894119883119894119883prime
119894
โคโฅโฅโฅโฅโฅโฆ
minus11135681119873
1114012119894119883119894119884119894
bull If you work through the matrix algebra this turns out to be
= (119831prime119831)minus1113568 119831prime119858
19 58
Asymptotic OLS inference
bull With this representation in hand we can write the OLSestimator as follows
= 120573 +
โกโขโขโขโขโขโฃ1114012
119894119883119894119883prime
119894
โคโฅโฅโฅโฅโฅโฆ
minus1113568
1114012119894119883119894119890119894
bull Core idea sum119894119883119894119890119894 is the sum of rvs so the CLT applies
bull That plus some simple asymptotic theory allows us to say
radic119873( minus 120573) 119873(0ฮฉ)
bull Converges in distribution to a Normal distribution with meanvector 0 and covariance matrix ฮฉ
ฮฉ = 120124[119883119894119883prime119894 ]minus1113568120124[119883119894119883prime
119894 1198901113569119894 ]120124[119883119894119883prime119894 ]minus1113568
bull No linearity assumption needed20 58
Estimating the variance
bull In large samples then
radic119873( minus 120573) sim 119873(0ฮฉ)
bull How to estimate ฮฉ Plug-in principle again
1114023ฮฉ =
โกโขโขโขโขโขโฃ1114012
119894119883119894119883prime
119894
โคโฅโฅโฅโฅโฅโฆ
minus1113568 โกโขโขโขโขโขโฃ1114012119894119883119894119883prime
119894 1113569119894
โคโฅโฅโฅโฅโฅโฆ
โกโขโขโขโขโขโฃ1114012
119894119883119894119883prime
119894
โคโฅโฅโฅโฅโฅโฆ
minus1113568
bull Replace 119890119894 with its emprical counterpart (residuals)119894 = 119884119894 minus 119883prime
119894 bull Replace the population moments of 119883119894 with their sample
counterpartsbull The square root of the diagonals of this covariance matrix are
the ldquorobustrdquo or Huber-White standard errors that Statacommonly report
21 58
Heteroskedasticity
bull No assumptions of homoskedasticitybull Heteroskedaticity will definitely occur when
CEF is linear but the 1205901113569(119909) = 120141[119884119894|119883119894 = 119909] is not constant in 119909 119864[119884119894|119883119894] is not linear but we use the linear regression to
approxmiate it
22 58
2 Regression andCausality
23 58
Regression and causality
bull Most econometrics textbooks regression defined withoutrespect to causality
bull But then when is ldquobiasedrdquo The above derivations work forsome 120124[119884119894|119883119894]
bull The question then is when does knowing the CEF tell ussomething about causality
bull MHE argues that a regression is causal when the CEF itapproximates is causal Identification is king
bull We will show that under certain conditions a regression of theoutcome on the treatment and the covariates can recover acausal parameter but perhaps not the one in which we areinterested
24 58
Review
bull Quick reminder we have potential outcomes 119884119894(1) and 119884119894(0)and two parameters the ATE and ATT
120591 = 119864[119884119894(1) minus 119884119894(0)]120591ATT = 119864[119884119894(1) minus 119884119894(0)|119863119894 = 1]
bull We have shown in past weeks that these effects are identifiedwhen ignorability holds MHE calls this the conditionalindependence assumption (CIA)
25 58
Linear constant effects model binarytreatment
bull Experiment with a simple experiment we can rewrite theconsistency assumption to be a regression formula
119884119894 = 119863119894119884119894(1) + (1 minus 119863119894)119884119894(0)= 119884119894(0) + (119884119894(1) minus 119884119894(0))119863119894
= 120124[119884119894(0)] + 120591119863119894 + (119884119894(0) minus 120124[119884119894(0)])= 1205831113567 + 120591119863119894 + 1199071113567119894
bull Note that if ignorability holds (as in an experiment) for 119884119894(0)then it will also hold for 1199071113567119894 since 120124[119884119894(0)] is constant Thusthis satifies the usual assumptions for regression
26 58
Now with covariates
bull Now assume no unmeasured confounders 119884119894(119889) โโ 119863119894|119883119894bull We will assume a linear model for the potential outcomes
119884119894(119889) = 120572 + 120591 sdot 119889 + 120578119894
bull Remember that linearity isnrsquot an assumption if 119863119894 is binarybull Effect of 119863119894 is constant here the 120578119894 are the only source of
individual variation and we have 119864[120578119894] = 0bull Consistency assumption allows us to write this as
119884119894 = 120572 + 120591119863119894 + 120578119894
27 58
Covariates in the error
bull Letrsquos assume that 120578119894 is linear in 119883119894 120578119894 = 119883prime119894 120574 + 120584119894
bull New error is uncorrelated with 119883119894 120124[120584119894|119883119894] = 0bull This is an assumption Might be falsebull Plug into the above
120124[119884119894(119889)|119883119894] = 119864[119884119894|119863119894 119883119894] = 120572 + 120591119863119894 + 119864[120578119894|119883119894]= 120572 + 120591119863119894 + 119883prime
119894 120574 + 119864[120584119894|119883119894]= 120572 + 120591119863119894 + 119883prime
119894 120574
28 58
Summing up regression with constanteffects
bull Reviewing the assumptions wersquove used no unmeasured confounders constant treatment effects linearity of the treatmentcovariates
bull Under these we can run the following regression to estimatethe ATE 120591
119884119894 = 120572 + 120591119863119894 + 119883prime119894 120574 + 120584119894
bull Works with continuous or ordinal 119863119894 if linearity in the effect ofthese variables is truly linear
29 58
OLS constant effects simulation
Model with linear covariates constant 0 effect of treatment
library(mvtnorm)
n lt- 100
p lt- 4
X lt- rmvnorm(n = 100 mean = rep(0 p))
gamma lt- c(274 137 137 137)
y lt- 210 + X c(gamma) + rnorm(n)
alpha lt- c(-1 05 -05 -01)
dprobs lt- bootinvlogit(X alpha)
d lt- rbinom(n size = 1 prob = dprobs)
30 58
OLS with no covariates
summary(lm(y ~ d))
Coefficients
Estimate Std Error t value Pr(gt|t|)
(Intercept) 22036 458 4813 lt 2e-16
d -2869 654 -439 0000029
---
Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1
Residual standard error 33 on 98 degrees of freedom
Multiple R-squared 0164 Adjusted R-squared 0156
F-statistic 192 on 1 and 98 DF p-value 0000029
31 58
OLS with covariates
summary(lm(y ~ d + X))
Coefficients
Estimate Std Error t value Pr(gt|t|)
(Intercept) 209952 0159 13207 lt2e-16
d -0128 0253 -05 062
X1 27368 0126 2175 lt2e-16
X2 13677 0114 1201 lt2e-16
X3 13673 0130 1051 lt2e-16
X4 13570 0106 1280 lt2e-16
---
Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1
Residual standard error 1 on 94 degrees of freedom
Multiple R-squared 0999 Adjusted R-squared 0999
F-statistic 244e+04 on 5 and 94 DF p-value lt2e-16
32 58
What happens with nonlinearity
Suppose we can only observe the following covariates
z1 lt- exp(X[ 1]2)
z2 lt- X[ 2](1 + exp(X[ 1])) + 10
z3 lt- (X[ 1] X[ 3]25 + 06)^3
z4 lt- (X[ 2] + X[ 4] + 20)^2
Implies that 119884119894 and 119863119894 are functions of log(1198851198941113568) 1198851198941113569 119885111356911989411135681198851198941113569
1 log(1198851198941113568) 1198851198941113570 log(1198851198941113568) and 119883111356811135691198941113571
Regression is a nonlinear function of the observed covariates
33 58
When linearity goes wrong
summary(lm(y ~ d + z1 + z2 + z3 + z4))
Coefficients
Estimate Std Error t value Pr(gt|t|)
(Intercept) 218728 300799 073 0469
d -69292 33854 -205 0043
z1 363110 26970 1346 lt2e-16
z2 -29033 36619 -079 0430
z3 862022 431030 200 0048
z4 04021 00329 1223 lt2e-16
---
Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1
Residual standard error 14 on 94 degrees of freedom
Multiple R-squared 0855 Adjusted R-squared 0847
F-statistic 111 on 5 and 94 DF p-value lt2e-16
34 58
3 Regression withHeterogeneousTreatment Effects
35 58
Heterogeneous effects binary treatment
bull Completely randomized experiment
119884119894 = 119863119894119884119894(1) + (1 minus 119863119894)119884119894(0)= 119884119894(0) + (119884119894(1) minus 119884119894(0))119863119894
= 1205831113567 + 120591119894119863119894 + (119884119894(0) minus 1205831113567)= 1205831113567 + 120591119863119894 + (119884119894(0) minus 1205831113567) + (120591119894 minus 120591) sdot 119863119894
= 1205831113567 + 120591119863119894 + 120576119894
bull Error term now includes two components1 ldquoBaselinerdquo variation in the outcome (119884119894(0) minus 1205831113567)2 Variation in the treatment effect (120591119894 minus 120591)
bull Easy to verify that under experiment 120124[120576119894|119863119894] = 0bull Thus OLS estimates the ATE with no covariates
36 58
Adding covariates
bull What happens with no unmeasured confounders Need tocondition on 119883119894 now
bull Remember identification of the ATEATT using iteratedexpectations
bull ATE is the weighted sum of CATEs
120591 = 1114012119909120591(119909) Pr[119883119894 = 119909]
bull ATEATT are weighted averages of CATEsbull What about the regression estimand 120591119877 How does it related
to the ATEATT
37 58
Heterogeneous effects and regression
bull Letrsquos investigate this under a saturated regression model
119884119894 = 1114012119909119861119909119894120572119909 + 120591119877119863119894 + 119890119894
bull Use a dummy variable for each unique combination of 119883119894119861119909119894 = 120128(119883119894 = 119909)
bull Linear in 119883119894 by construction
38 58
Investigating the regression coefficient
bull How can we investigate 120591119877 Well we can rely on theregression anatomy
120591119877 = Cov(119884119894 119863119894 minus 119864[119863119894|119883119894])120141(119863119894 minus 119864[119863119894|119883119894])
bull 119863119894 minus 120124[119863119894|119883119894] is the residual from a regression of 119863119894 on the fullset of dummies
bull With a little work we can show
120591119877 =120124 1114094120591(119883119894)(119863119894 minus 120124[119863119894|119883119894])11135691114097
120124[(119863119894 minus 119864[119863119894|119883119894])1113569]=
120124[120591(119883119894)1205901113569119889(119883119894)]120124[1205901113569119889(119883119894)]
bull 1205901113569119889(119909) = 120141[119863119894|119883119894 = 119909] is the conditional variance of treatmentassignment
39 58
ATE versus OLS
120591119877 = 120124[120591(119883119894)119882119894] = 1114012119909120591(119909)
1205901113569119889(119909)120124[1205901113569119889(119883119894)]
โ[119883119894 = 119909]
bull Compare to the ATE
120591 = 120124[120591(119883119894)] = 1114012119909120591(119909)โ[119883119894 = 119909]
bull Both weight strata relative to their size (โ[119883119894 = 119909])bull OLS weights strata higher if the treatment variance in those
strata (1205901113569119889(119909)) is higher in those strata relative to the averagevariance across strata (120124[1205901113569119889(119883119894)])
bull The ATE weights only by their size
40 58
Regression weighting
119882119894 =1205901113569119889(119883119894)
120124[1205901113569119889(119883119894)]
bull Why does OLS weight like thisbull OLS is a minimum-variance estimator more weight to
more precise within-strata estimatesbull Within-strata estimates are most precise when the treatment
is evenly spread and thus has the highest variancebull If 119863119894 is binary then we know the conditional variance will be
1205901113569119889(119909) = โ[119863119894 = 1|119883119894 = 119909] (1 minus โ[119863119894 = 1|119883119894 = 119909])= 119890(119909) (1 minus 119890(119909))
bull Maximum variance with โ[119863119894 = 1|119883119894 = 119909] = 12
41 58
OLS weighting examplebull Binary covariate
โ[119883119894 = 1] = 075 โ[119883119894 = 0] = 025โ[119863119894 = 1|119883119894 = 1] = 09 โ[119863119894 = 1|119883119894 = 0] = 05
1205901113569119889(1) = 009 1205901113569119889(0) = 025120591(1) = 1 120591(0) = minus1
bull Implies the ATE is 120591 = 05bull Average conditional variance 120124[1205901113569119889(119883119894)] = 013bull weights for 119883119894 = 1 are 009013 = 0692 for 119883119894 = 0
025013 = 192120591119877 = 120124[120591(119883119894)119882119894]
= 120591(1)119882(1)โ[119883119894 = 1] + 120591(0)119882(0)โ[119883119894 = 0]= 1 times 0692 times 075 + minus1 times 192 times 025= 0039
42 58
When will OLS estimate the ATE
bull When does 120591 = 120591119877bull Constant treatment effects 120591(119909) = 120591 = 120591119877bull Constant probability of treatment 119890(119909) = โ[119863119894 = 1|119883119894 = 119909] = 119890
Implies that the OLS weights are 1
bull Incorrect linearity assumption in 119883119894 will lead to more bias
43 58
Other ways to use regression
bull Whatrsquos the path forward Accept the bias (might be relatively small with saturated
models) Use a different regression approach
bull Let 120583119889(119909) = 120124[119884119894(119889)|119883119894 = 119909] be the CEF for the potentialoutcome under 119863119894 = 119889
bull By consistency and nuc we have 120583119889(119909) = 120124[119884119894|119863119894 = 119889119883119894 = 119909]bull Estimate a regression of 119884119894 on 119883119894 among the 119863119894 = 119889 groupbull Then 119889(119909) is just a predicted value from the regression for
119883119894 = 119909bull How can we use this
44 58
Imputation estimators
bull Impute the treated potential outcomes with 1114023119884119894(1) = 1113568(119883119894)bull Impute the control potential outcomes with 1114023119884119894(0) = 1113567(119883119894)bull Procedure
Regress 119884119894 on 119883119894 in the treated group and get predicted valuesfor all units (treated or control)
Regress 119884119894 on 119883119894 in the control group and get predicted valuesfor all units (treated or control)
Take the average difference between these predicted values
bull More mathematically look like this
120591119894119898119901 =1119873
11140121198941113568(119883119894) minus 1113567(119883119894)
bull Sometimes called an imputation estimator
45 58
Simple imputation estimator
bull Use predict() from the within-group models on the data fromthe entire sample
bull Useful trick use a model on the entire data andmodelframe() to get the right design matrix
heterogeneous effects
yhet lt- ifelse(d == 1 y + rnorm(n 0 5) y)
mod lt- lm(yhet ~ d + X)
mod1 lt- lm(yhet ~ X subset = d == 1)
mod0 lt- lm(yhet ~ X subset = d == 0)
y1imps lt- predict(mod1 modelframe(mod))
y0imps lt- predict(mod0 modelframe(mod))
mean(y1imps - y0imps)
[1] 061
46 58
Notes on imputation estimators
bull If 119889(119909) are consistent estimators then 120591119894119898119901 is consistent forthe ATE
bull Why donrsquot people use this Most people donrsquot know the results wersquove been talking about Harder to implement than vanilla OLS
bull Can use linear regression to estimate 119889(119909) = 119909prime120573119889bull Recent trend is to estimate 119889(119909) via non-parametric methods
such as Kernel regression local linear regression regression trees etc Easiest is generalized additive models (GAMs)
47 58
Imputation estimator visualization
-2 -1 0 1 2 3
-05
00
05
10
15
x
y
TreatedControl
48 58
Imputation estimator visualization
-2 -1 0 1 2 3
-05
00
05
10
15
x
y
TreatedControl
49 58
Imputation estimator visualization
-2 -1 0 1 2 3
-05
00
05
10
15
x
y
TreatedControl
50 58
Nonlinear relationshipsbull Same idea but with nonlinear relationship between 119884119894 and 119883119894
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
51 58
Nonlinear relationshipsbull Same idea but with nonlinear relationship between 119884119894 and 119883119894
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
52 58
Nonlinear relationshipsbull Same idea but with nonlinear relationship between 119884119894 and 119883119894
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
53 58
Using semiparametric regressionbull Here CEFs are nonlinear but we donrsquot know their formbull We can use GAMs from the mgcv package to for flexible
estimatelibrary(mgcv)
mod0 lt- gam(y ~ s(x) subset = d == 0)
summary(mod0)
Family gaussian
Link function identity
Formula
y ~ s(x)
Parametric coefficients
Estimate Std Error t value Pr(gt|t|)
(Intercept) -00225 00154 -146 016
Approximate significance of smooth terms
edf Refdf F p-value
s(x) 603 708 413 lt2e-16
---
Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1
R-sq(adj) = 0909 Deviance explained = 928
GCV = 00093204 Scale est = 00071351 n = 30
54 58
Using GAMs
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
55 58
Using GAMs
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
56 58
Using GAMs
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
57 58
Limited dependent variables
bull Usual advice model the data from first principles Logitprobit for binary Poisson for counts etc
bull OLS is a-ok with limited DVs when Binary treatment and no covariates (just diff-in-means) Binary treatment discrete covariates and saturated models
(stratified diff-in-means)
bull Imposing a model on LDVs in this case imposes adistributional assumption which could be wrong
bull Even in unsaturated models the marginal effect from OLSoften decent compared to nonlinear models
Could go wrong in small samples If using nonlinear models always get effects on the scale of the
outcome
58 58
- Agnostic Regression
- Regression and Causality
- Regression with Heterogeneous Treatment Effects
-
Saturated model example
bull Two binary variables 1198831113568119894 for incumbency status and 1198831113569119894 forparty of the candidate
bull Four possible values of 119883119894 four possible values of 120583(119883119894)
119864[119884119894|1198831113568119894 = 01198831113569119894 = 0] = 120572119864[119884119894|1198831113568119894 = 11198831113569119894 = 0] = 120572 + 120573119864[119884119894|1198831113568119894 = 01198831113569119894 = 1] = 120572 + 120574119864[119884119894|1198831113568119894 = 11198831113569119894 = 1] = 120572 + 120573 + 120574 + 120575
bull We can write the CEF as follows
119864[119884119894|1198831113568119894 1198831113569119894] = 120572 + 1205731198831113568119894 + 1205741198831113569119894 + 120575(11988311135681198941198831113569119894)
10 58
Saturated models example
119864[119884119894|1198831113568119894 1198831113569119894] = 120572 + 1205731198831113568119894 + 1205741198831113569119894 + 120575(11988311135681198941198831113569119894)
bull Basically each value of 120583(119883119894) is being estimated separately within-strata estimation No borrowing of information from across values of 119883119894
bull Requires a set of dummies for each categorical variable plusall interactions
bull Or a series of dummies for each unique combination of 119883119894bull This makes linearity hold mechanically and so linearity is
not an assumption Just a fact about saturated CEFs saturated models for limited dependent variables = A-OK
11 58
Saturated model example
bull Washington (AER) data on the effects of daughtersbull Wersquoll look at the relationship between voting and number of
kids (causal)
girls lt- foreignreaddta(rdquogirlsdtardquo)
head(girls[ c(rdquonamerdquo rdquototchirdquo rdquoaauwrdquo)])
name totchi aauw
1 ABERCROMBIE NEIL 0 100
2 ACKERMAN GARY L 3 88
3 ADERHOLT ROBERT B 0 0
4 ALLEN THOMAS H 2 100
5 ANDREWS ROBERT E 2 100
6 ARCHER WR 7 0
12 58
Linear model
summary(lm(aauw ~ totchi data = girls))
Coefficients
Estimate Std Error t value Pr(gt|t|)
(Intercept) 6131 181 3381 lt2e-16
totchi -533 062 -859 lt2e-16
---
Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1
Residual standard error 42 on 1733 degrees of freedom
(5 observations deleted due to missingness)
Multiple R-squared 00408 Adjusted R-squared 00403
F-statistic 738 on 1 and 1733 DF p-value lt2e-16
13 58
Saturated modelsummary(lm(aauw ~ asfactor(totchi) data = girls))
Coefficients
Estimate Std Error t value Pr(gt|t|)
(Intercept) 5641 276 2042 lt 2e-16
asfactor(totchi)1 545 411 133 01851
asfactor(totchi)2 -380 327 -116 02454
asfactor(totchi)3 -1365 345 -395 81e-05
asfactor(totchi)4 -1931 401 -482 16e-06
asfactor(totchi)5 -1546 485 -319 00015
asfactor(totchi)6 -3359 1042 -322 00013
asfactor(totchi)7 -1713 1141 -150 01336
asfactor(totchi)8 -5533 1228 -451 70e-06
asfactor(totchi)9 -5041 2408 -209 00364
asfactor(totchi)10 -5341 2090 -256 00107
asfactor(totchi)12 -5641 4153 -136 01745
---
Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1
Residual standard error 41 on 1723 degrees of freedom
(5 observations deleted due to missingness)
Multiple R-squared 00506 Adjusted R-squared 00446
F-statistic 836 on 11 and 1723 DF p-value 184e-1414 58
Saturated model minus the constantsummary(lm(aauw ~ asfactor(totchi) - 1 data = girls))
Coefficients
Estimate Std Error t value Pr(gt|t|)
asfactor(totchi)0 5641 276 2042 lt2e-16
asfactor(totchi)1 6186 305 2031 lt2e-16
asfactor(totchi)2 5262 175 3013 lt2e-16
asfactor(totchi)3 4276 207 2062 lt2e-16
asfactor(totchi)4 3711 290 1279 lt2e-16
asfactor(totchi)5 4095 399 1027 lt2e-16
asfactor(totchi)6 2282 1005 227 00233
asfactor(totchi)7 3929 1107 355 00004
asfactor(totchi)8 108 1196 009 09278
asfactor(totchi)9 600 2392 025 08020
asfactor(totchi)10 300 2072 014 08849
asfactor(totchi)12 000 4143 000 10000
---
Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1
Residual standard error 41 on 1723 degrees of freedom
(5 observations deleted due to missingness)
Multiple R-squared 0587 Adjusted R-squared 0584
F-statistic 204 on 12 and 1723 DF p-value lt2e-1615 58
Compare to within-strata means
bull The saturated model makes no assumptions about thebetween-strata relationships
bull Just calculates within-strata means
c1 lt- coef(lm(aauw ~ asfactor(totchi) - 1 data = girls))
c2 lt- with(girls tapply(aauw totchi mean narm = TRUE))
rbind(c1 c2)
0 1 2 3 4 5 6 7 8 9 10 12
c1 56 62 53 43 37 41 23 39 11 6 3 0
c2 56 62 53 43 37 41 23 39 11 6 3 0
16 58
Other justifications for OLS
bull Justification 2 119883prime119894 120573 is the best linear predictor (in a
mean-squared error sense) of 119884119894 Why 120573 = argmin119887 120124[(119884119894 minus 119883prime
119894 119887)1113569]
bull Justification 3 119883prime119894 120573 provides the minimum mean squared
error linear approxmiation to 119864[119884119894|119883119894]bull Even if the CEF is not linear a linear regression provides the
best linear approximation to that CEFbull Donrsquot need to believe the assumptions (linearity) in order to
use regression as a good approximation to the CEFbull Warning if the CEF is very nonlinear then this approximation
could be terrible
17 58
The error terms
bull Letrsquos define the error term 119890119894 equiv 119884119894 minus 119883prime119894 120573 so that
119884119894 = 119883prime119894 120573 + [119884119894 minus 119883prime
119894 120573] = 119883prime119894 120573 + 119890119894
bull Note the residual 119890119894 is uncorrelated with 119883119894
120124[119883119894119890119894] = 120124[119883119894(119884119894 minus 119883prime119894 120573)]
= 120124[119883119894119884119894] minus 120124[119883119894119883prime119894 120573]
= 120124[119883119894119884119894] minus 120124 1114094119883119894119883prime119894120124[119883119894119883prime
119894 ]minus1113568120124[119883119894119884119894]1114097= 120124[119883119894119884119894] minus 120124[119883119894119883prime
119894 ]120124[119883119894119883prime119894 ]minus1113568120124[119883119894119884119894]
= 120124[119883119894119884119894] minus 120124[119883119894119884119894] = 0
bull No assumptions on the linearity of 120124[119884119894|119883119894]
18 58
OLS estimator
bull We know the population value of 120573 is
120573 = 120124[119883119894119883prime119894 ]minus1113568120124[119883119894119884119894]
bull How do we get an estimator of thisbull Plug-in principle replace population expectation with
sample versions
=
โกโขโขโขโขโขโฃ1119873
1114012119894119883119894119883prime
119894
โคโฅโฅโฅโฅโฅโฆ
minus11135681119873
1114012119894119883119894119884119894
bull If you work through the matrix algebra this turns out to be
= (119831prime119831)minus1113568 119831prime119858
19 58
Asymptotic OLS inference
bull With this representation in hand we can write the OLSestimator as follows
= 120573 +
โกโขโขโขโขโขโฃ1114012
119894119883119894119883prime
119894
โคโฅโฅโฅโฅโฅโฆ
minus1113568
1114012119894119883119894119890119894
bull Core idea sum119894119883119894119890119894 is the sum of rvs so the CLT applies
bull That plus some simple asymptotic theory allows us to say
radic119873( minus 120573) 119873(0ฮฉ)
bull Converges in distribution to a Normal distribution with meanvector 0 and covariance matrix ฮฉ
ฮฉ = 120124[119883119894119883prime119894 ]minus1113568120124[119883119894119883prime
119894 1198901113569119894 ]120124[119883119894119883prime119894 ]minus1113568
bull No linearity assumption needed20 58
Estimating the variance
bull In large samples then
radic119873( minus 120573) sim 119873(0ฮฉ)
bull How to estimate ฮฉ Plug-in principle again
1114023ฮฉ =
โกโขโขโขโขโขโฃ1114012
119894119883119894119883prime
119894
โคโฅโฅโฅโฅโฅโฆ
minus1113568 โกโขโขโขโขโขโฃ1114012119894119883119894119883prime
119894 1113569119894
โคโฅโฅโฅโฅโฅโฆ
โกโขโขโขโขโขโฃ1114012
119894119883119894119883prime
119894
โคโฅโฅโฅโฅโฅโฆ
minus1113568
bull Replace 119890119894 with its emprical counterpart (residuals)119894 = 119884119894 minus 119883prime
119894 bull Replace the population moments of 119883119894 with their sample
counterpartsbull The square root of the diagonals of this covariance matrix are
the ldquorobustrdquo or Huber-White standard errors that Statacommonly report
21 58
Heteroskedasticity
bull No assumptions of homoskedasticitybull Heteroskedaticity will definitely occur when
CEF is linear but the 1205901113569(119909) = 120141[119884119894|119883119894 = 119909] is not constant in 119909 119864[119884119894|119883119894] is not linear but we use the linear regression to
approxmiate it
22 58
2 Regression andCausality
23 58
Regression and causality
bull Most econometrics textbooks regression defined withoutrespect to causality
bull But then when is ldquobiasedrdquo The above derivations work forsome 120124[119884119894|119883119894]
bull The question then is when does knowing the CEF tell ussomething about causality
bull MHE argues that a regression is causal when the CEF itapproximates is causal Identification is king
bull We will show that under certain conditions a regression of theoutcome on the treatment and the covariates can recover acausal parameter but perhaps not the one in which we areinterested
24 58
Review
bull Quick reminder we have potential outcomes 119884119894(1) and 119884119894(0)and two parameters the ATE and ATT
120591 = 119864[119884119894(1) minus 119884119894(0)]120591ATT = 119864[119884119894(1) minus 119884119894(0)|119863119894 = 1]
bull We have shown in past weeks that these effects are identifiedwhen ignorability holds MHE calls this the conditionalindependence assumption (CIA)
25 58
Linear constant effects model binarytreatment
bull Experiment with a simple experiment we can rewrite theconsistency assumption to be a regression formula
119884119894 = 119863119894119884119894(1) + (1 minus 119863119894)119884119894(0)= 119884119894(0) + (119884119894(1) minus 119884119894(0))119863119894
= 120124[119884119894(0)] + 120591119863119894 + (119884119894(0) minus 120124[119884119894(0)])= 1205831113567 + 120591119863119894 + 1199071113567119894
bull Note that if ignorability holds (as in an experiment) for 119884119894(0)then it will also hold for 1199071113567119894 since 120124[119884119894(0)] is constant Thusthis satifies the usual assumptions for regression
26 58
Now with covariates
bull Now assume no unmeasured confounders 119884119894(119889) โโ 119863119894|119883119894bull We will assume a linear model for the potential outcomes
119884119894(119889) = 120572 + 120591 sdot 119889 + 120578119894
bull Remember that linearity isnrsquot an assumption if 119863119894 is binarybull Effect of 119863119894 is constant here the 120578119894 are the only source of
individual variation and we have 119864[120578119894] = 0bull Consistency assumption allows us to write this as
119884119894 = 120572 + 120591119863119894 + 120578119894
27 58
Covariates in the error
bull Letrsquos assume that 120578119894 is linear in 119883119894 120578119894 = 119883prime119894 120574 + 120584119894
bull New error is uncorrelated with 119883119894 120124[120584119894|119883119894] = 0bull This is an assumption Might be falsebull Plug into the above
120124[119884119894(119889)|119883119894] = 119864[119884119894|119863119894 119883119894] = 120572 + 120591119863119894 + 119864[120578119894|119883119894]= 120572 + 120591119863119894 + 119883prime
119894 120574 + 119864[120584119894|119883119894]= 120572 + 120591119863119894 + 119883prime
119894 120574
28 58
Summing up regression with constanteffects
bull Reviewing the assumptions wersquove used no unmeasured confounders constant treatment effects linearity of the treatmentcovariates
bull Under these we can run the following regression to estimatethe ATE 120591
119884119894 = 120572 + 120591119863119894 + 119883prime119894 120574 + 120584119894
bull Works with continuous or ordinal 119863119894 if linearity in the effect ofthese variables is truly linear
29 58
OLS constant effects simulation
Model with linear covariates constant 0 effect of treatment
library(mvtnorm)
n lt- 100
p lt- 4
X lt- rmvnorm(n = 100 mean = rep(0 p))
gamma lt- c(274 137 137 137)
y lt- 210 + X c(gamma) + rnorm(n)
alpha lt- c(-1 05 -05 -01)
dprobs lt- bootinvlogit(X alpha)
d lt- rbinom(n size = 1 prob = dprobs)
30 58
OLS with no covariates
summary(lm(y ~ d))
Coefficients
Estimate Std Error t value Pr(gt|t|)
(Intercept) 22036 458 4813 lt 2e-16
d -2869 654 -439 0000029
---
Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1
Residual standard error 33 on 98 degrees of freedom
Multiple R-squared 0164 Adjusted R-squared 0156
F-statistic 192 on 1 and 98 DF p-value 0000029
31 58
OLS with covariates
summary(lm(y ~ d + X))
Coefficients
Estimate Std Error t value Pr(gt|t|)
(Intercept) 209952 0159 13207 lt2e-16
d -0128 0253 -05 062
X1 27368 0126 2175 lt2e-16
X2 13677 0114 1201 lt2e-16
X3 13673 0130 1051 lt2e-16
X4 13570 0106 1280 lt2e-16
---
Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1
Residual standard error 1 on 94 degrees of freedom
Multiple R-squared 0999 Adjusted R-squared 0999
F-statistic 244e+04 on 5 and 94 DF p-value lt2e-16
32 58
What happens with nonlinearity
Suppose we can only observe the following covariates
z1 lt- exp(X[ 1]2)
z2 lt- X[ 2](1 + exp(X[ 1])) + 10
z3 lt- (X[ 1] X[ 3]25 + 06)^3
z4 lt- (X[ 2] + X[ 4] + 20)^2
Implies that 119884119894 and 119863119894 are functions of log(1198851198941113568) 1198851198941113569 119885111356911989411135681198851198941113569
1 log(1198851198941113568) 1198851198941113570 log(1198851198941113568) and 119883111356811135691198941113571
Regression is a nonlinear function of the observed covariates
33 58
When linearity goes wrong
summary(lm(y ~ d + z1 + z2 + z3 + z4))
Coefficients
Estimate Std Error t value Pr(gt|t|)
(Intercept) 218728 300799 073 0469
d -69292 33854 -205 0043
z1 363110 26970 1346 lt2e-16
z2 -29033 36619 -079 0430
z3 862022 431030 200 0048
z4 04021 00329 1223 lt2e-16
---
Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1
Residual standard error 14 on 94 degrees of freedom
Multiple R-squared 0855 Adjusted R-squared 0847
F-statistic 111 on 5 and 94 DF p-value lt2e-16
34 58
3 Regression withHeterogeneousTreatment Effects
35 58
Heterogeneous effects binary treatment
bull Completely randomized experiment
119884119894 = 119863119894119884119894(1) + (1 minus 119863119894)119884119894(0)= 119884119894(0) + (119884119894(1) minus 119884119894(0))119863119894
= 1205831113567 + 120591119894119863119894 + (119884119894(0) minus 1205831113567)= 1205831113567 + 120591119863119894 + (119884119894(0) minus 1205831113567) + (120591119894 minus 120591) sdot 119863119894
= 1205831113567 + 120591119863119894 + 120576119894
bull Error term now includes two components1 ldquoBaselinerdquo variation in the outcome (119884119894(0) minus 1205831113567)2 Variation in the treatment effect (120591119894 minus 120591)
bull Easy to verify that under experiment 120124[120576119894|119863119894] = 0bull Thus OLS estimates the ATE with no covariates
36 58
Adding covariates
bull What happens with no unmeasured confounders Need tocondition on 119883119894 now
bull Remember identification of the ATEATT using iteratedexpectations
bull ATE is the weighted sum of CATEs
120591 = 1114012119909120591(119909) Pr[119883119894 = 119909]
bull ATEATT are weighted averages of CATEsbull What about the regression estimand 120591119877 How does it related
to the ATEATT
37 58
Heterogeneous effects and regression
bull Letrsquos investigate this under a saturated regression model
119884119894 = 1114012119909119861119909119894120572119909 + 120591119877119863119894 + 119890119894
bull Use a dummy variable for each unique combination of 119883119894119861119909119894 = 120128(119883119894 = 119909)
bull Linear in 119883119894 by construction
38 58
Investigating the regression coefficient
bull How can we investigate 120591119877 Well we can rely on theregression anatomy
120591119877 = Cov(119884119894 119863119894 minus 119864[119863119894|119883119894])120141(119863119894 minus 119864[119863119894|119883119894])
bull 119863119894 minus 120124[119863119894|119883119894] is the residual from a regression of 119863119894 on the fullset of dummies
bull With a little work we can show
120591119877 =120124 1114094120591(119883119894)(119863119894 minus 120124[119863119894|119883119894])11135691114097
120124[(119863119894 minus 119864[119863119894|119883119894])1113569]=
120124[120591(119883119894)1205901113569119889(119883119894)]120124[1205901113569119889(119883119894)]
bull 1205901113569119889(119909) = 120141[119863119894|119883119894 = 119909] is the conditional variance of treatmentassignment
39 58
ATE versus OLS
120591119877 = 120124[120591(119883119894)119882119894] = 1114012119909120591(119909)
1205901113569119889(119909)120124[1205901113569119889(119883119894)]
โ[119883119894 = 119909]
bull Compare to the ATE
120591 = 120124[120591(119883119894)] = 1114012119909120591(119909)โ[119883119894 = 119909]
bull Both weight strata relative to their size (โ[119883119894 = 119909])bull OLS weights strata higher if the treatment variance in those
strata (1205901113569119889(119909)) is higher in those strata relative to the averagevariance across strata (120124[1205901113569119889(119883119894)])
bull The ATE weights only by their size
40 58
Regression weighting
119882119894 =1205901113569119889(119883119894)
120124[1205901113569119889(119883119894)]
bull Why does OLS weight like thisbull OLS is a minimum-variance estimator more weight to
more precise within-strata estimatesbull Within-strata estimates are most precise when the treatment
is evenly spread and thus has the highest variancebull If 119863119894 is binary then we know the conditional variance will be
1205901113569119889(119909) = โ[119863119894 = 1|119883119894 = 119909] (1 minus โ[119863119894 = 1|119883119894 = 119909])= 119890(119909) (1 minus 119890(119909))
bull Maximum variance with โ[119863119894 = 1|119883119894 = 119909] = 12
41 58
OLS weighting examplebull Binary covariate
โ[119883119894 = 1] = 075 โ[119883119894 = 0] = 025โ[119863119894 = 1|119883119894 = 1] = 09 โ[119863119894 = 1|119883119894 = 0] = 05
1205901113569119889(1) = 009 1205901113569119889(0) = 025120591(1) = 1 120591(0) = minus1
bull Implies the ATE is 120591 = 05bull Average conditional variance 120124[1205901113569119889(119883119894)] = 013bull weights for 119883119894 = 1 are 009013 = 0692 for 119883119894 = 0
025013 = 192120591119877 = 120124[120591(119883119894)119882119894]
= 120591(1)119882(1)โ[119883119894 = 1] + 120591(0)119882(0)โ[119883119894 = 0]= 1 times 0692 times 075 + minus1 times 192 times 025= 0039
42 58
When will OLS estimate the ATE
bull When does 120591 = 120591119877bull Constant treatment effects 120591(119909) = 120591 = 120591119877bull Constant probability of treatment 119890(119909) = โ[119863119894 = 1|119883119894 = 119909] = 119890
Implies that the OLS weights are 1
bull Incorrect linearity assumption in 119883119894 will lead to more bias
43 58
Other ways to use regression
bull Whatrsquos the path forward Accept the bias (might be relatively small with saturated
models) Use a different regression approach
bull Let 120583119889(119909) = 120124[119884119894(119889)|119883119894 = 119909] be the CEF for the potentialoutcome under 119863119894 = 119889
bull By consistency and nuc we have 120583119889(119909) = 120124[119884119894|119863119894 = 119889119883119894 = 119909]bull Estimate a regression of 119884119894 on 119883119894 among the 119863119894 = 119889 groupbull Then 119889(119909) is just a predicted value from the regression for
119883119894 = 119909bull How can we use this
44 58
Imputation estimators
bull Impute the treated potential outcomes with 1114023119884119894(1) = 1113568(119883119894)bull Impute the control potential outcomes with 1114023119884119894(0) = 1113567(119883119894)bull Procedure
Regress 119884119894 on 119883119894 in the treated group and get predicted valuesfor all units (treated or control)
Regress 119884119894 on 119883119894 in the control group and get predicted valuesfor all units (treated or control)
Take the average difference between these predicted values
bull More mathematically look like this
120591119894119898119901 =1119873
11140121198941113568(119883119894) minus 1113567(119883119894)
bull Sometimes called an imputation estimator
45 58
Simple imputation estimator
bull Use predict() from the within-group models on the data fromthe entire sample
bull Useful trick use a model on the entire data andmodelframe() to get the right design matrix
heterogeneous effects
yhet lt- ifelse(d == 1 y + rnorm(n 0 5) y)
mod lt- lm(yhet ~ d + X)
mod1 lt- lm(yhet ~ X subset = d == 1)
mod0 lt- lm(yhet ~ X subset = d == 0)
y1imps lt- predict(mod1 modelframe(mod))
y0imps lt- predict(mod0 modelframe(mod))
mean(y1imps - y0imps)
[1] 061
46 58
Notes on imputation estimators
bull If 119889(119909) are consistent estimators then 120591119894119898119901 is consistent forthe ATE
bull Why donrsquot people use this Most people donrsquot know the results wersquove been talking about Harder to implement than vanilla OLS
bull Can use linear regression to estimate 119889(119909) = 119909prime120573119889bull Recent trend is to estimate 119889(119909) via non-parametric methods
such as Kernel regression local linear regression regression trees etc Easiest is generalized additive models (GAMs)
47 58
Imputation estimator visualization
-2 -1 0 1 2 3
-05
00
05
10
15
x
y
TreatedControl
48 58
Imputation estimator visualization
-2 -1 0 1 2 3
-05
00
05
10
15
x
y
TreatedControl
49 58
Imputation estimator visualization
-2 -1 0 1 2 3
-05
00
05
10
15
x
y
TreatedControl
50 58
Nonlinear relationshipsbull Same idea but with nonlinear relationship between 119884119894 and 119883119894
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
51 58
Nonlinear relationshipsbull Same idea but with nonlinear relationship between 119884119894 and 119883119894
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
52 58
Nonlinear relationshipsbull Same idea but with nonlinear relationship between 119884119894 and 119883119894
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
53 58
Using semiparametric regressionbull Here CEFs are nonlinear but we donrsquot know their formbull We can use GAMs from the mgcv package to for flexible
estimatelibrary(mgcv)
mod0 lt- gam(y ~ s(x) subset = d == 0)
summary(mod0)
Family gaussian
Link function identity
Formula
y ~ s(x)
Parametric coefficients
Estimate Std Error t value Pr(gt|t|)
(Intercept) -00225 00154 -146 016
Approximate significance of smooth terms
edf Refdf F p-value
s(x) 603 708 413 lt2e-16
---
Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1
R-sq(adj) = 0909 Deviance explained = 928
GCV = 00093204 Scale est = 00071351 n = 30
54 58
Using GAMs
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
55 58
Using GAMs
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
56 58
Using GAMs
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
57 58
Limited dependent variables
bull Usual advice model the data from first principles Logitprobit for binary Poisson for counts etc
bull OLS is a-ok with limited DVs when Binary treatment and no covariates (just diff-in-means) Binary treatment discrete covariates and saturated models
(stratified diff-in-means)
bull Imposing a model on LDVs in this case imposes adistributional assumption which could be wrong
bull Even in unsaturated models the marginal effect from OLSoften decent compared to nonlinear models
Could go wrong in small samples If using nonlinear models always get effects on the scale of the
outcome
58 58
- Agnostic Regression
- Regression and Causality
- Regression with Heterogeneous Treatment Effects
-
Saturated models example
119864[119884119894|1198831113568119894 1198831113569119894] = 120572 + 1205731198831113568119894 + 1205741198831113569119894 + 120575(11988311135681198941198831113569119894)
bull Basically each value of 120583(119883119894) is being estimated separately within-strata estimation No borrowing of information from across values of 119883119894
bull Requires a set of dummies for each categorical variable plusall interactions
bull Or a series of dummies for each unique combination of 119883119894bull This makes linearity hold mechanically and so linearity is
not an assumption Just a fact about saturated CEFs saturated models for limited dependent variables = A-OK
11 58
Saturated model example
bull Washington (AER) data on the effects of daughtersbull Wersquoll look at the relationship between voting and number of
kids (causal)
girls lt- foreignreaddta(rdquogirlsdtardquo)
head(girls[ c(rdquonamerdquo rdquototchirdquo rdquoaauwrdquo)])
name totchi aauw
1 ABERCROMBIE NEIL 0 100
2 ACKERMAN GARY L 3 88
3 ADERHOLT ROBERT B 0 0
4 ALLEN THOMAS H 2 100
5 ANDREWS ROBERT E 2 100
6 ARCHER WR 7 0
12 58
Linear model
summary(lm(aauw ~ totchi data = girls))
Coefficients
Estimate Std Error t value Pr(gt|t|)
(Intercept) 6131 181 3381 lt2e-16
totchi -533 062 -859 lt2e-16
---
Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1
Residual standard error 42 on 1733 degrees of freedom
(5 observations deleted due to missingness)
Multiple R-squared 00408 Adjusted R-squared 00403
F-statistic 738 on 1 and 1733 DF p-value lt2e-16
13 58
Saturated modelsummary(lm(aauw ~ asfactor(totchi) data = girls))
Coefficients
Estimate Std Error t value Pr(gt|t|)
(Intercept) 5641 276 2042 lt 2e-16
asfactor(totchi)1 545 411 133 01851
asfactor(totchi)2 -380 327 -116 02454
asfactor(totchi)3 -1365 345 -395 81e-05
asfactor(totchi)4 -1931 401 -482 16e-06
asfactor(totchi)5 -1546 485 -319 00015
asfactor(totchi)6 -3359 1042 -322 00013
asfactor(totchi)7 -1713 1141 -150 01336
asfactor(totchi)8 -5533 1228 -451 70e-06
asfactor(totchi)9 -5041 2408 -209 00364
asfactor(totchi)10 -5341 2090 -256 00107
asfactor(totchi)12 -5641 4153 -136 01745
---
Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1
Residual standard error 41 on 1723 degrees of freedom
(5 observations deleted due to missingness)
Multiple R-squared 00506 Adjusted R-squared 00446
F-statistic 836 on 11 and 1723 DF p-value 184e-1414 58
Saturated model minus the constantsummary(lm(aauw ~ asfactor(totchi) - 1 data = girls))
Coefficients
Estimate Std Error t value Pr(gt|t|)
asfactor(totchi)0 5641 276 2042 lt2e-16
asfactor(totchi)1 6186 305 2031 lt2e-16
asfactor(totchi)2 5262 175 3013 lt2e-16
asfactor(totchi)3 4276 207 2062 lt2e-16
asfactor(totchi)4 3711 290 1279 lt2e-16
asfactor(totchi)5 4095 399 1027 lt2e-16
asfactor(totchi)6 2282 1005 227 00233
asfactor(totchi)7 3929 1107 355 00004
asfactor(totchi)8 108 1196 009 09278
asfactor(totchi)9 600 2392 025 08020
asfactor(totchi)10 300 2072 014 08849
asfactor(totchi)12 000 4143 000 10000
---
Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1
Residual standard error 41 on 1723 degrees of freedom
(5 observations deleted due to missingness)
Multiple R-squared 0587 Adjusted R-squared 0584
F-statistic 204 on 12 and 1723 DF p-value lt2e-1615 58
Compare to within-strata means
bull The saturated model makes no assumptions about thebetween-strata relationships
bull Just calculates within-strata means
c1 lt- coef(lm(aauw ~ asfactor(totchi) - 1 data = girls))
c2 lt- with(girls tapply(aauw totchi mean narm = TRUE))
rbind(c1 c2)
0 1 2 3 4 5 6 7 8 9 10 12
c1 56 62 53 43 37 41 23 39 11 6 3 0
c2 56 62 53 43 37 41 23 39 11 6 3 0
16 58
Other justifications for OLS
bull Justification 2 119883prime119894 120573 is the best linear predictor (in a
mean-squared error sense) of 119884119894 Why 120573 = argmin119887 120124[(119884119894 minus 119883prime
119894 119887)1113569]
bull Justification 3 119883prime119894 120573 provides the minimum mean squared
error linear approxmiation to 119864[119884119894|119883119894]bull Even if the CEF is not linear a linear regression provides the
best linear approximation to that CEFbull Donrsquot need to believe the assumptions (linearity) in order to
use regression as a good approximation to the CEFbull Warning if the CEF is very nonlinear then this approximation
could be terrible
17 58
The error terms
bull Letrsquos define the error term 119890119894 equiv 119884119894 minus 119883prime119894 120573 so that
119884119894 = 119883prime119894 120573 + [119884119894 minus 119883prime
119894 120573] = 119883prime119894 120573 + 119890119894
bull Note the residual 119890119894 is uncorrelated with 119883119894
120124[119883119894119890119894] = 120124[119883119894(119884119894 minus 119883prime119894 120573)]
= 120124[119883119894119884119894] minus 120124[119883119894119883prime119894 120573]
= 120124[119883119894119884119894] minus 120124 1114094119883119894119883prime119894120124[119883119894119883prime
119894 ]minus1113568120124[119883119894119884119894]1114097= 120124[119883119894119884119894] minus 120124[119883119894119883prime
119894 ]120124[119883119894119883prime119894 ]minus1113568120124[119883119894119884119894]
= 120124[119883119894119884119894] minus 120124[119883119894119884119894] = 0
bull No assumptions on the linearity of 120124[119884119894|119883119894]
18 58
OLS estimator
bull We know the population value of 120573 is
120573 = 120124[119883119894119883prime119894 ]minus1113568120124[119883119894119884119894]
bull How do we get an estimator of thisbull Plug-in principle replace population expectation with
sample versions
=
โกโขโขโขโขโขโฃ1119873
1114012119894119883119894119883prime
119894
โคโฅโฅโฅโฅโฅโฆ
minus11135681119873
1114012119894119883119894119884119894
bull If you work through the matrix algebra this turns out to be
= (119831prime119831)minus1113568 119831prime119858
19 58
Asymptotic OLS inference
bull With this representation in hand we can write the OLSestimator as follows
= 120573 +
โกโขโขโขโขโขโฃ1114012
119894119883119894119883prime
119894
โคโฅโฅโฅโฅโฅโฆ
minus1113568
1114012119894119883119894119890119894
bull Core idea sum119894119883119894119890119894 is the sum of rvs so the CLT applies
bull That plus some simple asymptotic theory allows us to say
radic119873( minus 120573) 119873(0ฮฉ)
bull Converges in distribution to a Normal distribution with meanvector 0 and covariance matrix ฮฉ
ฮฉ = 120124[119883119894119883prime119894 ]minus1113568120124[119883119894119883prime
119894 1198901113569119894 ]120124[119883119894119883prime119894 ]minus1113568
bull No linearity assumption needed20 58
Estimating the variance
bull In large samples then
radic119873( minus 120573) sim 119873(0ฮฉ)
bull How to estimate ฮฉ Plug-in principle again
1114023ฮฉ =
โกโขโขโขโขโขโฃ1114012
119894119883119894119883prime
119894
โคโฅโฅโฅโฅโฅโฆ
minus1113568 โกโขโขโขโขโขโฃ1114012119894119883119894119883prime
119894 1113569119894
โคโฅโฅโฅโฅโฅโฆ
โกโขโขโขโขโขโฃ1114012
119894119883119894119883prime
119894
โคโฅโฅโฅโฅโฅโฆ
minus1113568
bull Replace 119890119894 with its emprical counterpart (residuals)119894 = 119884119894 minus 119883prime
119894 bull Replace the population moments of 119883119894 with their sample
counterpartsbull The square root of the diagonals of this covariance matrix are
the ldquorobustrdquo or Huber-White standard errors that Statacommonly report
21 58
Heteroskedasticity
bull No assumptions of homoskedasticitybull Heteroskedaticity will definitely occur when
CEF is linear but the 1205901113569(119909) = 120141[119884119894|119883119894 = 119909] is not constant in 119909 119864[119884119894|119883119894] is not linear but we use the linear regression to
approxmiate it
22 58
2 Regression andCausality
23 58
Regression and causality
bull Most econometrics textbooks regression defined withoutrespect to causality
bull But then when is ldquobiasedrdquo The above derivations work forsome 120124[119884119894|119883119894]
bull The question then is when does knowing the CEF tell ussomething about causality
bull MHE argues that a regression is causal when the CEF itapproximates is causal Identification is king
bull We will show that under certain conditions a regression of theoutcome on the treatment and the covariates can recover acausal parameter but perhaps not the one in which we areinterested
24 58
Review
bull Quick reminder we have potential outcomes 119884119894(1) and 119884119894(0)and two parameters the ATE and ATT
120591 = 119864[119884119894(1) minus 119884119894(0)]120591ATT = 119864[119884119894(1) minus 119884119894(0)|119863119894 = 1]
bull We have shown in past weeks that these effects are identifiedwhen ignorability holds MHE calls this the conditionalindependence assumption (CIA)
25 58
Linear constant effects model binarytreatment
bull Experiment with a simple experiment we can rewrite theconsistency assumption to be a regression formula
119884119894 = 119863119894119884119894(1) + (1 minus 119863119894)119884119894(0)= 119884119894(0) + (119884119894(1) minus 119884119894(0))119863119894
= 120124[119884119894(0)] + 120591119863119894 + (119884119894(0) minus 120124[119884119894(0)])= 1205831113567 + 120591119863119894 + 1199071113567119894
bull Note that if ignorability holds (as in an experiment) for 119884119894(0)then it will also hold for 1199071113567119894 since 120124[119884119894(0)] is constant Thusthis satifies the usual assumptions for regression
26 58
Now with covariates
bull Now assume no unmeasured confounders 119884119894(119889) โโ 119863119894|119883119894bull We will assume a linear model for the potential outcomes
119884119894(119889) = 120572 + 120591 sdot 119889 + 120578119894
bull Remember that linearity isnrsquot an assumption if 119863119894 is binarybull Effect of 119863119894 is constant here the 120578119894 are the only source of
individual variation and we have 119864[120578119894] = 0bull Consistency assumption allows us to write this as
119884119894 = 120572 + 120591119863119894 + 120578119894
27 58
Covariates in the error
bull Letrsquos assume that 120578119894 is linear in 119883119894 120578119894 = 119883prime119894 120574 + 120584119894
bull New error is uncorrelated with 119883119894 120124[120584119894|119883119894] = 0bull This is an assumption Might be falsebull Plug into the above
120124[119884119894(119889)|119883119894] = 119864[119884119894|119863119894 119883119894] = 120572 + 120591119863119894 + 119864[120578119894|119883119894]= 120572 + 120591119863119894 + 119883prime
119894 120574 + 119864[120584119894|119883119894]= 120572 + 120591119863119894 + 119883prime
119894 120574
28 58
Summing up regression with constanteffects
bull Reviewing the assumptions wersquove used no unmeasured confounders constant treatment effects linearity of the treatmentcovariates
bull Under these we can run the following regression to estimatethe ATE 120591
119884119894 = 120572 + 120591119863119894 + 119883prime119894 120574 + 120584119894
bull Works with continuous or ordinal 119863119894 if linearity in the effect ofthese variables is truly linear
29 58
OLS constant effects simulation
Model with linear covariates constant 0 effect of treatment
library(mvtnorm)
n lt- 100
p lt- 4
X lt- rmvnorm(n = 100 mean = rep(0 p))
gamma lt- c(274 137 137 137)
y lt- 210 + X c(gamma) + rnorm(n)
alpha lt- c(-1 05 -05 -01)
dprobs lt- bootinvlogit(X alpha)
d lt- rbinom(n size = 1 prob = dprobs)
30 58
OLS with no covariates
summary(lm(y ~ d))
Coefficients
Estimate Std Error t value Pr(gt|t|)
(Intercept) 22036 458 4813 lt 2e-16
d -2869 654 -439 0000029
---
Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1
Residual standard error 33 on 98 degrees of freedom
Multiple R-squared 0164 Adjusted R-squared 0156
F-statistic 192 on 1 and 98 DF p-value 0000029
31 58
OLS with covariates
summary(lm(y ~ d + X))
Coefficients
Estimate Std Error t value Pr(gt|t|)
(Intercept) 209952 0159 13207 lt2e-16
d -0128 0253 -05 062
X1 27368 0126 2175 lt2e-16
X2 13677 0114 1201 lt2e-16
X3 13673 0130 1051 lt2e-16
X4 13570 0106 1280 lt2e-16
---
Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1
Residual standard error 1 on 94 degrees of freedom
Multiple R-squared 0999 Adjusted R-squared 0999
F-statistic 244e+04 on 5 and 94 DF p-value lt2e-16
32 58
What happens with nonlinearity
Suppose we can only observe the following covariates
z1 lt- exp(X[ 1]2)
z2 lt- X[ 2](1 + exp(X[ 1])) + 10
z3 lt- (X[ 1] X[ 3]25 + 06)^3
z4 lt- (X[ 2] + X[ 4] + 20)^2
Implies that 119884119894 and 119863119894 are functions of log(1198851198941113568) 1198851198941113569 119885111356911989411135681198851198941113569
1 log(1198851198941113568) 1198851198941113570 log(1198851198941113568) and 119883111356811135691198941113571
Regression is a nonlinear function of the observed covariates
33 58
When linearity goes wrong
summary(lm(y ~ d + z1 + z2 + z3 + z4))
Coefficients
Estimate Std Error t value Pr(gt|t|)
(Intercept) 218728 300799 073 0469
d -69292 33854 -205 0043
z1 363110 26970 1346 lt2e-16
z2 -29033 36619 -079 0430
z3 862022 431030 200 0048
z4 04021 00329 1223 lt2e-16
---
Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1
Residual standard error 14 on 94 degrees of freedom
Multiple R-squared 0855 Adjusted R-squared 0847
F-statistic 111 on 5 and 94 DF p-value lt2e-16
34 58
3 Regression withHeterogeneousTreatment Effects
35 58
Heterogeneous effects binary treatment
bull Completely randomized experiment
119884119894 = 119863119894119884119894(1) + (1 minus 119863119894)119884119894(0)= 119884119894(0) + (119884119894(1) minus 119884119894(0))119863119894
= 1205831113567 + 120591119894119863119894 + (119884119894(0) minus 1205831113567)= 1205831113567 + 120591119863119894 + (119884119894(0) minus 1205831113567) + (120591119894 minus 120591) sdot 119863119894
= 1205831113567 + 120591119863119894 + 120576119894
bull Error term now includes two components1 ldquoBaselinerdquo variation in the outcome (119884119894(0) minus 1205831113567)2 Variation in the treatment effect (120591119894 minus 120591)
bull Easy to verify that under experiment 120124[120576119894|119863119894] = 0bull Thus OLS estimates the ATE with no covariates
36 58
Adding covariates
bull What happens with no unmeasured confounders Need tocondition on 119883119894 now
bull Remember identification of the ATEATT using iteratedexpectations
bull ATE is the weighted sum of CATEs
120591 = 1114012119909120591(119909) Pr[119883119894 = 119909]
bull ATEATT are weighted averages of CATEsbull What about the regression estimand 120591119877 How does it related
to the ATEATT
37 58
Heterogeneous effects and regression
bull Letrsquos investigate this under a saturated regression model
119884119894 = 1114012119909119861119909119894120572119909 + 120591119877119863119894 + 119890119894
bull Use a dummy variable for each unique combination of 119883119894119861119909119894 = 120128(119883119894 = 119909)
bull Linear in 119883119894 by construction
38 58
Investigating the regression coefficient
bull How can we investigate 120591119877 Well we can rely on theregression anatomy
120591119877 = Cov(119884119894 119863119894 minus 119864[119863119894|119883119894])120141(119863119894 minus 119864[119863119894|119883119894])
bull 119863119894 minus 120124[119863119894|119883119894] is the residual from a regression of 119863119894 on the fullset of dummies
bull With a little work we can show
120591119877 =120124 1114094120591(119883119894)(119863119894 minus 120124[119863119894|119883119894])11135691114097
120124[(119863119894 minus 119864[119863119894|119883119894])1113569]=
120124[120591(119883119894)1205901113569119889(119883119894)]120124[1205901113569119889(119883119894)]
bull 1205901113569119889(119909) = 120141[119863119894|119883119894 = 119909] is the conditional variance of treatmentassignment
39 58
ATE versus OLS
120591119877 = 120124[120591(119883119894)119882119894] = 1114012119909120591(119909)
1205901113569119889(119909)120124[1205901113569119889(119883119894)]
โ[119883119894 = 119909]
bull Compare to the ATE
120591 = 120124[120591(119883119894)] = 1114012119909120591(119909)โ[119883119894 = 119909]
bull Both weight strata relative to their size (โ[119883119894 = 119909])bull OLS weights strata higher if the treatment variance in those
strata (1205901113569119889(119909)) is higher in those strata relative to the averagevariance across strata (120124[1205901113569119889(119883119894)])
bull The ATE weights only by their size
40 58
Regression weighting
119882119894 =1205901113569119889(119883119894)
120124[1205901113569119889(119883119894)]
bull Why does OLS weight like thisbull OLS is a minimum-variance estimator more weight to
more precise within-strata estimatesbull Within-strata estimates are most precise when the treatment
is evenly spread and thus has the highest variancebull If 119863119894 is binary then we know the conditional variance will be
1205901113569119889(119909) = โ[119863119894 = 1|119883119894 = 119909] (1 minus โ[119863119894 = 1|119883119894 = 119909])= 119890(119909) (1 minus 119890(119909))
bull Maximum variance with โ[119863119894 = 1|119883119894 = 119909] = 12
41 58
OLS weighting examplebull Binary covariate
โ[119883119894 = 1] = 075 โ[119883119894 = 0] = 025โ[119863119894 = 1|119883119894 = 1] = 09 โ[119863119894 = 1|119883119894 = 0] = 05
1205901113569119889(1) = 009 1205901113569119889(0) = 025120591(1) = 1 120591(0) = minus1
bull Implies the ATE is 120591 = 05bull Average conditional variance 120124[1205901113569119889(119883119894)] = 013bull weights for 119883119894 = 1 are 009013 = 0692 for 119883119894 = 0
025013 = 192120591119877 = 120124[120591(119883119894)119882119894]
= 120591(1)119882(1)โ[119883119894 = 1] + 120591(0)119882(0)โ[119883119894 = 0]= 1 times 0692 times 075 + minus1 times 192 times 025= 0039
42 58
When will OLS estimate the ATE
bull When does 120591 = 120591119877bull Constant treatment effects 120591(119909) = 120591 = 120591119877bull Constant probability of treatment 119890(119909) = โ[119863119894 = 1|119883119894 = 119909] = 119890
Implies that the OLS weights are 1
bull Incorrect linearity assumption in 119883119894 will lead to more bias
43 58
Other ways to use regression
bull Whatrsquos the path forward Accept the bias (might be relatively small with saturated
models) Use a different regression approach
bull Let 120583119889(119909) = 120124[119884119894(119889)|119883119894 = 119909] be the CEF for the potentialoutcome under 119863119894 = 119889
bull By consistency and nuc we have 120583119889(119909) = 120124[119884119894|119863119894 = 119889119883119894 = 119909]bull Estimate a regression of 119884119894 on 119883119894 among the 119863119894 = 119889 groupbull Then 119889(119909) is just a predicted value from the regression for
119883119894 = 119909bull How can we use this
44 58
Imputation estimators
bull Impute the treated potential outcomes with 1114023119884119894(1) = 1113568(119883119894)bull Impute the control potential outcomes with 1114023119884119894(0) = 1113567(119883119894)bull Procedure
Regress 119884119894 on 119883119894 in the treated group and get predicted valuesfor all units (treated or control)
Regress 119884119894 on 119883119894 in the control group and get predicted valuesfor all units (treated or control)
Take the average difference between these predicted values
bull More mathematically look like this
120591119894119898119901 =1119873
11140121198941113568(119883119894) minus 1113567(119883119894)
bull Sometimes called an imputation estimator
45 58
Simple imputation estimator
bull Use predict() from the within-group models on the data fromthe entire sample
bull Useful trick use a model on the entire data andmodelframe() to get the right design matrix
heterogeneous effects
yhet lt- ifelse(d == 1 y + rnorm(n 0 5) y)
mod lt- lm(yhet ~ d + X)
mod1 lt- lm(yhet ~ X subset = d == 1)
mod0 lt- lm(yhet ~ X subset = d == 0)
y1imps lt- predict(mod1 modelframe(mod))
y0imps lt- predict(mod0 modelframe(mod))
mean(y1imps - y0imps)
[1] 061
46 58
Notes on imputation estimators
bull If 119889(119909) are consistent estimators then 120591119894119898119901 is consistent forthe ATE
bull Why donrsquot people use this Most people donrsquot know the results wersquove been talking about Harder to implement than vanilla OLS
bull Can use linear regression to estimate 119889(119909) = 119909prime120573119889bull Recent trend is to estimate 119889(119909) via non-parametric methods
such as Kernel regression local linear regression regression trees etc Easiest is generalized additive models (GAMs)
47 58
Imputation estimator visualization
-2 -1 0 1 2 3
-05
00
05
10
15
x
y
TreatedControl
48 58
Imputation estimator visualization
-2 -1 0 1 2 3
-05
00
05
10
15
x
y
TreatedControl
49 58
Imputation estimator visualization
-2 -1 0 1 2 3
-05
00
05
10
15
x
y
TreatedControl
50 58
Nonlinear relationshipsbull Same idea but with nonlinear relationship between 119884119894 and 119883119894
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
51 58
Nonlinear relationshipsbull Same idea but with nonlinear relationship between 119884119894 and 119883119894
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
52 58
Nonlinear relationshipsbull Same idea but with nonlinear relationship between 119884119894 and 119883119894
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
53 58
Using semiparametric regressionbull Here CEFs are nonlinear but we donrsquot know their formbull We can use GAMs from the mgcv package to for flexible
estimatelibrary(mgcv)
mod0 lt- gam(y ~ s(x) subset = d == 0)
summary(mod0)
Family gaussian
Link function identity
Formula
y ~ s(x)
Parametric coefficients
Estimate Std Error t value Pr(gt|t|)
(Intercept) -00225 00154 -146 016
Approximate significance of smooth terms
edf Refdf F p-value
s(x) 603 708 413 lt2e-16
---
Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1
R-sq(adj) = 0909 Deviance explained = 928
GCV = 00093204 Scale est = 00071351 n = 30
54 58
Using GAMs
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
55 58
Using GAMs
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
56 58
Using GAMs
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
57 58
Limited dependent variables
bull Usual advice model the data from first principles Logitprobit for binary Poisson for counts etc
bull OLS is a-ok with limited DVs when Binary treatment and no covariates (just diff-in-means) Binary treatment discrete covariates and saturated models
(stratified diff-in-means)
bull Imposing a model on LDVs in this case imposes adistributional assumption which could be wrong
bull Even in unsaturated models the marginal effect from OLSoften decent compared to nonlinear models
Could go wrong in small samples If using nonlinear models always get effects on the scale of the
outcome
58 58
- Agnostic Regression
- Regression and Causality
- Regression with Heterogeneous Treatment Effects
-
Saturated model example
bull Washington (AER) data on the effects of daughtersbull Wersquoll look at the relationship between voting and number of
kids (causal)
girls lt- foreignreaddta(rdquogirlsdtardquo)
head(girls[ c(rdquonamerdquo rdquototchirdquo rdquoaauwrdquo)])
name totchi aauw
1 ABERCROMBIE NEIL 0 100
2 ACKERMAN GARY L 3 88
3 ADERHOLT ROBERT B 0 0
4 ALLEN THOMAS H 2 100
5 ANDREWS ROBERT E 2 100
6 ARCHER WR 7 0
12 58
Linear model
summary(lm(aauw ~ totchi data = girls))
Coefficients
Estimate Std Error t value Pr(gt|t|)
(Intercept) 6131 181 3381 lt2e-16
totchi -533 062 -859 lt2e-16
---
Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1
Residual standard error 42 on 1733 degrees of freedom
(5 observations deleted due to missingness)
Multiple R-squared 00408 Adjusted R-squared 00403
F-statistic 738 on 1 and 1733 DF p-value lt2e-16
13 58
Saturated modelsummary(lm(aauw ~ asfactor(totchi) data = girls))
Coefficients
Estimate Std Error t value Pr(gt|t|)
(Intercept) 5641 276 2042 lt 2e-16
asfactor(totchi)1 545 411 133 01851
asfactor(totchi)2 -380 327 -116 02454
asfactor(totchi)3 -1365 345 -395 81e-05
asfactor(totchi)4 -1931 401 -482 16e-06
asfactor(totchi)5 -1546 485 -319 00015
asfactor(totchi)6 -3359 1042 -322 00013
asfactor(totchi)7 -1713 1141 -150 01336
asfactor(totchi)8 -5533 1228 -451 70e-06
asfactor(totchi)9 -5041 2408 -209 00364
asfactor(totchi)10 -5341 2090 -256 00107
asfactor(totchi)12 -5641 4153 -136 01745
---
Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1
Residual standard error 41 on 1723 degrees of freedom
(5 observations deleted due to missingness)
Multiple R-squared 00506 Adjusted R-squared 00446
F-statistic 836 on 11 and 1723 DF p-value 184e-1414 58
Saturated model minus the constantsummary(lm(aauw ~ asfactor(totchi) - 1 data = girls))
Coefficients
Estimate Std Error t value Pr(gt|t|)
asfactor(totchi)0 5641 276 2042 lt2e-16
asfactor(totchi)1 6186 305 2031 lt2e-16
asfactor(totchi)2 5262 175 3013 lt2e-16
asfactor(totchi)3 4276 207 2062 lt2e-16
asfactor(totchi)4 3711 290 1279 lt2e-16
asfactor(totchi)5 4095 399 1027 lt2e-16
asfactor(totchi)6 2282 1005 227 00233
asfactor(totchi)7 3929 1107 355 00004
asfactor(totchi)8 108 1196 009 09278
asfactor(totchi)9 600 2392 025 08020
asfactor(totchi)10 300 2072 014 08849
asfactor(totchi)12 000 4143 000 10000
---
Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1
Residual standard error 41 on 1723 degrees of freedom
(5 observations deleted due to missingness)
Multiple R-squared 0587 Adjusted R-squared 0584
F-statistic 204 on 12 and 1723 DF p-value lt2e-1615 58
Compare to within-strata means
bull The saturated model makes no assumptions about thebetween-strata relationships
bull Just calculates within-strata means
c1 lt- coef(lm(aauw ~ asfactor(totchi) - 1 data = girls))
c2 lt- with(girls tapply(aauw totchi mean narm = TRUE))
rbind(c1 c2)
0 1 2 3 4 5 6 7 8 9 10 12
c1 56 62 53 43 37 41 23 39 11 6 3 0
c2 56 62 53 43 37 41 23 39 11 6 3 0
16 58
Other justifications for OLS
bull Justification 2 119883prime119894 120573 is the best linear predictor (in a
mean-squared error sense) of 119884119894 Why 120573 = argmin119887 120124[(119884119894 minus 119883prime
119894 119887)1113569]
bull Justification 3 119883prime119894 120573 provides the minimum mean squared
error linear approxmiation to 119864[119884119894|119883119894]bull Even if the CEF is not linear a linear regression provides the
best linear approximation to that CEFbull Donrsquot need to believe the assumptions (linearity) in order to
use regression as a good approximation to the CEFbull Warning if the CEF is very nonlinear then this approximation
could be terrible
17 58
The error terms
bull Letrsquos define the error term 119890119894 equiv 119884119894 minus 119883prime119894 120573 so that
119884119894 = 119883prime119894 120573 + [119884119894 minus 119883prime
119894 120573] = 119883prime119894 120573 + 119890119894
bull Note the residual 119890119894 is uncorrelated with 119883119894
120124[119883119894119890119894] = 120124[119883119894(119884119894 minus 119883prime119894 120573)]
= 120124[119883119894119884119894] minus 120124[119883119894119883prime119894 120573]
= 120124[119883119894119884119894] minus 120124 1114094119883119894119883prime119894120124[119883119894119883prime
119894 ]minus1113568120124[119883119894119884119894]1114097= 120124[119883119894119884119894] minus 120124[119883119894119883prime
119894 ]120124[119883119894119883prime119894 ]minus1113568120124[119883119894119884119894]
= 120124[119883119894119884119894] minus 120124[119883119894119884119894] = 0
bull No assumptions on the linearity of 120124[119884119894|119883119894]
18 58
OLS estimator
bull We know the population value of 120573 is
120573 = 120124[119883119894119883prime119894 ]minus1113568120124[119883119894119884119894]
bull How do we get an estimator of thisbull Plug-in principle replace population expectation with
sample versions
=
โกโขโขโขโขโขโฃ1119873
1114012119894119883119894119883prime
119894
โคโฅโฅโฅโฅโฅโฆ
minus11135681119873
1114012119894119883119894119884119894
bull If you work through the matrix algebra this turns out to be
= (119831prime119831)minus1113568 119831prime119858
19 58
Asymptotic OLS inference
bull With this representation in hand we can write the OLSestimator as follows
= 120573 +
โกโขโขโขโขโขโฃ1114012
119894119883119894119883prime
119894
โคโฅโฅโฅโฅโฅโฆ
minus1113568
1114012119894119883119894119890119894
bull Core idea sum119894119883119894119890119894 is the sum of rvs so the CLT applies
bull That plus some simple asymptotic theory allows us to say
radic119873( minus 120573) 119873(0ฮฉ)
bull Converges in distribution to a Normal distribution with meanvector 0 and covariance matrix ฮฉ
ฮฉ = 120124[119883119894119883prime119894 ]minus1113568120124[119883119894119883prime
119894 1198901113569119894 ]120124[119883119894119883prime119894 ]minus1113568
bull No linearity assumption needed20 58
Estimating the variance
bull In large samples then
radic119873( minus 120573) sim 119873(0ฮฉ)
bull How to estimate ฮฉ Plug-in principle again
1114023ฮฉ =
โกโขโขโขโขโขโฃ1114012
119894119883119894119883prime
119894
โคโฅโฅโฅโฅโฅโฆ
minus1113568 โกโขโขโขโขโขโฃ1114012119894119883119894119883prime
119894 1113569119894
โคโฅโฅโฅโฅโฅโฆ
โกโขโขโขโขโขโฃ1114012
119894119883119894119883prime
119894
โคโฅโฅโฅโฅโฅโฆ
minus1113568
bull Replace 119890119894 with its emprical counterpart (residuals)119894 = 119884119894 minus 119883prime
119894 bull Replace the population moments of 119883119894 with their sample
counterpartsbull The square root of the diagonals of this covariance matrix are
the ldquorobustrdquo or Huber-White standard errors that Statacommonly report
21 58
Heteroskedasticity
bull No assumptions of homoskedasticitybull Heteroskedaticity will definitely occur when
CEF is linear but the 1205901113569(119909) = 120141[119884119894|119883119894 = 119909] is not constant in 119909 119864[119884119894|119883119894] is not linear but we use the linear regression to
approxmiate it
22 58
2 Regression andCausality
23 58
Regression and causality
bull Most econometrics textbooks regression defined withoutrespect to causality
bull But then when is ldquobiasedrdquo The above derivations work forsome 120124[119884119894|119883119894]
bull The question then is when does knowing the CEF tell ussomething about causality
bull MHE argues that a regression is causal when the CEF itapproximates is causal Identification is king
bull We will show that under certain conditions a regression of theoutcome on the treatment and the covariates can recover acausal parameter but perhaps not the one in which we areinterested
24 58
Review
bull Quick reminder we have potential outcomes 119884119894(1) and 119884119894(0)and two parameters the ATE and ATT
120591 = 119864[119884119894(1) minus 119884119894(0)]120591ATT = 119864[119884119894(1) minus 119884119894(0)|119863119894 = 1]
bull We have shown in past weeks that these effects are identifiedwhen ignorability holds MHE calls this the conditionalindependence assumption (CIA)
25 58
Linear constant effects model binarytreatment
bull Experiment with a simple experiment we can rewrite theconsistency assumption to be a regression formula
119884119894 = 119863119894119884119894(1) + (1 minus 119863119894)119884119894(0)= 119884119894(0) + (119884119894(1) minus 119884119894(0))119863119894
= 120124[119884119894(0)] + 120591119863119894 + (119884119894(0) minus 120124[119884119894(0)])= 1205831113567 + 120591119863119894 + 1199071113567119894
bull Note that if ignorability holds (as in an experiment) for 119884119894(0)then it will also hold for 1199071113567119894 since 120124[119884119894(0)] is constant Thusthis satifies the usual assumptions for regression
26 58
Now with covariates
bull Now assume no unmeasured confounders 119884119894(119889) โโ 119863119894|119883119894bull We will assume a linear model for the potential outcomes
119884119894(119889) = 120572 + 120591 sdot 119889 + 120578119894
bull Remember that linearity isnrsquot an assumption if 119863119894 is binarybull Effect of 119863119894 is constant here the 120578119894 are the only source of
individual variation and we have 119864[120578119894] = 0bull Consistency assumption allows us to write this as
119884119894 = 120572 + 120591119863119894 + 120578119894
27 58
Covariates in the error
bull Letrsquos assume that 120578119894 is linear in 119883119894 120578119894 = 119883prime119894 120574 + 120584119894
bull New error is uncorrelated with 119883119894 120124[120584119894|119883119894] = 0bull This is an assumption Might be falsebull Plug into the above
120124[119884119894(119889)|119883119894] = 119864[119884119894|119863119894 119883119894] = 120572 + 120591119863119894 + 119864[120578119894|119883119894]= 120572 + 120591119863119894 + 119883prime
119894 120574 + 119864[120584119894|119883119894]= 120572 + 120591119863119894 + 119883prime
119894 120574
28 58
Summing up regression with constanteffects
bull Reviewing the assumptions wersquove used no unmeasured confounders constant treatment effects linearity of the treatmentcovariates
bull Under these we can run the following regression to estimatethe ATE 120591
119884119894 = 120572 + 120591119863119894 + 119883prime119894 120574 + 120584119894
bull Works with continuous or ordinal 119863119894 if linearity in the effect ofthese variables is truly linear
29 58
OLS constant effects simulation
Model with linear covariates constant 0 effect of treatment
library(mvtnorm)
n lt- 100
p lt- 4
X lt- rmvnorm(n = 100 mean = rep(0 p))
gamma lt- c(274 137 137 137)
y lt- 210 + X c(gamma) + rnorm(n)
alpha lt- c(-1 05 -05 -01)
dprobs lt- bootinvlogit(X alpha)
d lt- rbinom(n size = 1 prob = dprobs)
30 58
OLS with no covariates
summary(lm(y ~ d))
Coefficients
Estimate Std Error t value Pr(gt|t|)
(Intercept) 22036 458 4813 lt 2e-16
d -2869 654 -439 0000029
---
Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1
Residual standard error 33 on 98 degrees of freedom
Multiple R-squared 0164 Adjusted R-squared 0156
F-statistic 192 on 1 and 98 DF p-value 0000029
31 58
OLS with covariates
summary(lm(y ~ d + X))
Coefficients
Estimate Std Error t value Pr(gt|t|)
(Intercept) 209952 0159 13207 lt2e-16
d -0128 0253 -05 062
X1 27368 0126 2175 lt2e-16
X2 13677 0114 1201 lt2e-16
X3 13673 0130 1051 lt2e-16
X4 13570 0106 1280 lt2e-16
---
Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1
Residual standard error 1 on 94 degrees of freedom
Multiple R-squared 0999 Adjusted R-squared 0999
F-statistic 244e+04 on 5 and 94 DF p-value lt2e-16
32 58
What happens with nonlinearity
Suppose we can only observe the following covariates
z1 lt- exp(X[ 1]2)
z2 lt- X[ 2](1 + exp(X[ 1])) + 10
z3 lt- (X[ 1] X[ 3]25 + 06)^3
z4 lt- (X[ 2] + X[ 4] + 20)^2
Implies that 119884119894 and 119863119894 are functions of log(1198851198941113568) 1198851198941113569 119885111356911989411135681198851198941113569
1 log(1198851198941113568) 1198851198941113570 log(1198851198941113568) and 119883111356811135691198941113571
Regression is a nonlinear function of the observed covariates
33 58
When linearity goes wrong
summary(lm(y ~ d + z1 + z2 + z3 + z4))
Coefficients
Estimate Std Error t value Pr(gt|t|)
(Intercept) 218728 300799 073 0469
d -69292 33854 -205 0043
z1 363110 26970 1346 lt2e-16
z2 -29033 36619 -079 0430
z3 862022 431030 200 0048
z4 04021 00329 1223 lt2e-16
---
Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1
Residual standard error 14 on 94 degrees of freedom
Multiple R-squared 0855 Adjusted R-squared 0847
F-statistic 111 on 5 and 94 DF p-value lt2e-16
34 58
3 Regression withHeterogeneousTreatment Effects
35 58
Heterogeneous effects binary treatment
bull Completely randomized experiment
119884119894 = 119863119894119884119894(1) + (1 minus 119863119894)119884119894(0)= 119884119894(0) + (119884119894(1) minus 119884119894(0))119863119894
= 1205831113567 + 120591119894119863119894 + (119884119894(0) minus 1205831113567)= 1205831113567 + 120591119863119894 + (119884119894(0) minus 1205831113567) + (120591119894 minus 120591) sdot 119863119894
= 1205831113567 + 120591119863119894 + 120576119894
bull Error term now includes two components1 ldquoBaselinerdquo variation in the outcome (119884119894(0) minus 1205831113567)2 Variation in the treatment effect (120591119894 minus 120591)
bull Easy to verify that under experiment 120124[120576119894|119863119894] = 0bull Thus OLS estimates the ATE with no covariates
36 58
Adding covariates
bull What happens with no unmeasured confounders Need tocondition on 119883119894 now
bull Remember identification of the ATEATT using iteratedexpectations
bull ATE is the weighted sum of CATEs
120591 = 1114012119909120591(119909) Pr[119883119894 = 119909]
bull ATEATT are weighted averages of CATEsbull What about the regression estimand 120591119877 How does it related
to the ATEATT
37 58
Heterogeneous effects and regression
bull Letrsquos investigate this under a saturated regression model
119884119894 = 1114012119909119861119909119894120572119909 + 120591119877119863119894 + 119890119894
bull Use a dummy variable for each unique combination of 119883119894119861119909119894 = 120128(119883119894 = 119909)
bull Linear in 119883119894 by construction
38 58
Investigating the regression coefficient
bull How can we investigate 120591119877 Well we can rely on theregression anatomy
120591119877 = Cov(119884119894 119863119894 minus 119864[119863119894|119883119894])120141(119863119894 minus 119864[119863119894|119883119894])
bull 119863119894 minus 120124[119863119894|119883119894] is the residual from a regression of 119863119894 on the fullset of dummies
bull With a little work we can show
120591119877 =120124 1114094120591(119883119894)(119863119894 minus 120124[119863119894|119883119894])11135691114097
120124[(119863119894 minus 119864[119863119894|119883119894])1113569]=
120124[120591(119883119894)1205901113569119889(119883119894)]120124[1205901113569119889(119883119894)]
bull 1205901113569119889(119909) = 120141[119863119894|119883119894 = 119909] is the conditional variance of treatmentassignment
39 58
ATE versus OLS
120591119877 = 120124[120591(119883119894)119882119894] = 1114012119909120591(119909)
1205901113569119889(119909)120124[1205901113569119889(119883119894)]
โ[119883119894 = 119909]
bull Compare to the ATE
120591 = 120124[120591(119883119894)] = 1114012119909120591(119909)โ[119883119894 = 119909]
bull Both weight strata relative to their size (โ[119883119894 = 119909])bull OLS weights strata higher if the treatment variance in those
strata (1205901113569119889(119909)) is higher in those strata relative to the averagevariance across strata (120124[1205901113569119889(119883119894)])
bull The ATE weights only by their size
40 58
Regression weighting
119882119894 =1205901113569119889(119883119894)
120124[1205901113569119889(119883119894)]
bull Why does OLS weight like thisbull OLS is a minimum-variance estimator more weight to
more precise within-strata estimatesbull Within-strata estimates are most precise when the treatment
is evenly spread and thus has the highest variancebull If 119863119894 is binary then we know the conditional variance will be
1205901113569119889(119909) = โ[119863119894 = 1|119883119894 = 119909] (1 minus โ[119863119894 = 1|119883119894 = 119909])= 119890(119909) (1 minus 119890(119909))
bull Maximum variance with โ[119863119894 = 1|119883119894 = 119909] = 12
41 58
OLS weighting examplebull Binary covariate
โ[119883119894 = 1] = 075 โ[119883119894 = 0] = 025โ[119863119894 = 1|119883119894 = 1] = 09 โ[119863119894 = 1|119883119894 = 0] = 05
1205901113569119889(1) = 009 1205901113569119889(0) = 025120591(1) = 1 120591(0) = minus1
bull Implies the ATE is 120591 = 05bull Average conditional variance 120124[1205901113569119889(119883119894)] = 013bull weights for 119883119894 = 1 are 009013 = 0692 for 119883119894 = 0
025013 = 192120591119877 = 120124[120591(119883119894)119882119894]
= 120591(1)119882(1)โ[119883119894 = 1] + 120591(0)119882(0)โ[119883119894 = 0]= 1 times 0692 times 075 + minus1 times 192 times 025= 0039
42 58
When will OLS estimate the ATE
bull When does 120591 = 120591119877bull Constant treatment effects 120591(119909) = 120591 = 120591119877bull Constant probability of treatment 119890(119909) = โ[119863119894 = 1|119883119894 = 119909] = 119890
Implies that the OLS weights are 1
bull Incorrect linearity assumption in 119883119894 will lead to more bias
43 58
Other ways to use regression
bull Whatrsquos the path forward Accept the bias (might be relatively small with saturated
models) Use a different regression approach
bull Let 120583119889(119909) = 120124[119884119894(119889)|119883119894 = 119909] be the CEF for the potentialoutcome under 119863119894 = 119889
bull By consistency and nuc we have 120583119889(119909) = 120124[119884119894|119863119894 = 119889119883119894 = 119909]bull Estimate a regression of 119884119894 on 119883119894 among the 119863119894 = 119889 groupbull Then 119889(119909) is just a predicted value from the regression for
119883119894 = 119909bull How can we use this
44 58
Imputation estimators
bull Impute the treated potential outcomes with 1114023119884119894(1) = 1113568(119883119894)bull Impute the control potential outcomes with 1114023119884119894(0) = 1113567(119883119894)bull Procedure
Regress 119884119894 on 119883119894 in the treated group and get predicted valuesfor all units (treated or control)
Regress 119884119894 on 119883119894 in the control group and get predicted valuesfor all units (treated or control)
Take the average difference between these predicted values
bull More mathematically look like this
120591119894119898119901 =1119873
11140121198941113568(119883119894) minus 1113567(119883119894)
bull Sometimes called an imputation estimator
45 58
Simple imputation estimator
bull Use predict() from the within-group models on the data fromthe entire sample
bull Useful trick use a model on the entire data andmodelframe() to get the right design matrix
heterogeneous effects
yhet lt- ifelse(d == 1 y + rnorm(n 0 5) y)
mod lt- lm(yhet ~ d + X)
mod1 lt- lm(yhet ~ X subset = d == 1)
mod0 lt- lm(yhet ~ X subset = d == 0)
y1imps lt- predict(mod1 modelframe(mod))
y0imps lt- predict(mod0 modelframe(mod))
mean(y1imps - y0imps)
[1] 061
46 58
Notes on imputation estimators
bull If 119889(119909) are consistent estimators then 120591119894119898119901 is consistent forthe ATE
bull Why donrsquot people use this Most people donrsquot know the results wersquove been talking about Harder to implement than vanilla OLS
bull Can use linear regression to estimate 119889(119909) = 119909prime120573119889bull Recent trend is to estimate 119889(119909) via non-parametric methods
such as Kernel regression local linear regression regression trees etc Easiest is generalized additive models (GAMs)
47 58
Imputation estimator visualization
-2 -1 0 1 2 3
-05
00
05
10
15
x
y
TreatedControl
48 58
Imputation estimator visualization
-2 -1 0 1 2 3
-05
00
05
10
15
x
y
TreatedControl
49 58
Imputation estimator visualization
-2 -1 0 1 2 3
-05
00
05
10
15
x
y
TreatedControl
50 58
Nonlinear relationshipsbull Same idea but with nonlinear relationship between 119884119894 and 119883119894
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
51 58
Nonlinear relationshipsbull Same idea but with nonlinear relationship between 119884119894 and 119883119894
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
52 58
Nonlinear relationshipsbull Same idea but with nonlinear relationship between 119884119894 and 119883119894
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
53 58
Using semiparametric regressionbull Here CEFs are nonlinear but we donrsquot know their formbull We can use GAMs from the mgcv package to for flexible
estimatelibrary(mgcv)
mod0 lt- gam(y ~ s(x) subset = d == 0)
summary(mod0)
Family gaussian
Link function identity
Formula
y ~ s(x)
Parametric coefficients
Estimate Std Error t value Pr(gt|t|)
(Intercept) -00225 00154 -146 016
Approximate significance of smooth terms
edf Refdf F p-value
s(x) 603 708 413 lt2e-16
---
Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1
R-sq(adj) = 0909 Deviance explained = 928
GCV = 00093204 Scale est = 00071351 n = 30
54 58
Using GAMs
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
55 58
Using GAMs
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
56 58
Using GAMs
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
57 58
Limited dependent variables
bull Usual advice model the data from first principles Logitprobit for binary Poisson for counts etc
bull OLS is a-ok with limited DVs when Binary treatment and no covariates (just diff-in-means) Binary treatment discrete covariates and saturated models
(stratified diff-in-means)
bull Imposing a model on LDVs in this case imposes adistributional assumption which could be wrong
bull Even in unsaturated models the marginal effect from OLSoften decent compared to nonlinear models
Could go wrong in small samples If using nonlinear models always get effects on the scale of the
outcome
58 58
- Agnostic Regression
- Regression and Causality
- Regression with Heterogeneous Treatment Effects
-
Linear model
summary(lm(aauw ~ totchi data = girls))
Coefficients
Estimate Std Error t value Pr(gt|t|)
(Intercept) 6131 181 3381 lt2e-16
totchi -533 062 -859 lt2e-16
---
Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1
Residual standard error 42 on 1733 degrees of freedom
(5 observations deleted due to missingness)
Multiple R-squared 00408 Adjusted R-squared 00403
F-statistic 738 on 1 and 1733 DF p-value lt2e-16
13 58
Saturated modelsummary(lm(aauw ~ asfactor(totchi) data = girls))
Coefficients
Estimate Std Error t value Pr(gt|t|)
(Intercept) 5641 276 2042 lt 2e-16
asfactor(totchi)1 545 411 133 01851
asfactor(totchi)2 -380 327 -116 02454
asfactor(totchi)3 -1365 345 -395 81e-05
asfactor(totchi)4 -1931 401 -482 16e-06
asfactor(totchi)5 -1546 485 -319 00015
asfactor(totchi)6 -3359 1042 -322 00013
asfactor(totchi)7 -1713 1141 -150 01336
asfactor(totchi)8 -5533 1228 -451 70e-06
asfactor(totchi)9 -5041 2408 -209 00364
asfactor(totchi)10 -5341 2090 -256 00107
asfactor(totchi)12 -5641 4153 -136 01745
---
Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1
Residual standard error 41 on 1723 degrees of freedom
(5 observations deleted due to missingness)
Multiple R-squared 00506 Adjusted R-squared 00446
F-statistic 836 on 11 and 1723 DF p-value 184e-1414 58
Saturated model minus the constantsummary(lm(aauw ~ asfactor(totchi) - 1 data = girls))
Coefficients
Estimate Std Error t value Pr(gt|t|)
asfactor(totchi)0 5641 276 2042 lt2e-16
asfactor(totchi)1 6186 305 2031 lt2e-16
asfactor(totchi)2 5262 175 3013 lt2e-16
asfactor(totchi)3 4276 207 2062 lt2e-16
asfactor(totchi)4 3711 290 1279 lt2e-16
asfactor(totchi)5 4095 399 1027 lt2e-16
asfactor(totchi)6 2282 1005 227 00233
asfactor(totchi)7 3929 1107 355 00004
asfactor(totchi)8 108 1196 009 09278
asfactor(totchi)9 600 2392 025 08020
asfactor(totchi)10 300 2072 014 08849
asfactor(totchi)12 000 4143 000 10000
---
Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1
Residual standard error 41 on 1723 degrees of freedom
(5 observations deleted due to missingness)
Multiple R-squared 0587 Adjusted R-squared 0584
F-statistic 204 on 12 and 1723 DF p-value lt2e-1615 58
Compare to within-strata means
bull The saturated model makes no assumptions about thebetween-strata relationships
bull Just calculates within-strata means
c1 lt- coef(lm(aauw ~ asfactor(totchi) - 1 data = girls))
c2 lt- with(girls tapply(aauw totchi mean narm = TRUE))
rbind(c1 c2)
0 1 2 3 4 5 6 7 8 9 10 12
c1 56 62 53 43 37 41 23 39 11 6 3 0
c2 56 62 53 43 37 41 23 39 11 6 3 0
16 58
Other justifications for OLS
bull Justification 2 119883prime119894 120573 is the best linear predictor (in a
mean-squared error sense) of 119884119894 Why 120573 = argmin119887 120124[(119884119894 minus 119883prime
119894 119887)1113569]
bull Justification 3 119883prime119894 120573 provides the minimum mean squared
error linear approxmiation to 119864[119884119894|119883119894]bull Even if the CEF is not linear a linear regression provides the
best linear approximation to that CEFbull Donrsquot need to believe the assumptions (linearity) in order to
use regression as a good approximation to the CEFbull Warning if the CEF is very nonlinear then this approximation
could be terrible
17 58
The error terms
bull Letrsquos define the error term 119890119894 equiv 119884119894 minus 119883prime119894 120573 so that
119884119894 = 119883prime119894 120573 + [119884119894 minus 119883prime
119894 120573] = 119883prime119894 120573 + 119890119894
bull Note the residual 119890119894 is uncorrelated with 119883119894
120124[119883119894119890119894] = 120124[119883119894(119884119894 minus 119883prime119894 120573)]
= 120124[119883119894119884119894] minus 120124[119883119894119883prime119894 120573]
= 120124[119883119894119884119894] minus 120124 1114094119883119894119883prime119894120124[119883119894119883prime
119894 ]minus1113568120124[119883119894119884119894]1114097= 120124[119883119894119884119894] minus 120124[119883119894119883prime
119894 ]120124[119883119894119883prime119894 ]minus1113568120124[119883119894119884119894]
= 120124[119883119894119884119894] minus 120124[119883119894119884119894] = 0
bull No assumptions on the linearity of 120124[119884119894|119883119894]
18 58
OLS estimator
bull We know the population value of 120573 is
120573 = 120124[119883119894119883prime119894 ]minus1113568120124[119883119894119884119894]
bull How do we get an estimator of thisbull Plug-in principle replace population expectation with
sample versions
=
โกโขโขโขโขโขโฃ1119873
1114012119894119883119894119883prime
119894
โคโฅโฅโฅโฅโฅโฆ
minus11135681119873
1114012119894119883119894119884119894
bull If you work through the matrix algebra this turns out to be
= (119831prime119831)minus1113568 119831prime119858
19 58
Asymptotic OLS inference
bull With this representation in hand we can write the OLSestimator as follows
= 120573 +
โกโขโขโขโขโขโฃ1114012
119894119883119894119883prime
119894
โคโฅโฅโฅโฅโฅโฆ
minus1113568
1114012119894119883119894119890119894
bull Core idea sum119894119883119894119890119894 is the sum of rvs so the CLT applies
bull That plus some simple asymptotic theory allows us to say
radic119873( minus 120573) 119873(0ฮฉ)
bull Converges in distribution to a Normal distribution with meanvector 0 and covariance matrix ฮฉ
ฮฉ = 120124[119883119894119883prime119894 ]minus1113568120124[119883119894119883prime
119894 1198901113569119894 ]120124[119883119894119883prime119894 ]minus1113568
bull No linearity assumption needed20 58
Estimating the variance
bull In large samples then
radic119873( minus 120573) sim 119873(0ฮฉ)
bull How to estimate ฮฉ Plug-in principle again
1114023ฮฉ =
โกโขโขโขโขโขโฃ1114012
119894119883119894119883prime
119894
โคโฅโฅโฅโฅโฅโฆ
minus1113568 โกโขโขโขโขโขโฃ1114012119894119883119894119883prime
119894 1113569119894
โคโฅโฅโฅโฅโฅโฆ
โกโขโขโขโขโขโฃ1114012
119894119883119894119883prime
119894
โคโฅโฅโฅโฅโฅโฆ
minus1113568
bull Replace 119890119894 with its emprical counterpart (residuals)119894 = 119884119894 minus 119883prime
119894 bull Replace the population moments of 119883119894 with their sample
counterpartsbull The square root of the diagonals of this covariance matrix are
the ldquorobustrdquo or Huber-White standard errors that Statacommonly report
21 58
Heteroskedasticity
bull No assumptions of homoskedasticitybull Heteroskedaticity will definitely occur when
CEF is linear but the 1205901113569(119909) = 120141[119884119894|119883119894 = 119909] is not constant in 119909 119864[119884119894|119883119894] is not linear but we use the linear regression to
approxmiate it
22 58
2 Regression andCausality
23 58
Regression and causality
bull Most econometrics textbooks regression defined withoutrespect to causality
bull But then when is ldquobiasedrdquo The above derivations work forsome 120124[119884119894|119883119894]
bull The question then is when does knowing the CEF tell ussomething about causality
bull MHE argues that a regression is causal when the CEF itapproximates is causal Identification is king
bull We will show that under certain conditions a regression of theoutcome on the treatment and the covariates can recover acausal parameter but perhaps not the one in which we areinterested
24 58
Review
bull Quick reminder we have potential outcomes 119884119894(1) and 119884119894(0)and two parameters the ATE and ATT
120591 = 119864[119884119894(1) minus 119884119894(0)]120591ATT = 119864[119884119894(1) minus 119884119894(0)|119863119894 = 1]
bull We have shown in past weeks that these effects are identifiedwhen ignorability holds MHE calls this the conditionalindependence assumption (CIA)
25 58
Linear constant effects model binarytreatment
bull Experiment with a simple experiment we can rewrite theconsistency assumption to be a regression formula
119884119894 = 119863119894119884119894(1) + (1 minus 119863119894)119884119894(0)= 119884119894(0) + (119884119894(1) minus 119884119894(0))119863119894
= 120124[119884119894(0)] + 120591119863119894 + (119884119894(0) minus 120124[119884119894(0)])= 1205831113567 + 120591119863119894 + 1199071113567119894
bull Note that if ignorability holds (as in an experiment) for 119884119894(0)then it will also hold for 1199071113567119894 since 120124[119884119894(0)] is constant Thusthis satifies the usual assumptions for regression
26 58
Now with covariates
bull Now assume no unmeasured confounders 119884119894(119889) โโ 119863119894|119883119894bull We will assume a linear model for the potential outcomes
119884119894(119889) = 120572 + 120591 sdot 119889 + 120578119894
bull Remember that linearity isnrsquot an assumption if 119863119894 is binarybull Effect of 119863119894 is constant here the 120578119894 are the only source of
individual variation and we have 119864[120578119894] = 0bull Consistency assumption allows us to write this as
119884119894 = 120572 + 120591119863119894 + 120578119894
27 58
Covariates in the error
bull Letrsquos assume that 120578119894 is linear in 119883119894 120578119894 = 119883prime119894 120574 + 120584119894
bull New error is uncorrelated with 119883119894 120124[120584119894|119883119894] = 0bull This is an assumption Might be falsebull Plug into the above
120124[119884119894(119889)|119883119894] = 119864[119884119894|119863119894 119883119894] = 120572 + 120591119863119894 + 119864[120578119894|119883119894]= 120572 + 120591119863119894 + 119883prime
119894 120574 + 119864[120584119894|119883119894]= 120572 + 120591119863119894 + 119883prime
119894 120574
28 58
Summing up regression with constanteffects
bull Reviewing the assumptions wersquove used no unmeasured confounders constant treatment effects linearity of the treatmentcovariates
bull Under these we can run the following regression to estimatethe ATE 120591
119884119894 = 120572 + 120591119863119894 + 119883prime119894 120574 + 120584119894
bull Works with continuous or ordinal 119863119894 if linearity in the effect ofthese variables is truly linear
29 58
OLS constant effects simulation
Model with linear covariates constant 0 effect of treatment
library(mvtnorm)
n lt- 100
p lt- 4
X lt- rmvnorm(n = 100 mean = rep(0 p))
gamma lt- c(274 137 137 137)
y lt- 210 + X c(gamma) + rnorm(n)
alpha lt- c(-1 05 -05 -01)
dprobs lt- bootinvlogit(X alpha)
d lt- rbinom(n size = 1 prob = dprobs)
30 58
OLS with no covariates
summary(lm(y ~ d))
Coefficients
Estimate Std Error t value Pr(gt|t|)
(Intercept) 22036 458 4813 lt 2e-16
d -2869 654 -439 0000029
---
Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1
Residual standard error 33 on 98 degrees of freedom
Multiple R-squared 0164 Adjusted R-squared 0156
F-statistic 192 on 1 and 98 DF p-value 0000029
31 58
OLS with covariates
summary(lm(y ~ d + X))
Coefficients
Estimate Std Error t value Pr(gt|t|)
(Intercept) 209952 0159 13207 lt2e-16
d -0128 0253 -05 062
X1 27368 0126 2175 lt2e-16
X2 13677 0114 1201 lt2e-16
X3 13673 0130 1051 lt2e-16
X4 13570 0106 1280 lt2e-16
---
Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1
Residual standard error 1 on 94 degrees of freedom
Multiple R-squared 0999 Adjusted R-squared 0999
F-statistic 244e+04 on 5 and 94 DF p-value lt2e-16
32 58
What happens with nonlinearity
Suppose we can only observe the following covariates
z1 lt- exp(X[ 1]2)
z2 lt- X[ 2](1 + exp(X[ 1])) + 10
z3 lt- (X[ 1] X[ 3]25 + 06)^3
z4 lt- (X[ 2] + X[ 4] + 20)^2
Implies that 119884119894 and 119863119894 are functions of log(1198851198941113568) 1198851198941113569 119885111356911989411135681198851198941113569
1 log(1198851198941113568) 1198851198941113570 log(1198851198941113568) and 119883111356811135691198941113571
Regression is a nonlinear function of the observed covariates
33 58
When linearity goes wrong
summary(lm(y ~ d + z1 + z2 + z3 + z4))
Coefficients
Estimate Std Error t value Pr(gt|t|)
(Intercept) 218728 300799 073 0469
d -69292 33854 -205 0043
z1 363110 26970 1346 lt2e-16
z2 -29033 36619 -079 0430
z3 862022 431030 200 0048
z4 04021 00329 1223 lt2e-16
---
Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1
Residual standard error 14 on 94 degrees of freedom
Multiple R-squared 0855 Adjusted R-squared 0847
F-statistic 111 on 5 and 94 DF p-value lt2e-16
34 58
3 Regression withHeterogeneousTreatment Effects
35 58
Heterogeneous effects binary treatment
bull Completely randomized experiment
119884119894 = 119863119894119884119894(1) + (1 minus 119863119894)119884119894(0)= 119884119894(0) + (119884119894(1) minus 119884119894(0))119863119894
= 1205831113567 + 120591119894119863119894 + (119884119894(0) minus 1205831113567)= 1205831113567 + 120591119863119894 + (119884119894(0) minus 1205831113567) + (120591119894 minus 120591) sdot 119863119894
= 1205831113567 + 120591119863119894 + 120576119894
bull Error term now includes two components1 ldquoBaselinerdquo variation in the outcome (119884119894(0) minus 1205831113567)2 Variation in the treatment effect (120591119894 minus 120591)
bull Easy to verify that under experiment 120124[120576119894|119863119894] = 0bull Thus OLS estimates the ATE with no covariates
36 58
Adding covariates
bull What happens with no unmeasured confounders Need tocondition on 119883119894 now
bull Remember identification of the ATEATT using iteratedexpectations
bull ATE is the weighted sum of CATEs
120591 = 1114012119909120591(119909) Pr[119883119894 = 119909]
bull ATEATT are weighted averages of CATEsbull What about the regression estimand 120591119877 How does it related
to the ATEATT
37 58
Heterogeneous effects and regression
bull Letrsquos investigate this under a saturated regression model
119884119894 = 1114012119909119861119909119894120572119909 + 120591119877119863119894 + 119890119894
bull Use a dummy variable for each unique combination of 119883119894119861119909119894 = 120128(119883119894 = 119909)
bull Linear in 119883119894 by construction
38 58
Investigating the regression coefficient
bull How can we investigate 120591119877 Well we can rely on theregression anatomy
120591119877 = Cov(119884119894 119863119894 minus 119864[119863119894|119883119894])120141(119863119894 minus 119864[119863119894|119883119894])
bull 119863119894 minus 120124[119863119894|119883119894] is the residual from a regression of 119863119894 on the fullset of dummies
bull With a little work we can show
120591119877 =120124 1114094120591(119883119894)(119863119894 minus 120124[119863119894|119883119894])11135691114097
120124[(119863119894 minus 119864[119863119894|119883119894])1113569]=
120124[120591(119883119894)1205901113569119889(119883119894)]120124[1205901113569119889(119883119894)]
bull 1205901113569119889(119909) = 120141[119863119894|119883119894 = 119909] is the conditional variance of treatmentassignment
39 58
ATE versus OLS
120591119877 = 120124[120591(119883119894)119882119894] = 1114012119909120591(119909)
1205901113569119889(119909)120124[1205901113569119889(119883119894)]
โ[119883119894 = 119909]
bull Compare to the ATE
120591 = 120124[120591(119883119894)] = 1114012119909120591(119909)โ[119883119894 = 119909]
bull Both weight strata relative to their size (โ[119883119894 = 119909])bull OLS weights strata higher if the treatment variance in those
strata (1205901113569119889(119909)) is higher in those strata relative to the averagevariance across strata (120124[1205901113569119889(119883119894)])
bull The ATE weights only by their size
40 58
Regression weighting
119882119894 =1205901113569119889(119883119894)
120124[1205901113569119889(119883119894)]
bull Why does OLS weight like thisbull OLS is a minimum-variance estimator more weight to
more precise within-strata estimatesbull Within-strata estimates are most precise when the treatment
is evenly spread and thus has the highest variancebull If 119863119894 is binary then we know the conditional variance will be
1205901113569119889(119909) = โ[119863119894 = 1|119883119894 = 119909] (1 minus โ[119863119894 = 1|119883119894 = 119909])= 119890(119909) (1 minus 119890(119909))
bull Maximum variance with โ[119863119894 = 1|119883119894 = 119909] = 12
41 58
OLS weighting examplebull Binary covariate
โ[119883119894 = 1] = 075 โ[119883119894 = 0] = 025โ[119863119894 = 1|119883119894 = 1] = 09 โ[119863119894 = 1|119883119894 = 0] = 05
1205901113569119889(1) = 009 1205901113569119889(0) = 025120591(1) = 1 120591(0) = minus1
bull Implies the ATE is 120591 = 05bull Average conditional variance 120124[1205901113569119889(119883119894)] = 013bull weights for 119883119894 = 1 are 009013 = 0692 for 119883119894 = 0
025013 = 192120591119877 = 120124[120591(119883119894)119882119894]
= 120591(1)119882(1)โ[119883119894 = 1] + 120591(0)119882(0)โ[119883119894 = 0]= 1 times 0692 times 075 + minus1 times 192 times 025= 0039
42 58
When will OLS estimate the ATE
bull When does 120591 = 120591119877bull Constant treatment effects 120591(119909) = 120591 = 120591119877bull Constant probability of treatment 119890(119909) = โ[119863119894 = 1|119883119894 = 119909] = 119890
Implies that the OLS weights are 1
bull Incorrect linearity assumption in 119883119894 will lead to more bias
43 58
Other ways to use regression
bull Whatrsquos the path forward Accept the bias (might be relatively small with saturated
models) Use a different regression approach
bull Let 120583119889(119909) = 120124[119884119894(119889)|119883119894 = 119909] be the CEF for the potentialoutcome under 119863119894 = 119889
bull By consistency and nuc we have 120583119889(119909) = 120124[119884119894|119863119894 = 119889119883119894 = 119909]bull Estimate a regression of 119884119894 on 119883119894 among the 119863119894 = 119889 groupbull Then 119889(119909) is just a predicted value from the regression for
119883119894 = 119909bull How can we use this
44 58
Imputation estimators
bull Impute the treated potential outcomes with 1114023119884119894(1) = 1113568(119883119894)bull Impute the control potential outcomes with 1114023119884119894(0) = 1113567(119883119894)bull Procedure
Regress 119884119894 on 119883119894 in the treated group and get predicted valuesfor all units (treated or control)
Regress 119884119894 on 119883119894 in the control group and get predicted valuesfor all units (treated or control)
Take the average difference between these predicted values
bull More mathematically look like this
120591119894119898119901 =1119873
11140121198941113568(119883119894) minus 1113567(119883119894)
bull Sometimes called an imputation estimator
45 58
Simple imputation estimator
bull Use predict() from the within-group models on the data fromthe entire sample
bull Useful trick use a model on the entire data andmodelframe() to get the right design matrix
heterogeneous effects
yhet lt- ifelse(d == 1 y + rnorm(n 0 5) y)
mod lt- lm(yhet ~ d + X)
mod1 lt- lm(yhet ~ X subset = d == 1)
mod0 lt- lm(yhet ~ X subset = d == 0)
y1imps lt- predict(mod1 modelframe(mod))
y0imps lt- predict(mod0 modelframe(mod))
mean(y1imps - y0imps)
[1] 061
46 58
Notes on imputation estimators
bull If 119889(119909) are consistent estimators then 120591119894119898119901 is consistent forthe ATE
bull Why donrsquot people use this Most people donrsquot know the results wersquove been talking about Harder to implement than vanilla OLS
bull Can use linear regression to estimate 119889(119909) = 119909prime120573119889bull Recent trend is to estimate 119889(119909) via non-parametric methods
such as Kernel regression local linear regression regression trees etc Easiest is generalized additive models (GAMs)
47 58
Imputation estimator visualization
-2 -1 0 1 2 3
-05
00
05
10
15
x
y
TreatedControl
48 58
Imputation estimator visualization
-2 -1 0 1 2 3
-05
00
05
10
15
x
y
TreatedControl
49 58
Imputation estimator visualization
-2 -1 0 1 2 3
-05
00
05
10
15
x
y
TreatedControl
50 58
Nonlinear relationshipsbull Same idea but with nonlinear relationship between 119884119894 and 119883119894
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
51 58
Nonlinear relationshipsbull Same idea but with nonlinear relationship between 119884119894 and 119883119894
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
52 58
Nonlinear relationshipsbull Same idea but with nonlinear relationship between 119884119894 and 119883119894
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
53 58
Using semiparametric regressionbull Here CEFs are nonlinear but we donrsquot know their formbull We can use GAMs from the mgcv package to for flexible
estimatelibrary(mgcv)
mod0 lt- gam(y ~ s(x) subset = d == 0)
summary(mod0)
Family gaussian
Link function identity
Formula
y ~ s(x)
Parametric coefficients
Estimate Std Error t value Pr(gt|t|)
(Intercept) -00225 00154 -146 016
Approximate significance of smooth terms
edf Refdf F p-value
s(x) 603 708 413 lt2e-16
---
Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1
R-sq(adj) = 0909 Deviance explained = 928
GCV = 00093204 Scale est = 00071351 n = 30
54 58
Using GAMs
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
55 58
Using GAMs
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
56 58
Using GAMs
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
57 58
Limited dependent variables
bull Usual advice model the data from first principles Logitprobit for binary Poisson for counts etc
bull OLS is a-ok with limited DVs when Binary treatment and no covariates (just diff-in-means) Binary treatment discrete covariates and saturated models
(stratified diff-in-means)
bull Imposing a model on LDVs in this case imposes adistributional assumption which could be wrong
bull Even in unsaturated models the marginal effect from OLSoften decent compared to nonlinear models
Could go wrong in small samples If using nonlinear models always get effects on the scale of the
outcome
58 58
- Agnostic Regression
- Regression and Causality
- Regression with Heterogeneous Treatment Effects
-
Saturated modelsummary(lm(aauw ~ asfactor(totchi) data = girls))
Coefficients
Estimate Std Error t value Pr(gt|t|)
(Intercept) 5641 276 2042 lt 2e-16
asfactor(totchi)1 545 411 133 01851
asfactor(totchi)2 -380 327 -116 02454
asfactor(totchi)3 -1365 345 -395 81e-05
asfactor(totchi)4 -1931 401 -482 16e-06
asfactor(totchi)5 -1546 485 -319 00015
asfactor(totchi)6 -3359 1042 -322 00013
asfactor(totchi)7 -1713 1141 -150 01336
asfactor(totchi)8 -5533 1228 -451 70e-06
asfactor(totchi)9 -5041 2408 -209 00364
asfactor(totchi)10 -5341 2090 -256 00107
asfactor(totchi)12 -5641 4153 -136 01745
---
Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1
Residual standard error 41 on 1723 degrees of freedom
(5 observations deleted due to missingness)
Multiple R-squared 00506 Adjusted R-squared 00446
F-statistic 836 on 11 and 1723 DF p-value 184e-1414 58
Saturated model minus the constantsummary(lm(aauw ~ asfactor(totchi) - 1 data = girls))
Coefficients
Estimate Std Error t value Pr(gt|t|)
asfactor(totchi)0 5641 276 2042 lt2e-16
asfactor(totchi)1 6186 305 2031 lt2e-16
asfactor(totchi)2 5262 175 3013 lt2e-16
asfactor(totchi)3 4276 207 2062 lt2e-16
asfactor(totchi)4 3711 290 1279 lt2e-16
asfactor(totchi)5 4095 399 1027 lt2e-16
asfactor(totchi)6 2282 1005 227 00233
asfactor(totchi)7 3929 1107 355 00004
asfactor(totchi)8 108 1196 009 09278
asfactor(totchi)9 600 2392 025 08020
asfactor(totchi)10 300 2072 014 08849
asfactor(totchi)12 000 4143 000 10000
---
Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1
Residual standard error 41 on 1723 degrees of freedom
(5 observations deleted due to missingness)
Multiple R-squared 0587 Adjusted R-squared 0584
F-statistic 204 on 12 and 1723 DF p-value lt2e-1615 58
Compare to within-strata means
bull The saturated model makes no assumptions about thebetween-strata relationships
bull Just calculates within-strata means
c1 lt- coef(lm(aauw ~ asfactor(totchi) - 1 data = girls))
c2 lt- with(girls tapply(aauw totchi mean narm = TRUE))
rbind(c1 c2)
0 1 2 3 4 5 6 7 8 9 10 12
c1 56 62 53 43 37 41 23 39 11 6 3 0
c2 56 62 53 43 37 41 23 39 11 6 3 0
16 58
Other justifications for OLS
bull Justification 2 119883prime119894 120573 is the best linear predictor (in a
mean-squared error sense) of 119884119894 Why 120573 = argmin119887 120124[(119884119894 minus 119883prime
119894 119887)1113569]
bull Justification 3 119883prime119894 120573 provides the minimum mean squared
error linear approxmiation to 119864[119884119894|119883119894]bull Even if the CEF is not linear a linear regression provides the
best linear approximation to that CEFbull Donrsquot need to believe the assumptions (linearity) in order to
use regression as a good approximation to the CEFbull Warning if the CEF is very nonlinear then this approximation
could be terrible
17 58
The error terms
bull Letrsquos define the error term 119890119894 equiv 119884119894 minus 119883prime119894 120573 so that
119884119894 = 119883prime119894 120573 + [119884119894 minus 119883prime
119894 120573] = 119883prime119894 120573 + 119890119894
bull Note the residual 119890119894 is uncorrelated with 119883119894
120124[119883119894119890119894] = 120124[119883119894(119884119894 minus 119883prime119894 120573)]
= 120124[119883119894119884119894] minus 120124[119883119894119883prime119894 120573]
= 120124[119883119894119884119894] minus 120124 1114094119883119894119883prime119894120124[119883119894119883prime
119894 ]minus1113568120124[119883119894119884119894]1114097= 120124[119883119894119884119894] minus 120124[119883119894119883prime
119894 ]120124[119883119894119883prime119894 ]minus1113568120124[119883119894119884119894]
= 120124[119883119894119884119894] minus 120124[119883119894119884119894] = 0
bull No assumptions on the linearity of 120124[119884119894|119883119894]
18 58
OLS estimator
bull We know the population value of 120573 is
120573 = 120124[119883119894119883prime119894 ]minus1113568120124[119883119894119884119894]
bull How do we get an estimator of thisbull Plug-in principle replace population expectation with
sample versions
=
โกโขโขโขโขโขโฃ1119873
1114012119894119883119894119883prime
119894
โคโฅโฅโฅโฅโฅโฆ
minus11135681119873
1114012119894119883119894119884119894
bull If you work through the matrix algebra this turns out to be
= (119831prime119831)minus1113568 119831prime119858
19 58
Asymptotic OLS inference
bull With this representation in hand we can write the OLSestimator as follows
= 120573 +
โกโขโขโขโขโขโฃ1114012
119894119883119894119883prime
119894
โคโฅโฅโฅโฅโฅโฆ
minus1113568
1114012119894119883119894119890119894
bull Core idea sum119894119883119894119890119894 is the sum of rvs so the CLT applies
bull That plus some simple asymptotic theory allows us to say
radic119873( minus 120573) 119873(0ฮฉ)
bull Converges in distribution to a Normal distribution with meanvector 0 and covariance matrix ฮฉ
ฮฉ = 120124[119883119894119883prime119894 ]minus1113568120124[119883119894119883prime
119894 1198901113569119894 ]120124[119883119894119883prime119894 ]minus1113568
bull No linearity assumption needed20 58
Estimating the variance
bull In large samples then
radic119873( minus 120573) sim 119873(0ฮฉ)
bull How to estimate ฮฉ Plug-in principle again
1114023ฮฉ =
โกโขโขโขโขโขโฃ1114012
119894119883119894119883prime
119894
โคโฅโฅโฅโฅโฅโฆ
minus1113568 โกโขโขโขโขโขโฃ1114012119894119883119894119883prime
119894 1113569119894
โคโฅโฅโฅโฅโฅโฆ
โกโขโขโขโขโขโฃ1114012
119894119883119894119883prime
119894
โคโฅโฅโฅโฅโฅโฆ
minus1113568
bull Replace 119890119894 with its emprical counterpart (residuals)119894 = 119884119894 minus 119883prime
119894 bull Replace the population moments of 119883119894 with their sample
counterpartsbull The square root of the diagonals of this covariance matrix are
the ldquorobustrdquo or Huber-White standard errors that Statacommonly report
21 58
Heteroskedasticity
bull No assumptions of homoskedasticitybull Heteroskedaticity will definitely occur when
CEF is linear but the 1205901113569(119909) = 120141[119884119894|119883119894 = 119909] is not constant in 119909 119864[119884119894|119883119894] is not linear but we use the linear regression to
approxmiate it
22 58
2 Regression andCausality
23 58
Regression and causality
bull Most econometrics textbooks regression defined withoutrespect to causality
bull But then when is ldquobiasedrdquo The above derivations work forsome 120124[119884119894|119883119894]
bull The question then is when does knowing the CEF tell ussomething about causality
bull MHE argues that a regression is causal when the CEF itapproximates is causal Identification is king
bull We will show that under certain conditions a regression of theoutcome on the treatment and the covariates can recover acausal parameter but perhaps not the one in which we areinterested
24 58
Review
bull Quick reminder we have potential outcomes 119884119894(1) and 119884119894(0)and two parameters the ATE and ATT
120591 = 119864[119884119894(1) minus 119884119894(0)]120591ATT = 119864[119884119894(1) minus 119884119894(0)|119863119894 = 1]
bull We have shown in past weeks that these effects are identifiedwhen ignorability holds MHE calls this the conditionalindependence assumption (CIA)
25 58
Linear constant effects model binarytreatment
bull Experiment with a simple experiment we can rewrite theconsistency assumption to be a regression formula
119884119894 = 119863119894119884119894(1) + (1 minus 119863119894)119884119894(0)= 119884119894(0) + (119884119894(1) minus 119884119894(0))119863119894
= 120124[119884119894(0)] + 120591119863119894 + (119884119894(0) minus 120124[119884119894(0)])= 1205831113567 + 120591119863119894 + 1199071113567119894
bull Note that if ignorability holds (as in an experiment) for 119884119894(0)then it will also hold for 1199071113567119894 since 120124[119884119894(0)] is constant Thusthis satifies the usual assumptions for regression
26 58
Now with covariates
bull Now assume no unmeasured confounders 119884119894(119889) โโ 119863119894|119883119894bull We will assume a linear model for the potential outcomes
119884119894(119889) = 120572 + 120591 sdot 119889 + 120578119894
bull Remember that linearity isnrsquot an assumption if 119863119894 is binarybull Effect of 119863119894 is constant here the 120578119894 are the only source of
individual variation and we have 119864[120578119894] = 0bull Consistency assumption allows us to write this as
119884119894 = 120572 + 120591119863119894 + 120578119894
27 58
Covariates in the error
bull Letrsquos assume that 120578119894 is linear in 119883119894 120578119894 = 119883prime119894 120574 + 120584119894
bull New error is uncorrelated with 119883119894 120124[120584119894|119883119894] = 0bull This is an assumption Might be falsebull Plug into the above
120124[119884119894(119889)|119883119894] = 119864[119884119894|119863119894 119883119894] = 120572 + 120591119863119894 + 119864[120578119894|119883119894]= 120572 + 120591119863119894 + 119883prime
119894 120574 + 119864[120584119894|119883119894]= 120572 + 120591119863119894 + 119883prime
119894 120574
28 58
Summing up regression with constanteffects
bull Reviewing the assumptions wersquove used no unmeasured confounders constant treatment effects linearity of the treatmentcovariates
bull Under these we can run the following regression to estimatethe ATE 120591
119884119894 = 120572 + 120591119863119894 + 119883prime119894 120574 + 120584119894
bull Works with continuous or ordinal 119863119894 if linearity in the effect ofthese variables is truly linear
29 58
OLS constant effects simulation
Model with linear covariates constant 0 effect of treatment
library(mvtnorm)
n lt- 100
p lt- 4
X lt- rmvnorm(n = 100 mean = rep(0 p))
gamma lt- c(274 137 137 137)
y lt- 210 + X c(gamma) + rnorm(n)
alpha lt- c(-1 05 -05 -01)
dprobs lt- bootinvlogit(X alpha)
d lt- rbinom(n size = 1 prob = dprobs)
30 58
OLS with no covariates
summary(lm(y ~ d))
Coefficients
Estimate Std Error t value Pr(gt|t|)
(Intercept) 22036 458 4813 lt 2e-16
d -2869 654 -439 0000029
---
Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1
Residual standard error 33 on 98 degrees of freedom
Multiple R-squared 0164 Adjusted R-squared 0156
F-statistic 192 on 1 and 98 DF p-value 0000029
31 58
OLS with covariates
summary(lm(y ~ d + X))
Coefficients
Estimate Std Error t value Pr(gt|t|)
(Intercept) 209952 0159 13207 lt2e-16
d -0128 0253 -05 062
X1 27368 0126 2175 lt2e-16
X2 13677 0114 1201 lt2e-16
X3 13673 0130 1051 lt2e-16
X4 13570 0106 1280 lt2e-16
---
Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1
Residual standard error 1 on 94 degrees of freedom
Multiple R-squared 0999 Adjusted R-squared 0999
F-statistic 244e+04 on 5 and 94 DF p-value lt2e-16
32 58
What happens with nonlinearity
Suppose we can only observe the following covariates
z1 lt- exp(X[ 1]2)
z2 lt- X[ 2](1 + exp(X[ 1])) + 10
z3 lt- (X[ 1] X[ 3]25 + 06)^3
z4 lt- (X[ 2] + X[ 4] + 20)^2
Implies that 119884119894 and 119863119894 are functions of log(1198851198941113568) 1198851198941113569 119885111356911989411135681198851198941113569
1 log(1198851198941113568) 1198851198941113570 log(1198851198941113568) and 119883111356811135691198941113571
Regression is a nonlinear function of the observed covariates
33 58
When linearity goes wrong
summary(lm(y ~ d + z1 + z2 + z3 + z4))
Coefficients
Estimate Std Error t value Pr(gt|t|)
(Intercept) 218728 300799 073 0469
d -69292 33854 -205 0043
z1 363110 26970 1346 lt2e-16
z2 -29033 36619 -079 0430
z3 862022 431030 200 0048
z4 04021 00329 1223 lt2e-16
---
Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1
Residual standard error 14 on 94 degrees of freedom
Multiple R-squared 0855 Adjusted R-squared 0847
F-statistic 111 on 5 and 94 DF p-value lt2e-16
34 58
3 Regression withHeterogeneousTreatment Effects
35 58
Heterogeneous effects binary treatment
bull Completely randomized experiment
119884119894 = 119863119894119884119894(1) + (1 minus 119863119894)119884119894(0)= 119884119894(0) + (119884119894(1) minus 119884119894(0))119863119894
= 1205831113567 + 120591119894119863119894 + (119884119894(0) minus 1205831113567)= 1205831113567 + 120591119863119894 + (119884119894(0) minus 1205831113567) + (120591119894 minus 120591) sdot 119863119894
= 1205831113567 + 120591119863119894 + 120576119894
bull Error term now includes two components1 ldquoBaselinerdquo variation in the outcome (119884119894(0) minus 1205831113567)2 Variation in the treatment effect (120591119894 minus 120591)
bull Easy to verify that under experiment 120124[120576119894|119863119894] = 0bull Thus OLS estimates the ATE with no covariates
36 58
Adding covariates
bull What happens with no unmeasured confounders Need tocondition on 119883119894 now
bull Remember identification of the ATEATT using iteratedexpectations
bull ATE is the weighted sum of CATEs
120591 = 1114012119909120591(119909) Pr[119883119894 = 119909]
bull ATEATT are weighted averages of CATEsbull What about the regression estimand 120591119877 How does it related
to the ATEATT
37 58
Heterogeneous effects and regression
bull Letrsquos investigate this under a saturated regression model
119884119894 = 1114012119909119861119909119894120572119909 + 120591119877119863119894 + 119890119894
bull Use a dummy variable for each unique combination of 119883119894119861119909119894 = 120128(119883119894 = 119909)
bull Linear in 119883119894 by construction
38 58
Investigating the regression coefficient
bull How can we investigate 120591119877 Well we can rely on theregression anatomy
120591119877 = Cov(119884119894 119863119894 minus 119864[119863119894|119883119894])120141(119863119894 minus 119864[119863119894|119883119894])
bull 119863119894 minus 120124[119863119894|119883119894] is the residual from a regression of 119863119894 on the fullset of dummies
bull With a little work we can show
120591119877 =120124 1114094120591(119883119894)(119863119894 minus 120124[119863119894|119883119894])11135691114097
120124[(119863119894 minus 119864[119863119894|119883119894])1113569]=
120124[120591(119883119894)1205901113569119889(119883119894)]120124[1205901113569119889(119883119894)]
bull 1205901113569119889(119909) = 120141[119863119894|119883119894 = 119909] is the conditional variance of treatmentassignment
39 58
ATE versus OLS
120591119877 = 120124[120591(119883119894)119882119894] = 1114012119909120591(119909)
1205901113569119889(119909)120124[1205901113569119889(119883119894)]
โ[119883119894 = 119909]
bull Compare to the ATE
120591 = 120124[120591(119883119894)] = 1114012119909120591(119909)โ[119883119894 = 119909]
bull Both weight strata relative to their size (โ[119883119894 = 119909])bull OLS weights strata higher if the treatment variance in those
strata (1205901113569119889(119909)) is higher in those strata relative to the averagevariance across strata (120124[1205901113569119889(119883119894)])
bull The ATE weights only by their size
40 58
Regression weighting
119882119894 =1205901113569119889(119883119894)
120124[1205901113569119889(119883119894)]
bull Why does OLS weight like thisbull OLS is a minimum-variance estimator more weight to
more precise within-strata estimatesbull Within-strata estimates are most precise when the treatment
is evenly spread and thus has the highest variancebull If 119863119894 is binary then we know the conditional variance will be
1205901113569119889(119909) = โ[119863119894 = 1|119883119894 = 119909] (1 minus โ[119863119894 = 1|119883119894 = 119909])= 119890(119909) (1 minus 119890(119909))
bull Maximum variance with โ[119863119894 = 1|119883119894 = 119909] = 12
41 58
OLS weighting examplebull Binary covariate
โ[119883119894 = 1] = 075 โ[119883119894 = 0] = 025โ[119863119894 = 1|119883119894 = 1] = 09 โ[119863119894 = 1|119883119894 = 0] = 05
1205901113569119889(1) = 009 1205901113569119889(0) = 025120591(1) = 1 120591(0) = minus1
bull Implies the ATE is 120591 = 05bull Average conditional variance 120124[1205901113569119889(119883119894)] = 013bull weights for 119883119894 = 1 are 009013 = 0692 for 119883119894 = 0
025013 = 192120591119877 = 120124[120591(119883119894)119882119894]
= 120591(1)119882(1)โ[119883119894 = 1] + 120591(0)119882(0)โ[119883119894 = 0]= 1 times 0692 times 075 + minus1 times 192 times 025= 0039
42 58
When will OLS estimate the ATE
bull When does 120591 = 120591119877bull Constant treatment effects 120591(119909) = 120591 = 120591119877bull Constant probability of treatment 119890(119909) = โ[119863119894 = 1|119883119894 = 119909] = 119890
Implies that the OLS weights are 1
bull Incorrect linearity assumption in 119883119894 will lead to more bias
43 58
Other ways to use regression
bull Whatrsquos the path forward Accept the bias (might be relatively small with saturated
models) Use a different regression approach
bull Let 120583119889(119909) = 120124[119884119894(119889)|119883119894 = 119909] be the CEF for the potentialoutcome under 119863119894 = 119889
bull By consistency and nuc we have 120583119889(119909) = 120124[119884119894|119863119894 = 119889119883119894 = 119909]bull Estimate a regression of 119884119894 on 119883119894 among the 119863119894 = 119889 groupbull Then 119889(119909) is just a predicted value from the regression for
119883119894 = 119909bull How can we use this
44 58
Imputation estimators
bull Impute the treated potential outcomes with 1114023119884119894(1) = 1113568(119883119894)bull Impute the control potential outcomes with 1114023119884119894(0) = 1113567(119883119894)bull Procedure
Regress 119884119894 on 119883119894 in the treated group and get predicted valuesfor all units (treated or control)
Regress 119884119894 on 119883119894 in the control group and get predicted valuesfor all units (treated or control)
Take the average difference between these predicted values
bull More mathematically look like this
120591119894119898119901 =1119873
11140121198941113568(119883119894) minus 1113567(119883119894)
bull Sometimes called an imputation estimator
45 58
Simple imputation estimator
bull Use predict() from the within-group models on the data fromthe entire sample
bull Useful trick use a model on the entire data andmodelframe() to get the right design matrix
heterogeneous effects
yhet lt- ifelse(d == 1 y + rnorm(n 0 5) y)
mod lt- lm(yhet ~ d + X)
mod1 lt- lm(yhet ~ X subset = d == 1)
mod0 lt- lm(yhet ~ X subset = d == 0)
y1imps lt- predict(mod1 modelframe(mod))
y0imps lt- predict(mod0 modelframe(mod))
mean(y1imps - y0imps)
[1] 061
46 58
Notes on imputation estimators
bull If 119889(119909) are consistent estimators then 120591119894119898119901 is consistent forthe ATE
bull Why donrsquot people use this Most people donrsquot know the results wersquove been talking about Harder to implement than vanilla OLS
bull Can use linear regression to estimate 119889(119909) = 119909prime120573119889bull Recent trend is to estimate 119889(119909) via non-parametric methods
such as Kernel regression local linear regression regression trees etc Easiest is generalized additive models (GAMs)
47 58
Imputation estimator visualization
-2 -1 0 1 2 3
-05
00
05
10
15
x
y
TreatedControl
48 58
Imputation estimator visualization
-2 -1 0 1 2 3
-05
00
05
10
15
x
y
TreatedControl
49 58
Imputation estimator visualization
-2 -1 0 1 2 3
-05
00
05
10
15
x
y
TreatedControl
50 58
Nonlinear relationshipsbull Same idea but with nonlinear relationship between 119884119894 and 119883119894
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
51 58
Nonlinear relationshipsbull Same idea but with nonlinear relationship between 119884119894 and 119883119894
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
52 58
Nonlinear relationshipsbull Same idea but with nonlinear relationship between 119884119894 and 119883119894
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
53 58
Using semiparametric regressionbull Here CEFs are nonlinear but we donrsquot know their formbull We can use GAMs from the mgcv package to for flexible
estimatelibrary(mgcv)
mod0 lt- gam(y ~ s(x) subset = d == 0)
summary(mod0)
Family gaussian
Link function identity
Formula
y ~ s(x)
Parametric coefficients
Estimate Std Error t value Pr(gt|t|)
(Intercept) -00225 00154 -146 016
Approximate significance of smooth terms
edf Refdf F p-value
s(x) 603 708 413 lt2e-16
---
Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1
R-sq(adj) = 0909 Deviance explained = 928
GCV = 00093204 Scale est = 00071351 n = 30
54 58
Using GAMs
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
55 58
Using GAMs
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
56 58
Using GAMs
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
57 58
Limited dependent variables
bull Usual advice model the data from first principles Logitprobit for binary Poisson for counts etc
bull OLS is a-ok with limited DVs when Binary treatment and no covariates (just diff-in-means) Binary treatment discrete covariates and saturated models
(stratified diff-in-means)
bull Imposing a model on LDVs in this case imposes adistributional assumption which could be wrong
bull Even in unsaturated models the marginal effect from OLSoften decent compared to nonlinear models
Could go wrong in small samples If using nonlinear models always get effects on the scale of the
outcome
58 58
- Agnostic Regression
- Regression and Causality
- Regression with Heterogeneous Treatment Effects
-
Saturated model minus the constantsummary(lm(aauw ~ asfactor(totchi) - 1 data = girls))
Coefficients
Estimate Std Error t value Pr(gt|t|)
asfactor(totchi)0 5641 276 2042 lt2e-16
asfactor(totchi)1 6186 305 2031 lt2e-16
asfactor(totchi)2 5262 175 3013 lt2e-16
asfactor(totchi)3 4276 207 2062 lt2e-16
asfactor(totchi)4 3711 290 1279 lt2e-16
asfactor(totchi)5 4095 399 1027 lt2e-16
asfactor(totchi)6 2282 1005 227 00233
asfactor(totchi)7 3929 1107 355 00004
asfactor(totchi)8 108 1196 009 09278
asfactor(totchi)9 600 2392 025 08020
asfactor(totchi)10 300 2072 014 08849
asfactor(totchi)12 000 4143 000 10000
---
Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1
Residual standard error 41 on 1723 degrees of freedom
(5 observations deleted due to missingness)
Multiple R-squared 0587 Adjusted R-squared 0584
F-statistic 204 on 12 and 1723 DF p-value lt2e-1615 58
Compare to within-strata means
bull The saturated model makes no assumptions about thebetween-strata relationships
bull Just calculates within-strata means
c1 lt- coef(lm(aauw ~ asfactor(totchi) - 1 data = girls))
c2 lt- with(girls tapply(aauw totchi mean narm = TRUE))
rbind(c1 c2)
0 1 2 3 4 5 6 7 8 9 10 12
c1 56 62 53 43 37 41 23 39 11 6 3 0
c2 56 62 53 43 37 41 23 39 11 6 3 0
16 58
Other justifications for OLS
bull Justification 2 119883prime119894 120573 is the best linear predictor (in a
mean-squared error sense) of 119884119894 Why 120573 = argmin119887 120124[(119884119894 minus 119883prime
119894 119887)1113569]
bull Justification 3 119883prime119894 120573 provides the minimum mean squared
error linear approxmiation to 119864[119884119894|119883119894]bull Even if the CEF is not linear a linear regression provides the
best linear approximation to that CEFbull Donrsquot need to believe the assumptions (linearity) in order to
use regression as a good approximation to the CEFbull Warning if the CEF is very nonlinear then this approximation
could be terrible
17 58
The error terms
bull Letrsquos define the error term 119890119894 equiv 119884119894 minus 119883prime119894 120573 so that
119884119894 = 119883prime119894 120573 + [119884119894 minus 119883prime
119894 120573] = 119883prime119894 120573 + 119890119894
bull Note the residual 119890119894 is uncorrelated with 119883119894
120124[119883119894119890119894] = 120124[119883119894(119884119894 minus 119883prime119894 120573)]
= 120124[119883119894119884119894] minus 120124[119883119894119883prime119894 120573]
= 120124[119883119894119884119894] minus 120124 1114094119883119894119883prime119894120124[119883119894119883prime
119894 ]minus1113568120124[119883119894119884119894]1114097= 120124[119883119894119884119894] minus 120124[119883119894119883prime
119894 ]120124[119883119894119883prime119894 ]minus1113568120124[119883119894119884119894]
= 120124[119883119894119884119894] minus 120124[119883119894119884119894] = 0
bull No assumptions on the linearity of 120124[119884119894|119883119894]
18 58
OLS estimator
bull We know the population value of 120573 is
120573 = 120124[119883119894119883prime119894 ]minus1113568120124[119883119894119884119894]
bull How do we get an estimator of thisbull Plug-in principle replace population expectation with
sample versions
=
โกโขโขโขโขโขโฃ1119873
1114012119894119883119894119883prime
119894
โคโฅโฅโฅโฅโฅโฆ
minus11135681119873
1114012119894119883119894119884119894
bull If you work through the matrix algebra this turns out to be
= (119831prime119831)minus1113568 119831prime119858
19 58
Asymptotic OLS inference
bull With this representation in hand we can write the OLSestimator as follows
= 120573 +
โกโขโขโขโขโขโฃ1114012
119894119883119894119883prime
119894
โคโฅโฅโฅโฅโฅโฆ
minus1113568
1114012119894119883119894119890119894
bull Core idea sum119894119883119894119890119894 is the sum of rvs so the CLT applies
bull That plus some simple asymptotic theory allows us to say
radic119873( minus 120573) 119873(0ฮฉ)
bull Converges in distribution to a Normal distribution with meanvector 0 and covariance matrix ฮฉ
ฮฉ = 120124[119883119894119883prime119894 ]minus1113568120124[119883119894119883prime
119894 1198901113569119894 ]120124[119883119894119883prime119894 ]minus1113568
bull No linearity assumption needed20 58
Estimating the variance
bull In large samples then
radic119873( minus 120573) sim 119873(0ฮฉ)
bull How to estimate ฮฉ Plug-in principle again
1114023ฮฉ =
โกโขโขโขโขโขโฃ1114012
119894119883119894119883prime
119894
โคโฅโฅโฅโฅโฅโฆ
minus1113568 โกโขโขโขโขโขโฃ1114012119894119883119894119883prime
119894 1113569119894
โคโฅโฅโฅโฅโฅโฆ
โกโขโขโขโขโขโฃ1114012
119894119883119894119883prime
119894
โคโฅโฅโฅโฅโฅโฆ
minus1113568
bull Replace 119890119894 with its emprical counterpart (residuals)119894 = 119884119894 minus 119883prime
119894 bull Replace the population moments of 119883119894 with their sample
counterpartsbull The square root of the diagonals of this covariance matrix are
the ldquorobustrdquo or Huber-White standard errors that Statacommonly report
21 58
Heteroskedasticity
bull No assumptions of homoskedasticitybull Heteroskedaticity will definitely occur when
CEF is linear but the 1205901113569(119909) = 120141[119884119894|119883119894 = 119909] is not constant in 119909 119864[119884119894|119883119894] is not linear but we use the linear regression to
approxmiate it
22 58
2 Regression andCausality
23 58
Regression and causality
bull Most econometrics textbooks regression defined withoutrespect to causality
bull But then when is ldquobiasedrdquo The above derivations work forsome 120124[119884119894|119883119894]
bull The question then is when does knowing the CEF tell ussomething about causality
bull MHE argues that a regression is causal when the CEF itapproximates is causal Identification is king
bull We will show that under certain conditions a regression of theoutcome on the treatment and the covariates can recover acausal parameter but perhaps not the one in which we areinterested
24 58
Review
bull Quick reminder we have potential outcomes 119884119894(1) and 119884119894(0)and two parameters the ATE and ATT
120591 = 119864[119884119894(1) minus 119884119894(0)]120591ATT = 119864[119884119894(1) minus 119884119894(0)|119863119894 = 1]
bull We have shown in past weeks that these effects are identifiedwhen ignorability holds MHE calls this the conditionalindependence assumption (CIA)
25 58
Linear constant effects model binarytreatment
bull Experiment with a simple experiment we can rewrite theconsistency assumption to be a regression formula
119884119894 = 119863119894119884119894(1) + (1 minus 119863119894)119884119894(0)= 119884119894(0) + (119884119894(1) minus 119884119894(0))119863119894
= 120124[119884119894(0)] + 120591119863119894 + (119884119894(0) minus 120124[119884119894(0)])= 1205831113567 + 120591119863119894 + 1199071113567119894
bull Note that if ignorability holds (as in an experiment) for 119884119894(0)then it will also hold for 1199071113567119894 since 120124[119884119894(0)] is constant Thusthis satifies the usual assumptions for regression
26 58
Now with covariates
bull Now assume no unmeasured confounders 119884119894(119889) โโ 119863119894|119883119894bull We will assume a linear model for the potential outcomes
119884119894(119889) = 120572 + 120591 sdot 119889 + 120578119894
bull Remember that linearity isnrsquot an assumption if 119863119894 is binarybull Effect of 119863119894 is constant here the 120578119894 are the only source of
individual variation and we have 119864[120578119894] = 0bull Consistency assumption allows us to write this as
119884119894 = 120572 + 120591119863119894 + 120578119894
27 58
Covariates in the error
bull Letrsquos assume that 120578119894 is linear in 119883119894 120578119894 = 119883prime119894 120574 + 120584119894
bull New error is uncorrelated with 119883119894 120124[120584119894|119883119894] = 0bull This is an assumption Might be falsebull Plug into the above
120124[119884119894(119889)|119883119894] = 119864[119884119894|119863119894 119883119894] = 120572 + 120591119863119894 + 119864[120578119894|119883119894]= 120572 + 120591119863119894 + 119883prime
119894 120574 + 119864[120584119894|119883119894]= 120572 + 120591119863119894 + 119883prime
119894 120574
28 58
Summing up regression with constanteffects
bull Reviewing the assumptions wersquove used no unmeasured confounders constant treatment effects linearity of the treatmentcovariates
bull Under these we can run the following regression to estimatethe ATE 120591
119884119894 = 120572 + 120591119863119894 + 119883prime119894 120574 + 120584119894
bull Works with continuous or ordinal 119863119894 if linearity in the effect ofthese variables is truly linear
29 58
OLS constant effects simulation
Model with linear covariates constant 0 effect of treatment
library(mvtnorm)
n lt- 100
p lt- 4
X lt- rmvnorm(n = 100 mean = rep(0 p))
gamma lt- c(274 137 137 137)
y lt- 210 + X c(gamma) + rnorm(n)
alpha lt- c(-1 05 -05 -01)
dprobs lt- bootinvlogit(X alpha)
d lt- rbinom(n size = 1 prob = dprobs)
30 58
OLS with no covariates
summary(lm(y ~ d))
Coefficients
Estimate Std Error t value Pr(gt|t|)
(Intercept) 22036 458 4813 lt 2e-16
d -2869 654 -439 0000029
---
Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1
Residual standard error 33 on 98 degrees of freedom
Multiple R-squared 0164 Adjusted R-squared 0156
F-statistic 192 on 1 and 98 DF p-value 0000029
31 58
OLS with covariates
summary(lm(y ~ d + X))
Coefficients
Estimate Std Error t value Pr(gt|t|)
(Intercept) 209952 0159 13207 lt2e-16
d -0128 0253 -05 062
X1 27368 0126 2175 lt2e-16
X2 13677 0114 1201 lt2e-16
X3 13673 0130 1051 lt2e-16
X4 13570 0106 1280 lt2e-16
---
Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1
Residual standard error 1 on 94 degrees of freedom
Multiple R-squared 0999 Adjusted R-squared 0999
F-statistic 244e+04 on 5 and 94 DF p-value lt2e-16
32 58
What happens with nonlinearity
Suppose we can only observe the following covariates
z1 lt- exp(X[ 1]2)
z2 lt- X[ 2](1 + exp(X[ 1])) + 10
z3 lt- (X[ 1] X[ 3]25 + 06)^3
z4 lt- (X[ 2] + X[ 4] + 20)^2
Implies that 119884119894 and 119863119894 are functions of log(1198851198941113568) 1198851198941113569 119885111356911989411135681198851198941113569
1 log(1198851198941113568) 1198851198941113570 log(1198851198941113568) and 119883111356811135691198941113571
Regression is a nonlinear function of the observed covariates
33 58
When linearity goes wrong
summary(lm(y ~ d + z1 + z2 + z3 + z4))
Coefficients
Estimate Std Error t value Pr(gt|t|)
(Intercept) 218728 300799 073 0469
d -69292 33854 -205 0043
z1 363110 26970 1346 lt2e-16
z2 -29033 36619 -079 0430
z3 862022 431030 200 0048
z4 04021 00329 1223 lt2e-16
---
Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1
Residual standard error 14 on 94 degrees of freedom
Multiple R-squared 0855 Adjusted R-squared 0847
F-statistic 111 on 5 and 94 DF p-value lt2e-16
34 58
3 Regression withHeterogeneousTreatment Effects
35 58
Heterogeneous effects binary treatment
bull Completely randomized experiment
119884119894 = 119863119894119884119894(1) + (1 minus 119863119894)119884119894(0)= 119884119894(0) + (119884119894(1) minus 119884119894(0))119863119894
= 1205831113567 + 120591119894119863119894 + (119884119894(0) minus 1205831113567)= 1205831113567 + 120591119863119894 + (119884119894(0) minus 1205831113567) + (120591119894 minus 120591) sdot 119863119894
= 1205831113567 + 120591119863119894 + 120576119894
bull Error term now includes two components1 ldquoBaselinerdquo variation in the outcome (119884119894(0) minus 1205831113567)2 Variation in the treatment effect (120591119894 minus 120591)
bull Easy to verify that under experiment 120124[120576119894|119863119894] = 0bull Thus OLS estimates the ATE with no covariates
36 58
Adding covariates
bull What happens with no unmeasured confounders Need tocondition on 119883119894 now
bull Remember identification of the ATEATT using iteratedexpectations
bull ATE is the weighted sum of CATEs
120591 = 1114012119909120591(119909) Pr[119883119894 = 119909]
bull ATEATT are weighted averages of CATEsbull What about the regression estimand 120591119877 How does it related
to the ATEATT
37 58
Heterogeneous effects and regression
bull Letrsquos investigate this under a saturated regression model
119884119894 = 1114012119909119861119909119894120572119909 + 120591119877119863119894 + 119890119894
bull Use a dummy variable for each unique combination of 119883119894119861119909119894 = 120128(119883119894 = 119909)
bull Linear in 119883119894 by construction
38 58
Investigating the regression coefficient
bull How can we investigate 120591119877 Well we can rely on theregression anatomy
120591119877 = Cov(119884119894 119863119894 minus 119864[119863119894|119883119894])120141(119863119894 minus 119864[119863119894|119883119894])
bull 119863119894 minus 120124[119863119894|119883119894] is the residual from a regression of 119863119894 on the fullset of dummies
bull With a little work we can show
120591119877 =120124 1114094120591(119883119894)(119863119894 minus 120124[119863119894|119883119894])11135691114097
120124[(119863119894 minus 119864[119863119894|119883119894])1113569]=
120124[120591(119883119894)1205901113569119889(119883119894)]120124[1205901113569119889(119883119894)]
bull 1205901113569119889(119909) = 120141[119863119894|119883119894 = 119909] is the conditional variance of treatmentassignment
39 58
ATE versus OLS
120591119877 = 120124[120591(119883119894)119882119894] = 1114012119909120591(119909)
1205901113569119889(119909)120124[1205901113569119889(119883119894)]
โ[119883119894 = 119909]
bull Compare to the ATE
120591 = 120124[120591(119883119894)] = 1114012119909120591(119909)โ[119883119894 = 119909]
bull Both weight strata relative to their size (โ[119883119894 = 119909])bull OLS weights strata higher if the treatment variance in those
strata (1205901113569119889(119909)) is higher in those strata relative to the averagevariance across strata (120124[1205901113569119889(119883119894)])
bull The ATE weights only by their size
40 58
Regression weighting
119882119894 =1205901113569119889(119883119894)
120124[1205901113569119889(119883119894)]
bull Why does OLS weight like thisbull OLS is a minimum-variance estimator more weight to
more precise within-strata estimatesbull Within-strata estimates are most precise when the treatment
is evenly spread and thus has the highest variancebull If 119863119894 is binary then we know the conditional variance will be
1205901113569119889(119909) = โ[119863119894 = 1|119883119894 = 119909] (1 minus โ[119863119894 = 1|119883119894 = 119909])= 119890(119909) (1 minus 119890(119909))
bull Maximum variance with โ[119863119894 = 1|119883119894 = 119909] = 12
41 58
OLS weighting examplebull Binary covariate
โ[119883119894 = 1] = 075 โ[119883119894 = 0] = 025โ[119863119894 = 1|119883119894 = 1] = 09 โ[119863119894 = 1|119883119894 = 0] = 05
1205901113569119889(1) = 009 1205901113569119889(0) = 025120591(1) = 1 120591(0) = minus1
bull Implies the ATE is 120591 = 05bull Average conditional variance 120124[1205901113569119889(119883119894)] = 013bull weights for 119883119894 = 1 are 009013 = 0692 for 119883119894 = 0
025013 = 192120591119877 = 120124[120591(119883119894)119882119894]
= 120591(1)119882(1)โ[119883119894 = 1] + 120591(0)119882(0)โ[119883119894 = 0]= 1 times 0692 times 075 + minus1 times 192 times 025= 0039
42 58
When will OLS estimate the ATE
bull When does 120591 = 120591119877bull Constant treatment effects 120591(119909) = 120591 = 120591119877bull Constant probability of treatment 119890(119909) = โ[119863119894 = 1|119883119894 = 119909] = 119890
Implies that the OLS weights are 1
bull Incorrect linearity assumption in 119883119894 will lead to more bias
43 58
Other ways to use regression
bull Whatrsquos the path forward Accept the bias (might be relatively small with saturated
models) Use a different regression approach
bull Let 120583119889(119909) = 120124[119884119894(119889)|119883119894 = 119909] be the CEF for the potentialoutcome under 119863119894 = 119889
bull By consistency and nuc we have 120583119889(119909) = 120124[119884119894|119863119894 = 119889119883119894 = 119909]bull Estimate a regression of 119884119894 on 119883119894 among the 119863119894 = 119889 groupbull Then 119889(119909) is just a predicted value from the regression for
119883119894 = 119909bull How can we use this
44 58
Imputation estimators
bull Impute the treated potential outcomes with 1114023119884119894(1) = 1113568(119883119894)bull Impute the control potential outcomes with 1114023119884119894(0) = 1113567(119883119894)bull Procedure
Regress 119884119894 on 119883119894 in the treated group and get predicted valuesfor all units (treated or control)
Regress 119884119894 on 119883119894 in the control group and get predicted valuesfor all units (treated or control)
Take the average difference between these predicted values
bull More mathematically look like this
120591119894119898119901 =1119873
11140121198941113568(119883119894) minus 1113567(119883119894)
bull Sometimes called an imputation estimator
45 58
Simple imputation estimator
bull Use predict() from the within-group models on the data fromthe entire sample
bull Useful trick use a model on the entire data andmodelframe() to get the right design matrix
heterogeneous effects
yhet lt- ifelse(d == 1 y + rnorm(n 0 5) y)
mod lt- lm(yhet ~ d + X)
mod1 lt- lm(yhet ~ X subset = d == 1)
mod0 lt- lm(yhet ~ X subset = d == 0)
y1imps lt- predict(mod1 modelframe(mod))
y0imps lt- predict(mod0 modelframe(mod))
mean(y1imps - y0imps)
[1] 061
46 58
Notes on imputation estimators
bull If 119889(119909) are consistent estimators then 120591119894119898119901 is consistent forthe ATE
bull Why donrsquot people use this Most people donrsquot know the results wersquove been talking about Harder to implement than vanilla OLS
bull Can use linear regression to estimate 119889(119909) = 119909prime120573119889bull Recent trend is to estimate 119889(119909) via non-parametric methods
such as Kernel regression local linear regression regression trees etc Easiest is generalized additive models (GAMs)
47 58
Imputation estimator visualization
-2 -1 0 1 2 3
-05
00
05
10
15
x
y
TreatedControl
48 58
Imputation estimator visualization
-2 -1 0 1 2 3
-05
00
05
10
15
x
y
TreatedControl
49 58
Imputation estimator visualization
-2 -1 0 1 2 3
-05
00
05
10
15
x
y
TreatedControl
50 58
Nonlinear relationshipsbull Same idea but with nonlinear relationship between 119884119894 and 119883119894
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
51 58
Nonlinear relationshipsbull Same idea but with nonlinear relationship between 119884119894 and 119883119894
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
52 58
Nonlinear relationshipsbull Same idea but with nonlinear relationship between 119884119894 and 119883119894
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
53 58
Using semiparametric regressionbull Here CEFs are nonlinear but we donrsquot know their formbull We can use GAMs from the mgcv package to for flexible
estimatelibrary(mgcv)
mod0 lt- gam(y ~ s(x) subset = d == 0)
summary(mod0)
Family gaussian
Link function identity
Formula
y ~ s(x)
Parametric coefficients
Estimate Std Error t value Pr(gt|t|)
(Intercept) -00225 00154 -146 016
Approximate significance of smooth terms
edf Refdf F p-value
s(x) 603 708 413 lt2e-16
---
Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1
R-sq(adj) = 0909 Deviance explained = 928
GCV = 00093204 Scale est = 00071351 n = 30
54 58
Using GAMs
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
55 58
Using GAMs
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
56 58
Using GAMs
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
57 58
Limited dependent variables
bull Usual advice model the data from first principles Logitprobit for binary Poisson for counts etc
bull OLS is a-ok with limited DVs when Binary treatment and no covariates (just diff-in-means) Binary treatment discrete covariates and saturated models
(stratified diff-in-means)
bull Imposing a model on LDVs in this case imposes adistributional assumption which could be wrong
bull Even in unsaturated models the marginal effect from OLSoften decent compared to nonlinear models
Could go wrong in small samples If using nonlinear models always get effects on the scale of the
outcome
58 58
- Agnostic Regression
- Regression and Causality
- Regression with Heterogeneous Treatment Effects
-
Compare to within-strata means
bull The saturated model makes no assumptions about thebetween-strata relationships
bull Just calculates within-strata means
c1 lt- coef(lm(aauw ~ asfactor(totchi) - 1 data = girls))
c2 lt- with(girls tapply(aauw totchi mean narm = TRUE))
rbind(c1 c2)
0 1 2 3 4 5 6 7 8 9 10 12
c1 56 62 53 43 37 41 23 39 11 6 3 0
c2 56 62 53 43 37 41 23 39 11 6 3 0
16 58
Other justifications for OLS
bull Justification 2 119883prime119894 120573 is the best linear predictor (in a
mean-squared error sense) of 119884119894 Why 120573 = argmin119887 120124[(119884119894 minus 119883prime
119894 119887)1113569]
bull Justification 3 119883prime119894 120573 provides the minimum mean squared
error linear approxmiation to 119864[119884119894|119883119894]bull Even if the CEF is not linear a linear regression provides the
best linear approximation to that CEFbull Donrsquot need to believe the assumptions (linearity) in order to
use regression as a good approximation to the CEFbull Warning if the CEF is very nonlinear then this approximation
could be terrible
17 58
The error terms
bull Letrsquos define the error term 119890119894 equiv 119884119894 minus 119883prime119894 120573 so that
119884119894 = 119883prime119894 120573 + [119884119894 minus 119883prime
119894 120573] = 119883prime119894 120573 + 119890119894
bull Note the residual 119890119894 is uncorrelated with 119883119894
120124[119883119894119890119894] = 120124[119883119894(119884119894 minus 119883prime119894 120573)]
= 120124[119883119894119884119894] minus 120124[119883119894119883prime119894 120573]
= 120124[119883119894119884119894] minus 120124 1114094119883119894119883prime119894120124[119883119894119883prime
119894 ]minus1113568120124[119883119894119884119894]1114097= 120124[119883119894119884119894] minus 120124[119883119894119883prime
119894 ]120124[119883119894119883prime119894 ]minus1113568120124[119883119894119884119894]
= 120124[119883119894119884119894] minus 120124[119883119894119884119894] = 0
bull No assumptions on the linearity of 120124[119884119894|119883119894]
18 58
OLS estimator
bull We know the population value of 120573 is
120573 = 120124[119883119894119883prime119894 ]minus1113568120124[119883119894119884119894]
bull How do we get an estimator of thisbull Plug-in principle replace population expectation with
sample versions
=
โกโขโขโขโขโขโฃ1119873
1114012119894119883119894119883prime
119894
โคโฅโฅโฅโฅโฅโฆ
minus11135681119873
1114012119894119883119894119884119894
bull If you work through the matrix algebra this turns out to be
= (119831prime119831)minus1113568 119831prime119858
19 58
Asymptotic OLS inference
bull With this representation in hand we can write the OLSestimator as follows
= 120573 +
โกโขโขโขโขโขโฃ1114012
119894119883119894119883prime
119894
โคโฅโฅโฅโฅโฅโฆ
minus1113568
1114012119894119883119894119890119894
bull Core idea sum119894119883119894119890119894 is the sum of rvs so the CLT applies
bull That plus some simple asymptotic theory allows us to say
radic119873( minus 120573) 119873(0ฮฉ)
bull Converges in distribution to a Normal distribution with meanvector 0 and covariance matrix ฮฉ
ฮฉ = 120124[119883119894119883prime119894 ]minus1113568120124[119883119894119883prime
119894 1198901113569119894 ]120124[119883119894119883prime119894 ]minus1113568
bull No linearity assumption needed20 58
Estimating the variance
bull In large samples then
radic119873( minus 120573) sim 119873(0ฮฉ)
bull How to estimate ฮฉ Plug-in principle again
1114023ฮฉ =
โกโขโขโขโขโขโฃ1114012
119894119883119894119883prime
119894
โคโฅโฅโฅโฅโฅโฆ
minus1113568 โกโขโขโขโขโขโฃ1114012119894119883119894119883prime
119894 1113569119894
โคโฅโฅโฅโฅโฅโฆ
โกโขโขโขโขโขโฃ1114012
119894119883119894119883prime
119894
โคโฅโฅโฅโฅโฅโฆ
minus1113568
bull Replace 119890119894 with its emprical counterpart (residuals)119894 = 119884119894 minus 119883prime
119894 bull Replace the population moments of 119883119894 with their sample
counterpartsbull The square root of the diagonals of this covariance matrix are
the ldquorobustrdquo or Huber-White standard errors that Statacommonly report
21 58
Heteroskedasticity
bull No assumptions of homoskedasticitybull Heteroskedaticity will definitely occur when
CEF is linear but the 1205901113569(119909) = 120141[119884119894|119883119894 = 119909] is not constant in 119909 119864[119884119894|119883119894] is not linear but we use the linear regression to
approxmiate it
22 58
2 Regression andCausality
23 58
Regression and causality
bull Most econometrics textbooks regression defined withoutrespect to causality
bull But then when is ldquobiasedrdquo The above derivations work forsome 120124[119884119894|119883119894]
bull The question then is when does knowing the CEF tell ussomething about causality
bull MHE argues that a regression is causal when the CEF itapproximates is causal Identification is king
bull We will show that under certain conditions a regression of theoutcome on the treatment and the covariates can recover acausal parameter but perhaps not the one in which we areinterested
24 58
Review
bull Quick reminder we have potential outcomes 119884119894(1) and 119884119894(0)and two parameters the ATE and ATT
120591 = 119864[119884119894(1) minus 119884119894(0)]120591ATT = 119864[119884119894(1) minus 119884119894(0)|119863119894 = 1]
bull We have shown in past weeks that these effects are identifiedwhen ignorability holds MHE calls this the conditionalindependence assumption (CIA)
25 58
Linear constant effects model binarytreatment
bull Experiment with a simple experiment we can rewrite theconsistency assumption to be a regression formula
119884119894 = 119863119894119884119894(1) + (1 minus 119863119894)119884119894(0)= 119884119894(0) + (119884119894(1) minus 119884119894(0))119863119894
= 120124[119884119894(0)] + 120591119863119894 + (119884119894(0) minus 120124[119884119894(0)])= 1205831113567 + 120591119863119894 + 1199071113567119894
bull Note that if ignorability holds (as in an experiment) for 119884119894(0)then it will also hold for 1199071113567119894 since 120124[119884119894(0)] is constant Thusthis satifies the usual assumptions for regression
26 58
Now with covariates
bull Now assume no unmeasured confounders 119884119894(119889) โโ 119863119894|119883119894bull We will assume a linear model for the potential outcomes
119884119894(119889) = 120572 + 120591 sdot 119889 + 120578119894
bull Remember that linearity isnrsquot an assumption if 119863119894 is binarybull Effect of 119863119894 is constant here the 120578119894 are the only source of
individual variation and we have 119864[120578119894] = 0bull Consistency assumption allows us to write this as
119884119894 = 120572 + 120591119863119894 + 120578119894
27 58
Covariates in the error
bull Letrsquos assume that 120578119894 is linear in 119883119894 120578119894 = 119883prime119894 120574 + 120584119894
bull New error is uncorrelated with 119883119894 120124[120584119894|119883119894] = 0bull This is an assumption Might be falsebull Plug into the above
120124[119884119894(119889)|119883119894] = 119864[119884119894|119863119894 119883119894] = 120572 + 120591119863119894 + 119864[120578119894|119883119894]= 120572 + 120591119863119894 + 119883prime
119894 120574 + 119864[120584119894|119883119894]= 120572 + 120591119863119894 + 119883prime
119894 120574
28 58
Summing up regression with constanteffects
bull Reviewing the assumptions wersquove used no unmeasured confounders constant treatment effects linearity of the treatmentcovariates
bull Under these we can run the following regression to estimatethe ATE 120591
119884119894 = 120572 + 120591119863119894 + 119883prime119894 120574 + 120584119894
bull Works with continuous or ordinal 119863119894 if linearity in the effect ofthese variables is truly linear
29 58
OLS constant effects simulation
Model with linear covariates constant 0 effect of treatment
library(mvtnorm)
n lt- 100
p lt- 4
X lt- rmvnorm(n = 100 mean = rep(0 p))
gamma lt- c(274 137 137 137)
y lt- 210 + X c(gamma) + rnorm(n)
alpha lt- c(-1 05 -05 -01)
dprobs lt- bootinvlogit(X alpha)
d lt- rbinom(n size = 1 prob = dprobs)
30 58
OLS with no covariates
summary(lm(y ~ d))
Coefficients
Estimate Std Error t value Pr(gt|t|)
(Intercept) 22036 458 4813 lt 2e-16
d -2869 654 -439 0000029
---
Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1
Residual standard error 33 on 98 degrees of freedom
Multiple R-squared 0164 Adjusted R-squared 0156
F-statistic 192 on 1 and 98 DF p-value 0000029
31 58
OLS with covariates
summary(lm(y ~ d + X))
Coefficients
Estimate Std Error t value Pr(gt|t|)
(Intercept) 209952 0159 13207 lt2e-16
d -0128 0253 -05 062
X1 27368 0126 2175 lt2e-16
X2 13677 0114 1201 lt2e-16
X3 13673 0130 1051 lt2e-16
X4 13570 0106 1280 lt2e-16
---
Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1
Residual standard error 1 on 94 degrees of freedom
Multiple R-squared 0999 Adjusted R-squared 0999
F-statistic 244e+04 on 5 and 94 DF p-value lt2e-16
32 58
What happens with nonlinearity
Suppose we can only observe the following covariates
z1 lt- exp(X[ 1]2)
z2 lt- X[ 2](1 + exp(X[ 1])) + 10
z3 lt- (X[ 1] X[ 3]25 + 06)^3
z4 lt- (X[ 2] + X[ 4] + 20)^2
Implies that 119884119894 and 119863119894 are functions of log(1198851198941113568) 1198851198941113569 119885111356911989411135681198851198941113569
1 log(1198851198941113568) 1198851198941113570 log(1198851198941113568) and 119883111356811135691198941113571
Regression is a nonlinear function of the observed covariates
33 58
When linearity goes wrong
summary(lm(y ~ d + z1 + z2 + z3 + z4))
Coefficients
Estimate Std Error t value Pr(gt|t|)
(Intercept) 218728 300799 073 0469
d -69292 33854 -205 0043
z1 363110 26970 1346 lt2e-16
z2 -29033 36619 -079 0430
z3 862022 431030 200 0048
z4 04021 00329 1223 lt2e-16
---
Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1
Residual standard error 14 on 94 degrees of freedom
Multiple R-squared 0855 Adjusted R-squared 0847
F-statistic 111 on 5 and 94 DF p-value lt2e-16
34 58
3 Regression withHeterogeneousTreatment Effects
35 58
Heterogeneous effects binary treatment
bull Completely randomized experiment
119884119894 = 119863119894119884119894(1) + (1 minus 119863119894)119884119894(0)= 119884119894(0) + (119884119894(1) minus 119884119894(0))119863119894
= 1205831113567 + 120591119894119863119894 + (119884119894(0) minus 1205831113567)= 1205831113567 + 120591119863119894 + (119884119894(0) minus 1205831113567) + (120591119894 minus 120591) sdot 119863119894
= 1205831113567 + 120591119863119894 + 120576119894
bull Error term now includes two components1 ldquoBaselinerdquo variation in the outcome (119884119894(0) minus 1205831113567)2 Variation in the treatment effect (120591119894 minus 120591)
bull Easy to verify that under experiment 120124[120576119894|119863119894] = 0bull Thus OLS estimates the ATE with no covariates
36 58
Adding covariates
bull What happens with no unmeasured confounders Need tocondition on 119883119894 now
bull Remember identification of the ATEATT using iteratedexpectations
bull ATE is the weighted sum of CATEs
120591 = 1114012119909120591(119909) Pr[119883119894 = 119909]
bull ATEATT are weighted averages of CATEsbull What about the regression estimand 120591119877 How does it related
to the ATEATT
37 58
Heterogeneous effects and regression
bull Letrsquos investigate this under a saturated regression model
119884119894 = 1114012119909119861119909119894120572119909 + 120591119877119863119894 + 119890119894
bull Use a dummy variable for each unique combination of 119883119894119861119909119894 = 120128(119883119894 = 119909)
bull Linear in 119883119894 by construction
38 58
Investigating the regression coefficient
bull How can we investigate 120591119877 Well we can rely on theregression anatomy
120591119877 = Cov(119884119894 119863119894 minus 119864[119863119894|119883119894])120141(119863119894 minus 119864[119863119894|119883119894])
bull 119863119894 minus 120124[119863119894|119883119894] is the residual from a regression of 119863119894 on the fullset of dummies
bull With a little work we can show
120591119877 =120124 1114094120591(119883119894)(119863119894 minus 120124[119863119894|119883119894])11135691114097
120124[(119863119894 minus 119864[119863119894|119883119894])1113569]=
120124[120591(119883119894)1205901113569119889(119883119894)]120124[1205901113569119889(119883119894)]
bull 1205901113569119889(119909) = 120141[119863119894|119883119894 = 119909] is the conditional variance of treatmentassignment
39 58
ATE versus OLS
120591119877 = 120124[120591(119883119894)119882119894] = 1114012119909120591(119909)
1205901113569119889(119909)120124[1205901113569119889(119883119894)]
โ[119883119894 = 119909]
bull Compare to the ATE
120591 = 120124[120591(119883119894)] = 1114012119909120591(119909)โ[119883119894 = 119909]
bull Both weight strata relative to their size (โ[119883119894 = 119909])bull OLS weights strata higher if the treatment variance in those
strata (1205901113569119889(119909)) is higher in those strata relative to the averagevariance across strata (120124[1205901113569119889(119883119894)])
bull The ATE weights only by their size
40 58
Regression weighting
119882119894 =1205901113569119889(119883119894)
120124[1205901113569119889(119883119894)]
bull Why does OLS weight like thisbull OLS is a minimum-variance estimator more weight to
more precise within-strata estimatesbull Within-strata estimates are most precise when the treatment
is evenly spread and thus has the highest variancebull If 119863119894 is binary then we know the conditional variance will be
1205901113569119889(119909) = โ[119863119894 = 1|119883119894 = 119909] (1 minus โ[119863119894 = 1|119883119894 = 119909])= 119890(119909) (1 minus 119890(119909))
bull Maximum variance with โ[119863119894 = 1|119883119894 = 119909] = 12
41 58
OLS weighting examplebull Binary covariate
โ[119883119894 = 1] = 075 โ[119883119894 = 0] = 025โ[119863119894 = 1|119883119894 = 1] = 09 โ[119863119894 = 1|119883119894 = 0] = 05
1205901113569119889(1) = 009 1205901113569119889(0) = 025120591(1) = 1 120591(0) = minus1
bull Implies the ATE is 120591 = 05bull Average conditional variance 120124[1205901113569119889(119883119894)] = 013bull weights for 119883119894 = 1 are 009013 = 0692 for 119883119894 = 0
025013 = 192120591119877 = 120124[120591(119883119894)119882119894]
= 120591(1)119882(1)โ[119883119894 = 1] + 120591(0)119882(0)โ[119883119894 = 0]= 1 times 0692 times 075 + minus1 times 192 times 025= 0039
42 58
When will OLS estimate the ATE
bull When does 120591 = 120591119877bull Constant treatment effects 120591(119909) = 120591 = 120591119877bull Constant probability of treatment 119890(119909) = โ[119863119894 = 1|119883119894 = 119909] = 119890
Implies that the OLS weights are 1
bull Incorrect linearity assumption in 119883119894 will lead to more bias
43 58
Other ways to use regression
bull Whatrsquos the path forward Accept the bias (might be relatively small with saturated
models) Use a different regression approach
bull Let 120583119889(119909) = 120124[119884119894(119889)|119883119894 = 119909] be the CEF for the potentialoutcome under 119863119894 = 119889
bull By consistency and nuc we have 120583119889(119909) = 120124[119884119894|119863119894 = 119889119883119894 = 119909]bull Estimate a regression of 119884119894 on 119883119894 among the 119863119894 = 119889 groupbull Then 119889(119909) is just a predicted value from the regression for
119883119894 = 119909bull How can we use this
44 58
Imputation estimators
bull Impute the treated potential outcomes with 1114023119884119894(1) = 1113568(119883119894)bull Impute the control potential outcomes with 1114023119884119894(0) = 1113567(119883119894)bull Procedure
Regress 119884119894 on 119883119894 in the treated group and get predicted valuesfor all units (treated or control)
Regress 119884119894 on 119883119894 in the control group and get predicted valuesfor all units (treated or control)
Take the average difference between these predicted values
bull More mathematically look like this
120591119894119898119901 =1119873
11140121198941113568(119883119894) minus 1113567(119883119894)
bull Sometimes called an imputation estimator
45 58
Simple imputation estimator
bull Use predict() from the within-group models on the data fromthe entire sample
bull Useful trick use a model on the entire data andmodelframe() to get the right design matrix
heterogeneous effects
yhet lt- ifelse(d == 1 y + rnorm(n 0 5) y)
mod lt- lm(yhet ~ d + X)
mod1 lt- lm(yhet ~ X subset = d == 1)
mod0 lt- lm(yhet ~ X subset = d == 0)
y1imps lt- predict(mod1 modelframe(mod))
y0imps lt- predict(mod0 modelframe(mod))
mean(y1imps - y0imps)
[1] 061
46 58
Notes on imputation estimators
bull If 119889(119909) are consistent estimators then 120591119894119898119901 is consistent forthe ATE
bull Why donrsquot people use this Most people donrsquot know the results wersquove been talking about Harder to implement than vanilla OLS
bull Can use linear regression to estimate 119889(119909) = 119909prime120573119889bull Recent trend is to estimate 119889(119909) via non-parametric methods
such as Kernel regression local linear regression regression trees etc Easiest is generalized additive models (GAMs)
47 58
Imputation estimator visualization
-2 -1 0 1 2 3
-05
00
05
10
15
x
y
TreatedControl
48 58
Imputation estimator visualization
-2 -1 0 1 2 3
-05
00
05
10
15
x
y
TreatedControl
49 58
Imputation estimator visualization
-2 -1 0 1 2 3
-05
00
05
10
15
x
y
TreatedControl
50 58
Nonlinear relationshipsbull Same idea but with nonlinear relationship between 119884119894 and 119883119894
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
51 58
Nonlinear relationshipsbull Same idea but with nonlinear relationship between 119884119894 and 119883119894
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
52 58
Nonlinear relationshipsbull Same idea but with nonlinear relationship between 119884119894 and 119883119894
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
53 58
Using semiparametric regressionbull Here CEFs are nonlinear but we donrsquot know their formbull We can use GAMs from the mgcv package to for flexible
estimatelibrary(mgcv)
mod0 lt- gam(y ~ s(x) subset = d == 0)
summary(mod0)
Family gaussian
Link function identity
Formula
y ~ s(x)
Parametric coefficients
Estimate Std Error t value Pr(gt|t|)
(Intercept) -00225 00154 -146 016
Approximate significance of smooth terms
edf Refdf F p-value
s(x) 603 708 413 lt2e-16
---
Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1
R-sq(adj) = 0909 Deviance explained = 928
GCV = 00093204 Scale est = 00071351 n = 30
54 58
Using GAMs
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
55 58
Using GAMs
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
56 58
Using GAMs
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
57 58
Limited dependent variables
bull Usual advice model the data from first principles Logitprobit for binary Poisson for counts etc
bull OLS is a-ok with limited DVs when Binary treatment and no covariates (just diff-in-means) Binary treatment discrete covariates and saturated models
(stratified diff-in-means)
bull Imposing a model on LDVs in this case imposes adistributional assumption which could be wrong
bull Even in unsaturated models the marginal effect from OLSoften decent compared to nonlinear models
Could go wrong in small samples If using nonlinear models always get effects on the scale of the
outcome
58 58
- Agnostic Regression
- Regression and Causality
- Regression with Heterogeneous Treatment Effects
-
Other justifications for OLS
bull Justification 2 119883prime119894 120573 is the best linear predictor (in a
mean-squared error sense) of 119884119894 Why 120573 = argmin119887 120124[(119884119894 minus 119883prime
119894 119887)1113569]
bull Justification 3 119883prime119894 120573 provides the minimum mean squared
error linear approxmiation to 119864[119884119894|119883119894]bull Even if the CEF is not linear a linear regression provides the
best linear approximation to that CEFbull Donrsquot need to believe the assumptions (linearity) in order to
use regression as a good approximation to the CEFbull Warning if the CEF is very nonlinear then this approximation
could be terrible
17 58
The error terms
bull Letrsquos define the error term 119890119894 equiv 119884119894 minus 119883prime119894 120573 so that
119884119894 = 119883prime119894 120573 + [119884119894 minus 119883prime
119894 120573] = 119883prime119894 120573 + 119890119894
bull Note the residual 119890119894 is uncorrelated with 119883119894
120124[119883119894119890119894] = 120124[119883119894(119884119894 minus 119883prime119894 120573)]
= 120124[119883119894119884119894] minus 120124[119883119894119883prime119894 120573]
= 120124[119883119894119884119894] minus 120124 1114094119883119894119883prime119894120124[119883119894119883prime
119894 ]minus1113568120124[119883119894119884119894]1114097= 120124[119883119894119884119894] minus 120124[119883119894119883prime
119894 ]120124[119883119894119883prime119894 ]minus1113568120124[119883119894119884119894]
= 120124[119883119894119884119894] minus 120124[119883119894119884119894] = 0
bull No assumptions on the linearity of 120124[119884119894|119883119894]
18 58
OLS estimator
bull We know the population value of 120573 is
120573 = 120124[119883119894119883prime119894 ]minus1113568120124[119883119894119884119894]
bull How do we get an estimator of thisbull Plug-in principle replace population expectation with
sample versions
=
โกโขโขโขโขโขโฃ1119873
1114012119894119883119894119883prime
119894
โคโฅโฅโฅโฅโฅโฆ
minus11135681119873
1114012119894119883119894119884119894
bull If you work through the matrix algebra this turns out to be
= (119831prime119831)minus1113568 119831prime119858
19 58
Asymptotic OLS inference
bull With this representation in hand we can write the OLSestimator as follows
= 120573 +
โกโขโขโขโขโขโฃ1114012
119894119883119894119883prime
119894
โคโฅโฅโฅโฅโฅโฆ
minus1113568
1114012119894119883119894119890119894
bull Core idea sum119894119883119894119890119894 is the sum of rvs so the CLT applies
bull That plus some simple asymptotic theory allows us to say
radic119873( minus 120573) 119873(0ฮฉ)
bull Converges in distribution to a Normal distribution with meanvector 0 and covariance matrix ฮฉ
ฮฉ = 120124[119883119894119883prime119894 ]minus1113568120124[119883119894119883prime
119894 1198901113569119894 ]120124[119883119894119883prime119894 ]minus1113568
bull No linearity assumption needed20 58
Estimating the variance
bull In large samples then
radic119873( minus 120573) sim 119873(0ฮฉ)
bull How to estimate ฮฉ Plug-in principle again
1114023ฮฉ =
โกโขโขโขโขโขโฃ1114012
119894119883119894119883prime
119894
โคโฅโฅโฅโฅโฅโฆ
minus1113568 โกโขโขโขโขโขโฃ1114012119894119883119894119883prime
119894 1113569119894
โคโฅโฅโฅโฅโฅโฆ
โกโขโขโขโขโขโฃ1114012
119894119883119894119883prime
119894
โคโฅโฅโฅโฅโฅโฆ
minus1113568
bull Replace 119890119894 with its emprical counterpart (residuals)119894 = 119884119894 minus 119883prime
119894 bull Replace the population moments of 119883119894 with their sample
counterpartsbull The square root of the diagonals of this covariance matrix are
the ldquorobustrdquo or Huber-White standard errors that Statacommonly report
21 58
Heteroskedasticity
bull No assumptions of homoskedasticitybull Heteroskedaticity will definitely occur when
CEF is linear but the 1205901113569(119909) = 120141[119884119894|119883119894 = 119909] is not constant in 119909 119864[119884119894|119883119894] is not linear but we use the linear regression to
approxmiate it
22 58
2 Regression andCausality
23 58
Regression and causality
bull Most econometrics textbooks regression defined withoutrespect to causality
bull But then when is ldquobiasedrdquo The above derivations work forsome 120124[119884119894|119883119894]
bull The question then is when does knowing the CEF tell ussomething about causality
bull MHE argues that a regression is causal when the CEF itapproximates is causal Identification is king
bull We will show that under certain conditions a regression of theoutcome on the treatment and the covariates can recover acausal parameter but perhaps not the one in which we areinterested
24 58
Review
bull Quick reminder we have potential outcomes 119884119894(1) and 119884119894(0)and two parameters the ATE and ATT
120591 = 119864[119884119894(1) minus 119884119894(0)]120591ATT = 119864[119884119894(1) minus 119884119894(0)|119863119894 = 1]
bull We have shown in past weeks that these effects are identifiedwhen ignorability holds MHE calls this the conditionalindependence assumption (CIA)
25 58
Linear constant effects model binarytreatment
bull Experiment with a simple experiment we can rewrite theconsistency assumption to be a regression formula
119884119894 = 119863119894119884119894(1) + (1 minus 119863119894)119884119894(0)= 119884119894(0) + (119884119894(1) minus 119884119894(0))119863119894
= 120124[119884119894(0)] + 120591119863119894 + (119884119894(0) minus 120124[119884119894(0)])= 1205831113567 + 120591119863119894 + 1199071113567119894
bull Note that if ignorability holds (as in an experiment) for 119884119894(0)then it will also hold for 1199071113567119894 since 120124[119884119894(0)] is constant Thusthis satifies the usual assumptions for regression
26 58
Now with covariates
bull Now assume no unmeasured confounders 119884119894(119889) โโ 119863119894|119883119894bull We will assume a linear model for the potential outcomes
119884119894(119889) = 120572 + 120591 sdot 119889 + 120578119894
bull Remember that linearity isnrsquot an assumption if 119863119894 is binarybull Effect of 119863119894 is constant here the 120578119894 are the only source of
individual variation and we have 119864[120578119894] = 0bull Consistency assumption allows us to write this as
119884119894 = 120572 + 120591119863119894 + 120578119894
27 58
Covariates in the error
bull Letrsquos assume that 120578119894 is linear in 119883119894 120578119894 = 119883prime119894 120574 + 120584119894
bull New error is uncorrelated with 119883119894 120124[120584119894|119883119894] = 0bull This is an assumption Might be falsebull Plug into the above
120124[119884119894(119889)|119883119894] = 119864[119884119894|119863119894 119883119894] = 120572 + 120591119863119894 + 119864[120578119894|119883119894]= 120572 + 120591119863119894 + 119883prime
119894 120574 + 119864[120584119894|119883119894]= 120572 + 120591119863119894 + 119883prime
119894 120574
28 58
Summing up regression with constanteffects
bull Reviewing the assumptions wersquove used no unmeasured confounders constant treatment effects linearity of the treatmentcovariates
bull Under these we can run the following regression to estimatethe ATE 120591
119884119894 = 120572 + 120591119863119894 + 119883prime119894 120574 + 120584119894
bull Works with continuous or ordinal 119863119894 if linearity in the effect ofthese variables is truly linear
29 58
OLS constant effects simulation
Model with linear covariates constant 0 effect of treatment
library(mvtnorm)
n lt- 100
p lt- 4
X lt- rmvnorm(n = 100 mean = rep(0 p))
gamma lt- c(274 137 137 137)
y lt- 210 + X c(gamma) + rnorm(n)
alpha lt- c(-1 05 -05 -01)
dprobs lt- bootinvlogit(X alpha)
d lt- rbinom(n size = 1 prob = dprobs)
30 58
OLS with no covariates
summary(lm(y ~ d))
Coefficients
Estimate Std Error t value Pr(gt|t|)
(Intercept) 22036 458 4813 lt 2e-16
d -2869 654 -439 0000029
---
Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1
Residual standard error 33 on 98 degrees of freedom
Multiple R-squared 0164 Adjusted R-squared 0156
F-statistic 192 on 1 and 98 DF p-value 0000029
31 58
OLS with covariates
summary(lm(y ~ d + X))
Coefficients
Estimate Std Error t value Pr(gt|t|)
(Intercept) 209952 0159 13207 lt2e-16
d -0128 0253 -05 062
X1 27368 0126 2175 lt2e-16
X2 13677 0114 1201 lt2e-16
X3 13673 0130 1051 lt2e-16
X4 13570 0106 1280 lt2e-16
---
Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1
Residual standard error 1 on 94 degrees of freedom
Multiple R-squared 0999 Adjusted R-squared 0999
F-statistic 244e+04 on 5 and 94 DF p-value lt2e-16
32 58
What happens with nonlinearity
Suppose we can only observe the following covariates
z1 lt- exp(X[ 1]2)
z2 lt- X[ 2](1 + exp(X[ 1])) + 10
z3 lt- (X[ 1] X[ 3]25 + 06)^3
z4 lt- (X[ 2] + X[ 4] + 20)^2
Implies that 119884119894 and 119863119894 are functions of log(1198851198941113568) 1198851198941113569 119885111356911989411135681198851198941113569
1 log(1198851198941113568) 1198851198941113570 log(1198851198941113568) and 119883111356811135691198941113571
Regression is a nonlinear function of the observed covariates
33 58
When linearity goes wrong
summary(lm(y ~ d + z1 + z2 + z3 + z4))
Coefficients
Estimate Std Error t value Pr(gt|t|)
(Intercept) 218728 300799 073 0469
d -69292 33854 -205 0043
z1 363110 26970 1346 lt2e-16
z2 -29033 36619 -079 0430
z3 862022 431030 200 0048
z4 04021 00329 1223 lt2e-16
---
Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1
Residual standard error 14 on 94 degrees of freedom
Multiple R-squared 0855 Adjusted R-squared 0847
F-statistic 111 on 5 and 94 DF p-value lt2e-16
34 58
3 Regression withHeterogeneousTreatment Effects
35 58
Heterogeneous effects binary treatment
bull Completely randomized experiment
119884119894 = 119863119894119884119894(1) + (1 minus 119863119894)119884119894(0)= 119884119894(0) + (119884119894(1) minus 119884119894(0))119863119894
= 1205831113567 + 120591119894119863119894 + (119884119894(0) minus 1205831113567)= 1205831113567 + 120591119863119894 + (119884119894(0) minus 1205831113567) + (120591119894 minus 120591) sdot 119863119894
= 1205831113567 + 120591119863119894 + 120576119894
bull Error term now includes two components1 ldquoBaselinerdquo variation in the outcome (119884119894(0) minus 1205831113567)2 Variation in the treatment effect (120591119894 minus 120591)
bull Easy to verify that under experiment 120124[120576119894|119863119894] = 0bull Thus OLS estimates the ATE with no covariates
36 58
Adding covariates
bull What happens with no unmeasured confounders Need tocondition on 119883119894 now
bull Remember identification of the ATEATT using iteratedexpectations
bull ATE is the weighted sum of CATEs
120591 = 1114012119909120591(119909) Pr[119883119894 = 119909]
bull ATEATT are weighted averages of CATEsbull What about the regression estimand 120591119877 How does it related
to the ATEATT
37 58
Heterogeneous effects and regression
bull Letrsquos investigate this under a saturated regression model
119884119894 = 1114012119909119861119909119894120572119909 + 120591119877119863119894 + 119890119894
bull Use a dummy variable for each unique combination of 119883119894119861119909119894 = 120128(119883119894 = 119909)
bull Linear in 119883119894 by construction
38 58
Investigating the regression coefficient
bull How can we investigate 120591119877 Well we can rely on theregression anatomy
120591119877 = Cov(119884119894 119863119894 minus 119864[119863119894|119883119894])120141(119863119894 minus 119864[119863119894|119883119894])
bull 119863119894 minus 120124[119863119894|119883119894] is the residual from a regression of 119863119894 on the fullset of dummies
bull With a little work we can show
120591119877 =120124 1114094120591(119883119894)(119863119894 minus 120124[119863119894|119883119894])11135691114097
120124[(119863119894 minus 119864[119863119894|119883119894])1113569]=
120124[120591(119883119894)1205901113569119889(119883119894)]120124[1205901113569119889(119883119894)]
bull 1205901113569119889(119909) = 120141[119863119894|119883119894 = 119909] is the conditional variance of treatmentassignment
39 58
ATE versus OLS
120591119877 = 120124[120591(119883119894)119882119894] = 1114012119909120591(119909)
1205901113569119889(119909)120124[1205901113569119889(119883119894)]
โ[119883119894 = 119909]
bull Compare to the ATE
120591 = 120124[120591(119883119894)] = 1114012119909120591(119909)โ[119883119894 = 119909]
bull Both weight strata relative to their size (โ[119883119894 = 119909])bull OLS weights strata higher if the treatment variance in those
strata (1205901113569119889(119909)) is higher in those strata relative to the averagevariance across strata (120124[1205901113569119889(119883119894)])
bull The ATE weights only by their size
40 58
Regression weighting
119882119894 =1205901113569119889(119883119894)
120124[1205901113569119889(119883119894)]
bull Why does OLS weight like thisbull OLS is a minimum-variance estimator more weight to
more precise within-strata estimatesbull Within-strata estimates are most precise when the treatment
is evenly spread and thus has the highest variancebull If 119863119894 is binary then we know the conditional variance will be
1205901113569119889(119909) = โ[119863119894 = 1|119883119894 = 119909] (1 minus โ[119863119894 = 1|119883119894 = 119909])= 119890(119909) (1 minus 119890(119909))
bull Maximum variance with โ[119863119894 = 1|119883119894 = 119909] = 12
41 58
OLS weighting examplebull Binary covariate
โ[119883119894 = 1] = 075 โ[119883119894 = 0] = 025โ[119863119894 = 1|119883119894 = 1] = 09 โ[119863119894 = 1|119883119894 = 0] = 05
1205901113569119889(1) = 009 1205901113569119889(0) = 025120591(1) = 1 120591(0) = minus1
bull Implies the ATE is 120591 = 05bull Average conditional variance 120124[1205901113569119889(119883119894)] = 013bull weights for 119883119894 = 1 are 009013 = 0692 for 119883119894 = 0
025013 = 192120591119877 = 120124[120591(119883119894)119882119894]
= 120591(1)119882(1)โ[119883119894 = 1] + 120591(0)119882(0)โ[119883119894 = 0]= 1 times 0692 times 075 + minus1 times 192 times 025= 0039
42 58
When will OLS estimate the ATE
bull When does 120591 = 120591119877bull Constant treatment effects 120591(119909) = 120591 = 120591119877bull Constant probability of treatment 119890(119909) = โ[119863119894 = 1|119883119894 = 119909] = 119890
Implies that the OLS weights are 1
bull Incorrect linearity assumption in 119883119894 will lead to more bias
43 58
Other ways to use regression
bull Whatrsquos the path forward Accept the bias (might be relatively small with saturated
models) Use a different regression approach
bull Let 120583119889(119909) = 120124[119884119894(119889)|119883119894 = 119909] be the CEF for the potentialoutcome under 119863119894 = 119889
bull By consistency and nuc we have 120583119889(119909) = 120124[119884119894|119863119894 = 119889119883119894 = 119909]bull Estimate a regression of 119884119894 on 119883119894 among the 119863119894 = 119889 groupbull Then 119889(119909) is just a predicted value from the regression for
119883119894 = 119909bull How can we use this
44 58
Imputation estimators
bull Impute the treated potential outcomes with 1114023119884119894(1) = 1113568(119883119894)bull Impute the control potential outcomes with 1114023119884119894(0) = 1113567(119883119894)bull Procedure
Regress 119884119894 on 119883119894 in the treated group and get predicted valuesfor all units (treated or control)
Regress 119884119894 on 119883119894 in the control group and get predicted valuesfor all units (treated or control)
Take the average difference between these predicted values
bull More mathematically look like this
120591119894119898119901 =1119873
11140121198941113568(119883119894) minus 1113567(119883119894)
bull Sometimes called an imputation estimator
45 58
Simple imputation estimator
bull Use predict() from the within-group models on the data fromthe entire sample
bull Useful trick use a model on the entire data andmodelframe() to get the right design matrix
heterogeneous effects
yhet lt- ifelse(d == 1 y + rnorm(n 0 5) y)
mod lt- lm(yhet ~ d + X)
mod1 lt- lm(yhet ~ X subset = d == 1)
mod0 lt- lm(yhet ~ X subset = d == 0)
y1imps lt- predict(mod1 modelframe(mod))
y0imps lt- predict(mod0 modelframe(mod))
mean(y1imps - y0imps)
[1] 061
46 58
Notes on imputation estimators
bull If 119889(119909) are consistent estimators then 120591119894119898119901 is consistent forthe ATE
bull Why donrsquot people use this Most people donrsquot know the results wersquove been talking about Harder to implement than vanilla OLS
bull Can use linear regression to estimate 119889(119909) = 119909prime120573119889bull Recent trend is to estimate 119889(119909) via non-parametric methods
such as Kernel regression local linear regression regression trees etc Easiest is generalized additive models (GAMs)
47 58
Imputation estimator visualization
-2 -1 0 1 2 3
-05
00
05
10
15
x
y
TreatedControl
48 58
Imputation estimator visualization
-2 -1 0 1 2 3
-05
00
05
10
15
x
y
TreatedControl
49 58
Imputation estimator visualization
-2 -1 0 1 2 3
-05
00
05
10
15
x
y
TreatedControl
50 58
Nonlinear relationshipsbull Same idea but with nonlinear relationship between 119884119894 and 119883119894
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
51 58
Nonlinear relationshipsbull Same idea but with nonlinear relationship between 119884119894 and 119883119894
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
52 58
Nonlinear relationshipsbull Same idea but with nonlinear relationship between 119884119894 and 119883119894
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
53 58
Using semiparametric regressionbull Here CEFs are nonlinear but we donrsquot know their formbull We can use GAMs from the mgcv package to for flexible
estimatelibrary(mgcv)
mod0 lt- gam(y ~ s(x) subset = d == 0)
summary(mod0)
Family gaussian
Link function identity
Formula
y ~ s(x)
Parametric coefficients
Estimate Std Error t value Pr(gt|t|)
(Intercept) -00225 00154 -146 016
Approximate significance of smooth terms
edf Refdf F p-value
s(x) 603 708 413 lt2e-16
---
Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1
R-sq(adj) = 0909 Deviance explained = 928
GCV = 00093204 Scale est = 00071351 n = 30
54 58
Using GAMs
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
55 58
Using GAMs
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
56 58
Using GAMs
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
57 58
Limited dependent variables
bull Usual advice model the data from first principles Logitprobit for binary Poisson for counts etc
bull OLS is a-ok with limited DVs when Binary treatment and no covariates (just diff-in-means) Binary treatment discrete covariates and saturated models
(stratified diff-in-means)
bull Imposing a model on LDVs in this case imposes adistributional assumption which could be wrong
bull Even in unsaturated models the marginal effect from OLSoften decent compared to nonlinear models
Could go wrong in small samples If using nonlinear models always get effects on the scale of the
outcome
58 58
- Agnostic Regression
- Regression and Causality
- Regression with Heterogeneous Treatment Effects
-
The error terms
bull Letrsquos define the error term 119890119894 equiv 119884119894 minus 119883prime119894 120573 so that
119884119894 = 119883prime119894 120573 + [119884119894 minus 119883prime
119894 120573] = 119883prime119894 120573 + 119890119894
bull Note the residual 119890119894 is uncorrelated with 119883119894
120124[119883119894119890119894] = 120124[119883119894(119884119894 minus 119883prime119894 120573)]
= 120124[119883119894119884119894] minus 120124[119883119894119883prime119894 120573]
= 120124[119883119894119884119894] minus 120124 1114094119883119894119883prime119894120124[119883119894119883prime
119894 ]minus1113568120124[119883119894119884119894]1114097= 120124[119883119894119884119894] minus 120124[119883119894119883prime
119894 ]120124[119883119894119883prime119894 ]minus1113568120124[119883119894119884119894]
= 120124[119883119894119884119894] minus 120124[119883119894119884119894] = 0
bull No assumptions on the linearity of 120124[119884119894|119883119894]
18 58
OLS estimator
bull We know the population value of 120573 is
120573 = 120124[119883119894119883prime119894 ]minus1113568120124[119883119894119884119894]
bull How do we get an estimator of thisbull Plug-in principle replace population expectation with
sample versions
=
โกโขโขโขโขโขโฃ1119873
1114012119894119883119894119883prime
119894
โคโฅโฅโฅโฅโฅโฆ
minus11135681119873
1114012119894119883119894119884119894
bull If you work through the matrix algebra this turns out to be
= (119831prime119831)minus1113568 119831prime119858
19 58
Asymptotic OLS inference
bull With this representation in hand we can write the OLSestimator as follows
= 120573 +
โกโขโขโขโขโขโฃ1114012
119894119883119894119883prime
119894
โคโฅโฅโฅโฅโฅโฆ
minus1113568
1114012119894119883119894119890119894
bull Core idea sum119894119883119894119890119894 is the sum of rvs so the CLT applies
bull That plus some simple asymptotic theory allows us to say
radic119873( minus 120573) 119873(0ฮฉ)
bull Converges in distribution to a Normal distribution with meanvector 0 and covariance matrix ฮฉ
ฮฉ = 120124[119883119894119883prime119894 ]minus1113568120124[119883119894119883prime
119894 1198901113569119894 ]120124[119883119894119883prime119894 ]minus1113568
bull No linearity assumption needed20 58
Estimating the variance
bull In large samples then
radic119873( minus 120573) sim 119873(0ฮฉ)
bull How to estimate ฮฉ Plug-in principle again
1114023ฮฉ =
โกโขโขโขโขโขโฃ1114012
119894119883119894119883prime
119894
โคโฅโฅโฅโฅโฅโฆ
minus1113568 โกโขโขโขโขโขโฃ1114012119894119883119894119883prime
119894 1113569119894
โคโฅโฅโฅโฅโฅโฆ
โกโขโขโขโขโขโฃ1114012
119894119883119894119883prime
119894
โคโฅโฅโฅโฅโฅโฆ
minus1113568
bull Replace 119890119894 with its emprical counterpart (residuals)119894 = 119884119894 minus 119883prime
119894 bull Replace the population moments of 119883119894 with their sample
counterpartsbull The square root of the diagonals of this covariance matrix are
the ldquorobustrdquo or Huber-White standard errors that Statacommonly report
21 58
Heteroskedasticity
bull No assumptions of homoskedasticitybull Heteroskedaticity will definitely occur when
CEF is linear but the 1205901113569(119909) = 120141[119884119894|119883119894 = 119909] is not constant in 119909 119864[119884119894|119883119894] is not linear but we use the linear regression to
approxmiate it
22 58
2 Regression andCausality
23 58
Regression and causality
bull Most econometrics textbooks regression defined withoutrespect to causality
bull But then when is ldquobiasedrdquo The above derivations work forsome 120124[119884119894|119883119894]
bull The question then is when does knowing the CEF tell ussomething about causality
bull MHE argues that a regression is causal when the CEF itapproximates is causal Identification is king
bull We will show that under certain conditions a regression of theoutcome on the treatment and the covariates can recover acausal parameter but perhaps not the one in which we areinterested
24 58
Review
bull Quick reminder we have potential outcomes 119884119894(1) and 119884119894(0)and two parameters the ATE and ATT
120591 = 119864[119884119894(1) minus 119884119894(0)]120591ATT = 119864[119884119894(1) minus 119884119894(0)|119863119894 = 1]
bull We have shown in past weeks that these effects are identifiedwhen ignorability holds MHE calls this the conditionalindependence assumption (CIA)
25 58
Linear constant effects model binarytreatment
bull Experiment with a simple experiment we can rewrite theconsistency assumption to be a regression formula
119884119894 = 119863119894119884119894(1) + (1 minus 119863119894)119884119894(0)= 119884119894(0) + (119884119894(1) minus 119884119894(0))119863119894
= 120124[119884119894(0)] + 120591119863119894 + (119884119894(0) minus 120124[119884119894(0)])= 1205831113567 + 120591119863119894 + 1199071113567119894
bull Note that if ignorability holds (as in an experiment) for 119884119894(0)then it will also hold for 1199071113567119894 since 120124[119884119894(0)] is constant Thusthis satifies the usual assumptions for regression
26 58
Now with covariates
bull Now assume no unmeasured confounders 119884119894(119889) โโ 119863119894|119883119894bull We will assume a linear model for the potential outcomes
119884119894(119889) = 120572 + 120591 sdot 119889 + 120578119894
bull Remember that linearity isnrsquot an assumption if 119863119894 is binarybull Effect of 119863119894 is constant here the 120578119894 are the only source of
individual variation and we have 119864[120578119894] = 0bull Consistency assumption allows us to write this as
119884119894 = 120572 + 120591119863119894 + 120578119894
27 58
Covariates in the error
bull Letrsquos assume that 120578119894 is linear in 119883119894 120578119894 = 119883prime119894 120574 + 120584119894
bull New error is uncorrelated with 119883119894 120124[120584119894|119883119894] = 0bull This is an assumption Might be falsebull Plug into the above
120124[119884119894(119889)|119883119894] = 119864[119884119894|119863119894 119883119894] = 120572 + 120591119863119894 + 119864[120578119894|119883119894]= 120572 + 120591119863119894 + 119883prime
119894 120574 + 119864[120584119894|119883119894]= 120572 + 120591119863119894 + 119883prime
119894 120574
28 58
Summing up regression with constanteffects
bull Reviewing the assumptions wersquove used no unmeasured confounders constant treatment effects linearity of the treatmentcovariates
bull Under these we can run the following regression to estimatethe ATE 120591
119884119894 = 120572 + 120591119863119894 + 119883prime119894 120574 + 120584119894
bull Works with continuous or ordinal 119863119894 if linearity in the effect ofthese variables is truly linear
29 58
OLS constant effects simulation
Model with linear covariates constant 0 effect of treatment
library(mvtnorm)
n lt- 100
p lt- 4
X lt- rmvnorm(n = 100 mean = rep(0 p))
gamma lt- c(274 137 137 137)
y lt- 210 + X c(gamma) + rnorm(n)
alpha lt- c(-1 05 -05 -01)
dprobs lt- bootinvlogit(X alpha)
d lt- rbinom(n size = 1 prob = dprobs)
30 58
OLS with no covariates
summary(lm(y ~ d))
Coefficients
Estimate Std Error t value Pr(gt|t|)
(Intercept) 22036 458 4813 lt 2e-16
d -2869 654 -439 0000029
---
Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1
Residual standard error 33 on 98 degrees of freedom
Multiple R-squared 0164 Adjusted R-squared 0156
F-statistic 192 on 1 and 98 DF p-value 0000029
31 58
OLS with covariates
summary(lm(y ~ d + X))
Coefficients
Estimate Std Error t value Pr(gt|t|)
(Intercept) 209952 0159 13207 lt2e-16
d -0128 0253 -05 062
X1 27368 0126 2175 lt2e-16
X2 13677 0114 1201 lt2e-16
X3 13673 0130 1051 lt2e-16
X4 13570 0106 1280 lt2e-16
---
Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1
Residual standard error 1 on 94 degrees of freedom
Multiple R-squared 0999 Adjusted R-squared 0999
F-statistic 244e+04 on 5 and 94 DF p-value lt2e-16
32 58
What happens with nonlinearity
Suppose we can only observe the following covariates
z1 lt- exp(X[ 1]2)
z2 lt- X[ 2](1 + exp(X[ 1])) + 10
z3 lt- (X[ 1] X[ 3]25 + 06)^3
z4 lt- (X[ 2] + X[ 4] + 20)^2
Implies that 119884119894 and 119863119894 are functions of log(1198851198941113568) 1198851198941113569 119885111356911989411135681198851198941113569
1 log(1198851198941113568) 1198851198941113570 log(1198851198941113568) and 119883111356811135691198941113571
Regression is a nonlinear function of the observed covariates
33 58
When linearity goes wrong
summary(lm(y ~ d + z1 + z2 + z3 + z4))
Coefficients
Estimate Std Error t value Pr(gt|t|)
(Intercept) 218728 300799 073 0469
d -69292 33854 -205 0043
z1 363110 26970 1346 lt2e-16
z2 -29033 36619 -079 0430
z3 862022 431030 200 0048
z4 04021 00329 1223 lt2e-16
---
Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1
Residual standard error 14 on 94 degrees of freedom
Multiple R-squared 0855 Adjusted R-squared 0847
F-statistic 111 on 5 and 94 DF p-value lt2e-16
34 58
3 Regression withHeterogeneousTreatment Effects
35 58
Heterogeneous effects binary treatment
bull Completely randomized experiment
119884119894 = 119863119894119884119894(1) + (1 minus 119863119894)119884119894(0)= 119884119894(0) + (119884119894(1) minus 119884119894(0))119863119894
= 1205831113567 + 120591119894119863119894 + (119884119894(0) minus 1205831113567)= 1205831113567 + 120591119863119894 + (119884119894(0) minus 1205831113567) + (120591119894 minus 120591) sdot 119863119894
= 1205831113567 + 120591119863119894 + 120576119894
bull Error term now includes two components1 ldquoBaselinerdquo variation in the outcome (119884119894(0) minus 1205831113567)2 Variation in the treatment effect (120591119894 minus 120591)
bull Easy to verify that under experiment 120124[120576119894|119863119894] = 0bull Thus OLS estimates the ATE with no covariates
36 58
Adding covariates
bull What happens with no unmeasured confounders Need tocondition on 119883119894 now
bull Remember identification of the ATEATT using iteratedexpectations
bull ATE is the weighted sum of CATEs
120591 = 1114012119909120591(119909) Pr[119883119894 = 119909]
bull ATEATT are weighted averages of CATEsbull What about the regression estimand 120591119877 How does it related
to the ATEATT
37 58
Heterogeneous effects and regression
bull Letrsquos investigate this under a saturated regression model
119884119894 = 1114012119909119861119909119894120572119909 + 120591119877119863119894 + 119890119894
bull Use a dummy variable for each unique combination of 119883119894119861119909119894 = 120128(119883119894 = 119909)
bull Linear in 119883119894 by construction
38 58
Investigating the regression coefficient
bull How can we investigate 120591119877 Well we can rely on theregression anatomy
120591119877 = Cov(119884119894 119863119894 minus 119864[119863119894|119883119894])120141(119863119894 minus 119864[119863119894|119883119894])
bull 119863119894 minus 120124[119863119894|119883119894] is the residual from a regression of 119863119894 on the fullset of dummies
bull With a little work we can show
120591119877 =120124 1114094120591(119883119894)(119863119894 minus 120124[119863119894|119883119894])11135691114097
120124[(119863119894 minus 119864[119863119894|119883119894])1113569]=
120124[120591(119883119894)1205901113569119889(119883119894)]120124[1205901113569119889(119883119894)]
bull 1205901113569119889(119909) = 120141[119863119894|119883119894 = 119909] is the conditional variance of treatmentassignment
39 58
ATE versus OLS
120591119877 = 120124[120591(119883119894)119882119894] = 1114012119909120591(119909)
1205901113569119889(119909)120124[1205901113569119889(119883119894)]
โ[119883119894 = 119909]
bull Compare to the ATE
120591 = 120124[120591(119883119894)] = 1114012119909120591(119909)โ[119883119894 = 119909]
bull Both weight strata relative to their size (โ[119883119894 = 119909])bull OLS weights strata higher if the treatment variance in those
strata (1205901113569119889(119909)) is higher in those strata relative to the averagevariance across strata (120124[1205901113569119889(119883119894)])
bull The ATE weights only by their size
40 58
Regression weighting
119882119894 =1205901113569119889(119883119894)
120124[1205901113569119889(119883119894)]
bull Why does OLS weight like thisbull OLS is a minimum-variance estimator more weight to
more precise within-strata estimatesbull Within-strata estimates are most precise when the treatment
is evenly spread and thus has the highest variancebull If 119863119894 is binary then we know the conditional variance will be
1205901113569119889(119909) = โ[119863119894 = 1|119883119894 = 119909] (1 minus โ[119863119894 = 1|119883119894 = 119909])= 119890(119909) (1 minus 119890(119909))
bull Maximum variance with โ[119863119894 = 1|119883119894 = 119909] = 12
41 58
OLS weighting examplebull Binary covariate
โ[119883119894 = 1] = 075 โ[119883119894 = 0] = 025โ[119863119894 = 1|119883119894 = 1] = 09 โ[119863119894 = 1|119883119894 = 0] = 05
1205901113569119889(1) = 009 1205901113569119889(0) = 025120591(1) = 1 120591(0) = minus1
bull Implies the ATE is 120591 = 05bull Average conditional variance 120124[1205901113569119889(119883119894)] = 013bull weights for 119883119894 = 1 are 009013 = 0692 for 119883119894 = 0
025013 = 192120591119877 = 120124[120591(119883119894)119882119894]
= 120591(1)119882(1)โ[119883119894 = 1] + 120591(0)119882(0)โ[119883119894 = 0]= 1 times 0692 times 075 + minus1 times 192 times 025= 0039
42 58
When will OLS estimate the ATE
bull When does 120591 = 120591119877bull Constant treatment effects 120591(119909) = 120591 = 120591119877bull Constant probability of treatment 119890(119909) = โ[119863119894 = 1|119883119894 = 119909] = 119890
Implies that the OLS weights are 1
bull Incorrect linearity assumption in 119883119894 will lead to more bias
43 58
Other ways to use regression
bull Whatrsquos the path forward Accept the bias (might be relatively small with saturated
models) Use a different regression approach
bull Let 120583119889(119909) = 120124[119884119894(119889)|119883119894 = 119909] be the CEF for the potentialoutcome under 119863119894 = 119889
bull By consistency and nuc we have 120583119889(119909) = 120124[119884119894|119863119894 = 119889119883119894 = 119909]bull Estimate a regression of 119884119894 on 119883119894 among the 119863119894 = 119889 groupbull Then 119889(119909) is just a predicted value from the regression for
119883119894 = 119909bull How can we use this
44 58
Imputation estimators
bull Impute the treated potential outcomes with 1114023119884119894(1) = 1113568(119883119894)bull Impute the control potential outcomes with 1114023119884119894(0) = 1113567(119883119894)bull Procedure
Regress 119884119894 on 119883119894 in the treated group and get predicted valuesfor all units (treated or control)
Regress 119884119894 on 119883119894 in the control group and get predicted valuesfor all units (treated or control)
Take the average difference between these predicted values
bull More mathematically look like this
120591119894119898119901 =1119873
11140121198941113568(119883119894) minus 1113567(119883119894)
bull Sometimes called an imputation estimator
45 58
Simple imputation estimator
bull Use predict() from the within-group models on the data fromthe entire sample
bull Useful trick use a model on the entire data andmodelframe() to get the right design matrix
heterogeneous effects
yhet lt- ifelse(d == 1 y + rnorm(n 0 5) y)
mod lt- lm(yhet ~ d + X)
mod1 lt- lm(yhet ~ X subset = d == 1)
mod0 lt- lm(yhet ~ X subset = d == 0)
y1imps lt- predict(mod1 modelframe(mod))
y0imps lt- predict(mod0 modelframe(mod))
mean(y1imps - y0imps)
[1] 061
46 58
Notes on imputation estimators
bull If 119889(119909) are consistent estimators then 120591119894119898119901 is consistent forthe ATE
bull Why donrsquot people use this Most people donrsquot know the results wersquove been talking about Harder to implement than vanilla OLS
bull Can use linear regression to estimate 119889(119909) = 119909prime120573119889bull Recent trend is to estimate 119889(119909) via non-parametric methods
such as Kernel regression local linear regression regression trees etc Easiest is generalized additive models (GAMs)
47 58
Imputation estimator visualization
-2 -1 0 1 2 3
-05
00
05
10
15
x
y
TreatedControl
48 58
Imputation estimator visualization
-2 -1 0 1 2 3
-05
00
05
10
15
x
y
TreatedControl
49 58
Imputation estimator visualization
-2 -1 0 1 2 3
-05
00
05
10
15
x
y
TreatedControl
50 58
Nonlinear relationshipsbull Same idea but with nonlinear relationship between 119884119894 and 119883119894
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
51 58
Nonlinear relationshipsbull Same idea but with nonlinear relationship between 119884119894 and 119883119894
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
52 58
Nonlinear relationshipsbull Same idea but with nonlinear relationship between 119884119894 and 119883119894
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
53 58
Using semiparametric regressionbull Here CEFs are nonlinear but we donrsquot know their formbull We can use GAMs from the mgcv package to for flexible
estimatelibrary(mgcv)
mod0 lt- gam(y ~ s(x) subset = d == 0)
summary(mod0)
Family gaussian
Link function identity
Formula
y ~ s(x)
Parametric coefficients
Estimate Std Error t value Pr(gt|t|)
(Intercept) -00225 00154 -146 016
Approximate significance of smooth terms
edf Refdf F p-value
s(x) 603 708 413 lt2e-16
---
Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1
R-sq(adj) = 0909 Deviance explained = 928
GCV = 00093204 Scale est = 00071351 n = 30
54 58
Using GAMs
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
55 58
Using GAMs
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
56 58
Using GAMs
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
57 58
Limited dependent variables
bull Usual advice model the data from first principles Logitprobit for binary Poisson for counts etc
bull OLS is a-ok with limited DVs when Binary treatment and no covariates (just diff-in-means) Binary treatment discrete covariates and saturated models
(stratified diff-in-means)
bull Imposing a model on LDVs in this case imposes adistributional assumption which could be wrong
bull Even in unsaturated models the marginal effect from OLSoften decent compared to nonlinear models
Could go wrong in small samples If using nonlinear models always get effects on the scale of the
outcome
58 58
- Agnostic Regression
- Regression and Causality
- Regression with Heterogeneous Treatment Effects
-
OLS estimator
bull We know the population value of 120573 is
120573 = 120124[119883119894119883prime119894 ]minus1113568120124[119883119894119884119894]
bull How do we get an estimator of thisbull Plug-in principle replace population expectation with
sample versions
=
โกโขโขโขโขโขโฃ1119873
1114012119894119883119894119883prime
119894
โคโฅโฅโฅโฅโฅโฆ
minus11135681119873
1114012119894119883119894119884119894
bull If you work through the matrix algebra this turns out to be
= (119831prime119831)minus1113568 119831prime119858
19 58
Asymptotic OLS inference
bull With this representation in hand we can write the OLSestimator as follows
= 120573 +
โกโขโขโขโขโขโฃ1114012
119894119883119894119883prime
119894
โคโฅโฅโฅโฅโฅโฆ
minus1113568
1114012119894119883119894119890119894
bull Core idea sum119894119883119894119890119894 is the sum of rvs so the CLT applies
bull That plus some simple asymptotic theory allows us to say
radic119873( minus 120573) 119873(0ฮฉ)
bull Converges in distribution to a Normal distribution with meanvector 0 and covariance matrix ฮฉ
ฮฉ = 120124[119883119894119883prime119894 ]minus1113568120124[119883119894119883prime
119894 1198901113569119894 ]120124[119883119894119883prime119894 ]minus1113568
bull No linearity assumption needed20 58
Estimating the variance
bull In large samples then
radic119873( minus 120573) sim 119873(0ฮฉ)
bull How to estimate ฮฉ Plug-in principle again
1114023ฮฉ =
โกโขโขโขโขโขโฃ1114012
119894119883119894119883prime
119894
โคโฅโฅโฅโฅโฅโฆ
minus1113568 โกโขโขโขโขโขโฃ1114012119894119883119894119883prime
119894 1113569119894
โคโฅโฅโฅโฅโฅโฆ
โกโขโขโขโขโขโฃ1114012
119894119883119894119883prime
119894
โคโฅโฅโฅโฅโฅโฆ
minus1113568
bull Replace 119890119894 with its emprical counterpart (residuals)119894 = 119884119894 minus 119883prime
119894 bull Replace the population moments of 119883119894 with their sample
counterpartsbull The square root of the diagonals of this covariance matrix are
the ldquorobustrdquo or Huber-White standard errors that Statacommonly report
21 58
Heteroskedasticity
bull No assumptions of homoskedasticitybull Heteroskedaticity will definitely occur when
CEF is linear but the 1205901113569(119909) = 120141[119884119894|119883119894 = 119909] is not constant in 119909 119864[119884119894|119883119894] is not linear but we use the linear regression to
approxmiate it
22 58
2 Regression andCausality
23 58
Regression and causality
bull Most econometrics textbooks regression defined withoutrespect to causality
bull But then when is ldquobiasedrdquo The above derivations work forsome 120124[119884119894|119883119894]
bull The question then is when does knowing the CEF tell ussomething about causality
bull MHE argues that a regression is causal when the CEF itapproximates is causal Identification is king
bull We will show that under certain conditions a regression of theoutcome on the treatment and the covariates can recover acausal parameter but perhaps not the one in which we areinterested
24 58
Review
bull Quick reminder we have potential outcomes 119884119894(1) and 119884119894(0)and two parameters the ATE and ATT
120591 = 119864[119884119894(1) minus 119884119894(0)]120591ATT = 119864[119884119894(1) minus 119884119894(0)|119863119894 = 1]
bull We have shown in past weeks that these effects are identifiedwhen ignorability holds MHE calls this the conditionalindependence assumption (CIA)
25 58
Linear constant effects model binarytreatment
bull Experiment with a simple experiment we can rewrite theconsistency assumption to be a regression formula
119884119894 = 119863119894119884119894(1) + (1 minus 119863119894)119884119894(0)= 119884119894(0) + (119884119894(1) minus 119884119894(0))119863119894
= 120124[119884119894(0)] + 120591119863119894 + (119884119894(0) minus 120124[119884119894(0)])= 1205831113567 + 120591119863119894 + 1199071113567119894
bull Note that if ignorability holds (as in an experiment) for 119884119894(0)then it will also hold for 1199071113567119894 since 120124[119884119894(0)] is constant Thusthis satifies the usual assumptions for regression
26 58
Now with covariates
bull Now assume no unmeasured confounders 119884119894(119889) โโ 119863119894|119883119894bull We will assume a linear model for the potential outcomes
119884119894(119889) = 120572 + 120591 sdot 119889 + 120578119894
bull Remember that linearity isnrsquot an assumption if 119863119894 is binarybull Effect of 119863119894 is constant here the 120578119894 are the only source of
individual variation and we have 119864[120578119894] = 0bull Consistency assumption allows us to write this as
119884119894 = 120572 + 120591119863119894 + 120578119894
27 58
Covariates in the error
bull Letrsquos assume that 120578119894 is linear in 119883119894 120578119894 = 119883prime119894 120574 + 120584119894
bull New error is uncorrelated with 119883119894 120124[120584119894|119883119894] = 0bull This is an assumption Might be falsebull Plug into the above
120124[119884119894(119889)|119883119894] = 119864[119884119894|119863119894 119883119894] = 120572 + 120591119863119894 + 119864[120578119894|119883119894]= 120572 + 120591119863119894 + 119883prime
119894 120574 + 119864[120584119894|119883119894]= 120572 + 120591119863119894 + 119883prime
119894 120574
28 58
Summing up regression with constanteffects
bull Reviewing the assumptions wersquove used no unmeasured confounders constant treatment effects linearity of the treatmentcovariates
bull Under these we can run the following regression to estimatethe ATE 120591
119884119894 = 120572 + 120591119863119894 + 119883prime119894 120574 + 120584119894
bull Works with continuous or ordinal 119863119894 if linearity in the effect ofthese variables is truly linear
29 58
OLS constant effects simulation
Model with linear covariates constant 0 effect of treatment
library(mvtnorm)
n lt- 100
p lt- 4
X lt- rmvnorm(n = 100 mean = rep(0 p))
gamma lt- c(274 137 137 137)
y lt- 210 + X c(gamma) + rnorm(n)
alpha lt- c(-1 05 -05 -01)
dprobs lt- bootinvlogit(X alpha)
d lt- rbinom(n size = 1 prob = dprobs)
30 58
OLS with no covariates
summary(lm(y ~ d))
Coefficients
Estimate Std Error t value Pr(gt|t|)
(Intercept) 22036 458 4813 lt 2e-16
d -2869 654 -439 0000029
---
Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1
Residual standard error 33 on 98 degrees of freedom
Multiple R-squared 0164 Adjusted R-squared 0156
F-statistic 192 on 1 and 98 DF p-value 0000029
31 58
OLS with covariates
summary(lm(y ~ d + X))
Coefficients
Estimate Std Error t value Pr(gt|t|)
(Intercept) 209952 0159 13207 lt2e-16
d -0128 0253 -05 062
X1 27368 0126 2175 lt2e-16
X2 13677 0114 1201 lt2e-16
X3 13673 0130 1051 lt2e-16
X4 13570 0106 1280 lt2e-16
---
Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1
Residual standard error 1 on 94 degrees of freedom
Multiple R-squared 0999 Adjusted R-squared 0999
F-statistic 244e+04 on 5 and 94 DF p-value lt2e-16
32 58
What happens with nonlinearity
Suppose we can only observe the following covariates
z1 lt- exp(X[ 1]2)
z2 lt- X[ 2](1 + exp(X[ 1])) + 10
z3 lt- (X[ 1] X[ 3]25 + 06)^3
z4 lt- (X[ 2] + X[ 4] + 20)^2
Implies that 119884119894 and 119863119894 are functions of log(1198851198941113568) 1198851198941113569 119885111356911989411135681198851198941113569
1 log(1198851198941113568) 1198851198941113570 log(1198851198941113568) and 119883111356811135691198941113571
Regression is a nonlinear function of the observed covariates
33 58
When linearity goes wrong
summary(lm(y ~ d + z1 + z2 + z3 + z4))
Coefficients
Estimate Std Error t value Pr(gt|t|)
(Intercept) 218728 300799 073 0469
d -69292 33854 -205 0043
z1 363110 26970 1346 lt2e-16
z2 -29033 36619 -079 0430
z3 862022 431030 200 0048
z4 04021 00329 1223 lt2e-16
---
Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1
Residual standard error 14 on 94 degrees of freedom
Multiple R-squared 0855 Adjusted R-squared 0847
F-statistic 111 on 5 and 94 DF p-value lt2e-16
34 58
3 Regression withHeterogeneousTreatment Effects
35 58
Heterogeneous effects binary treatment
bull Completely randomized experiment
119884119894 = 119863119894119884119894(1) + (1 minus 119863119894)119884119894(0)= 119884119894(0) + (119884119894(1) minus 119884119894(0))119863119894
= 1205831113567 + 120591119894119863119894 + (119884119894(0) minus 1205831113567)= 1205831113567 + 120591119863119894 + (119884119894(0) minus 1205831113567) + (120591119894 minus 120591) sdot 119863119894
= 1205831113567 + 120591119863119894 + 120576119894
bull Error term now includes two components1 ldquoBaselinerdquo variation in the outcome (119884119894(0) minus 1205831113567)2 Variation in the treatment effect (120591119894 minus 120591)
bull Easy to verify that under experiment 120124[120576119894|119863119894] = 0bull Thus OLS estimates the ATE with no covariates
36 58
Adding covariates
bull What happens with no unmeasured confounders Need tocondition on 119883119894 now
bull Remember identification of the ATEATT using iteratedexpectations
bull ATE is the weighted sum of CATEs
120591 = 1114012119909120591(119909) Pr[119883119894 = 119909]
bull ATEATT are weighted averages of CATEsbull What about the regression estimand 120591119877 How does it related
to the ATEATT
37 58
Heterogeneous effects and regression
bull Letrsquos investigate this under a saturated regression model
119884119894 = 1114012119909119861119909119894120572119909 + 120591119877119863119894 + 119890119894
bull Use a dummy variable for each unique combination of 119883119894119861119909119894 = 120128(119883119894 = 119909)
bull Linear in 119883119894 by construction
38 58
Investigating the regression coefficient
bull How can we investigate 120591119877 Well we can rely on theregression anatomy
120591119877 = Cov(119884119894 119863119894 minus 119864[119863119894|119883119894])120141(119863119894 minus 119864[119863119894|119883119894])
bull 119863119894 minus 120124[119863119894|119883119894] is the residual from a regression of 119863119894 on the fullset of dummies
bull With a little work we can show
120591119877 =120124 1114094120591(119883119894)(119863119894 minus 120124[119863119894|119883119894])11135691114097
120124[(119863119894 minus 119864[119863119894|119883119894])1113569]=
120124[120591(119883119894)1205901113569119889(119883119894)]120124[1205901113569119889(119883119894)]
bull 1205901113569119889(119909) = 120141[119863119894|119883119894 = 119909] is the conditional variance of treatmentassignment
39 58
ATE versus OLS
120591119877 = 120124[120591(119883119894)119882119894] = 1114012119909120591(119909)
1205901113569119889(119909)120124[1205901113569119889(119883119894)]
โ[119883119894 = 119909]
bull Compare to the ATE
120591 = 120124[120591(119883119894)] = 1114012119909120591(119909)โ[119883119894 = 119909]
bull Both weight strata relative to their size (โ[119883119894 = 119909])bull OLS weights strata higher if the treatment variance in those
strata (1205901113569119889(119909)) is higher in those strata relative to the averagevariance across strata (120124[1205901113569119889(119883119894)])
bull The ATE weights only by their size
40 58
Regression weighting
119882119894 =1205901113569119889(119883119894)
120124[1205901113569119889(119883119894)]
bull Why does OLS weight like thisbull OLS is a minimum-variance estimator more weight to
more precise within-strata estimatesbull Within-strata estimates are most precise when the treatment
is evenly spread and thus has the highest variancebull If 119863119894 is binary then we know the conditional variance will be
1205901113569119889(119909) = โ[119863119894 = 1|119883119894 = 119909] (1 minus โ[119863119894 = 1|119883119894 = 119909])= 119890(119909) (1 minus 119890(119909))
bull Maximum variance with โ[119863119894 = 1|119883119894 = 119909] = 12
41 58
OLS weighting examplebull Binary covariate
โ[119883119894 = 1] = 075 โ[119883119894 = 0] = 025โ[119863119894 = 1|119883119894 = 1] = 09 โ[119863119894 = 1|119883119894 = 0] = 05
1205901113569119889(1) = 009 1205901113569119889(0) = 025120591(1) = 1 120591(0) = minus1
bull Implies the ATE is 120591 = 05bull Average conditional variance 120124[1205901113569119889(119883119894)] = 013bull weights for 119883119894 = 1 are 009013 = 0692 for 119883119894 = 0
025013 = 192120591119877 = 120124[120591(119883119894)119882119894]
= 120591(1)119882(1)โ[119883119894 = 1] + 120591(0)119882(0)โ[119883119894 = 0]= 1 times 0692 times 075 + minus1 times 192 times 025= 0039
42 58
When will OLS estimate the ATE
bull When does 120591 = 120591119877bull Constant treatment effects 120591(119909) = 120591 = 120591119877bull Constant probability of treatment 119890(119909) = โ[119863119894 = 1|119883119894 = 119909] = 119890
Implies that the OLS weights are 1
bull Incorrect linearity assumption in 119883119894 will lead to more bias
43 58
Other ways to use regression
bull Whatrsquos the path forward Accept the bias (might be relatively small with saturated
models) Use a different regression approach
bull Let 120583119889(119909) = 120124[119884119894(119889)|119883119894 = 119909] be the CEF for the potentialoutcome under 119863119894 = 119889
bull By consistency and nuc we have 120583119889(119909) = 120124[119884119894|119863119894 = 119889119883119894 = 119909]bull Estimate a regression of 119884119894 on 119883119894 among the 119863119894 = 119889 groupbull Then 119889(119909) is just a predicted value from the regression for
119883119894 = 119909bull How can we use this
44 58
Imputation estimators
bull Impute the treated potential outcomes with 1114023119884119894(1) = 1113568(119883119894)bull Impute the control potential outcomes with 1114023119884119894(0) = 1113567(119883119894)bull Procedure
Regress 119884119894 on 119883119894 in the treated group and get predicted valuesfor all units (treated or control)
Regress 119884119894 on 119883119894 in the control group and get predicted valuesfor all units (treated or control)
Take the average difference between these predicted values
bull More mathematically look like this
120591119894119898119901 =1119873
11140121198941113568(119883119894) minus 1113567(119883119894)
bull Sometimes called an imputation estimator
45 58
Simple imputation estimator
bull Use predict() from the within-group models on the data fromthe entire sample
bull Useful trick use a model on the entire data andmodelframe() to get the right design matrix
heterogeneous effects
yhet lt- ifelse(d == 1 y + rnorm(n 0 5) y)
mod lt- lm(yhet ~ d + X)
mod1 lt- lm(yhet ~ X subset = d == 1)
mod0 lt- lm(yhet ~ X subset = d == 0)
y1imps lt- predict(mod1 modelframe(mod))
y0imps lt- predict(mod0 modelframe(mod))
mean(y1imps - y0imps)
[1] 061
46 58
Notes on imputation estimators
bull If 119889(119909) are consistent estimators then 120591119894119898119901 is consistent forthe ATE
bull Why donrsquot people use this Most people donrsquot know the results wersquove been talking about Harder to implement than vanilla OLS
bull Can use linear regression to estimate 119889(119909) = 119909prime120573119889bull Recent trend is to estimate 119889(119909) via non-parametric methods
such as Kernel regression local linear regression regression trees etc Easiest is generalized additive models (GAMs)
47 58
Imputation estimator visualization
-2 -1 0 1 2 3
-05
00
05
10
15
x
y
TreatedControl
48 58
Imputation estimator visualization
-2 -1 0 1 2 3
-05
00
05
10
15
x
y
TreatedControl
49 58
Imputation estimator visualization
-2 -1 0 1 2 3
-05
00
05
10
15
x
y
TreatedControl
50 58
Nonlinear relationshipsbull Same idea but with nonlinear relationship between 119884119894 and 119883119894
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
51 58
Nonlinear relationshipsbull Same idea but with nonlinear relationship between 119884119894 and 119883119894
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
52 58
Nonlinear relationshipsbull Same idea but with nonlinear relationship between 119884119894 and 119883119894
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
53 58
Using semiparametric regressionbull Here CEFs are nonlinear but we donrsquot know their formbull We can use GAMs from the mgcv package to for flexible
estimatelibrary(mgcv)
mod0 lt- gam(y ~ s(x) subset = d == 0)
summary(mod0)
Family gaussian
Link function identity
Formula
y ~ s(x)
Parametric coefficients
Estimate Std Error t value Pr(gt|t|)
(Intercept) -00225 00154 -146 016
Approximate significance of smooth terms
edf Refdf F p-value
s(x) 603 708 413 lt2e-16
---
Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1
R-sq(adj) = 0909 Deviance explained = 928
GCV = 00093204 Scale est = 00071351 n = 30
54 58
Using GAMs
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
55 58
Using GAMs
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
56 58
Using GAMs
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
57 58
Limited dependent variables
bull Usual advice model the data from first principles Logitprobit for binary Poisson for counts etc
bull OLS is a-ok with limited DVs when Binary treatment and no covariates (just diff-in-means) Binary treatment discrete covariates and saturated models
(stratified diff-in-means)
bull Imposing a model on LDVs in this case imposes adistributional assumption which could be wrong
bull Even in unsaturated models the marginal effect from OLSoften decent compared to nonlinear models
Could go wrong in small samples If using nonlinear models always get effects on the scale of the
outcome
58 58
- Agnostic Regression
- Regression and Causality
- Regression with Heterogeneous Treatment Effects
-
Asymptotic OLS inference
bull With this representation in hand we can write the OLSestimator as follows
= 120573 +
โกโขโขโขโขโขโฃ1114012
119894119883119894119883prime
119894
โคโฅโฅโฅโฅโฅโฆ
minus1113568
1114012119894119883119894119890119894
bull Core idea sum119894119883119894119890119894 is the sum of rvs so the CLT applies
bull That plus some simple asymptotic theory allows us to say
radic119873( minus 120573) 119873(0ฮฉ)
bull Converges in distribution to a Normal distribution with meanvector 0 and covariance matrix ฮฉ
ฮฉ = 120124[119883119894119883prime119894 ]minus1113568120124[119883119894119883prime
119894 1198901113569119894 ]120124[119883119894119883prime119894 ]minus1113568
bull No linearity assumption needed20 58
Estimating the variance
bull In large samples then
radic119873( minus 120573) sim 119873(0ฮฉ)
bull How to estimate ฮฉ Plug-in principle again
1114023ฮฉ =
โกโขโขโขโขโขโฃ1114012
119894119883119894119883prime
119894
โคโฅโฅโฅโฅโฅโฆ
minus1113568 โกโขโขโขโขโขโฃ1114012119894119883119894119883prime
119894 1113569119894
โคโฅโฅโฅโฅโฅโฆ
โกโขโขโขโขโขโฃ1114012
119894119883119894119883prime
119894
โคโฅโฅโฅโฅโฅโฆ
minus1113568
bull Replace 119890119894 with its emprical counterpart (residuals)119894 = 119884119894 minus 119883prime
119894 bull Replace the population moments of 119883119894 with their sample
counterpartsbull The square root of the diagonals of this covariance matrix are
the ldquorobustrdquo or Huber-White standard errors that Statacommonly report
21 58
Heteroskedasticity
bull No assumptions of homoskedasticitybull Heteroskedaticity will definitely occur when
CEF is linear but the 1205901113569(119909) = 120141[119884119894|119883119894 = 119909] is not constant in 119909 119864[119884119894|119883119894] is not linear but we use the linear regression to
approxmiate it
22 58
2 Regression andCausality
23 58
Regression and causality
bull Most econometrics textbooks regression defined withoutrespect to causality
bull But then when is ldquobiasedrdquo The above derivations work forsome 120124[119884119894|119883119894]
bull The question then is when does knowing the CEF tell ussomething about causality
bull MHE argues that a regression is causal when the CEF itapproximates is causal Identification is king
bull We will show that under certain conditions a regression of theoutcome on the treatment and the covariates can recover acausal parameter but perhaps not the one in which we areinterested
24 58
Review
bull Quick reminder we have potential outcomes 119884119894(1) and 119884119894(0)and two parameters the ATE and ATT
120591 = 119864[119884119894(1) minus 119884119894(0)]120591ATT = 119864[119884119894(1) minus 119884119894(0)|119863119894 = 1]
bull We have shown in past weeks that these effects are identifiedwhen ignorability holds MHE calls this the conditionalindependence assumption (CIA)
25 58
Linear constant effects model binarytreatment
bull Experiment with a simple experiment we can rewrite theconsistency assumption to be a regression formula
119884119894 = 119863119894119884119894(1) + (1 minus 119863119894)119884119894(0)= 119884119894(0) + (119884119894(1) minus 119884119894(0))119863119894
= 120124[119884119894(0)] + 120591119863119894 + (119884119894(0) minus 120124[119884119894(0)])= 1205831113567 + 120591119863119894 + 1199071113567119894
bull Note that if ignorability holds (as in an experiment) for 119884119894(0)then it will also hold for 1199071113567119894 since 120124[119884119894(0)] is constant Thusthis satifies the usual assumptions for regression
26 58
Now with covariates
bull Now assume no unmeasured confounders 119884119894(119889) โโ 119863119894|119883119894bull We will assume a linear model for the potential outcomes
119884119894(119889) = 120572 + 120591 sdot 119889 + 120578119894
bull Remember that linearity isnrsquot an assumption if 119863119894 is binarybull Effect of 119863119894 is constant here the 120578119894 are the only source of
individual variation and we have 119864[120578119894] = 0bull Consistency assumption allows us to write this as
119884119894 = 120572 + 120591119863119894 + 120578119894
27 58
Covariates in the error
bull Letrsquos assume that 120578119894 is linear in 119883119894 120578119894 = 119883prime119894 120574 + 120584119894
bull New error is uncorrelated with 119883119894 120124[120584119894|119883119894] = 0bull This is an assumption Might be falsebull Plug into the above
120124[119884119894(119889)|119883119894] = 119864[119884119894|119863119894 119883119894] = 120572 + 120591119863119894 + 119864[120578119894|119883119894]= 120572 + 120591119863119894 + 119883prime
119894 120574 + 119864[120584119894|119883119894]= 120572 + 120591119863119894 + 119883prime
119894 120574
28 58
Summing up regression with constanteffects
bull Reviewing the assumptions wersquove used no unmeasured confounders constant treatment effects linearity of the treatmentcovariates
bull Under these we can run the following regression to estimatethe ATE 120591
119884119894 = 120572 + 120591119863119894 + 119883prime119894 120574 + 120584119894
bull Works with continuous or ordinal 119863119894 if linearity in the effect ofthese variables is truly linear
29 58
OLS constant effects simulation
Model with linear covariates constant 0 effect of treatment
library(mvtnorm)
n lt- 100
p lt- 4
X lt- rmvnorm(n = 100 mean = rep(0 p))
gamma lt- c(274 137 137 137)
y lt- 210 + X c(gamma) + rnorm(n)
alpha lt- c(-1 05 -05 -01)
dprobs lt- bootinvlogit(X alpha)
d lt- rbinom(n size = 1 prob = dprobs)
30 58
OLS with no covariates
summary(lm(y ~ d))
Coefficients
Estimate Std Error t value Pr(gt|t|)
(Intercept) 22036 458 4813 lt 2e-16
d -2869 654 -439 0000029
---
Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1
Residual standard error 33 on 98 degrees of freedom
Multiple R-squared 0164 Adjusted R-squared 0156
F-statistic 192 on 1 and 98 DF p-value 0000029
31 58
OLS with covariates
summary(lm(y ~ d + X))
Coefficients
Estimate Std Error t value Pr(gt|t|)
(Intercept) 209952 0159 13207 lt2e-16
d -0128 0253 -05 062
X1 27368 0126 2175 lt2e-16
X2 13677 0114 1201 lt2e-16
X3 13673 0130 1051 lt2e-16
X4 13570 0106 1280 lt2e-16
---
Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1
Residual standard error 1 on 94 degrees of freedom
Multiple R-squared 0999 Adjusted R-squared 0999
F-statistic 244e+04 on 5 and 94 DF p-value lt2e-16
32 58
What happens with nonlinearity
Suppose we can only observe the following covariates
z1 lt- exp(X[ 1]2)
z2 lt- X[ 2](1 + exp(X[ 1])) + 10
z3 lt- (X[ 1] X[ 3]25 + 06)^3
z4 lt- (X[ 2] + X[ 4] + 20)^2
Implies that 119884119894 and 119863119894 are functions of log(1198851198941113568) 1198851198941113569 119885111356911989411135681198851198941113569
1 log(1198851198941113568) 1198851198941113570 log(1198851198941113568) and 119883111356811135691198941113571
Regression is a nonlinear function of the observed covariates
33 58
When linearity goes wrong
summary(lm(y ~ d + z1 + z2 + z3 + z4))
Coefficients
Estimate Std Error t value Pr(gt|t|)
(Intercept) 218728 300799 073 0469
d -69292 33854 -205 0043
z1 363110 26970 1346 lt2e-16
z2 -29033 36619 -079 0430
z3 862022 431030 200 0048
z4 04021 00329 1223 lt2e-16
---
Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1
Residual standard error 14 on 94 degrees of freedom
Multiple R-squared 0855 Adjusted R-squared 0847
F-statistic 111 on 5 and 94 DF p-value lt2e-16
34 58
3 Regression withHeterogeneousTreatment Effects
35 58
Heterogeneous effects binary treatment
bull Completely randomized experiment
119884119894 = 119863119894119884119894(1) + (1 minus 119863119894)119884119894(0)= 119884119894(0) + (119884119894(1) minus 119884119894(0))119863119894
= 1205831113567 + 120591119894119863119894 + (119884119894(0) minus 1205831113567)= 1205831113567 + 120591119863119894 + (119884119894(0) minus 1205831113567) + (120591119894 minus 120591) sdot 119863119894
= 1205831113567 + 120591119863119894 + 120576119894
bull Error term now includes two components1 ldquoBaselinerdquo variation in the outcome (119884119894(0) minus 1205831113567)2 Variation in the treatment effect (120591119894 minus 120591)
bull Easy to verify that under experiment 120124[120576119894|119863119894] = 0bull Thus OLS estimates the ATE with no covariates
36 58
Adding covariates
bull What happens with no unmeasured confounders Need tocondition on 119883119894 now
bull Remember identification of the ATEATT using iteratedexpectations
bull ATE is the weighted sum of CATEs
120591 = 1114012119909120591(119909) Pr[119883119894 = 119909]
bull ATEATT are weighted averages of CATEsbull What about the regression estimand 120591119877 How does it related
to the ATEATT
37 58
Heterogeneous effects and regression
bull Letrsquos investigate this under a saturated regression model
119884119894 = 1114012119909119861119909119894120572119909 + 120591119877119863119894 + 119890119894
bull Use a dummy variable for each unique combination of 119883119894119861119909119894 = 120128(119883119894 = 119909)
bull Linear in 119883119894 by construction
38 58
Investigating the regression coefficient
bull How can we investigate 120591119877 Well we can rely on theregression anatomy
120591119877 = Cov(119884119894 119863119894 minus 119864[119863119894|119883119894])120141(119863119894 minus 119864[119863119894|119883119894])
bull 119863119894 minus 120124[119863119894|119883119894] is the residual from a regression of 119863119894 on the fullset of dummies
bull With a little work we can show
120591119877 =120124 1114094120591(119883119894)(119863119894 minus 120124[119863119894|119883119894])11135691114097
120124[(119863119894 minus 119864[119863119894|119883119894])1113569]=
120124[120591(119883119894)1205901113569119889(119883119894)]120124[1205901113569119889(119883119894)]
bull 1205901113569119889(119909) = 120141[119863119894|119883119894 = 119909] is the conditional variance of treatmentassignment
39 58
ATE versus OLS
120591119877 = 120124[120591(119883119894)119882119894] = 1114012119909120591(119909)
1205901113569119889(119909)120124[1205901113569119889(119883119894)]
โ[119883119894 = 119909]
bull Compare to the ATE
120591 = 120124[120591(119883119894)] = 1114012119909120591(119909)โ[119883119894 = 119909]
bull Both weight strata relative to their size (โ[119883119894 = 119909])bull OLS weights strata higher if the treatment variance in those
strata (1205901113569119889(119909)) is higher in those strata relative to the averagevariance across strata (120124[1205901113569119889(119883119894)])
bull The ATE weights only by their size
40 58
Regression weighting
119882119894 =1205901113569119889(119883119894)
120124[1205901113569119889(119883119894)]
bull Why does OLS weight like thisbull OLS is a minimum-variance estimator more weight to
more precise within-strata estimatesbull Within-strata estimates are most precise when the treatment
is evenly spread and thus has the highest variancebull If 119863119894 is binary then we know the conditional variance will be
1205901113569119889(119909) = โ[119863119894 = 1|119883119894 = 119909] (1 minus โ[119863119894 = 1|119883119894 = 119909])= 119890(119909) (1 minus 119890(119909))
bull Maximum variance with โ[119863119894 = 1|119883119894 = 119909] = 12
41 58
OLS weighting examplebull Binary covariate
โ[119883119894 = 1] = 075 โ[119883119894 = 0] = 025โ[119863119894 = 1|119883119894 = 1] = 09 โ[119863119894 = 1|119883119894 = 0] = 05
1205901113569119889(1) = 009 1205901113569119889(0) = 025120591(1) = 1 120591(0) = minus1
bull Implies the ATE is 120591 = 05bull Average conditional variance 120124[1205901113569119889(119883119894)] = 013bull weights for 119883119894 = 1 are 009013 = 0692 for 119883119894 = 0
025013 = 192120591119877 = 120124[120591(119883119894)119882119894]
= 120591(1)119882(1)โ[119883119894 = 1] + 120591(0)119882(0)โ[119883119894 = 0]= 1 times 0692 times 075 + minus1 times 192 times 025= 0039
42 58
When will OLS estimate the ATE
bull When does 120591 = 120591119877bull Constant treatment effects 120591(119909) = 120591 = 120591119877bull Constant probability of treatment 119890(119909) = โ[119863119894 = 1|119883119894 = 119909] = 119890
Implies that the OLS weights are 1
bull Incorrect linearity assumption in 119883119894 will lead to more bias
43 58
Other ways to use regression
bull Whatrsquos the path forward Accept the bias (might be relatively small with saturated
models) Use a different regression approach
bull Let 120583119889(119909) = 120124[119884119894(119889)|119883119894 = 119909] be the CEF for the potentialoutcome under 119863119894 = 119889
bull By consistency and nuc we have 120583119889(119909) = 120124[119884119894|119863119894 = 119889119883119894 = 119909]bull Estimate a regression of 119884119894 on 119883119894 among the 119863119894 = 119889 groupbull Then 119889(119909) is just a predicted value from the regression for
119883119894 = 119909bull How can we use this
44 58
Imputation estimators
bull Impute the treated potential outcomes with 1114023119884119894(1) = 1113568(119883119894)bull Impute the control potential outcomes with 1114023119884119894(0) = 1113567(119883119894)bull Procedure
Regress 119884119894 on 119883119894 in the treated group and get predicted valuesfor all units (treated or control)
Regress 119884119894 on 119883119894 in the control group and get predicted valuesfor all units (treated or control)
Take the average difference between these predicted values
bull More mathematically look like this
120591119894119898119901 =1119873
11140121198941113568(119883119894) minus 1113567(119883119894)
bull Sometimes called an imputation estimator
45 58
Simple imputation estimator
bull Use predict() from the within-group models on the data fromthe entire sample
bull Useful trick use a model on the entire data andmodelframe() to get the right design matrix
heterogeneous effects
yhet lt- ifelse(d == 1 y + rnorm(n 0 5) y)
mod lt- lm(yhet ~ d + X)
mod1 lt- lm(yhet ~ X subset = d == 1)
mod0 lt- lm(yhet ~ X subset = d == 0)
y1imps lt- predict(mod1 modelframe(mod))
y0imps lt- predict(mod0 modelframe(mod))
mean(y1imps - y0imps)
[1] 061
46 58
Notes on imputation estimators
bull If 119889(119909) are consistent estimators then 120591119894119898119901 is consistent forthe ATE
bull Why donrsquot people use this Most people donrsquot know the results wersquove been talking about Harder to implement than vanilla OLS
bull Can use linear regression to estimate 119889(119909) = 119909prime120573119889bull Recent trend is to estimate 119889(119909) via non-parametric methods
such as Kernel regression local linear regression regression trees etc Easiest is generalized additive models (GAMs)
47 58
Imputation estimator visualization
-2 -1 0 1 2 3
-05
00
05
10
15
x
y
TreatedControl
48 58
Imputation estimator visualization
-2 -1 0 1 2 3
-05
00
05
10
15
x
y
TreatedControl
49 58
Imputation estimator visualization
-2 -1 0 1 2 3
-05
00
05
10
15
x
y
TreatedControl
50 58
Nonlinear relationshipsbull Same idea but with nonlinear relationship between 119884119894 and 119883119894
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
51 58
Nonlinear relationshipsbull Same idea but with nonlinear relationship between 119884119894 and 119883119894
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
52 58
Nonlinear relationshipsbull Same idea but with nonlinear relationship between 119884119894 and 119883119894
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
53 58
Using semiparametric regressionbull Here CEFs are nonlinear but we donrsquot know their formbull We can use GAMs from the mgcv package to for flexible
estimatelibrary(mgcv)
mod0 lt- gam(y ~ s(x) subset = d == 0)
summary(mod0)
Family gaussian
Link function identity
Formula
y ~ s(x)
Parametric coefficients
Estimate Std Error t value Pr(gt|t|)
(Intercept) -00225 00154 -146 016
Approximate significance of smooth terms
edf Refdf F p-value
s(x) 603 708 413 lt2e-16
---
Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1
R-sq(adj) = 0909 Deviance explained = 928
GCV = 00093204 Scale est = 00071351 n = 30
54 58
Using GAMs
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
55 58
Using GAMs
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
56 58
Using GAMs
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
57 58
Limited dependent variables
bull Usual advice model the data from first principles Logitprobit for binary Poisson for counts etc
bull OLS is a-ok with limited DVs when Binary treatment and no covariates (just diff-in-means) Binary treatment discrete covariates and saturated models
(stratified diff-in-means)
bull Imposing a model on LDVs in this case imposes adistributional assumption which could be wrong
bull Even in unsaturated models the marginal effect from OLSoften decent compared to nonlinear models
Could go wrong in small samples If using nonlinear models always get effects on the scale of the
outcome
58 58
- Agnostic Regression
- Regression and Causality
- Regression with Heterogeneous Treatment Effects
-
Estimating the variance
bull In large samples then
radic119873( minus 120573) sim 119873(0ฮฉ)
bull How to estimate ฮฉ Plug-in principle again
1114023ฮฉ =
โกโขโขโขโขโขโฃ1114012
119894119883119894119883prime
119894
โคโฅโฅโฅโฅโฅโฆ
minus1113568 โกโขโขโขโขโขโฃ1114012119894119883119894119883prime
119894 1113569119894
โคโฅโฅโฅโฅโฅโฆ
โกโขโขโขโขโขโฃ1114012
119894119883119894119883prime
119894
โคโฅโฅโฅโฅโฅโฆ
minus1113568
bull Replace 119890119894 with its emprical counterpart (residuals)119894 = 119884119894 minus 119883prime
119894 bull Replace the population moments of 119883119894 with their sample
counterpartsbull The square root of the diagonals of this covariance matrix are
the ldquorobustrdquo or Huber-White standard errors that Statacommonly report
21 58
Heteroskedasticity
bull No assumptions of homoskedasticitybull Heteroskedaticity will definitely occur when
CEF is linear but the 1205901113569(119909) = 120141[119884119894|119883119894 = 119909] is not constant in 119909 119864[119884119894|119883119894] is not linear but we use the linear regression to
approxmiate it
22 58
2 Regression andCausality
23 58
Regression and causality
bull Most econometrics textbooks regression defined withoutrespect to causality
bull But then when is ldquobiasedrdquo The above derivations work forsome 120124[119884119894|119883119894]
bull The question then is when does knowing the CEF tell ussomething about causality
bull MHE argues that a regression is causal when the CEF itapproximates is causal Identification is king
bull We will show that under certain conditions a regression of theoutcome on the treatment and the covariates can recover acausal parameter but perhaps not the one in which we areinterested
24 58
Review
bull Quick reminder we have potential outcomes 119884119894(1) and 119884119894(0)and two parameters the ATE and ATT
120591 = 119864[119884119894(1) minus 119884119894(0)]120591ATT = 119864[119884119894(1) minus 119884119894(0)|119863119894 = 1]
bull We have shown in past weeks that these effects are identifiedwhen ignorability holds MHE calls this the conditionalindependence assumption (CIA)
25 58
Linear constant effects model binarytreatment
bull Experiment with a simple experiment we can rewrite theconsistency assumption to be a regression formula
119884119894 = 119863119894119884119894(1) + (1 minus 119863119894)119884119894(0)= 119884119894(0) + (119884119894(1) minus 119884119894(0))119863119894
= 120124[119884119894(0)] + 120591119863119894 + (119884119894(0) minus 120124[119884119894(0)])= 1205831113567 + 120591119863119894 + 1199071113567119894
bull Note that if ignorability holds (as in an experiment) for 119884119894(0)then it will also hold for 1199071113567119894 since 120124[119884119894(0)] is constant Thusthis satifies the usual assumptions for regression
26 58
Now with covariates
bull Now assume no unmeasured confounders 119884119894(119889) โโ 119863119894|119883119894bull We will assume a linear model for the potential outcomes
119884119894(119889) = 120572 + 120591 sdot 119889 + 120578119894
bull Remember that linearity isnrsquot an assumption if 119863119894 is binarybull Effect of 119863119894 is constant here the 120578119894 are the only source of
individual variation and we have 119864[120578119894] = 0bull Consistency assumption allows us to write this as
119884119894 = 120572 + 120591119863119894 + 120578119894
27 58
Covariates in the error
bull Letrsquos assume that 120578119894 is linear in 119883119894 120578119894 = 119883prime119894 120574 + 120584119894
bull New error is uncorrelated with 119883119894 120124[120584119894|119883119894] = 0bull This is an assumption Might be falsebull Plug into the above
120124[119884119894(119889)|119883119894] = 119864[119884119894|119863119894 119883119894] = 120572 + 120591119863119894 + 119864[120578119894|119883119894]= 120572 + 120591119863119894 + 119883prime
119894 120574 + 119864[120584119894|119883119894]= 120572 + 120591119863119894 + 119883prime
119894 120574
28 58
Summing up regression with constanteffects
bull Reviewing the assumptions wersquove used no unmeasured confounders constant treatment effects linearity of the treatmentcovariates
bull Under these we can run the following regression to estimatethe ATE 120591
119884119894 = 120572 + 120591119863119894 + 119883prime119894 120574 + 120584119894
bull Works with continuous or ordinal 119863119894 if linearity in the effect ofthese variables is truly linear
29 58
OLS constant effects simulation
Model with linear covariates constant 0 effect of treatment
library(mvtnorm)
n lt- 100
p lt- 4
X lt- rmvnorm(n = 100 mean = rep(0 p))
gamma lt- c(274 137 137 137)
y lt- 210 + X c(gamma) + rnorm(n)
alpha lt- c(-1 05 -05 -01)
dprobs lt- bootinvlogit(X alpha)
d lt- rbinom(n size = 1 prob = dprobs)
30 58
OLS with no covariates
summary(lm(y ~ d))
Coefficients
Estimate Std Error t value Pr(gt|t|)
(Intercept) 22036 458 4813 lt 2e-16
d -2869 654 -439 0000029
---
Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1
Residual standard error 33 on 98 degrees of freedom
Multiple R-squared 0164 Adjusted R-squared 0156
F-statistic 192 on 1 and 98 DF p-value 0000029
31 58
OLS with covariates
summary(lm(y ~ d + X))
Coefficients
Estimate Std Error t value Pr(gt|t|)
(Intercept) 209952 0159 13207 lt2e-16
d -0128 0253 -05 062
X1 27368 0126 2175 lt2e-16
X2 13677 0114 1201 lt2e-16
X3 13673 0130 1051 lt2e-16
X4 13570 0106 1280 lt2e-16
---
Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1
Residual standard error 1 on 94 degrees of freedom
Multiple R-squared 0999 Adjusted R-squared 0999
F-statistic 244e+04 on 5 and 94 DF p-value lt2e-16
32 58
What happens with nonlinearity
Suppose we can only observe the following covariates
z1 lt- exp(X[ 1]2)
z2 lt- X[ 2](1 + exp(X[ 1])) + 10
z3 lt- (X[ 1] X[ 3]25 + 06)^3
z4 lt- (X[ 2] + X[ 4] + 20)^2
Implies that 119884119894 and 119863119894 are functions of log(1198851198941113568) 1198851198941113569 119885111356911989411135681198851198941113569
1 log(1198851198941113568) 1198851198941113570 log(1198851198941113568) and 119883111356811135691198941113571
Regression is a nonlinear function of the observed covariates
33 58
When linearity goes wrong
summary(lm(y ~ d + z1 + z2 + z3 + z4))
Coefficients
Estimate Std Error t value Pr(gt|t|)
(Intercept) 218728 300799 073 0469
d -69292 33854 -205 0043
z1 363110 26970 1346 lt2e-16
z2 -29033 36619 -079 0430
z3 862022 431030 200 0048
z4 04021 00329 1223 lt2e-16
---
Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1
Residual standard error 14 on 94 degrees of freedom
Multiple R-squared 0855 Adjusted R-squared 0847
F-statistic 111 on 5 and 94 DF p-value lt2e-16
34 58
3 Regression withHeterogeneousTreatment Effects
35 58
Heterogeneous effects binary treatment
bull Completely randomized experiment
119884119894 = 119863119894119884119894(1) + (1 minus 119863119894)119884119894(0)= 119884119894(0) + (119884119894(1) minus 119884119894(0))119863119894
= 1205831113567 + 120591119894119863119894 + (119884119894(0) minus 1205831113567)= 1205831113567 + 120591119863119894 + (119884119894(0) minus 1205831113567) + (120591119894 minus 120591) sdot 119863119894
= 1205831113567 + 120591119863119894 + 120576119894
bull Error term now includes two components1 ldquoBaselinerdquo variation in the outcome (119884119894(0) minus 1205831113567)2 Variation in the treatment effect (120591119894 minus 120591)
bull Easy to verify that under experiment 120124[120576119894|119863119894] = 0bull Thus OLS estimates the ATE with no covariates
36 58
Adding covariates
bull What happens with no unmeasured confounders Need tocondition on 119883119894 now
bull Remember identification of the ATEATT using iteratedexpectations
bull ATE is the weighted sum of CATEs
120591 = 1114012119909120591(119909) Pr[119883119894 = 119909]
bull ATEATT are weighted averages of CATEsbull What about the regression estimand 120591119877 How does it related
to the ATEATT
37 58
Heterogeneous effects and regression
bull Letrsquos investigate this under a saturated regression model
119884119894 = 1114012119909119861119909119894120572119909 + 120591119877119863119894 + 119890119894
bull Use a dummy variable for each unique combination of 119883119894119861119909119894 = 120128(119883119894 = 119909)
bull Linear in 119883119894 by construction
38 58
Investigating the regression coefficient
bull How can we investigate 120591119877 Well we can rely on theregression anatomy
120591119877 = Cov(119884119894 119863119894 minus 119864[119863119894|119883119894])120141(119863119894 minus 119864[119863119894|119883119894])
bull 119863119894 minus 120124[119863119894|119883119894] is the residual from a regression of 119863119894 on the fullset of dummies
bull With a little work we can show
120591119877 =120124 1114094120591(119883119894)(119863119894 minus 120124[119863119894|119883119894])11135691114097
120124[(119863119894 minus 119864[119863119894|119883119894])1113569]=
120124[120591(119883119894)1205901113569119889(119883119894)]120124[1205901113569119889(119883119894)]
bull 1205901113569119889(119909) = 120141[119863119894|119883119894 = 119909] is the conditional variance of treatmentassignment
39 58
ATE versus OLS
120591119877 = 120124[120591(119883119894)119882119894] = 1114012119909120591(119909)
1205901113569119889(119909)120124[1205901113569119889(119883119894)]
โ[119883119894 = 119909]
bull Compare to the ATE
120591 = 120124[120591(119883119894)] = 1114012119909120591(119909)โ[119883119894 = 119909]
bull Both weight strata relative to their size (โ[119883119894 = 119909])bull OLS weights strata higher if the treatment variance in those
strata (1205901113569119889(119909)) is higher in those strata relative to the averagevariance across strata (120124[1205901113569119889(119883119894)])
bull The ATE weights only by their size
40 58
Regression weighting
119882119894 =1205901113569119889(119883119894)
120124[1205901113569119889(119883119894)]
bull Why does OLS weight like thisbull OLS is a minimum-variance estimator more weight to
more precise within-strata estimatesbull Within-strata estimates are most precise when the treatment
is evenly spread and thus has the highest variancebull If 119863119894 is binary then we know the conditional variance will be
1205901113569119889(119909) = โ[119863119894 = 1|119883119894 = 119909] (1 minus โ[119863119894 = 1|119883119894 = 119909])= 119890(119909) (1 minus 119890(119909))
bull Maximum variance with โ[119863119894 = 1|119883119894 = 119909] = 12
41 58
OLS weighting examplebull Binary covariate
โ[119883119894 = 1] = 075 โ[119883119894 = 0] = 025โ[119863119894 = 1|119883119894 = 1] = 09 โ[119863119894 = 1|119883119894 = 0] = 05
1205901113569119889(1) = 009 1205901113569119889(0) = 025120591(1) = 1 120591(0) = minus1
bull Implies the ATE is 120591 = 05bull Average conditional variance 120124[1205901113569119889(119883119894)] = 013bull weights for 119883119894 = 1 are 009013 = 0692 for 119883119894 = 0
025013 = 192120591119877 = 120124[120591(119883119894)119882119894]
= 120591(1)119882(1)โ[119883119894 = 1] + 120591(0)119882(0)โ[119883119894 = 0]= 1 times 0692 times 075 + minus1 times 192 times 025= 0039
42 58
When will OLS estimate the ATE
bull When does 120591 = 120591119877bull Constant treatment effects 120591(119909) = 120591 = 120591119877bull Constant probability of treatment 119890(119909) = โ[119863119894 = 1|119883119894 = 119909] = 119890
Implies that the OLS weights are 1
bull Incorrect linearity assumption in 119883119894 will lead to more bias
43 58
Other ways to use regression
bull Whatrsquos the path forward Accept the bias (might be relatively small with saturated
models) Use a different regression approach
bull Let 120583119889(119909) = 120124[119884119894(119889)|119883119894 = 119909] be the CEF for the potentialoutcome under 119863119894 = 119889
bull By consistency and nuc we have 120583119889(119909) = 120124[119884119894|119863119894 = 119889119883119894 = 119909]bull Estimate a regression of 119884119894 on 119883119894 among the 119863119894 = 119889 groupbull Then 119889(119909) is just a predicted value from the regression for
119883119894 = 119909bull How can we use this
44 58
Imputation estimators
bull Impute the treated potential outcomes with 1114023119884119894(1) = 1113568(119883119894)bull Impute the control potential outcomes with 1114023119884119894(0) = 1113567(119883119894)bull Procedure
Regress 119884119894 on 119883119894 in the treated group and get predicted valuesfor all units (treated or control)
Regress 119884119894 on 119883119894 in the control group and get predicted valuesfor all units (treated or control)
Take the average difference between these predicted values
bull More mathematically look like this
120591119894119898119901 =1119873
11140121198941113568(119883119894) minus 1113567(119883119894)
bull Sometimes called an imputation estimator
45 58
Simple imputation estimator
bull Use predict() from the within-group models on the data fromthe entire sample
bull Useful trick use a model on the entire data andmodelframe() to get the right design matrix
heterogeneous effects
yhet lt- ifelse(d == 1 y + rnorm(n 0 5) y)
mod lt- lm(yhet ~ d + X)
mod1 lt- lm(yhet ~ X subset = d == 1)
mod0 lt- lm(yhet ~ X subset = d == 0)
y1imps lt- predict(mod1 modelframe(mod))
y0imps lt- predict(mod0 modelframe(mod))
mean(y1imps - y0imps)
[1] 061
46 58
Notes on imputation estimators
bull If 119889(119909) are consistent estimators then 120591119894119898119901 is consistent forthe ATE
bull Why donrsquot people use this Most people donrsquot know the results wersquove been talking about Harder to implement than vanilla OLS
bull Can use linear regression to estimate 119889(119909) = 119909prime120573119889bull Recent trend is to estimate 119889(119909) via non-parametric methods
such as Kernel regression local linear regression regression trees etc Easiest is generalized additive models (GAMs)
47 58
Imputation estimator visualization
-2 -1 0 1 2 3
-05
00
05
10
15
x
y
TreatedControl
48 58
Imputation estimator visualization
-2 -1 0 1 2 3
-05
00
05
10
15
x
y
TreatedControl
49 58
Imputation estimator visualization
-2 -1 0 1 2 3
-05
00
05
10
15
x
y
TreatedControl
50 58
Nonlinear relationshipsbull Same idea but with nonlinear relationship between 119884119894 and 119883119894
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
51 58
Nonlinear relationshipsbull Same idea but with nonlinear relationship between 119884119894 and 119883119894
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
52 58
Nonlinear relationshipsbull Same idea but with nonlinear relationship between 119884119894 and 119883119894
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
53 58
Using semiparametric regressionbull Here CEFs are nonlinear but we donrsquot know their formbull We can use GAMs from the mgcv package to for flexible
estimatelibrary(mgcv)
mod0 lt- gam(y ~ s(x) subset = d == 0)
summary(mod0)
Family gaussian
Link function identity
Formula
y ~ s(x)
Parametric coefficients
Estimate Std Error t value Pr(gt|t|)
(Intercept) -00225 00154 -146 016
Approximate significance of smooth terms
edf Refdf F p-value
s(x) 603 708 413 lt2e-16
---
Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1
R-sq(adj) = 0909 Deviance explained = 928
GCV = 00093204 Scale est = 00071351 n = 30
54 58
Using GAMs
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
55 58
Using GAMs
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
56 58
Using GAMs
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
57 58
Limited dependent variables
bull Usual advice model the data from first principles Logitprobit for binary Poisson for counts etc
bull OLS is a-ok with limited DVs when Binary treatment and no covariates (just diff-in-means) Binary treatment discrete covariates and saturated models
(stratified diff-in-means)
bull Imposing a model on LDVs in this case imposes adistributional assumption which could be wrong
bull Even in unsaturated models the marginal effect from OLSoften decent compared to nonlinear models
Could go wrong in small samples If using nonlinear models always get effects on the scale of the
outcome
58 58
- Agnostic Regression
- Regression and Causality
- Regression with Heterogeneous Treatment Effects
-
Heteroskedasticity
bull No assumptions of homoskedasticitybull Heteroskedaticity will definitely occur when
CEF is linear but the 1205901113569(119909) = 120141[119884119894|119883119894 = 119909] is not constant in 119909 119864[119884119894|119883119894] is not linear but we use the linear regression to
approxmiate it
22 58
2 Regression andCausality
23 58
Regression and causality
bull Most econometrics textbooks regression defined withoutrespect to causality
bull But then when is ldquobiasedrdquo The above derivations work forsome 120124[119884119894|119883119894]
bull The question then is when does knowing the CEF tell ussomething about causality
bull MHE argues that a regression is causal when the CEF itapproximates is causal Identification is king
bull We will show that under certain conditions a regression of theoutcome on the treatment and the covariates can recover acausal parameter but perhaps not the one in which we areinterested
24 58
Review
bull Quick reminder we have potential outcomes 119884119894(1) and 119884119894(0)and two parameters the ATE and ATT
120591 = 119864[119884119894(1) minus 119884119894(0)]120591ATT = 119864[119884119894(1) minus 119884119894(0)|119863119894 = 1]
bull We have shown in past weeks that these effects are identifiedwhen ignorability holds MHE calls this the conditionalindependence assumption (CIA)
25 58
Linear constant effects model binarytreatment
bull Experiment with a simple experiment we can rewrite theconsistency assumption to be a regression formula
119884119894 = 119863119894119884119894(1) + (1 minus 119863119894)119884119894(0)= 119884119894(0) + (119884119894(1) minus 119884119894(0))119863119894
= 120124[119884119894(0)] + 120591119863119894 + (119884119894(0) minus 120124[119884119894(0)])= 1205831113567 + 120591119863119894 + 1199071113567119894
bull Note that if ignorability holds (as in an experiment) for 119884119894(0)then it will also hold for 1199071113567119894 since 120124[119884119894(0)] is constant Thusthis satifies the usual assumptions for regression
26 58
Now with covariates
bull Now assume no unmeasured confounders 119884119894(119889) โโ 119863119894|119883119894bull We will assume a linear model for the potential outcomes
119884119894(119889) = 120572 + 120591 sdot 119889 + 120578119894
bull Remember that linearity isnrsquot an assumption if 119863119894 is binarybull Effect of 119863119894 is constant here the 120578119894 are the only source of
individual variation and we have 119864[120578119894] = 0bull Consistency assumption allows us to write this as
119884119894 = 120572 + 120591119863119894 + 120578119894
27 58
Covariates in the error
bull Letrsquos assume that 120578119894 is linear in 119883119894 120578119894 = 119883prime119894 120574 + 120584119894
bull New error is uncorrelated with 119883119894 120124[120584119894|119883119894] = 0bull This is an assumption Might be falsebull Plug into the above
120124[119884119894(119889)|119883119894] = 119864[119884119894|119863119894 119883119894] = 120572 + 120591119863119894 + 119864[120578119894|119883119894]= 120572 + 120591119863119894 + 119883prime
119894 120574 + 119864[120584119894|119883119894]= 120572 + 120591119863119894 + 119883prime
119894 120574
28 58
Summing up regression with constanteffects
bull Reviewing the assumptions wersquove used no unmeasured confounders constant treatment effects linearity of the treatmentcovariates
bull Under these we can run the following regression to estimatethe ATE 120591
119884119894 = 120572 + 120591119863119894 + 119883prime119894 120574 + 120584119894
bull Works with continuous or ordinal 119863119894 if linearity in the effect ofthese variables is truly linear
29 58
OLS constant effects simulation
Model with linear covariates constant 0 effect of treatment
library(mvtnorm)
n lt- 100
p lt- 4
X lt- rmvnorm(n = 100 mean = rep(0 p))
gamma lt- c(274 137 137 137)
y lt- 210 + X c(gamma) + rnorm(n)
alpha lt- c(-1 05 -05 -01)
dprobs lt- bootinvlogit(X alpha)
d lt- rbinom(n size = 1 prob = dprobs)
30 58
OLS with no covariates
summary(lm(y ~ d))
Coefficients
Estimate Std Error t value Pr(gt|t|)
(Intercept) 22036 458 4813 lt 2e-16
d -2869 654 -439 0000029
---
Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1
Residual standard error 33 on 98 degrees of freedom
Multiple R-squared 0164 Adjusted R-squared 0156
F-statistic 192 on 1 and 98 DF p-value 0000029
31 58
OLS with covariates
summary(lm(y ~ d + X))
Coefficients
Estimate Std Error t value Pr(gt|t|)
(Intercept) 209952 0159 13207 lt2e-16
d -0128 0253 -05 062
X1 27368 0126 2175 lt2e-16
X2 13677 0114 1201 lt2e-16
X3 13673 0130 1051 lt2e-16
X4 13570 0106 1280 lt2e-16
---
Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1
Residual standard error 1 on 94 degrees of freedom
Multiple R-squared 0999 Adjusted R-squared 0999
F-statistic 244e+04 on 5 and 94 DF p-value lt2e-16
32 58
What happens with nonlinearity
Suppose we can only observe the following covariates
z1 lt- exp(X[ 1]2)
z2 lt- X[ 2](1 + exp(X[ 1])) + 10
z3 lt- (X[ 1] X[ 3]25 + 06)^3
z4 lt- (X[ 2] + X[ 4] + 20)^2
Implies that 119884119894 and 119863119894 are functions of log(1198851198941113568) 1198851198941113569 119885111356911989411135681198851198941113569
1 log(1198851198941113568) 1198851198941113570 log(1198851198941113568) and 119883111356811135691198941113571
Regression is a nonlinear function of the observed covariates
33 58
When linearity goes wrong
summary(lm(y ~ d + z1 + z2 + z3 + z4))
Coefficients
Estimate Std Error t value Pr(gt|t|)
(Intercept) 218728 300799 073 0469
d -69292 33854 -205 0043
z1 363110 26970 1346 lt2e-16
z2 -29033 36619 -079 0430
z3 862022 431030 200 0048
z4 04021 00329 1223 lt2e-16
---
Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1
Residual standard error 14 on 94 degrees of freedom
Multiple R-squared 0855 Adjusted R-squared 0847
F-statistic 111 on 5 and 94 DF p-value lt2e-16
34 58
3 Regression withHeterogeneousTreatment Effects
35 58
Heterogeneous effects binary treatment
bull Completely randomized experiment
119884119894 = 119863119894119884119894(1) + (1 minus 119863119894)119884119894(0)= 119884119894(0) + (119884119894(1) minus 119884119894(0))119863119894
= 1205831113567 + 120591119894119863119894 + (119884119894(0) minus 1205831113567)= 1205831113567 + 120591119863119894 + (119884119894(0) minus 1205831113567) + (120591119894 minus 120591) sdot 119863119894
= 1205831113567 + 120591119863119894 + 120576119894
bull Error term now includes two components1 ldquoBaselinerdquo variation in the outcome (119884119894(0) minus 1205831113567)2 Variation in the treatment effect (120591119894 minus 120591)
bull Easy to verify that under experiment 120124[120576119894|119863119894] = 0bull Thus OLS estimates the ATE with no covariates
36 58
Adding covariates
bull What happens with no unmeasured confounders Need tocondition on 119883119894 now
bull Remember identification of the ATEATT using iteratedexpectations
bull ATE is the weighted sum of CATEs
120591 = 1114012119909120591(119909) Pr[119883119894 = 119909]
bull ATEATT are weighted averages of CATEsbull What about the regression estimand 120591119877 How does it related
to the ATEATT
37 58
Heterogeneous effects and regression
bull Letrsquos investigate this under a saturated regression model
119884119894 = 1114012119909119861119909119894120572119909 + 120591119877119863119894 + 119890119894
bull Use a dummy variable for each unique combination of 119883119894119861119909119894 = 120128(119883119894 = 119909)
bull Linear in 119883119894 by construction
38 58
Investigating the regression coefficient
bull How can we investigate 120591119877 Well we can rely on theregression anatomy
120591119877 = Cov(119884119894 119863119894 minus 119864[119863119894|119883119894])120141(119863119894 minus 119864[119863119894|119883119894])
bull 119863119894 minus 120124[119863119894|119883119894] is the residual from a regression of 119863119894 on the fullset of dummies
bull With a little work we can show
120591119877 =120124 1114094120591(119883119894)(119863119894 minus 120124[119863119894|119883119894])11135691114097
120124[(119863119894 minus 119864[119863119894|119883119894])1113569]=
120124[120591(119883119894)1205901113569119889(119883119894)]120124[1205901113569119889(119883119894)]
bull 1205901113569119889(119909) = 120141[119863119894|119883119894 = 119909] is the conditional variance of treatmentassignment
39 58
ATE versus OLS
120591119877 = 120124[120591(119883119894)119882119894] = 1114012119909120591(119909)
1205901113569119889(119909)120124[1205901113569119889(119883119894)]
โ[119883119894 = 119909]
bull Compare to the ATE
120591 = 120124[120591(119883119894)] = 1114012119909120591(119909)โ[119883119894 = 119909]
bull Both weight strata relative to their size (โ[119883119894 = 119909])bull OLS weights strata higher if the treatment variance in those
strata (1205901113569119889(119909)) is higher in those strata relative to the averagevariance across strata (120124[1205901113569119889(119883119894)])
bull The ATE weights only by their size
40 58
Regression weighting
119882119894 =1205901113569119889(119883119894)
120124[1205901113569119889(119883119894)]
bull Why does OLS weight like thisbull OLS is a minimum-variance estimator more weight to
more precise within-strata estimatesbull Within-strata estimates are most precise when the treatment
is evenly spread and thus has the highest variancebull If 119863119894 is binary then we know the conditional variance will be
1205901113569119889(119909) = โ[119863119894 = 1|119883119894 = 119909] (1 minus โ[119863119894 = 1|119883119894 = 119909])= 119890(119909) (1 minus 119890(119909))
bull Maximum variance with โ[119863119894 = 1|119883119894 = 119909] = 12
41 58
OLS weighting examplebull Binary covariate
โ[119883119894 = 1] = 075 โ[119883119894 = 0] = 025โ[119863119894 = 1|119883119894 = 1] = 09 โ[119863119894 = 1|119883119894 = 0] = 05
1205901113569119889(1) = 009 1205901113569119889(0) = 025120591(1) = 1 120591(0) = minus1
bull Implies the ATE is 120591 = 05bull Average conditional variance 120124[1205901113569119889(119883119894)] = 013bull weights for 119883119894 = 1 are 009013 = 0692 for 119883119894 = 0
025013 = 192120591119877 = 120124[120591(119883119894)119882119894]
= 120591(1)119882(1)โ[119883119894 = 1] + 120591(0)119882(0)โ[119883119894 = 0]= 1 times 0692 times 075 + minus1 times 192 times 025= 0039
42 58
When will OLS estimate the ATE
bull When does 120591 = 120591119877bull Constant treatment effects 120591(119909) = 120591 = 120591119877bull Constant probability of treatment 119890(119909) = โ[119863119894 = 1|119883119894 = 119909] = 119890
Implies that the OLS weights are 1
bull Incorrect linearity assumption in 119883119894 will lead to more bias
43 58
Other ways to use regression
bull Whatrsquos the path forward Accept the bias (might be relatively small with saturated
models) Use a different regression approach
bull Let 120583119889(119909) = 120124[119884119894(119889)|119883119894 = 119909] be the CEF for the potentialoutcome under 119863119894 = 119889
bull By consistency and nuc we have 120583119889(119909) = 120124[119884119894|119863119894 = 119889119883119894 = 119909]bull Estimate a regression of 119884119894 on 119883119894 among the 119863119894 = 119889 groupbull Then 119889(119909) is just a predicted value from the regression for
119883119894 = 119909bull How can we use this
44 58
Imputation estimators
bull Impute the treated potential outcomes with 1114023119884119894(1) = 1113568(119883119894)bull Impute the control potential outcomes with 1114023119884119894(0) = 1113567(119883119894)bull Procedure
Regress 119884119894 on 119883119894 in the treated group and get predicted valuesfor all units (treated or control)
Regress 119884119894 on 119883119894 in the control group and get predicted valuesfor all units (treated or control)
Take the average difference between these predicted values
bull More mathematically look like this
120591119894119898119901 =1119873
11140121198941113568(119883119894) minus 1113567(119883119894)
bull Sometimes called an imputation estimator
45 58
Simple imputation estimator
bull Use predict() from the within-group models on the data fromthe entire sample
bull Useful trick use a model on the entire data andmodelframe() to get the right design matrix
heterogeneous effects
yhet lt- ifelse(d == 1 y + rnorm(n 0 5) y)
mod lt- lm(yhet ~ d + X)
mod1 lt- lm(yhet ~ X subset = d == 1)
mod0 lt- lm(yhet ~ X subset = d == 0)
y1imps lt- predict(mod1 modelframe(mod))
y0imps lt- predict(mod0 modelframe(mod))
mean(y1imps - y0imps)
[1] 061
46 58
Notes on imputation estimators
bull If 119889(119909) are consistent estimators then 120591119894119898119901 is consistent forthe ATE
bull Why donrsquot people use this Most people donrsquot know the results wersquove been talking about Harder to implement than vanilla OLS
bull Can use linear regression to estimate 119889(119909) = 119909prime120573119889bull Recent trend is to estimate 119889(119909) via non-parametric methods
such as Kernel regression local linear regression regression trees etc Easiest is generalized additive models (GAMs)
47 58
Imputation estimator visualization
-2 -1 0 1 2 3
-05
00
05
10
15
x
y
TreatedControl
48 58
Imputation estimator visualization
-2 -1 0 1 2 3
-05
00
05
10
15
x
y
TreatedControl
49 58
Imputation estimator visualization
-2 -1 0 1 2 3
-05
00
05
10
15
x
y
TreatedControl
50 58
Nonlinear relationshipsbull Same idea but with nonlinear relationship between 119884119894 and 119883119894
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
51 58
Nonlinear relationshipsbull Same idea but with nonlinear relationship between 119884119894 and 119883119894
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
52 58
Nonlinear relationshipsbull Same idea but with nonlinear relationship between 119884119894 and 119883119894
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
53 58
Using semiparametric regressionbull Here CEFs are nonlinear but we donrsquot know their formbull We can use GAMs from the mgcv package to for flexible
estimatelibrary(mgcv)
mod0 lt- gam(y ~ s(x) subset = d == 0)
summary(mod0)
Family gaussian
Link function identity
Formula
y ~ s(x)
Parametric coefficients
Estimate Std Error t value Pr(gt|t|)
(Intercept) -00225 00154 -146 016
Approximate significance of smooth terms
edf Refdf F p-value
s(x) 603 708 413 lt2e-16
---
Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1
R-sq(adj) = 0909 Deviance explained = 928
GCV = 00093204 Scale est = 00071351 n = 30
54 58
Using GAMs
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
55 58
Using GAMs
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
56 58
Using GAMs
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
57 58
Limited dependent variables
bull Usual advice model the data from first principles Logitprobit for binary Poisson for counts etc
bull OLS is a-ok with limited DVs when Binary treatment and no covariates (just diff-in-means) Binary treatment discrete covariates and saturated models
(stratified diff-in-means)
bull Imposing a model on LDVs in this case imposes adistributional assumption which could be wrong
bull Even in unsaturated models the marginal effect from OLSoften decent compared to nonlinear models
Could go wrong in small samples If using nonlinear models always get effects on the scale of the
outcome
58 58
- Agnostic Regression
- Regression and Causality
- Regression with Heterogeneous Treatment Effects
-
2 Regression andCausality
23 58
Regression and causality
bull Most econometrics textbooks regression defined withoutrespect to causality
bull But then when is ldquobiasedrdquo The above derivations work forsome 120124[119884119894|119883119894]
bull The question then is when does knowing the CEF tell ussomething about causality
bull MHE argues that a regression is causal when the CEF itapproximates is causal Identification is king
bull We will show that under certain conditions a regression of theoutcome on the treatment and the covariates can recover acausal parameter but perhaps not the one in which we areinterested
24 58
Review
bull Quick reminder we have potential outcomes 119884119894(1) and 119884119894(0)and two parameters the ATE and ATT
120591 = 119864[119884119894(1) minus 119884119894(0)]120591ATT = 119864[119884119894(1) minus 119884119894(0)|119863119894 = 1]
bull We have shown in past weeks that these effects are identifiedwhen ignorability holds MHE calls this the conditionalindependence assumption (CIA)
25 58
Linear constant effects model binarytreatment
bull Experiment with a simple experiment we can rewrite theconsistency assumption to be a regression formula
119884119894 = 119863119894119884119894(1) + (1 minus 119863119894)119884119894(0)= 119884119894(0) + (119884119894(1) minus 119884119894(0))119863119894
= 120124[119884119894(0)] + 120591119863119894 + (119884119894(0) minus 120124[119884119894(0)])= 1205831113567 + 120591119863119894 + 1199071113567119894
bull Note that if ignorability holds (as in an experiment) for 119884119894(0)then it will also hold for 1199071113567119894 since 120124[119884119894(0)] is constant Thusthis satifies the usual assumptions for regression
26 58
Now with covariates
bull Now assume no unmeasured confounders 119884119894(119889) โโ 119863119894|119883119894bull We will assume a linear model for the potential outcomes
119884119894(119889) = 120572 + 120591 sdot 119889 + 120578119894
bull Remember that linearity isnrsquot an assumption if 119863119894 is binarybull Effect of 119863119894 is constant here the 120578119894 are the only source of
individual variation and we have 119864[120578119894] = 0bull Consistency assumption allows us to write this as
119884119894 = 120572 + 120591119863119894 + 120578119894
27 58
Covariates in the error
bull Letrsquos assume that 120578119894 is linear in 119883119894 120578119894 = 119883prime119894 120574 + 120584119894
bull New error is uncorrelated with 119883119894 120124[120584119894|119883119894] = 0bull This is an assumption Might be falsebull Plug into the above
120124[119884119894(119889)|119883119894] = 119864[119884119894|119863119894 119883119894] = 120572 + 120591119863119894 + 119864[120578119894|119883119894]= 120572 + 120591119863119894 + 119883prime
119894 120574 + 119864[120584119894|119883119894]= 120572 + 120591119863119894 + 119883prime
119894 120574
28 58
Summing up regression with constanteffects
bull Reviewing the assumptions wersquove used no unmeasured confounders constant treatment effects linearity of the treatmentcovariates
bull Under these we can run the following regression to estimatethe ATE 120591
119884119894 = 120572 + 120591119863119894 + 119883prime119894 120574 + 120584119894
bull Works with continuous or ordinal 119863119894 if linearity in the effect ofthese variables is truly linear
29 58
OLS constant effects simulation
Model with linear covariates constant 0 effect of treatment
library(mvtnorm)
n lt- 100
p lt- 4
X lt- rmvnorm(n = 100 mean = rep(0 p))
gamma lt- c(274 137 137 137)
y lt- 210 + X c(gamma) + rnorm(n)
alpha lt- c(-1 05 -05 -01)
dprobs lt- bootinvlogit(X alpha)
d lt- rbinom(n size = 1 prob = dprobs)
30 58
OLS with no covariates
summary(lm(y ~ d))
Coefficients
Estimate Std Error t value Pr(gt|t|)
(Intercept) 22036 458 4813 lt 2e-16
d -2869 654 -439 0000029
---
Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1
Residual standard error 33 on 98 degrees of freedom
Multiple R-squared 0164 Adjusted R-squared 0156
F-statistic 192 on 1 and 98 DF p-value 0000029
31 58
OLS with covariates
summary(lm(y ~ d + X))
Coefficients
Estimate Std Error t value Pr(gt|t|)
(Intercept) 209952 0159 13207 lt2e-16
d -0128 0253 -05 062
X1 27368 0126 2175 lt2e-16
X2 13677 0114 1201 lt2e-16
X3 13673 0130 1051 lt2e-16
X4 13570 0106 1280 lt2e-16
---
Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1
Residual standard error 1 on 94 degrees of freedom
Multiple R-squared 0999 Adjusted R-squared 0999
F-statistic 244e+04 on 5 and 94 DF p-value lt2e-16
32 58
What happens with nonlinearity
Suppose we can only observe the following covariates
z1 lt- exp(X[ 1]2)
z2 lt- X[ 2](1 + exp(X[ 1])) + 10
z3 lt- (X[ 1] X[ 3]25 + 06)^3
z4 lt- (X[ 2] + X[ 4] + 20)^2
Implies that 119884119894 and 119863119894 are functions of log(1198851198941113568) 1198851198941113569 119885111356911989411135681198851198941113569
1 log(1198851198941113568) 1198851198941113570 log(1198851198941113568) and 119883111356811135691198941113571
Regression is a nonlinear function of the observed covariates
33 58
When linearity goes wrong
summary(lm(y ~ d + z1 + z2 + z3 + z4))
Coefficients
Estimate Std Error t value Pr(gt|t|)
(Intercept) 218728 300799 073 0469
d -69292 33854 -205 0043
z1 363110 26970 1346 lt2e-16
z2 -29033 36619 -079 0430
z3 862022 431030 200 0048
z4 04021 00329 1223 lt2e-16
---
Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1
Residual standard error 14 on 94 degrees of freedom
Multiple R-squared 0855 Adjusted R-squared 0847
F-statistic 111 on 5 and 94 DF p-value lt2e-16
34 58
3 Regression withHeterogeneousTreatment Effects
35 58
Heterogeneous effects binary treatment
bull Completely randomized experiment
119884119894 = 119863119894119884119894(1) + (1 minus 119863119894)119884119894(0)= 119884119894(0) + (119884119894(1) minus 119884119894(0))119863119894
= 1205831113567 + 120591119894119863119894 + (119884119894(0) minus 1205831113567)= 1205831113567 + 120591119863119894 + (119884119894(0) minus 1205831113567) + (120591119894 minus 120591) sdot 119863119894
= 1205831113567 + 120591119863119894 + 120576119894
bull Error term now includes two components1 ldquoBaselinerdquo variation in the outcome (119884119894(0) minus 1205831113567)2 Variation in the treatment effect (120591119894 minus 120591)
bull Easy to verify that under experiment 120124[120576119894|119863119894] = 0bull Thus OLS estimates the ATE with no covariates
36 58
Adding covariates
bull What happens with no unmeasured confounders Need tocondition on 119883119894 now
bull Remember identification of the ATEATT using iteratedexpectations
bull ATE is the weighted sum of CATEs
120591 = 1114012119909120591(119909) Pr[119883119894 = 119909]
bull ATEATT are weighted averages of CATEsbull What about the regression estimand 120591119877 How does it related
to the ATEATT
37 58
Heterogeneous effects and regression
bull Letrsquos investigate this under a saturated regression model
119884119894 = 1114012119909119861119909119894120572119909 + 120591119877119863119894 + 119890119894
bull Use a dummy variable for each unique combination of 119883119894119861119909119894 = 120128(119883119894 = 119909)
bull Linear in 119883119894 by construction
38 58
Investigating the regression coefficient
bull How can we investigate 120591119877 Well we can rely on theregression anatomy
120591119877 = Cov(119884119894 119863119894 minus 119864[119863119894|119883119894])120141(119863119894 minus 119864[119863119894|119883119894])
bull 119863119894 minus 120124[119863119894|119883119894] is the residual from a regression of 119863119894 on the fullset of dummies
bull With a little work we can show
120591119877 =120124 1114094120591(119883119894)(119863119894 minus 120124[119863119894|119883119894])11135691114097
120124[(119863119894 minus 119864[119863119894|119883119894])1113569]=
120124[120591(119883119894)1205901113569119889(119883119894)]120124[1205901113569119889(119883119894)]
bull 1205901113569119889(119909) = 120141[119863119894|119883119894 = 119909] is the conditional variance of treatmentassignment
39 58
ATE versus OLS
120591119877 = 120124[120591(119883119894)119882119894] = 1114012119909120591(119909)
1205901113569119889(119909)120124[1205901113569119889(119883119894)]
โ[119883119894 = 119909]
bull Compare to the ATE
120591 = 120124[120591(119883119894)] = 1114012119909120591(119909)โ[119883119894 = 119909]
bull Both weight strata relative to their size (โ[119883119894 = 119909])bull OLS weights strata higher if the treatment variance in those
strata (1205901113569119889(119909)) is higher in those strata relative to the averagevariance across strata (120124[1205901113569119889(119883119894)])
bull The ATE weights only by their size
40 58
Regression weighting
119882119894 =1205901113569119889(119883119894)
120124[1205901113569119889(119883119894)]
bull Why does OLS weight like thisbull OLS is a minimum-variance estimator more weight to
more precise within-strata estimatesbull Within-strata estimates are most precise when the treatment
is evenly spread and thus has the highest variancebull If 119863119894 is binary then we know the conditional variance will be
1205901113569119889(119909) = โ[119863119894 = 1|119883119894 = 119909] (1 minus โ[119863119894 = 1|119883119894 = 119909])= 119890(119909) (1 minus 119890(119909))
bull Maximum variance with โ[119863119894 = 1|119883119894 = 119909] = 12
41 58
OLS weighting examplebull Binary covariate
โ[119883119894 = 1] = 075 โ[119883119894 = 0] = 025โ[119863119894 = 1|119883119894 = 1] = 09 โ[119863119894 = 1|119883119894 = 0] = 05
1205901113569119889(1) = 009 1205901113569119889(0) = 025120591(1) = 1 120591(0) = minus1
bull Implies the ATE is 120591 = 05bull Average conditional variance 120124[1205901113569119889(119883119894)] = 013bull weights for 119883119894 = 1 are 009013 = 0692 for 119883119894 = 0
025013 = 192120591119877 = 120124[120591(119883119894)119882119894]
= 120591(1)119882(1)โ[119883119894 = 1] + 120591(0)119882(0)โ[119883119894 = 0]= 1 times 0692 times 075 + minus1 times 192 times 025= 0039
42 58
When will OLS estimate the ATE
bull When does 120591 = 120591119877bull Constant treatment effects 120591(119909) = 120591 = 120591119877bull Constant probability of treatment 119890(119909) = โ[119863119894 = 1|119883119894 = 119909] = 119890
Implies that the OLS weights are 1
bull Incorrect linearity assumption in 119883119894 will lead to more bias
43 58
Other ways to use regression
bull Whatrsquos the path forward Accept the bias (might be relatively small with saturated
models) Use a different regression approach
bull Let 120583119889(119909) = 120124[119884119894(119889)|119883119894 = 119909] be the CEF for the potentialoutcome under 119863119894 = 119889
bull By consistency and nuc we have 120583119889(119909) = 120124[119884119894|119863119894 = 119889119883119894 = 119909]bull Estimate a regression of 119884119894 on 119883119894 among the 119863119894 = 119889 groupbull Then 119889(119909) is just a predicted value from the regression for
119883119894 = 119909bull How can we use this
44 58
Imputation estimators
bull Impute the treated potential outcomes with 1114023119884119894(1) = 1113568(119883119894)bull Impute the control potential outcomes with 1114023119884119894(0) = 1113567(119883119894)bull Procedure
Regress 119884119894 on 119883119894 in the treated group and get predicted valuesfor all units (treated or control)
Regress 119884119894 on 119883119894 in the control group and get predicted valuesfor all units (treated or control)
Take the average difference between these predicted values
bull More mathematically look like this
120591119894119898119901 =1119873
11140121198941113568(119883119894) minus 1113567(119883119894)
bull Sometimes called an imputation estimator
45 58
Simple imputation estimator
bull Use predict() from the within-group models on the data fromthe entire sample
bull Useful trick use a model on the entire data andmodelframe() to get the right design matrix
heterogeneous effects
yhet lt- ifelse(d == 1 y + rnorm(n 0 5) y)
mod lt- lm(yhet ~ d + X)
mod1 lt- lm(yhet ~ X subset = d == 1)
mod0 lt- lm(yhet ~ X subset = d == 0)
y1imps lt- predict(mod1 modelframe(mod))
y0imps lt- predict(mod0 modelframe(mod))
mean(y1imps - y0imps)
[1] 061
46 58
Notes on imputation estimators
bull If 119889(119909) are consistent estimators then 120591119894119898119901 is consistent forthe ATE
bull Why donrsquot people use this Most people donrsquot know the results wersquove been talking about Harder to implement than vanilla OLS
bull Can use linear regression to estimate 119889(119909) = 119909prime120573119889bull Recent trend is to estimate 119889(119909) via non-parametric methods
such as Kernel regression local linear regression regression trees etc Easiest is generalized additive models (GAMs)
47 58
Imputation estimator visualization
-2 -1 0 1 2 3
-05
00
05
10
15
x
y
TreatedControl
48 58
Imputation estimator visualization
-2 -1 0 1 2 3
-05
00
05
10
15
x
y
TreatedControl
49 58
Imputation estimator visualization
-2 -1 0 1 2 3
-05
00
05
10
15
x
y
TreatedControl
50 58
Nonlinear relationshipsbull Same idea but with nonlinear relationship between 119884119894 and 119883119894
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
51 58
Nonlinear relationshipsbull Same idea but with nonlinear relationship between 119884119894 and 119883119894
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
52 58
Nonlinear relationshipsbull Same idea but with nonlinear relationship between 119884119894 and 119883119894
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
53 58
Using semiparametric regressionbull Here CEFs are nonlinear but we donrsquot know their formbull We can use GAMs from the mgcv package to for flexible
estimatelibrary(mgcv)
mod0 lt- gam(y ~ s(x) subset = d == 0)
summary(mod0)
Family gaussian
Link function identity
Formula
y ~ s(x)
Parametric coefficients
Estimate Std Error t value Pr(gt|t|)
(Intercept) -00225 00154 -146 016
Approximate significance of smooth terms
edf Refdf F p-value
s(x) 603 708 413 lt2e-16
---
Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1
R-sq(adj) = 0909 Deviance explained = 928
GCV = 00093204 Scale est = 00071351 n = 30
54 58
Using GAMs
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
55 58
Using GAMs
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
56 58
Using GAMs
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
57 58
Limited dependent variables
bull Usual advice model the data from first principles Logitprobit for binary Poisson for counts etc
bull OLS is a-ok with limited DVs when Binary treatment and no covariates (just diff-in-means) Binary treatment discrete covariates and saturated models
(stratified diff-in-means)
bull Imposing a model on LDVs in this case imposes adistributional assumption which could be wrong
bull Even in unsaturated models the marginal effect from OLSoften decent compared to nonlinear models
Could go wrong in small samples If using nonlinear models always get effects on the scale of the
outcome
58 58
- Agnostic Regression
- Regression and Causality
- Regression with Heterogeneous Treatment Effects
-
Regression and causality
bull Most econometrics textbooks regression defined withoutrespect to causality
bull But then when is ldquobiasedrdquo The above derivations work forsome 120124[119884119894|119883119894]
bull The question then is when does knowing the CEF tell ussomething about causality
bull MHE argues that a regression is causal when the CEF itapproximates is causal Identification is king
bull We will show that under certain conditions a regression of theoutcome on the treatment and the covariates can recover acausal parameter but perhaps not the one in which we areinterested
24 58
Review
bull Quick reminder we have potential outcomes 119884119894(1) and 119884119894(0)and two parameters the ATE and ATT
120591 = 119864[119884119894(1) minus 119884119894(0)]120591ATT = 119864[119884119894(1) minus 119884119894(0)|119863119894 = 1]
bull We have shown in past weeks that these effects are identifiedwhen ignorability holds MHE calls this the conditionalindependence assumption (CIA)
25 58
Linear constant effects model binarytreatment
bull Experiment with a simple experiment we can rewrite theconsistency assumption to be a regression formula
119884119894 = 119863119894119884119894(1) + (1 minus 119863119894)119884119894(0)= 119884119894(0) + (119884119894(1) minus 119884119894(0))119863119894
= 120124[119884119894(0)] + 120591119863119894 + (119884119894(0) minus 120124[119884119894(0)])= 1205831113567 + 120591119863119894 + 1199071113567119894
bull Note that if ignorability holds (as in an experiment) for 119884119894(0)then it will also hold for 1199071113567119894 since 120124[119884119894(0)] is constant Thusthis satifies the usual assumptions for regression
26 58
Now with covariates
bull Now assume no unmeasured confounders 119884119894(119889) โโ 119863119894|119883119894bull We will assume a linear model for the potential outcomes
119884119894(119889) = 120572 + 120591 sdot 119889 + 120578119894
bull Remember that linearity isnrsquot an assumption if 119863119894 is binarybull Effect of 119863119894 is constant here the 120578119894 are the only source of
individual variation and we have 119864[120578119894] = 0bull Consistency assumption allows us to write this as
119884119894 = 120572 + 120591119863119894 + 120578119894
27 58
Covariates in the error
bull Letrsquos assume that 120578119894 is linear in 119883119894 120578119894 = 119883prime119894 120574 + 120584119894
bull New error is uncorrelated with 119883119894 120124[120584119894|119883119894] = 0bull This is an assumption Might be falsebull Plug into the above
120124[119884119894(119889)|119883119894] = 119864[119884119894|119863119894 119883119894] = 120572 + 120591119863119894 + 119864[120578119894|119883119894]= 120572 + 120591119863119894 + 119883prime
119894 120574 + 119864[120584119894|119883119894]= 120572 + 120591119863119894 + 119883prime
119894 120574
28 58
Summing up regression with constanteffects
bull Reviewing the assumptions wersquove used no unmeasured confounders constant treatment effects linearity of the treatmentcovariates
bull Under these we can run the following regression to estimatethe ATE 120591
119884119894 = 120572 + 120591119863119894 + 119883prime119894 120574 + 120584119894
bull Works with continuous or ordinal 119863119894 if linearity in the effect ofthese variables is truly linear
29 58
OLS constant effects simulation
Model with linear covariates constant 0 effect of treatment
library(mvtnorm)
n lt- 100
p lt- 4
X lt- rmvnorm(n = 100 mean = rep(0 p))
gamma lt- c(274 137 137 137)
y lt- 210 + X c(gamma) + rnorm(n)
alpha lt- c(-1 05 -05 -01)
dprobs lt- bootinvlogit(X alpha)
d lt- rbinom(n size = 1 prob = dprobs)
30 58
OLS with no covariates
summary(lm(y ~ d))
Coefficients
Estimate Std Error t value Pr(gt|t|)
(Intercept) 22036 458 4813 lt 2e-16
d -2869 654 -439 0000029
---
Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1
Residual standard error 33 on 98 degrees of freedom
Multiple R-squared 0164 Adjusted R-squared 0156
F-statistic 192 on 1 and 98 DF p-value 0000029
31 58
OLS with covariates
summary(lm(y ~ d + X))
Coefficients
Estimate Std Error t value Pr(gt|t|)
(Intercept) 209952 0159 13207 lt2e-16
d -0128 0253 -05 062
X1 27368 0126 2175 lt2e-16
X2 13677 0114 1201 lt2e-16
X3 13673 0130 1051 lt2e-16
X4 13570 0106 1280 lt2e-16
---
Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1
Residual standard error 1 on 94 degrees of freedom
Multiple R-squared 0999 Adjusted R-squared 0999
F-statistic 244e+04 on 5 and 94 DF p-value lt2e-16
32 58
What happens with nonlinearity
Suppose we can only observe the following covariates
z1 lt- exp(X[ 1]2)
z2 lt- X[ 2](1 + exp(X[ 1])) + 10
z3 lt- (X[ 1] X[ 3]25 + 06)^3
z4 lt- (X[ 2] + X[ 4] + 20)^2
Implies that 119884119894 and 119863119894 are functions of log(1198851198941113568) 1198851198941113569 119885111356911989411135681198851198941113569
1 log(1198851198941113568) 1198851198941113570 log(1198851198941113568) and 119883111356811135691198941113571
Regression is a nonlinear function of the observed covariates
33 58
When linearity goes wrong
summary(lm(y ~ d + z1 + z2 + z3 + z4))
Coefficients
Estimate Std Error t value Pr(gt|t|)
(Intercept) 218728 300799 073 0469
d -69292 33854 -205 0043
z1 363110 26970 1346 lt2e-16
z2 -29033 36619 -079 0430
z3 862022 431030 200 0048
z4 04021 00329 1223 lt2e-16
---
Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1
Residual standard error 14 on 94 degrees of freedom
Multiple R-squared 0855 Adjusted R-squared 0847
F-statistic 111 on 5 and 94 DF p-value lt2e-16
34 58
3 Regression withHeterogeneousTreatment Effects
35 58
Heterogeneous effects binary treatment
bull Completely randomized experiment
119884119894 = 119863119894119884119894(1) + (1 minus 119863119894)119884119894(0)= 119884119894(0) + (119884119894(1) minus 119884119894(0))119863119894
= 1205831113567 + 120591119894119863119894 + (119884119894(0) minus 1205831113567)= 1205831113567 + 120591119863119894 + (119884119894(0) minus 1205831113567) + (120591119894 minus 120591) sdot 119863119894
= 1205831113567 + 120591119863119894 + 120576119894
bull Error term now includes two components1 ldquoBaselinerdquo variation in the outcome (119884119894(0) minus 1205831113567)2 Variation in the treatment effect (120591119894 minus 120591)
bull Easy to verify that under experiment 120124[120576119894|119863119894] = 0bull Thus OLS estimates the ATE with no covariates
36 58
Adding covariates
bull What happens with no unmeasured confounders Need tocondition on 119883119894 now
bull Remember identification of the ATEATT using iteratedexpectations
bull ATE is the weighted sum of CATEs
120591 = 1114012119909120591(119909) Pr[119883119894 = 119909]
bull ATEATT are weighted averages of CATEsbull What about the regression estimand 120591119877 How does it related
to the ATEATT
37 58
Heterogeneous effects and regression
bull Letrsquos investigate this under a saturated regression model
119884119894 = 1114012119909119861119909119894120572119909 + 120591119877119863119894 + 119890119894
bull Use a dummy variable for each unique combination of 119883119894119861119909119894 = 120128(119883119894 = 119909)
bull Linear in 119883119894 by construction
38 58
Investigating the regression coefficient
bull How can we investigate 120591119877 Well we can rely on theregression anatomy
120591119877 = Cov(119884119894 119863119894 minus 119864[119863119894|119883119894])120141(119863119894 minus 119864[119863119894|119883119894])
bull 119863119894 minus 120124[119863119894|119883119894] is the residual from a regression of 119863119894 on the fullset of dummies
bull With a little work we can show
120591119877 =120124 1114094120591(119883119894)(119863119894 minus 120124[119863119894|119883119894])11135691114097
120124[(119863119894 minus 119864[119863119894|119883119894])1113569]=
120124[120591(119883119894)1205901113569119889(119883119894)]120124[1205901113569119889(119883119894)]
bull 1205901113569119889(119909) = 120141[119863119894|119883119894 = 119909] is the conditional variance of treatmentassignment
39 58
ATE versus OLS
120591119877 = 120124[120591(119883119894)119882119894] = 1114012119909120591(119909)
1205901113569119889(119909)120124[1205901113569119889(119883119894)]
โ[119883119894 = 119909]
bull Compare to the ATE
120591 = 120124[120591(119883119894)] = 1114012119909120591(119909)โ[119883119894 = 119909]
bull Both weight strata relative to their size (โ[119883119894 = 119909])bull OLS weights strata higher if the treatment variance in those
strata (1205901113569119889(119909)) is higher in those strata relative to the averagevariance across strata (120124[1205901113569119889(119883119894)])
bull The ATE weights only by their size
40 58
Regression weighting
119882119894 =1205901113569119889(119883119894)
120124[1205901113569119889(119883119894)]
bull Why does OLS weight like thisbull OLS is a minimum-variance estimator more weight to
more precise within-strata estimatesbull Within-strata estimates are most precise when the treatment
is evenly spread and thus has the highest variancebull If 119863119894 is binary then we know the conditional variance will be
1205901113569119889(119909) = โ[119863119894 = 1|119883119894 = 119909] (1 minus โ[119863119894 = 1|119883119894 = 119909])= 119890(119909) (1 minus 119890(119909))
bull Maximum variance with โ[119863119894 = 1|119883119894 = 119909] = 12
41 58
OLS weighting examplebull Binary covariate
โ[119883119894 = 1] = 075 โ[119883119894 = 0] = 025โ[119863119894 = 1|119883119894 = 1] = 09 โ[119863119894 = 1|119883119894 = 0] = 05
1205901113569119889(1) = 009 1205901113569119889(0) = 025120591(1) = 1 120591(0) = minus1
bull Implies the ATE is 120591 = 05bull Average conditional variance 120124[1205901113569119889(119883119894)] = 013bull weights for 119883119894 = 1 are 009013 = 0692 for 119883119894 = 0
025013 = 192120591119877 = 120124[120591(119883119894)119882119894]
= 120591(1)119882(1)โ[119883119894 = 1] + 120591(0)119882(0)โ[119883119894 = 0]= 1 times 0692 times 075 + minus1 times 192 times 025= 0039
42 58
When will OLS estimate the ATE
bull When does 120591 = 120591119877bull Constant treatment effects 120591(119909) = 120591 = 120591119877bull Constant probability of treatment 119890(119909) = โ[119863119894 = 1|119883119894 = 119909] = 119890
Implies that the OLS weights are 1
bull Incorrect linearity assumption in 119883119894 will lead to more bias
43 58
Other ways to use regression
bull Whatrsquos the path forward Accept the bias (might be relatively small with saturated
models) Use a different regression approach
bull Let 120583119889(119909) = 120124[119884119894(119889)|119883119894 = 119909] be the CEF for the potentialoutcome under 119863119894 = 119889
bull By consistency and nuc we have 120583119889(119909) = 120124[119884119894|119863119894 = 119889119883119894 = 119909]bull Estimate a regression of 119884119894 on 119883119894 among the 119863119894 = 119889 groupbull Then 119889(119909) is just a predicted value from the regression for
119883119894 = 119909bull How can we use this
44 58
Imputation estimators
bull Impute the treated potential outcomes with 1114023119884119894(1) = 1113568(119883119894)bull Impute the control potential outcomes with 1114023119884119894(0) = 1113567(119883119894)bull Procedure
Regress 119884119894 on 119883119894 in the treated group and get predicted valuesfor all units (treated or control)
Regress 119884119894 on 119883119894 in the control group and get predicted valuesfor all units (treated or control)
Take the average difference between these predicted values
bull More mathematically look like this
120591119894119898119901 =1119873
11140121198941113568(119883119894) minus 1113567(119883119894)
bull Sometimes called an imputation estimator
45 58
Simple imputation estimator
bull Use predict() from the within-group models on the data fromthe entire sample
bull Useful trick use a model on the entire data andmodelframe() to get the right design matrix
heterogeneous effects
yhet lt- ifelse(d == 1 y + rnorm(n 0 5) y)
mod lt- lm(yhet ~ d + X)
mod1 lt- lm(yhet ~ X subset = d == 1)
mod0 lt- lm(yhet ~ X subset = d == 0)
y1imps lt- predict(mod1 modelframe(mod))
y0imps lt- predict(mod0 modelframe(mod))
mean(y1imps - y0imps)
[1] 061
46 58
Notes on imputation estimators
bull If 119889(119909) are consistent estimators then 120591119894119898119901 is consistent forthe ATE
bull Why donrsquot people use this Most people donrsquot know the results wersquove been talking about Harder to implement than vanilla OLS
bull Can use linear regression to estimate 119889(119909) = 119909prime120573119889bull Recent trend is to estimate 119889(119909) via non-parametric methods
such as Kernel regression local linear regression regression trees etc Easiest is generalized additive models (GAMs)
47 58
Imputation estimator visualization
-2 -1 0 1 2 3
-05
00
05
10
15
x
y
TreatedControl
48 58
Imputation estimator visualization
-2 -1 0 1 2 3
-05
00
05
10
15
x
y
TreatedControl
49 58
Imputation estimator visualization
-2 -1 0 1 2 3
-05
00
05
10
15
x
y
TreatedControl
50 58
Nonlinear relationshipsbull Same idea but with nonlinear relationship between 119884119894 and 119883119894
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
51 58
Nonlinear relationshipsbull Same idea but with nonlinear relationship between 119884119894 and 119883119894
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
52 58
Nonlinear relationshipsbull Same idea but with nonlinear relationship between 119884119894 and 119883119894
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
53 58
Using semiparametric regressionbull Here CEFs are nonlinear but we donrsquot know their formbull We can use GAMs from the mgcv package to for flexible
estimatelibrary(mgcv)
mod0 lt- gam(y ~ s(x) subset = d == 0)
summary(mod0)
Family gaussian
Link function identity
Formula
y ~ s(x)
Parametric coefficients
Estimate Std Error t value Pr(gt|t|)
(Intercept) -00225 00154 -146 016
Approximate significance of smooth terms
edf Refdf F p-value
s(x) 603 708 413 lt2e-16
---
Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1
R-sq(adj) = 0909 Deviance explained = 928
GCV = 00093204 Scale est = 00071351 n = 30
54 58
Using GAMs
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
55 58
Using GAMs
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
56 58
Using GAMs
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
57 58
Limited dependent variables
bull Usual advice model the data from first principles Logitprobit for binary Poisson for counts etc
bull OLS is a-ok with limited DVs when Binary treatment and no covariates (just diff-in-means) Binary treatment discrete covariates and saturated models
(stratified diff-in-means)
bull Imposing a model on LDVs in this case imposes adistributional assumption which could be wrong
bull Even in unsaturated models the marginal effect from OLSoften decent compared to nonlinear models
Could go wrong in small samples If using nonlinear models always get effects on the scale of the
outcome
58 58
- Agnostic Regression
- Regression and Causality
- Regression with Heterogeneous Treatment Effects
-
Review
bull Quick reminder we have potential outcomes 119884119894(1) and 119884119894(0)and two parameters the ATE and ATT
120591 = 119864[119884119894(1) minus 119884119894(0)]120591ATT = 119864[119884119894(1) minus 119884119894(0)|119863119894 = 1]
bull We have shown in past weeks that these effects are identifiedwhen ignorability holds MHE calls this the conditionalindependence assumption (CIA)
25 58
Linear constant effects model binarytreatment
bull Experiment with a simple experiment we can rewrite theconsistency assumption to be a regression formula
119884119894 = 119863119894119884119894(1) + (1 minus 119863119894)119884119894(0)= 119884119894(0) + (119884119894(1) minus 119884119894(0))119863119894
= 120124[119884119894(0)] + 120591119863119894 + (119884119894(0) minus 120124[119884119894(0)])= 1205831113567 + 120591119863119894 + 1199071113567119894
bull Note that if ignorability holds (as in an experiment) for 119884119894(0)then it will also hold for 1199071113567119894 since 120124[119884119894(0)] is constant Thusthis satifies the usual assumptions for regression
26 58
Now with covariates
bull Now assume no unmeasured confounders 119884119894(119889) โโ 119863119894|119883119894bull We will assume a linear model for the potential outcomes
119884119894(119889) = 120572 + 120591 sdot 119889 + 120578119894
bull Remember that linearity isnrsquot an assumption if 119863119894 is binarybull Effect of 119863119894 is constant here the 120578119894 are the only source of
individual variation and we have 119864[120578119894] = 0bull Consistency assumption allows us to write this as
119884119894 = 120572 + 120591119863119894 + 120578119894
27 58
Covariates in the error
bull Letrsquos assume that 120578119894 is linear in 119883119894 120578119894 = 119883prime119894 120574 + 120584119894
bull New error is uncorrelated with 119883119894 120124[120584119894|119883119894] = 0bull This is an assumption Might be falsebull Plug into the above
120124[119884119894(119889)|119883119894] = 119864[119884119894|119863119894 119883119894] = 120572 + 120591119863119894 + 119864[120578119894|119883119894]= 120572 + 120591119863119894 + 119883prime
119894 120574 + 119864[120584119894|119883119894]= 120572 + 120591119863119894 + 119883prime
119894 120574
28 58
Summing up regression with constanteffects
bull Reviewing the assumptions wersquove used no unmeasured confounders constant treatment effects linearity of the treatmentcovariates
bull Under these we can run the following regression to estimatethe ATE 120591
119884119894 = 120572 + 120591119863119894 + 119883prime119894 120574 + 120584119894
bull Works with continuous or ordinal 119863119894 if linearity in the effect ofthese variables is truly linear
29 58
OLS constant effects simulation
Model with linear covariates constant 0 effect of treatment
library(mvtnorm)
n lt- 100
p lt- 4
X lt- rmvnorm(n = 100 mean = rep(0 p))
gamma lt- c(274 137 137 137)
y lt- 210 + X c(gamma) + rnorm(n)
alpha lt- c(-1 05 -05 -01)
dprobs lt- bootinvlogit(X alpha)
d lt- rbinom(n size = 1 prob = dprobs)
30 58
OLS with no covariates
summary(lm(y ~ d))
Coefficients
Estimate Std Error t value Pr(gt|t|)
(Intercept) 22036 458 4813 lt 2e-16
d -2869 654 -439 0000029
---
Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1
Residual standard error 33 on 98 degrees of freedom
Multiple R-squared 0164 Adjusted R-squared 0156
F-statistic 192 on 1 and 98 DF p-value 0000029
31 58
OLS with covariates
summary(lm(y ~ d + X))
Coefficients
Estimate Std Error t value Pr(gt|t|)
(Intercept) 209952 0159 13207 lt2e-16
d -0128 0253 -05 062
X1 27368 0126 2175 lt2e-16
X2 13677 0114 1201 lt2e-16
X3 13673 0130 1051 lt2e-16
X4 13570 0106 1280 lt2e-16
---
Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1
Residual standard error 1 on 94 degrees of freedom
Multiple R-squared 0999 Adjusted R-squared 0999
F-statistic 244e+04 on 5 and 94 DF p-value lt2e-16
32 58
What happens with nonlinearity
Suppose we can only observe the following covariates
z1 lt- exp(X[ 1]2)
z2 lt- X[ 2](1 + exp(X[ 1])) + 10
z3 lt- (X[ 1] X[ 3]25 + 06)^3
z4 lt- (X[ 2] + X[ 4] + 20)^2
Implies that 119884119894 and 119863119894 are functions of log(1198851198941113568) 1198851198941113569 119885111356911989411135681198851198941113569
1 log(1198851198941113568) 1198851198941113570 log(1198851198941113568) and 119883111356811135691198941113571
Regression is a nonlinear function of the observed covariates
33 58
When linearity goes wrong
summary(lm(y ~ d + z1 + z2 + z3 + z4))
Coefficients
Estimate Std Error t value Pr(gt|t|)
(Intercept) 218728 300799 073 0469
d -69292 33854 -205 0043
z1 363110 26970 1346 lt2e-16
z2 -29033 36619 -079 0430
z3 862022 431030 200 0048
z4 04021 00329 1223 lt2e-16
---
Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1
Residual standard error 14 on 94 degrees of freedom
Multiple R-squared 0855 Adjusted R-squared 0847
F-statistic 111 on 5 and 94 DF p-value lt2e-16
34 58
3 Regression withHeterogeneousTreatment Effects
35 58
Heterogeneous effects binary treatment
bull Completely randomized experiment
119884119894 = 119863119894119884119894(1) + (1 minus 119863119894)119884119894(0)= 119884119894(0) + (119884119894(1) minus 119884119894(0))119863119894
= 1205831113567 + 120591119894119863119894 + (119884119894(0) minus 1205831113567)= 1205831113567 + 120591119863119894 + (119884119894(0) minus 1205831113567) + (120591119894 minus 120591) sdot 119863119894
= 1205831113567 + 120591119863119894 + 120576119894
bull Error term now includes two components1 ldquoBaselinerdquo variation in the outcome (119884119894(0) minus 1205831113567)2 Variation in the treatment effect (120591119894 minus 120591)
bull Easy to verify that under experiment 120124[120576119894|119863119894] = 0bull Thus OLS estimates the ATE with no covariates
36 58
Adding covariates
bull What happens with no unmeasured confounders Need tocondition on 119883119894 now
bull Remember identification of the ATEATT using iteratedexpectations
bull ATE is the weighted sum of CATEs
120591 = 1114012119909120591(119909) Pr[119883119894 = 119909]
bull ATEATT are weighted averages of CATEsbull What about the regression estimand 120591119877 How does it related
to the ATEATT
37 58
Heterogeneous effects and regression
bull Letrsquos investigate this under a saturated regression model
119884119894 = 1114012119909119861119909119894120572119909 + 120591119877119863119894 + 119890119894
bull Use a dummy variable for each unique combination of 119883119894119861119909119894 = 120128(119883119894 = 119909)
bull Linear in 119883119894 by construction
38 58
Investigating the regression coefficient
bull How can we investigate 120591119877 Well we can rely on theregression anatomy
120591119877 = Cov(119884119894 119863119894 minus 119864[119863119894|119883119894])120141(119863119894 minus 119864[119863119894|119883119894])
bull 119863119894 minus 120124[119863119894|119883119894] is the residual from a regression of 119863119894 on the fullset of dummies
bull With a little work we can show
120591119877 =120124 1114094120591(119883119894)(119863119894 minus 120124[119863119894|119883119894])11135691114097
120124[(119863119894 minus 119864[119863119894|119883119894])1113569]=
120124[120591(119883119894)1205901113569119889(119883119894)]120124[1205901113569119889(119883119894)]
bull 1205901113569119889(119909) = 120141[119863119894|119883119894 = 119909] is the conditional variance of treatmentassignment
39 58
ATE versus OLS
120591119877 = 120124[120591(119883119894)119882119894] = 1114012119909120591(119909)
1205901113569119889(119909)120124[1205901113569119889(119883119894)]
โ[119883119894 = 119909]
bull Compare to the ATE
120591 = 120124[120591(119883119894)] = 1114012119909120591(119909)โ[119883119894 = 119909]
bull Both weight strata relative to their size (โ[119883119894 = 119909])bull OLS weights strata higher if the treatment variance in those
strata (1205901113569119889(119909)) is higher in those strata relative to the averagevariance across strata (120124[1205901113569119889(119883119894)])
bull The ATE weights only by their size
40 58
Regression weighting
119882119894 =1205901113569119889(119883119894)
120124[1205901113569119889(119883119894)]
bull Why does OLS weight like thisbull OLS is a minimum-variance estimator more weight to
more precise within-strata estimatesbull Within-strata estimates are most precise when the treatment
is evenly spread and thus has the highest variancebull If 119863119894 is binary then we know the conditional variance will be
1205901113569119889(119909) = โ[119863119894 = 1|119883119894 = 119909] (1 minus โ[119863119894 = 1|119883119894 = 119909])= 119890(119909) (1 minus 119890(119909))
bull Maximum variance with โ[119863119894 = 1|119883119894 = 119909] = 12
41 58
OLS weighting examplebull Binary covariate
โ[119883119894 = 1] = 075 โ[119883119894 = 0] = 025โ[119863119894 = 1|119883119894 = 1] = 09 โ[119863119894 = 1|119883119894 = 0] = 05
1205901113569119889(1) = 009 1205901113569119889(0) = 025120591(1) = 1 120591(0) = minus1
bull Implies the ATE is 120591 = 05bull Average conditional variance 120124[1205901113569119889(119883119894)] = 013bull weights for 119883119894 = 1 are 009013 = 0692 for 119883119894 = 0
025013 = 192120591119877 = 120124[120591(119883119894)119882119894]
= 120591(1)119882(1)โ[119883119894 = 1] + 120591(0)119882(0)โ[119883119894 = 0]= 1 times 0692 times 075 + minus1 times 192 times 025= 0039
42 58
When will OLS estimate the ATE
bull When does 120591 = 120591119877bull Constant treatment effects 120591(119909) = 120591 = 120591119877bull Constant probability of treatment 119890(119909) = โ[119863119894 = 1|119883119894 = 119909] = 119890
Implies that the OLS weights are 1
bull Incorrect linearity assumption in 119883119894 will lead to more bias
43 58
Other ways to use regression
bull Whatrsquos the path forward Accept the bias (might be relatively small with saturated
models) Use a different regression approach
bull Let 120583119889(119909) = 120124[119884119894(119889)|119883119894 = 119909] be the CEF for the potentialoutcome under 119863119894 = 119889
bull By consistency and nuc we have 120583119889(119909) = 120124[119884119894|119863119894 = 119889119883119894 = 119909]bull Estimate a regression of 119884119894 on 119883119894 among the 119863119894 = 119889 groupbull Then 119889(119909) is just a predicted value from the regression for
119883119894 = 119909bull How can we use this
44 58
Imputation estimators
bull Impute the treated potential outcomes with 1114023119884119894(1) = 1113568(119883119894)bull Impute the control potential outcomes with 1114023119884119894(0) = 1113567(119883119894)bull Procedure
Regress 119884119894 on 119883119894 in the treated group and get predicted valuesfor all units (treated or control)
Regress 119884119894 on 119883119894 in the control group and get predicted valuesfor all units (treated or control)
Take the average difference between these predicted values
bull More mathematically look like this
120591119894119898119901 =1119873
11140121198941113568(119883119894) minus 1113567(119883119894)
bull Sometimes called an imputation estimator
45 58
Simple imputation estimator
bull Use predict() from the within-group models on the data fromthe entire sample
bull Useful trick use a model on the entire data andmodelframe() to get the right design matrix
heterogeneous effects
yhet lt- ifelse(d == 1 y + rnorm(n 0 5) y)
mod lt- lm(yhet ~ d + X)
mod1 lt- lm(yhet ~ X subset = d == 1)
mod0 lt- lm(yhet ~ X subset = d == 0)
y1imps lt- predict(mod1 modelframe(mod))
y0imps lt- predict(mod0 modelframe(mod))
mean(y1imps - y0imps)
[1] 061
46 58
Notes on imputation estimators
bull If 119889(119909) are consistent estimators then 120591119894119898119901 is consistent forthe ATE
bull Why donrsquot people use this Most people donrsquot know the results wersquove been talking about Harder to implement than vanilla OLS
bull Can use linear regression to estimate 119889(119909) = 119909prime120573119889bull Recent trend is to estimate 119889(119909) via non-parametric methods
such as Kernel regression local linear regression regression trees etc Easiest is generalized additive models (GAMs)
47 58
Imputation estimator visualization
-2 -1 0 1 2 3
-05
00
05
10
15
x
y
TreatedControl
48 58
Imputation estimator visualization
-2 -1 0 1 2 3
-05
00
05
10
15
x
y
TreatedControl
49 58
Imputation estimator visualization
-2 -1 0 1 2 3
-05
00
05
10
15
x
y
TreatedControl
50 58
Nonlinear relationshipsbull Same idea but with nonlinear relationship between 119884119894 and 119883119894
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
51 58
Nonlinear relationshipsbull Same idea but with nonlinear relationship between 119884119894 and 119883119894
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
52 58
Nonlinear relationshipsbull Same idea but with nonlinear relationship between 119884119894 and 119883119894
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
53 58
Using semiparametric regressionbull Here CEFs are nonlinear but we donrsquot know their formbull We can use GAMs from the mgcv package to for flexible
estimatelibrary(mgcv)
mod0 lt- gam(y ~ s(x) subset = d == 0)
summary(mod0)
Family gaussian
Link function identity
Formula
y ~ s(x)
Parametric coefficients
Estimate Std Error t value Pr(gt|t|)
(Intercept) -00225 00154 -146 016
Approximate significance of smooth terms
edf Refdf F p-value
s(x) 603 708 413 lt2e-16
---
Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1
R-sq(adj) = 0909 Deviance explained = 928
GCV = 00093204 Scale est = 00071351 n = 30
54 58
Using GAMs
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
55 58
Using GAMs
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
56 58
Using GAMs
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
57 58
Limited dependent variables
bull Usual advice model the data from first principles Logitprobit for binary Poisson for counts etc
bull OLS is a-ok with limited DVs when Binary treatment and no covariates (just diff-in-means) Binary treatment discrete covariates and saturated models
(stratified diff-in-means)
bull Imposing a model on LDVs in this case imposes adistributional assumption which could be wrong
bull Even in unsaturated models the marginal effect from OLSoften decent compared to nonlinear models
Could go wrong in small samples If using nonlinear models always get effects on the scale of the
outcome
58 58
- Agnostic Regression
- Regression and Causality
- Regression with Heterogeneous Treatment Effects
-
Linear constant effects model binarytreatment
bull Experiment with a simple experiment we can rewrite theconsistency assumption to be a regression formula
119884119894 = 119863119894119884119894(1) + (1 minus 119863119894)119884119894(0)= 119884119894(0) + (119884119894(1) minus 119884119894(0))119863119894
= 120124[119884119894(0)] + 120591119863119894 + (119884119894(0) minus 120124[119884119894(0)])= 1205831113567 + 120591119863119894 + 1199071113567119894
bull Note that if ignorability holds (as in an experiment) for 119884119894(0)then it will also hold for 1199071113567119894 since 120124[119884119894(0)] is constant Thusthis satifies the usual assumptions for regression
26 58
Now with covariates
bull Now assume no unmeasured confounders 119884119894(119889) โโ 119863119894|119883119894bull We will assume a linear model for the potential outcomes
119884119894(119889) = 120572 + 120591 sdot 119889 + 120578119894
bull Remember that linearity isnrsquot an assumption if 119863119894 is binarybull Effect of 119863119894 is constant here the 120578119894 are the only source of
individual variation and we have 119864[120578119894] = 0bull Consistency assumption allows us to write this as
119884119894 = 120572 + 120591119863119894 + 120578119894
27 58
Covariates in the error
bull Letrsquos assume that 120578119894 is linear in 119883119894 120578119894 = 119883prime119894 120574 + 120584119894
bull New error is uncorrelated with 119883119894 120124[120584119894|119883119894] = 0bull This is an assumption Might be falsebull Plug into the above
120124[119884119894(119889)|119883119894] = 119864[119884119894|119863119894 119883119894] = 120572 + 120591119863119894 + 119864[120578119894|119883119894]= 120572 + 120591119863119894 + 119883prime
119894 120574 + 119864[120584119894|119883119894]= 120572 + 120591119863119894 + 119883prime
119894 120574
28 58
Summing up regression with constanteffects
bull Reviewing the assumptions wersquove used no unmeasured confounders constant treatment effects linearity of the treatmentcovariates
bull Under these we can run the following regression to estimatethe ATE 120591
119884119894 = 120572 + 120591119863119894 + 119883prime119894 120574 + 120584119894
bull Works with continuous or ordinal 119863119894 if linearity in the effect ofthese variables is truly linear
29 58
OLS constant effects simulation
Model with linear covariates constant 0 effect of treatment
library(mvtnorm)
n lt- 100
p lt- 4
X lt- rmvnorm(n = 100 mean = rep(0 p))
gamma lt- c(274 137 137 137)
y lt- 210 + X c(gamma) + rnorm(n)
alpha lt- c(-1 05 -05 -01)
dprobs lt- bootinvlogit(X alpha)
d lt- rbinom(n size = 1 prob = dprobs)
30 58
OLS with no covariates
summary(lm(y ~ d))
Coefficients
Estimate Std Error t value Pr(gt|t|)
(Intercept) 22036 458 4813 lt 2e-16
d -2869 654 -439 0000029
---
Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1
Residual standard error 33 on 98 degrees of freedom
Multiple R-squared 0164 Adjusted R-squared 0156
F-statistic 192 on 1 and 98 DF p-value 0000029
31 58
OLS with covariates
summary(lm(y ~ d + X))
Coefficients
Estimate Std Error t value Pr(gt|t|)
(Intercept) 209952 0159 13207 lt2e-16
d -0128 0253 -05 062
X1 27368 0126 2175 lt2e-16
X2 13677 0114 1201 lt2e-16
X3 13673 0130 1051 lt2e-16
X4 13570 0106 1280 lt2e-16
---
Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1
Residual standard error 1 on 94 degrees of freedom
Multiple R-squared 0999 Adjusted R-squared 0999
F-statistic 244e+04 on 5 and 94 DF p-value lt2e-16
32 58
What happens with nonlinearity
Suppose we can only observe the following covariates
z1 lt- exp(X[ 1]2)
z2 lt- X[ 2](1 + exp(X[ 1])) + 10
z3 lt- (X[ 1] X[ 3]25 + 06)^3
z4 lt- (X[ 2] + X[ 4] + 20)^2
Implies that 119884119894 and 119863119894 are functions of log(1198851198941113568) 1198851198941113569 119885111356911989411135681198851198941113569
1 log(1198851198941113568) 1198851198941113570 log(1198851198941113568) and 119883111356811135691198941113571
Regression is a nonlinear function of the observed covariates
33 58
When linearity goes wrong
summary(lm(y ~ d + z1 + z2 + z3 + z4))
Coefficients
Estimate Std Error t value Pr(gt|t|)
(Intercept) 218728 300799 073 0469
d -69292 33854 -205 0043
z1 363110 26970 1346 lt2e-16
z2 -29033 36619 -079 0430
z3 862022 431030 200 0048
z4 04021 00329 1223 lt2e-16
---
Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1
Residual standard error 14 on 94 degrees of freedom
Multiple R-squared 0855 Adjusted R-squared 0847
F-statistic 111 on 5 and 94 DF p-value lt2e-16
34 58
3 Regression withHeterogeneousTreatment Effects
35 58
Heterogeneous effects binary treatment
bull Completely randomized experiment
119884119894 = 119863119894119884119894(1) + (1 minus 119863119894)119884119894(0)= 119884119894(0) + (119884119894(1) minus 119884119894(0))119863119894
= 1205831113567 + 120591119894119863119894 + (119884119894(0) minus 1205831113567)= 1205831113567 + 120591119863119894 + (119884119894(0) minus 1205831113567) + (120591119894 minus 120591) sdot 119863119894
= 1205831113567 + 120591119863119894 + 120576119894
bull Error term now includes two components1 ldquoBaselinerdquo variation in the outcome (119884119894(0) minus 1205831113567)2 Variation in the treatment effect (120591119894 minus 120591)
bull Easy to verify that under experiment 120124[120576119894|119863119894] = 0bull Thus OLS estimates the ATE with no covariates
36 58
Adding covariates
bull What happens with no unmeasured confounders Need tocondition on 119883119894 now
bull Remember identification of the ATEATT using iteratedexpectations
bull ATE is the weighted sum of CATEs
120591 = 1114012119909120591(119909) Pr[119883119894 = 119909]
bull ATEATT are weighted averages of CATEsbull What about the regression estimand 120591119877 How does it related
to the ATEATT
37 58
Heterogeneous effects and regression
bull Letrsquos investigate this under a saturated regression model
119884119894 = 1114012119909119861119909119894120572119909 + 120591119877119863119894 + 119890119894
bull Use a dummy variable for each unique combination of 119883119894119861119909119894 = 120128(119883119894 = 119909)
bull Linear in 119883119894 by construction
38 58
Investigating the regression coefficient
bull How can we investigate 120591119877 Well we can rely on theregression anatomy
120591119877 = Cov(119884119894 119863119894 minus 119864[119863119894|119883119894])120141(119863119894 minus 119864[119863119894|119883119894])
bull 119863119894 minus 120124[119863119894|119883119894] is the residual from a regression of 119863119894 on the fullset of dummies
bull With a little work we can show
120591119877 =120124 1114094120591(119883119894)(119863119894 minus 120124[119863119894|119883119894])11135691114097
120124[(119863119894 minus 119864[119863119894|119883119894])1113569]=
120124[120591(119883119894)1205901113569119889(119883119894)]120124[1205901113569119889(119883119894)]
bull 1205901113569119889(119909) = 120141[119863119894|119883119894 = 119909] is the conditional variance of treatmentassignment
39 58
ATE versus OLS
120591119877 = 120124[120591(119883119894)119882119894] = 1114012119909120591(119909)
1205901113569119889(119909)120124[1205901113569119889(119883119894)]
โ[119883119894 = 119909]
bull Compare to the ATE
120591 = 120124[120591(119883119894)] = 1114012119909120591(119909)โ[119883119894 = 119909]
bull Both weight strata relative to their size (โ[119883119894 = 119909])bull OLS weights strata higher if the treatment variance in those
strata (1205901113569119889(119909)) is higher in those strata relative to the averagevariance across strata (120124[1205901113569119889(119883119894)])
bull The ATE weights only by their size
40 58
Regression weighting
119882119894 =1205901113569119889(119883119894)
120124[1205901113569119889(119883119894)]
bull Why does OLS weight like thisbull OLS is a minimum-variance estimator more weight to
more precise within-strata estimatesbull Within-strata estimates are most precise when the treatment
is evenly spread and thus has the highest variancebull If 119863119894 is binary then we know the conditional variance will be
1205901113569119889(119909) = โ[119863119894 = 1|119883119894 = 119909] (1 minus โ[119863119894 = 1|119883119894 = 119909])= 119890(119909) (1 minus 119890(119909))
bull Maximum variance with โ[119863119894 = 1|119883119894 = 119909] = 12
41 58
OLS weighting examplebull Binary covariate
โ[119883119894 = 1] = 075 โ[119883119894 = 0] = 025โ[119863119894 = 1|119883119894 = 1] = 09 โ[119863119894 = 1|119883119894 = 0] = 05
1205901113569119889(1) = 009 1205901113569119889(0) = 025120591(1) = 1 120591(0) = minus1
bull Implies the ATE is 120591 = 05bull Average conditional variance 120124[1205901113569119889(119883119894)] = 013bull weights for 119883119894 = 1 are 009013 = 0692 for 119883119894 = 0
025013 = 192120591119877 = 120124[120591(119883119894)119882119894]
= 120591(1)119882(1)โ[119883119894 = 1] + 120591(0)119882(0)โ[119883119894 = 0]= 1 times 0692 times 075 + minus1 times 192 times 025= 0039
42 58
When will OLS estimate the ATE
bull When does 120591 = 120591119877bull Constant treatment effects 120591(119909) = 120591 = 120591119877bull Constant probability of treatment 119890(119909) = โ[119863119894 = 1|119883119894 = 119909] = 119890
Implies that the OLS weights are 1
bull Incorrect linearity assumption in 119883119894 will lead to more bias
43 58
Other ways to use regression
bull Whatrsquos the path forward Accept the bias (might be relatively small with saturated
models) Use a different regression approach
bull Let 120583119889(119909) = 120124[119884119894(119889)|119883119894 = 119909] be the CEF for the potentialoutcome under 119863119894 = 119889
bull By consistency and nuc we have 120583119889(119909) = 120124[119884119894|119863119894 = 119889119883119894 = 119909]bull Estimate a regression of 119884119894 on 119883119894 among the 119863119894 = 119889 groupbull Then 119889(119909) is just a predicted value from the regression for
119883119894 = 119909bull How can we use this
44 58
Imputation estimators
bull Impute the treated potential outcomes with 1114023119884119894(1) = 1113568(119883119894)bull Impute the control potential outcomes with 1114023119884119894(0) = 1113567(119883119894)bull Procedure
Regress 119884119894 on 119883119894 in the treated group and get predicted valuesfor all units (treated or control)
Regress 119884119894 on 119883119894 in the control group and get predicted valuesfor all units (treated or control)
Take the average difference between these predicted values
bull More mathematically look like this
120591119894119898119901 =1119873
11140121198941113568(119883119894) minus 1113567(119883119894)
bull Sometimes called an imputation estimator
45 58
Simple imputation estimator
bull Use predict() from the within-group models on the data fromthe entire sample
bull Useful trick use a model on the entire data andmodelframe() to get the right design matrix
heterogeneous effects
yhet lt- ifelse(d == 1 y + rnorm(n 0 5) y)
mod lt- lm(yhet ~ d + X)
mod1 lt- lm(yhet ~ X subset = d == 1)
mod0 lt- lm(yhet ~ X subset = d == 0)
y1imps lt- predict(mod1 modelframe(mod))
y0imps lt- predict(mod0 modelframe(mod))
mean(y1imps - y0imps)
[1] 061
46 58
Notes on imputation estimators
bull If 119889(119909) are consistent estimators then 120591119894119898119901 is consistent forthe ATE
bull Why donrsquot people use this Most people donrsquot know the results wersquove been talking about Harder to implement than vanilla OLS
bull Can use linear regression to estimate 119889(119909) = 119909prime120573119889bull Recent trend is to estimate 119889(119909) via non-parametric methods
such as Kernel regression local linear regression regression trees etc Easiest is generalized additive models (GAMs)
47 58
Imputation estimator visualization
-2 -1 0 1 2 3
-05
00
05
10
15
x
y
TreatedControl
48 58
Imputation estimator visualization
-2 -1 0 1 2 3
-05
00
05
10
15
x
y
TreatedControl
49 58
Imputation estimator visualization
-2 -1 0 1 2 3
-05
00
05
10
15
x
y
TreatedControl
50 58
Nonlinear relationshipsbull Same idea but with nonlinear relationship between 119884119894 and 119883119894
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
51 58
Nonlinear relationshipsbull Same idea but with nonlinear relationship between 119884119894 and 119883119894
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
52 58
Nonlinear relationshipsbull Same idea but with nonlinear relationship between 119884119894 and 119883119894
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
53 58
Using semiparametric regressionbull Here CEFs are nonlinear but we donrsquot know their formbull We can use GAMs from the mgcv package to for flexible
estimatelibrary(mgcv)
mod0 lt- gam(y ~ s(x) subset = d == 0)
summary(mod0)
Family gaussian
Link function identity
Formula
y ~ s(x)
Parametric coefficients
Estimate Std Error t value Pr(gt|t|)
(Intercept) -00225 00154 -146 016
Approximate significance of smooth terms
edf Refdf F p-value
s(x) 603 708 413 lt2e-16
---
Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1
R-sq(adj) = 0909 Deviance explained = 928
GCV = 00093204 Scale est = 00071351 n = 30
54 58
Using GAMs
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
55 58
Using GAMs
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
56 58
Using GAMs
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
57 58
Limited dependent variables
bull Usual advice model the data from first principles Logitprobit for binary Poisson for counts etc
bull OLS is a-ok with limited DVs when Binary treatment and no covariates (just diff-in-means) Binary treatment discrete covariates and saturated models
(stratified diff-in-means)
bull Imposing a model on LDVs in this case imposes adistributional assumption which could be wrong
bull Even in unsaturated models the marginal effect from OLSoften decent compared to nonlinear models
Could go wrong in small samples If using nonlinear models always get effects on the scale of the
outcome
58 58
- Agnostic Regression
- Regression and Causality
- Regression with Heterogeneous Treatment Effects
-
Now with covariates
bull Now assume no unmeasured confounders 119884119894(119889) โโ 119863119894|119883119894bull We will assume a linear model for the potential outcomes
119884119894(119889) = 120572 + 120591 sdot 119889 + 120578119894
bull Remember that linearity isnrsquot an assumption if 119863119894 is binarybull Effect of 119863119894 is constant here the 120578119894 are the only source of
individual variation and we have 119864[120578119894] = 0bull Consistency assumption allows us to write this as
119884119894 = 120572 + 120591119863119894 + 120578119894
27 58
Covariates in the error
bull Letrsquos assume that 120578119894 is linear in 119883119894 120578119894 = 119883prime119894 120574 + 120584119894
bull New error is uncorrelated with 119883119894 120124[120584119894|119883119894] = 0bull This is an assumption Might be falsebull Plug into the above
120124[119884119894(119889)|119883119894] = 119864[119884119894|119863119894 119883119894] = 120572 + 120591119863119894 + 119864[120578119894|119883119894]= 120572 + 120591119863119894 + 119883prime
119894 120574 + 119864[120584119894|119883119894]= 120572 + 120591119863119894 + 119883prime
119894 120574
28 58
Summing up regression with constanteffects
bull Reviewing the assumptions wersquove used no unmeasured confounders constant treatment effects linearity of the treatmentcovariates
bull Under these we can run the following regression to estimatethe ATE 120591
119884119894 = 120572 + 120591119863119894 + 119883prime119894 120574 + 120584119894
bull Works with continuous or ordinal 119863119894 if linearity in the effect ofthese variables is truly linear
29 58
OLS constant effects simulation
Model with linear covariates constant 0 effect of treatment
library(mvtnorm)
n lt- 100
p lt- 4
X lt- rmvnorm(n = 100 mean = rep(0 p))
gamma lt- c(274 137 137 137)
y lt- 210 + X c(gamma) + rnorm(n)
alpha lt- c(-1 05 -05 -01)
dprobs lt- bootinvlogit(X alpha)
d lt- rbinom(n size = 1 prob = dprobs)
30 58
OLS with no covariates
summary(lm(y ~ d))
Coefficients
Estimate Std Error t value Pr(gt|t|)
(Intercept) 22036 458 4813 lt 2e-16
d -2869 654 -439 0000029
---
Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1
Residual standard error 33 on 98 degrees of freedom
Multiple R-squared 0164 Adjusted R-squared 0156
F-statistic 192 on 1 and 98 DF p-value 0000029
31 58
OLS with covariates
summary(lm(y ~ d + X))
Coefficients
Estimate Std Error t value Pr(gt|t|)
(Intercept) 209952 0159 13207 lt2e-16
d -0128 0253 -05 062
X1 27368 0126 2175 lt2e-16
X2 13677 0114 1201 lt2e-16
X3 13673 0130 1051 lt2e-16
X4 13570 0106 1280 lt2e-16
---
Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1
Residual standard error 1 on 94 degrees of freedom
Multiple R-squared 0999 Adjusted R-squared 0999
F-statistic 244e+04 on 5 and 94 DF p-value lt2e-16
32 58
What happens with nonlinearity
Suppose we can only observe the following covariates
z1 lt- exp(X[ 1]2)
z2 lt- X[ 2](1 + exp(X[ 1])) + 10
z3 lt- (X[ 1] X[ 3]25 + 06)^3
z4 lt- (X[ 2] + X[ 4] + 20)^2
Implies that 119884119894 and 119863119894 are functions of log(1198851198941113568) 1198851198941113569 119885111356911989411135681198851198941113569
1 log(1198851198941113568) 1198851198941113570 log(1198851198941113568) and 119883111356811135691198941113571
Regression is a nonlinear function of the observed covariates
33 58
When linearity goes wrong
summary(lm(y ~ d + z1 + z2 + z3 + z4))
Coefficients
Estimate Std Error t value Pr(gt|t|)
(Intercept) 218728 300799 073 0469
d -69292 33854 -205 0043
z1 363110 26970 1346 lt2e-16
z2 -29033 36619 -079 0430
z3 862022 431030 200 0048
z4 04021 00329 1223 lt2e-16
---
Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1
Residual standard error 14 on 94 degrees of freedom
Multiple R-squared 0855 Adjusted R-squared 0847
F-statistic 111 on 5 and 94 DF p-value lt2e-16
34 58
3 Regression withHeterogeneousTreatment Effects
35 58
Heterogeneous effects binary treatment
bull Completely randomized experiment
119884119894 = 119863119894119884119894(1) + (1 minus 119863119894)119884119894(0)= 119884119894(0) + (119884119894(1) minus 119884119894(0))119863119894
= 1205831113567 + 120591119894119863119894 + (119884119894(0) minus 1205831113567)= 1205831113567 + 120591119863119894 + (119884119894(0) minus 1205831113567) + (120591119894 minus 120591) sdot 119863119894
= 1205831113567 + 120591119863119894 + 120576119894
bull Error term now includes two components1 ldquoBaselinerdquo variation in the outcome (119884119894(0) minus 1205831113567)2 Variation in the treatment effect (120591119894 minus 120591)
bull Easy to verify that under experiment 120124[120576119894|119863119894] = 0bull Thus OLS estimates the ATE with no covariates
36 58
Adding covariates
bull What happens with no unmeasured confounders Need tocondition on 119883119894 now
bull Remember identification of the ATEATT using iteratedexpectations
bull ATE is the weighted sum of CATEs
120591 = 1114012119909120591(119909) Pr[119883119894 = 119909]
bull ATEATT are weighted averages of CATEsbull What about the regression estimand 120591119877 How does it related
to the ATEATT
37 58
Heterogeneous effects and regression
bull Letrsquos investigate this under a saturated regression model
119884119894 = 1114012119909119861119909119894120572119909 + 120591119877119863119894 + 119890119894
bull Use a dummy variable for each unique combination of 119883119894119861119909119894 = 120128(119883119894 = 119909)
bull Linear in 119883119894 by construction
38 58
Investigating the regression coefficient
bull How can we investigate 120591119877 Well we can rely on theregression anatomy
120591119877 = Cov(119884119894 119863119894 minus 119864[119863119894|119883119894])120141(119863119894 minus 119864[119863119894|119883119894])
bull 119863119894 minus 120124[119863119894|119883119894] is the residual from a regression of 119863119894 on the fullset of dummies
bull With a little work we can show
120591119877 =120124 1114094120591(119883119894)(119863119894 minus 120124[119863119894|119883119894])11135691114097
120124[(119863119894 minus 119864[119863119894|119883119894])1113569]=
120124[120591(119883119894)1205901113569119889(119883119894)]120124[1205901113569119889(119883119894)]
bull 1205901113569119889(119909) = 120141[119863119894|119883119894 = 119909] is the conditional variance of treatmentassignment
39 58
ATE versus OLS
120591119877 = 120124[120591(119883119894)119882119894] = 1114012119909120591(119909)
1205901113569119889(119909)120124[1205901113569119889(119883119894)]
โ[119883119894 = 119909]
bull Compare to the ATE
120591 = 120124[120591(119883119894)] = 1114012119909120591(119909)โ[119883119894 = 119909]
bull Both weight strata relative to their size (โ[119883119894 = 119909])bull OLS weights strata higher if the treatment variance in those
strata (1205901113569119889(119909)) is higher in those strata relative to the averagevariance across strata (120124[1205901113569119889(119883119894)])
bull The ATE weights only by their size
40 58
Regression weighting
119882119894 =1205901113569119889(119883119894)
120124[1205901113569119889(119883119894)]
bull Why does OLS weight like thisbull OLS is a minimum-variance estimator more weight to
more precise within-strata estimatesbull Within-strata estimates are most precise when the treatment
is evenly spread and thus has the highest variancebull If 119863119894 is binary then we know the conditional variance will be
1205901113569119889(119909) = โ[119863119894 = 1|119883119894 = 119909] (1 minus โ[119863119894 = 1|119883119894 = 119909])= 119890(119909) (1 minus 119890(119909))
bull Maximum variance with โ[119863119894 = 1|119883119894 = 119909] = 12
41 58
OLS weighting examplebull Binary covariate
โ[119883119894 = 1] = 075 โ[119883119894 = 0] = 025โ[119863119894 = 1|119883119894 = 1] = 09 โ[119863119894 = 1|119883119894 = 0] = 05
1205901113569119889(1) = 009 1205901113569119889(0) = 025120591(1) = 1 120591(0) = minus1
bull Implies the ATE is 120591 = 05bull Average conditional variance 120124[1205901113569119889(119883119894)] = 013bull weights for 119883119894 = 1 are 009013 = 0692 for 119883119894 = 0
025013 = 192120591119877 = 120124[120591(119883119894)119882119894]
= 120591(1)119882(1)โ[119883119894 = 1] + 120591(0)119882(0)โ[119883119894 = 0]= 1 times 0692 times 075 + minus1 times 192 times 025= 0039
42 58
When will OLS estimate the ATE
bull When does 120591 = 120591119877bull Constant treatment effects 120591(119909) = 120591 = 120591119877bull Constant probability of treatment 119890(119909) = โ[119863119894 = 1|119883119894 = 119909] = 119890
Implies that the OLS weights are 1
bull Incorrect linearity assumption in 119883119894 will lead to more bias
43 58
Other ways to use regression
bull Whatrsquos the path forward Accept the bias (might be relatively small with saturated
models) Use a different regression approach
bull Let 120583119889(119909) = 120124[119884119894(119889)|119883119894 = 119909] be the CEF for the potentialoutcome under 119863119894 = 119889
bull By consistency and nuc we have 120583119889(119909) = 120124[119884119894|119863119894 = 119889119883119894 = 119909]bull Estimate a regression of 119884119894 on 119883119894 among the 119863119894 = 119889 groupbull Then 119889(119909) is just a predicted value from the regression for
119883119894 = 119909bull How can we use this
44 58
Imputation estimators
bull Impute the treated potential outcomes with 1114023119884119894(1) = 1113568(119883119894)bull Impute the control potential outcomes with 1114023119884119894(0) = 1113567(119883119894)bull Procedure
Regress 119884119894 on 119883119894 in the treated group and get predicted valuesfor all units (treated or control)
Regress 119884119894 on 119883119894 in the control group and get predicted valuesfor all units (treated or control)
Take the average difference between these predicted values
bull More mathematically look like this
120591119894119898119901 =1119873
11140121198941113568(119883119894) minus 1113567(119883119894)
bull Sometimes called an imputation estimator
45 58
Simple imputation estimator
bull Use predict() from the within-group models on the data fromthe entire sample
bull Useful trick use a model on the entire data andmodelframe() to get the right design matrix
heterogeneous effects
yhet lt- ifelse(d == 1 y + rnorm(n 0 5) y)
mod lt- lm(yhet ~ d + X)
mod1 lt- lm(yhet ~ X subset = d == 1)
mod0 lt- lm(yhet ~ X subset = d == 0)
y1imps lt- predict(mod1 modelframe(mod))
y0imps lt- predict(mod0 modelframe(mod))
mean(y1imps - y0imps)
[1] 061
46 58
Notes on imputation estimators
bull If 119889(119909) are consistent estimators then 120591119894119898119901 is consistent forthe ATE
bull Why donrsquot people use this Most people donrsquot know the results wersquove been talking about Harder to implement than vanilla OLS
bull Can use linear regression to estimate 119889(119909) = 119909prime120573119889bull Recent trend is to estimate 119889(119909) via non-parametric methods
such as Kernel regression local linear regression regression trees etc Easiest is generalized additive models (GAMs)
47 58
Imputation estimator visualization
-2 -1 0 1 2 3
-05
00
05
10
15
x
y
TreatedControl
48 58
Imputation estimator visualization
-2 -1 0 1 2 3
-05
00
05
10
15
x
y
TreatedControl
49 58
Imputation estimator visualization
-2 -1 0 1 2 3
-05
00
05
10
15
x
y
TreatedControl
50 58
Nonlinear relationshipsbull Same idea but with nonlinear relationship between 119884119894 and 119883119894
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
51 58
Nonlinear relationshipsbull Same idea but with nonlinear relationship between 119884119894 and 119883119894
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
52 58
Nonlinear relationshipsbull Same idea but with nonlinear relationship between 119884119894 and 119883119894
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
53 58
Using semiparametric regressionbull Here CEFs are nonlinear but we donrsquot know their formbull We can use GAMs from the mgcv package to for flexible
estimatelibrary(mgcv)
mod0 lt- gam(y ~ s(x) subset = d == 0)
summary(mod0)
Family gaussian
Link function identity
Formula
y ~ s(x)
Parametric coefficients
Estimate Std Error t value Pr(gt|t|)
(Intercept) -00225 00154 -146 016
Approximate significance of smooth terms
edf Refdf F p-value
s(x) 603 708 413 lt2e-16
---
Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1
R-sq(adj) = 0909 Deviance explained = 928
GCV = 00093204 Scale est = 00071351 n = 30
54 58
Using GAMs
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
55 58
Using GAMs
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
56 58
Using GAMs
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
57 58
Limited dependent variables
bull Usual advice model the data from first principles Logitprobit for binary Poisson for counts etc
bull OLS is a-ok with limited DVs when Binary treatment and no covariates (just diff-in-means) Binary treatment discrete covariates and saturated models
(stratified diff-in-means)
bull Imposing a model on LDVs in this case imposes adistributional assumption which could be wrong
bull Even in unsaturated models the marginal effect from OLSoften decent compared to nonlinear models
Could go wrong in small samples If using nonlinear models always get effects on the scale of the
outcome
58 58
- Agnostic Regression
- Regression and Causality
- Regression with Heterogeneous Treatment Effects
-
Covariates in the error
bull Letrsquos assume that 120578119894 is linear in 119883119894 120578119894 = 119883prime119894 120574 + 120584119894
bull New error is uncorrelated with 119883119894 120124[120584119894|119883119894] = 0bull This is an assumption Might be falsebull Plug into the above
120124[119884119894(119889)|119883119894] = 119864[119884119894|119863119894 119883119894] = 120572 + 120591119863119894 + 119864[120578119894|119883119894]= 120572 + 120591119863119894 + 119883prime
119894 120574 + 119864[120584119894|119883119894]= 120572 + 120591119863119894 + 119883prime
119894 120574
28 58
Summing up regression with constanteffects
bull Reviewing the assumptions wersquove used no unmeasured confounders constant treatment effects linearity of the treatmentcovariates
bull Under these we can run the following regression to estimatethe ATE 120591
119884119894 = 120572 + 120591119863119894 + 119883prime119894 120574 + 120584119894
bull Works with continuous or ordinal 119863119894 if linearity in the effect ofthese variables is truly linear
29 58
OLS constant effects simulation
Model with linear covariates constant 0 effect of treatment
library(mvtnorm)
n lt- 100
p lt- 4
X lt- rmvnorm(n = 100 mean = rep(0 p))
gamma lt- c(274 137 137 137)
y lt- 210 + X c(gamma) + rnorm(n)
alpha lt- c(-1 05 -05 -01)
dprobs lt- bootinvlogit(X alpha)
d lt- rbinom(n size = 1 prob = dprobs)
30 58
OLS with no covariates
summary(lm(y ~ d))
Coefficients
Estimate Std Error t value Pr(gt|t|)
(Intercept) 22036 458 4813 lt 2e-16
d -2869 654 -439 0000029
---
Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1
Residual standard error 33 on 98 degrees of freedom
Multiple R-squared 0164 Adjusted R-squared 0156
F-statistic 192 on 1 and 98 DF p-value 0000029
31 58
OLS with covariates
summary(lm(y ~ d + X))
Coefficients
Estimate Std Error t value Pr(gt|t|)
(Intercept) 209952 0159 13207 lt2e-16
d -0128 0253 -05 062
X1 27368 0126 2175 lt2e-16
X2 13677 0114 1201 lt2e-16
X3 13673 0130 1051 lt2e-16
X4 13570 0106 1280 lt2e-16
---
Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1
Residual standard error 1 on 94 degrees of freedom
Multiple R-squared 0999 Adjusted R-squared 0999
F-statistic 244e+04 on 5 and 94 DF p-value lt2e-16
32 58
What happens with nonlinearity
Suppose we can only observe the following covariates
z1 lt- exp(X[ 1]2)
z2 lt- X[ 2](1 + exp(X[ 1])) + 10
z3 lt- (X[ 1] X[ 3]25 + 06)^3
z4 lt- (X[ 2] + X[ 4] + 20)^2
Implies that 119884119894 and 119863119894 are functions of log(1198851198941113568) 1198851198941113569 119885111356911989411135681198851198941113569
1 log(1198851198941113568) 1198851198941113570 log(1198851198941113568) and 119883111356811135691198941113571
Regression is a nonlinear function of the observed covariates
33 58
When linearity goes wrong
summary(lm(y ~ d + z1 + z2 + z3 + z4))
Coefficients
Estimate Std Error t value Pr(gt|t|)
(Intercept) 218728 300799 073 0469
d -69292 33854 -205 0043
z1 363110 26970 1346 lt2e-16
z2 -29033 36619 -079 0430
z3 862022 431030 200 0048
z4 04021 00329 1223 lt2e-16
---
Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1
Residual standard error 14 on 94 degrees of freedom
Multiple R-squared 0855 Adjusted R-squared 0847
F-statistic 111 on 5 and 94 DF p-value lt2e-16
34 58
3 Regression withHeterogeneousTreatment Effects
35 58
Heterogeneous effects binary treatment
bull Completely randomized experiment
119884119894 = 119863119894119884119894(1) + (1 minus 119863119894)119884119894(0)= 119884119894(0) + (119884119894(1) minus 119884119894(0))119863119894
= 1205831113567 + 120591119894119863119894 + (119884119894(0) minus 1205831113567)= 1205831113567 + 120591119863119894 + (119884119894(0) minus 1205831113567) + (120591119894 minus 120591) sdot 119863119894
= 1205831113567 + 120591119863119894 + 120576119894
bull Error term now includes two components1 ldquoBaselinerdquo variation in the outcome (119884119894(0) minus 1205831113567)2 Variation in the treatment effect (120591119894 minus 120591)
bull Easy to verify that under experiment 120124[120576119894|119863119894] = 0bull Thus OLS estimates the ATE with no covariates
36 58
Adding covariates
bull What happens with no unmeasured confounders Need tocondition on 119883119894 now
bull Remember identification of the ATEATT using iteratedexpectations
bull ATE is the weighted sum of CATEs
120591 = 1114012119909120591(119909) Pr[119883119894 = 119909]
bull ATEATT are weighted averages of CATEsbull What about the regression estimand 120591119877 How does it related
to the ATEATT
37 58
Heterogeneous effects and regression
bull Letrsquos investigate this under a saturated regression model
119884119894 = 1114012119909119861119909119894120572119909 + 120591119877119863119894 + 119890119894
bull Use a dummy variable for each unique combination of 119883119894119861119909119894 = 120128(119883119894 = 119909)
bull Linear in 119883119894 by construction
38 58
Investigating the regression coefficient
bull How can we investigate 120591119877 Well we can rely on theregression anatomy
120591119877 = Cov(119884119894 119863119894 minus 119864[119863119894|119883119894])120141(119863119894 minus 119864[119863119894|119883119894])
bull 119863119894 minus 120124[119863119894|119883119894] is the residual from a regression of 119863119894 on the fullset of dummies
bull With a little work we can show
120591119877 =120124 1114094120591(119883119894)(119863119894 minus 120124[119863119894|119883119894])11135691114097
120124[(119863119894 minus 119864[119863119894|119883119894])1113569]=
120124[120591(119883119894)1205901113569119889(119883119894)]120124[1205901113569119889(119883119894)]
bull 1205901113569119889(119909) = 120141[119863119894|119883119894 = 119909] is the conditional variance of treatmentassignment
39 58
ATE versus OLS
120591119877 = 120124[120591(119883119894)119882119894] = 1114012119909120591(119909)
1205901113569119889(119909)120124[1205901113569119889(119883119894)]
โ[119883119894 = 119909]
bull Compare to the ATE
120591 = 120124[120591(119883119894)] = 1114012119909120591(119909)โ[119883119894 = 119909]
bull Both weight strata relative to their size (โ[119883119894 = 119909])bull OLS weights strata higher if the treatment variance in those
strata (1205901113569119889(119909)) is higher in those strata relative to the averagevariance across strata (120124[1205901113569119889(119883119894)])
bull The ATE weights only by their size
40 58
Regression weighting
119882119894 =1205901113569119889(119883119894)
120124[1205901113569119889(119883119894)]
bull Why does OLS weight like thisbull OLS is a minimum-variance estimator more weight to
more precise within-strata estimatesbull Within-strata estimates are most precise when the treatment
is evenly spread and thus has the highest variancebull If 119863119894 is binary then we know the conditional variance will be
1205901113569119889(119909) = โ[119863119894 = 1|119883119894 = 119909] (1 minus โ[119863119894 = 1|119883119894 = 119909])= 119890(119909) (1 minus 119890(119909))
bull Maximum variance with โ[119863119894 = 1|119883119894 = 119909] = 12
41 58
OLS weighting examplebull Binary covariate
โ[119883119894 = 1] = 075 โ[119883119894 = 0] = 025โ[119863119894 = 1|119883119894 = 1] = 09 โ[119863119894 = 1|119883119894 = 0] = 05
1205901113569119889(1) = 009 1205901113569119889(0) = 025120591(1) = 1 120591(0) = minus1
bull Implies the ATE is 120591 = 05bull Average conditional variance 120124[1205901113569119889(119883119894)] = 013bull weights for 119883119894 = 1 are 009013 = 0692 for 119883119894 = 0
025013 = 192120591119877 = 120124[120591(119883119894)119882119894]
= 120591(1)119882(1)โ[119883119894 = 1] + 120591(0)119882(0)โ[119883119894 = 0]= 1 times 0692 times 075 + minus1 times 192 times 025= 0039
42 58
When will OLS estimate the ATE
bull When does 120591 = 120591119877bull Constant treatment effects 120591(119909) = 120591 = 120591119877bull Constant probability of treatment 119890(119909) = โ[119863119894 = 1|119883119894 = 119909] = 119890
Implies that the OLS weights are 1
bull Incorrect linearity assumption in 119883119894 will lead to more bias
43 58
Other ways to use regression
bull Whatrsquos the path forward Accept the bias (might be relatively small with saturated
models) Use a different regression approach
bull Let 120583119889(119909) = 120124[119884119894(119889)|119883119894 = 119909] be the CEF for the potentialoutcome under 119863119894 = 119889
bull By consistency and nuc we have 120583119889(119909) = 120124[119884119894|119863119894 = 119889119883119894 = 119909]bull Estimate a regression of 119884119894 on 119883119894 among the 119863119894 = 119889 groupbull Then 119889(119909) is just a predicted value from the regression for
119883119894 = 119909bull How can we use this
44 58
Imputation estimators
bull Impute the treated potential outcomes with 1114023119884119894(1) = 1113568(119883119894)bull Impute the control potential outcomes with 1114023119884119894(0) = 1113567(119883119894)bull Procedure
Regress 119884119894 on 119883119894 in the treated group and get predicted valuesfor all units (treated or control)
Regress 119884119894 on 119883119894 in the control group and get predicted valuesfor all units (treated or control)
Take the average difference between these predicted values
bull More mathematically look like this
120591119894119898119901 =1119873
11140121198941113568(119883119894) minus 1113567(119883119894)
bull Sometimes called an imputation estimator
45 58
Simple imputation estimator
bull Use predict() from the within-group models on the data fromthe entire sample
bull Useful trick use a model on the entire data andmodelframe() to get the right design matrix
heterogeneous effects
yhet lt- ifelse(d == 1 y + rnorm(n 0 5) y)
mod lt- lm(yhet ~ d + X)
mod1 lt- lm(yhet ~ X subset = d == 1)
mod0 lt- lm(yhet ~ X subset = d == 0)
y1imps lt- predict(mod1 modelframe(mod))
y0imps lt- predict(mod0 modelframe(mod))
mean(y1imps - y0imps)
[1] 061
46 58
Notes on imputation estimators
bull If 119889(119909) are consistent estimators then 120591119894119898119901 is consistent forthe ATE
bull Why donrsquot people use this Most people donrsquot know the results wersquove been talking about Harder to implement than vanilla OLS
bull Can use linear regression to estimate 119889(119909) = 119909prime120573119889bull Recent trend is to estimate 119889(119909) via non-parametric methods
such as Kernel regression local linear regression regression trees etc Easiest is generalized additive models (GAMs)
47 58
Imputation estimator visualization
-2 -1 0 1 2 3
-05
00
05
10
15
x
y
TreatedControl
48 58
Imputation estimator visualization
-2 -1 0 1 2 3
-05
00
05
10
15
x
y
TreatedControl
49 58
Imputation estimator visualization
-2 -1 0 1 2 3
-05
00
05
10
15
x
y
TreatedControl
50 58
Nonlinear relationshipsbull Same idea but with nonlinear relationship between 119884119894 and 119883119894
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
51 58
Nonlinear relationshipsbull Same idea but with nonlinear relationship between 119884119894 and 119883119894
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
52 58
Nonlinear relationshipsbull Same idea but with nonlinear relationship between 119884119894 and 119883119894
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
53 58
Using semiparametric regressionbull Here CEFs are nonlinear but we donrsquot know their formbull We can use GAMs from the mgcv package to for flexible
estimatelibrary(mgcv)
mod0 lt- gam(y ~ s(x) subset = d == 0)
summary(mod0)
Family gaussian
Link function identity
Formula
y ~ s(x)
Parametric coefficients
Estimate Std Error t value Pr(gt|t|)
(Intercept) -00225 00154 -146 016
Approximate significance of smooth terms
edf Refdf F p-value
s(x) 603 708 413 lt2e-16
---
Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1
R-sq(adj) = 0909 Deviance explained = 928
GCV = 00093204 Scale est = 00071351 n = 30
54 58
Using GAMs
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
55 58
Using GAMs
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
56 58
Using GAMs
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
57 58
Limited dependent variables
bull Usual advice model the data from first principles Logitprobit for binary Poisson for counts etc
bull OLS is a-ok with limited DVs when Binary treatment and no covariates (just diff-in-means) Binary treatment discrete covariates and saturated models
(stratified diff-in-means)
bull Imposing a model on LDVs in this case imposes adistributional assumption which could be wrong
bull Even in unsaturated models the marginal effect from OLSoften decent compared to nonlinear models
Could go wrong in small samples If using nonlinear models always get effects on the scale of the
outcome
58 58
- Agnostic Regression
- Regression and Causality
- Regression with Heterogeneous Treatment Effects
-
Summing up regression with constanteffects
bull Reviewing the assumptions wersquove used no unmeasured confounders constant treatment effects linearity of the treatmentcovariates
bull Under these we can run the following regression to estimatethe ATE 120591
119884119894 = 120572 + 120591119863119894 + 119883prime119894 120574 + 120584119894
bull Works with continuous or ordinal 119863119894 if linearity in the effect ofthese variables is truly linear
29 58
OLS constant effects simulation
Model with linear covariates constant 0 effect of treatment
library(mvtnorm)
n lt- 100
p lt- 4
X lt- rmvnorm(n = 100 mean = rep(0 p))
gamma lt- c(274 137 137 137)
y lt- 210 + X c(gamma) + rnorm(n)
alpha lt- c(-1 05 -05 -01)
dprobs lt- bootinvlogit(X alpha)
d lt- rbinom(n size = 1 prob = dprobs)
30 58
OLS with no covariates
summary(lm(y ~ d))
Coefficients
Estimate Std Error t value Pr(gt|t|)
(Intercept) 22036 458 4813 lt 2e-16
d -2869 654 -439 0000029
---
Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1
Residual standard error 33 on 98 degrees of freedom
Multiple R-squared 0164 Adjusted R-squared 0156
F-statistic 192 on 1 and 98 DF p-value 0000029
31 58
OLS with covariates
summary(lm(y ~ d + X))
Coefficients
Estimate Std Error t value Pr(gt|t|)
(Intercept) 209952 0159 13207 lt2e-16
d -0128 0253 -05 062
X1 27368 0126 2175 lt2e-16
X2 13677 0114 1201 lt2e-16
X3 13673 0130 1051 lt2e-16
X4 13570 0106 1280 lt2e-16
---
Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1
Residual standard error 1 on 94 degrees of freedom
Multiple R-squared 0999 Adjusted R-squared 0999
F-statistic 244e+04 on 5 and 94 DF p-value lt2e-16
32 58
What happens with nonlinearity
Suppose we can only observe the following covariates
z1 lt- exp(X[ 1]2)
z2 lt- X[ 2](1 + exp(X[ 1])) + 10
z3 lt- (X[ 1] X[ 3]25 + 06)^3
z4 lt- (X[ 2] + X[ 4] + 20)^2
Implies that 119884119894 and 119863119894 are functions of log(1198851198941113568) 1198851198941113569 119885111356911989411135681198851198941113569
1 log(1198851198941113568) 1198851198941113570 log(1198851198941113568) and 119883111356811135691198941113571
Regression is a nonlinear function of the observed covariates
33 58
When linearity goes wrong
summary(lm(y ~ d + z1 + z2 + z3 + z4))
Coefficients
Estimate Std Error t value Pr(gt|t|)
(Intercept) 218728 300799 073 0469
d -69292 33854 -205 0043
z1 363110 26970 1346 lt2e-16
z2 -29033 36619 -079 0430
z3 862022 431030 200 0048
z4 04021 00329 1223 lt2e-16
---
Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1
Residual standard error 14 on 94 degrees of freedom
Multiple R-squared 0855 Adjusted R-squared 0847
F-statistic 111 on 5 and 94 DF p-value lt2e-16
34 58
3 Regression withHeterogeneousTreatment Effects
35 58
Heterogeneous effects binary treatment
bull Completely randomized experiment
119884119894 = 119863119894119884119894(1) + (1 minus 119863119894)119884119894(0)= 119884119894(0) + (119884119894(1) minus 119884119894(0))119863119894
= 1205831113567 + 120591119894119863119894 + (119884119894(0) minus 1205831113567)= 1205831113567 + 120591119863119894 + (119884119894(0) minus 1205831113567) + (120591119894 minus 120591) sdot 119863119894
= 1205831113567 + 120591119863119894 + 120576119894
bull Error term now includes two components1 ldquoBaselinerdquo variation in the outcome (119884119894(0) minus 1205831113567)2 Variation in the treatment effect (120591119894 minus 120591)
bull Easy to verify that under experiment 120124[120576119894|119863119894] = 0bull Thus OLS estimates the ATE with no covariates
36 58
Adding covariates
bull What happens with no unmeasured confounders Need tocondition on 119883119894 now
bull Remember identification of the ATEATT using iteratedexpectations
bull ATE is the weighted sum of CATEs
120591 = 1114012119909120591(119909) Pr[119883119894 = 119909]
bull ATEATT are weighted averages of CATEsbull What about the regression estimand 120591119877 How does it related
to the ATEATT
37 58
Heterogeneous effects and regression
bull Letrsquos investigate this under a saturated regression model
119884119894 = 1114012119909119861119909119894120572119909 + 120591119877119863119894 + 119890119894
bull Use a dummy variable for each unique combination of 119883119894119861119909119894 = 120128(119883119894 = 119909)
bull Linear in 119883119894 by construction
38 58
Investigating the regression coefficient
bull How can we investigate 120591119877 Well we can rely on theregression anatomy
120591119877 = Cov(119884119894 119863119894 minus 119864[119863119894|119883119894])120141(119863119894 minus 119864[119863119894|119883119894])
bull 119863119894 minus 120124[119863119894|119883119894] is the residual from a regression of 119863119894 on the fullset of dummies
bull With a little work we can show
120591119877 =120124 1114094120591(119883119894)(119863119894 minus 120124[119863119894|119883119894])11135691114097
120124[(119863119894 minus 119864[119863119894|119883119894])1113569]=
120124[120591(119883119894)1205901113569119889(119883119894)]120124[1205901113569119889(119883119894)]
bull 1205901113569119889(119909) = 120141[119863119894|119883119894 = 119909] is the conditional variance of treatmentassignment
39 58
ATE versus OLS
120591119877 = 120124[120591(119883119894)119882119894] = 1114012119909120591(119909)
1205901113569119889(119909)120124[1205901113569119889(119883119894)]
โ[119883119894 = 119909]
bull Compare to the ATE
120591 = 120124[120591(119883119894)] = 1114012119909120591(119909)โ[119883119894 = 119909]
bull Both weight strata relative to their size (โ[119883119894 = 119909])bull OLS weights strata higher if the treatment variance in those
strata (1205901113569119889(119909)) is higher in those strata relative to the averagevariance across strata (120124[1205901113569119889(119883119894)])
bull The ATE weights only by their size
40 58
Regression weighting
119882119894 =1205901113569119889(119883119894)
120124[1205901113569119889(119883119894)]
bull Why does OLS weight like thisbull OLS is a minimum-variance estimator more weight to
more precise within-strata estimatesbull Within-strata estimates are most precise when the treatment
is evenly spread and thus has the highest variancebull If 119863119894 is binary then we know the conditional variance will be
1205901113569119889(119909) = โ[119863119894 = 1|119883119894 = 119909] (1 minus โ[119863119894 = 1|119883119894 = 119909])= 119890(119909) (1 minus 119890(119909))
bull Maximum variance with โ[119863119894 = 1|119883119894 = 119909] = 12
41 58
OLS weighting examplebull Binary covariate
โ[119883119894 = 1] = 075 โ[119883119894 = 0] = 025โ[119863119894 = 1|119883119894 = 1] = 09 โ[119863119894 = 1|119883119894 = 0] = 05
1205901113569119889(1) = 009 1205901113569119889(0) = 025120591(1) = 1 120591(0) = minus1
bull Implies the ATE is 120591 = 05bull Average conditional variance 120124[1205901113569119889(119883119894)] = 013bull weights for 119883119894 = 1 are 009013 = 0692 for 119883119894 = 0
025013 = 192120591119877 = 120124[120591(119883119894)119882119894]
= 120591(1)119882(1)โ[119883119894 = 1] + 120591(0)119882(0)โ[119883119894 = 0]= 1 times 0692 times 075 + minus1 times 192 times 025= 0039
42 58
When will OLS estimate the ATE
bull When does 120591 = 120591119877bull Constant treatment effects 120591(119909) = 120591 = 120591119877bull Constant probability of treatment 119890(119909) = โ[119863119894 = 1|119883119894 = 119909] = 119890
Implies that the OLS weights are 1
bull Incorrect linearity assumption in 119883119894 will lead to more bias
43 58
Other ways to use regression
bull Whatrsquos the path forward Accept the bias (might be relatively small with saturated
models) Use a different regression approach
bull Let 120583119889(119909) = 120124[119884119894(119889)|119883119894 = 119909] be the CEF for the potentialoutcome under 119863119894 = 119889
bull By consistency and nuc we have 120583119889(119909) = 120124[119884119894|119863119894 = 119889119883119894 = 119909]bull Estimate a regression of 119884119894 on 119883119894 among the 119863119894 = 119889 groupbull Then 119889(119909) is just a predicted value from the regression for
119883119894 = 119909bull How can we use this
44 58
Imputation estimators
bull Impute the treated potential outcomes with 1114023119884119894(1) = 1113568(119883119894)bull Impute the control potential outcomes with 1114023119884119894(0) = 1113567(119883119894)bull Procedure
Regress 119884119894 on 119883119894 in the treated group and get predicted valuesfor all units (treated or control)
Regress 119884119894 on 119883119894 in the control group and get predicted valuesfor all units (treated or control)
Take the average difference between these predicted values
bull More mathematically look like this
120591119894119898119901 =1119873
11140121198941113568(119883119894) minus 1113567(119883119894)
bull Sometimes called an imputation estimator
45 58
Simple imputation estimator
bull Use predict() from the within-group models on the data fromthe entire sample
bull Useful trick use a model on the entire data andmodelframe() to get the right design matrix
heterogeneous effects
yhet lt- ifelse(d == 1 y + rnorm(n 0 5) y)
mod lt- lm(yhet ~ d + X)
mod1 lt- lm(yhet ~ X subset = d == 1)
mod0 lt- lm(yhet ~ X subset = d == 0)
y1imps lt- predict(mod1 modelframe(mod))
y0imps lt- predict(mod0 modelframe(mod))
mean(y1imps - y0imps)
[1] 061
46 58
Notes on imputation estimators
bull If 119889(119909) are consistent estimators then 120591119894119898119901 is consistent forthe ATE
bull Why donrsquot people use this Most people donrsquot know the results wersquove been talking about Harder to implement than vanilla OLS
bull Can use linear regression to estimate 119889(119909) = 119909prime120573119889bull Recent trend is to estimate 119889(119909) via non-parametric methods
such as Kernel regression local linear regression regression trees etc Easiest is generalized additive models (GAMs)
47 58
Imputation estimator visualization
-2 -1 0 1 2 3
-05
00
05
10
15
x
y
TreatedControl
48 58
Imputation estimator visualization
-2 -1 0 1 2 3
-05
00
05
10
15
x
y
TreatedControl
49 58
Imputation estimator visualization
-2 -1 0 1 2 3
-05
00
05
10
15
x
y
TreatedControl
50 58
Nonlinear relationshipsbull Same idea but with nonlinear relationship between 119884119894 and 119883119894
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
51 58
Nonlinear relationshipsbull Same idea but with nonlinear relationship between 119884119894 and 119883119894
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
52 58
Nonlinear relationshipsbull Same idea but with nonlinear relationship between 119884119894 and 119883119894
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
53 58
Using semiparametric regressionbull Here CEFs are nonlinear but we donrsquot know their formbull We can use GAMs from the mgcv package to for flexible
estimatelibrary(mgcv)
mod0 lt- gam(y ~ s(x) subset = d == 0)
summary(mod0)
Family gaussian
Link function identity
Formula
y ~ s(x)
Parametric coefficients
Estimate Std Error t value Pr(gt|t|)
(Intercept) -00225 00154 -146 016
Approximate significance of smooth terms
edf Refdf F p-value
s(x) 603 708 413 lt2e-16
---
Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1
R-sq(adj) = 0909 Deviance explained = 928
GCV = 00093204 Scale est = 00071351 n = 30
54 58
Using GAMs
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
55 58
Using GAMs
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
56 58
Using GAMs
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
57 58
Limited dependent variables
bull Usual advice model the data from first principles Logitprobit for binary Poisson for counts etc
bull OLS is a-ok with limited DVs when Binary treatment and no covariates (just diff-in-means) Binary treatment discrete covariates and saturated models
(stratified diff-in-means)
bull Imposing a model on LDVs in this case imposes adistributional assumption which could be wrong
bull Even in unsaturated models the marginal effect from OLSoften decent compared to nonlinear models
Could go wrong in small samples If using nonlinear models always get effects on the scale of the
outcome
58 58
- Agnostic Regression
- Regression and Causality
- Regression with Heterogeneous Treatment Effects
-
OLS constant effects simulation
Model with linear covariates constant 0 effect of treatment
library(mvtnorm)
n lt- 100
p lt- 4
X lt- rmvnorm(n = 100 mean = rep(0 p))
gamma lt- c(274 137 137 137)
y lt- 210 + X c(gamma) + rnorm(n)
alpha lt- c(-1 05 -05 -01)
dprobs lt- bootinvlogit(X alpha)
d lt- rbinom(n size = 1 prob = dprobs)
30 58
OLS with no covariates
summary(lm(y ~ d))
Coefficients
Estimate Std Error t value Pr(gt|t|)
(Intercept) 22036 458 4813 lt 2e-16
d -2869 654 -439 0000029
---
Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1
Residual standard error 33 on 98 degrees of freedom
Multiple R-squared 0164 Adjusted R-squared 0156
F-statistic 192 on 1 and 98 DF p-value 0000029
31 58
OLS with covariates
summary(lm(y ~ d + X))
Coefficients
Estimate Std Error t value Pr(gt|t|)
(Intercept) 209952 0159 13207 lt2e-16
d -0128 0253 -05 062
X1 27368 0126 2175 lt2e-16
X2 13677 0114 1201 lt2e-16
X3 13673 0130 1051 lt2e-16
X4 13570 0106 1280 lt2e-16
---
Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1
Residual standard error 1 on 94 degrees of freedom
Multiple R-squared 0999 Adjusted R-squared 0999
F-statistic 244e+04 on 5 and 94 DF p-value lt2e-16
32 58
What happens with nonlinearity
Suppose we can only observe the following covariates
z1 lt- exp(X[ 1]2)
z2 lt- X[ 2](1 + exp(X[ 1])) + 10
z3 lt- (X[ 1] X[ 3]25 + 06)^3
z4 lt- (X[ 2] + X[ 4] + 20)^2
Implies that 119884119894 and 119863119894 are functions of log(1198851198941113568) 1198851198941113569 119885111356911989411135681198851198941113569
1 log(1198851198941113568) 1198851198941113570 log(1198851198941113568) and 119883111356811135691198941113571
Regression is a nonlinear function of the observed covariates
33 58
When linearity goes wrong
summary(lm(y ~ d + z1 + z2 + z3 + z4))
Coefficients
Estimate Std Error t value Pr(gt|t|)
(Intercept) 218728 300799 073 0469
d -69292 33854 -205 0043
z1 363110 26970 1346 lt2e-16
z2 -29033 36619 -079 0430
z3 862022 431030 200 0048
z4 04021 00329 1223 lt2e-16
---
Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1
Residual standard error 14 on 94 degrees of freedom
Multiple R-squared 0855 Adjusted R-squared 0847
F-statistic 111 on 5 and 94 DF p-value lt2e-16
34 58
3 Regression withHeterogeneousTreatment Effects
35 58
Heterogeneous effects binary treatment
bull Completely randomized experiment
119884119894 = 119863119894119884119894(1) + (1 minus 119863119894)119884119894(0)= 119884119894(0) + (119884119894(1) minus 119884119894(0))119863119894
= 1205831113567 + 120591119894119863119894 + (119884119894(0) minus 1205831113567)= 1205831113567 + 120591119863119894 + (119884119894(0) minus 1205831113567) + (120591119894 minus 120591) sdot 119863119894
= 1205831113567 + 120591119863119894 + 120576119894
bull Error term now includes two components1 ldquoBaselinerdquo variation in the outcome (119884119894(0) minus 1205831113567)2 Variation in the treatment effect (120591119894 minus 120591)
bull Easy to verify that under experiment 120124[120576119894|119863119894] = 0bull Thus OLS estimates the ATE with no covariates
36 58
Adding covariates
bull What happens with no unmeasured confounders Need tocondition on 119883119894 now
bull Remember identification of the ATEATT using iteratedexpectations
bull ATE is the weighted sum of CATEs
120591 = 1114012119909120591(119909) Pr[119883119894 = 119909]
bull ATEATT are weighted averages of CATEsbull What about the regression estimand 120591119877 How does it related
to the ATEATT
37 58
Heterogeneous effects and regression
bull Letrsquos investigate this under a saturated regression model
119884119894 = 1114012119909119861119909119894120572119909 + 120591119877119863119894 + 119890119894
bull Use a dummy variable for each unique combination of 119883119894119861119909119894 = 120128(119883119894 = 119909)
bull Linear in 119883119894 by construction
38 58
Investigating the regression coefficient
bull How can we investigate 120591119877 Well we can rely on theregression anatomy
120591119877 = Cov(119884119894 119863119894 minus 119864[119863119894|119883119894])120141(119863119894 minus 119864[119863119894|119883119894])
bull 119863119894 minus 120124[119863119894|119883119894] is the residual from a regression of 119863119894 on the fullset of dummies
bull With a little work we can show
120591119877 =120124 1114094120591(119883119894)(119863119894 minus 120124[119863119894|119883119894])11135691114097
120124[(119863119894 minus 119864[119863119894|119883119894])1113569]=
120124[120591(119883119894)1205901113569119889(119883119894)]120124[1205901113569119889(119883119894)]
bull 1205901113569119889(119909) = 120141[119863119894|119883119894 = 119909] is the conditional variance of treatmentassignment
39 58
ATE versus OLS
120591119877 = 120124[120591(119883119894)119882119894] = 1114012119909120591(119909)
1205901113569119889(119909)120124[1205901113569119889(119883119894)]
โ[119883119894 = 119909]
bull Compare to the ATE
120591 = 120124[120591(119883119894)] = 1114012119909120591(119909)โ[119883119894 = 119909]
bull Both weight strata relative to their size (โ[119883119894 = 119909])bull OLS weights strata higher if the treatment variance in those
strata (1205901113569119889(119909)) is higher in those strata relative to the averagevariance across strata (120124[1205901113569119889(119883119894)])
bull The ATE weights only by their size
40 58
Regression weighting
119882119894 =1205901113569119889(119883119894)
120124[1205901113569119889(119883119894)]
bull Why does OLS weight like thisbull OLS is a minimum-variance estimator more weight to
more precise within-strata estimatesbull Within-strata estimates are most precise when the treatment
is evenly spread and thus has the highest variancebull If 119863119894 is binary then we know the conditional variance will be
1205901113569119889(119909) = โ[119863119894 = 1|119883119894 = 119909] (1 minus โ[119863119894 = 1|119883119894 = 119909])= 119890(119909) (1 minus 119890(119909))
bull Maximum variance with โ[119863119894 = 1|119883119894 = 119909] = 12
41 58
OLS weighting examplebull Binary covariate
โ[119883119894 = 1] = 075 โ[119883119894 = 0] = 025โ[119863119894 = 1|119883119894 = 1] = 09 โ[119863119894 = 1|119883119894 = 0] = 05
1205901113569119889(1) = 009 1205901113569119889(0) = 025120591(1) = 1 120591(0) = minus1
bull Implies the ATE is 120591 = 05bull Average conditional variance 120124[1205901113569119889(119883119894)] = 013bull weights for 119883119894 = 1 are 009013 = 0692 for 119883119894 = 0
025013 = 192120591119877 = 120124[120591(119883119894)119882119894]
= 120591(1)119882(1)โ[119883119894 = 1] + 120591(0)119882(0)โ[119883119894 = 0]= 1 times 0692 times 075 + minus1 times 192 times 025= 0039
42 58
When will OLS estimate the ATE
bull When does 120591 = 120591119877bull Constant treatment effects 120591(119909) = 120591 = 120591119877bull Constant probability of treatment 119890(119909) = โ[119863119894 = 1|119883119894 = 119909] = 119890
Implies that the OLS weights are 1
bull Incorrect linearity assumption in 119883119894 will lead to more bias
43 58
Other ways to use regression
bull Whatrsquos the path forward Accept the bias (might be relatively small with saturated
models) Use a different regression approach
bull Let 120583119889(119909) = 120124[119884119894(119889)|119883119894 = 119909] be the CEF for the potentialoutcome under 119863119894 = 119889
bull By consistency and nuc we have 120583119889(119909) = 120124[119884119894|119863119894 = 119889119883119894 = 119909]bull Estimate a regression of 119884119894 on 119883119894 among the 119863119894 = 119889 groupbull Then 119889(119909) is just a predicted value from the regression for
119883119894 = 119909bull How can we use this
44 58
Imputation estimators
bull Impute the treated potential outcomes with 1114023119884119894(1) = 1113568(119883119894)bull Impute the control potential outcomes with 1114023119884119894(0) = 1113567(119883119894)bull Procedure
Regress 119884119894 on 119883119894 in the treated group and get predicted valuesfor all units (treated or control)
Regress 119884119894 on 119883119894 in the control group and get predicted valuesfor all units (treated or control)
Take the average difference between these predicted values
bull More mathematically look like this
120591119894119898119901 =1119873
11140121198941113568(119883119894) minus 1113567(119883119894)
bull Sometimes called an imputation estimator
45 58
Simple imputation estimator
bull Use predict() from the within-group models on the data fromthe entire sample
bull Useful trick use a model on the entire data andmodelframe() to get the right design matrix
heterogeneous effects
yhet lt- ifelse(d == 1 y + rnorm(n 0 5) y)
mod lt- lm(yhet ~ d + X)
mod1 lt- lm(yhet ~ X subset = d == 1)
mod0 lt- lm(yhet ~ X subset = d == 0)
y1imps lt- predict(mod1 modelframe(mod))
y0imps lt- predict(mod0 modelframe(mod))
mean(y1imps - y0imps)
[1] 061
46 58
Notes on imputation estimators
bull If 119889(119909) are consistent estimators then 120591119894119898119901 is consistent forthe ATE
bull Why donrsquot people use this Most people donrsquot know the results wersquove been talking about Harder to implement than vanilla OLS
bull Can use linear regression to estimate 119889(119909) = 119909prime120573119889bull Recent trend is to estimate 119889(119909) via non-parametric methods
such as Kernel regression local linear regression regression trees etc Easiest is generalized additive models (GAMs)
47 58
Imputation estimator visualization
-2 -1 0 1 2 3
-05
00
05
10
15
x
y
TreatedControl
48 58
Imputation estimator visualization
-2 -1 0 1 2 3
-05
00
05
10
15
x
y
TreatedControl
49 58
Imputation estimator visualization
-2 -1 0 1 2 3
-05
00
05
10
15
x
y
TreatedControl
50 58
Nonlinear relationshipsbull Same idea but with nonlinear relationship between 119884119894 and 119883119894
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
51 58
Nonlinear relationshipsbull Same idea but with nonlinear relationship between 119884119894 and 119883119894
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
52 58
Nonlinear relationshipsbull Same idea but with nonlinear relationship between 119884119894 and 119883119894
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
53 58
Using semiparametric regressionbull Here CEFs are nonlinear but we donrsquot know their formbull We can use GAMs from the mgcv package to for flexible
estimatelibrary(mgcv)
mod0 lt- gam(y ~ s(x) subset = d == 0)
summary(mod0)
Family gaussian
Link function identity
Formula
y ~ s(x)
Parametric coefficients
Estimate Std Error t value Pr(gt|t|)
(Intercept) -00225 00154 -146 016
Approximate significance of smooth terms
edf Refdf F p-value
s(x) 603 708 413 lt2e-16
---
Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1
R-sq(adj) = 0909 Deviance explained = 928
GCV = 00093204 Scale est = 00071351 n = 30
54 58
Using GAMs
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
55 58
Using GAMs
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
56 58
Using GAMs
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
57 58
Limited dependent variables
bull Usual advice model the data from first principles Logitprobit for binary Poisson for counts etc
bull OLS is a-ok with limited DVs when Binary treatment and no covariates (just diff-in-means) Binary treatment discrete covariates and saturated models
(stratified diff-in-means)
bull Imposing a model on LDVs in this case imposes adistributional assumption which could be wrong
bull Even in unsaturated models the marginal effect from OLSoften decent compared to nonlinear models
Could go wrong in small samples If using nonlinear models always get effects on the scale of the
outcome
58 58
- Agnostic Regression
- Regression and Causality
- Regression with Heterogeneous Treatment Effects
-
OLS with no covariates
summary(lm(y ~ d))
Coefficients
Estimate Std Error t value Pr(gt|t|)
(Intercept) 22036 458 4813 lt 2e-16
d -2869 654 -439 0000029
---
Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1
Residual standard error 33 on 98 degrees of freedom
Multiple R-squared 0164 Adjusted R-squared 0156
F-statistic 192 on 1 and 98 DF p-value 0000029
31 58
OLS with covariates
summary(lm(y ~ d + X))
Coefficients
Estimate Std Error t value Pr(gt|t|)
(Intercept) 209952 0159 13207 lt2e-16
d -0128 0253 -05 062
X1 27368 0126 2175 lt2e-16
X2 13677 0114 1201 lt2e-16
X3 13673 0130 1051 lt2e-16
X4 13570 0106 1280 lt2e-16
---
Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1
Residual standard error 1 on 94 degrees of freedom
Multiple R-squared 0999 Adjusted R-squared 0999
F-statistic 244e+04 on 5 and 94 DF p-value lt2e-16
32 58
What happens with nonlinearity
Suppose we can only observe the following covariates
z1 lt- exp(X[ 1]2)
z2 lt- X[ 2](1 + exp(X[ 1])) + 10
z3 lt- (X[ 1] X[ 3]25 + 06)^3
z4 lt- (X[ 2] + X[ 4] + 20)^2
Implies that 119884119894 and 119863119894 are functions of log(1198851198941113568) 1198851198941113569 119885111356911989411135681198851198941113569
1 log(1198851198941113568) 1198851198941113570 log(1198851198941113568) and 119883111356811135691198941113571
Regression is a nonlinear function of the observed covariates
33 58
When linearity goes wrong
summary(lm(y ~ d + z1 + z2 + z3 + z4))
Coefficients
Estimate Std Error t value Pr(gt|t|)
(Intercept) 218728 300799 073 0469
d -69292 33854 -205 0043
z1 363110 26970 1346 lt2e-16
z2 -29033 36619 -079 0430
z3 862022 431030 200 0048
z4 04021 00329 1223 lt2e-16
---
Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1
Residual standard error 14 on 94 degrees of freedom
Multiple R-squared 0855 Adjusted R-squared 0847
F-statistic 111 on 5 and 94 DF p-value lt2e-16
34 58
3 Regression withHeterogeneousTreatment Effects
35 58
Heterogeneous effects binary treatment
bull Completely randomized experiment
119884119894 = 119863119894119884119894(1) + (1 minus 119863119894)119884119894(0)= 119884119894(0) + (119884119894(1) minus 119884119894(0))119863119894
= 1205831113567 + 120591119894119863119894 + (119884119894(0) minus 1205831113567)= 1205831113567 + 120591119863119894 + (119884119894(0) minus 1205831113567) + (120591119894 minus 120591) sdot 119863119894
= 1205831113567 + 120591119863119894 + 120576119894
bull Error term now includes two components1 ldquoBaselinerdquo variation in the outcome (119884119894(0) minus 1205831113567)2 Variation in the treatment effect (120591119894 minus 120591)
bull Easy to verify that under experiment 120124[120576119894|119863119894] = 0bull Thus OLS estimates the ATE with no covariates
36 58
Adding covariates
bull What happens with no unmeasured confounders Need tocondition on 119883119894 now
bull Remember identification of the ATEATT using iteratedexpectations
bull ATE is the weighted sum of CATEs
120591 = 1114012119909120591(119909) Pr[119883119894 = 119909]
bull ATEATT are weighted averages of CATEsbull What about the regression estimand 120591119877 How does it related
to the ATEATT
37 58
Heterogeneous effects and regression
bull Letrsquos investigate this under a saturated regression model
119884119894 = 1114012119909119861119909119894120572119909 + 120591119877119863119894 + 119890119894
bull Use a dummy variable for each unique combination of 119883119894119861119909119894 = 120128(119883119894 = 119909)
bull Linear in 119883119894 by construction
38 58
Investigating the regression coefficient
bull How can we investigate 120591119877 Well we can rely on theregression anatomy
120591119877 = Cov(119884119894 119863119894 minus 119864[119863119894|119883119894])120141(119863119894 minus 119864[119863119894|119883119894])
bull 119863119894 minus 120124[119863119894|119883119894] is the residual from a regression of 119863119894 on the fullset of dummies
bull With a little work we can show
120591119877 =120124 1114094120591(119883119894)(119863119894 minus 120124[119863119894|119883119894])11135691114097
120124[(119863119894 minus 119864[119863119894|119883119894])1113569]=
120124[120591(119883119894)1205901113569119889(119883119894)]120124[1205901113569119889(119883119894)]
bull 1205901113569119889(119909) = 120141[119863119894|119883119894 = 119909] is the conditional variance of treatmentassignment
39 58
ATE versus OLS
120591119877 = 120124[120591(119883119894)119882119894] = 1114012119909120591(119909)
1205901113569119889(119909)120124[1205901113569119889(119883119894)]
โ[119883119894 = 119909]
bull Compare to the ATE
120591 = 120124[120591(119883119894)] = 1114012119909120591(119909)โ[119883119894 = 119909]
bull Both weight strata relative to their size (โ[119883119894 = 119909])bull OLS weights strata higher if the treatment variance in those
strata (1205901113569119889(119909)) is higher in those strata relative to the averagevariance across strata (120124[1205901113569119889(119883119894)])
bull The ATE weights only by their size
40 58
Regression weighting
119882119894 =1205901113569119889(119883119894)
120124[1205901113569119889(119883119894)]
bull Why does OLS weight like thisbull OLS is a minimum-variance estimator more weight to
more precise within-strata estimatesbull Within-strata estimates are most precise when the treatment
is evenly spread and thus has the highest variancebull If 119863119894 is binary then we know the conditional variance will be
1205901113569119889(119909) = โ[119863119894 = 1|119883119894 = 119909] (1 minus โ[119863119894 = 1|119883119894 = 119909])= 119890(119909) (1 minus 119890(119909))
bull Maximum variance with โ[119863119894 = 1|119883119894 = 119909] = 12
41 58
OLS weighting examplebull Binary covariate
โ[119883119894 = 1] = 075 โ[119883119894 = 0] = 025โ[119863119894 = 1|119883119894 = 1] = 09 โ[119863119894 = 1|119883119894 = 0] = 05
1205901113569119889(1) = 009 1205901113569119889(0) = 025120591(1) = 1 120591(0) = minus1
bull Implies the ATE is 120591 = 05bull Average conditional variance 120124[1205901113569119889(119883119894)] = 013bull weights for 119883119894 = 1 are 009013 = 0692 for 119883119894 = 0
025013 = 192120591119877 = 120124[120591(119883119894)119882119894]
= 120591(1)119882(1)โ[119883119894 = 1] + 120591(0)119882(0)โ[119883119894 = 0]= 1 times 0692 times 075 + minus1 times 192 times 025= 0039
42 58
When will OLS estimate the ATE
bull When does 120591 = 120591119877bull Constant treatment effects 120591(119909) = 120591 = 120591119877bull Constant probability of treatment 119890(119909) = โ[119863119894 = 1|119883119894 = 119909] = 119890
Implies that the OLS weights are 1
bull Incorrect linearity assumption in 119883119894 will lead to more bias
43 58
Other ways to use regression
bull Whatrsquos the path forward Accept the bias (might be relatively small with saturated
models) Use a different regression approach
bull Let 120583119889(119909) = 120124[119884119894(119889)|119883119894 = 119909] be the CEF for the potentialoutcome under 119863119894 = 119889
bull By consistency and nuc we have 120583119889(119909) = 120124[119884119894|119863119894 = 119889119883119894 = 119909]bull Estimate a regression of 119884119894 on 119883119894 among the 119863119894 = 119889 groupbull Then 119889(119909) is just a predicted value from the regression for
119883119894 = 119909bull How can we use this
44 58
Imputation estimators
bull Impute the treated potential outcomes with 1114023119884119894(1) = 1113568(119883119894)bull Impute the control potential outcomes with 1114023119884119894(0) = 1113567(119883119894)bull Procedure
Regress 119884119894 on 119883119894 in the treated group and get predicted valuesfor all units (treated or control)
Regress 119884119894 on 119883119894 in the control group and get predicted valuesfor all units (treated or control)
Take the average difference between these predicted values
bull More mathematically look like this
120591119894119898119901 =1119873
11140121198941113568(119883119894) minus 1113567(119883119894)
bull Sometimes called an imputation estimator
45 58
Simple imputation estimator
bull Use predict() from the within-group models on the data fromthe entire sample
bull Useful trick use a model on the entire data andmodelframe() to get the right design matrix
heterogeneous effects
yhet lt- ifelse(d == 1 y + rnorm(n 0 5) y)
mod lt- lm(yhet ~ d + X)
mod1 lt- lm(yhet ~ X subset = d == 1)
mod0 lt- lm(yhet ~ X subset = d == 0)
y1imps lt- predict(mod1 modelframe(mod))
y0imps lt- predict(mod0 modelframe(mod))
mean(y1imps - y0imps)
[1] 061
46 58
Notes on imputation estimators
bull If 119889(119909) are consistent estimators then 120591119894119898119901 is consistent forthe ATE
bull Why donrsquot people use this Most people donrsquot know the results wersquove been talking about Harder to implement than vanilla OLS
bull Can use linear regression to estimate 119889(119909) = 119909prime120573119889bull Recent trend is to estimate 119889(119909) via non-parametric methods
such as Kernel regression local linear regression regression trees etc Easiest is generalized additive models (GAMs)
47 58
Imputation estimator visualization
-2 -1 0 1 2 3
-05
00
05
10
15
x
y
TreatedControl
48 58
Imputation estimator visualization
-2 -1 0 1 2 3
-05
00
05
10
15
x
y
TreatedControl
49 58
Imputation estimator visualization
-2 -1 0 1 2 3
-05
00
05
10
15
x
y
TreatedControl
50 58
Nonlinear relationshipsbull Same idea but with nonlinear relationship between 119884119894 and 119883119894
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
51 58
Nonlinear relationshipsbull Same idea but with nonlinear relationship between 119884119894 and 119883119894
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
52 58
Nonlinear relationshipsbull Same idea but with nonlinear relationship between 119884119894 and 119883119894
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
53 58
Using semiparametric regressionbull Here CEFs are nonlinear but we donrsquot know their formbull We can use GAMs from the mgcv package to for flexible
estimatelibrary(mgcv)
mod0 lt- gam(y ~ s(x) subset = d == 0)
summary(mod0)
Family gaussian
Link function identity
Formula
y ~ s(x)
Parametric coefficients
Estimate Std Error t value Pr(gt|t|)
(Intercept) -00225 00154 -146 016
Approximate significance of smooth terms
edf Refdf F p-value
s(x) 603 708 413 lt2e-16
---
Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1
R-sq(adj) = 0909 Deviance explained = 928
GCV = 00093204 Scale est = 00071351 n = 30
54 58
Using GAMs
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
55 58
Using GAMs
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
56 58
Using GAMs
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
57 58
Limited dependent variables
bull Usual advice model the data from first principles Logitprobit for binary Poisson for counts etc
bull OLS is a-ok with limited DVs when Binary treatment and no covariates (just diff-in-means) Binary treatment discrete covariates and saturated models
(stratified diff-in-means)
bull Imposing a model on LDVs in this case imposes adistributional assumption which could be wrong
bull Even in unsaturated models the marginal effect from OLSoften decent compared to nonlinear models
Could go wrong in small samples If using nonlinear models always get effects on the scale of the
outcome
58 58
- Agnostic Regression
- Regression and Causality
- Regression with Heterogeneous Treatment Effects
-
OLS with covariates
summary(lm(y ~ d + X))
Coefficients
Estimate Std Error t value Pr(gt|t|)
(Intercept) 209952 0159 13207 lt2e-16
d -0128 0253 -05 062
X1 27368 0126 2175 lt2e-16
X2 13677 0114 1201 lt2e-16
X3 13673 0130 1051 lt2e-16
X4 13570 0106 1280 lt2e-16
---
Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1
Residual standard error 1 on 94 degrees of freedom
Multiple R-squared 0999 Adjusted R-squared 0999
F-statistic 244e+04 on 5 and 94 DF p-value lt2e-16
32 58
What happens with nonlinearity
Suppose we can only observe the following covariates
z1 lt- exp(X[ 1]2)
z2 lt- X[ 2](1 + exp(X[ 1])) + 10
z3 lt- (X[ 1] X[ 3]25 + 06)^3
z4 lt- (X[ 2] + X[ 4] + 20)^2
Implies that 119884119894 and 119863119894 are functions of log(1198851198941113568) 1198851198941113569 119885111356911989411135681198851198941113569
1 log(1198851198941113568) 1198851198941113570 log(1198851198941113568) and 119883111356811135691198941113571
Regression is a nonlinear function of the observed covariates
33 58
When linearity goes wrong
summary(lm(y ~ d + z1 + z2 + z3 + z4))
Coefficients
Estimate Std Error t value Pr(gt|t|)
(Intercept) 218728 300799 073 0469
d -69292 33854 -205 0043
z1 363110 26970 1346 lt2e-16
z2 -29033 36619 -079 0430
z3 862022 431030 200 0048
z4 04021 00329 1223 lt2e-16
---
Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1
Residual standard error 14 on 94 degrees of freedom
Multiple R-squared 0855 Adjusted R-squared 0847
F-statistic 111 on 5 and 94 DF p-value lt2e-16
34 58
3 Regression withHeterogeneousTreatment Effects
35 58
Heterogeneous effects binary treatment
bull Completely randomized experiment
119884119894 = 119863119894119884119894(1) + (1 minus 119863119894)119884119894(0)= 119884119894(0) + (119884119894(1) minus 119884119894(0))119863119894
= 1205831113567 + 120591119894119863119894 + (119884119894(0) minus 1205831113567)= 1205831113567 + 120591119863119894 + (119884119894(0) minus 1205831113567) + (120591119894 minus 120591) sdot 119863119894
= 1205831113567 + 120591119863119894 + 120576119894
bull Error term now includes two components1 ldquoBaselinerdquo variation in the outcome (119884119894(0) minus 1205831113567)2 Variation in the treatment effect (120591119894 minus 120591)
bull Easy to verify that under experiment 120124[120576119894|119863119894] = 0bull Thus OLS estimates the ATE with no covariates
36 58
Adding covariates
bull What happens with no unmeasured confounders Need tocondition on 119883119894 now
bull Remember identification of the ATEATT using iteratedexpectations
bull ATE is the weighted sum of CATEs
120591 = 1114012119909120591(119909) Pr[119883119894 = 119909]
bull ATEATT are weighted averages of CATEsbull What about the regression estimand 120591119877 How does it related
to the ATEATT
37 58
Heterogeneous effects and regression
bull Letrsquos investigate this under a saturated regression model
119884119894 = 1114012119909119861119909119894120572119909 + 120591119877119863119894 + 119890119894
bull Use a dummy variable for each unique combination of 119883119894119861119909119894 = 120128(119883119894 = 119909)
bull Linear in 119883119894 by construction
38 58
Investigating the regression coefficient
bull How can we investigate 120591119877 Well we can rely on theregression anatomy
120591119877 = Cov(119884119894 119863119894 minus 119864[119863119894|119883119894])120141(119863119894 minus 119864[119863119894|119883119894])
bull 119863119894 minus 120124[119863119894|119883119894] is the residual from a regression of 119863119894 on the fullset of dummies
bull With a little work we can show
120591119877 =120124 1114094120591(119883119894)(119863119894 minus 120124[119863119894|119883119894])11135691114097
120124[(119863119894 minus 119864[119863119894|119883119894])1113569]=
120124[120591(119883119894)1205901113569119889(119883119894)]120124[1205901113569119889(119883119894)]
bull 1205901113569119889(119909) = 120141[119863119894|119883119894 = 119909] is the conditional variance of treatmentassignment
39 58
ATE versus OLS
120591119877 = 120124[120591(119883119894)119882119894] = 1114012119909120591(119909)
1205901113569119889(119909)120124[1205901113569119889(119883119894)]
โ[119883119894 = 119909]
bull Compare to the ATE
120591 = 120124[120591(119883119894)] = 1114012119909120591(119909)โ[119883119894 = 119909]
bull Both weight strata relative to their size (โ[119883119894 = 119909])bull OLS weights strata higher if the treatment variance in those
strata (1205901113569119889(119909)) is higher in those strata relative to the averagevariance across strata (120124[1205901113569119889(119883119894)])
bull The ATE weights only by their size
40 58
Regression weighting
119882119894 =1205901113569119889(119883119894)
120124[1205901113569119889(119883119894)]
bull Why does OLS weight like thisbull OLS is a minimum-variance estimator more weight to
more precise within-strata estimatesbull Within-strata estimates are most precise when the treatment
is evenly spread and thus has the highest variancebull If 119863119894 is binary then we know the conditional variance will be
1205901113569119889(119909) = โ[119863119894 = 1|119883119894 = 119909] (1 minus โ[119863119894 = 1|119883119894 = 119909])= 119890(119909) (1 minus 119890(119909))
bull Maximum variance with โ[119863119894 = 1|119883119894 = 119909] = 12
41 58
OLS weighting examplebull Binary covariate
โ[119883119894 = 1] = 075 โ[119883119894 = 0] = 025โ[119863119894 = 1|119883119894 = 1] = 09 โ[119863119894 = 1|119883119894 = 0] = 05
1205901113569119889(1) = 009 1205901113569119889(0) = 025120591(1) = 1 120591(0) = minus1
bull Implies the ATE is 120591 = 05bull Average conditional variance 120124[1205901113569119889(119883119894)] = 013bull weights for 119883119894 = 1 are 009013 = 0692 for 119883119894 = 0
025013 = 192120591119877 = 120124[120591(119883119894)119882119894]
= 120591(1)119882(1)โ[119883119894 = 1] + 120591(0)119882(0)โ[119883119894 = 0]= 1 times 0692 times 075 + minus1 times 192 times 025= 0039
42 58
When will OLS estimate the ATE
bull When does 120591 = 120591119877bull Constant treatment effects 120591(119909) = 120591 = 120591119877bull Constant probability of treatment 119890(119909) = โ[119863119894 = 1|119883119894 = 119909] = 119890
Implies that the OLS weights are 1
bull Incorrect linearity assumption in 119883119894 will lead to more bias
43 58
Other ways to use regression
bull Whatrsquos the path forward Accept the bias (might be relatively small with saturated
models) Use a different regression approach
bull Let 120583119889(119909) = 120124[119884119894(119889)|119883119894 = 119909] be the CEF for the potentialoutcome under 119863119894 = 119889
bull By consistency and nuc we have 120583119889(119909) = 120124[119884119894|119863119894 = 119889119883119894 = 119909]bull Estimate a regression of 119884119894 on 119883119894 among the 119863119894 = 119889 groupbull Then 119889(119909) is just a predicted value from the regression for
119883119894 = 119909bull How can we use this
44 58
Imputation estimators
bull Impute the treated potential outcomes with 1114023119884119894(1) = 1113568(119883119894)bull Impute the control potential outcomes with 1114023119884119894(0) = 1113567(119883119894)bull Procedure
Regress 119884119894 on 119883119894 in the treated group and get predicted valuesfor all units (treated or control)
Regress 119884119894 on 119883119894 in the control group and get predicted valuesfor all units (treated or control)
Take the average difference between these predicted values
bull More mathematically look like this
120591119894119898119901 =1119873
11140121198941113568(119883119894) minus 1113567(119883119894)
bull Sometimes called an imputation estimator
45 58
Simple imputation estimator
bull Use predict() from the within-group models on the data fromthe entire sample
bull Useful trick use a model on the entire data andmodelframe() to get the right design matrix
heterogeneous effects
yhet lt- ifelse(d == 1 y + rnorm(n 0 5) y)
mod lt- lm(yhet ~ d + X)
mod1 lt- lm(yhet ~ X subset = d == 1)
mod0 lt- lm(yhet ~ X subset = d == 0)
y1imps lt- predict(mod1 modelframe(mod))
y0imps lt- predict(mod0 modelframe(mod))
mean(y1imps - y0imps)
[1] 061
46 58
Notes on imputation estimators
bull If 119889(119909) are consistent estimators then 120591119894119898119901 is consistent forthe ATE
bull Why donrsquot people use this Most people donrsquot know the results wersquove been talking about Harder to implement than vanilla OLS
bull Can use linear regression to estimate 119889(119909) = 119909prime120573119889bull Recent trend is to estimate 119889(119909) via non-parametric methods
such as Kernel regression local linear regression regression trees etc Easiest is generalized additive models (GAMs)
47 58
Imputation estimator visualization
-2 -1 0 1 2 3
-05
00
05
10
15
x
y
TreatedControl
48 58
Imputation estimator visualization
-2 -1 0 1 2 3
-05
00
05
10
15
x
y
TreatedControl
49 58
Imputation estimator visualization
-2 -1 0 1 2 3
-05
00
05
10
15
x
y
TreatedControl
50 58
Nonlinear relationshipsbull Same idea but with nonlinear relationship between 119884119894 and 119883119894
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
51 58
Nonlinear relationshipsbull Same idea but with nonlinear relationship between 119884119894 and 119883119894
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
52 58
Nonlinear relationshipsbull Same idea but with nonlinear relationship between 119884119894 and 119883119894
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
53 58
Using semiparametric regressionbull Here CEFs are nonlinear but we donrsquot know their formbull We can use GAMs from the mgcv package to for flexible
estimatelibrary(mgcv)
mod0 lt- gam(y ~ s(x) subset = d == 0)
summary(mod0)
Family gaussian
Link function identity
Formula
y ~ s(x)
Parametric coefficients
Estimate Std Error t value Pr(gt|t|)
(Intercept) -00225 00154 -146 016
Approximate significance of smooth terms
edf Refdf F p-value
s(x) 603 708 413 lt2e-16
---
Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1
R-sq(adj) = 0909 Deviance explained = 928
GCV = 00093204 Scale est = 00071351 n = 30
54 58
Using GAMs
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
55 58
Using GAMs
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
56 58
Using GAMs
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
57 58
Limited dependent variables
bull Usual advice model the data from first principles Logitprobit for binary Poisson for counts etc
bull OLS is a-ok with limited DVs when Binary treatment and no covariates (just diff-in-means) Binary treatment discrete covariates and saturated models
(stratified diff-in-means)
bull Imposing a model on LDVs in this case imposes adistributional assumption which could be wrong
bull Even in unsaturated models the marginal effect from OLSoften decent compared to nonlinear models
Could go wrong in small samples If using nonlinear models always get effects on the scale of the
outcome
58 58
- Agnostic Regression
- Regression and Causality
- Regression with Heterogeneous Treatment Effects
-
What happens with nonlinearity
Suppose we can only observe the following covariates
z1 lt- exp(X[ 1]2)
z2 lt- X[ 2](1 + exp(X[ 1])) + 10
z3 lt- (X[ 1] X[ 3]25 + 06)^3
z4 lt- (X[ 2] + X[ 4] + 20)^2
Implies that 119884119894 and 119863119894 are functions of log(1198851198941113568) 1198851198941113569 119885111356911989411135681198851198941113569
1 log(1198851198941113568) 1198851198941113570 log(1198851198941113568) and 119883111356811135691198941113571
Regression is a nonlinear function of the observed covariates
33 58
When linearity goes wrong
summary(lm(y ~ d + z1 + z2 + z3 + z4))
Coefficients
Estimate Std Error t value Pr(gt|t|)
(Intercept) 218728 300799 073 0469
d -69292 33854 -205 0043
z1 363110 26970 1346 lt2e-16
z2 -29033 36619 -079 0430
z3 862022 431030 200 0048
z4 04021 00329 1223 lt2e-16
---
Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1
Residual standard error 14 on 94 degrees of freedom
Multiple R-squared 0855 Adjusted R-squared 0847
F-statistic 111 on 5 and 94 DF p-value lt2e-16
34 58
3 Regression withHeterogeneousTreatment Effects
35 58
Heterogeneous effects binary treatment
bull Completely randomized experiment
119884119894 = 119863119894119884119894(1) + (1 minus 119863119894)119884119894(0)= 119884119894(0) + (119884119894(1) minus 119884119894(0))119863119894
= 1205831113567 + 120591119894119863119894 + (119884119894(0) minus 1205831113567)= 1205831113567 + 120591119863119894 + (119884119894(0) minus 1205831113567) + (120591119894 minus 120591) sdot 119863119894
= 1205831113567 + 120591119863119894 + 120576119894
bull Error term now includes two components1 ldquoBaselinerdquo variation in the outcome (119884119894(0) minus 1205831113567)2 Variation in the treatment effect (120591119894 minus 120591)
bull Easy to verify that under experiment 120124[120576119894|119863119894] = 0bull Thus OLS estimates the ATE with no covariates
36 58
Adding covariates
bull What happens with no unmeasured confounders Need tocondition on 119883119894 now
bull Remember identification of the ATEATT using iteratedexpectations
bull ATE is the weighted sum of CATEs
120591 = 1114012119909120591(119909) Pr[119883119894 = 119909]
bull ATEATT are weighted averages of CATEsbull What about the regression estimand 120591119877 How does it related
to the ATEATT
37 58
Heterogeneous effects and regression
bull Letrsquos investigate this under a saturated regression model
119884119894 = 1114012119909119861119909119894120572119909 + 120591119877119863119894 + 119890119894
bull Use a dummy variable for each unique combination of 119883119894119861119909119894 = 120128(119883119894 = 119909)
bull Linear in 119883119894 by construction
38 58
Investigating the regression coefficient
bull How can we investigate 120591119877 Well we can rely on theregression anatomy
120591119877 = Cov(119884119894 119863119894 minus 119864[119863119894|119883119894])120141(119863119894 minus 119864[119863119894|119883119894])
bull 119863119894 minus 120124[119863119894|119883119894] is the residual from a regression of 119863119894 on the fullset of dummies
bull With a little work we can show
120591119877 =120124 1114094120591(119883119894)(119863119894 minus 120124[119863119894|119883119894])11135691114097
120124[(119863119894 minus 119864[119863119894|119883119894])1113569]=
120124[120591(119883119894)1205901113569119889(119883119894)]120124[1205901113569119889(119883119894)]
bull 1205901113569119889(119909) = 120141[119863119894|119883119894 = 119909] is the conditional variance of treatmentassignment
39 58
ATE versus OLS
120591119877 = 120124[120591(119883119894)119882119894] = 1114012119909120591(119909)
1205901113569119889(119909)120124[1205901113569119889(119883119894)]
โ[119883119894 = 119909]
bull Compare to the ATE
120591 = 120124[120591(119883119894)] = 1114012119909120591(119909)โ[119883119894 = 119909]
bull Both weight strata relative to their size (โ[119883119894 = 119909])bull OLS weights strata higher if the treatment variance in those
strata (1205901113569119889(119909)) is higher in those strata relative to the averagevariance across strata (120124[1205901113569119889(119883119894)])
bull The ATE weights only by their size
40 58
Regression weighting
119882119894 =1205901113569119889(119883119894)
120124[1205901113569119889(119883119894)]
bull Why does OLS weight like thisbull OLS is a minimum-variance estimator more weight to
more precise within-strata estimatesbull Within-strata estimates are most precise when the treatment
is evenly spread and thus has the highest variancebull If 119863119894 is binary then we know the conditional variance will be
1205901113569119889(119909) = โ[119863119894 = 1|119883119894 = 119909] (1 minus โ[119863119894 = 1|119883119894 = 119909])= 119890(119909) (1 minus 119890(119909))
bull Maximum variance with โ[119863119894 = 1|119883119894 = 119909] = 12
41 58
OLS weighting examplebull Binary covariate
โ[119883119894 = 1] = 075 โ[119883119894 = 0] = 025โ[119863119894 = 1|119883119894 = 1] = 09 โ[119863119894 = 1|119883119894 = 0] = 05
1205901113569119889(1) = 009 1205901113569119889(0) = 025120591(1) = 1 120591(0) = minus1
bull Implies the ATE is 120591 = 05bull Average conditional variance 120124[1205901113569119889(119883119894)] = 013bull weights for 119883119894 = 1 are 009013 = 0692 for 119883119894 = 0
025013 = 192120591119877 = 120124[120591(119883119894)119882119894]
= 120591(1)119882(1)โ[119883119894 = 1] + 120591(0)119882(0)โ[119883119894 = 0]= 1 times 0692 times 075 + minus1 times 192 times 025= 0039
42 58
When will OLS estimate the ATE
bull When does 120591 = 120591119877bull Constant treatment effects 120591(119909) = 120591 = 120591119877bull Constant probability of treatment 119890(119909) = โ[119863119894 = 1|119883119894 = 119909] = 119890
Implies that the OLS weights are 1
bull Incorrect linearity assumption in 119883119894 will lead to more bias
43 58
Other ways to use regression
bull Whatrsquos the path forward Accept the bias (might be relatively small with saturated
models) Use a different regression approach
bull Let 120583119889(119909) = 120124[119884119894(119889)|119883119894 = 119909] be the CEF for the potentialoutcome under 119863119894 = 119889
bull By consistency and nuc we have 120583119889(119909) = 120124[119884119894|119863119894 = 119889119883119894 = 119909]bull Estimate a regression of 119884119894 on 119883119894 among the 119863119894 = 119889 groupbull Then 119889(119909) is just a predicted value from the regression for
119883119894 = 119909bull How can we use this
44 58
Imputation estimators
bull Impute the treated potential outcomes with 1114023119884119894(1) = 1113568(119883119894)bull Impute the control potential outcomes with 1114023119884119894(0) = 1113567(119883119894)bull Procedure
Regress 119884119894 on 119883119894 in the treated group and get predicted valuesfor all units (treated or control)
Regress 119884119894 on 119883119894 in the control group and get predicted valuesfor all units (treated or control)
Take the average difference between these predicted values
bull More mathematically look like this
120591119894119898119901 =1119873
11140121198941113568(119883119894) minus 1113567(119883119894)
bull Sometimes called an imputation estimator
45 58
Simple imputation estimator
bull Use predict() from the within-group models on the data fromthe entire sample
bull Useful trick use a model on the entire data andmodelframe() to get the right design matrix
heterogeneous effects
yhet lt- ifelse(d == 1 y + rnorm(n 0 5) y)
mod lt- lm(yhet ~ d + X)
mod1 lt- lm(yhet ~ X subset = d == 1)
mod0 lt- lm(yhet ~ X subset = d == 0)
y1imps lt- predict(mod1 modelframe(mod))
y0imps lt- predict(mod0 modelframe(mod))
mean(y1imps - y0imps)
[1] 061
46 58
Notes on imputation estimators
bull If 119889(119909) are consistent estimators then 120591119894119898119901 is consistent forthe ATE
bull Why donrsquot people use this Most people donrsquot know the results wersquove been talking about Harder to implement than vanilla OLS
bull Can use linear regression to estimate 119889(119909) = 119909prime120573119889bull Recent trend is to estimate 119889(119909) via non-parametric methods
such as Kernel regression local linear regression regression trees etc Easiest is generalized additive models (GAMs)
47 58
Imputation estimator visualization
-2 -1 0 1 2 3
-05
00
05
10
15
x
y
TreatedControl
48 58
Imputation estimator visualization
-2 -1 0 1 2 3
-05
00
05
10
15
x
y
TreatedControl
49 58
Imputation estimator visualization
-2 -1 0 1 2 3
-05
00
05
10
15
x
y
TreatedControl
50 58
Nonlinear relationshipsbull Same idea but with nonlinear relationship between 119884119894 and 119883119894
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
51 58
Nonlinear relationshipsbull Same idea but with nonlinear relationship between 119884119894 and 119883119894
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
52 58
Nonlinear relationshipsbull Same idea but with nonlinear relationship between 119884119894 and 119883119894
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
53 58
Using semiparametric regressionbull Here CEFs are nonlinear but we donrsquot know their formbull We can use GAMs from the mgcv package to for flexible
estimatelibrary(mgcv)
mod0 lt- gam(y ~ s(x) subset = d == 0)
summary(mod0)
Family gaussian
Link function identity
Formula
y ~ s(x)
Parametric coefficients
Estimate Std Error t value Pr(gt|t|)
(Intercept) -00225 00154 -146 016
Approximate significance of smooth terms
edf Refdf F p-value
s(x) 603 708 413 lt2e-16
---
Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1
R-sq(adj) = 0909 Deviance explained = 928
GCV = 00093204 Scale est = 00071351 n = 30
54 58
Using GAMs
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
55 58
Using GAMs
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
56 58
Using GAMs
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
57 58
Limited dependent variables
bull Usual advice model the data from first principles Logitprobit for binary Poisson for counts etc
bull OLS is a-ok with limited DVs when Binary treatment and no covariates (just diff-in-means) Binary treatment discrete covariates and saturated models
(stratified diff-in-means)
bull Imposing a model on LDVs in this case imposes adistributional assumption which could be wrong
bull Even in unsaturated models the marginal effect from OLSoften decent compared to nonlinear models
Could go wrong in small samples If using nonlinear models always get effects on the scale of the
outcome
58 58
- Agnostic Regression
- Regression and Causality
- Regression with Heterogeneous Treatment Effects
-
When linearity goes wrong
summary(lm(y ~ d + z1 + z2 + z3 + z4))
Coefficients
Estimate Std Error t value Pr(gt|t|)
(Intercept) 218728 300799 073 0469
d -69292 33854 -205 0043
z1 363110 26970 1346 lt2e-16
z2 -29033 36619 -079 0430
z3 862022 431030 200 0048
z4 04021 00329 1223 lt2e-16
---
Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1
Residual standard error 14 on 94 degrees of freedom
Multiple R-squared 0855 Adjusted R-squared 0847
F-statistic 111 on 5 and 94 DF p-value lt2e-16
34 58
3 Regression withHeterogeneousTreatment Effects
35 58
Heterogeneous effects binary treatment
bull Completely randomized experiment
119884119894 = 119863119894119884119894(1) + (1 minus 119863119894)119884119894(0)= 119884119894(0) + (119884119894(1) minus 119884119894(0))119863119894
= 1205831113567 + 120591119894119863119894 + (119884119894(0) minus 1205831113567)= 1205831113567 + 120591119863119894 + (119884119894(0) minus 1205831113567) + (120591119894 minus 120591) sdot 119863119894
= 1205831113567 + 120591119863119894 + 120576119894
bull Error term now includes two components1 ldquoBaselinerdquo variation in the outcome (119884119894(0) minus 1205831113567)2 Variation in the treatment effect (120591119894 minus 120591)
bull Easy to verify that under experiment 120124[120576119894|119863119894] = 0bull Thus OLS estimates the ATE with no covariates
36 58
Adding covariates
bull What happens with no unmeasured confounders Need tocondition on 119883119894 now
bull Remember identification of the ATEATT using iteratedexpectations
bull ATE is the weighted sum of CATEs
120591 = 1114012119909120591(119909) Pr[119883119894 = 119909]
bull ATEATT are weighted averages of CATEsbull What about the regression estimand 120591119877 How does it related
to the ATEATT
37 58
Heterogeneous effects and regression
bull Letrsquos investigate this under a saturated regression model
119884119894 = 1114012119909119861119909119894120572119909 + 120591119877119863119894 + 119890119894
bull Use a dummy variable for each unique combination of 119883119894119861119909119894 = 120128(119883119894 = 119909)
bull Linear in 119883119894 by construction
38 58
Investigating the regression coefficient
bull How can we investigate 120591119877 Well we can rely on theregression anatomy
120591119877 = Cov(119884119894 119863119894 minus 119864[119863119894|119883119894])120141(119863119894 minus 119864[119863119894|119883119894])
bull 119863119894 minus 120124[119863119894|119883119894] is the residual from a regression of 119863119894 on the fullset of dummies
bull With a little work we can show
120591119877 =120124 1114094120591(119883119894)(119863119894 minus 120124[119863119894|119883119894])11135691114097
120124[(119863119894 minus 119864[119863119894|119883119894])1113569]=
120124[120591(119883119894)1205901113569119889(119883119894)]120124[1205901113569119889(119883119894)]
bull 1205901113569119889(119909) = 120141[119863119894|119883119894 = 119909] is the conditional variance of treatmentassignment
39 58
ATE versus OLS
120591119877 = 120124[120591(119883119894)119882119894] = 1114012119909120591(119909)
1205901113569119889(119909)120124[1205901113569119889(119883119894)]
โ[119883119894 = 119909]
bull Compare to the ATE
120591 = 120124[120591(119883119894)] = 1114012119909120591(119909)โ[119883119894 = 119909]
bull Both weight strata relative to their size (โ[119883119894 = 119909])bull OLS weights strata higher if the treatment variance in those
strata (1205901113569119889(119909)) is higher in those strata relative to the averagevariance across strata (120124[1205901113569119889(119883119894)])
bull The ATE weights only by their size
40 58
Regression weighting
119882119894 =1205901113569119889(119883119894)
120124[1205901113569119889(119883119894)]
bull Why does OLS weight like thisbull OLS is a minimum-variance estimator more weight to
more precise within-strata estimatesbull Within-strata estimates are most precise when the treatment
is evenly spread and thus has the highest variancebull If 119863119894 is binary then we know the conditional variance will be
1205901113569119889(119909) = โ[119863119894 = 1|119883119894 = 119909] (1 minus โ[119863119894 = 1|119883119894 = 119909])= 119890(119909) (1 minus 119890(119909))
bull Maximum variance with โ[119863119894 = 1|119883119894 = 119909] = 12
41 58
OLS weighting examplebull Binary covariate
โ[119883119894 = 1] = 075 โ[119883119894 = 0] = 025โ[119863119894 = 1|119883119894 = 1] = 09 โ[119863119894 = 1|119883119894 = 0] = 05
1205901113569119889(1) = 009 1205901113569119889(0) = 025120591(1) = 1 120591(0) = minus1
bull Implies the ATE is 120591 = 05bull Average conditional variance 120124[1205901113569119889(119883119894)] = 013bull weights for 119883119894 = 1 are 009013 = 0692 for 119883119894 = 0
025013 = 192120591119877 = 120124[120591(119883119894)119882119894]
= 120591(1)119882(1)โ[119883119894 = 1] + 120591(0)119882(0)โ[119883119894 = 0]= 1 times 0692 times 075 + minus1 times 192 times 025= 0039
42 58
When will OLS estimate the ATE
bull When does 120591 = 120591119877bull Constant treatment effects 120591(119909) = 120591 = 120591119877bull Constant probability of treatment 119890(119909) = โ[119863119894 = 1|119883119894 = 119909] = 119890
Implies that the OLS weights are 1
bull Incorrect linearity assumption in 119883119894 will lead to more bias
43 58
Other ways to use regression
bull Whatrsquos the path forward Accept the bias (might be relatively small with saturated
models) Use a different regression approach
bull Let 120583119889(119909) = 120124[119884119894(119889)|119883119894 = 119909] be the CEF for the potentialoutcome under 119863119894 = 119889
bull By consistency and nuc we have 120583119889(119909) = 120124[119884119894|119863119894 = 119889119883119894 = 119909]bull Estimate a regression of 119884119894 on 119883119894 among the 119863119894 = 119889 groupbull Then 119889(119909) is just a predicted value from the regression for
119883119894 = 119909bull How can we use this
44 58
Imputation estimators
bull Impute the treated potential outcomes with 1114023119884119894(1) = 1113568(119883119894)bull Impute the control potential outcomes with 1114023119884119894(0) = 1113567(119883119894)bull Procedure
Regress 119884119894 on 119883119894 in the treated group and get predicted valuesfor all units (treated or control)
Regress 119884119894 on 119883119894 in the control group and get predicted valuesfor all units (treated or control)
Take the average difference between these predicted values
bull More mathematically look like this
120591119894119898119901 =1119873
11140121198941113568(119883119894) minus 1113567(119883119894)
bull Sometimes called an imputation estimator
45 58
Simple imputation estimator
bull Use predict() from the within-group models on the data fromthe entire sample
bull Useful trick use a model on the entire data andmodelframe() to get the right design matrix
heterogeneous effects
yhet lt- ifelse(d == 1 y + rnorm(n 0 5) y)
mod lt- lm(yhet ~ d + X)
mod1 lt- lm(yhet ~ X subset = d == 1)
mod0 lt- lm(yhet ~ X subset = d == 0)
y1imps lt- predict(mod1 modelframe(mod))
y0imps lt- predict(mod0 modelframe(mod))
mean(y1imps - y0imps)
[1] 061
46 58
Notes on imputation estimators
bull If 119889(119909) are consistent estimators then 120591119894119898119901 is consistent forthe ATE
bull Why donrsquot people use this Most people donrsquot know the results wersquove been talking about Harder to implement than vanilla OLS
bull Can use linear regression to estimate 119889(119909) = 119909prime120573119889bull Recent trend is to estimate 119889(119909) via non-parametric methods
such as Kernel regression local linear regression regression trees etc Easiest is generalized additive models (GAMs)
47 58
Imputation estimator visualization
-2 -1 0 1 2 3
-05
00
05
10
15
x
y
TreatedControl
48 58
Imputation estimator visualization
-2 -1 0 1 2 3
-05
00
05
10
15
x
y
TreatedControl
49 58
Imputation estimator visualization
-2 -1 0 1 2 3
-05
00
05
10
15
x
y
TreatedControl
50 58
Nonlinear relationshipsbull Same idea but with nonlinear relationship between 119884119894 and 119883119894
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
51 58
Nonlinear relationshipsbull Same idea but with nonlinear relationship between 119884119894 and 119883119894
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
52 58
Nonlinear relationshipsbull Same idea but with nonlinear relationship between 119884119894 and 119883119894
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
53 58
Using semiparametric regressionbull Here CEFs are nonlinear but we donrsquot know their formbull We can use GAMs from the mgcv package to for flexible
estimatelibrary(mgcv)
mod0 lt- gam(y ~ s(x) subset = d == 0)
summary(mod0)
Family gaussian
Link function identity
Formula
y ~ s(x)
Parametric coefficients
Estimate Std Error t value Pr(gt|t|)
(Intercept) -00225 00154 -146 016
Approximate significance of smooth terms
edf Refdf F p-value
s(x) 603 708 413 lt2e-16
---
Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1
R-sq(adj) = 0909 Deviance explained = 928
GCV = 00093204 Scale est = 00071351 n = 30
54 58
Using GAMs
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
55 58
Using GAMs
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
56 58
Using GAMs
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
57 58
Limited dependent variables
bull Usual advice model the data from first principles Logitprobit for binary Poisson for counts etc
bull OLS is a-ok with limited DVs when Binary treatment and no covariates (just diff-in-means) Binary treatment discrete covariates and saturated models
(stratified diff-in-means)
bull Imposing a model on LDVs in this case imposes adistributional assumption which could be wrong
bull Even in unsaturated models the marginal effect from OLSoften decent compared to nonlinear models
Could go wrong in small samples If using nonlinear models always get effects on the scale of the
outcome
58 58
- Agnostic Regression
- Regression and Causality
- Regression with Heterogeneous Treatment Effects
-
3 Regression withHeterogeneousTreatment Effects
35 58
Heterogeneous effects binary treatment
bull Completely randomized experiment
119884119894 = 119863119894119884119894(1) + (1 minus 119863119894)119884119894(0)= 119884119894(0) + (119884119894(1) minus 119884119894(0))119863119894
= 1205831113567 + 120591119894119863119894 + (119884119894(0) minus 1205831113567)= 1205831113567 + 120591119863119894 + (119884119894(0) minus 1205831113567) + (120591119894 minus 120591) sdot 119863119894
= 1205831113567 + 120591119863119894 + 120576119894
bull Error term now includes two components1 ldquoBaselinerdquo variation in the outcome (119884119894(0) minus 1205831113567)2 Variation in the treatment effect (120591119894 minus 120591)
bull Easy to verify that under experiment 120124[120576119894|119863119894] = 0bull Thus OLS estimates the ATE with no covariates
36 58
Adding covariates
bull What happens with no unmeasured confounders Need tocondition on 119883119894 now
bull Remember identification of the ATEATT using iteratedexpectations
bull ATE is the weighted sum of CATEs
120591 = 1114012119909120591(119909) Pr[119883119894 = 119909]
bull ATEATT are weighted averages of CATEsbull What about the regression estimand 120591119877 How does it related
to the ATEATT
37 58
Heterogeneous effects and regression
bull Letrsquos investigate this under a saturated regression model
119884119894 = 1114012119909119861119909119894120572119909 + 120591119877119863119894 + 119890119894
bull Use a dummy variable for each unique combination of 119883119894119861119909119894 = 120128(119883119894 = 119909)
bull Linear in 119883119894 by construction
38 58
Investigating the regression coefficient
bull How can we investigate 120591119877 Well we can rely on theregression anatomy
120591119877 = Cov(119884119894 119863119894 minus 119864[119863119894|119883119894])120141(119863119894 minus 119864[119863119894|119883119894])
bull 119863119894 minus 120124[119863119894|119883119894] is the residual from a regression of 119863119894 on the fullset of dummies
bull With a little work we can show
120591119877 =120124 1114094120591(119883119894)(119863119894 minus 120124[119863119894|119883119894])11135691114097
120124[(119863119894 minus 119864[119863119894|119883119894])1113569]=
120124[120591(119883119894)1205901113569119889(119883119894)]120124[1205901113569119889(119883119894)]
bull 1205901113569119889(119909) = 120141[119863119894|119883119894 = 119909] is the conditional variance of treatmentassignment
39 58
ATE versus OLS
120591119877 = 120124[120591(119883119894)119882119894] = 1114012119909120591(119909)
1205901113569119889(119909)120124[1205901113569119889(119883119894)]
โ[119883119894 = 119909]
bull Compare to the ATE
120591 = 120124[120591(119883119894)] = 1114012119909120591(119909)โ[119883119894 = 119909]
bull Both weight strata relative to their size (โ[119883119894 = 119909])bull OLS weights strata higher if the treatment variance in those
strata (1205901113569119889(119909)) is higher in those strata relative to the averagevariance across strata (120124[1205901113569119889(119883119894)])
bull The ATE weights only by their size
40 58
Regression weighting
119882119894 =1205901113569119889(119883119894)
120124[1205901113569119889(119883119894)]
bull Why does OLS weight like thisbull OLS is a minimum-variance estimator more weight to
more precise within-strata estimatesbull Within-strata estimates are most precise when the treatment
is evenly spread and thus has the highest variancebull If 119863119894 is binary then we know the conditional variance will be
1205901113569119889(119909) = โ[119863119894 = 1|119883119894 = 119909] (1 minus โ[119863119894 = 1|119883119894 = 119909])= 119890(119909) (1 minus 119890(119909))
bull Maximum variance with โ[119863119894 = 1|119883119894 = 119909] = 12
41 58
OLS weighting examplebull Binary covariate
โ[119883119894 = 1] = 075 โ[119883119894 = 0] = 025โ[119863119894 = 1|119883119894 = 1] = 09 โ[119863119894 = 1|119883119894 = 0] = 05
1205901113569119889(1) = 009 1205901113569119889(0) = 025120591(1) = 1 120591(0) = minus1
bull Implies the ATE is 120591 = 05bull Average conditional variance 120124[1205901113569119889(119883119894)] = 013bull weights for 119883119894 = 1 are 009013 = 0692 for 119883119894 = 0
025013 = 192120591119877 = 120124[120591(119883119894)119882119894]
= 120591(1)119882(1)โ[119883119894 = 1] + 120591(0)119882(0)โ[119883119894 = 0]= 1 times 0692 times 075 + minus1 times 192 times 025= 0039
42 58
When will OLS estimate the ATE
bull When does 120591 = 120591119877bull Constant treatment effects 120591(119909) = 120591 = 120591119877bull Constant probability of treatment 119890(119909) = โ[119863119894 = 1|119883119894 = 119909] = 119890
Implies that the OLS weights are 1
bull Incorrect linearity assumption in 119883119894 will lead to more bias
43 58
Other ways to use regression
bull Whatrsquos the path forward Accept the bias (might be relatively small with saturated
models) Use a different regression approach
bull Let 120583119889(119909) = 120124[119884119894(119889)|119883119894 = 119909] be the CEF for the potentialoutcome under 119863119894 = 119889
bull By consistency and nuc we have 120583119889(119909) = 120124[119884119894|119863119894 = 119889119883119894 = 119909]bull Estimate a regression of 119884119894 on 119883119894 among the 119863119894 = 119889 groupbull Then 119889(119909) is just a predicted value from the regression for
119883119894 = 119909bull How can we use this
44 58
Imputation estimators
bull Impute the treated potential outcomes with 1114023119884119894(1) = 1113568(119883119894)bull Impute the control potential outcomes with 1114023119884119894(0) = 1113567(119883119894)bull Procedure
Regress 119884119894 on 119883119894 in the treated group and get predicted valuesfor all units (treated or control)
Regress 119884119894 on 119883119894 in the control group and get predicted valuesfor all units (treated or control)
Take the average difference between these predicted values
bull More mathematically look like this
120591119894119898119901 =1119873
11140121198941113568(119883119894) minus 1113567(119883119894)
bull Sometimes called an imputation estimator
45 58
Simple imputation estimator
bull Use predict() from the within-group models on the data fromthe entire sample
bull Useful trick use a model on the entire data andmodelframe() to get the right design matrix
heterogeneous effects
yhet lt- ifelse(d == 1 y + rnorm(n 0 5) y)
mod lt- lm(yhet ~ d + X)
mod1 lt- lm(yhet ~ X subset = d == 1)
mod0 lt- lm(yhet ~ X subset = d == 0)
y1imps lt- predict(mod1 modelframe(mod))
y0imps lt- predict(mod0 modelframe(mod))
mean(y1imps - y0imps)
[1] 061
46 58
Notes on imputation estimators
bull If 119889(119909) are consistent estimators then 120591119894119898119901 is consistent forthe ATE
bull Why donrsquot people use this Most people donrsquot know the results wersquove been talking about Harder to implement than vanilla OLS
bull Can use linear regression to estimate 119889(119909) = 119909prime120573119889bull Recent trend is to estimate 119889(119909) via non-parametric methods
such as Kernel regression local linear regression regression trees etc Easiest is generalized additive models (GAMs)
47 58
Imputation estimator visualization
-2 -1 0 1 2 3
-05
00
05
10
15
x
y
TreatedControl
48 58
Imputation estimator visualization
-2 -1 0 1 2 3
-05
00
05
10
15
x
y
TreatedControl
49 58
Imputation estimator visualization
-2 -1 0 1 2 3
-05
00
05
10
15
x
y
TreatedControl
50 58
Nonlinear relationshipsbull Same idea but with nonlinear relationship between 119884119894 and 119883119894
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
51 58
Nonlinear relationshipsbull Same idea but with nonlinear relationship between 119884119894 and 119883119894
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
52 58
Nonlinear relationshipsbull Same idea but with nonlinear relationship between 119884119894 and 119883119894
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
53 58
Using semiparametric regressionbull Here CEFs are nonlinear but we donrsquot know their formbull We can use GAMs from the mgcv package to for flexible
estimatelibrary(mgcv)
mod0 lt- gam(y ~ s(x) subset = d == 0)
summary(mod0)
Family gaussian
Link function identity
Formula
y ~ s(x)
Parametric coefficients
Estimate Std Error t value Pr(gt|t|)
(Intercept) -00225 00154 -146 016
Approximate significance of smooth terms
edf Refdf F p-value
s(x) 603 708 413 lt2e-16
---
Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1
R-sq(adj) = 0909 Deviance explained = 928
GCV = 00093204 Scale est = 00071351 n = 30
54 58
Using GAMs
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
55 58
Using GAMs
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
56 58
Using GAMs
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
57 58
Limited dependent variables
bull Usual advice model the data from first principles Logitprobit for binary Poisson for counts etc
bull OLS is a-ok with limited DVs when Binary treatment and no covariates (just diff-in-means) Binary treatment discrete covariates and saturated models
(stratified diff-in-means)
bull Imposing a model on LDVs in this case imposes adistributional assumption which could be wrong
bull Even in unsaturated models the marginal effect from OLSoften decent compared to nonlinear models
Could go wrong in small samples If using nonlinear models always get effects on the scale of the
outcome
58 58
- Agnostic Regression
- Regression and Causality
- Regression with Heterogeneous Treatment Effects
-
Heterogeneous effects binary treatment
bull Completely randomized experiment
119884119894 = 119863119894119884119894(1) + (1 minus 119863119894)119884119894(0)= 119884119894(0) + (119884119894(1) minus 119884119894(0))119863119894
= 1205831113567 + 120591119894119863119894 + (119884119894(0) minus 1205831113567)= 1205831113567 + 120591119863119894 + (119884119894(0) minus 1205831113567) + (120591119894 minus 120591) sdot 119863119894
= 1205831113567 + 120591119863119894 + 120576119894
bull Error term now includes two components1 ldquoBaselinerdquo variation in the outcome (119884119894(0) minus 1205831113567)2 Variation in the treatment effect (120591119894 minus 120591)
bull Easy to verify that under experiment 120124[120576119894|119863119894] = 0bull Thus OLS estimates the ATE with no covariates
36 58
Adding covariates
bull What happens with no unmeasured confounders Need tocondition on 119883119894 now
bull Remember identification of the ATEATT using iteratedexpectations
bull ATE is the weighted sum of CATEs
120591 = 1114012119909120591(119909) Pr[119883119894 = 119909]
bull ATEATT are weighted averages of CATEsbull What about the regression estimand 120591119877 How does it related
to the ATEATT
37 58
Heterogeneous effects and regression
bull Letrsquos investigate this under a saturated regression model
119884119894 = 1114012119909119861119909119894120572119909 + 120591119877119863119894 + 119890119894
bull Use a dummy variable for each unique combination of 119883119894119861119909119894 = 120128(119883119894 = 119909)
bull Linear in 119883119894 by construction
38 58
Investigating the regression coefficient
bull How can we investigate 120591119877 Well we can rely on theregression anatomy
120591119877 = Cov(119884119894 119863119894 minus 119864[119863119894|119883119894])120141(119863119894 minus 119864[119863119894|119883119894])
bull 119863119894 minus 120124[119863119894|119883119894] is the residual from a regression of 119863119894 on the fullset of dummies
bull With a little work we can show
120591119877 =120124 1114094120591(119883119894)(119863119894 minus 120124[119863119894|119883119894])11135691114097
120124[(119863119894 minus 119864[119863119894|119883119894])1113569]=
120124[120591(119883119894)1205901113569119889(119883119894)]120124[1205901113569119889(119883119894)]
bull 1205901113569119889(119909) = 120141[119863119894|119883119894 = 119909] is the conditional variance of treatmentassignment
39 58
ATE versus OLS
120591119877 = 120124[120591(119883119894)119882119894] = 1114012119909120591(119909)
1205901113569119889(119909)120124[1205901113569119889(119883119894)]
โ[119883119894 = 119909]
bull Compare to the ATE
120591 = 120124[120591(119883119894)] = 1114012119909120591(119909)โ[119883119894 = 119909]
bull Both weight strata relative to their size (โ[119883119894 = 119909])bull OLS weights strata higher if the treatment variance in those
strata (1205901113569119889(119909)) is higher in those strata relative to the averagevariance across strata (120124[1205901113569119889(119883119894)])
bull The ATE weights only by their size
40 58
Regression weighting
119882119894 =1205901113569119889(119883119894)
120124[1205901113569119889(119883119894)]
bull Why does OLS weight like thisbull OLS is a minimum-variance estimator more weight to
more precise within-strata estimatesbull Within-strata estimates are most precise when the treatment
is evenly spread and thus has the highest variancebull If 119863119894 is binary then we know the conditional variance will be
1205901113569119889(119909) = โ[119863119894 = 1|119883119894 = 119909] (1 minus โ[119863119894 = 1|119883119894 = 119909])= 119890(119909) (1 minus 119890(119909))
bull Maximum variance with โ[119863119894 = 1|119883119894 = 119909] = 12
41 58
OLS weighting examplebull Binary covariate
โ[119883119894 = 1] = 075 โ[119883119894 = 0] = 025โ[119863119894 = 1|119883119894 = 1] = 09 โ[119863119894 = 1|119883119894 = 0] = 05
1205901113569119889(1) = 009 1205901113569119889(0) = 025120591(1) = 1 120591(0) = minus1
bull Implies the ATE is 120591 = 05bull Average conditional variance 120124[1205901113569119889(119883119894)] = 013bull weights for 119883119894 = 1 are 009013 = 0692 for 119883119894 = 0
025013 = 192120591119877 = 120124[120591(119883119894)119882119894]
= 120591(1)119882(1)โ[119883119894 = 1] + 120591(0)119882(0)โ[119883119894 = 0]= 1 times 0692 times 075 + minus1 times 192 times 025= 0039
42 58
When will OLS estimate the ATE
bull When does 120591 = 120591119877bull Constant treatment effects 120591(119909) = 120591 = 120591119877bull Constant probability of treatment 119890(119909) = โ[119863119894 = 1|119883119894 = 119909] = 119890
Implies that the OLS weights are 1
bull Incorrect linearity assumption in 119883119894 will lead to more bias
43 58
Other ways to use regression
bull Whatrsquos the path forward Accept the bias (might be relatively small with saturated
models) Use a different regression approach
bull Let 120583119889(119909) = 120124[119884119894(119889)|119883119894 = 119909] be the CEF for the potentialoutcome under 119863119894 = 119889
bull By consistency and nuc we have 120583119889(119909) = 120124[119884119894|119863119894 = 119889119883119894 = 119909]bull Estimate a regression of 119884119894 on 119883119894 among the 119863119894 = 119889 groupbull Then 119889(119909) is just a predicted value from the regression for
119883119894 = 119909bull How can we use this
44 58
Imputation estimators
bull Impute the treated potential outcomes with 1114023119884119894(1) = 1113568(119883119894)bull Impute the control potential outcomes with 1114023119884119894(0) = 1113567(119883119894)bull Procedure
Regress 119884119894 on 119883119894 in the treated group and get predicted valuesfor all units (treated or control)
Regress 119884119894 on 119883119894 in the control group and get predicted valuesfor all units (treated or control)
Take the average difference between these predicted values
bull More mathematically look like this
120591119894119898119901 =1119873
11140121198941113568(119883119894) minus 1113567(119883119894)
bull Sometimes called an imputation estimator
45 58
Simple imputation estimator
bull Use predict() from the within-group models on the data fromthe entire sample
bull Useful trick use a model on the entire data andmodelframe() to get the right design matrix
heterogeneous effects
yhet lt- ifelse(d == 1 y + rnorm(n 0 5) y)
mod lt- lm(yhet ~ d + X)
mod1 lt- lm(yhet ~ X subset = d == 1)
mod0 lt- lm(yhet ~ X subset = d == 0)
y1imps lt- predict(mod1 modelframe(mod))
y0imps lt- predict(mod0 modelframe(mod))
mean(y1imps - y0imps)
[1] 061
46 58
Notes on imputation estimators
bull If 119889(119909) are consistent estimators then 120591119894119898119901 is consistent forthe ATE
bull Why donrsquot people use this Most people donrsquot know the results wersquove been talking about Harder to implement than vanilla OLS
bull Can use linear regression to estimate 119889(119909) = 119909prime120573119889bull Recent trend is to estimate 119889(119909) via non-parametric methods
such as Kernel regression local linear regression regression trees etc Easiest is generalized additive models (GAMs)
47 58
Imputation estimator visualization
-2 -1 0 1 2 3
-05
00
05
10
15
x
y
TreatedControl
48 58
Imputation estimator visualization
-2 -1 0 1 2 3
-05
00
05
10
15
x
y
TreatedControl
49 58
Imputation estimator visualization
-2 -1 0 1 2 3
-05
00
05
10
15
x
y
TreatedControl
50 58
Nonlinear relationshipsbull Same idea but with nonlinear relationship between 119884119894 and 119883119894
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
51 58
Nonlinear relationshipsbull Same idea but with nonlinear relationship between 119884119894 and 119883119894
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
52 58
Nonlinear relationshipsbull Same idea but with nonlinear relationship between 119884119894 and 119883119894
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
53 58
Using semiparametric regressionbull Here CEFs are nonlinear but we donrsquot know their formbull We can use GAMs from the mgcv package to for flexible
estimatelibrary(mgcv)
mod0 lt- gam(y ~ s(x) subset = d == 0)
summary(mod0)
Family gaussian
Link function identity
Formula
y ~ s(x)
Parametric coefficients
Estimate Std Error t value Pr(gt|t|)
(Intercept) -00225 00154 -146 016
Approximate significance of smooth terms
edf Refdf F p-value
s(x) 603 708 413 lt2e-16
---
Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1
R-sq(adj) = 0909 Deviance explained = 928
GCV = 00093204 Scale est = 00071351 n = 30
54 58
Using GAMs
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
55 58
Using GAMs
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
56 58
Using GAMs
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
57 58
Limited dependent variables
bull Usual advice model the data from first principles Logitprobit for binary Poisson for counts etc
bull OLS is a-ok with limited DVs when Binary treatment and no covariates (just diff-in-means) Binary treatment discrete covariates and saturated models
(stratified diff-in-means)
bull Imposing a model on LDVs in this case imposes adistributional assumption which could be wrong
bull Even in unsaturated models the marginal effect from OLSoften decent compared to nonlinear models
Could go wrong in small samples If using nonlinear models always get effects on the scale of the
outcome
58 58
- Agnostic Regression
- Regression and Causality
- Regression with Heterogeneous Treatment Effects
-
Adding covariates
bull What happens with no unmeasured confounders Need tocondition on 119883119894 now
bull Remember identification of the ATEATT using iteratedexpectations
bull ATE is the weighted sum of CATEs
120591 = 1114012119909120591(119909) Pr[119883119894 = 119909]
bull ATEATT are weighted averages of CATEsbull What about the regression estimand 120591119877 How does it related
to the ATEATT
37 58
Heterogeneous effects and regression
bull Letrsquos investigate this under a saturated regression model
119884119894 = 1114012119909119861119909119894120572119909 + 120591119877119863119894 + 119890119894
bull Use a dummy variable for each unique combination of 119883119894119861119909119894 = 120128(119883119894 = 119909)
bull Linear in 119883119894 by construction
38 58
Investigating the regression coefficient
bull How can we investigate 120591119877 Well we can rely on theregression anatomy
120591119877 = Cov(119884119894 119863119894 minus 119864[119863119894|119883119894])120141(119863119894 minus 119864[119863119894|119883119894])
bull 119863119894 minus 120124[119863119894|119883119894] is the residual from a regression of 119863119894 on the fullset of dummies
bull With a little work we can show
120591119877 =120124 1114094120591(119883119894)(119863119894 minus 120124[119863119894|119883119894])11135691114097
120124[(119863119894 minus 119864[119863119894|119883119894])1113569]=
120124[120591(119883119894)1205901113569119889(119883119894)]120124[1205901113569119889(119883119894)]
bull 1205901113569119889(119909) = 120141[119863119894|119883119894 = 119909] is the conditional variance of treatmentassignment
39 58
ATE versus OLS
120591119877 = 120124[120591(119883119894)119882119894] = 1114012119909120591(119909)
1205901113569119889(119909)120124[1205901113569119889(119883119894)]
โ[119883119894 = 119909]
bull Compare to the ATE
120591 = 120124[120591(119883119894)] = 1114012119909120591(119909)โ[119883119894 = 119909]
bull Both weight strata relative to their size (โ[119883119894 = 119909])bull OLS weights strata higher if the treatment variance in those
strata (1205901113569119889(119909)) is higher in those strata relative to the averagevariance across strata (120124[1205901113569119889(119883119894)])
bull The ATE weights only by their size
40 58
Regression weighting
119882119894 =1205901113569119889(119883119894)
120124[1205901113569119889(119883119894)]
bull Why does OLS weight like thisbull OLS is a minimum-variance estimator more weight to
more precise within-strata estimatesbull Within-strata estimates are most precise when the treatment
is evenly spread and thus has the highest variancebull If 119863119894 is binary then we know the conditional variance will be
1205901113569119889(119909) = โ[119863119894 = 1|119883119894 = 119909] (1 minus โ[119863119894 = 1|119883119894 = 119909])= 119890(119909) (1 minus 119890(119909))
bull Maximum variance with โ[119863119894 = 1|119883119894 = 119909] = 12
41 58
OLS weighting examplebull Binary covariate
โ[119883119894 = 1] = 075 โ[119883119894 = 0] = 025โ[119863119894 = 1|119883119894 = 1] = 09 โ[119863119894 = 1|119883119894 = 0] = 05
1205901113569119889(1) = 009 1205901113569119889(0) = 025120591(1) = 1 120591(0) = minus1
bull Implies the ATE is 120591 = 05bull Average conditional variance 120124[1205901113569119889(119883119894)] = 013bull weights for 119883119894 = 1 are 009013 = 0692 for 119883119894 = 0
025013 = 192120591119877 = 120124[120591(119883119894)119882119894]
= 120591(1)119882(1)โ[119883119894 = 1] + 120591(0)119882(0)โ[119883119894 = 0]= 1 times 0692 times 075 + minus1 times 192 times 025= 0039
42 58
When will OLS estimate the ATE
bull When does 120591 = 120591119877bull Constant treatment effects 120591(119909) = 120591 = 120591119877bull Constant probability of treatment 119890(119909) = โ[119863119894 = 1|119883119894 = 119909] = 119890
Implies that the OLS weights are 1
bull Incorrect linearity assumption in 119883119894 will lead to more bias
43 58
Other ways to use regression
bull Whatrsquos the path forward Accept the bias (might be relatively small with saturated
models) Use a different regression approach
bull Let 120583119889(119909) = 120124[119884119894(119889)|119883119894 = 119909] be the CEF for the potentialoutcome under 119863119894 = 119889
bull By consistency and nuc we have 120583119889(119909) = 120124[119884119894|119863119894 = 119889119883119894 = 119909]bull Estimate a regression of 119884119894 on 119883119894 among the 119863119894 = 119889 groupbull Then 119889(119909) is just a predicted value from the regression for
119883119894 = 119909bull How can we use this
44 58
Imputation estimators
bull Impute the treated potential outcomes with 1114023119884119894(1) = 1113568(119883119894)bull Impute the control potential outcomes with 1114023119884119894(0) = 1113567(119883119894)bull Procedure
Regress 119884119894 on 119883119894 in the treated group and get predicted valuesfor all units (treated or control)
Regress 119884119894 on 119883119894 in the control group and get predicted valuesfor all units (treated or control)
Take the average difference between these predicted values
bull More mathematically look like this
120591119894119898119901 =1119873
11140121198941113568(119883119894) minus 1113567(119883119894)
bull Sometimes called an imputation estimator
45 58
Simple imputation estimator
bull Use predict() from the within-group models on the data fromthe entire sample
bull Useful trick use a model on the entire data andmodelframe() to get the right design matrix
heterogeneous effects
yhet lt- ifelse(d == 1 y + rnorm(n 0 5) y)
mod lt- lm(yhet ~ d + X)
mod1 lt- lm(yhet ~ X subset = d == 1)
mod0 lt- lm(yhet ~ X subset = d == 0)
y1imps lt- predict(mod1 modelframe(mod))
y0imps lt- predict(mod0 modelframe(mod))
mean(y1imps - y0imps)
[1] 061
46 58
Notes on imputation estimators
bull If 119889(119909) are consistent estimators then 120591119894119898119901 is consistent forthe ATE
bull Why donrsquot people use this Most people donrsquot know the results wersquove been talking about Harder to implement than vanilla OLS
bull Can use linear regression to estimate 119889(119909) = 119909prime120573119889bull Recent trend is to estimate 119889(119909) via non-parametric methods
such as Kernel regression local linear regression regression trees etc Easiest is generalized additive models (GAMs)
47 58
Imputation estimator visualization
-2 -1 0 1 2 3
-05
00
05
10
15
x
y
TreatedControl
48 58
Imputation estimator visualization
-2 -1 0 1 2 3
-05
00
05
10
15
x
y
TreatedControl
49 58
Imputation estimator visualization
-2 -1 0 1 2 3
-05
00
05
10
15
x
y
TreatedControl
50 58
Nonlinear relationshipsbull Same idea but with nonlinear relationship between 119884119894 and 119883119894
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
51 58
Nonlinear relationshipsbull Same idea but with nonlinear relationship between 119884119894 and 119883119894
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
52 58
Nonlinear relationshipsbull Same idea but with nonlinear relationship between 119884119894 and 119883119894
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
53 58
Using semiparametric regressionbull Here CEFs are nonlinear but we donrsquot know their formbull We can use GAMs from the mgcv package to for flexible
estimatelibrary(mgcv)
mod0 lt- gam(y ~ s(x) subset = d == 0)
summary(mod0)
Family gaussian
Link function identity
Formula
y ~ s(x)
Parametric coefficients
Estimate Std Error t value Pr(gt|t|)
(Intercept) -00225 00154 -146 016
Approximate significance of smooth terms
edf Refdf F p-value
s(x) 603 708 413 lt2e-16
---
Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1
R-sq(adj) = 0909 Deviance explained = 928
GCV = 00093204 Scale est = 00071351 n = 30
54 58
Using GAMs
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
55 58
Using GAMs
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
56 58
Using GAMs
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
57 58
Limited dependent variables
bull Usual advice model the data from first principles Logitprobit for binary Poisson for counts etc
bull OLS is a-ok with limited DVs when Binary treatment and no covariates (just diff-in-means) Binary treatment discrete covariates and saturated models
(stratified diff-in-means)
bull Imposing a model on LDVs in this case imposes adistributional assumption which could be wrong
bull Even in unsaturated models the marginal effect from OLSoften decent compared to nonlinear models
Could go wrong in small samples If using nonlinear models always get effects on the scale of the
outcome
58 58
- Agnostic Regression
- Regression and Causality
- Regression with Heterogeneous Treatment Effects
-
Heterogeneous effects and regression
bull Letrsquos investigate this under a saturated regression model
119884119894 = 1114012119909119861119909119894120572119909 + 120591119877119863119894 + 119890119894
bull Use a dummy variable for each unique combination of 119883119894119861119909119894 = 120128(119883119894 = 119909)
bull Linear in 119883119894 by construction
38 58
Investigating the regression coefficient
bull How can we investigate 120591119877 Well we can rely on theregression anatomy
120591119877 = Cov(119884119894 119863119894 minus 119864[119863119894|119883119894])120141(119863119894 minus 119864[119863119894|119883119894])
bull 119863119894 minus 120124[119863119894|119883119894] is the residual from a regression of 119863119894 on the fullset of dummies
bull With a little work we can show
120591119877 =120124 1114094120591(119883119894)(119863119894 minus 120124[119863119894|119883119894])11135691114097
120124[(119863119894 minus 119864[119863119894|119883119894])1113569]=
120124[120591(119883119894)1205901113569119889(119883119894)]120124[1205901113569119889(119883119894)]
bull 1205901113569119889(119909) = 120141[119863119894|119883119894 = 119909] is the conditional variance of treatmentassignment
39 58
ATE versus OLS
120591119877 = 120124[120591(119883119894)119882119894] = 1114012119909120591(119909)
1205901113569119889(119909)120124[1205901113569119889(119883119894)]
โ[119883119894 = 119909]
bull Compare to the ATE
120591 = 120124[120591(119883119894)] = 1114012119909120591(119909)โ[119883119894 = 119909]
bull Both weight strata relative to their size (โ[119883119894 = 119909])bull OLS weights strata higher if the treatment variance in those
strata (1205901113569119889(119909)) is higher in those strata relative to the averagevariance across strata (120124[1205901113569119889(119883119894)])
bull The ATE weights only by their size
40 58
Regression weighting
119882119894 =1205901113569119889(119883119894)
120124[1205901113569119889(119883119894)]
bull Why does OLS weight like thisbull OLS is a minimum-variance estimator more weight to
more precise within-strata estimatesbull Within-strata estimates are most precise when the treatment
is evenly spread and thus has the highest variancebull If 119863119894 is binary then we know the conditional variance will be
1205901113569119889(119909) = โ[119863119894 = 1|119883119894 = 119909] (1 minus โ[119863119894 = 1|119883119894 = 119909])= 119890(119909) (1 minus 119890(119909))
bull Maximum variance with โ[119863119894 = 1|119883119894 = 119909] = 12
41 58
OLS weighting examplebull Binary covariate
โ[119883119894 = 1] = 075 โ[119883119894 = 0] = 025โ[119863119894 = 1|119883119894 = 1] = 09 โ[119863119894 = 1|119883119894 = 0] = 05
1205901113569119889(1) = 009 1205901113569119889(0) = 025120591(1) = 1 120591(0) = minus1
bull Implies the ATE is 120591 = 05bull Average conditional variance 120124[1205901113569119889(119883119894)] = 013bull weights for 119883119894 = 1 are 009013 = 0692 for 119883119894 = 0
025013 = 192120591119877 = 120124[120591(119883119894)119882119894]
= 120591(1)119882(1)โ[119883119894 = 1] + 120591(0)119882(0)โ[119883119894 = 0]= 1 times 0692 times 075 + minus1 times 192 times 025= 0039
42 58
When will OLS estimate the ATE
bull When does 120591 = 120591119877bull Constant treatment effects 120591(119909) = 120591 = 120591119877bull Constant probability of treatment 119890(119909) = โ[119863119894 = 1|119883119894 = 119909] = 119890
Implies that the OLS weights are 1
bull Incorrect linearity assumption in 119883119894 will lead to more bias
43 58
Other ways to use regression
bull Whatrsquos the path forward Accept the bias (might be relatively small with saturated
models) Use a different regression approach
bull Let 120583119889(119909) = 120124[119884119894(119889)|119883119894 = 119909] be the CEF for the potentialoutcome under 119863119894 = 119889
bull By consistency and nuc we have 120583119889(119909) = 120124[119884119894|119863119894 = 119889119883119894 = 119909]bull Estimate a regression of 119884119894 on 119883119894 among the 119863119894 = 119889 groupbull Then 119889(119909) is just a predicted value from the regression for
119883119894 = 119909bull How can we use this
44 58
Imputation estimators
bull Impute the treated potential outcomes with 1114023119884119894(1) = 1113568(119883119894)bull Impute the control potential outcomes with 1114023119884119894(0) = 1113567(119883119894)bull Procedure
Regress 119884119894 on 119883119894 in the treated group and get predicted valuesfor all units (treated or control)
Regress 119884119894 on 119883119894 in the control group and get predicted valuesfor all units (treated or control)
Take the average difference between these predicted values
bull More mathematically look like this
120591119894119898119901 =1119873
11140121198941113568(119883119894) minus 1113567(119883119894)
bull Sometimes called an imputation estimator
45 58
Simple imputation estimator
bull Use predict() from the within-group models on the data fromthe entire sample
bull Useful trick use a model on the entire data andmodelframe() to get the right design matrix
heterogeneous effects
yhet lt- ifelse(d == 1 y + rnorm(n 0 5) y)
mod lt- lm(yhet ~ d + X)
mod1 lt- lm(yhet ~ X subset = d == 1)
mod0 lt- lm(yhet ~ X subset = d == 0)
y1imps lt- predict(mod1 modelframe(mod))
y0imps lt- predict(mod0 modelframe(mod))
mean(y1imps - y0imps)
[1] 061
46 58
Notes on imputation estimators
bull If 119889(119909) are consistent estimators then 120591119894119898119901 is consistent forthe ATE
bull Why donrsquot people use this Most people donrsquot know the results wersquove been talking about Harder to implement than vanilla OLS
bull Can use linear regression to estimate 119889(119909) = 119909prime120573119889bull Recent trend is to estimate 119889(119909) via non-parametric methods
such as Kernel regression local linear regression regression trees etc Easiest is generalized additive models (GAMs)
47 58
Imputation estimator visualization
-2 -1 0 1 2 3
-05
00
05
10
15
x
y
TreatedControl
48 58
Imputation estimator visualization
-2 -1 0 1 2 3
-05
00
05
10
15
x
y
TreatedControl
49 58
Imputation estimator visualization
-2 -1 0 1 2 3
-05
00
05
10
15
x
y
TreatedControl
50 58
Nonlinear relationshipsbull Same idea but with nonlinear relationship between 119884119894 and 119883119894
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
51 58
Nonlinear relationshipsbull Same idea but with nonlinear relationship between 119884119894 and 119883119894
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
52 58
Nonlinear relationshipsbull Same idea but with nonlinear relationship between 119884119894 and 119883119894
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
53 58
Using semiparametric regressionbull Here CEFs are nonlinear but we donrsquot know their formbull We can use GAMs from the mgcv package to for flexible
estimatelibrary(mgcv)
mod0 lt- gam(y ~ s(x) subset = d == 0)
summary(mod0)
Family gaussian
Link function identity
Formula
y ~ s(x)
Parametric coefficients
Estimate Std Error t value Pr(gt|t|)
(Intercept) -00225 00154 -146 016
Approximate significance of smooth terms
edf Refdf F p-value
s(x) 603 708 413 lt2e-16
---
Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1
R-sq(adj) = 0909 Deviance explained = 928
GCV = 00093204 Scale est = 00071351 n = 30
54 58
Using GAMs
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
55 58
Using GAMs
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
56 58
Using GAMs
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
57 58
Limited dependent variables
bull Usual advice model the data from first principles Logitprobit for binary Poisson for counts etc
bull OLS is a-ok with limited DVs when Binary treatment and no covariates (just diff-in-means) Binary treatment discrete covariates and saturated models
(stratified diff-in-means)
bull Imposing a model on LDVs in this case imposes adistributional assumption which could be wrong
bull Even in unsaturated models the marginal effect from OLSoften decent compared to nonlinear models
Could go wrong in small samples If using nonlinear models always get effects on the scale of the
outcome
58 58
- Agnostic Regression
- Regression and Causality
- Regression with Heterogeneous Treatment Effects
-
Investigating the regression coefficient
bull How can we investigate 120591119877 Well we can rely on theregression anatomy
120591119877 = Cov(119884119894 119863119894 minus 119864[119863119894|119883119894])120141(119863119894 minus 119864[119863119894|119883119894])
bull 119863119894 minus 120124[119863119894|119883119894] is the residual from a regression of 119863119894 on the fullset of dummies
bull With a little work we can show
120591119877 =120124 1114094120591(119883119894)(119863119894 minus 120124[119863119894|119883119894])11135691114097
120124[(119863119894 minus 119864[119863119894|119883119894])1113569]=
120124[120591(119883119894)1205901113569119889(119883119894)]120124[1205901113569119889(119883119894)]
bull 1205901113569119889(119909) = 120141[119863119894|119883119894 = 119909] is the conditional variance of treatmentassignment
39 58
ATE versus OLS
120591119877 = 120124[120591(119883119894)119882119894] = 1114012119909120591(119909)
1205901113569119889(119909)120124[1205901113569119889(119883119894)]
โ[119883119894 = 119909]
bull Compare to the ATE
120591 = 120124[120591(119883119894)] = 1114012119909120591(119909)โ[119883119894 = 119909]
bull Both weight strata relative to their size (โ[119883119894 = 119909])bull OLS weights strata higher if the treatment variance in those
strata (1205901113569119889(119909)) is higher in those strata relative to the averagevariance across strata (120124[1205901113569119889(119883119894)])
bull The ATE weights only by their size
40 58
Regression weighting
119882119894 =1205901113569119889(119883119894)
120124[1205901113569119889(119883119894)]
bull Why does OLS weight like thisbull OLS is a minimum-variance estimator more weight to
more precise within-strata estimatesbull Within-strata estimates are most precise when the treatment
is evenly spread and thus has the highest variancebull If 119863119894 is binary then we know the conditional variance will be
1205901113569119889(119909) = โ[119863119894 = 1|119883119894 = 119909] (1 minus โ[119863119894 = 1|119883119894 = 119909])= 119890(119909) (1 minus 119890(119909))
bull Maximum variance with โ[119863119894 = 1|119883119894 = 119909] = 12
41 58
OLS weighting examplebull Binary covariate
โ[119883119894 = 1] = 075 โ[119883119894 = 0] = 025โ[119863119894 = 1|119883119894 = 1] = 09 โ[119863119894 = 1|119883119894 = 0] = 05
1205901113569119889(1) = 009 1205901113569119889(0) = 025120591(1) = 1 120591(0) = minus1
bull Implies the ATE is 120591 = 05bull Average conditional variance 120124[1205901113569119889(119883119894)] = 013bull weights for 119883119894 = 1 are 009013 = 0692 for 119883119894 = 0
025013 = 192120591119877 = 120124[120591(119883119894)119882119894]
= 120591(1)119882(1)โ[119883119894 = 1] + 120591(0)119882(0)โ[119883119894 = 0]= 1 times 0692 times 075 + minus1 times 192 times 025= 0039
42 58
When will OLS estimate the ATE
bull When does 120591 = 120591119877bull Constant treatment effects 120591(119909) = 120591 = 120591119877bull Constant probability of treatment 119890(119909) = โ[119863119894 = 1|119883119894 = 119909] = 119890
Implies that the OLS weights are 1
bull Incorrect linearity assumption in 119883119894 will lead to more bias
43 58
Other ways to use regression
bull Whatrsquos the path forward Accept the bias (might be relatively small with saturated
models) Use a different regression approach
bull Let 120583119889(119909) = 120124[119884119894(119889)|119883119894 = 119909] be the CEF for the potentialoutcome under 119863119894 = 119889
bull By consistency and nuc we have 120583119889(119909) = 120124[119884119894|119863119894 = 119889119883119894 = 119909]bull Estimate a regression of 119884119894 on 119883119894 among the 119863119894 = 119889 groupbull Then 119889(119909) is just a predicted value from the regression for
119883119894 = 119909bull How can we use this
44 58
Imputation estimators
bull Impute the treated potential outcomes with 1114023119884119894(1) = 1113568(119883119894)bull Impute the control potential outcomes with 1114023119884119894(0) = 1113567(119883119894)bull Procedure
Regress 119884119894 on 119883119894 in the treated group and get predicted valuesfor all units (treated or control)
Regress 119884119894 on 119883119894 in the control group and get predicted valuesfor all units (treated or control)
Take the average difference between these predicted values
bull More mathematically look like this
120591119894119898119901 =1119873
11140121198941113568(119883119894) minus 1113567(119883119894)
bull Sometimes called an imputation estimator
45 58
Simple imputation estimator
bull Use predict() from the within-group models on the data fromthe entire sample
bull Useful trick use a model on the entire data andmodelframe() to get the right design matrix
heterogeneous effects
yhet lt- ifelse(d == 1 y + rnorm(n 0 5) y)
mod lt- lm(yhet ~ d + X)
mod1 lt- lm(yhet ~ X subset = d == 1)
mod0 lt- lm(yhet ~ X subset = d == 0)
y1imps lt- predict(mod1 modelframe(mod))
y0imps lt- predict(mod0 modelframe(mod))
mean(y1imps - y0imps)
[1] 061
46 58
Notes on imputation estimators
bull If 119889(119909) are consistent estimators then 120591119894119898119901 is consistent forthe ATE
bull Why donrsquot people use this Most people donrsquot know the results wersquove been talking about Harder to implement than vanilla OLS
bull Can use linear regression to estimate 119889(119909) = 119909prime120573119889bull Recent trend is to estimate 119889(119909) via non-parametric methods
such as Kernel regression local linear regression regression trees etc Easiest is generalized additive models (GAMs)
47 58
Imputation estimator visualization
-2 -1 0 1 2 3
-05
00
05
10
15
x
y
TreatedControl
48 58
Imputation estimator visualization
-2 -1 0 1 2 3
-05
00
05
10
15
x
y
TreatedControl
49 58
Imputation estimator visualization
-2 -1 0 1 2 3
-05
00
05
10
15
x
y
TreatedControl
50 58
Nonlinear relationshipsbull Same idea but with nonlinear relationship between 119884119894 and 119883119894
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
51 58
Nonlinear relationshipsbull Same idea but with nonlinear relationship between 119884119894 and 119883119894
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
52 58
Nonlinear relationshipsbull Same idea but with nonlinear relationship between 119884119894 and 119883119894
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
53 58
Using semiparametric regressionbull Here CEFs are nonlinear but we donrsquot know their formbull We can use GAMs from the mgcv package to for flexible
estimatelibrary(mgcv)
mod0 lt- gam(y ~ s(x) subset = d == 0)
summary(mod0)
Family gaussian
Link function identity
Formula
y ~ s(x)
Parametric coefficients
Estimate Std Error t value Pr(gt|t|)
(Intercept) -00225 00154 -146 016
Approximate significance of smooth terms
edf Refdf F p-value
s(x) 603 708 413 lt2e-16
---
Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1
R-sq(adj) = 0909 Deviance explained = 928
GCV = 00093204 Scale est = 00071351 n = 30
54 58
Using GAMs
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
55 58
Using GAMs
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
56 58
Using GAMs
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
57 58
Limited dependent variables
bull Usual advice model the data from first principles Logitprobit for binary Poisson for counts etc
bull OLS is a-ok with limited DVs when Binary treatment and no covariates (just diff-in-means) Binary treatment discrete covariates and saturated models
(stratified diff-in-means)
bull Imposing a model on LDVs in this case imposes adistributional assumption which could be wrong
bull Even in unsaturated models the marginal effect from OLSoften decent compared to nonlinear models
Could go wrong in small samples If using nonlinear models always get effects on the scale of the
outcome
58 58
- Agnostic Regression
- Regression and Causality
- Regression with Heterogeneous Treatment Effects
-
ATE versus OLS
120591119877 = 120124[120591(119883119894)119882119894] = 1114012119909120591(119909)
1205901113569119889(119909)120124[1205901113569119889(119883119894)]
โ[119883119894 = 119909]
bull Compare to the ATE
120591 = 120124[120591(119883119894)] = 1114012119909120591(119909)โ[119883119894 = 119909]
bull Both weight strata relative to their size (โ[119883119894 = 119909])bull OLS weights strata higher if the treatment variance in those
strata (1205901113569119889(119909)) is higher in those strata relative to the averagevariance across strata (120124[1205901113569119889(119883119894)])
bull The ATE weights only by their size
40 58
Regression weighting
119882119894 =1205901113569119889(119883119894)
120124[1205901113569119889(119883119894)]
bull Why does OLS weight like thisbull OLS is a minimum-variance estimator more weight to
more precise within-strata estimatesbull Within-strata estimates are most precise when the treatment
is evenly spread and thus has the highest variancebull If 119863119894 is binary then we know the conditional variance will be
1205901113569119889(119909) = โ[119863119894 = 1|119883119894 = 119909] (1 minus โ[119863119894 = 1|119883119894 = 119909])= 119890(119909) (1 minus 119890(119909))
bull Maximum variance with โ[119863119894 = 1|119883119894 = 119909] = 12
41 58
OLS weighting examplebull Binary covariate
โ[119883119894 = 1] = 075 โ[119883119894 = 0] = 025โ[119863119894 = 1|119883119894 = 1] = 09 โ[119863119894 = 1|119883119894 = 0] = 05
1205901113569119889(1) = 009 1205901113569119889(0) = 025120591(1) = 1 120591(0) = minus1
bull Implies the ATE is 120591 = 05bull Average conditional variance 120124[1205901113569119889(119883119894)] = 013bull weights for 119883119894 = 1 are 009013 = 0692 for 119883119894 = 0
025013 = 192120591119877 = 120124[120591(119883119894)119882119894]
= 120591(1)119882(1)โ[119883119894 = 1] + 120591(0)119882(0)โ[119883119894 = 0]= 1 times 0692 times 075 + minus1 times 192 times 025= 0039
42 58
When will OLS estimate the ATE
bull When does 120591 = 120591119877bull Constant treatment effects 120591(119909) = 120591 = 120591119877bull Constant probability of treatment 119890(119909) = โ[119863119894 = 1|119883119894 = 119909] = 119890
Implies that the OLS weights are 1
bull Incorrect linearity assumption in 119883119894 will lead to more bias
43 58
Other ways to use regression
bull Whatrsquos the path forward Accept the bias (might be relatively small with saturated
models) Use a different regression approach
bull Let 120583119889(119909) = 120124[119884119894(119889)|119883119894 = 119909] be the CEF for the potentialoutcome under 119863119894 = 119889
bull By consistency and nuc we have 120583119889(119909) = 120124[119884119894|119863119894 = 119889119883119894 = 119909]bull Estimate a regression of 119884119894 on 119883119894 among the 119863119894 = 119889 groupbull Then 119889(119909) is just a predicted value from the regression for
119883119894 = 119909bull How can we use this
44 58
Imputation estimators
bull Impute the treated potential outcomes with 1114023119884119894(1) = 1113568(119883119894)bull Impute the control potential outcomes with 1114023119884119894(0) = 1113567(119883119894)bull Procedure
Regress 119884119894 on 119883119894 in the treated group and get predicted valuesfor all units (treated or control)
Regress 119884119894 on 119883119894 in the control group and get predicted valuesfor all units (treated or control)
Take the average difference between these predicted values
bull More mathematically look like this
120591119894119898119901 =1119873
11140121198941113568(119883119894) minus 1113567(119883119894)
bull Sometimes called an imputation estimator
45 58
Simple imputation estimator
bull Use predict() from the within-group models on the data fromthe entire sample
bull Useful trick use a model on the entire data andmodelframe() to get the right design matrix
heterogeneous effects
yhet lt- ifelse(d == 1 y + rnorm(n 0 5) y)
mod lt- lm(yhet ~ d + X)
mod1 lt- lm(yhet ~ X subset = d == 1)
mod0 lt- lm(yhet ~ X subset = d == 0)
y1imps lt- predict(mod1 modelframe(mod))
y0imps lt- predict(mod0 modelframe(mod))
mean(y1imps - y0imps)
[1] 061
46 58
Notes on imputation estimators
bull If 119889(119909) are consistent estimators then 120591119894119898119901 is consistent forthe ATE
bull Why donrsquot people use this Most people donrsquot know the results wersquove been talking about Harder to implement than vanilla OLS
bull Can use linear regression to estimate 119889(119909) = 119909prime120573119889bull Recent trend is to estimate 119889(119909) via non-parametric methods
such as Kernel regression local linear regression regression trees etc Easiest is generalized additive models (GAMs)
47 58
Imputation estimator visualization
-2 -1 0 1 2 3
-05
00
05
10
15
x
y
TreatedControl
48 58
Imputation estimator visualization
-2 -1 0 1 2 3
-05
00
05
10
15
x
y
TreatedControl
49 58
Imputation estimator visualization
-2 -1 0 1 2 3
-05
00
05
10
15
x
y
TreatedControl
50 58
Nonlinear relationshipsbull Same idea but with nonlinear relationship between 119884119894 and 119883119894
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
51 58
Nonlinear relationshipsbull Same idea but with nonlinear relationship between 119884119894 and 119883119894
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
52 58
Nonlinear relationshipsbull Same idea but with nonlinear relationship between 119884119894 and 119883119894
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
53 58
Using semiparametric regressionbull Here CEFs are nonlinear but we donrsquot know their formbull We can use GAMs from the mgcv package to for flexible
estimatelibrary(mgcv)
mod0 lt- gam(y ~ s(x) subset = d == 0)
summary(mod0)
Family gaussian
Link function identity
Formula
y ~ s(x)
Parametric coefficients
Estimate Std Error t value Pr(gt|t|)
(Intercept) -00225 00154 -146 016
Approximate significance of smooth terms
edf Refdf F p-value
s(x) 603 708 413 lt2e-16
---
Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1
R-sq(adj) = 0909 Deviance explained = 928
GCV = 00093204 Scale est = 00071351 n = 30
54 58
Using GAMs
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
55 58
Using GAMs
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
56 58
Using GAMs
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
57 58
Limited dependent variables
bull Usual advice model the data from first principles Logitprobit for binary Poisson for counts etc
bull OLS is a-ok with limited DVs when Binary treatment and no covariates (just diff-in-means) Binary treatment discrete covariates and saturated models
(stratified diff-in-means)
bull Imposing a model on LDVs in this case imposes adistributional assumption which could be wrong
bull Even in unsaturated models the marginal effect from OLSoften decent compared to nonlinear models
Could go wrong in small samples If using nonlinear models always get effects on the scale of the
outcome
58 58
- Agnostic Regression
- Regression and Causality
- Regression with Heterogeneous Treatment Effects
-
Regression weighting
119882119894 =1205901113569119889(119883119894)
120124[1205901113569119889(119883119894)]
bull Why does OLS weight like thisbull OLS is a minimum-variance estimator more weight to
more precise within-strata estimatesbull Within-strata estimates are most precise when the treatment
is evenly spread and thus has the highest variancebull If 119863119894 is binary then we know the conditional variance will be
1205901113569119889(119909) = โ[119863119894 = 1|119883119894 = 119909] (1 minus โ[119863119894 = 1|119883119894 = 119909])= 119890(119909) (1 minus 119890(119909))
bull Maximum variance with โ[119863119894 = 1|119883119894 = 119909] = 12
41 58
OLS weighting examplebull Binary covariate
โ[119883119894 = 1] = 075 โ[119883119894 = 0] = 025โ[119863119894 = 1|119883119894 = 1] = 09 โ[119863119894 = 1|119883119894 = 0] = 05
1205901113569119889(1) = 009 1205901113569119889(0) = 025120591(1) = 1 120591(0) = minus1
bull Implies the ATE is 120591 = 05bull Average conditional variance 120124[1205901113569119889(119883119894)] = 013bull weights for 119883119894 = 1 are 009013 = 0692 for 119883119894 = 0
025013 = 192120591119877 = 120124[120591(119883119894)119882119894]
= 120591(1)119882(1)โ[119883119894 = 1] + 120591(0)119882(0)โ[119883119894 = 0]= 1 times 0692 times 075 + minus1 times 192 times 025= 0039
42 58
When will OLS estimate the ATE
bull When does 120591 = 120591119877bull Constant treatment effects 120591(119909) = 120591 = 120591119877bull Constant probability of treatment 119890(119909) = โ[119863119894 = 1|119883119894 = 119909] = 119890
Implies that the OLS weights are 1
bull Incorrect linearity assumption in 119883119894 will lead to more bias
43 58
Other ways to use regression
bull Whatrsquos the path forward Accept the bias (might be relatively small with saturated
models) Use a different regression approach
bull Let 120583119889(119909) = 120124[119884119894(119889)|119883119894 = 119909] be the CEF for the potentialoutcome under 119863119894 = 119889
bull By consistency and nuc we have 120583119889(119909) = 120124[119884119894|119863119894 = 119889119883119894 = 119909]bull Estimate a regression of 119884119894 on 119883119894 among the 119863119894 = 119889 groupbull Then 119889(119909) is just a predicted value from the regression for
119883119894 = 119909bull How can we use this
44 58
Imputation estimators
bull Impute the treated potential outcomes with 1114023119884119894(1) = 1113568(119883119894)bull Impute the control potential outcomes with 1114023119884119894(0) = 1113567(119883119894)bull Procedure
Regress 119884119894 on 119883119894 in the treated group and get predicted valuesfor all units (treated or control)
Regress 119884119894 on 119883119894 in the control group and get predicted valuesfor all units (treated or control)
Take the average difference between these predicted values
bull More mathematically look like this
120591119894119898119901 =1119873
11140121198941113568(119883119894) minus 1113567(119883119894)
bull Sometimes called an imputation estimator
45 58
Simple imputation estimator
bull Use predict() from the within-group models on the data fromthe entire sample
bull Useful trick use a model on the entire data andmodelframe() to get the right design matrix
heterogeneous effects
yhet lt- ifelse(d == 1 y + rnorm(n 0 5) y)
mod lt- lm(yhet ~ d + X)
mod1 lt- lm(yhet ~ X subset = d == 1)
mod0 lt- lm(yhet ~ X subset = d == 0)
y1imps lt- predict(mod1 modelframe(mod))
y0imps lt- predict(mod0 modelframe(mod))
mean(y1imps - y0imps)
[1] 061
46 58
Notes on imputation estimators
bull If 119889(119909) are consistent estimators then 120591119894119898119901 is consistent forthe ATE
bull Why donrsquot people use this Most people donrsquot know the results wersquove been talking about Harder to implement than vanilla OLS
bull Can use linear regression to estimate 119889(119909) = 119909prime120573119889bull Recent trend is to estimate 119889(119909) via non-parametric methods
such as Kernel regression local linear regression regression trees etc Easiest is generalized additive models (GAMs)
47 58
Imputation estimator visualization
-2 -1 0 1 2 3
-05
00
05
10
15
x
y
TreatedControl
48 58
Imputation estimator visualization
-2 -1 0 1 2 3
-05
00
05
10
15
x
y
TreatedControl
49 58
Imputation estimator visualization
-2 -1 0 1 2 3
-05
00
05
10
15
x
y
TreatedControl
50 58
Nonlinear relationshipsbull Same idea but with nonlinear relationship between 119884119894 and 119883119894
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
51 58
Nonlinear relationshipsbull Same idea but with nonlinear relationship between 119884119894 and 119883119894
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
52 58
Nonlinear relationshipsbull Same idea but with nonlinear relationship between 119884119894 and 119883119894
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
53 58
Using semiparametric regressionbull Here CEFs are nonlinear but we donrsquot know their formbull We can use GAMs from the mgcv package to for flexible
estimatelibrary(mgcv)
mod0 lt- gam(y ~ s(x) subset = d == 0)
summary(mod0)
Family gaussian
Link function identity
Formula
y ~ s(x)
Parametric coefficients
Estimate Std Error t value Pr(gt|t|)
(Intercept) -00225 00154 -146 016
Approximate significance of smooth terms
edf Refdf F p-value
s(x) 603 708 413 lt2e-16
---
Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1
R-sq(adj) = 0909 Deviance explained = 928
GCV = 00093204 Scale est = 00071351 n = 30
54 58
Using GAMs
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
55 58
Using GAMs
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
56 58
Using GAMs
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
57 58
Limited dependent variables
bull Usual advice model the data from first principles Logitprobit for binary Poisson for counts etc
bull OLS is a-ok with limited DVs when Binary treatment and no covariates (just diff-in-means) Binary treatment discrete covariates and saturated models
(stratified diff-in-means)
bull Imposing a model on LDVs in this case imposes adistributional assumption which could be wrong
bull Even in unsaturated models the marginal effect from OLSoften decent compared to nonlinear models
Could go wrong in small samples If using nonlinear models always get effects on the scale of the
outcome
58 58
- Agnostic Regression
- Regression and Causality
- Regression with Heterogeneous Treatment Effects
-
OLS weighting examplebull Binary covariate
โ[119883119894 = 1] = 075 โ[119883119894 = 0] = 025โ[119863119894 = 1|119883119894 = 1] = 09 โ[119863119894 = 1|119883119894 = 0] = 05
1205901113569119889(1) = 009 1205901113569119889(0) = 025120591(1) = 1 120591(0) = minus1
bull Implies the ATE is 120591 = 05bull Average conditional variance 120124[1205901113569119889(119883119894)] = 013bull weights for 119883119894 = 1 are 009013 = 0692 for 119883119894 = 0
025013 = 192120591119877 = 120124[120591(119883119894)119882119894]
= 120591(1)119882(1)โ[119883119894 = 1] + 120591(0)119882(0)โ[119883119894 = 0]= 1 times 0692 times 075 + minus1 times 192 times 025= 0039
42 58
When will OLS estimate the ATE
bull When does 120591 = 120591119877bull Constant treatment effects 120591(119909) = 120591 = 120591119877bull Constant probability of treatment 119890(119909) = โ[119863119894 = 1|119883119894 = 119909] = 119890
Implies that the OLS weights are 1
bull Incorrect linearity assumption in 119883119894 will lead to more bias
43 58
Other ways to use regression
bull Whatrsquos the path forward Accept the bias (might be relatively small with saturated
models) Use a different regression approach
bull Let 120583119889(119909) = 120124[119884119894(119889)|119883119894 = 119909] be the CEF for the potentialoutcome under 119863119894 = 119889
bull By consistency and nuc we have 120583119889(119909) = 120124[119884119894|119863119894 = 119889119883119894 = 119909]bull Estimate a regression of 119884119894 on 119883119894 among the 119863119894 = 119889 groupbull Then 119889(119909) is just a predicted value from the regression for
119883119894 = 119909bull How can we use this
44 58
Imputation estimators
bull Impute the treated potential outcomes with 1114023119884119894(1) = 1113568(119883119894)bull Impute the control potential outcomes with 1114023119884119894(0) = 1113567(119883119894)bull Procedure
Regress 119884119894 on 119883119894 in the treated group and get predicted valuesfor all units (treated or control)
Regress 119884119894 on 119883119894 in the control group and get predicted valuesfor all units (treated or control)
Take the average difference between these predicted values
bull More mathematically look like this
120591119894119898119901 =1119873
11140121198941113568(119883119894) minus 1113567(119883119894)
bull Sometimes called an imputation estimator
45 58
Simple imputation estimator
bull Use predict() from the within-group models on the data fromthe entire sample
bull Useful trick use a model on the entire data andmodelframe() to get the right design matrix
heterogeneous effects
yhet lt- ifelse(d == 1 y + rnorm(n 0 5) y)
mod lt- lm(yhet ~ d + X)
mod1 lt- lm(yhet ~ X subset = d == 1)
mod0 lt- lm(yhet ~ X subset = d == 0)
y1imps lt- predict(mod1 modelframe(mod))
y0imps lt- predict(mod0 modelframe(mod))
mean(y1imps - y0imps)
[1] 061
46 58
Notes on imputation estimators
bull If 119889(119909) are consistent estimators then 120591119894119898119901 is consistent forthe ATE
bull Why donrsquot people use this Most people donrsquot know the results wersquove been talking about Harder to implement than vanilla OLS
bull Can use linear regression to estimate 119889(119909) = 119909prime120573119889bull Recent trend is to estimate 119889(119909) via non-parametric methods
such as Kernel regression local linear regression regression trees etc Easiest is generalized additive models (GAMs)
47 58
Imputation estimator visualization
-2 -1 0 1 2 3
-05
00
05
10
15
x
y
TreatedControl
48 58
Imputation estimator visualization
-2 -1 0 1 2 3
-05
00
05
10
15
x
y
TreatedControl
49 58
Imputation estimator visualization
-2 -1 0 1 2 3
-05
00
05
10
15
x
y
TreatedControl
50 58
Nonlinear relationshipsbull Same idea but with nonlinear relationship between 119884119894 and 119883119894
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
51 58
Nonlinear relationshipsbull Same idea but with nonlinear relationship between 119884119894 and 119883119894
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
52 58
Nonlinear relationshipsbull Same idea but with nonlinear relationship between 119884119894 and 119883119894
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
53 58
Using semiparametric regressionbull Here CEFs are nonlinear but we donrsquot know their formbull We can use GAMs from the mgcv package to for flexible
estimatelibrary(mgcv)
mod0 lt- gam(y ~ s(x) subset = d == 0)
summary(mod0)
Family gaussian
Link function identity
Formula
y ~ s(x)
Parametric coefficients
Estimate Std Error t value Pr(gt|t|)
(Intercept) -00225 00154 -146 016
Approximate significance of smooth terms
edf Refdf F p-value
s(x) 603 708 413 lt2e-16
---
Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1
R-sq(adj) = 0909 Deviance explained = 928
GCV = 00093204 Scale est = 00071351 n = 30
54 58
Using GAMs
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
55 58
Using GAMs
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
56 58
Using GAMs
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
57 58
Limited dependent variables
bull Usual advice model the data from first principles Logitprobit for binary Poisson for counts etc
bull OLS is a-ok with limited DVs when Binary treatment and no covariates (just diff-in-means) Binary treatment discrete covariates and saturated models
(stratified diff-in-means)
bull Imposing a model on LDVs in this case imposes adistributional assumption which could be wrong
bull Even in unsaturated models the marginal effect from OLSoften decent compared to nonlinear models
Could go wrong in small samples If using nonlinear models always get effects on the scale of the
outcome
58 58
- Agnostic Regression
- Regression and Causality
- Regression with Heterogeneous Treatment Effects
-
When will OLS estimate the ATE
bull When does 120591 = 120591119877bull Constant treatment effects 120591(119909) = 120591 = 120591119877bull Constant probability of treatment 119890(119909) = โ[119863119894 = 1|119883119894 = 119909] = 119890
Implies that the OLS weights are 1
bull Incorrect linearity assumption in 119883119894 will lead to more bias
43 58
Other ways to use regression
bull Whatrsquos the path forward Accept the bias (might be relatively small with saturated
models) Use a different regression approach
bull Let 120583119889(119909) = 120124[119884119894(119889)|119883119894 = 119909] be the CEF for the potentialoutcome under 119863119894 = 119889
bull By consistency and nuc we have 120583119889(119909) = 120124[119884119894|119863119894 = 119889119883119894 = 119909]bull Estimate a regression of 119884119894 on 119883119894 among the 119863119894 = 119889 groupbull Then 119889(119909) is just a predicted value from the regression for
119883119894 = 119909bull How can we use this
44 58
Imputation estimators
bull Impute the treated potential outcomes with 1114023119884119894(1) = 1113568(119883119894)bull Impute the control potential outcomes with 1114023119884119894(0) = 1113567(119883119894)bull Procedure
Regress 119884119894 on 119883119894 in the treated group and get predicted valuesfor all units (treated or control)
Regress 119884119894 on 119883119894 in the control group and get predicted valuesfor all units (treated or control)
Take the average difference between these predicted values
bull More mathematically look like this
120591119894119898119901 =1119873
11140121198941113568(119883119894) minus 1113567(119883119894)
bull Sometimes called an imputation estimator
45 58
Simple imputation estimator
bull Use predict() from the within-group models on the data fromthe entire sample
bull Useful trick use a model on the entire data andmodelframe() to get the right design matrix
heterogeneous effects
yhet lt- ifelse(d == 1 y + rnorm(n 0 5) y)
mod lt- lm(yhet ~ d + X)
mod1 lt- lm(yhet ~ X subset = d == 1)
mod0 lt- lm(yhet ~ X subset = d == 0)
y1imps lt- predict(mod1 modelframe(mod))
y0imps lt- predict(mod0 modelframe(mod))
mean(y1imps - y0imps)
[1] 061
46 58
Notes on imputation estimators
bull If 119889(119909) are consistent estimators then 120591119894119898119901 is consistent forthe ATE
bull Why donrsquot people use this Most people donrsquot know the results wersquove been talking about Harder to implement than vanilla OLS
bull Can use linear regression to estimate 119889(119909) = 119909prime120573119889bull Recent trend is to estimate 119889(119909) via non-parametric methods
such as Kernel regression local linear regression regression trees etc Easiest is generalized additive models (GAMs)
47 58
Imputation estimator visualization
-2 -1 0 1 2 3
-05
00
05
10
15
x
y
TreatedControl
48 58
Imputation estimator visualization
-2 -1 0 1 2 3
-05
00
05
10
15
x
y
TreatedControl
49 58
Imputation estimator visualization
-2 -1 0 1 2 3
-05
00
05
10
15
x
y
TreatedControl
50 58
Nonlinear relationshipsbull Same idea but with nonlinear relationship between 119884119894 and 119883119894
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
51 58
Nonlinear relationshipsbull Same idea but with nonlinear relationship between 119884119894 and 119883119894
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
52 58
Nonlinear relationshipsbull Same idea but with nonlinear relationship between 119884119894 and 119883119894
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
53 58
Using semiparametric regressionbull Here CEFs are nonlinear but we donrsquot know their formbull We can use GAMs from the mgcv package to for flexible
estimatelibrary(mgcv)
mod0 lt- gam(y ~ s(x) subset = d == 0)
summary(mod0)
Family gaussian
Link function identity
Formula
y ~ s(x)
Parametric coefficients
Estimate Std Error t value Pr(gt|t|)
(Intercept) -00225 00154 -146 016
Approximate significance of smooth terms
edf Refdf F p-value
s(x) 603 708 413 lt2e-16
---
Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1
R-sq(adj) = 0909 Deviance explained = 928
GCV = 00093204 Scale est = 00071351 n = 30
54 58
Using GAMs
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
55 58
Using GAMs
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
56 58
Using GAMs
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
57 58
Limited dependent variables
bull Usual advice model the data from first principles Logitprobit for binary Poisson for counts etc
bull OLS is a-ok with limited DVs when Binary treatment and no covariates (just diff-in-means) Binary treatment discrete covariates and saturated models
(stratified diff-in-means)
bull Imposing a model on LDVs in this case imposes adistributional assumption which could be wrong
bull Even in unsaturated models the marginal effect from OLSoften decent compared to nonlinear models
Could go wrong in small samples If using nonlinear models always get effects on the scale of the
outcome
58 58
- Agnostic Regression
- Regression and Causality
- Regression with Heterogeneous Treatment Effects
-
Other ways to use regression
bull Whatrsquos the path forward Accept the bias (might be relatively small with saturated
models) Use a different regression approach
bull Let 120583119889(119909) = 120124[119884119894(119889)|119883119894 = 119909] be the CEF for the potentialoutcome under 119863119894 = 119889
bull By consistency and nuc we have 120583119889(119909) = 120124[119884119894|119863119894 = 119889119883119894 = 119909]bull Estimate a regression of 119884119894 on 119883119894 among the 119863119894 = 119889 groupbull Then 119889(119909) is just a predicted value from the regression for
119883119894 = 119909bull How can we use this
44 58
Imputation estimators
bull Impute the treated potential outcomes with 1114023119884119894(1) = 1113568(119883119894)bull Impute the control potential outcomes with 1114023119884119894(0) = 1113567(119883119894)bull Procedure
Regress 119884119894 on 119883119894 in the treated group and get predicted valuesfor all units (treated or control)
Regress 119884119894 on 119883119894 in the control group and get predicted valuesfor all units (treated or control)
Take the average difference between these predicted values
bull More mathematically look like this
120591119894119898119901 =1119873
11140121198941113568(119883119894) minus 1113567(119883119894)
bull Sometimes called an imputation estimator
45 58
Simple imputation estimator
bull Use predict() from the within-group models on the data fromthe entire sample
bull Useful trick use a model on the entire data andmodelframe() to get the right design matrix
heterogeneous effects
yhet lt- ifelse(d == 1 y + rnorm(n 0 5) y)
mod lt- lm(yhet ~ d + X)
mod1 lt- lm(yhet ~ X subset = d == 1)
mod0 lt- lm(yhet ~ X subset = d == 0)
y1imps lt- predict(mod1 modelframe(mod))
y0imps lt- predict(mod0 modelframe(mod))
mean(y1imps - y0imps)
[1] 061
46 58
Notes on imputation estimators
bull If 119889(119909) are consistent estimators then 120591119894119898119901 is consistent forthe ATE
bull Why donrsquot people use this Most people donrsquot know the results wersquove been talking about Harder to implement than vanilla OLS
bull Can use linear regression to estimate 119889(119909) = 119909prime120573119889bull Recent trend is to estimate 119889(119909) via non-parametric methods
such as Kernel regression local linear regression regression trees etc Easiest is generalized additive models (GAMs)
47 58
Imputation estimator visualization
-2 -1 0 1 2 3
-05
00
05
10
15
x
y
TreatedControl
48 58
Imputation estimator visualization
-2 -1 0 1 2 3
-05
00
05
10
15
x
y
TreatedControl
49 58
Imputation estimator visualization
-2 -1 0 1 2 3
-05
00
05
10
15
x
y
TreatedControl
50 58
Nonlinear relationshipsbull Same idea but with nonlinear relationship between 119884119894 and 119883119894
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
51 58
Nonlinear relationshipsbull Same idea but with nonlinear relationship between 119884119894 and 119883119894
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
52 58
Nonlinear relationshipsbull Same idea but with nonlinear relationship between 119884119894 and 119883119894
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
53 58
Using semiparametric regressionbull Here CEFs are nonlinear but we donrsquot know their formbull We can use GAMs from the mgcv package to for flexible
estimatelibrary(mgcv)
mod0 lt- gam(y ~ s(x) subset = d == 0)
summary(mod0)
Family gaussian
Link function identity
Formula
y ~ s(x)
Parametric coefficients
Estimate Std Error t value Pr(gt|t|)
(Intercept) -00225 00154 -146 016
Approximate significance of smooth terms
edf Refdf F p-value
s(x) 603 708 413 lt2e-16
---
Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1
R-sq(adj) = 0909 Deviance explained = 928
GCV = 00093204 Scale est = 00071351 n = 30
54 58
Using GAMs
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
55 58
Using GAMs
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
56 58
Using GAMs
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
57 58
Limited dependent variables
bull Usual advice model the data from first principles Logitprobit for binary Poisson for counts etc
bull OLS is a-ok with limited DVs when Binary treatment and no covariates (just diff-in-means) Binary treatment discrete covariates and saturated models
(stratified diff-in-means)
bull Imposing a model on LDVs in this case imposes adistributional assumption which could be wrong
bull Even in unsaturated models the marginal effect from OLSoften decent compared to nonlinear models
Could go wrong in small samples If using nonlinear models always get effects on the scale of the
outcome
58 58
- Agnostic Regression
- Regression and Causality
- Regression with Heterogeneous Treatment Effects
-
Imputation estimators
bull Impute the treated potential outcomes with 1114023119884119894(1) = 1113568(119883119894)bull Impute the control potential outcomes with 1114023119884119894(0) = 1113567(119883119894)bull Procedure
Regress 119884119894 on 119883119894 in the treated group and get predicted valuesfor all units (treated or control)
Regress 119884119894 on 119883119894 in the control group and get predicted valuesfor all units (treated or control)
Take the average difference between these predicted values
bull More mathematically look like this
120591119894119898119901 =1119873
11140121198941113568(119883119894) minus 1113567(119883119894)
bull Sometimes called an imputation estimator
45 58
Simple imputation estimator
bull Use predict() from the within-group models on the data fromthe entire sample
bull Useful trick use a model on the entire data andmodelframe() to get the right design matrix
heterogeneous effects
yhet lt- ifelse(d == 1 y + rnorm(n 0 5) y)
mod lt- lm(yhet ~ d + X)
mod1 lt- lm(yhet ~ X subset = d == 1)
mod0 lt- lm(yhet ~ X subset = d == 0)
y1imps lt- predict(mod1 modelframe(mod))
y0imps lt- predict(mod0 modelframe(mod))
mean(y1imps - y0imps)
[1] 061
46 58
Notes on imputation estimators
bull If 119889(119909) are consistent estimators then 120591119894119898119901 is consistent forthe ATE
bull Why donrsquot people use this Most people donrsquot know the results wersquove been talking about Harder to implement than vanilla OLS
bull Can use linear regression to estimate 119889(119909) = 119909prime120573119889bull Recent trend is to estimate 119889(119909) via non-parametric methods
such as Kernel regression local linear regression regression trees etc Easiest is generalized additive models (GAMs)
47 58
Imputation estimator visualization
-2 -1 0 1 2 3
-05
00
05
10
15
x
y
TreatedControl
48 58
Imputation estimator visualization
-2 -1 0 1 2 3
-05
00
05
10
15
x
y
TreatedControl
49 58
Imputation estimator visualization
-2 -1 0 1 2 3
-05
00
05
10
15
x
y
TreatedControl
50 58
Nonlinear relationshipsbull Same idea but with nonlinear relationship between 119884119894 and 119883119894
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
51 58
Nonlinear relationshipsbull Same idea but with nonlinear relationship between 119884119894 and 119883119894
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
52 58
Nonlinear relationshipsbull Same idea but with nonlinear relationship between 119884119894 and 119883119894
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
53 58
Using semiparametric regressionbull Here CEFs are nonlinear but we donrsquot know their formbull We can use GAMs from the mgcv package to for flexible
estimatelibrary(mgcv)
mod0 lt- gam(y ~ s(x) subset = d == 0)
summary(mod0)
Family gaussian
Link function identity
Formula
y ~ s(x)
Parametric coefficients
Estimate Std Error t value Pr(gt|t|)
(Intercept) -00225 00154 -146 016
Approximate significance of smooth terms
edf Refdf F p-value
s(x) 603 708 413 lt2e-16
---
Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1
R-sq(adj) = 0909 Deviance explained = 928
GCV = 00093204 Scale est = 00071351 n = 30
54 58
Using GAMs
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
55 58
Using GAMs
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
56 58
Using GAMs
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
57 58
Limited dependent variables
bull Usual advice model the data from first principles Logitprobit for binary Poisson for counts etc
bull OLS is a-ok with limited DVs when Binary treatment and no covariates (just diff-in-means) Binary treatment discrete covariates and saturated models
(stratified diff-in-means)
bull Imposing a model on LDVs in this case imposes adistributional assumption which could be wrong
bull Even in unsaturated models the marginal effect from OLSoften decent compared to nonlinear models
Could go wrong in small samples If using nonlinear models always get effects on the scale of the
outcome
58 58
- Agnostic Regression
- Regression and Causality
- Regression with Heterogeneous Treatment Effects
-
Simple imputation estimator
bull Use predict() from the within-group models on the data fromthe entire sample
bull Useful trick use a model on the entire data andmodelframe() to get the right design matrix
heterogeneous effects
yhet lt- ifelse(d == 1 y + rnorm(n 0 5) y)
mod lt- lm(yhet ~ d + X)
mod1 lt- lm(yhet ~ X subset = d == 1)
mod0 lt- lm(yhet ~ X subset = d == 0)
y1imps lt- predict(mod1 modelframe(mod))
y0imps lt- predict(mod0 modelframe(mod))
mean(y1imps - y0imps)
[1] 061
46 58
Notes on imputation estimators
bull If 119889(119909) are consistent estimators then 120591119894119898119901 is consistent forthe ATE
bull Why donrsquot people use this Most people donrsquot know the results wersquove been talking about Harder to implement than vanilla OLS
bull Can use linear regression to estimate 119889(119909) = 119909prime120573119889bull Recent trend is to estimate 119889(119909) via non-parametric methods
such as Kernel regression local linear regression regression trees etc Easiest is generalized additive models (GAMs)
47 58
Imputation estimator visualization
-2 -1 0 1 2 3
-05
00
05
10
15
x
y
TreatedControl
48 58
Imputation estimator visualization
-2 -1 0 1 2 3
-05
00
05
10
15
x
y
TreatedControl
49 58
Imputation estimator visualization
-2 -1 0 1 2 3
-05
00
05
10
15
x
y
TreatedControl
50 58
Nonlinear relationshipsbull Same idea but with nonlinear relationship between 119884119894 and 119883119894
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
51 58
Nonlinear relationshipsbull Same idea but with nonlinear relationship between 119884119894 and 119883119894
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
52 58
Nonlinear relationshipsbull Same idea but with nonlinear relationship between 119884119894 and 119883119894
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
53 58
Using semiparametric regressionbull Here CEFs are nonlinear but we donrsquot know their formbull We can use GAMs from the mgcv package to for flexible
estimatelibrary(mgcv)
mod0 lt- gam(y ~ s(x) subset = d == 0)
summary(mod0)
Family gaussian
Link function identity
Formula
y ~ s(x)
Parametric coefficients
Estimate Std Error t value Pr(gt|t|)
(Intercept) -00225 00154 -146 016
Approximate significance of smooth terms
edf Refdf F p-value
s(x) 603 708 413 lt2e-16
---
Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1
R-sq(adj) = 0909 Deviance explained = 928
GCV = 00093204 Scale est = 00071351 n = 30
54 58
Using GAMs
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
55 58
Using GAMs
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
56 58
Using GAMs
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
57 58
Limited dependent variables
bull Usual advice model the data from first principles Logitprobit for binary Poisson for counts etc
bull OLS is a-ok with limited DVs when Binary treatment and no covariates (just diff-in-means) Binary treatment discrete covariates and saturated models
(stratified diff-in-means)
bull Imposing a model on LDVs in this case imposes adistributional assumption which could be wrong
bull Even in unsaturated models the marginal effect from OLSoften decent compared to nonlinear models
Could go wrong in small samples If using nonlinear models always get effects on the scale of the
outcome
58 58
- Agnostic Regression
- Regression and Causality
- Regression with Heterogeneous Treatment Effects
-
Notes on imputation estimators
bull If 119889(119909) are consistent estimators then 120591119894119898119901 is consistent forthe ATE
bull Why donrsquot people use this Most people donrsquot know the results wersquove been talking about Harder to implement than vanilla OLS
bull Can use linear regression to estimate 119889(119909) = 119909prime120573119889bull Recent trend is to estimate 119889(119909) via non-parametric methods
such as Kernel regression local linear regression regression trees etc Easiest is generalized additive models (GAMs)
47 58
Imputation estimator visualization
-2 -1 0 1 2 3
-05
00
05
10
15
x
y
TreatedControl
48 58
Imputation estimator visualization
-2 -1 0 1 2 3
-05
00
05
10
15
x
y
TreatedControl
49 58
Imputation estimator visualization
-2 -1 0 1 2 3
-05
00
05
10
15
x
y
TreatedControl
50 58
Nonlinear relationshipsbull Same idea but with nonlinear relationship between 119884119894 and 119883119894
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
51 58
Nonlinear relationshipsbull Same idea but with nonlinear relationship between 119884119894 and 119883119894
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
52 58
Nonlinear relationshipsbull Same idea but with nonlinear relationship between 119884119894 and 119883119894
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
53 58
Using semiparametric regressionbull Here CEFs are nonlinear but we donrsquot know their formbull We can use GAMs from the mgcv package to for flexible
estimatelibrary(mgcv)
mod0 lt- gam(y ~ s(x) subset = d == 0)
summary(mod0)
Family gaussian
Link function identity
Formula
y ~ s(x)
Parametric coefficients
Estimate Std Error t value Pr(gt|t|)
(Intercept) -00225 00154 -146 016
Approximate significance of smooth terms
edf Refdf F p-value
s(x) 603 708 413 lt2e-16
---
Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1
R-sq(adj) = 0909 Deviance explained = 928
GCV = 00093204 Scale est = 00071351 n = 30
54 58
Using GAMs
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
55 58
Using GAMs
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
56 58
Using GAMs
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
57 58
Limited dependent variables
bull Usual advice model the data from first principles Logitprobit for binary Poisson for counts etc
bull OLS is a-ok with limited DVs when Binary treatment and no covariates (just diff-in-means) Binary treatment discrete covariates and saturated models
(stratified diff-in-means)
bull Imposing a model on LDVs in this case imposes adistributional assumption which could be wrong
bull Even in unsaturated models the marginal effect from OLSoften decent compared to nonlinear models
Could go wrong in small samples If using nonlinear models always get effects on the scale of the
outcome
58 58
- Agnostic Regression
- Regression and Causality
- Regression with Heterogeneous Treatment Effects
-
Imputation estimator visualization
-2 -1 0 1 2 3
-05
00
05
10
15
x
y
TreatedControl
48 58
Imputation estimator visualization
-2 -1 0 1 2 3
-05
00
05
10
15
x
y
TreatedControl
49 58
Imputation estimator visualization
-2 -1 0 1 2 3
-05
00
05
10
15
x
y
TreatedControl
50 58
Nonlinear relationshipsbull Same idea but with nonlinear relationship between 119884119894 and 119883119894
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
51 58
Nonlinear relationshipsbull Same idea but with nonlinear relationship between 119884119894 and 119883119894
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
52 58
Nonlinear relationshipsbull Same idea but with nonlinear relationship between 119884119894 and 119883119894
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
53 58
Using semiparametric regressionbull Here CEFs are nonlinear but we donrsquot know their formbull We can use GAMs from the mgcv package to for flexible
estimatelibrary(mgcv)
mod0 lt- gam(y ~ s(x) subset = d == 0)
summary(mod0)
Family gaussian
Link function identity
Formula
y ~ s(x)
Parametric coefficients
Estimate Std Error t value Pr(gt|t|)
(Intercept) -00225 00154 -146 016
Approximate significance of smooth terms
edf Refdf F p-value
s(x) 603 708 413 lt2e-16
---
Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1
R-sq(adj) = 0909 Deviance explained = 928
GCV = 00093204 Scale est = 00071351 n = 30
54 58
Using GAMs
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
55 58
Using GAMs
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
56 58
Using GAMs
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
57 58
Limited dependent variables
bull Usual advice model the data from first principles Logitprobit for binary Poisson for counts etc
bull OLS is a-ok with limited DVs when Binary treatment and no covariates (just diff-in-means) Binary treatment discrete covariates and saturated models
(stratified diff-in-means)
bull Imposing a model on LDVs in this case imposes adistributional assumption which could be wrong
bull Even in unsaturated models the marginal effect from OLSoften decent compared to nonlinear models
Could go wrong in small samples If using nonlinear models always get effects on the scale of the
outcome
58 58
- Agnostic Regression
- Regression and Causality
- Regression with Heterogeneous Treatment Effects
-
Imputation estimator visualization
-2 -1 0 1 2 3
-05
00
05
10
15
x
y
TreatedControl
49 58
Imputation estimator visualization
-2 -1 0 1 2 3
-05
00
05
10
15
x
y
TreatedControl
50 58
Nonlinear relationshipsbull Same idea but with nonlinear relationship between 119884119894 and 119883119894
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
51 58
Nonlinear relationshipsbull Same idea but with nonlinear relationship between 119884119894 and 119883119894
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
52 58
Nonlinear relationshipsbull Same idea but with nonlinear relationship between 119884119894 and 119883119894
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
53 58
Using semiparametric regressionbull Here CEFs are nonlinear but we donrsquot know their formbull We can use GAMs from the mgcv package to for flexible
estimatelibrary(mgcv)
mod0 lt- gam(y ~ s(x) subset = d == 0)
summary(mod0)
Family gaussian
Link function identity
Formula
y ~ s(x)
Parametric coefficients
Estimate Std Error t value Pr(gt|t|)
(Intercept) -00225 00154 -146 016
Approximate significance of smooth terms
edf Refdf F p-value
s(x) 603 708 413 lt2e-16
---
Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1
R-sq(adj) = 0909 Deviance explained = 928
GCV = 00093204 Scale est = 00071351 n = 30
54 58
Using GAMs
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
55 58
Using GAMs
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
56 58
Using GAMs
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
57 58
Limited dependent variables
bull Usual advice model the data from first principles Logitprobit for binary Poisson for counts etc
bull OLS is a-ok with limited DVs when Binary treatment and no covariates (just diff-in-means) Binary treatment discrete covariates and saturated models
(stratified diff-in-means)
bull Imposing a model on LDVs in this case imposes adistributional assumption which could be wrong
bull Even in unsaturated models the marginal effect from OLSoften decent compared to nonlinear models
Could go wrong in small samples If using nonlinear models always get effects on the scale of the
outcome
58 58
- Agnostic Regression
- Regression and Causality
- Regression with Heterogeneous Treatment Effects
-
Imputation estimator visualization
-2 -1 0 1 2 3
-05
00
05
10
15
x
y
TreatedControl
50 58
Nonlinear relationshipsbull Same idea but with nonlinear relationship between 119884119894 and 119883119894
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
51 58
Nonlinear relationshipsbull Same idea but with nonlinear relationship between 119884119894 and 119883119894
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
52 58
Nonlinear relationshipsbull Same idea but with nonlinear relationship between 119884119894 and 119883119894
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
53 58
Using semiparametric regressionbull Here CEFs are nonlinear but we donrsquot know their formbull We can use GAMs from the mgcv package to for flexible
estimatelibrary(mgcv)
mod0 lt- gam(y ~ s(x) subset = d == 0)
summary(mod0)
Family gaussian
Link function identity
Formula
y ~ s(x)
Parametric coefficients
Estimate Std Error t value Pr(gt|t|)
(Intercept) -00225 00154 -146 016
Approximate significance of smooth terms
edf Refdf F p-value
s(x) 603 708 413 lt2e-16
---
Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1
R-sq(adj) = 0909 Deviance explained = 928
GCV = 00093204 Scale est = 00071351 n = 30
54 58
Using GAMs
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
55 58
Using GAMs
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
56 58
Using GAMs
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
57 58
Limited dependent variables
bull Usual advice model the data from first principles Logitprobit for binary Poisson for counts etc
bull OLS is a-ok with limited DVs when Binary treatment and no covariates (just diff-in-means) Binary treatment discrete covariates and saturated models
(stratified diff-in-means)
bull Imposing a model on LDVs in this case imposes adistributional assumption which could be wrong
bull Even in unsaturated models the marginal effect from OLSoften decent compared to nonlinear models
Could go wrong in small samples If using nonlinear models always get effects on the scale of the
outcome
58 58
- Agnostic Regression
- Regression and Causality
- Regression with Heterogeneous Treatment Effects
-
Nonlinear relationshipsbull Same idea but with nonlinear relationship between 119884119894 and 119883119894
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
51 58
Nonlinear relationshipsbull Same idea but with nonlinear relationship between 119884119894 and 119883119894
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
52 58
Nonlinear relationshipsbull Same idea but with nonlinear relationship between 119884119894 and 119883119894
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
53 58
Using semiparametric regressionbull Here CEFs are nonlinear but we donrsquot know their formbull We can use GAMs from the mgcv package to for flexible
estimatelibrary(mgcv)
mod0 lt- gam(y ~ s(x) subset = d == 0)
summary(mod0)
Family gaussian
Link function identity
Formula
y ~ s(x)
Parametric coefficients
Estimate Std Error t value Pr(gt|t|)
(Intercept) -00225 00154 -146 016
Approximate significance of smooth terms
edf Refdf F p-value
s(x) 603 708 413 lt2e-16
---
Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1
R-sq(adj) = 0909 Deviance explained = 928
GCV = 00093204 Scale est = 00071351 n = 30
54 58
Using GAMs
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
55 58
Using GAMs
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
56 58
Using GAMs
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
57 58
Limited dependent variables
bull Usual advice model the data from first principles Logitprobit for binary Poisson for counts etc
bull OLS is a-ok with limited DVs when Binary treatment and no covariates (just diff-in-means) Binary treatment discrete covariates and saturated models
(stratified diff-in-means)
bull Imposing a model on LDVs in this case imposes adistributional assumption which could be wrong
bull Even in unsaturated models the marginal effect from OLSoften decent compared to nonlinear models
Could go wrong in small samples If using nonlinear models always get effects on the scale of the
outcome
58 58
- Agnostic Regression
- Regression and Causality
- Regression with Heterogeneous Treatment Effects
-
Nonlinear relationshipsbull Same idea but with nonlinear relationship between 119884119894 and 119883119894
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
52 58
Nonlinear relationshipsbull Same idea but with nonlinear relationship between 119884119894 and 119883119894
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
53 58
Using semiparametric regressionbull Here CEFs are nonlinear but we donrsquot know their formbull We can use GAMs from the mgcv package to for flexible
estimatelibrary(mgcv)
mod0 lt- gam(y ~ s(x) subset = d == 0)
summary(mod0)
Family gaussian
Link function identity
Formula
y ~ s(x)
Parametric coefficients
Estimate Std Error t value Pr(gt|t|)
(Intercept) -00225 00154 -146 016
Approximate significance of smooth terms
edf Refdf F p-value
s(x) 603 708 413 lt2e-16
---
Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1
R-sq(adj) = 0909 Deviance explained = 928
GCV = 00093204 Scale est = 00071351 n = 30
54 58
Using GAMs
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
55 58
Using GAMs
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
56 58
Using GAMs
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
57 58
Limited dependent variables
bull Usual advice model the data from first principles Logitprobit for binary Poisson for counts etc
bull OLS is a-ok with limited DVs when Binary treatment and no covariates (just diff-in-means) Binary treatment discrete covariates and saturated models
(stratified diff-in-means)
bull Imposing a model on LDVs in this case imposes adistributional assumption which could be wrong
bull Even in unsaturated models the marginal effect from OLSoften decent compared to nonlinear models
Could go wrong in small samples If using nonlinear models always get effects on the scale of the
outcome
58 58
- Agnostic Regression
- Regression and Causality
- Regression with Heterogeneous Treatment Effects
-
Nonlinear relationshipsbull Same idea but with nonlinear relationship between 119884119894 and 119883119894
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
53 58
Using semiparametric regressionbull Here CEFs are nonlinear but we donrsquot know their formbull We can use GAMs from the mgcv package to for flexible
estimatelibrary(mgcv)
mod0 lt- gam(y ~ s(x) subset = d == 0)
summary(mod0)
Family gaussian
Link function identity
Formula
y ~ s(x)
Parametric coefficients
Estimate Std Error t value Pr(gt|t|)
(Intercept) -00225 00154 -146 016
Approximate significance of smooth terms
edf Refdf F p-value
s(x) 603 708 413 lt2e-16
---
Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1
R-sq(adj) = 0909 Deviance explained = 928
GCV = 00093204 Scale est = 00071351 n = 30
54 58
Using GAMs
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
55 58
Using GAMs
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
56 58
Using GAMs
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
57 58
Limited dependent variables
bull Usual advice model the data from first principles Logitprobit for binary Poisson for counts etc
bull OLS is a-ok with limited DVs when Binary treatment and no covariates (just diff-in-means) Binary treatment discrete covariates and saturated models
(stratified diff-in-means)
bull Imposing a model on LDVs in this case imposes adistributional assumption which could be wrong
bull Even in unsaturated models the marginal effect from OLSoften decent compared to nonlinear models
Could go wrong in small samples If using nonlinear models always get effects on the scale of the
outcome
58 58
- Agnostic Regression
- Regression and Causality
- Regression with Heterogeneous Treatment Effects
-
Using semiparametric regressionbull Here CEFs are nonlinear but we donrsquot know their formbull We can use GAMs from the mgcv package to for flexible
estimatelibrary(mgcv)
mod0 lt- gam(y ~ s(x) subset = d == 0)
summary(mod0)
Family gaussian
Link function identity
Formula
y ~ s(x)
Parametric coefficients
Estimate Std Error t value Pr(gt|t|)
(Intercept) -00225 00154 -146 016
Approximate significance of smooth terms
edf Refdf F p-value
s(x) 603 708 413 lt2e-16
---
Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1
R-sq(adj) = 0909 Deviance explained = 928
GCV = 00093204 Scale est = 00071351 n = 30
54 58
Using GAMs
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
55 58
Using GAMs
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
56 58
Using GAMs
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
57 58
Limited dependent variables
bull Usual advice model the data from first principles Logitprobit for binary Poisson for counts etc
bull OLS is a-ok with limited DVs when Binary treatment and no covariates (just diff-in-means) Binary treatment discrete covariates and saturated models
(stratified diff-in-means)
bull Imposing a model on LDVs in this case imposes adistributional assumption which could be wrong
bull Even in unsaturated models the marginal effect from OLSoften decent compared to nonlinear models
Could go wrong in small samples If using nonlinear models always get effects on the scale of the
outcome
58 58
- Agnostic Regression
- Regression and Causality
- Regression with Heterogeneous Treatment Effects
-
Using GAMs
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
55 58
Using GAMs
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
56 58
Using GAMs
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
57 58
Limited dependent variables
bull Usual advice model the data from first principles Logitprobit for binary Poisson for counts etc
bull OLS is a-ok with limited DVs when Binary treatment and no covariates (just diff-in-means) Binary treatment discrete covariates and saturated models
(stratified diff-in-means)
bull Imposing a model on LDVs in this case imposes adistributional assumption which could be wrong
bull Even in unsaturated models the marginal effect from OLSoften decent compared to nonlinear models
Could go wrong in small samples If using nonlinear models always get effects on the scale of the
outcome
58 58
- Agnostic Regression
- Regression and Causality
- Regression with Heterogeneous Treatment Effects
-
Using GAMs
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
56 58
Using GAMs
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
57 58
Limited dependent variables
bull Usual advice model the data from first principles Logitprobit for binary Poisson for counts etc
bull OLS is a-ok with limited DVs when Binary treatment and no covariates (just diff-in-means) Binary treatment discrete covariates and saturated models
(stratified diff-in-means)
bull Imposing a model on LDVs in this case imposes adistributional assumption which could be wrong
bull Even in unsaturated models the marginal effect from OLSoften decent compared to nonlinear models
Could go wrong in small samples If using nonlinear models always get effects on the scale of the
outcome
58 58
- Agnostic Regression
- Regression and Causality
- Regression with Heterogeneous Treatment Effects
-
Using GAMs
-2 -1 0 1 2
-10
-05
00
05
x
y
TreatedControl
57 58
Limited dependent variables
bull Usual advice model the data from first principles Logitprobit for binary Poisson for counts etc
bull OLS is a-ok with limited DVs when Binary treatment and no covariates (just diff-in-means) Binary treatment discrete covariates and saturated models
(stratified diff-in-means)
bull Imposing a model on LDVs in this case imposes adistributional assumption which could be wrong
bull Even in unsaturated models the marginal effect from OLSoften decent compared to nonlinear models
Could go wrong in small samples If using nonlinear models always get effects on the scale of the
outcome
58 58
- Agnostic Regression
- Regression and Causality
- Regression with Heterogeneous Treatment Effects
-
Limited dependent variables
bull Usual advice model the data from first principles Logitprobit for binary Poisson for counts etc
bull OLS is a-ok with limited DVs when Binary treatment and no covariates (just diff-in-means) Binary treatment discrete covariates and saturated models
(stratified diff-in-means)
bull Imposing a model on LDVs in this case imposes adistributional assumption which could be wrong
bull Even in unsaturated models the marginal effect from OLSoften decent compared to nonlinear models
Could go wrong in small samples If using nonlinear models always get effects on the scale of the
outcome
58 58
- Agnostic Regression
- Regression and Causality
- Regression with Heterogeneous Treatment Effects
-