Gov 2002: 7. Regression and Causalityโ€ข Define linear regression: =argmin ๐”ผ[( ๐‘–โˆ’ โ€ฒ ) ]...

58
Gov 2002: 7. Regression and Causality Matthew Blackwell October 15, 2015 1 / 58

Transcript of Gov 2002: 7. Regression and Causalityโ€ข Define linear regression: =argmin ๐”ผ[( ๐‘–โˆ’ โ€ฒ ) ]...

Page 1: Gov 2002: 7. Regression and Causalityโ€ข Define linear regression: =argmin ๐”ผ[( ๐‘–โˆ’ โ€ฒ ) ] โ€ข The solution to this is the following: =๐”ผ[ ๐‘– โ€ฒ]โˆ’ ๐”ผ[ ๐‘– ๐‘–] โ€ข

Gov 2002 7 Regression andCausality

Matthew Blackwell

October 15 2015

1 58

Agnostic Regression

Regression and Causality

Regression with Heterogeneous Treatment Effects

2 58

Where are we Where are we going

bull Last few weeks using matching weighting for estimatingcausal effects

bull This week how to use regression to estimate causal effectsbull Regression is so widely used itrsquos good to know what itrsquos

actually estimatingbull Goal salvage regression from the ashes of 1980rsquos textbooksbull Next week panel data

Reminder Email me and Stephen a half-page description of yourproposed research project

3 58

1 AgnosticRegression

4 58

Regression as parametric modeling

bull Gauss-Markov assumptions linearity iid sample full rank 119883119894 zero conditional mean error

homoskedasticity

bull OLS is BLUE plus normality of the errors and we get smallsample SEs

bull What is the basic approach here It is a model for theconditional distribution of 119884119894 given 119883119894

[119884119894|119883119894] sim 119873(119883prime119894 120573 1205901113569)

bull MLE from this model is the usual OLS estimator OLS

OLS =

โŽกโŽขโŽขโŽขโŽขโŽขโŽฃ1198731114012119894=1113568

119883119894119883prime119894

โŽคโŽฅโŽฅโŽฅโŽฅโŽฅโŽฆ

minus11135681198731114012119894=1113568

119883119894119884119894

5 58

Agnostic views on regression

[119884119894|119883119894] sim 119873(119883prime119894 120573 1205901113569)

bull Strong distributional assumption on 119884119894bull Properties like BLUE or MLE properties depend on these

assumptions holdingbull Alternative take an agnostic view on regression

Use OLS without believing these assumptions

bull Lose the distributional assumptions focus on the conditionalexpectation function (CEF)

120583(119909) = 120124[119884119894|119883119894 = 119909] = 1114012119910119910 sdot โ„™[119884119894 = 119910|119883119894 = 119909]

6 58

Justifying linear regression

bull Define linear regression

120573 = argmin119887

120124[(119884119894 minus 119883prime119894 119887)1113569]

bull The solution to this is the following

120573 = 120124[119883119894119883prime119894 ]minus1113568120124[119883119894119884119894]

bull Note that the is the population coefficient vector not theestimator yet

7 58

Regression anatomy

bull Consider simple linear regression

(120572 120573) = argmin119886119887

120124 1114094(119884119894 minus 119886 minus 119887119883119894)11135691114097

bull In this case we can write the populationtrue slope 120573 as

120573 = 120124[119883119894119883prime119894 ]minus1113568120124[119883119894119884119894] =

Cov(119884119894 119883119894)120141[119883119894]

bull With more covariates 120573 is more complicated but we can stillwrite it like this

bull Let 119896119894 be the residual from a regression of 119883119896119894 on all the otherindependent variables Then 120573119896 the coefficient for 119883119896119894 is

120573119896 =Cov(119884119894 119896119894)

119881(119896119894)

8 58

Justification 1 Linear CEFs

bull Justification 1 if the CEF is linear the population regressionfunction is it That is if 119864[119884119894|119883119894] = 119883prime

119894 119887 then 119887 = 120573bull When would we expect the CEF to be linear Two cases

1 Outcome and covariates are multivariate normal2 Linear regression model is saturated

bull A model is saturated if there are as many parameters asthere are possible combination of the 119883119894 variables

9 58

Saturated model example

bull Two binary variables 1198831113568119894 for incumbency status and 1198831113569119894 forparty of the candidate

bull Four possible values of 119883119894 four possible values of 120583(119883119894)

119864[119884119894|1198831113568119894 = 01198831113569119894 = 0] = 120572119864[119884119894|1198831113568119894 = 11198831113569119894 = 0] = 120572 + 120573119864[119884119894|1198831113568119894 = 01198831113569119894 = 1] = 120572 + 120574119864[119884119894|1198831113568119894 = 11198831113569119894 = 1] = 120572 + 120573 + 120574 + 120575

bull We can write the CEF as follows

119864[119884119894|1198831113568119894 1198831113569119894] = 120572 + 1205731198831113568119894 + 1205741198831113569119894 + 120575(11988311135681198941198831113569119894)

10 58

Saturated models example

119864[119884119894|1198831113568119894 1198831113569119894] = 120572 + 1205731198831113568119894 + 1205741198831113569119894 + 120575(11988311135681198941198831113569119894)

bull Basically each value of 120583(119883119894) is being estimated separately within-strata estimation No borrowing of information from across values of 119883119894

bull Requires a set of dummies for each categorical variable plusall interactions

bull Or a series of dummies for each unique combination of 119883119894bull This makes linearity hold mechanically and so linearity is

not an assumption Just a fact about saturated CEFs saturated models for limited dependent variables = A-OK

11 58

Saturated model example

bull Washington (AER) data on the effects of daughtersbull Wersquoll look at the relationship between voting and number of

kids (causal)

girls lt- foreignreaddta(rdquogirlsdtardquo)

head(girls[ c(rdquonamerdquo rdquototchirdquo rdquoaauwrdquo)])

name totchi aauw

1 ABERCROMBIE NEIL 0 100

2 ACKERMAN GARY L 3 88

3 ADERHOLT ROBERT B 0 0

4 ALLEN THOMAS H 2 100

5 ANDREWS ROBERT E 2 100

6 ARCHER WR 7 0

12 58

Linear model

summary(lm(aauw ~ totchi data = girls))

Coefficients

Estimate Std Error t value Pr(gt|t|)

(Intercept) 6131 181 3381 lt2e-16

totchi -533 062 -859 lt2e-16

---

Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1

Residual standard error 42 on 1733 degrees of freedom

(5 observations deleted due to missingness)

Multiple R-squared 00408 Adjusted R-squared 00403

F-statistic 738 on 1 and 1733 DF p-value lt2e-16

13 58

Saturated modelsummary(lm(aauw ~ asfactor(totchi) data = girls))

Coefficients

Estimate Std Error t value Pr(gt|t|)

(Intercept) 5641 276 2042 lt 2e-16

asfactor(totchi)1 545 411 133 01851

asfactor(totchi)2 -380 327 -116 02454

asfactor(totchi)3 -1365 345 -395 81e-05

asfactor(totchi)4 -1931 401 -482 16e-06

asfactor(totchi)5 -1546 485 -319 00015

asfactor(totchi)6 -3359 1042 -322 00013

asfactor(totchi)7 -1713 1141 -150 01336

asfactor(totchi)8 -5533 1228 -451 70e-06

asfactor(totchi)9 -5041 2408 -209 00364

asfactor(totchi)10 -5341 2090 -256 00107

asfactor(totchi)12 -5641 4153 -136 01745

---

Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1

Residual standard error 41 on 1723 degrees of freedom

(5 observations deleted due to missingness)

Multiple R-squared 00506 Adjusted R-squared 00446

F-statistic 836 on 11 and 1723 DF p-value 184e-1414 58

Saturated model minus the constantsummary(lm(aauw ~ asfactor(totchi) - 1 data = girls))

Coefficients

Estimate Std Error t value Pr(gt|t|)

asfactor(totchi)0 5641 276 2042 lt2e-16

asfactor(totchi)1 6186 305 2031 lt2e-16

asfactor(totchi)2 5262 175 3013 lt2e-16

asfactor(totchi)3 4276 207 2062 lt2e-16

asfactor(totchi)4 3711 290 1279 lt2e-16

asfactor(totchi)5 4095 399 1027 lt2e-16

asfactor(totchi)6 2282 1005 227 00233

asfactor(totchi)7 3929 1107 355 00004

asfactor(totchi)8 108 1196 009 09278

asfactor(totchi)9 600 2392 025 08020

asfactor(totchi)10 300 2072 014 08849

asfactor(totchi)12 000 4143 000 10000

---

Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1

Residual standard error 41 on 1723 degrees of freedom

(5 observations deleted due to missingness)

Multiple R-squared 0587 Adjusted R-squared 0584

F-statistic 204 on 12 and 1723 DF p-value lt2e-1615 58

Compare to within-strata means

bull The saturated model makes no assumptions about thebetween-strata relationships

bull Just calculates within-strata means

c1 lt- coef(lm(aauw ~ asfactor(totchi) - 1 data = girls))

c2 lt- with(girls tapply(aauw totchi mean narm = TRUE))

rbind(c1 c2)

0 1 2 3 4 5 6 7 8 9 10 12

c1 56 62 53 43 37 41 23 39 11 6 3 0

c2 56 62 53 43 37 41 23 39 11 6 3 0

16 58

Other justifications for OLS

bull Justification 2 119883prime119894 120573 is the best linear predictor (in a

mean-squared error sense) of 119884119894 Why 120573 = argmin119887 120124[(119884119894 minus 119883prime

119894 119887)1113569]

bull Justification 3 119883prime119894 120573 provides the minimum mean squared

error linear approxmiation to 119864[119884119894|119883119894]bull Even if the CEF is not linear a linear regression provides the

best linear approximation to that CEFbull Donrsquot need to believe the assumptions (linearity) in order to

use regression as a good approximation to the CEFbull Warning if the CEF is very nonlinear then this approximation

could be terrible

17 58

The error terms

bull Letrsquos define the error term 119890119894 equiv 119884119894 minus 119883prime119894 120573 so that

119884119894 = 119883prime119894 120573 + [119884119894 minus 119883prime

119894 120573] = 119883prime119894 120573 + 119890119894

bull Note the residual 119890119894 is uncorrelated with 119883119894

120124[119883119894119890119894] = 120124[119883119894(119884119894 minus 119883prime119894 120573)]

= 120124[119883119894119884119894] minus 120124[119883119894119883prime119894 120573]

= 120124[119883119894119884119894] minus 120124 1114094119883119894119883prime119894120124[119883119894119883prime

119894 ]minus1113568120124[119883119894119884119894]1114097= 120124[119883119894119884119894] minus 120124[119883119894119883prime

119894 ]120124[119883119894119883prime119894 ]minus1113568120124[119883119894119884119894]

= 120124[119883119894119884119894] minus 120124[119883119894119884119894] = 0

bull No assumptions on the linearity of 120124[119884119894|119883119894]

18 58

OLS estimator

bull We know the population value of 120573 is

120573 = 120124[119883119894119883prime119894 ]minus1113568120124[119883119894119884119894]

bull How do we get an estimator of thisbull Plug-in principle replace population expectation with

sample versions

=

โŽกโŽขโŽขโŽขโŽขโŽขโŽฃ1119873

1114012119894119883119894119883prime

119894

โŽคโŽฅโŽฅโŽฅโŽฅโŽฅโŽฆ

minus11135681119873

1114012119894119883119894119884119894

bull If you work through the matrix algebra this turns out to be

= (119831prime119831)minus1113568 119831prime119858

19 58

Asymptotic OLS inference

bull With this representation in hand we can write the OLSestimator as follows

= 120573 +

โŽกโŽขโŽขโŽขโŽขโŽขโŽฃ1114012

119894119883119894119883prime

119894

โŽคโŽฅโŽฅโŽฅโŽฅโŽฅโŽฆ

minus1113568

1114012119894119883119894119890119894

bull Core idea sum119894119883119894119890119894 is the sum of rvs so the CLT applies

bull That plus some simple asymptotic theory allows us to say

radic119873( minus 120573) 119873(0ฮฉ)

bull Converges in distribution to a Normal distribution with meanvector 0 and covariance matrix ฮฉ

ฮฉ = 120124[119883119894119883prime119894 ]minus1113568120124[119883119894119883prime

119894 1198901113569119894 ]120124[119883119894119883prime119894 ]minus1113568

bull No linearity assumption needed20 58

Estimating the variance

bull In large samples then

radic119873( minus 120573) sim 119873(0ฮฉ)

bull How to estimate ฮฉ Plug-in principle again

1114023ฮฉ =

โŽกโŽขโŽขโŽขโŽขโŽขโŽฃ1114012

119894119883119894119883prime

119894

โŽคโŽฅโŽฅโŽฅโŽฅโŽฅโŽฆ

minus1113568 โŽกโŽขโŽขโŽขโŽขโŽขโŽฃ1114012119894119883119894119883prime

119894 1113569119894

โŽคโŽฅโŽฅโŽฅโŽฅโŽฅโŽฆ

โŽกโŽขโŽขโŽขโŽขโŽขโŽฃ1114012

119894119883119894119883prime

119894

โŽคโŽฅโŽฅโŽฅโŽฅโŽฅโŽฆ

minus1113568

bull Replace 119890119894 with its emprical counterpart (residuals)119894 = 119884119894 minus 119883prime

119894 bull Replace the population moments of 119883119894 with their sample

counterpartsbull The square root of the diagonals of this covariance matrix are

the ldquorobustrdquo or Huber-White standard errors that Statacommonly report

21 58

Heteroskedasticity

bull No assumptions of homoskedasticitybull Heteroskedaticity will definitely occur when

CEF is linear but the 1205901113569(119909) = 120141[119884119894|119883119894 = 119909] is not constant in 119909 119864[119884119894|119883119894] is not linear but we use the linear regression to

approxmiate it

22 58

2 Regression andCausality

23 58

Regression and causality

bull Most econometrics textbooks regression defined withoutrespect to causality

bull But then when is ldquobiasedrdquo The above derivations work forsome 120124[119884119894|119883119894]

bull The question then is when does knowing the CEF tell ussomething about causality

bull MHE argues that a regression is causal when the CEF itapproximates is causal Identification is king

bull We will show that under certain conditions a regression of theoutcome on the treatment and the covariates can recover acausal parameter but perhaps not the one in which we areinterested

24 58

Review

bull Quick reminder we have potential outcomes 119884119894(1) and 119884119894(0)and two parameters the ATE and ATT

120591 = 119864[119884119894(1) minus 119884119894(0)]120591ATT = 119864[119884119894(1) minus 119884119894(0)|119863119894 = 1]

bull We have shown in past weeks that these effects are identifiedwhen ignorability holds MHE calls this the conditionalindependence assumption (CIA)

25 58

Linear constant effects model binarytreatment

bull Experiment with a simple experiment we can rewrite theconsistency assumption to be a regression formula

119884119894 = 119863119894119884119894(1) + (1 minus 119863119894)119884119894(0)= 119884119894(0) + (119884119894(1) minus 119884119894(0))119863119894

= 120124[119884119894(0)] + 120591119863119894 + (119884119894(0) minus 120124[119884119894(0)])= 1205831113567 + 120591119863119894 + 1199071113567119894

bull Note that if ignorability holds (as in an experiment) for 119884119894(0)then it will also hold for 1199071113567119894 since 120124[119884119894(0)] is constant Thusthis satifies the usual assumptions for regression

26 58

Now with covariates

bull Now assume no unmeasured confounders 119884119894(119889) โŸ‚โŸ‚ 119863119894|119883119894bull We will assume a linear model for the potential outcomes

119884119894(119889) = 120572 + 120591 sdot 119889 + 120578119894

bull Remember that linearity isnrsquot an assumption if 119863119894 is binarybull Effect of 119863119894 is constant here the 120578119894 are the only source of

individual variation and we have 119864[120578119894] = 0bull Consistency assumption allows us to write this as

119884119894 = 120572 + 120591119863119894 + 120578119894

27 58

Covariates in the error

bull Letrsquos assume that 120578119894 is linear in 119883119894 120578119894 = 119883prime119894 120574 + 120584119894

bull New error is uncorrelated with 119883119894 120124[120584119894|119883119894] = 0bull This is an assumption Might be falsebull Plug into the above

120124[119884119894(119889)|119883119894] = 119864[119884119894|119863119894 119883119894] = 120572 + 120591119863119894 + 119864[120578119894|119883119894]= 120572 + 120591119863119894 + 119883prime

119894 120574 + 119864[120584119894|119883119894]= 120572 + 120591119863119894 + 119883prime

119894 120574

28 58

Summing up regression with constanteffects

bull Reviewing the assumptions wersquove used no unmeasured confounders constant treatment effects linearity of the treatmentcovariates

bull Under these we can run the following regression to estimatethe ATE 120591

119884119894 = 120572 + 120591119863119894 + 119883prime119894 120574 + 120584119894

bull Works with continuous or ordinal 119863119894 if linearity in the effect ofthese variables is truly linear

29 58

OLS constant effects simulation

Model with linear covariates constant 0 effect of treatment

library(mvtnorm)

n lt- 100

p lt- 4

X lt- rmvnorm(n = 100 mean = rep(0 p))

gamma lt- c(274 137 137 137)

y lt- 210 + X c(gamma) + rnorm(n)

alpha lt- c(-1 05 -05 -01)

dprobs lt- bootinvlogit(X alpha)

d lt- rbinom(n size = 1 prob = dprobs)

30 58

OLS with no covariates

summary(lm(y ~ d))

Coefficients

Estimate Std Error t value Pr(gt|t|)

(Intercept) 22036 458 4813 lt 2e-16

d -2869 654 -439 0000029

---

Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1

Residual standard error 33 on 98 degrees of freedom

Multiple R-squared 0164 Adjusted R-squared 0156

F-statistic 192 on 1 and 98 DF p-value 0000029

31 58

OLS with covariates

summary(lm(y ~ d + X))

Coefficients

Estimate Std Error t value Pr(gt|t|)

(Intercept) 209952 0159 13207 lt2e-16

d -0128 0253 -05 062

X1 27368 0126 2175 lt2e-16

X2 13677 0114 1201 lt2e-16

X3 13673 0130 1051 lt2e-16

X4 13570 0106 1280 lt2e-16

---

Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1

Residual standard error 1 on 94 degrees of freedom

Multiple R-squared 0999 Adjusted R-squared 0999

F-statistic 244e+04 on 5 and 94 DF p-value lt2e-16

32 58

What happens with nonlinearity

Suppose we can only observe the following covariates

z1 lt- exp(X[ 1]2)

z2 lt- X[ 2](1 + exp(X[ 1])) + 10

z3 lt- (X[ 1] X[ 3]25 + 06)^3

z4 lt- (X[ 2] + X[ 4] + 20)^2

Implies that 119884119894 and 119863119894 are functions of log(1198851198941113568) 1198851198941113569 119885111356911989411135681198851198941113569

1 log(1198851198941113568) 1198851198941113570 log(1198851198941113568) and 119883111356811135691198941113571

Regression is a nonlinear function of the observed covariates

33 58

When linearity goes wrong

summary(lm(y ~ d + z1 + z2 + z3 + z4))

Coefficients

Estimate Std Error t value Pr(gt|t|)

(Intercept) 218728 300799 073 0469

d -69292 33854 -205 0043

z1 363110 26970 1346 lt2e-16

z2 -29033 36619 -079 0430

z3 862022 431030 200 0048

z4 04021 00329 1223 lt2e-16

---

Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1

Residual standard error 14 on 94 degrees of freedom

Multiple R-squared 0855 Adjusted R-squared 0847

F-statistic 111 on 5 and 94 DF p-value lt2e-16

34 58

3 Regression withHeterogeneousTreatment Effects

35 58

Heterogeneous effects binary treatment

bull Completely randomized experiment

119884119894 = 119863119894119884119894(1) + (1 minus 119863119894)119884119894(0)= 119884119894(0) + (119884119894(1) minus 119884119894(0))119863119894

= 1205831113567 + 120591119894119863119894 + (119884119894(0) minus 1205831113567)= 1205831113567 + 120591119863119894 + (119884119894(0) minus 1205831113567) + (120591119894 minus 120591) sdot 119863119894

= 1205831113567 + 120591119863119894 + 120576119894

bull Error term now includes two components1 ldquoBaselinerdquo variation in the outcome (119884119894(0) minus 1205831113567)2 Variation in the treatment effect (120591119894 minus 120591)

bull Easy to verify that under experiment 120124[120576119894|119863119894] = 0bull Thus OLS estimates the ATE with no covariates

36 58

Adding covariates

bull What happens with no unmeasured confounders Need tocondition on 119883119894 now

bull Remember identification of the ATEATT using iteratedexpectations

bull ATE is the weighted sum of CATEs

120591 = 1114012119909120591(119909) Pr[119883119894 = 119909]

bull ATEATT are weighted averages of CATEsbull What about the regression estimand 120591119877 How does it related

to the ATEATT

37 58

Heterogeneous effects and regression

bull Letrsquos investigate this under a saturated regression model

119884119894 = 1114012119909119861119909119894120572119909 + 120591119877119863119894 + 119890119894

bull Use a dummy variable for each unique combination of 119883119894119861119909119894 = 120128(119883119894 = 119909)

bull Linear in 119883119894 by construction

38 58

Investigating the regression coefficient

bull How can we investigate 120591119877 Well we can rely on theregression anatomy

120591119877 = Cov(119884119894 119863119894 minus 119864[119863119894|119883119894])120141(119863119894 minus 119864[119863119894|119883119894])

bull 119863119894 minus 120124[119863119894|119883119894] is the residual from a regression of 119863119894 on the fullset of dummies

bull With a little work we can show

120591119877 =120124 1114094120591(119883119894)(119863119894 minus 120124[119863119894|119883119894])11135691114097

120124[(119863119894 minus 119864[119863119894|119883119894])1113569]=

120124[120591(119883119894)1205901113569119889(119883119894)]120124[1205901113569119889(119883119894)]

bull 1205901113569119889(119909) = 120141[119863119894|119883119894 = 119909] is the conditional variance of treatmentassignment

39 58

ATE versus OLS

120591119877 = 120124[120591(119883119894)119882119894] = 1114012119909120591(119909)

1205901113569119889(119909)120124[1205901113569119889(119883119894)]

โ„™[119883119894 = 119909]

bull Compare to the ATE

120591 = 120124[120591(119883119894)] = 1114012119909120591(119909)โ„™[119883119894 = 119909]

bull Both weight strata relative to their size (โ„™[119883119894 = 119909])bull OLS weights strata higher if the treatment variance in those

strata (1205901113569119889(119909)) is higher in those strata relative to the averagevariance across strata (120124[1205901113569119889(119883119894)])

bull The ATE weights only by their size

40 58

Regression weighting

119882119894 =1205901113569119889(119883119894)

120124[1205901113569119889(119883119894)]

bull Why does OLS weight like thisbull OLS is a minimum-variance estimator more weight to

more precise within-strata estimatesbull Within-strata estimates are most precise when the treatment

is evenly spread and thus has the highest variancebull If 119863119894 is binary then we know the conditional variance will be

1205901113569119889(119909) = โ„™[119863119894 = 1|119883119894 = 119909] (1 minus โ„™[119863119894 = 1|119883119894 = 119909])= 119890(119909) (1 minus 119890(119909))

bull Maximum variance with โ„™[119863119894 = 1|119883119894 = 119909] = 12

41 58

OLS weighting examplebull Binary covariate

โ„™[119883119894 = 1] = 075 โ„™[119883119894 = 0] = 025โ„™[119863119894 = 1|119883119894 = 1] = 09 โ„™[119863119894 = 1|119883119894 = 0] = 05

1205901113569119889(1) = 009 1205901113569119889(0) = 025120591(1) = 1 120591(0) = minus1

bull Implies the ATE is 120591 = 05bull Average conditional variance 120124[1205901113569119889(119883119894)] = 013bull weights for 119883119894 = 1 are 009013 = 0692 for 119883119894 = 0

025013 = 192120591119877 = 120124[120591(119883119894)119882119894]

= 120591(1)119882(1)โ„™[119883119894 = 1] + 120591(0)119882(0)โ„™[119883119894 = 0]= 1 times 0692 times 075 + minus1 times 192 times 025= 0039

42 58

When will OLS estimate the ATE

bull When does 120591 = 120591119877bull Constant treatment effects 120591(119909) = 120591 = 120591119877bull Constant probability of treatment 119890(119909) = โ„™[119863119894 = 1|119883119894 = 119909] = 119890

Implies that the OLS weights are 1

bull Incorrect linearity assumption in 119883119894 will lead to more bias

43 58

Other ways to use regression

bull Whatrsquos the path forward Accept the bias (might be relatively small with saturated

models) Use a different regression approach

bull Let 120583119889(119909) = 120124[119884119894(119889)|119883119894 = 119909] be the CEF for the potentialoutcome under 119863119894 = 119889

bull By consistency and nuc we have 120583119889(119909) = 120124[119884119894|119863119894 = 119889119883119894 = 119909]bull Estimate a regression of 119884119894 on 119883119894 among the 119863119894 = 119889 groupbull Then 119889(119909) is just a predicted value from the regression for

119883119894 = 119909bull How can we use this

44 58

Imputation estimators

bull Impute the treated potential outcomes with 1114023119884119894(1) = 1113568(119883119894)bull Impute the control potential outcomes with 1114023119884119894(0) = 1113567(119883119894)bull Procedure

Regress 119884119894 on 119883119894 in the treated group and get predicted valuesfor all units (treated or control)

Regress 119884119894 on 119883119894 in the control group and get predicted valuesfor all units (treated or control)

Take the average difference between these predicted values

bull More mathematically look like this

120591119894119898119901 =1119873

11140121198941113568(119883119894) minus 1113567(119883119894)

bull Sometimes called an imputation estimator

45 58

Simple imputation estimator

bull Use predict() from the within-group models on the data fromthe entire sample

bull Useful trick use a model on the entire data andmodelframe() to get the right design matrix

heterogeneous effects

yhet lt- ifelse(d == 1 y + rnorm(n 0 5) y)

mod lt- lm(yhet ~ d + X)

mod1 lt- lm(yhet ~ X subset = d == 1)

mod0 lt- lm(yhet ~ X subset = d == 0)

y1imps lt- predict(mod1 modelframe(mod))

y0imps lt- predict(mod0 modelframe(mod))

mean(y1imps - y0imps)

[1] 061

46 58

Notes on imputation estimators

bull If 119889(119909) are consistent estimators then 120591119894119898119901 is consistent forthe ATE

bull Why donrsquot people use this Most people donrsquot know the results wersquove been talking about Harder to implement than vanilla OLS

bull Can use linear regression to estimate 119889(119909) = 119909prime120573119889bull Recent trend is to estimate 119889(119909) via non-parametric methods

such as Kernel regression local linear regression regression trees etc Easiest is generalized additive models (GAMs)

47 58

Imputation estimator visualization

-2 -1 0 1 2 3

-05

00

05

10

15

x

y

TreatedControl

48 58

Imputation estimator visualization

-2 -1 0 1 2 3

-05

00

05

10

15

x

y

TreatedControl

49 58

Imputation estimator visualization

-2 -1 0 1 2 3

-05

00

05

10

15

x

y

TreatedControl

50 58

Nonlinear relationshipsbull Same idea but with nonlinear relationship between 119884119894 and 119883119894

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

51 58

Nonlinear relationshipsbull Same idea but with nonlinear relationship between 119884119894 and 119883119894

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

52 58

Nonlinear relationshipsbull Same idea but with nonlinear relationship between 119884119894 and 119883119894

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

53 58

Using semiparametric regressionbull Here CEFs are nonlinear but we donrsquot know their formbull We can use GAMs from the mgcv package to for flexible

estimatelibrary(mgcv)

mod0 lt- gam(y ~ s(x) subset = d == 0)

summary(mod0)

Family gaussian

Link function identity

Formula

y ~ s(x)

Parametric coefficients

Estimate Std Error t value Pr(gt|t|)

(Intercept) -00225 00154 -146 016

Approximate significance of smooth terms

edf Refdf F p-value

s(x) 603 708 413 lt2e-16

---

Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1

R-sq(adj) = 0909 Deviance explained = 928

GCV = 00093204 Scale est = 00071351 n = 30

54 58

Using GAMs

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

55 58

Using GAMs

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

56 58

Using GAMs

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

57 58

Limited dependent variables

bull Usual advice model the data from first principles Logitprobit for binary Poisson for counts etc

bull OLS is a-ok with limited DVs when Binary treatment and no covariates (just diff-in-means) Binary treatment discrete covariates and saturated models

(stratified diff-in-means)

bull Imposing a model on LDVs in this case imposes adistributional assumption which could be wrong

bull Even in unsaturated models the marginal effect from OLSoften decent compared to nonlinear models

Could go wrong in small samples If using nonlinear models always get effects on the scale of the

outcome

58 58

  • Agnostic Regression
  • Regression and Causality
  • Regression with Heterogeneous Treatment Effects
Page 2: Gov 2002: 7. Regression and Causalityโ€ข Define linear regression: =argmin ๐”ผ[( ๐‘–โˆ’ โ€ฒ ) ] โ€ข The solution to this is the following: =๐”ผ[ ๐‘– โ€ฒ]โˆ’ ๐”ผ[ ๐‘– ๐‘–] โ€ข

Agnostic Regression

Regression and Causality

Regression with Heterogeneous Treatment Effects

2 58

Where are we Where are we going

bull Last few weeks using matching weighting for estimatingcausal effects

bull This week how to use regression to estimate causal effectsbull Regression is so widely used itrsquos good to know what itrsquos

actually estimatingbull Goal salvage regression from the ashes of 1980rsquos textbooksbull Next week panel data

Reminder Email me and Stephen a half-page description of yourproposed research project

3 58

1 AgnosticRegression

4 58

Regression as parametric modeling

bull Gauss-Markov assumptions linearity iid sample full rank 119883119894 zero conditional mean error

homoskedasticity

bull OLS is BLUE plus normality of the errors and we get smallsample SEs

bull What is the basic approach here It is a model for theconditional distribution of 119884119894 given 119883119894

[119884119894|119883119894] sim 119873(119883prime119894 120573 1205901113569)

bull MLE from this model is the usual OLS estimator OLS

OLS =

โŽกโŽขโŽขโŽขโŽขโŽขโŽฃ1198731114012119894=1113568

119883119894119883prime119894

โŽคโŽฅโŽฅโŽฅโŽฅโŽฅโŽฆ

minus11135681198731114012119894=1113568

119883119894119884119894

5 58

Agnostic views on regression

[119884119894|119883119894] sim 119873(119883prime119894 120573 1205901113569)

bull Strong distributional assumption on 119884119894bull Properties like BLUE or MLE properties depend on these

assumptions holdingbull Alternative take an agnostic view on regression

Use OLS without believing these assumptions

bull Lose the distributional assumptions focus on the conditionalexpectation function (CEF)

120583(119909) = 120124[119884119894|119883119894 = 119909] = 1114012119910119910 sdot โ„™[119884119894 = 119910|119883119894 = 119909]

6 58

Justifying linear regression

bull Define linear regression

120573 = argmin119887

120124[(119884119894 minus 119883prime119894 119887)1113569]

bull The solution to this is the following

120573 = 120124[119883119894119883prime119894 ]minus1113568120124[119883119894119884119894]

bull Note that the is the population coefficient vector not theestimator yet

7 58

Regression anatomy

bull Consider simple linear regression

(120572 120573) = argmin119886119887

120124 1114094(119884119894 minus 119886 minus 119887119883119894)11135691114097

bull In this case we can write the populationtrue slope 120573 as

120573 = 120124[119883119894119883prime119894 ]minus1113568120124[119883119894119884119894] =

Cov(119884119894 119883119894)120141[119883119894]

bull With more covariates 120573 is more complicated but we can stillwrite it like this

bull Let 119896119894 be the residual from a regression of 119883119896119894 on all the otherindependent variables Then 120573119896 the coefficient for 119883119896119894 is

120573119896 =Cov(119884119894 119896119894)

119881(119896119894)

8 58

Justification 1 Linear CEFs

bull Justification 1 if the CEF is linear the population regressionfunction is it That is if 119864[119884119894|119883119894] = 119883prime

119894 119887 then 119887 = 120573bull When would we expect the CEF to be linear Two cases

1 Outcome and covariates are multivariate normal2 Linear regression model is saturated

bull A model is saturated if there are as many parameters asthere are possible combination of the 119883119894 variables

9 58

Saturated model example

bull Two binary variables 1198831113568119894 for incumbency status and 1198831113569119894 forparty of the candidate

bull Four possible values of 119883119894 four possible values of 120583(119883119894)

119864[119884119894|1198831113568119894 = 01198831113569119894 = 0] = 120572119864[119884119894|1198831113568119894 = 11198831113569119894 = 0] = 120572 + 120573119864[119884119894|1198831113568119894 = 01198831113569119894 = 1] = 120572 + 120574119864[119884119894|1198831113568119894 = 11198831113569119894 = 1] = 120572 + 120573 + 120574 + 120575

bull We can write the CEF as follows

119864[119884119894|1198831113568119894 1198831113569119894] = 120572 + 1205731198831113568119894 + 1205741198831113569119894 + 120575(11988311135681198941198831113569119894)

10 58

Saturated models example

119864[119884119894|1198831113568119894 1198831113569119894] = 120572 + 1205731198831113568119894 + 1205741198831113569119894 + 120575(11988311135681198941198831113569119894)

bull Basically each value of 120583(119883119894) is being estimated separately within-strata estimation No borrowing of information from across values of 119883119894

bull Requires a set of dummies for each categorical variable plusall interactions

bull Or a series of dummies for each unique combination of 119883119894bull This makes linearity hold mechanically and so linearity is

not an assumption Just a fact about saturated CEFs saturated models for limited dependent variables = A-OK

11 58

Saturated model example

bull Washington (AER) data on the effects of daughtersbull Wersquoll look at the relationship between voting and number of

kids (causal)

girls lt- foreignreaddta(rdquogirlsdtardquo)

head(girls[ c(rdquonamerdquo rdquototchirdquo rdquoaauwrdquo)])

name totchi aauw

1 ABERCROMBIE NEIL 0 100

2 ACKERMAN GARY L 3 88

3 ADERHOLT ROBERT B 0 0

4 ALLEN THOMAS H 2 100

5 ANDREWS ROBERT E 2 100

6 ARCHER WR 7 0

12 58

Linear model

summary(lm(aauw ~ totchi data = girls))

Coefficients

Estimate Std Error t value Pr(gt|t|)

(Intercept) 6131 181 3381 lt2e-16

totchi -533 062 -859 lt2e-16

---

Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1

Residual standard error 42 on 1733 degrees of freedom

(5 observations deleted due to missingness)

Multiple R-squared 00408 Adjusted R-squared 00403

F-statistic 738 on 1 and 1733 DF p-value lt2e-16

13 58

Saturated modelsummary(lm(aauw ~ asfactor(totchi) data = girls))

Coefficients

Estimate Std Error t value Pr(gt|t|)

(Intercept) 5641 276 2042 lt 2e-16

asfactor(totchi)1 545 411 133 01851

asfactor(totchi)2 -380 327 -116 02454

asfactor(totchi)3 -1365 345 -395 81e-05

asfactor(totchi)4 -1931 401 -482 16e-06

asfactor(totchi)5 -1546 485 -319 00015

asfactor(totchi)6 -3359 1042 -322 00013

asfactor(totchi)7 -1713 1141 -150 01336

asfactor(totchi)8 -5533 1228 -451 70e-06

asfactor(totchi)9 -5041 2408 -209 00364

asfactor(totchi)10 -5341 2090 -256 00107

asfactor(totchi)12 -5641 4153 -136 01745

---

Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1

Residual standard error 41 on 1723 degrees of freedom

(5 observations deleted due to missingness)

Multiple R-squared 00506 Adjusted R-squared 00446

F-statistic 836 on 11 and 1723 DF p-value 184e-1414 58

Saturated model minus the constantsummary(lm(aauw ~ asfactor(totchi) - 1 data = girls))

Coefficients

Estimate Std Error t value Pr(gt|t|)

asfactor(totchi)0 5641 276 2042 lt2e-16

asfactor(totchi)1 6186 305 2031 lt2e-16

asfactor(totchi)2 5262 175 3013 lt2e-16

asfactor(totchi)3 4276 207 2062 lt2e-16

asfactor(totchi)4 3711 290 1279 lt2e-16

asfactor(totchi)5 4095 399 1027 lt2e-16

asfactor(totchi)6 2282 1005 227 00233

asfactor(totchi)7 3929 1107 355 00004

asfactor(totchi)8 108 1196 009 09278

asfactor(totchi)9 600 2392 025 08020

asfactor(totchi)10 300 2072 014 08849

asfactor(totchi)12 000 4143 000 10000

---

Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1

Residual standard error 41 on 1723 degrees of freedom

(5 observations deleted due to missingness)

Multiple R-squared 0587 Adjusted R-squared 0584

F-statistic 204 on 12 and 1723 DF p-value lt2e-1615 58

Compare to within-strata means

bull The saturated model makes no assumptions about thebetween-strata relationships

bull Just calculates within-strata means

c1 lt- coef(lm(aauw ~ asfactor(totchi) - 1 data = girls))

c2 lt- with(girls tapply(aauw totchi mean narm = TRUE))

rbind(c1 c2)

0 1 2 3 4 5 6 7 8 9 10 12

c1 56 62 53 43 37 41 23 39 11 6 3 0

c2 56 62 53 43 37 41 23 39 11 6 3 0

16 58

Other justifications for OLS

bull Justification 2 119883prime119894 120573 is the best linear predictor (in a

mean-squared error sense) of 119884119894 Why 120573 = argmin119887 120124[(119884119894 minus 119883prime

119894 119887)1113569]

bull Justification 3 119883prime119894 120573 provides the minimum mean squared

error linear approxmiation to 119864[119884119894|119883119894]bull Even if the CEF is not linear a linear regression provides the

best linear approximation to that CEFbull Donrsquot need to believe the assumptions (linearity) in order to

use regression as a good approximation to the CEFbull Warning if the CEF is very nonlinear then this approximation

could be terrible

17 58

The error terms

bull Letrsquos define the error term 119890119894 equiv 119884119894 minus 119883prime119894 120573 so that

119884119894 = 119883prime119894 120573 + [119884119894 minus 119883prime

119894 120573] = 119883prime119894 120573 + 119890119894

bull Note the residual 119890119894 is uncorrelated with 119883119894

120124[119883119894119890119894] = 120124[119883119894(119884119894 minus 119883prime119894 120573)]

= 120124[119883119894119884119894] minus 120124[119883119894119883prime119894 120573]

= 120124[119883119894119884119894] minus 120124 1114094119883119894119883prime119894120124[119883119894119883prime

119894 ]minus1113568120124[119883119894119884119894]1114097= 120124[119883119894119884119894] minus 120124[119883119894119883prime

119894 ]120124[119883119894119883prime119894 ]minus1113568120124[119883119894119884119894]

= 120124[119883119894119884119894] minus 120124[119883119894119884119894] = 0

bull No assumptions on the linearity of 120124[119884119894|119883119894]

18 58

OLS estimator

bull We know the population value of 120573 is

120573 = 120124[119883119894119883prime119894 ]minus1113568120124[119883119894119884119894]

bull How do we get an estimator of thisbull Plug-in principle replace population expectation with

sample versions

=

โŽกโŽขโŽขโŽขโŽขโŽขโŽฃ1119873

1114012119894119883119894119883prime

119894

โŽคโŽฅโŽฅโŽฅโŽฅโŽฅโŽฆ

minus11135681119873

1114012119894119883119894119884119894

bull If you work through the matrix algebra this turns out to be

= (119831prime119831)minus1113568 119831prime119858

19 58

Asymptotic OLS inference

bull With this representation in hand we can write the OLSestimator as follows

= 120573 +

โŽกโŽขโŽขโŽขโŽขโŽขโŽฃ1114012

119894119883119894119883prime

119894

โŽคโŽฅโŽฅโŽฅโŽฅโŽฅโŽฆ

minus1113568

1114012119894119883119894119890119894

bull Core idea sum119894119883119894119890119894 is the sum of rvs so the CLT applies

bull That plus some simple asymptotic theory allows us to say

radic119873( minus 120573) 119873(0ฮฉ)

bull Converges in distribution to a Normal distribution with meanvector 0 and covariance matrix ฮฉ

ฮฉ = 120124[119883119894119883prime119894 ]minus1113568120124[119883119894119883prime

119894 1198901113569119894 ]120124[119883119894119883prime119894 ]minus1113568

bull No linearity assumption needed20 58

Estimating the variance

bull In large samples then

radic119873( minus 120573) sim 119873(0ฮฉ)

bull How to estimate ฮฉ Plug-in principle again

1114023ฮฉ =

โŽกโŽขโŽขโŽขโŽขโŽขโŽฃ1114012

119894119883119894119883prime

119894

โŽคโŽฅโŽฅโŽฅโŽฅโŽฅโŽฆ

minus1113568 โŽกโŽขโŽขโŽขโŽขโŽขโŽฃ1114012119894119883119894119883prime

119894 1113569119894

โŽคโŽฅโŽฅโŽฅโŽฅโŽฅโŽฆ

โŽกโŽขโŽขโŽขโŽขโŽขโŽฃ1114012

119894119883119894119883prime

119894

โŽคโŽฅโŽฅโŽฅโŽฅโŽฅโŽฆ

minus1113568

bull Replace 119890119894 with its emprical counterpart (residuals)119894 = 119884119894 minus 119883prime

119894 bull Replace the population moments of 119883119894 with their sample

counterpartsbull The square root of the diagonals of this covariance matrix are

the ldquorobustrdquo or Huber-White standard errors that Statacommonly report

21 58

Heteroskedasticity

bull No assumptions of homoskedasticitybull Heteroskedaticity will definitely occur when

CEF is linear but the 1205901113569(119909) = 120141[119884119894|119883119894 = 119909] is not constant in 119909 119864[119884119894|119883119894] is not linear but we use the linear regression to

approxmiate it

22 58

2 Regression andCausality

23 58

Regression and causality

bull Most econometrics textbooks regression defined withoutrespect to causality

bull But then when is ldquobiasedrdquo The above derivations work forsome 120124[119884119894|119883119894]

bull The question then is when does knowing the CEF tell ussomething about causality

bull MHE argues that a regression is causal when the CEF itapproximates is causal Identification is king

bull We will show that under certain conditions a regression of theoutcome on the treatment and the covariates can recover acausal parameter but perhaps not the one in which we areinterested

24 58

Review

bull Quick reminder we have potential outcomes 119884119894(1) and 119884119894(0)and two parameters the ATE and ATT

120591 = 119864[119884119894(1) minus 119884119894(0)]120591ATT = 119864[119884119894(1) minus 119884119894(0)|119863119894 = 1]

bull We have shown in past weeks that these effects are identifiedwhen ignorability holds MHE calls this the conditionalindependence assumption (CIA)

25 58

Linear constant effects model binarytreatment

bull Experiment with a simple experiment we can rewrite theconsistency assumption to be a regression formula

119884119894 = 119863119894119884119894(1) + (1 minus 119863119894)119884119894(0)= 119884119894(0) + (119884119894(1) minus 119884119894(0))119863119894

= 120124[119884119894(0)] + 120591119863119894 + (119884119894(0) minus 120124[119884119894(0)])= 1205831113567 + 120591119863119894 + 1199071113567119894

bull Note that if ignorability holds (as in an experiment) for 119884119894(0)then it will also hold for 1199071113567119894 since 120124[119884119894(0)] is constant Thusthis satifies the usual assumptions for regression

26 58

Now with covariates

bull Now assume no unmeasured confounders 119884119894(119889) โŸ‚โŸ‚ 119863119894|119883119894bull We will assume a linear model for the potential outcomes

119884119894(119889) = 120572 + 120591 sdot 119889 + 120578119894

bull Remember that linearity isnrsquot an assumption if 119863119894 is binarybull Effect of 119863119894 is constant here the 120578119894 are the only source of

individual variation and we have 119864[120578119894] = 0bull Consistency assumption allows us to write this as

119884119894 = 120572 + 120591119863119894 + 120578119894

27 58

Covariates in the error

bull Letrsquos assume that 120578119894 is linear in 119883119894 120578119894 = 119883prime119894 120574 + 120584119894

bull New error is uncorrelated with 119883119894 120124[120584119894|119883119894] = 0bull This is an assumption Might be falsebull Plug into the above

120124[119884119894(119889)|119883119894] = 119864[119884119894|119863119894 119883119894] = 120572 + 120591119863119894 + 119864[120578119894|119883119894]= 120572 + 120591119863119894 + 119883prime

119894 120574 + 119864[120584119894|119883119894]= 120572 + 120591119863119894 + 119883prime

119894 120574

28 58

Summing up regression with constanteffects

bull Reviewing the assumptions wersquove used no unmeasured confounders constant treatment effects linearity of the treatmentcovariates

bull Under these we can run the following regression to estimatethe ATE 120591

119884119894 = 120572 + 120591119863119894 + 119883prime119894 120574 + 120584119894

bull Works with continuous or ordinal 119863119894 if linearity in the effect ofthese variables is truly linear

29 58

OLS constant effects simulation

Model with linear covariates constant 0 effect of treatment

library(mvtnorm)

n lt- 100

p lt- 4

X lt- rmvnorm(n = 100 mean = rep(0 p))

gamma lt- c(274 137 137 137)

y lt- 210 + X c(gamma) + rnorm(n)

alpha lt- c(-1 05 -05 -01)

dprobs lt- bootinvlogit(X alpha)

d lt- rbinom(n size = 1 prob = dprobs)

30 58

OLS with no covariates

summary(lm(y ~ d))

Coefficients

Estimate Std Error t value Pr(gt|t|)

(Intercept) 22036 458 4813 lt 2e-16

d -2869 654 -439 0000029

---

Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1

Residual standard error 33 on 98 degrees of freedom

Multiple R-squared 0164 Adjusted R-squared 0156

F-statistic 192 on 1 and 98 DF p-value 0000029

31 58

OLS with covariates

summary(lm(y ~ d + X))

Coefficients

Estimate Std Error t value Pr(gt|t|)

(Intercept) 209952 0159 13207 lt2e-16

d -0128 0253 -05 062

X1 27368 0126 2175 lt2e-16

X2 13677 0114 1201 lt2e-16

X3 13673 0130 1051 lt2e-16

X4 13570 0106 1280 lt2e-16

---

Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1

Residual standard error 1 on 94 degrees of freedom

Multiple R-squared 0999 Adjusted R-squared 0999

F-statistic 244e+04 on 5 and 94 DF p-value lt2e-16

32 58

What happens with nonlinearity

Suppose we can only observe the following covariates

z1 lt- exp(X[ 1]2)

z2 lt- X[ 2](1 + exp(X[ 1])) + 10

z3 lt- (X[ 1] X[ 3]25 + 06)^3

z4 lt- (X[ 2] + X[ 4] + 20)^2

Implies that 119884119894 and 119863119894 are functions of log(1198851198941113568) 1198851198941113569 119885111356911989411135681198851198941113569

1 log(1198851198941113568) 1198851198941113570 log(1198851198941113568) and 119883111356811135691198941113571

Regression is a nonlinear function of the observed covariates

33 58

When linearity goes wrong

summary(lm(y ~ d + z1 + z2 + z3 + z4))

Coefficients

Estimate Std Error t value Pr(gt|t|)

(Intercept) 218728 300799 073 0469

d -69292 33854 -205 0043

z1 363110 26970 1346 lt2e-16

z2 -29033 36619 -079 0430

z3 862022 431030 200 0048

z4 04021 00329 1223 lt2e-16

---

Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1

Residual standard error 14 on 94 degrees of freedom

Multiple R-squared 0855 Adjusted R-squared 0847

F-statistic 111 on 5 and 94 DF p-value lt2e-16

34 58

3 Regression withHeterogeneousTreatment Effects

35 58

Heterogeneous effects binary treatment

bull Completely randomized experiment

119884119894 = 119863119894119884119894(1) + (1 minus 119863119894)119884119894(0)= 119884119894(0) + (119884119894(1) minus 119884119894(0))119863119894

= 1205831113567 + 120591119894119863119894 + (119884119894(0) minus 1205831113567)= 1205831113567 + 120591119863119894 + (119884119894(0) minus 1205831113567) + (120591119894 minus 120591) sdot 119863119894

= 1205831113567 + 120591119863119894 + 120576119894

bull Error term now includes two components1 ldquoBaselinerdquo variation in the outcome (119884119894(0) minus 1205831113567)2 Variation in the treatment effect (120591119894 minus 120591)

bull Easy to verify that under experiment 120124[120576119894|119863119894] = 0bull Thus OLS estimates the ATE with no covariates

36 58

Adding covariates

bull What happens with no unmeasured confounders Need tocondition on 119883119894 now

bull Remember identification of the ATEATT using iteratedexpectations

bull ATE is the weighted sum of CATEs

120591 = 1114012119909120591(119909) Pr[119883119894 = 119909]

bull ATEATT are weighted averages of CATEsbull What about the regression estimand 120591119877 How does it related

to the ATEATT

37 58

Heterogeneous effects and regression

bull Letrsquos investigate this under a saturated regression model

119884119894 = 1114012119909119861119909119894120572119909 + 120591119877119863119894 + 119890119894

bull Use a dummy variable for each unique combination of 119883119894119861119909119894 = 120128(119883119894 = 119909)

bull Linear in 119883119894 by construction

38 58

Investigating the regression coefficient

bull How can we investigate 120591119877 Well we can rely on theregression anatomy

120591119877 = Cov(119884119894 119863119894 minus 119864[119863119894|119883119894])120141(119863119894 minus 119864[119863119894|119883119894])

bull 119863119894 minus 120124[119863119894|119883119894] is the residual from a regression of 119863119894 on the fullset of dummies

bull With a little work we can show

120591119877 =120124 1114094120591(119883119894)(119863119894 minus 120124[119863119894|119883119894])11135691114097

120124[(119863119894 minus 119864[119863119894|119883119894])1113569]=

120124[120591(119883119894)1205901113569119889(119883119894)]120124[1205901113569119889(119883119894)]

bull 1205901113569119889(119909) = 120141[119863119894|119883119894 = 119909] is the conditional variance of treatmentassignment

39 58

ATE versus OLS

120591119877 = 120124[120591(119883119894)119882119894] = 1114012119909120591(119909)

1205901113569119889(119909)120124[1205901113569119889(119883119894)]

โ„™[119883119894 = 119909]

bull Compare to the ATE

120591 = 120124[120591(119883119894)] = 1114012119909120591(119909)โ„™[119883119894 = 119909]

bull Both weight strata relative to their size (โ„™[119883119894 = 119909])bull OLS weights strata higher if the treatment variance in those

strata (1205901113569119889(119909)) is higher in those strata relative to the averagevariance across strata (120124[1205901113569119889(119883119894)])

bull The ATE weights only by their size

40 58

Regression weighting

119882119894 =1205901113569119889(119883119894)

120124[1205901113569119889(119883119894)]

bull Why does OLS weight like thisbull OLS is a minimum-variance estimator more weight to

more precise within-strata estimatesbull Within-strata estimates are most precise when the treatment

is evenly spread and thus has the highest variancebull If 119863119894 is binary then we know the conditional variance will be

1205901113569119889(119909) = โ„™[119863119894 = 1|119883119894 = 119909] (1 minus โ„™[119863119894 = 1|119883119894 = 119909])= 119890(119909) (1 minus 119890(119909))

bull Maximum variance with โ„™[119863119894 = 1|119883119894 = 119909] = 12

41 58

OLS weighting examplebull Binary covariate

โ„™[119883119894 = 1] = 075 โ„™[119883119894 = 0] = 025โ„™[119863119894 = 1|119883119894 = 1] = 09 โ„™[119863119894 = 1|119883119894 = 0] = 05

1205901113569119889(1) = 009 1205901113569119889(0) = 025120591(1) = 1 120591(0) = minus1

bull Implies the ATE is 120591 = 05bull Average conditional variance 120124[1205901113569119889(119883119894)] = 013bull weights for 119883119894 = 1 are 009013 = 0692 for 119883119894 = 0

025013 = 192120591119877 = 120124[120591(119883119894)119882119894]

= 120591(1)119882(1)โ„™[119883119894 = 1] + 120591(0)119882(0)โ„™[119883119894 = 0]= 1 times 0692 times 075 + minus1 times 192 times 025= 0039

42 58

When will OLS estimate the ATE

bull When does 120591 = 120591119877bull Constant treatment effects 120591(119909) = 120591 = 120591119877bull Constant probability of treatment 119890(119909) = โ„™[119863119894 = 1|119883119894 = 119909] = 119890

Implies that the OLS weights are 1

bull Incorrect linearity assumption in 119883119894 will lead to more bias

43 58

Other ways to use regression

bull Whatrsquos the path forward Accept the bias (might be relatively small with saturated

models) Use a different regression approach

bull Let 120583119889(119909) = 120124[119884119894(119889)|119883119894 = 119909] be the CEF for the potentialoutcome under 119863119894 = 119889

bull By consistency and nuc we have 120583119889(119909) = 120124[119884119894|119863119894 = 119889119883119894 = 119909]bull Estimate a regression of 119884119894 on 119883119894 among the 119863119894 = 119889 groupbull Then 119889(119909) is just a predicted value from the regression for

119883119894 = 119909bull How can we use this

44 58

Imputation estimators

bull Impute the treated potential outcomes with 1114023119884119894(1) = 1113568(119883119894)bull Impute the control potential outcomes with 1114023119884119894(0) = 1113567(119883119894)bull Procedure

Regress 119884119894 on 119883119894 in the treated group and get predicted valuesfor all units (treated or control)

Regress 119884119894 on 119883119894 in the control group and get predicted valuesfor all units (treated or control)

Take the average difference between these predicted values

bull More mathematically look like this

120591119894119898119901 =1119873

11140121198941113568(119883119894) minus 1113567(119883119894)

bull Sometimes called an imputation estimator

45 58

Simple imputation estimator

bull Use predict() from the within-group models on the data fromthe entire sample

bull Useful trick use a model on the entire data andmodelframe() to get the right design matrix

heterogeneous effects

yhet lt- ifelse(d == 1 y + rnorm(n 0 5) y)

mod lt- lm(yhet ~ d + X)

mod1 lt- lm(yhet ~ X subset = d == 1)

mod0 lt- lm(yhet ~ X subset = d == 0)

y1imps lt- predict(mod1 modelframe(mod))

y0imps lt- predict(mod0 modelframe(mod))

mean(y1imps - y0imps)

[1] 061

46 58

Notes on imputation estimators

bull If 119889(119909) are consistent estimators then 120591119894119898119901 is consistent forthe ATE

bull Why donrsquot people use this Most people donrsquot know the results wersquove been talking about Harder to implement than vanilla OLS

bull Can use linear regression to estimate 119889(119909) = 119909prime120573119889bull Recent trend is to estimate 119889(119909) via non-parametric methods

such as Kernel regression local linear regression regression trees etc Easiest is generalized additive models (GAMs)

47 58

Imputation estimator visualization

-2 -1 0 1 2 3

-05

00

05

10

15

x

y

TreatedControl

48 58

Imputation estimator visualization

-2 -1 0 1 2 3

-05

00

05

10

15

x

y

TreatedControl

49 58

Imputation estimator visualization

-2 -1 0 1 2 3

-05

00

05

10

15

x

y

TreatedControl

50 58

Nonlinear relationshipsbull Same idea but with nonlinear relationship between 119884119894 and 119883119894

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

51 58

Nonlinear relationshipsbull Same idea but with nonlinear relationship between 119884119894 and 119883119894

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

52 58

Nonlinear relationshipsbull Same idea but with nonlinear relationship between 119884119894 and 119883119894

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

53 58

Using semiparametric regressionbull Here CEFs are nonlinear but we donrsquot know their formbull We can use GAMs from the mgcv package to for flexible

estimatelibrary(mgcv)

mod0 lt- gam(y ~ s(x) subset = d == 0)

summary(mod0)

Family gaussian

Link function identity

Formula

y ~ s(x)

Parametric coefficients

Estimate Std Error t value Pr(gt|t|)

(Intercept) -00225 00154 -146 016

Approximate significance of smooth terms

edf Refdf F p-value

s(x) 603 708 413 lt2e-16

---

Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1

R-sq(adj) = 0909 Deviance explained = 928

GCV = 00093204 Scale est = 00071351 n = 30

54 58

Using GAMs

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

55 58

Using GAMs

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

56 58

Using GAMs

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

57 58

Limited dependent variables

bull Usual advice model the data from first principles Logitprobit for binary Poisson for counts etc

bull OLS is a-ok with limited DVs when Binary treatment and no covariates (just diff-in-means) Binary treatment discrete covariates and saturated models

(stratified diff-in-means)

bull Imposing a model on LDVs in this case imposes adistributional assumption which could be wrong

bull Even in unsaturated models the marginal effect from OLSoften decent compared to nonlinear models

Could go wrong in small samples If using nonlinear models always get effects on the scale of the

outcome

58 58

  • Agnostic Regression
  • Regression and Causality
  • Regression with Heterogeneous Treatment Effects
Page 3: Gov 2002: 7. Regression and Causalityโ€ข Define linear regression: =argmin ๐”ผ[( ๐‘–โˆ’ โ€ฒ ) ] โ€ข The solution to this is the following: =๐”ผ[ ๐‘– โ€ฒ]โˆ’ ๐”ผ[ ๐‘– ๐‘–] โ€ข

Where are we Where are we going

bull Last few weeks using matching weighting for estimatingcausal effects

bull This week how to use regression to estimate causal effectsbull Regression is so widely used itrsquos good to know what itrsquos

actually estimatingbull Goal salvage regression from the ashes of 1980rsquos textbooksbull Next week panel data

Reminder Email me and Stephen a half-page description of yourproposed research project

3 58

1 AgnosticRegression

4 58

Regression as parametric modeling

bull Gauss-Markov assumptions linearity iid sample full rank 119883119894 zero conditional mean error

homoskedasticity

bull OLS is BLUE plus normality of the errors and we get smallsample SEs

bull What is the basic approach here It is a model for theconditional distribution of 119884119894 given 119883119894

[119884119894|119883119894] sim 119873(119883prime119894 120573 1205901113569)

bull MLE from this model is the usual OLS estimator OLS

OLS =

โŽกโŽขโŽขโŽขโŽขโŽขโŽฃ1198731114012119894=1113568

119883119894119883prime119894

โŽคโŽฅโŽฅโŽฅโŽฅโŽฅโŽฆ

minus11135681198731114012119894=1113568

119883119894119884119894

5 58

Agnostic views on regression

[119884119894|119883119894] sim 119873(119883prime119894 120573 1205901113569)

bull Strong distributional assumption on 119884119894bull Properties like BLUE or MLE properties depend on these

assumptions holdingbull Alternative take an agnostic view on regression

Use OLS without believing these assumptions

bull Lose the distributional assumptions focus on the conditionalexpectation function (CEF)

120583(119909) = 120124[119884119894|119883119894 = 119909] = 1114012119910119910 sdot โ„™[119884119894 = 119910|119883119894 = 119909]

6 58

Justifying linear regression

bull Define linear regression

120573 = argmin119887

120124[(119884119894 minus 119883prime119894 119887)1113569]

bull The solution to this is the following

120573 = 120124[119883119894119883prime119894 ]minus1113568120124[119883119894119884119894]

bull Note that the is the population coefficient vector not theestimator yet

7 58

Regression anatomy

bull Consider simple linear regression

(120572 120573) = argmin119886119887

120124 1114094(119884119894 minus 119886 minus 119887119883119894)11135691114097

bull In this case we can write the populationtrue slope 120573 as

120573 = 120124[119883119894119883prime119894 ]minus1113568120124[119883119894119884119894] =

Cov(119884119894 119883119894)120141[119883119894]

bull With more covariates 120573 is more complicated but we can stillwrite it like this

bull Let 119896119894 be the residual from a regression of 119883119896119894 on all the otherindependent variables Then 120573119896 the coefficient for 119883119896119894 is

120573119896 =Cov(119884119894 119896119894)

119881(119896119894)

8 58

Justification 1 Linear CEFs

bull Justification 1 if the CEF is linear the population regressionfunction is it That is if 119864[119884119894|119883119894] = 119883prime

119894 119887 then 119887 = 120573bull When would we expect the CEF to be linear Two cases

1 Outcome and covariates are multivariate normal2 Linear regression model is saturated

bull A model is saturated if there are as many parameters asthere are possible combination of the 119883119894 variables

9 58

Saturated model example

bull Two binary variables 1198831113568119894 for incumbency status and 1198831113569119894 forparty of the candidate

bull Four possible values of 119883119894 four possible values of 120583(119883119894)

119864[119884119894|1198831113568119894 = 01198831113569119894 = 0] = 120572119864[119884119894|1198831113568119894 = 11198831113569119894 = 0] = 120572 + 120573119864[119884119894|1198831113568119894 = 01198831113569119894 = 1] = 120572 + 120574119864[119884119894|1198831113568119894 = 11198831113569119894 = 1] = 120572 + 120573 + 120574 + 120575

bull We can write the CEF as follows

119864[119884119894|1198831113568119894 1198831113569119894] = 120572 + 1205731198831113568119894 + 1205741198831113569119894 + 120575(11988311135681198941198831113569119894)

10 58

Saturated models example

119864[119884119894|1198831113568119894 1198831113569119894] = 120572 + 1205731198831113568119894 + 1205741198831113569119894 + 120575(11988311135681198941198831113569119894)

bull Basically each value of 120583(119883119894) is being estimated separately within-strata estimation No borrowing of information from across values of 119883119894

bull Requires a set of dummies for each categorical variable plusall interactions

bull Or a series of dummies for each unique combination of 119883119894bull This makes linearity hold mechanically and so linearity is

not an assumption Just a fact about saturated CEFs saturated models for limited dependent variables = A-OK

11 58

Saturated model example

bull Washington (AER) data on the effects of daughtersbull Wersquoll look at the relationship between voting and number of

kids (causal)

girls lt- foreignreaddta(rdquogirlsdtardquo)

head(girls[ c(rdquonamerdquo rdquototchirdquo rdquoaauwrdquo)])

name totchi aauw

1 ABERCROMBIE NEIL 0 100

2 ACKERMAN GARY L 3 88

3 ADERHOLT ROBERT B 0 0

4 ALLEN THOMAS H 2 100

5 ANDREWS ROBERT E 2 100

6 ARCHER WR 7 0

12 58

Linear model

summary(lm(aauw ~ totchi data = girls))

Coefficients

Estimate Std Error t value Pr(gt|t|)

(Intercept) 6131 181 3381 lt2e-16

totchi -533 062 -859 lt2e-16

---

Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1

Residual standard error 42 on 1733 degrees of freedom

(5 observations deleted due to missingness)

Multiple R-squared 00408 Adjusted R-squared 00403

F-statistic 738 on 1 and 1733 DF p-value lt2e-16

13 58

Saturated modelsummary(lm(aauw ~ asfactor(totchi) data = girls))

Coefficients

Estimate Std Error t value Pr(gt|t|)

(Intercept) 5641 276 2042 lt 2e-16

asfactor(totchi)1 545 411 133 01851

asfactor(totchi)2 -380 327 -116 02454

asfactor(totchi)3 -1365 345 -395 81e-05

asfactor(totchi)4 -1931 401 -482 16e-06

asfactor(totchi)5 -1546 485 -319 00015

asfactor(totchi)6 -3359 1042 -322 00013

asfactor(totchi)7 -1713 1141 -150 01336

asfactor(totchi)8 -5533 1228 -451 70e-06

asfactor(totchi)9 -5041 2408 -209 00364

asfactor(totchi)10 -5341 2090 -256 00107

asfactor(totchi)12 -5641 4153 -136 01745

---

Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1

Residual standard error 41 on 1723 degrees of freedom

(5 observations deleted due to missingness)

Multiple R-squared 00506 Adjusted R-squared 00446

F-statistic 836 on 11 and 1723 DF p-value 184e-1414 58

Saturated model minus the constantsummary(lm(aauw ~ asfactor(totchi) - 1 data = girls))

Coefficients

Estimate Std Error t value Pr(gt|t|)

asfactor(totchi)0 5641 276 2042 lt2e-16

asfactor(totchi)1 6186 305 2031 lt2e-16

asfactor(totchi)2 5262 175 3013 lt2e-16

asfactor(totchi)3 4276 207 2062 lt2e-16

asfactor(totchi)4 3711 290 1279 lt2e-16

asfactor(totchi)5 4095 399 1027 lt2e-16

asfactor(totchi)6 2282 1005 227 00233

asfactor(totchi)7 3929 1107 355 00004

asfactor(totchi)8 108 1196 009 09278

asfactor(totchi)9 600 2392 025 08020

asfactor(totchi)10 300 2072 014 08849

asfactor(totchi)12 000 4143 000 10000

---

Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1

Residual standard error 41 on 1723 degrees of freedom

(5 observations deleted due to missingness)

Multiple R-squared 0587 Adjusted R-squared 0584

F-statistic 204 on 12 and 1723 DF p-value lt2e-1615 58

Compare to within-strata means

bull The saturated model makes no assumptions about thebetween-strata relationships

bull Just calculates within-strata means

c1 lt- coef(lm(aauw ~ asfactor(totchi) - 1 data = girls))

c2 lt- with(girls tapply(aauw totchi mean narm = TRUE))

rbind(c1 c2)

0 1 2 3 4 5 6 7 8 9 10 12

c1 56 62 53 43 37 41 23 39 11 6 3 0

c2 56 62 53 43 37 41 23 39 11 6 3 0

16 58

Other justifications for OLS

bull Justification 2 119883prime119894 120573 is the best linear predictor (in a

mean-squared error sense) of 119884119894 Why 120573 = argmin119887 120124[(119884119894 minus 119883prime

119894 119887)1113569]

bull Justification 3 119883prime119894 120573 provides the minimum mean squared

error linear approxmiation to 119864[119884119894|119883119894]bull Even if the CEF is not linear a linear regression provides the

best linear approximation to that CEFbull Donrsquot need to believe the assumptions (linearity) in order to

use regression as a good approximation to the CEFbull Warning if the CEF is very nonlinear then this approximation

could be terrible

17 58

The error terms

bull Letrsquos define the error term 119890119894 equiv 119884119894 minus 119883prime119894 120573 so that

119884119894 = 119883prime119894 120573 + [119884119894 minus 119883prime

119894 120573] = 119883prime119894 120573 + 119890119894

bull Note the residual 119890119894 is uncorrelated with 119883119894

120124[119883119894119890119894] = 120124[119883119894(119884119894 minus 119883prime119894 120573)]

= 120124[119883119894119884119894] minus 120124[119883119894119883prime119894 120573]

= 120124[119883119894119884119894] minus 120124 1114094119883119894119883prime119894120124[119883119894119883prime

119894 ]minus1113568120124[119883119894119884119894]1114097= 120124[119883119894119884119894] minus 120124[119883119894119883prime

119894 ]120124[119883119894119883prime119894 ]minus1113568120124[119883119894119884119894]

= 120124[119883119894119884119894] minus 120124[119883119894119884119894] = 0

bull No assumptions on the linearity of 120124[119884119894|119883119894]

18 58

OLS estimator

bull We know the population value of 120573 is

120573 = 120124[119883119894119883prime119894 ]minus1113568120124[119883119894119884119894]

bull How do we get an estimator of thisbull Plug-in principle replace population expectation with

sample versions

=

โŽกโŽขโŽขโŽขโŽขโŽขโŽฃ1119873

1114012119894119883119894119883prime

119894

โŽคโŽฅโŽฅโŽฅโŽฅโŽฅโŽฆ

minus11135681119873

1114012119894119883119894119884119894

bull If you work through the matrix algebra this turns out to be

= (119831prime119831)minus1113568 119831prime119858

19 58

Asymptotic OLS inference

bull With this representation in hand we can write the OLSestimator as follows

= 120573 +

โŽกโŽขโŽขโŽขโŽขโŽขโŽฃ1114012

119894119883119894119883prime

119894

โŽคโŽฅโŽฅโŽฅโŽฅโŽฅโŽฆ

minus1113568

1114012119894119883119894119890119894

bull Core idea sum119894119883119894119890119894 is the sum of rvs so the CLT applies

bull That plus some simple asymptotic theory allows us to say

radic119873( minus 120573) 119873(0ฮฉ)

bull Converges in distribution to a Normal distribution with meanvector 0 and covariance matrix ฮฉ

ฮฉ = 120124[119883119894119883prime119894 ]minus1113568120124[119883119894119883prime

119894 1198901113569119894 ]120124[119883119894119883prime119894 ]minus1113568

bull No linearity assumption needed20 58

Estimating the variance

bull In large samples then

radic119873( minus 120573) sim 119873(0ฮฉ)

bull How to estimate ฮฉ Plug-in principle again

1114023ฮฉ =

โŽกโŽขโŽขโŽขโŽขโŽขโŽฃ1114012

119894119883119894119883prime

119894

โŽคโŽฅโŽฅโŽฅโŽฅโŽฅโŽฆ

minus1113568 โŽกโŽขโŽขโŽขโŽขโŽขโŽฃ1114012119894119883119894119883prime

119894 1113569119894

โŽคโŽฅโŽฅโŽฅโŽฅโŽฅโŽฆ

โŽกโŽขโŽขโŽขโŽขโŽขโŽฃ1114012

119894119883119894119883prime

119894

โŽคโŽฅโŽฅโŽฅโŽฅโŽฅโŽฆ

minus1113568

bull Replace 119890119894 with its emprical counterpart (residuals)119894 = 119884119894 minus 119883prime

119894 bull Replace the population moments of 119883119894 with their sample

counterpartsbull The square root of the diagonals of this covariance matrix are

the ldquorobustrdquo or Huber-White standard errors that Statacommonly report

21 58

Heteroskedasticity

bull No assumptions of homoskedasticitybull Heteroskedaticity will definitely occur when

CEF is linear but the 1205901113569(119909) = 120141[119884119894|119883119894 = 119909] is not constant in 119909 119864[119884119894|119883119894] is not linear but we use the linear regression to

approxmiate it

22 58

2 Regression andCausality

23 58

Regression and causality

bull Most econometrics textbooks regression defined withoutrespect to causality

bull But then when is ldquobiasedrdquo The above derivations work forsome 120124[119884119894|119883119894]

bull The question then is when does knowing the CEF tell ussomething about causality

bull MHE argues that a regression is causal when the CEF itapproximates is causal Identification is king

bull We will show that under certain conditions a regression of theoutcome on the treatment and the covariates can recover acausal parameter but perhaps not the one in which we areinterested

24 58

Review

bull Quick reminder we have potential outcomes 119884119894(1) and 119884119894(0)and two parameters the ATE and ATT

120591 = 119864[119884119894(1) minus 119884119894(0)]120591ATT = 119864[119884119894(1) minus 119884119894(0)|119863119894 = 1]

bull We have shown in past weeks that these effects are identifiedwhen ignorability holds MHE calls this the conditionalindependence assumption (CIA)

25 58

Linear constant effects model binarytreatment

bull Experiment with a simple experiment we can rewrite theconsistency assumption to be a regression formula

119884119894 = 119863119894119884119894(1) + (1 minus 119863119894)119884119894(0)= 119884119894(0) + (119884119894(1) minus 119884119894(0))119863119894

= 120124[119884119894(0)] + 120591119863119894 + (119884119894(0) minus 120124[119884119894(0)])= 1205831113567 + 120591119863119894 + 1199071113567119894

bull Note that if ignorability holds (as in an experiment) for 119884119894(0)then it will also hold for 1199071113567119894 since 120124[119884119894(0)] is constant Thusthis satifies the usual assumptions for regression

26 58

Now with covariates

bull Now assume no unmeasured confounders 119884119894(119889) โŸ‚โŸ‚ 119863119894|119883119894bull We will assume a linear model for the potential outcomes

119884119894(119889) = 120572 + 120591 sdot 119889 + 120578119894

bull Remember that linearity isnrsquot an assumption if 119863119894 is binarybull Effect of 119863119894 is constant here the 120578119894 are the only source of

individual variation and we have 119864[120578119894] = 0bull Consistency assumption allows us to write this as

119884119894 = 120572 + 120591119863119894 + 120578119894

27 58

Covariates in the error

bull Letrsquos assume that 120578119894 is linear in 119883119894 120578119894 = 119883prime119894 120574 + 120584119894

bull New error is uncorrelated with 119883119894 120124[120584119894|119883119894] = 0bull This is an assumption Might be falsebull Plug into the above

120124[119884119894(119889)|119883119894] = 119864[119884119894|119863119894 119883119894] = 120572 + 120591119863119894 + 119864[120578119894|119883119894]= 120572 + 120591119863119894 + 119883prime

119894 120574 + 119864[120584119894|119883119894]= 120572 + 120591119863119894 + 119883prime

119894 120574

28 58

Summing up regression with constanteffects

bull Reviewing the assumptions wersquove used no unmeasured confounders constant treatment effects linearity of the treatmentcovariates

bull Under these we can run the following regression to estimatethe ATE 120591

119884119894 = 120572 + 120591119863119894 + 119883prime119894 120574 + 120584119894

bull Works with continuous or ordinal 119863119894 if linearity in the effect ofthese variables is truly linear

29 58

OLS constant effects simulation

Model with linear covariates constant 0 effect of treatment

library(mvtnorm)

n lt- 100

p lt- 4

X lt- rmvnorm(n = 100 mean = rep(0 p))

gamma lt- c(274 137 137 137)

y lt- 210 + X c(gamma) + rnorm(n)

alpha lt- c(-1 05 -05 -01)

dprobs lt- bootinvlogit(X alpha)

d lt- rbinom(n size = 1 prob = dprobs)

30 58

OLS with no covariates

summary(lm(y ~ d))

Coefficients

Estimate Std Error t value Pr(gt|t|)

(Intercept) 22036 458 4813 lt 2e-16

d -2869 654 -439 0000029

---

Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1

Residual standard error 33 on 98 degrees of freedom

Multiple R-squared 0164 Adjusted R-squared 0156

F-statistic 192 on 1 and 98 DF p-value 0000029

31 58

OLS with covariates

summary(lm(y ~ d + X))

Coefficients

Estimate Std Error t value Pr(gt|t|)

(Intercept) 209952 0159 13207 lt2e-16

d -0128 0253 -05 062

X1 27368 0126 2175 lt2e-16

X2 13677 0114 1201 lt2e-16

X3 13673 0130 1051 lt2e-16

X4 13570 0106 1280 lt2e-16

---

Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1

Residual standard error 1 on 94 degrees of freedom

Multiple R-squared 0999 Adjusted R-squared 0999

F-statistic 244e+04 on 5 and 94 DF p-value lt2e-16

32 58

What happens with nonlinearity

Suppose we can only observe the following covariates

z1 lt- exp(X[ 1]2)

z2 lt- X[ 2](1 + exp(X[ 1])) + 10

z3 lt- (X[ 1] X[ 3]25 + 06)^3

z4 lt- (X[ 2] + X[ 4] + 20)^2

Implies that 119884119894 and 119863119894 are functions of log(1198851198941113568) 1198851198941113569 119885111356911989411135681198851198941113569

1 log(1198851198941113568) 1198851198941113570 log(1198851198941113568) and 119883111356811135691198941113571

Regression is a nonlinear function of the observed covariates

33 58

When linearity goes wrong

summary(lm(y ~ d + z1 + z2 + z3 + z4))

Coefficients

Estimate Std Error t value Pr(gt|t|)

(Intercept) 218728 300799 073 0469

d -69292 33854 -205 0043

z1 363110 26970 1346 lt2e-16

z2 -29033 36619 -079 0430

z3 862022 431030 200 0048

z4 04021 00329 1223 lt2e-16

---

Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1

Residual standard error 14 on 94 degrees of freedom

Multiple R-squared 0855 Adjusted R-squared 0847

F-statistic 111 on 5 and 94 DF p-value lt2e-16

34 58

3 Regression withHeterogeneousTreatment Effects

35 58

Heterogeneous effects binary treatment

bull Completely randomized experiment

119884119894 = 119863119894119884119894(1) + (1 minus 119863119894)119884119894(0)= 119884119894(0) + (119884119894(1) minus 119884119894(0))119863119894

= 1205831113567 + 120591119894119863119894 + (119884119894(0) minus 1205831113567)= 1205831113567 + 120591119863119894 + (119884119894(0) minus 1205831113567) + (120591119894 minus 120591) sdot 119863119894

= 1205831113567 + 120591119863119894 + 120576119894

bull Error term now includes two components1 ldquoBaselinerdquo variation in the outcome (119884119894(0) minus 1205831113567)2 Variation in the treatment effect (120591119894 minus 120591)

bull Easy to verify that under experiment 120124[120576119894|119863119894] = 0bull Thus OLS estimates the ATE with no covariates

36 58

Adding covariates

bull What happens with no unmeasured confounders Need tocondition on 119883119894 now

bull Remember identification of the ATEATT using iteratedexpectations

bull ATE is the weighted sum of CATEs

120591 = 1114012119909120591(119909) Pr[119883119894 = 119909]

bull ATEATT are weighted averages of CATEsbull What about the regression estimand 120591119877 How does it related

to the ATEATT

37 58

Heterogeneous effects and regression

bull Letrsquos investigate this under a saturated regression model

119884119894 = 1114012119909119861119909119894120572119909 + 120591119877119863119894 + 119890119894

bull Use a dummy variable for each unique combination of 119883119894119861119909119894 = 120128(119883119894 = 119909)

bull Linear in 119883119894 by construction

38 58

Investigating the regression coefficient

bull How can we investigate 120591119877 Well we can rely on theregression anatomy

120591119877 = Cov(119884119894 119863119894 minus 119864[119863119894|119883119894])120141(119863119894 minus 119864[119863119894|119883119894])

bull 119863119894 minus 120124[119863119894|119883119894] is the residual from a regression of 119863119894 on the fullset of dummies

bull With a little work we can show

120591119877 =120124 1114094120591(119883119894)(119863119894 minus 120124[119863119894|119883119894])11135691114097

120124[(119863119894 minus 119864[119863119894|119883119894])1113569]=

120124[120591(119883119894)1205901113569119889(119883119894)]120124[1205901113569119889(119883119894)]

bull 1205901113569119889(119909) = 120141[119863119894|119883119894 = 119909] is the conditional variance of treatmentassignment

39 58

ATE versus OLS

120591119877 = 120124[120591(119883119894)119882119894] = 1114012119909120591(119909)

1205901113569119889(119909)120124[1205901113569119889(119883119894)]

โ„™[119883119894 = 119909]

bull Compare to the ATE

120591 = 120124[120591(119883119894)] = 1114012119909120591(119909)โ„™[119883119894 = 119909]

bull Both weight strata relative to their size (โ„™[119883119894 = 119909])bull OLS weights strata higher if the treatment variance in those

strata (1205901113569119889(119909)) is higher in those strata relative to the averagevariance across strata (120124[1205901113569119889(119883119894)])

bull The ATE weights only by their size

40 58

Regression weighting

119882119894 =1205901113569119889(119883119894)

120124[1205901113569119889(119883119894)]

bull Why does OLS weight like thisbull OLS is a minimum-variance estimator more weight to

more precise within-strata estimatesbull Within-strata estimates are most precise when the treatment

is evenly spread and thus has the highest variancebull If 119863119894 is binary then we know the conditional variance will be

1205901113569119889(119909) = โ„™[119863119894 = 1|119883119894 = 119909] (1 minus โ„™[119863119894 = 1|119883119894 = 119909])= 119890(119909) (1 minus 119890(119909))

bull Maximum variance with โ„™[119863119894 = 1|119883119894 = 119909] = 12

41 58

OLS weighting examplebull Binary covariate

โ„™[119883119894 = 1] = 075 โ„™[119883119894 = 0] = 025โ„™[119863119894 = 1|119883119894 = 1] = 09 โ„™[119863119894 = 1|119883119894 = 0] = 05

1205901113569119889(1) = 009 1205901113569119889(0) = 025120591(1) = 1 120591(0) = minus1

bull Implies the ATE is 120591 = 05bull Average conditional variance 120124[1205901113569119889(119883119894)] = 013bull weights for 119883119894 = 1 are 009013 = 0692 for 119883119894 = 0

025013 = 192120591119877 = 120124[120591(119883119894)119882119894]

= 120591(1)119882(1)โ„™[119883119894 = 1] + 120591(0)119882(0)โ„™[119883119894 = 0]= 1 times 0692 times 075 + minus1 times 192 times 025= 0039

42 58

When will OLS estimate the ATE

bull When does 120591 = 120591119877bull Constant treatment effects 120591(119909) = 120591 = 120591119877bull Constant probability of treatment 119890(119909) = โ„™[119863119894 = 1|119883119894 = 119909] = 119890

Implies that the OLS weights are 1

bull Incorrect linearity assumption in 119883119894 will lead to more bias

43 58

Other ways to use regression

bull Whatrsquos the path forward Accept the bias (might be relatively small with saturated

models) Use a different regression approach

bull Let 120583119889(119909) = 120124[119884119894(119889)|119883119894 = 119909] be the CEF for the potentialoutcome under 119863119894 = 119889

bull By consistency and nuc we have 120583119889(119909) = 120124[119884119894|119863119894 = 119889119883119894 = 119909]bull Estimate a regression of 119884119894 on 119883119894 among the 119863119894 = 119889 groupbull Then 119889(119909) is just a predicted value from the regression for

119883119894 = 119909bull How can we use this

44 58

Imputation estimators

bull Impute the treated potential outcomes with 1114023119884119894(1) = 1113568(119883119894)bull Impute the control potential outcomes with 1114023119884119894(0) = 1113567(119883119894)bull Procedure

Regress 119884119894 on 119883119894 in the treated group and get predicted valuesfor all units (treated or control)

Regress 119884119894 on 119883119894 in the control group and get predicted valuesfor all units (treated or control)

Take the average difference between these predicted values

bull More mathematically look like this

120591119894119898119901 =1119873

11140121198941113568(119883119894) minus 1113567(119883119894)

bull Sometimes called an imputation estimator

45 58

Simple imputation estimator

bull Use predict() from the within-group models on the data fromthe entire sample

bull Useful trick use a model on the entire data andmodelframe() to get the right design matrix

heterogeneous effects

yhet lt- ifelse(d == 1 y + rnorm(n 0 5) y)

mod lt- lm(yhet ~ d + X)

mod1 lt- lm(yhet ~ X subset = d == 1)

mod0 lt- lm(yhet ~ X subset = d == 0)

y1imps lt- predict(mod1 modelframe(mod))

y0imps lt- predict(mod0 modelframe(mod))

mean(y1imps - y0imps)

[1] 061

46 58

Notes on imputation estimators

bull If 119889(119909) are consistent estimators then 120591119894119898119901 is consistent forthe ATE

bull Why donrsquot people use this Most people donrsquot know the results wersquove been talking about Harder to implement than vanilla OLS

bull Can use linear regression to estimate 119889(119909) = 119909prime120573119889bull Recent trend is to estimate 119889(119909) via non-parametric methods

such as Kernel regression local linear regression regression trees etc Easiest is generalized additive models (GAMs)

47 58

Imputation estimator visualization

-2 -1 0 1 2 3

-05

00

05

10

15

x

y

TreatedControl

48 58

Imputation estimator visualization

-2 -1 0 1 2 3

-05

00

05

10

15

x

y

TreatedControl

49 58

Imputation estimator visualization

-2 -1 0 1 2 3

-05

00

05

10

15

x

y

TreatedControl

50 58

Nonlinear relationshipsbull Same idea but with nonlinear relationship between 119884119894 and 119883119894

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

51 58

Nonlinear relationshipsbull Same idea but with nonlinear relationship between 119884119894 and 119883119894

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

52 58

Nonlinear relationshipsbull Same idea but with nonlinear relationship between 119884119894 and 119883119894

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

53 58

Using semiparametric regressionbull Here CEFs are nonlinear but we donrsquot know their formbull We can use GAMs from the mgcv package to for flexible

estimatelibrary(mgcv)

mod0 lt- gam(y ~ s(x) subset = d == 0)

summary(mod0)

Family gaussian

Link function identity

Formula

y ~ s(x)

Parametric coefficients

Estimate Std Error t value Pr(gt|t|)

(Intercept) -00225 00154 -146 016

Approximate significance of smooth terms

edf Refdf F p-value

s(x) 603 708 413 lt2e-16

---

Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1

R-sq(adj) = 0909 Deviance explained = 928

GCV = 00093204 Scale est = 00071351 n = 30

54 58

Using GAMs

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

55 58

Using GAMs

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

56 58

Using GAMs

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

57 58

Limited dependent variables

bull Usual advice model the data from first principles Logitprobit for binary Poisson for counts etc

bull OLS is a-ok with limited DVs when Binary treatment and no covariates (just diff-in-means) Binary treatment discrete covariates and saturated models

(stratified diff-in-means)

bull Imposing a model on LDVs in this case imposes adistributional assumption which could be wrong

bull Even in unsaturated models the marginal effect from OLSoften decent compared to nonlinear models

Could go wrong in small samples If using nonlinear models always get effects on the scale of the

outcome

58 58

  • Agnostic Regression
  • Regression and Causality
  • Regression with Heterogeneous Treatment Effects
Page 4: Gov 2002: 7. Regression and Causalityโ€ข Define linear regression: =argmin ๐”ผ[( ๐‘–โˆ’ โ€ฒ ) ] โ€ข The solution to this is the following: =๐”ผ[ ๐‘– โ€ฒ]โˆ’ ๐”ผ[ ๐‘– ๐‘–] โ€ข

1 AgnosticRegression

4 58

Regression as parametric modeling

bull Gauss-Markov assumptions linearity iid sample full rank 119883119894 zero conditional mean error

homoskedasticity

bull OLS is BLUE plus normality of the errors and we get smallsample SEs

bull What is the basic approach here It is a model for theconditional distribution of 119884119894 given 119883119894

[119884119894|119883119894] sim 119873(119883prime119894 120573 1205901113569)

bull MLE from this model is the usual OLS estimator OLS

OLS =

โŽกโŽขโŽขโŽขโŽขโŽขโŽฃ1198731114012119894=1113568

119883119894119883prime119894

โŽคโŽฅโŽฅโŽฅโŽฅโŽฅโŽฆ

minus11135681198731114012119894=1113568

119883119894119884119894

5 58

Agnostic views on regression

[119884119894|119883119894] sim 119873(119883prime119894 120573 1205901113569)

bull Strong distributional assumption on 119884119894bull Properties like BLUE or MLE properties depend on these

assumptions holdingbull Alternative take an agnostic view on regression

Use OLS without believing these assumptions

bull Lose the distributional assumptions focus on the conditionalexpectation function (CEF)

120583(119909) = 120124[119884119894|119883119894 = 119909] = 1114012119910119910 sdot โ„™[119884119894 = 119910|119883119894 = 119909]

6 58

Justifying linear regression

bull Define linear regression

120573 = argmin119887

120124[(119884119894 minus 119883prime119894 119887)1113569]

bull The solution to this is the following

120573 = 120124[119883119894119883prime119894 ]minus1113568120124[119883119894119884119894]

bull Note that the is the population coefficient vector not theestimator yet

7 58

Regression anatomy

bull Consider simple linear regression

(120572 120573) = argmin119886119887

120124 1114094(119884119894 minus 119886 minus 119887119883119894)11135691114097

bull In this case we can write the populationtrue slope 120573 as

120573 = 120124[119883119894119883prime119894 ]minus1113568120124[119883119894119884119894] =

Cov(119884119894 119883119894)120141[119883119894]

bull With more covariates 120573 is more complicated but we can stillwrite it like this

bull Let 119896119894 be the residual from a regression of 119883119896119894 on all the otherindependent variables Then 120573119896 the coefficient for 119883119896119894 is

120573119896 =Cov(119884119894 119896119894)

119881(119896119894)

8 58

Justification 1 Linear CEFs

bull Justification 1 if the CEF is linear the population regressionfunction is it That is if 119864[119884119894|119883119894] = 119883prime

119894 119887 then 119887 = 120573bull When would we expect the CEF to be linear Two cases

1 Outcome and covariates are multivariate normal2 Linear regression model is saturated

bull A model is saturated if there are as many parameters asthere are possible combination of the 119883119894 variables

9 58

Saturated model example

bull Two binary variables 1198831113568119894 for incumbency status and 1198831113569119894 forparty of the candidate

bull Four possible values of 119883119894 four possible values of 120583(119883119894)

119864[119884119894|1198831113568119894 = 01198831113569119894 = 0] = 120572119864[119884119894|1198831113568119894 = 11198831113569119894 = 0] = 120572 + 120573119864[119884119894|1198831113568119894 = 01198831113569119894 = 1] = 120572 + 120574119864[119884119894|1198831113568119894 = 11198831113569119894 = 1] = 120572 + 120573 + 120574 + 120575

bull We can write the CEF as follows

119864[119884119894|1198831113568119894 1198831113569119894] = 120572 + 1205731198831113568119894 + 1205741198831113569119894 + 120575(11988311135681198941198831113569119894)

10 58

Saturated models example

119864[119884119894|1198831113568119894 1198831113569119894] = 120572 + 1205731198831113568119894 + 1205741198831113569119894 + 120575(11988311135681198941198831113569119894)

bull Basically each value of 120583(119883119894) is being estimated separately within-strata estimation No borrowing of information from across values of 119883119894

bull Requires a set of dummies for each categorical variable plusall interactions

bull Or a series of dummies for each unique combination of 119883119894bull This makes linearity hold mechanically and so linearity is

not an assumption Just a fact about saturated CEFs saturated models for limited dependent variables = A-OK

11 58

Saturated model example

bull Washington (AER) data on the effects of daughtersbull Wersquoll look at the relationship between voting and number of

kids (causal)

girls lt- foreignreaddta(rdquogirlsdtardquo)

head(girls[ c(rdquonamerdquo rdquototchirdquo rdquoaauwrdquo)])

name totchi aauw

1 ABERCROMBIE NEIL 0 100

2 ACKERMAN GARY L 3 88

3 ADERHOLT ROBERT B 0 0

4 ALLEN THOMAS H 2 100

5 ANDREWS ROBERT E 2 100

6 ARCHER WR 7 0

12 58

Linear model

summary(lm(aauw ~ totchi data = girls))

Coefficients

Estimate Std Error t value Pr(gt|t|)

(Intercept) 6131 181 3381 lt2e-16

totchi -533 062 -859 lt2e-16

---

Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1

Residual standard error 42 on 1733 degrees of freedom

(5 observations deleted due to missingness)

Multiple R-squared 00408 Adjusted R-squared 00403

F-statistic 738 on 1 and 1733 DF p-value lt2e-16

13 58

Saturated modelsummary(lm(aauw ~ asfactor(totchi) data = girls))

Coefficients

Estimate Std Error t value Pr(gt|t|)

(Intercept) 5641 276 2042 lt 2e-16

asfactor(totchi)1 545 411 133 01851

asfactor(totchi)2 -380 327 -116 02454

asfactor(totchi)3 -1365 345 -395 81e-05

asfactor(totchi)4 -1931 401 -482 16e-06

asfactor(totchi)5 -1546 485 -319 00015

asfactor(totchi)6 -3359 1042 -322 00013

asfactor(totchi)7 -1713 1141 -150 01336

asfactor(totchi)8 -5533 1228 -451 70e-06

asfactor(totchi)9 -5041 2408 -209 00364

asfactor(totchi)10 -5341 2090 -256 00107

asfactor(totchi)12 -5641 4153 -136 01745

---

Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1

Residual standard error 41 on 1723 degrees of freedom

(5 observations deleted due to missingness)

Multiple R-squared 00506 Adjusted R-squared 00446

F-statistic 836 on 11 and 1723 DF p-value 184e-1414 58

Saturated model minus the constantsummary(lm(aauw ~ asfactor(totchi) - 1 data = girls))

Coefficients

Estimate Std Error t value Pr(gt|t|)

asfactor(totchi)0 5641 276 2042 lt2e-16

asfactor(totchi)1 6186 305 2031 lt2e-16

asfactor(totchi)2 5262 175 3013 lt2e-16

asfactor(totchi)3 4276 207 2062 lt2e-16

asfactor(totchi)4 3711 290 1279 lt2e-16

asfactor(totchi)5 4095 399 1027 lt2e-16

asfactor(totchi)6 2282 1005 227 00233

asfactor(totchi)7 3929 1107 355 00004

asfactor(totchi)8 108 1196 009 09278

asfactor(totchi)9 600 2392 025 08020

asfactor(totchi)10 300 2072 014 08849

asfactor(totchi)12 000 4143 000 10000

---

Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1

Residual standard error 41 on 1723 degrees of freedom

(5 observations deleted due to missingness)

Multiple R-squared 0587 Adjusted R-squared 0584

F-statistic 204 on 12 and 1723 DF p-value lt2e-1615 58

Compare to within-strata means

bull The saturated model makes no assumptions about thebetween-strata relationships

bull Just calculates within-strata means

c1 lt- coef(lm(aauw ~ asfactor(totchi) - 1 data = girls))

c2 lt- with(girls tapply(aauw totchi mean narm = TRUE))

rbind(c1 c2)

0 1 2 3 4 5 6 7 8 9 10 12

c1 56 62 53 43 37 41 23 39 11 6 3 0

c2 56 62 53 43 37 41 23 39 11 6 3 0

16 58

Other justifications for OLS

bull Justification 2 119883prime119894 120573 is the best linear predictor (in a

mean-squared error sense) of 119884119894 Why 120573 = argmin119887 120124[(119884119894 minus 119883prime

119894 119887)1113569]

bull Justification 3 119883prime119894 120573 provides the minimum mean squared

error linear approxmiation to 119864[119884119894|119883119894]bull Even if the CEF is not linear a linear regression provides the

best linear approximation to that CEFbull Donrsquot need to believe the assumptions (linearity) in order to

use regression as a good approximation to the CEFbull Warning if the CEF is very nonlinear then this approximation

could be terrible

17 58

The error terms

bull Letrsquos define the error term 119890119894 equiv 119884119894 minus 119883prime119894 120573 so that

119884119894 = 119883prime119894 120573 + [119884119894 minus 119883prime

119894 120573] = 119883prime119894 120573 + 119890119894

bull Note the residual 119890119894 is uncorrelated with 119883119894

120124[119883119894119890119894] = 120124[119883119894(119884119894 minus 119883prime119894 120573)]

= 120124[119883119894119884119894] minus 120124[119883119894119883prime119894 120573]

= 120124[119883119894119884119894] minus 120124 1114094119883119894119883prime119894120124[119883119894119883prime

119894 ]minus1113568120124[119883119894119884119894]1114097= 120124[119883119894119884119894] minus 120124[119883119894119883prime

119894 ]120124[119883119894119883prime119894 ]minus1113568120124[119883119894119884119894]

= 120124[119883119894119884119894] minus 120124[119883119894119884119894] = 0

bull No assumptions on the linearity of 120124[119884119894|119883119894]

18 58

OLS estimator

bull We know the population value of 120573 is

120573 = 120124[119883119894119883prime119894 ]minus1113568120124[119883119894119884119894]

bull How do we get an estimator of thisbull Plug-in principle replace population expectation with

sample versions

=

โŽกโŽขโŽขโŽขโŽขโŽขโŽฃ1119873

1114012119894119883119894119883prime

119894

โŽคโŽฅโŽฅโŽฅโŽฅโŽฅโŽฆ

minus11135681119873

1114012119894119883119894119884119894

bull If you work through the matrix algebra this turns out to be

= (119831prime119831)minus1113568 119831prime119858

19 58

Asymptotic OLS inference

bull With this representation in hand we can write the OLSestimator as follows

= 120573 +

โŽกโŽขโŽขโŽขโŽขโŽขโŽฃ1114012

119894119883119894119883prime

119894

โŽคโŽฅโŽฅโŽฅโŽฅโŽฅโŽฆ

minus1113568

1114012119894119883119894119890119894

bull Core idea sum119894119883119894119890119894 is the sum of rvs so the CLT applies

bull That plus some simple asymptotic theory allows us to say

radic119873( minus 120573) 119873(0ฮฉ)

bull Converges in distribution to a Normal distribution with meanvector 0 and covariance matrix ฮฉ

ฮฉ = 120124[119883119894119883prime119894 ]minus1113568120124[119883119894119883prime

119894 1198901113569119894 ]120124[119883119894119883prime119894 ]minus1113568

bull No linearity assumption needed20 58

Estimating the variance

bull In large samples then

radic119873( minus 120573) sim 119873(0ฮฉ)

bull How to estimate ฮฉ Plug-in principle again

1114023ฮฉ =

โŽกโŽขโŽขโŽขโŽขโŽขโŽฃ1114012

119894119883119894119883prime

119894

โŽคโŽฅโŽฅโŽฅโŽฅโŽฅโŽฆ

minus1113568 โŽกโŽขโŽขโŽขโŽขโŽขโŽฃ1114012119894119883119894119883prime

119894 1113569119894

โŽคโŽฅโŽฅโŽฅโŽฅโŽฅโŽฆ

โŽกโŽขโŽขโŽขโŽขโŽขโŽฃ1114012

119894119883119894119883prime

119894

โŽคโŽฅโŽฅโŽฅโŽฅโŽฅโŽฆ

minus1113568

bull Replace 119890119894 with its emprical counterpart (residuals)119894 = 119884119894 minus 119883prime

119894 bull Replace the population moments of 119883119894 with their sample

counterpartsbull The square root of the diagonals of this covariance matrix are

the ldquorobustrdquo or Huber-White standard errors that Statacommonly report

21 58

Heteroskedasticity

bull No assumptions of homoskedasticitybull Heteroskedaticity will definitely occur when

CEF is linear but the 1205901113569(119909) = 120141[119884119894|119883119894 = 119909] is not constant in 119909 119864[119884119894|119883119894] is not linear but we use the linear regression to

approxmiate it

22 58

2 Regression andCausality

23 58

Regression and causality

bull Most econometrics textbooks regression defined withoutrespect to causality

bull But then when is ldquobiasedrdquo The above derivations work forsome 120124[119884119894|119883119894]

bull The question then is when does knowing the CEF tell ussomething about causality

bull MHE argues that a regression is causal when the CEF itapproximates is causal Identification is king

bull We will show that under certain conditions a regression of theoutcome on the treatment and the covariates can recover acausal parameter but perhaps not the one in which we areinterested

24 58

Review

bull Quick reminder we have potential outcomes 119884119894(1) and 119884119894(0)and two parameters the ATE and ATT

120591 = 119864[119884119894(1) minus 119884119894(0)]120591ATT = 119864[119884119894(1) minus 119884119894(0)|119863119894 = 1]

bull We have shown in past weeks that these effects are identifiedwhen ignorability holds MHE calls this the conditionalindependence assumption (CIA)

25 58

Linear constant effects model binarytreatment

bull Experiment with a simple experiment we can rewrite theconsistency assumption to be a regression formula

119884119894 = 119863119894119884119894(1) + (1 minus 119863119894)119884119894(0)= 119884119894(0) + (119884119894(1) minus 119884119894(0))119863119894

= 120124[119884119894(0)] + 120591119863119894 + (119884119894(0) minus 120124[119884119894(0)])= 1205831113567 + 120591119863119894 + 1199071113567119894

bull Note that if ignorability holds (as in an experiment) for 119884119894(0)then it will also hold for 1199071113567119894 since 120124[119884119894(0)] is constant Thusthis satifies the usual assumptions for regression

26 58

Now with covariates

bull Now assume no unmeasured confounders 119884119894(119889) โŸ‚โŸ‚ 119863119894|119883119894bull We will assume a linear model for the potential outcomes

119884119894(119889) = 120572 + 120591 sdot 119889 + 120578119894

bull Remember that linearity isnrsquot an assumption if 119863119894 is binarybull Effect of 119863119894 is constant here the 120578119894 are the only source of

individual variation and we have 119864[120578119894] = 0bull Consistency assumption allows us to write this as

119884119894 = 120572 + 120591119863119894 + 120578119894

27 58

Covariates in the error

bull Letrsquos assume that 120578119894 is linear in 119883119894 120578119894 = 119883prime119894 120574 + 120584119894

bull New error is uncorrelated with 119883119894 120124[120584119894|119883119894] = 0bull This is an assumption Might be falsebull Plug into the above

120124[119884119894(119889)|119883119894] = 119864[119884119894|119863119894 119883119894] = 120572 + 120591119863119894 + 119864[120578119894|119883119894]= 120572 + 120591119863119894 + 119883prime

119894 120574 + 119864[120584119894|119883119894]= 120572 + 120591119863119894 + 119883prime

119894 120574

28 58

Summing up regression with constanteffects

bull Reviewing the assumptions wersquove used no unmeasured confounders constant treatment effects linearity of the treatmentcovariates

bull Under these we can run the following regression to estimatethe ATE 120591

119884119894 = 120572 + 120591119863119894 + 119883prime119894 120574 + 120584119894

bull Works with continuous or ordinal 119863119894 if linearity in the effect ofthese variables is truly linear

29 58

OLS constant effects simulation

Model with linear covariates constant 0 effect of treatment

library(mvtnorm)

n lt- 100

p lt- 4

X lt- rmvnorm(n = 100 mean = rep(0 p))

gamma lt- c(274 137 137 137)

y lt- 210 + X c(gamma) + rnorm(n)

alpha lt- c(-1 05 -05 -01)

dprobs lt- bootinvlogit(X alpha)

d lt- rbinom(n size = 1 prob = dprobs)

30 58

OLS with no covariates

summary(lm(y ~ d))

Coefficients

Estimate Std Error t value Pr(gt|t|)

(Intercept) 22036 458 4813 lt 2e-16

d -2869 654 -439 0000029

---

Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1

Residual standard error 33 on 98 degrees of freedom

Multiple R-squared 0164 Adjusted R-squared 0156

F-statistic 192 on 1 and 98 DF p-value 0000029

31 58

OLS with covariates

summary(lm(y ~ d + X))

Coefficients

Estimate Std Error t value Pr(gt|t|)

(Intercept) 209952 0159 13207 lt2e-16

d -0128 0253 -05 062

X1 27368 0126 2175 lt2e-16

X2 13677 0114 1201 lt2e-16

X3 13673 0130 1051 lt2e-16

X4 13570 0106 1280 lt2e-16

---

Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1

Residual standard error 1 on 94 degrees of freedom

Multiple R-squared 0999 Adjusted R-squared 0999

F-statistic 244e+04 on 5 and 94 DF p-value lt2e-16

32 58

What happens with nonlinearity

Suppose we can only observe the following covariates

z1 lt- exp(X[ 1]2)

z2 lt- X[ 2](1 + exp(X[ 1])) + 10

z3 lt- (X[ 1] X[ 3]25 + 06)^3

z4 lt- (X[ 2] + X[ 4] + 20)^2

Implies that 119884119894 and 119863119894 are functions of log(1198851198941113568) 1198851198941113569 119885111356911989411135681198851198941113569

1 log(1198851198941113568) 1198851198941113570 log(1198851198941113568) and 119883111356811135691198941113571

Regression is a nonlinear function of the observed covariates

33 58

When linearity goes wrong

summary(lm(y ~ d + z1 + z2 + z3 + z4))

Coefficients

Estimate Std Error t value Pr(gt|t|)

(Intercept) 218728 300799 073 0469

d -69292 33854 -205 0043

z1 363110 26970 1346 lt2e-16

z2 -29033 36619 -079 0430

z3 862022 431030 200 0048

z4 04021 00329 1223 lt2e-16

---

Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1

Residual standard error 14 on 94 degrees of freedom

Multiple R-squared 0855 Adjusted R-squared 0847

F-statistic 111 on 5 and 94 DF p-value lt2e-16

34 58

3 Regression withHeterogeneousTreatment Effects

35 58

Heterogeneous effects binary treatment

bull Completely randomized experiment

119884119894 = 119863119894119884119894(1) + (1 minus 119863119894)119884119894(0)= 119884119894(0) + (119884119894(1) minus 119884119894(0))119863119894

= 1205831113567 + 120591119894119863119894 + (119884119894(0) minus 1205831113567)= 1205831113567 + 120591119863119894 + (119884119894(0) minus 1205831113567) + (120591119894 minus 120591) sdot 119863119894

= 1205831113567 + 120591119863119894 + 120576119894

bull Error term now includes two components1 ldquoBaselinerdquo variation in the outcome (119884119894(0) minus 1205831113567)2 Variation in the treatment effect (120591119894 minus 120591)

bull Easy to verify that under experiment 120124[120576119894|119863119894] = 0bull Thus OLS estimates the ATE with no covariates

36 58

Adding covariates

bull What happens with no unmeasured confounders Need tocondition on 119883119894 now

bull Remember identification of the ATEATT using iteratedexpectations

bull ATE is the weighted sum of CATEs

120591 = 1114012119909120591(119909) Pr[119883119894 = 119909]

bull ATEATT are weighted averages of CATEsbull What about the regression estimand 120591119877 How does it related

to the ATEATT

37 58

Heterogeneous effects and regression

bull Letrsquos investigate this under a saturated regression model

119884119894 = 1114012119909119861119909119894120572119909 + 120591119877119863119894 + 119890119894

bull Use a dummy variable for each unique combination of 119883119894119861119909119894 = 120128(119883119894 = 119909)

bull Linear in 119883119894 by construction

38 58

Investigating the regression coefficient

bull How can we investigate 120591119877 Well we can rely on theregression anatomy

120591119877 = Cov(119884119894 119863119894 minus 119864[119863119894|119883119894])120141(119863119894 minus 119864[119863119894|119883119894])

bull 119863119894 minus 120124[119863119894|119883119894] is the residual from a regression of 119863119894 on the fullset of dummies

bull With a little work we can show

120591119877 =120124 1114094120591(119883119894)(119863119894 minus 120124[119863119894|119883119894])11135691114097

120124[(119863119894 minus 119864[119863119894|119883119894])1113569]=

120124[120591(119883119894)1205901113569119889(119883119894)]120124[1205901113569119889(119883119894)]

bull 1205901113569119889(119909) = 120141[119863119894|119883119894 = 119909] is the conditional variance of treatmentassignment

39 58

ATE versus OLS

120591119877 = 120124[120591(119883119894)119882119894] = 1114012119909120591(119909)

1205901113569119889(119909)120124[1205901113569119889(119883119894)]

โ„™[119883119894 = 119909]

bull Compare to the ATE

120591 = 120124[120591(119883119894)] = 1114012119909120591(119909)โ„™[119883119894 = 119909]

bull Both weight strata relative to their size (โ„™[119883119894 = 119909])bull OLS weights strata higher if the treatment variance in those

strata (1205901113569119889(119909)) is higher in those strata relative to the averagevariance across strata (120124[1205901113569119889(119883119894)])

bull The ATE weights only by their size

40 58

Regression weighting

119882119894 =1205901113569119889(119883119894)

120124[1205901113569119889(119883119894)]

bull Why does OLS weight like thisbull OLS is a minimum-variance estimator more weight to

more precise within-strata estimatesbull Within-strata estimates are most precise when the treatment

is evenly spread and thus has the highest variancebull If 119863119894 is binary then we know the conditional variance will be

1205901113569119889(119909) = โ„™[119863119894 = 1|119883119894 = 119909] (1 minus โ„™[119863119894 = 1|119883119894 = 119909])= 119890(119909) (1 minus 119890(119909))

bull Maximum variance with โ„™[119863119894 = 1|119883119894 = 119909] = 12

41 58

OLS weighting examplebull Binary covariate

โ„™[119883119894 = 1] = 075 โ„™[119883119894 = 0] = 025โ„™[119863119894 = 1|119883119894 = 1] = 09 โ„™[119863119894 = 1|119883119894 = 0] = 05

1205901113569119889(1) = 009 1205901113569119889(0) = 025120591(1) = 1 120591(0) = minus1

bull Implies the ATE is 120591 = 05bull Average conditional variance 120124[1205901113569119889(119883119894)] = 013bull weights for 119883119894 = 1 are 009013 = 0692 for 119883119894 = 0

025013 = 192120591119877 = 120124[120591(119883119894)119882119894]

= 120591(1)119882(1)โ„™[119883119894 = 1] + 120591(0)119882(0)โ„™[119883119894 = 0]= 1 times 0692 times 075 + minus1 times 192 times 025= 0039

42 58

When will OLS estimate the ATE

bull When does 120591 = 120591119877bull Constant treatment effects 120591(119909) = 120591 = 120591119877bull Constant probability of treatment 119890(119909) = โ„™[119863119894 = 1|119883119894 = 119909] = 119890

Implies that the OLS weights are 1

bull Incorrect linearity assumption in 119883119894 will lead to more bias

43 58

Other ways to use regression

bull Whatrsquos the path forward Accept the bias (might be relatively small with saturated

models) Use a different regression approach

bull Let 120583119889(119909) = 120124[119884119894(119889)|119883119894 = 119909] be the CEF for the potentialoutcome under 119863119894 = 119889

bull By consistency and nuc we have 120583119889(119909) = 120124[119884119894|119863119894 = 119889119883119894 = 119909]bull Estimate a regression of 119884119894 on 119883119894 among the 119863119894 = 119889 groupbull Then 119889(119909) is just a predicted value from the regression for

119883119894 = 119909bull How can we use this

44 58

Imputation estimators

bull Impute the treated potential outcomes with 1114023119884119894(1) = 1113568(119883119894)bull Impute the control potential outcomes with 1114023119884119894(0) = 1113567(119883119894)bull Procedure

Regress 119884119894 on 119883119894 in the treated group and get predicted valuesfor all units (treated or control)

Regress 119884119894 on 119883119894 in the control group and get predicted valuesfor all units (treated or control)

Take the average difference between these predicted values

bull More mathematically look like this

120591119894119898119901 =1119873

11140121198941113568(119883119894) minus 1113567(119883119894)

bull Sometimes called an imputation estimator

45 58

Simple imputation estimator

bull Use predict() from the within-group models on the data fromthe entire sample

bull Useful trick use a model on the entire data andmodelframe() to get the right design matrix

heterogeneous effects

yhet lt- ifelse(d == 1 y + rnorm(n 0 5) y)

mod lt- lm(yhet ~ d + X)

mod1 lt- lm(yhet ~ X subset = d == 1)

mod0 lt- lm(yhet ~ X subset = d == 0)

y1imps lt- predict(mod1 modelframe(mod))

y0imps lt- predict(mod0 modelframe(mod))

mean(y1imps - y0imps)

[1] 061

46 58

Notes on imputation estimators

bull If 119889(119909) are consistent estimators then 120591119894119898119901 is consistent forthe ATE

bull Why donrsquot people use this Most people donrsquot know the results wersquove been talking about Harder to implement than vanilla OLS

bull Can use linear regression to estimate 119889(119909) = 119909prime120573119889bull Recent trend is to estimate 119889(119909) via non-parametric methods

such as Kernel regression local linear regression regression trees etc Easiest is generalized additive models (GAMs)

47 58

Imputation estimator visualization

-2 -1 0 1 2 3

-05

00

05

10

15

x

y

TreatedControl

48 58

Imputation estimator visualization

-2 -1 0 1 2 3

-05

00

05

10

15

x

y

TreatedControl

49 58

Imputation estimator visualization

-2 -1 0 1 2 3

-05

00

05

10

15

x

y

TreatedControl

50 58

Nonlinear relationshipsbull Same idea but with nonlinear relationship between 119884119894 and 119883119894

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

51 58

Nonlinear relationshipsbull Same idea but with nonlinear relationship between 119884119894 and 119883119894

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

52 58

Nonlinear relationshipsbull Same idea but with nonlinear relationship between 119884119894 and 119883119894

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

53 58

Using semiparametric regressionbull Here CEFs are nonlinear but we donrsquot know their formbull We can use GAMs from the mgcv package to for flexible

estimatelibrary(mgcv)

mod0 lt- gam(y ~ s(x) subset = d == 0)

summary(mod0)

Family gaussian

Link function identity

Formula

y ~ s(x)

Parametric coefficients

Estimate Std Error t value Pr(gt|t|)

(Intercept) -00225 00154 -146 016

Approximate significance of smooth terms

edf Refdf F p-value

s(x) 603 708 413 lt2e-16

---

Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1

R-sq(adj) = 0909 Deviance explained = 928

GCV = 00093204 Scale est = 00071351 n = 30

54 58

Using GAMs

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

55 58

Using GAMs

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

56 58

Using GAMs

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

57 58

Limited dependent variables

bull Usual advice model the data from first principles Logitprobit for binary Poisson for counts etc

bull OLS is a-ok with limited DVs when Binary treatment and no covariates (just diff-in-means) Binary treatment discrete covariates and saturated models

(stratified diff-in-means)

bull Imposing a model on LDVs in this case imposes adistributional assumption which could be wrong

bull Even in unsaturated models the marginal effect from OLSoften decent compared to nonlinear models

Could go wrong in small samples If using nonlinear models always get effects on the scale of the

outcome

58 58

  • Agnostic Regression
  • Regression and Causality
  • Regression with Heterogeneous Treatment Effects
Page 5: Gov 2002: 7. Regression and Causalityโ€ข Define linear regression: =argmin ๐”ผ[( ๐‘–โˆ’ โ€ฒ ) ] โ€ข The solution to this is the following: =๐”ผ[ ๐‘– โ€ฒ]โˆ’ ๐”ผ[ ๐‘– ๐‘–] โ€ข

Regression as parametric modeling

bull Gauss-Markov assumptions linearity iid sample full rank 119883119894 zero conditional mean error

homoskedasticity

bull OLS is BLUE plus normality of the errors and we get smallsample SEs

bull What is the basic approach here It is a model for theconditional distribution of 119884119894 given 119883119894

[119884119894|119883119894] sim 119873(119883prime119894 120573 1205901113569)

bull MLE from this model is the usual OLS estimator OLS

OLS =

โŽกโŽขโŽขโŽขโŽขโŽขโŽฃ1198731114012119894=1113568

119883119894119883prime119894

โŽคโŽฅโŽฅโŽฅโŽฅโŽฅโŽฆ

minus11135681198731114012119894=1113568

119883119894119884119894

5 58

Agnostic views on regression

[119884119894|119883119894] sim 119873(119883prime119894 120573 1205901113569)

bull Strong distributional assumption on 119884119894bull Properties like BLUE or MLE properties depend on these

assumptions holdingbull Alternative take an agnostic view on regression

Use OLS without believing these assumptions

bull Lose the distributional assumptions focus on the conditionalexpectation function (CEF)

120583(119909) = 120124[119884119894|119883119894 = 119909] = 1114012119910119910 sdot โ„™[119884119894 = 119910|119883119894 = 119909]

6 58

Justifying linear regression

bull Define linear regression

120573 = argmin119887

120124[(119884119894 minus 119883prime119894 119887)1113569]

bull The solution to this is the following

120573 = 120124[119883119894119883prime119894 ]minus1113568120124[119883119894119884119894]

bull Note that the is the population coefficient vector not theestimator yet

7 58

Regression anatomy

bull Consider simple linear regression

(120572 120573) = argmin119886119887

120124 1114094(119884119894 minus 119886 minus 119887119883119894)11135691114097

bull In this case we can write the populationtrue slope 120573 as

120573 = 120124[119883119894119883prime119894 ]minus1113568120124[119883119894119884119894] =

Cov(119884119894 119883119894)120141[119883119894]

bull With more covariates 120573 is more complicated but we can stillwrite it like this

bull Let 119896119894 be the residual from a regression of 119883119896119894 on all the otherindependent variables Then 120573119896 the coefficient for 119883119896119894 is

120573119896 =Cov(119884119894 119896119894)

119881(119896119894)

8 58

Justification 1 Linear CEFs

bull Justification 1 if the CEF is linear the population regressionfunction is it That is if 119864[119884119894|119883119894] = 119883prime

119894 119887 then 119887 = 120573bull When would we expect the CEF to be linear Two cases

1 Outcome and covariates are multivariate normal2 Linear regression model is saturated

bull A model is saturated if there are as many parameters asthere are possible combination of the 119883119894 variables

9 58

Saturated model example

bull Two binary variables 1198831113568119894 for incumbency status and 1198831113569119894 forparty of the candidate

bull Four possible values of 119883119894 four possible values of 120583(119883119894)

119864[119884119894|1198831113568119894 = 01198831113569119894 = 0] = 120572119864[119884119894|1198831113568119894 = 11198831113569119894 = 0] = 120572 + 120573119864[119884119894|1198831113568119894 = 01198831113569119894 = 1] = 120572 + 120574119864[119884119894|1198831113568119894 = 11198831113569119894 = 1] = 120572 + 120573 + 120574 + 120575

bull We can write the CEF as follows

119864[119884119894|1198831113568119894 1198831113569119894] = 120572 + 1205731198831113568119894 + 1205741198831113569119894 + 120575(11988311135681198941198831113569119894)

10 58

Saturated models example

119864[119884119894|1198831113568119894 1198831113569119894] = 120572 + 1205731198831113568119894 + 1205741198831113569119894 + 120575(11988311135681198941198831113569119894)

bull Basically each value of 120583(119883119894) is being estimated separately within-strata estimation No borrowing of information from across values of 119883119894

bull Requires a set of dummies for each categorical variable plusall interactions

bull Or a series of dummies for each unique combination of 119883119894bull This makes linearity hold mechanically and so linearity is

not an assumption Just a fact about saturated CEFs saturated models for limited dependent variables = A-OK

11 58

Saturated model example

bull Washington (AER) data on the effects of daughtersbull Wersquoll look at the relationship between voting and number of

kids (causal)

girls lt- foreignreaddta(rdquogirlsdtardquo)

head(girls[ c(rdquonamerdquo rdquototchirdquo rdquoaauwrdquo)])

name totchi aauw

1 ABERCROMBIE NEIL 0 100

2 ACKERMAN GARY L 3 88

3 ADERHOLT ROBERT B 0 0

4 ALLEN THOMAS H 2 100

5 ANDREWS ROBERT E 2 100

6 ARCHER WR 7 0

12 58

Linear model

summary(lm(aauw ~ totchi data = girls))

Coefficients

Estimate Std Error t value Pr(gt|t|)

(Intercept) 6131 181 3381 lt2e-16

totchi -533 062 -859 lt2e-16

---

Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1

Residual standard error 42 on 1733 degrees of freedom

(5 observations deleted due to missingness)

Multiple R-squared 00408 Adjusted R-squared 00403

F-statistic 738 on 1 and 1733 DF p-value lt2e-16

13 58

Saturated modelsummary(lm(aauw ~ asfactor(totchi) data = girls))

Coefficients

Estimate Std Error t value Pr(gt|t|)

(Intercept) 5641 276 2042 lt 2e-16

asfactor(totchi)1 545 411 133 01851

asfactor(totchi)2 -380 327 -116 02454

asfactor(totchi)3 -1365 345 -395 81e-05

asfactor(totchi)4 -1931 401 -482 16e-06

asfactor(totchi)5 -1546 485 -319 00015

asfactor(totchi)6 -3359 1042 -322 00013

asfactor(totchi)7 -1713 1141 -150 01336

asfactor(totchi)8 -5533 1228 -451 70e-06

asfactor(totchi)9 -5041 2408 -209 00364

asfactor(totchi)10 -5341 2090 -256 00107

asfactor(totchi)12 -5641 4153 -136 01745

---

Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1

Residual standard error 41 on 1723 degrees of freedom

(5 observations deleted due to missingness)

Multiple R-squared 00506 Adjusted R-squared 00446

F-statistic 836 on 11 and 1723 DF p-value 184e-1414 58

Saturated model minus the constantsummary(lm(aauw ~ asfactor(totchi) - 1 data = girls))

Coefficients

Estimate Std Error t value Pr(gt|t|)

asfactor(totchi)0 5641 276 2042 lt2e-16

asfactor(totchi)1 6186 305 2031 lt2e-16

asfactor(totchi)2 5262 175 3013 lt2e-16

asfactor(totchi)3 4276 207 2062 lt2e-16

asfactor(totchi)4 3711 290 1279 lt2e-16

asfactor(totchi)5 4095 399 1027 lt2e-16

asfactor(totchi)6 2282 1005 227 00233

asfactor(totchi)7 3929 1107 355 00004

asfactor(totchi)8 108 1196 009 09278

asfactor(totchi)9 600 2392 025 08020

asfactor(totchi)10 300 2072 014 08849

asfactor(totchi)12 000 4143 000 10000

---

Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1

Residual standard error 41 on 1723 degrees of freedom

(5 observations deleted due to missingness)

Multiple R-squared 0587 Adjusted R-squared 0584

F-statistic 204 on 12 and 1723 DF p-value lt2e-1615 58

Compare to within-strata means

bull The saturated model makes no assumptions about thebetween-strata relationships

bull Just calculates within-strata means

c1 lt- coef(lm(aauw ~ asfactor(totchi) - 1 data = girls))

c2 lt- with(girls tapply(aauw totchi mean narm = TRUE))

rbind(c1 c2)

0 1 2 3 4 5 6 7 8 9 10 12

c1 56 62 53 43 37 41 23 39 11 6 3 0

c2 56 62 53 43 37 41 23 39 11 6 3 0

16 58

Other justifications for OLS

bull Justification 2 119883prime119894 120573 is the best linear predictor (in a

mean-squared error sense) of 119884119894 Why 120573 = argmin119887 120124[(119884119894 minus 119883prime

119894 119887)1113569]

bull Justification 3 119883prime119894 120573 provides the minimum mean squared

error linear approxmiation to 119864[119884119894|119883119894]bull Even if the CEF is not linear a linear regression provides the

best linear approximation to that CEFbull Donrsquot need to believe the assumptions (linearity) in order to

use regression as a good approximation to the CEFbull Warning if the CEF is very nonlinear then this approximation

could be terrible

17 58

The error terms

bull Letrsquos define the error term 119890119894 equiv 119884119894 minus 119883prime119894 120573 so that

119884119894 = 119883prime119894 120573 + [119884119894 minus 119883prime

119894 120573] = 119883prime119894 120573 + 119890119894

bull Note the residual 119890119894 is uncorrelated with 119883119894

120124[119883119894119890119894] = 120124[119883119894(119884119894 minus 119883prime119894 120573)]

= 120124[119883119894119884119894] minus 120124[119883119894119883prime119894 120573]

= 120124[119883119894119884119894] minus 120124 1114094119883119894119883prime119894120124[119883119894119883prime

119894 ]minus1113568120124[119883119894119884119894]1114097= 120124[119883119894119884119894] minus 120124[119883119894119883prime

119894 ]120124[119883119894119883prime119894 ]minus1113568120124[119883119894119884119894]

= 120124[119883119894119884119894] minus 120124[119883119894119884119894] = 0

bull No assumptions on the linearity of 120124[119884119894|119883119894]

18 58

OLS estimator

bull We know the population value of 120573 is

120573 = 120124[119883119894119883prime119894 ]minus1113568120124[119883119894119884119894]

bull How do we get an estimator of thisbull Plug-in principle replace population expectation with

sample versions

=

โŽกโŽขโŽขโŽขโŽขโŽขโŽฃ1119873

1114012119894119883119894119883prime

119894

โŽคโŽฅโŽฅโŽฅโŽฅโŽฅโŽฆ

minus11135681119873

1114012119894119883119894119884119894

bull If you work through the matrix algebra this turns out to be

= (119831prime119831)minus1113568 119831prime119858

19 58

Asymptotic OLS inference

bull With this representation in hand we can write the OLSestimator as follows

= 120573 +

โŽกโŽขโŽขโŽขโŽขโŽขโŽฃ1114012

119894119883119894119883prime

119894

โŽคโŽฅโŽฅโŽฅโŽฅโŽฅโŽฆ

minus1113568

1114012119894119883119894119890119894

bull Core idea sum119894119883119894119890119894 is the sum of rvs so the CLT applies

bull That plus some simple asymptotic theory allows us to say

radic119873( minus 120573) 119873(0ฮฉ)

bull Converges in distribution to a Normal distribution with meanvector 0 and covariance matrix ฮฉ

ฮฉ = 120124[119883119894119883prime119894 ]minus1113568120124[119883119894119883prime

119894 1198901113569119894 ]120124[119883119894119883prime119894 ]minus1113568

bull No linearity assumption needed20 58

Estimating the variance

bull In large samples then

radic119873( minus 120573) sim 119873(0ฮฉ)

bull How to estimate ฮฉ Plug-in principle again

1114023ฮฉ =

โŽกโŽขโŽขโŽขโŽขโŽขโŽฃ1114012

119894119883119894119883prime

119894

โŽคโŽฅโŽฅโŽฅโŽฅโŽฅโŽฆ

minus1113568 โŽกโŽขโŽขโŽขโŽขโŽขโŽฃ1114012119894119883119894119883prime

119894 1113569119894

โŽคโŽฅโŽฅโŽฅโŽฅโŽฅโŽฆ

โŽกโŽขโŽขโŽขโŽขโŽขโŽฃ1114012

119894119883119894119883prime

119894

โŽคโŽฅโŽฅโŽฅโŽฅโŽฅโŽฆ

minus1113568

bull Replace 119890119894 with its emprical counterpart (residuals)119894 = 119884119894 minus 119883prime

119894 bull Replace the population moments of 119883119894 with their sample

counterpartsbull The square root of the diagonals of this covariance matrix are

the ldquorobustrdquo or Huber-White standard errors that Statacommonly report

21 58

Heteroskedasticity

bull No assumptions of homoskedasticitybull Heteroskedaticity will definitely occur when

CEF is linear but the 1205901113569(119909) = 120141[119884119894|119883119894 = 119909] is not constant in 119909 119864[119884119894|119883119894] is not linear but we use the linear regression to

approxmiate it

22 58

2 Regression andCausality

23 58

Regression and causality

bull Most econometrics textbooks regression defined withoutrespect to causality

bull But then when is ldquobiasedrdquo The above derivations work forsome 120124[119884119894|119883119894]

bull The question then is when does knowing the CEF tell ussomething about causality

bull MHE argues that a regression is causal when the CEF itapproximates is causal Identification is king

bull We will show that under certain conditions a regression of theoutcome on the treatment and the covariates can recover acausal parameter but perhaps not the one in which we areinterested

24 58

Review

bull Quick reminder we have potential outcomes 119884119894(1) and 119884119894(0)and two parameters the ATE and ATT

120591 = 119864[119884119894(1) minus 119884119894(0)]120591ATT = 119864[119884119894(1) minus 119884119894(0)|119863119894 = 1]

bull We have shown in past weeks that these effects are identifiedwhen ignorability holds MHE calls this the conditionalindependence assumption (CIA)

25 58

Linear constant effects model binarytreatment

bull Experiment with a simple experiment we can rewrite theconsistency assumption to be a regression formula

119884119894 = 119863119894119884119894(1) + (1 minus 119863119894)119884119894(0)= 119884119894(0) + (119884119894(1) minus 119884119894(0))119863119894

= 120124[119884119894(0)] + 120591119863119894 + (119884119894(0) minus 120124[119884119894(0)])= 1205831113567 + 120591119863119894 + 1199071113567119894

bull Note that if ignorability holds (as in an experiment) for 119884119894(0)then it will also hold for 1199071113567119894 since 120124[119884119894(0)] is constant Thusthis satifies the usual assumptions for regression

26 58

Now with covariates

bull Now assume no unmeasured confounders 119884119894(119889) โŸ‚โŸ‚ 119863119894|119883119894bull We will assume a linear model for the potential outcomes

119884119894(119889) = 120572 + 120591 sdot 119889 + 120578119894

bull Remember that linearity isnrsquot an assumption if 119863119894 is binarybull Effect of 119863119894 is constant here the 120578119894 are the only source of

individual variation and we have 119864[120578119894] = 0bull Consistency assumption allows us to write this as

119884119894 = 120572 + 120591119863119894 + 120578119894

27 58

Covariates in the error

bull Letrsquos assume that 120578119894 is linear in 119883119894 120578119894 = 119883prime119894 120574 + 120584119894

bull New error is uncorrelated with 119883119894 120124[120584119894|119883119894] = 0bull This is an assumption Might be falsebull Plug into the above

120124[119884119894(119889)|119883119894] = 119864[119884119894|119863119894 119883119894] = 120572 + 120591119863119894 + 119864[120578119894|119883119894]= 120572 + 120591119863119894 + 119883prime

119894 120574 + 119864[120584119894|119883119894]= 120572 + 120591119863119894 + 119883prime

119894 120574

28 58

Summing up regression with constanteffects

bull Reviewing the assumptions wersquove used no unmeasured confounders constant treatment effects linearity of the treatmentcovariates

bull Under these we can run the following regression to estimatethe ATE 120591

119884119894 = 120572 + 120591119863119894 + 119883prime119894 120574 + 120584119894

bull Works with continuous or ordinal 119863119894 if linearity in the effect ofthese variables is truly linear

29 58

OLS constant effects simulation

Model with linear covariates constant 0 effect of treatment

library(mvtnorm)

n lt- 100

p lt- 4

X lt- rmvnorm(n = 100 mean = rep(0 p))

gamma lt- c(274 137 137 137)

y lt- 210 + X c(gamma) + rnorm(n)

alpha lt- c(-1 05 -05 -01)

dprobs lt- bootinvlogit(X alpha)

d lt- rbinom(n size = 1 prob = dprobs)

30 58

OLS with no covariates

summary(lm(y ~ d))

Coefficients

Estimate Std Error t value Pr(gt|t|)

(Intercept) 22036 458 4813 lt 2e-16

d -2869 654 -439 0000029

---

Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1

Residual standard error 33 on 98 degrees of freedom

Multiple R-squared 0164 Adjusted R-squared 0156

F-statistic 192 on 1 and 98 DF p-value 0000029

31 58

OLS with covariates

summary(lm(y ~ d + X))

Coefficients

Estimate Std Error t value Pr(gt|t|)

(Intercept) 209952 0159 13207 lt2e-16

d -0128 0253 -05 062

X1 27368 0126 2175 lt2e-16

X2 13677 0114 1201 lt2e-16

X3 13673 0130 1051 lt2e-16

X4 13570 0106 1280 lt2e-16

---

Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1

Residual standard error 1 on 94 degrees of freedom

Multiple R-squared 0999 Adjusted R-squared 0999

F-statistic 244e+04 on 5 and 94 DF p-value lt2e-16

32 58

What happens with nonlinearity

Suppose we can only observe the following covariates

z1 lt- exp(X[ 1]2)

z2 lt- X[ 2](1 + exp(X[ 1])) + 10

z3 lt- (X[ 1] X[ 3]25 + 06)^3

z4 lt- (X[ 2] + X[ 4] + 20)^2

Implies that 119884119894 and 119863119894 are functions of log(1198851198941113568) 1198851198941113569 119885111356911989411135681198851198941113569

1 log(1198851198941113568) 1198851198941113570 log(1198851198941113568) and 119883111356811135691198941113571

Regression is a nonlinear function of the observed covariates

33 58

When linearity goes wrong

summary(lm(y ~ d + z1 + z2 + z3 + z4))

Coefficients

Estimate Std Error t value Pr(gt|t|)

(Intercept) 218728 300799 073 0469

d -69292 33854 -205 0043

z1 363110 26970 1346 lt2e-16

z2 -29033 36619 -079 0430

z3 862022 431030 200 0048

z4 04021 00329 1223 lt2e-16

---

Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1

Residual standard error 14 on 94 degrees of freedom

Multiple R-squared 0855 Adjusted R-squared 0847

F-statistic 111 on 5 and 94 DF p-value lt2e-16

34 58

3 Regression withHeterogeneousTreatment Effects

35 58

Heterogeneous effects binary treatment

bull Completely randomized experiment

119884119894 = 119863119894119884119894(1) + (1 minus 119863119894)119884119894(0)= 119884119894(0) + (119884119894(1) minus 119884119894(0))119863119894

= 1205831113567 + 120591119894119863119894 + (119884119894(0) minus 1205831113567)= 1205831113567 + 120591119863119894 + (119884119894(0) minus 1205831113567) + (120591119894 minus 120591) sdot 119863119894

= 1205831113567 + 120591119863119894 + 120576119894

bull Error term now includes two components1 ldquoBaselinerdquo variation in the outcome (119884119894(0) minus 1205831113567)2 Variation in the treatment effect (120591119894 minus 120591)

bull Easy to verify that under experiment 120124[120576119894|119863119894] = 0bull Thus OLS estimates the ATE with no covariates

36 58

Adding covariates

bull What happens with no unmeasured confounders Need tocondition on 119883119894 now

bull Remember identification of the ATEATT using iteratedexpectations

bull ATE is the weighted sum of CATEs

120591 = 1114012119909120591(119909) Pr[119883119894 = 119909]

bull ATEATT are weighted averages of CATEsbull What about the regression estimand 120591119877 How does it related

to the ATEATT

37 58

Heterogeneous effects and regression

bull Letrsquos investigate this under a saturated regression model

119884119894 = 1114012119909119861119909119894120572119909 + 120591119877119863119894 + 119890119894

bull Use a dummy variable for each unique combination of 119883119894119861119909119894 = 120128(119883119894 = 119909)

bull Linear in 119883119894 by construction

38 58

Investigating the regression coefficient

bull How can we investigate 120591119877 Well we can rely on theregression anatomy

120591119877 = Cov(119884119894 119863119894 minus 119864[119863119894|119883119894])120141(119863119894 minus 119864[119863119894|119883119894])

bull 119863119894 minus 120124[119863119894|119883119894] is the residual from a regression of 119863119894 on the fullset of dummies

bull With a little work we can show

120591119877 =120124 1114094120591(119883119894)(119863119894 minus 120124[119863119894|119883119894])11135691114097

120124[(119863119894 minus 119864[119863119894|119883119894])1113569]=

120124[120591(119883119894)1205901113569119889(119883119894)]120124[1205901113569119889(119883119894)]

bull 1205901113569119889(119909) = 120141[119863119894|119883119894 = 119909] is the conditional variance of treatmentassignment

39 58

ATE versus OLS

120591119877 = 120124[120591(119883119894)119882119894] = 1114012119909120591(119909)

1205901113569119889(119909)120124[1205901113569119889(119883119894)]

โ„™[119883119894 = 119909]

bull Compare to the ATE

120591 = 120124[120591(119883119894)] = 1114012119909120591(119909)โ„™[119883119894 = 119909]

bull Both weight strata relative to their size (โ„™[119883119894 = 119909])bull OLS weights strata higher if the treatment variance in those

strata (1205901113569119889(119909)) is higher in those strata relative to the averagevariance across strata (120124[1205901113569119889(119883119894)])

bull The ATE weights only by their size

40 58

Regression weighting

119882119894 =1205901113569119889(119883119894)

120124[1205901113569119889(119883119894)]

bull Why does OLS weight like thisbull OLS is a minimum-variance estimator more weight to

more precise within-strata estimatesbull Within-strata estimates are most precise when the treatment

is evenly spread and thus has the highest variancebull If 119863119894 is binary then we know the conditional variance will be

1205901113569119889(119909) = โ„™[119863119894 = 1|119883119894 = 119909] (1 minus โ„™[119863119894 = 1|119883119894 = 119909])= 119890(119909) (1 minus 119890(119909))

bull Maximum variance with โ„™[119863119894 = 1|119883119894 = 119909] = 12

41 58

OLS weighting examplebull Binary covariate

โ„™[119883119894 = 1] = 075 โ„™[119883119894 = 0] = 025โ„™[119863119894 = 1|119883119894 = 1] = 09 โ„™[119863119894 = 1|119883119894 = 0] = 05

1205901113569119889(1) = 009 1205901113569119889(0) = 025120591(1) = 1 120591(0) = minus1

bull Implies the ATE is 120591 = 05bull Average conditional variance 120124[1205901113569119889(119883119894)] = 013bull weights for 119883119894 = 1 are 009013 = 0692 for 119883119894 = 0

025013 = 192120591119877 = 120124[120591(119883119894)119882119894]

= 120591(1)119882(1)โ„™[119883119894 = 1] + 120591(0)119882(0)โ„™[119883119894 = 0]= 1 times 0692 times 075 + minus1 times 192 times 025= 0039

42 58

When will OLS estimate the ATE

bull When does 120591 = 120591119877bull Constant treatment effects 120591(119909) = 120591 = 120591119877bull Constant probability of treatment 119890(119909) = โ„™[119863119894 = 1|119883119894 = 119909] = 119890

Implies that the OLS weights are 1

bull Incorrect linearity assumption in 119883119894 will lead to more bias

43 58

Other ways to use regression

bull Whatrsquos the path forward Accept the bias (might be relatively small with saturated

models) Use a different regression approach

bull Let 120583119889(119909) = 120124[119884119894(119889)|119883119894 = 119909] be the CEF for the potentialoutcome under 119863119894 = 119889

bull By consistency and nuc we have 120583119889(119909) = 120124[119884119894|119863119894 = 119889119883119894 = 119909]bull Estimate a regression of 119884119894 on 119883119894 among the 119863119894 = 119889 groupbull Then 119889(119909) is just a predicted value from the regression for

119883119894 = 119909bull How can we use this

44 58

Imputation estimators

bull Impute the treated potential outcomes with 1114023119884119894(1) = 1113568(119883119894)bull Impute the control potential outcomes with 1114023119884119894(0) = 1113567(119883119894)bull Procedure

Regress 119884119894 on 119883119894 in the treated group and get predicted valuesfor all units (treated or control)

Regress 119884119894 on 119883119894 in the control group and get predicted valuesfor all units (treated or control)

Take the average difference between these predicted values

bull More mathematically look like this

120591119894119898119901 =1119873

11140121198941113568(119883119894) minus 1113567(119883119894)

bull Sometimes called an imputation estimator

45 58

Simple imputation estimator

bull Use predict() from the within-group models on the data fromthe entire sample

bull Useful trick use a model on the entire data andmodelframe() to get the right design matrix

heterogeneous effects

yhet lt- ifelse(d == 1 y + rnorm(n 0 5) y)

mod lt- lm(yhet ~ d + X)

mod1 lt- lm(yhet ~ X subset = d == 1)

mod0 lt- lm(yhet ~ X subset = d == 0)

y1imps lt- predict(mod1 modelframe(mod))

y0imps lt- predict(mod0 modelframe(mod))

mean(y1imps - y0imps)

[1] 061

46 58

Notes on imputation estimators

bull If 119889(119909) are consistent estimators then 120591119894119898119901 is consistent forthe ATE

bull Why donrsquot people use this Most people donrsquot know the results wersquove been talking about Harder to implement than vanilla OLS

bull Can use linear regression to estimate 119889(119909) = 119909prime120573119889bull Recent trend is to estimate 119889(119909) via non-parametric methods

such as Kernel regression local linear regression regression trees etc Easiest is generalized additive models (GAMs)

47 58

Imputation estimator visualization

-2 -1 0 1 2 3

-05

00

05

10

15

x

y

TreatedControl

48 58

Imputation estimator visualization

-2 -1 0 1 2 3

-05

00

05

10

15

x

y

TreatedControl

49 58

Imputation estimator visualization

-2 -1 0 1 2 3

-05

00

05

10

15

x

y

TreatedControl

50 58

Nonlinear relationshipsbull Same idea but with nonlinear relationship between 119884119894 and 119883119894

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

51 58

Nonlinear relationshipsbull Same idea but with nonlinear relationship between 119884119894 and 119883119894

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

52 58

Nonlinear relationshipsbull Same idea but with nonlinear relationship between 119884119894 and 119883119894

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

53 58

Using semiparametric regressionbull Here CEFs are nonlinear but we donrsquot know their formbull We can use GAMs from the mgcv package to for flexible

estimatelibrary(mgcv)

mod0 lt- gam(y ~ s(x) subset = d == 0)

summary(mod0)

Family gaussian

Link function identity

Formula

y ~ s(x)

Parametric coefficients

Estimate Std Error t value Pr(gt|t|)

(Intercept) -00225 00154 -146 016

Approximate significance of smooth terms

edf Refdf F p-value

s(x) 603 708 413 lt2e-16

---

Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1

R-sq(adj) = 0909 Deviance explained = 928

GCV = 00093204 Scale est = 00071351 n = 30

54 58

Using GAMs

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

55 58

Using GAMs

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

56 58

Using GAMs

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

57 58

Limited dependent variables

bull Usual advice model the data from first principles Logitprobit for binary Poisson for counts etc

bull OLS is a-ok with limited DVs when Binary treatment and no covariates (just diff-in-means) Binary treatment discrete covariates and saturated models

(stratified diff-in-means)

bull Imposing a model on LDVs in this case imposes adistributional assumption which could be wrong

bull Even in unsaturated models the marginal effect from OLSoften decent compared to nonlinear models

Could go wrong in small samples If using nonlinear models always get effects on the scale of the

outcome

58 58

  • Agnostic Regression
  • Regression and Causality
  • Regression with Heterogeneous Treatment Effects
Page 6: Gov 2002: 7. Regression and Causalityโ€ข Define linear regression: =argmin ๐”ผ[( ๐‘–โˆ’ โ€ฒ ) ] โ€ข The solution to this is the following: =๐”ผ[ ๐‘– โ€ฒ]โˆ’ ๐”ผ[ ๐‘– ๐‘–] โ€ข

Agnostic views on regression

[119884119894|119883119894] sim 119873(119883prime119894 120573 1205901113569)

bull Strong distributional assumption on 119884119894bull Properties like BLUE or MLE properties depend on these

assumptions holdingbull Alternative take an agnostic view on regression

Use OLS without believing these assumptions

bull Lose the distributional assumptions focus on the conditionalexpectation function (CEF)

120583(119909) = 120124[119884119894|119883119894 = 119909] = 1114012119910119910 sdot โ„™[119884119894 = 119910|119883119894 = 119909]

6 58

Justifying linear regression

bull Define linear regression

120573 = argmin119887

120124[(119884119894 minus 119883prime119894 119887)1113569]

bull The solution to this is the following

120573 = 120124[119883119894119883prime119894 ]minus1113568120124[119883119894119884119894]

bull Note that the is the population coefficient vector not theestimator yet

7 58

Regression anatomy

bull Consider simple linear regression

(120572 120573) = argmin119886119887

120124 1114094(119884119894 minus 119886 minus 119887119883119894)11135691114097

bull In this case we can write the populationtrue slope 120573 as

120573 = 120124[119883119894119883prime119894 ]minus1113568120124[119883119894119884119894] =

Cov(119884119894 119883119894)120141[119883119894]

bull With more covariates 120573 is more complicated but we can stillwrite it like this

bull Let 119896119894 be the residual from a regression of 119883119896119894 on all the otherindependent variables Then 120573119896 the coefficient for 119883119896119894 is

120573119896 =Cov(119884119894 119896119894)

119881(119896119894)

8 58

Justification 1 Linear CEFs

bull Justification 1 if the CEF is linear the population regressionfunction is it That is if 119864[119884119894|119883119894] = 119883prime

119894 119887 then 119887 = 120573bull When would we expect the CEF to be linear Two cases

1 Outcome and covariates are multivariate normal2 Linear regression model is saturated

bull A model is saturated if there are as many parameters asthere are possible combination of the 119883119894 variables

9 58

Saturated model example

bull Two binary variables 1198831113568119894 for incumbency status and 1198831113569119894 forparty of the candidate

bull Four possible values of 119883119894 four possible values of 120583(119883119894)

119864[119884119894|1198831113568119894 = 01198831113569119894 = 0] = 120572119864[119884119894|1198831113568119894 = 11198831113569119894 = 0] = 120572 + 120573119864[119884119894|1198831113568119894 = 01198831113569119894 = 1] = 120572 + 120574119864[119884119894|1198831113568119894 = 11198831113569119894 = 1] = 120572 + 120573 + 120574 + 120575

bull We can write the CEF as follows

119864[119884119894|1198831113568119894 1198831113569119894] = 120572 + 1205731198831113568119894 + 1205741198831113569119894 + 120575(11988311135681198941198831113569119894)

10 58

Saturated models example

119864[119884119894|1198831113568119894 1198831113569119894] = 120572 + 1205731198831113568119894 + 1205741198831113569119894 + 120575(11988311135681198941198831113569119894)

bull Basically each value of 120583(119883119894) is being estimated separately within-strata estimation No borrowing of information from across values of 119883119894

bull Requires a set of dummies for each categorical variable plusall interactions

bull Or a series of dummies for each unique combination of 119883119894bull This makes linearity hold mechanically and so linearity is

not an assumption Just a fact about saturated CEFs saturated models for limited dependent variables = A-OK

11 58

Saturated model example

bull Washington (AER) data on the effects of daughtersbull Wersquoll look at the relationship between voting and number of

kids (causal)

girls lt- foreignreaddta(rdquogirlsdtardquo)

head(girls[ c(rdquonamerdquo rdquototchirdquo rdquoaauwrdquo)])

name totchi aauw

1 ABERCROMBIE NEIL 0 100

2 ACKERMAN GARY L 3 88

3 ADERHOLT ROBERT B 0 0

4 ALLEN THOMAS H 2 100

5 ANDREWS ROBERT E 2 100

6 ARCHER WR 7 0

12 58

Linear model

summary(lm(aauw ~ totchi data = girls))

Coefficients

Estimate Std Error t value Pr(gt|t|)

(Intercept) 6131 181 3381 lt2e-16

totchi -533 062 -859 lt2e-16

---

Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1

Residual standard error 42 on 1733 degrees of freedom

(5 observations deleted due to missingness)

Multiple R-squared 00408 Adjusted R-squared 00403

F-statistic 738 on 1 and 1733 DF p-value lt2e-16

13 58

Saturated modelsummary(lm(aauw ~ asfactor(totchi) data = girls))

Coefficients

Estimate Std Error t value Pr(gt|t|)

(Intercept) 5641 276 2042 lt 2e-16

asfactor(totchi)1 545 411 133 01851

asfactor(totchi)2 -380 327 -116 02454

asfactor(totchi)3 -1365 345 -395 81e-05

asfactor(totchi)4 -1931 401 -482 16e-06

asfactor(totchi)5 -1546 485 -319 00015

asfactor(totchi)6 -3359 1042 -322 00013

asfactor(totchi)7 -1713 1141 -150 01336

asfactor(totchi)8 -5533 1228 -451 70e-06

asfactor(totchi)9 -5041 2408 -209 00364

asfactor(totchi)10 -5341 2090 -256 00107

asfactor(totchi)12 -5641 4153 -136 01745

---

Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1

Residual standard error 41 on 1723 degrees of freedom

(5 observations deleted due to missingness)

Multiple R-squared 00506 Adjusted R-squared 00446

F-statistic 836 on 11 and 1723 DF p-value 184e-1414 58

Saturated model minus the constantsummary(lm(aauw ~ asfactor(totchi) - 1 data = girls))

Coefficients

Estimate Std Error t value Pr(gt|t|)

asfactor(totchi)0 5641 276 2042 lt2e-16

asfactor(totchi)1 6186 305 2031 lt2e-16

asfactor(totchi)2 5262 175 3013 lt2e-16

asfactor(totchi)3 4276 207 2062 lt2e-16

asfactor(totchi)4 3711 290 1279 lt2e-16

asfactor(totchi)5 4095 399 1027 lt2e-16

asfactor(totchi)6 2282 1005 227 00233

asfactor(totchi)7 3929 1107 355 00004

asfactor(totchi)8 108 1196 009 09278

asfactor(totchi)9 600 2392 025 08020

asfactor(totchi)10 300 2072 014 08849

asfactor(totchi)12 000 4143 000 10000

---

Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1

Residual standard error 41 on 1723 degrees of freedom

(5 observations deleted due to missingness)

Multiple R-squared 0587 Adjusted R-squared 0584

F-statistic 204 on 12 and 1723 DF p-value lt2e-1615 58

Compare to within-strata means

bull The saturated model makes no assumptions about thebetween-strata relationships

bull Just calculates within-strata means

c1 lt- coef(lm(aauw ~ asfactor(totchi) - 1 data = girls))

c2 lt- with(girls tapply(aauw totchi mean narm = TRUE))

rbind(c1 c2)

0 1 2 3 4 5 6 7 8 9 10 12

c1 56 62 53 43 37 41 23 39 11 6 3 0

c2 56 62 53 43 37 41 23 39 11 6 3 0

16 58

Other justifications for OLS

bull Justification 2 119883prime119894 120573 is the best linear predictor (in a

mean-squared error sense) of 119884119894 Why 120573 = argmin119887 120124[(119884119894 minus 119883prime

119894 119887)1113569]

bull Justification 3 119883prime119894 120573 provides the minimum mean squared

error linear approxmiation to 119864[119884119894|119883119894]bull Even if the CEF is not linear a linear regression provides the

best linear approximation to that CEFbull Donrsquot need to believe the assumptions (linearity) in order to

use regression as a good approximation to the CEFbull Warning if the CEF is very nonlinear then this approximation

could be terrible

17 58

The error terms

bull Letrsquos define the error term 119890119894 equiv 119884119894 minus 119883prime119894 120573 so that

119884119894 = 119883prime119894 120573 + [119884119894 minus 119883prime

119894 120573] = 119883prime119894 120573 + 119890119894

bull Note the residual 119890119894 is uncorrelated with 119883119894

120124[119883119894119890119894] = 120124[119883119894(119884119894 minus 119883prime119894 120573)]

= 120124[119883119894119884119894] minus 120124[119883119894119883prime119894 120573]

= 120124[119883119894119884119894] minus 120124 1114094119883119894119883prime119894120124[119883119894119883prime

119894 ]minus1113568120124[119883119894119884119894]1114097= 120124[119883119894119884119894] minus 120124[119883119894119883prime

119894 ]120124[119883119894119883prime119894 ]minus1113568120124[119883119894119884119894]

= 120124[119883119894119884119894] minus 120124[119883119894119884119894] = 0

bull No assumptions on the linearity of 120124[119884119894|119883119894]

18 58

OLS estimator

bull We know the population value of 120573 is

120573 = 120124[119883119894119883prime119894 ]minus1113568120124[119883119894119884119894]

bull How do we get an estimator of thisbull Plug-in principle replace population expectation with

sample versions

=

โŽกโŽขโŽขโŽขโŽขโŽขโŽฃ1119873

1114012119894119883119894119883prime

119894

โŽคโŽฅโŽฅโŽฅโŽฅโŽฅโŽฆ

minus11135681119873

1114012119894119883119894119884119894

bull If you work through the matrix algebra this turns out to be

= (119831prime119831)minus1113568 119831prime119858

19 58

Asymptotic OLS inference

bull With this representation in hand we can write the OLSestimator as follows

= 120573 +

โŽกโŽขโŽขโŽขโŽขโŽขโŽฃ1114012

119894119883119894119883prime

119894

โŽคโŽฅโŽฅโŽฅโŽฅโŽฅโŽฆ

minus1113568

1114012119894119883119894119890119894

bull Core idea sum119894119883119894119890119894 is the sum of rvs so the CLT applies

bull That plus some simple asymptotic theory allows us to say

radic119873( minus 120573) 119873(0ฮฉ)

bull Converges in distribution to a Normal distribution with meanvector 0 and covariance matrix ฮฉ

ฮฉ = 120124[119883119894119883prime119894 ]minus1113568120124[119883119894119883prime

119894 1198901113569119894 ]120124[119883119894119883prime119894 ]minus1113568

bull No linearity assumption needed20 58

Estimating the variance

bull In large samples then

radic119873( minus 120573) sim 119873(0ฮฉ)

bull How to estimate ฮฉ Plug-in principle again

1114023ฮฉ =

โŽกโŽขโŽขโŽขโŽขโŽขโŽฃ1114012

119894119883119894119883prime

119894

โŽคโŽฅโŽฅโŽฅโŽฅโŽฅโŽฆ

minus1113568 โŽกโŽขโŽขโŽขโŽขโŽขโŽฃ1114012119894119883119894119883prime

119894 1113569119894

โŽคโŽฅโŽฅโŽฅโŽฅโŽฅโŽฆ

โŽกโŽขโŽขโŽขโŽขโŽขโŽฃ1114012

119894119883119894119883prime

119894

โŽคโŽฅโŽฅโŽฅโŽฅโŽฅโŽฆ

minus1113568

bull Replace 119890119894 with its emprical counterpart (residuals)119894 = 119884119894 minus 119883prime

119894 bull Replace the population moments of 119883119894 with their sample

counterpartsbull The square root of the diagonals of this covariance matrix are

the ldquorobustrdquo or Huber-White standard errors that Statacommonly report

21 58

Heteroskedasticity

bull No assumptions of homoskedasticitybull Heteroskedaticity will definitely occur when

CEF is linear but the 1205901113569(119909) = 120141[119884119894|119883119894 = 119909] is not constant in 119909 119864[119884119894|119883119894] is not linear but we use the linear regression to

approxmiate it

22 58

2 Regression andCausality

23 58

Regression and causality

bull Most econometrics textbooks regression defined withoutrespect to causality

bull But then when is ldquobiasedrdquo The above derivations work forsome 120124[119884119894|119883119894]

bull The question then is when does knowing the CEF tell ussomething about causality

bull MHE argues that a regression is causal when the CEF itapproximates is causal Identification is king

bull We will show that under certain conditions a regression of theoutcome on the treatment and the covariates can recover acausal parameter but perhaps not the one in which we areinterested

24 58

Review

bull Quick reminder we have potential outcomes 119884119894(1) and 119884119894(0)and two parameters the ATE and ATT

120591 = 119864[119884119894(1) minus 119884119894(0)]120591ATT = 119864[119884119894(1) minus 119884119894(0)|119863119894 = 1]

bull We have shown in past weeks that these effects are identifiedwhen ignorability holds MHE calls this the conditionalindependence assumption (CIA)

25 58

Linear constant effects model binarytreatment

bull Experiment with a simple experiment we can rewrite theconsistency assumption to be a regression formula

119884119894 = 119863119894119884119894(1) + (1 minus 119863119894)119884119894(0)= 119884119894(0) + (119884119894(1) minus 119884119894(0))119863119894

= 120124[119884119894(0)] + 120591119863119894 + (119884119894(0) minus 120124[119884119894(0)])= 1205831113567 + 120591119863119894 + 1199071113567119894

bull Note that if ignorability holds (as in an experiment) for 119884119894(0)then it will also hold for 1199071113567119894 since 120124[119884119894(0)] is constant Thusthis satifies the usual assumptions for regression

26 58

Now with covariates

bull Now assume no unmeasured confounders 119884119894(119889) โŸ‚โŸ‚ 119863119894|119883119894bull We will assume a linear model for the potential outcomes

119884119894(119889) = 120572 + 120591 sdot 119889 + 120578119894

bull Remember that linearity isnrsquot an assumption if 119863119894 is binarybull Effect of 119863119894 is constant here the 120578119894 are the only source of

individual variation and we have 119864[120578119894] = 0bull Consistency assumption allows us to write this as

119884119894 = 120572 + 120591119863119894 + 120578119894

27 58

Covariates in the error

bull Letrsquos assume that 120578119894 is linear in 119883119894 120578119894 = 119883prime119894 120574 + 120584119894

bull New error is uncorrelated with 119883119894 120124[120584119894|119883119894] = 0bull This is an assumption Might be falsebull Plug into the above

120124[119884119894(119889)|119883119894] = 119864[119884119894|119863119894 119883119894] = 120572 + 120591119863119894 + 119864[120578119894|119883119894]= 120572 + 120591119863119894 + 119883prime

119894 120574 + 119864[120584119894|119883119894]= 120572 + 120591119863119894 + 119883prime

119894 120574

28 58

Summing up regression with constanteffects

bull Reviewing the assumptions wersquove used no unmeasured confounders constant treatment effects linearity of the treatmentcovariates

bull Under these we can run the following regression to estimatethe ATE 120591

119884119894 = 120572 + 120591119863119894 + 119883prime119894 120574 + 120584119894

bull Works with continuous or ordinal 119863119894 if linearity in the effect ofthese variables is truly linear

29 58

OLS constant effects simulation

Model with linear covariates constant 0 effect of treatment

library(mvtnorm)

n lt- 100

p lt- 4

X lt- rmvnorm(n = 100 mean = rep(0 p))

gamma lt- c(274 137 137 137)

y lt- 210 + X c(gamma) + rnorm(n)

alpha lt- c(-1 05 -05 -01)

dprobs lt- bootinvlogit(X alpha)

d lt- rbinom(n size = 1 prob = dprobs)

30 58

OLS with no covariates

summary(lm(y ~ d))

Coefficients

Estimate Std Error t value Pr(gt|t|)

(Intercept) 22036 458 4813 lt 2e-16

d -2869 654 -439 0000029

---

Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1

Residual standard error 33 on 98 degrees of freedom

Multiple R-squared 0164 Adjusted R-squared 0156

F-statistic 192 on 1 and 98 DF p-value 0000029

31 58

OLS with covariates

summary(lm(y ~ d + X))

Coefficients

Estimate Std Error t value Pr(gt|t|)

(Intercept) 209952 0159 13207 lt2e-16

d -0128 0253 -05 062

X1 27368 0126 2175 lt2e-16

X2 13677 0114 1201 lt2e-16

X3 13673 0130 1051 lt2e-16

X4 13570 0106 1280 lt2e-16

---

Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1

Residual standard error 1 on 94 degrees of freedom

Multiple R-squared 0999 Adjusted R-squared 0999

F-statistic 244e+04 on 5 and 94 DF p-value lt2e-16

32 58

What happens with nonlinearity

Suppose we can only observe the following covariates

z1 lt- exp(X[ 1]2)

z2 lt- X[ 2](1 + exp(X[ 1])) + 10

z3 lt- (X[ 1] X[ 3]25 + 06)^3

z4 lt- (X[ 2] + X[ 4] + 20)^2

Implies that 119884119894 and 119863119894 are functions of log(1198851198941113568) 1198851198941113569 119885111356911989411135681198851198941113569

1 log(1198851198941113568) 1198851198941113570 log(1198851198941113568) and 119883111356811135691198941113571

Regression is a nonlinear function of the observed covariates

33 58

When linearity goes wrong

summary(lm(y ~ d + z1 + z2 + z3 + z4))

Coefficients

Estimate Std Error t value Pr(gt|t|)

(Intercept) 218728 300799 073 0469

d -69292 33854 -205 0043

z1 363110 26970 1346 lt2e-16

z2 -29033 36619 -079 0430

z3 862022 431030 200 0048

z4 04021 00329 1223 lt2e-16

---

Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1

Residual standard error 14 on 94 degrees of freedom

Multiple R-squared 0855 Adjusted R-squared 0847

F-statistic 111 on 5 and 94 DF p-value lt2e-16

34 58

3 Regression withHeterogeneousTreatment Effects

35 58

Heterogeneous effects binary treatment

bull Completely randomized experiment

119884119894 = 119863119894119884119894(1) + (1 minus 119863119894)119884119894(0)= 119884119894(0) + (119884119894(1) minus 119884119894(0))119863119894

= 1205831113567 + 120591119894119863119894 + (119884119894(0) minus 1205831113567)= 1205831113567 + 120591119863119894 + (119884119894(0) minus 1205831113567) + (120591119894 minus 120591) sdot 119863119894

= 1205831113567 + 120591119863119894 + 120576119894

bull Error term now includes two components1 ldquoBaselinerdquo variation in the outcome (119884119894(0) minus 1205831113567)2 Variation in the treatment effect (120591119894 minus 120591)

bull Easy to verify that under experiment 120124[120576119894|119863119894] = 0bull Thus OLS estimates the ATE with no covariates

36 58

Adding covariates

bull What happens with no unmeasured confounders Need tocondition on 119883119894 now

bull Remember identification of the ATEATT using iteratedexpectations

bull ATE is the weighted sum of CATEs

120591 = 1114012119909120591(119909) Pr[119883119894 = 119909]

bull ATEATT are weighted averages of CATEsbull What about the regression estimand 120591119877 How does it related

to the ATEATT

37 58

Heterogeneous effects and regression

bull Letrsquos investigate this under a saturated regression model

119884119894 = 1114012119909119861119909119894120572119909 + 120591119877119863119894 + 119890119894

bull Use a dummy variable for each unique combination of 119883119894119861119909119894 = 120128(119883119894 = 119909)

bull Linear in 119883119894 by construction

38 58

Investigating the regression coefficient

bull How can we investigate 120591119877 Well we can rely on theregression anatomy

120591119877 = Cov(119884119894 119863119894 minus 119864[119863119894|119883119894])120141(119863119894 minus 119864[119863119894|119883119894])

bull 119863119894 minus 120124[119863119894|119883119894] is the residual from a regression of 119863119894 on the fullset of dummies

bull With a little work we can show

120591119877 =120124 1114094120591(119883119894)(119863119894 minus 120124[119863119894|119883119894])11135691114097

120124[(119863119894 minus 119864[119863119894|119883119894])1113569]=

120124[120591(119883119894)1205901113569119889(119883119894)]120124[1205901113569119889(119883119894)]

bull 1205901113569119889(119909) = 120141[119863119894|119883119894 = 119909] is the conditional variance of treatmentassignment

39 58

ATE versus OLS

120591119877 = 120124[120591(119883119894)119882119894] = 1114012119909120591(119909)

1205901113569119889(119909)120124[1205901113569119889(119883119894)]

โ„™[119883119894 = 119909]

bull Compare to the ATE

120591 = 120124[120591(119883119894)] = 1114012119909120591(119909)โ„™[119883119894 = 119909]

bull Both weight strata relative to their size (โ„™[119883119894 = 119909])bull OLS weights strata higher if the treatment variance in those

strata (1205901113569119889(119909)) is higher in those strata relative to the averagevariance across strata (120124[1205901113569119889(119883119894)])

bull The ATE weights only by their size

40 58

Regression weighting

119882119894 =1205901113569119889(119883119894)

120124[1205901113569119889(119883119894)]

bull Why does OLS weight like thisbull OLS is a minimum-variance estimator more weight to

more precise within-strata estimatesbull Within-strata estimates are most precise when the treatment

is evenly spread and thus has the highest variancebull If 119863119894 is binary then we know the conditional variance will be

1205901113569119889(119909) = โ„™[119863119894 = 1|119883119894 = 119909] (1 minus โ„™[119863119894 = 1|119883119894 = 119909])= 119890(119909) (1 minus 119890(119909))

bull Maximum variance with โ„™[119863119894 = 1|119883119894 = 119909] = 12

41 58

OLS weighting examplebull Binary covariate

โ„™[119883119894 = 1] = 075 โ„™[119883119894 = 0] = 025โ„™[119863119894 = 1|119883119894 = 1] = 09 โ„™[119863119894 = 1|119883119894 = 0] = 05

1205901113569119889(1) = 009 1205901113569119889(0) = 025120591(1) = 1 120591(0) = minus1

bull Implies the ATE is 120591 = 05bull Average conditional variance 120124[1205901113569119889(119883119894)] = 013bull weights for 119883119894 = 1 are 009013 = 0692 for 119883119894 = 0

025013 = 192120591119877 = 120124[120591(119883119894)119882119894]

= 120591(1)119882(1)โ„™[119883119894 = 1] + 120591(0)119882(0)โ„™[119883119894 = 0]= 1 times 0692 times 075 + minus1 times 192 times 025= 0039

42 58

When will OLS estimate the ATE

bull When does 120591 = 120591119877bull Constant treatment effects 120591(119909) = 120591 = 120591119877bull Constant probability of treatment 119890(119909) = โ„™[119863119894 = 1|119883119894 = 119909] = 119890

Implies that the OLS weights are 1

bull Incorrect linearity assumption in 119883119894 will lead to more bias

43 58

Other ways to use regression

bull Whatrsquos the path forward Accept the bias (might be relatively small with saturated

models) Use a different regression approach

bull Let 120583119889(119909) = 120124[119884119894(119889)|119883119894 = 119909] be the CEF for the potentialoutcome under 119863119894 = 119889

bull By consistency and nuc we have 120583119889(119909) = 120124[119884119894|119863119894 = 119889119883119894 = 119909]bull Estimate a regression of 119884119894 on 119883119894 among the 119863119894 = 119889 groupbull Then 119889(119909) is just a predicted value from the regression for

119883119894 = 119909bull How can we use this

44 58

Imputation estimators

bull Impute the treated potential outcomes with 1114023119884119894(1) = 1113568(119883119894)bull Impute the control potential outcomes with 1114023119884119894(0) = 1113567(119883119894)bull Procedure

Regress 119884119894 on 119883119894 in the treated group and get predicted valuesfor all units (treated or control)

Regress 119884119894 on 119883119894 in the control group and get predicted valuesfor all units (treated or control)

Take the average difference between these predicted values

bull More mathematically look like this

120591119894119898119901 =1119873

11140121198941113568(119883119894) minus 1113567(119883119894)

bull Sometimes called an imputation estimator

45 58

Simple imputation estimator

bull Use predict() from the within-group models on the data fromthe entire sample

bull Useful trick use a model on the entire data andmodelframe() to get the right design matrix

heterogeneous effects

yhet lt- ifelse(d == 1 y + rnorm(n 0 5) y)

mod lt- lm(yhet ~ d + X)

mod1 lt- lm(yhet ~ X subset = d == 1)

mod0 lt- lm(yhet ~ X subset = d == 0)

y1imps lt- predict(mod1 modelframe(mod))

y0imps lt- predict(mod0 modelframe(mod))

mean(y1imps - y0imps)

[1] 061

46 58

Notes on imputation estimators

bull If 119889(119909) are consistent estimators then 120591119894119898119901 is consistent forthe ATE

bull Why donrsquot people use this Most people donrsquot know the results wersquove been talking about Harder to implement than vanilla OLS

bull Can use linear regression to estimate 119889(119909) = 119909prime120573119889bull Recent trend is to estimate 119889(119909) via non-parametric methods

such as Kernel regression local linear regression regression trees etc Easiest is generalized additive models (GAMs)

47 58

Imputation estimator visualization

-2 -1 0 1 2 3

-05

00

05

10

15

x

y

TreatedControl

48 58

Imputation estimator visualization

-2 -1 0 1 2 3

-05

00

05

10

15

x

y

TreatedControl

49 58

Imputation estimator visualization

-2 -1 0 1 2 3

-05

00

05

10

15

x

y

TreatedControl

50 58

Nonlinear relationshipsbull Same idea but with nonlinear relationship between 119884119894 and 119883119894

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

51 58

Nonlinear relationshipsbull Same idea but with nonlinear relationship between 119884119894 and 119883119894

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

52 58

Nonlinear relationshipsbull Same idea but with nonlinear relationship between 119884119894 and 119883119894

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

53 58

Using semiparametric regressionbull Here CEFs are nonlinear but we donrsquot know their formbull We can use GAMs from the mgcv package to for flexible

estimatelibrary(mgcv)

mod0 lt- gam(y ~ s(x) subset = d == 0)

summary(mod0)

Family gaussian

Link function identity

Formula

y ~ s(x)

Parametric coefficients

Estimate Std Error t value Pr(gt|t|)

(Intercept) -00225 00154 -146 016

Approximate significance of smooth terms

edf Refdf F p-value

s(x) 603 708 413 lt2e-16

---

Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1

R-sq(adj) = 0909 Deviance explained = 928

GCV = 00093204 Scale est = 00071351 n = 30

54 58

Using GAMs

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

55 58

Using GAMs

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

56 58

Using GAMs

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

57 58

Limited dependent variables

bull Usual advice model the data from first principles Logitprobit for binary Poisson for counts etc

bull OLS is a-ok with limited DVs when Binary treatment and no covariates (just diff-in-means) Binary treatment discrete covariates and saturated models

(stratified diff-in-means)

bull Imposing a model on LDVs in this case imposes adistributional assumption which could be wrong

bull Even in unsaturated models the marginal effect from OLSoften decent compared to nonlinear models

Could go wrong in small samples If using nonlinear models always get effects on the scale of the

outcome

58 58

  • Agnostic Regression
  • Regression and Causality
  • Regression with Heterogeneous Treatment Effects
Page 7: Gov 2002: 7. Regression and Causalityโ€ข Define linear regression: =argmin ๐”ผ[( ๐‘–โˆ’ โ€ฒ ) ] โ€ข The solution to this is the following: =๐”ผ[ ๐‘– โ€ฒ]โˆ’ ๐”ผ[ ๐‘– ๐‘–] โ€ข

Justifying linear regression

bull Define linear regression

120573 = argmin119887

120124[(119884119894 minus 119883prime119894 119887)1113569]

bull The solution to this is the following

120573 = 120124[119883119894119883prime119894 ]minus1113568120124[119883119894119884119894]

bull Note that the is the population coefficient vector not theestimator yet

7 58

Regression anatomy

bull Consider simple linear regression

(120572 120573) = argmin119886119887

120124 1114094(119884119894 minus 119886 minus 119887119883119894)11135691114097

bull In this case we can write the populationtrue slope 120573 as

120573 = 120124[119883119894119883prime119894 ]minus1113568120124[119883119894119884119894] =

Cov(119884119894 119883119894)120141[119883119894]

bull With more covariates 120573 is more complicated but we can stillwrite it like this

bull Let 119896119894 be the residual from a regression of 119883119896119894 on all the otherindependent variables Then 120573119896 the coefficient for 119883119896119894 is

120573119896 =Cov(119884119894 119896119894)

119881(119896119894)

8 58

Justification 1 Linear CEFs

bull Justification 1 if the CEF is linear the population regressionfunction is it That is if 119864[119884119894|119883119894] = 119883prime

119894 119887 then 119887 = 120573bull When would we expect the CEF to be linear Two cases

1 Outcome and covariates are multivariate normal2 Linear regression model is saturated

bull A model is saturated if there are as many parameters asthere are possible combination of the 119883119894 variables

9 58

Saturated model example

bull Two binary variables 1198831113568119894 for incumbency status and 1198831113569119894 forparty of the candidate

bull Four possible values of 119883119894 four possible values of 120583(119883119894)

119864[119884119894|1198831113568119894 = 01198831113569119894 = 0] = 120572119864[119884119894|1198831113568119894 = 11198831113569119894 = 0] = 120572 + 120573119864[119884119894|1198831113568119894 = 01198831113569119894 = 1] = 120572 + 120574119864[119884119894|1198831113568119894 = 11198831113569119894 = 1] = 120572 + 120573 + 120574 + 120575

bull We can write the CEF as follows

119864[119884119894|1198831113568119894 1198831113569119894] = 120572 + 1205731198831113568119894 + 1205741198831113569119894 + 120575(11988311135681198941198831113569119894)

10 58

Saturated models example

119864[119884119894|1198831113568119894 1198831113569119894] = 120572 + 1205731198831113568119894 + 1205741198831113569119894 + 120575(11988311135681198941198831113569119894)

bull Basically each value of 120583(119883119894) is being estimated separately within-strata estimation No borrowing of information from across values of 119883119894

bull Requires a set of dummies for each categorical variable plusall interactions

bull Or a series of dummies for each unique combination of 119883119894bull This makes linearity hold mechanically and so linearity is

not an assumption Just a fact about saturated CEFs saturated models for limited dependent variables = A-OK

11 58

Saturated model example

bull Washington (AER) data on the effects of daughtersbull Wersquoll look at the relationship between voting and number of

kids (causal)

girls lt- foreignreaddta(rdquogirlsdtardquo)

head(girls[ c(rdquonamerdquo rdquototchirdquo rdquoaauwrdquo)])

name totchi aauw

1 ABERCROMBIE NEIL 0 100

2 ACKERMAN GARY L 3 88

3 ADERHOLT ROBERT B 0 0

4 ALLEN THOMAS H 2 100

5 ANDREWS ROBERT E 2 100

6 ARCHER WR 7 0

12 58

Linear model

summary(lm(aauw ~ totchi data = girls))

Coefficients

Estimate Std Error t value Pr(gt|t|)

(Intercept) 6131 181 3381 lt2e-16

totchi -533 062 -859 lt2e-16

---

Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1

Residual standard error 42 on 1733 degrees of freedom

(5 observations deleted due to missingness)

Multiple R-squared 00408 Adjusted R-squared 00403

F-statistic 738 on 1 and 1733 DF p-value lt2e-16

13 58

Saturated modelsummary(lm(aauw ~ asfactor(totchi) data = girls))

Coefficients

Estimate Std Error t value Pr(gt|t|)

(Intercept) 5641 276 2042 lt 2e-16

asfactor(totchi)1 545 411 133 01851

asfactor(totchi)2 -380 327 -116 02454

asfactor(totchi)3 -1365 345 -395 81e-05

asfactor(totchi)4 -1931 401 -482 16e-06

asfactor(totchi)5 -1546 485 -319 00015

asfactor(totchi)6 -3359 1042 -322 00013

asfactor(totchi)7 -1713 1141 -150 01336

asfactor(totchi)8 -5533 1228 -451 70e-06

asfactor(totchi)9 -5041 2408 -209 00364

asfactor(totchi)10 -5341 2090 -256 00107

asfactor(totchi)12 -5641 4153 -136 01745

---

Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1

Residual standard error 41 on 1723 degrees of freedom

(5 observations deleted due to missingness)

Multiple R-squared 00506 Adjusted R-squared 00446

F-statistic 836 on 11 and 1723 DF p-value 184e-1414 58

Saturated model minus the constantsummary(lm(aauw ~ asfactor(totchi) - 1 data = girls))

Coefficients

Estimate Std Error t value Pr(gt|t|)

asfactor(totchi)0 5641 276 2042 lt2e-16

asfactor(totchi)1 6186 305 2031 lt2e-16

asfactor(totchi)2 5262 175 3013 lt2e-16

asfactor(totchi)3 4276 207 2062 lt2e-16

asfactor(totchi)4 3711 290 1279 lt2e-16

asfactor(totchi)5 4095 399 1027 lt2e-16

asfactor(totchi)6 2282 1005 227 00233

asfactor(totchi)7 3929 1107 355 00004

asfactor(totchi)8 108 1196 009 09278

asfactor(totchi)9 600 2392 025 08020

asfactor(totchi)10 300 2072 014 08849

asfactor(totchi)12 000 4143 000 10000

---

Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1

Residual standard error 41 on 1723 degrees of freedom

(5 observations deleted due to missingness)

Multiple R-squared 0587 Adjusted R-squared 0584

F-statistic 204 on 12 and 1723 DF p-value lt2e-1615 58

Compare to within-strata means

bull The saturated model makes no assumptions about thebetween-strata relationships

bull Just calculates within-strata means

c1 lt- coef(lm(aauw ~ asfactor(totchi) - 1 data = girls))

c2 lt- with(girls tapply(aauw totchi mean narm = TRUE))

rbind(c1 c2)

0 1 2 3 4 5 6 7 8 9 10 12

c1 56 62 53 43 37 41 23 39 11 6 3 0

c2 56 62 53 43 37 41 23 39 11 6 3 0

16 58

Other justifications for OLS

bull Justification 2 119883prime119894 120573 is the best linear predictor (in a

mean-squared error sense) of 119884119894 Why 120573 = argmin119887 120124[(119884119894 minus 119883prime

119894 119887)1113569]

bull Justification 3 119883prime119894 120573 provides the minimum mean squared

error linear approxmiation to 119864[119884119894|119883119894]bull Even if the CEF is not linear a linear regression provides the

best linear approximation to that CEFbull Donrsquot need to believe the assumptions (linearity) in order to

use regression as a good approximation to the CEFbull Warning if the CEF is very nonlinear then this approximation

could be terrible

17 58

The error terms

bull Letrsquos define the error term 119890119894 equiv 119884119894 minus 119883prime119894 120573 so that

119884119894 = 119883prime119894 120573 + [119884119894 minus 119883prime

119894 120573] = 119883prime119894 120573 + 119890119894

bull Note the residual 119890119894 is uncorrelated with 119883119894

120124[119883119894119890119894] = 120124[119883119894(119884119894 minus 119883prime119894 120573)]

= 120124[119883119894119884119894] minus 120124[119883119894119883prime119894 120573]

= 120124[119883119894119884119894] minus 120124 1114094119883119894119883prime119894120124[119883119894119883prime

119894 ]minus1113568120124[119883119894119884119894]1114097= 120124[119883119894119884119894] minus 120124[119883119894119883prime

119894 ]120124[119883119894119883prime119894 ]minus1113568120124[119883119894119884119894]

= 120124[119883119894119884119894] minus 120124[119883119894119884119894] = 0

bull No assumptions on the linearity of 120124[119884119894|119883119894]

18 58

OLS estimator

bull We know the population value of 120573 is

120573 = 120124[119883119894119883prime119894 ]minus1113568120124[119883119894119884119894]

bull How do we get an estimator of thisbull Plug-in principle replace population expectation with

sample versions

=

โŽกโŽขโŽขโŽขโŽขโŽขโŽฃ1119873

1114012119894119883119894119883prime

119894

โŽคโŽฅโŽฅโŽฅโŽฅโŽฅโŽฆ

minus11135681119873

1114012119894119883119894119884119894

bull If you work through the matrix algebra this turns out to be

= (119831prime119831)minus1113568 119831prime119858

19 58

Asymptotic OLS inference

bull With this representation in hand we can write the OLSestimator as follows

= 120573 +

โŽกโŽขโŽขโŽขโŽขโŽขโŽฃ1114012

119894119883119894119883prime

119894

โŽคโŽฅโŽฅโŽฅโŽฅโŽฅโŽฆ

minus1113568

1114012119894119883119894119890119894

bull Core idea sum119894119883119894119890119894 is the sum of rvs so the CLT applies

bull That plus some simple asymptotic theory allows us to say

radic119873( minus 120573) 119873(0ฮฉ)

bull Converges in distribution to a Normal distribution with meanvector 0 and covariance matrix ฮฉ

ฮฉ = 120124[119883119894119883prime119894 ]minus1113568120124[119883119894119883prime

119894 1198901113569119894 ]120124[119883119894119883prime119894 ]minus1113568

bull No linearity assumption needed20 58

Estimating the variance

bull In large samples then

radic119873( minus 120573) sim 119873(0ฮฉ)

bull How to estimate ฮฉ Plug-in principle again

1114023ฮฉ =

โŽกโŽขโŽขโŽขโŽขโŽขโŽฃ1114012

119894119883119894119883prime

119894

โŽคโŽฅโŽฅโŽฅโŽฅโŽฅโŽฆ

minus1113568 โŽกโŽขโŽขโŽขโŽขโŽขโŽฃ1114012119894119883119894119883prime

119894 1113569119894

โŽคโŽฅโŽฅโŽฅโŽฅโŽฅโŽฆ

โŽกโŽขโŽขโŽขโŽขโŽขโŽฃ1114012

119894119883119894119883prime

119894

โŽคโŽฅโŽฅโŽฅโŽฅโŽฅโŽฆ

minus1113568

bull Replace 119890119894 with its emprical counterpart (residuals)119894 = 119884119894 minus 119883prime

119894 bull Replace the population moments of 119883119894 with their sample

counterpartsbull The square root of the diagonals of this covariance matrix are

the ldquorobustrdquo or Huber-White standard errors that Statacommonly report

21 58

Heteroskedasticity

bull No assumptions of homoskedasticitybull Heteroskedaticity will definitely occur when

CEF is linear but the 1205901113569(119909) = 120141[119884119894|119883119894 = 119909] is not constant in 119909 119864[119884119894|119883119894] is not linear but we use the linear regression to

approxmiate it

22 58

2 Regression andCausality

23 58

Regression and causality

bull Most econometrics textbooks regression defined withoutrespect to causality

bull But then when is ldquobiasedrdquo The above derivations work forsome 120124[119884119894|119883119894]

bull The question then is when does knowing the CEF tell ussomething about causality

bull MHE argues that a regression is causal when the CEF itapproximates is causal Identification is king

bull We will show that under certain conditions a regression of theoutcome on the treatment and the covariates can recover acausal parameter but perhaps not the one in which we areinterested

24 58

Review

bull Quick reminder we have potential outcomes 119884119894(1) and 119884119894(0)and two parameters the ATE and ATT

120591 = 119864[119884119894(1) minus 119884119894(0)]120591ATT = 119864[119884119894(1) minus 119884119894(0)|119863119894 = 1]

bull We have shown in past weeks that these effects are identifiedwhen ignorability holds MHE calls this the conditionalindependence assumption (CIA)

25 58

Linear constant effects model binarytreatment

bull Experiment with a simple experiment we can rewrite theconsistency assumption to be a regression formula

119884119894 = 119863119894119884119894(1) + (1 minus 119863119894)119884119894(0)= 119884119894(0) + (119884119894(1) minus 119884119894(0))119863119894

= 120124[119884119894(0)] + 120591119863119894 + (119884119894(0) minus 120124[119884119894(0)])= 1205831113567 + 120591119863119894 + 1199071113567119894

bull Note that if ignorability holds (as in an experiment) for 119884119894(0)then it will also hold for 1199071113567119894 since 120124[119884119894(0)] is constant Thusthis satifies the usual assumptions for regression

26 58

Now with covariates

bull Now assume no unmeasured confounders 119884119894(119889) โŸ‚โŸ‚ 119863119894|119883119894bull We will assume a linear model for the potential outcomes

119884119894(119889) = 120572 + 120591 sdot 119889 + 120578119894

bull Remember that linearity isnrsquot an assumption if 119863119894 is binarybull Effect of 119863119894 is constant here the 120578119894 are the only source of

individual variation and we have 119864[120578119894] = 0bull Consistency assumption allows us to write this as

119884119894 = 120572 + 120591119863119894 + 120578119894

27 58

Covariates in the error

bull Letrsquos assume that 120578119894 is linear in 119883119894 120578119894 = 119883prime119894 120574 + 120584119894

bull New error is uncorrelated with 119883119894 120124[120584119894|119883119894] = 0bull This is an assumption Might be falsebull Plug into the above

120124[119884119894(119889)|119883119894] = 119864[119884119894|119863119894 119883119894] = 120572 + 120591119863119894 + 119864[120578119894|119883119894]= 120572 + 120591119863119894 + 119883prime

119894 120574 + 119864[120584119894|119883119894]= 120572 + 120591119863119894 + 119883prime

119894 120574

28 58

Summing up regression with constanteffects

bull Reviewing the assumptions wersquove used no unmeasured confounders constant treatment effects linearity of the treatmentcovariates

bull Under these we can run the following regression to estimatethe ATE 120591

119884119894 = 120572 + 120591119863119894 + 119883prime119894 120574 + 120584119894

bull Works with continuous or ordinal 119863119894 if linearity in the effect ofthese variables is truly linear

29 58

OLS constant effects simulation

Model with linear covariates constant 0 effect of treatment

library(mvtnorm)

n lt- 100

p lt- 4

X lt- rmvnorm(n = 100 mean = rep(0 p))

gamma lt- c(274 137 137 137)

y lt- 210 + X c(gamma) + rnorm(n)

alpha lt- c(-1 05 -05 -01)

dprobs lt- bootinvlogit(X alpha)

d lt- rbinom(n size = 1 prob = dprobs)

30 58

OLS with no covariates

summary(lm(y ~ d))

Coefficients

Estimate Std Error t value Pr(gt|t|)

(Intercept) 22036 458 4813 lt 2e-16

d -2869 654 -439 0000029

---

Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1

Residual standard error 33 on 98 degrees of freedom

Multiple R-squared 0164 Adjusted R-squared 0156

F-statistic 192 on 1 and 98 DF p-value 0000029

31 58

OLS with covariates

summary(lm(y ~ d + X))

Coefficients

Estimate Std Error t value Pr(gt|t|)

(Intercept) 209952 0159 13207 lt2e-16

d -0128 0253 -05 062

X1 27368 0126 2175 lt2e-16

X2 13677 0114 1201 lt2e-16

X3 13673 0130 1051 lt2e-16

X4 13570 0106 1280 lt2e-16

---

Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1

Residual standard error 1 on 94 degrees of freedom

Multiple R-squared 0999 Adjusted R-squared 0999

F-statistic 244e+04 on 5 and 94 DF p-value lt2e-16

32 58

What happens with nonlinearity

Suppose we can only observe the following covariates

z1 lt- exp(X[ 1]2)

z2 lt- X[ 2](1 + exp(X[ 1])) + 10

z3 lt- (X[ 1] X[ 3]25 + 06)^3

z4 lt- (X[ 2] + X[ 4] + 20)^2

Implies that 119884119894 and 119863119894 are functions of log(1198851198941113568) 1198851198941113569 119885111356911989411135681198851198941113569

1 log(1198851198941113568) 1198851198941113570 log(1198851198941113568) and 119883111356811135691198941113571

Regression is a nonlinear function of the observed covariates

33 58

When linearity goes wrong

summary(lm(y ~ d + z1 + z2 + z3 + z4))

Coefficients

Estimate Std Error t value Pr(gt|t|)

(Intercept) 218728 300799 073 0469

d -69292 33854 -205 0043

z1 363110 26970 1346 lt2e-16

z2 -29033 36619 -079 0430

z3 862022 431030 200 0048

z4 04021 00329 1223 lt2e-16

---

Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1

Residual standard error 14 on 94 degrees of freedom

Multiple R-squared 0855 Adjusted R-squared 0847

F-statistic 111 on 5 and 94 DF p-value lt2e-16

34 58

3 Regression withHeterogeneousTreatment Effects

35 58

Heterogeneous effects binary treatment

bull Completely randomized experiment

119884119894 = 119863119894119884119894(1) + (1 minus 119863119894)119884119894(0)= 119884119894(0) + (119884119894(1) minus 119884119894(0))119863119894

= 1205831113567 + 120591119894119863119894 + (119884119894(0) minus 1205831113567)= 1205831113567 + 120591119863119894 + (119884119894(0) minus 1205831113567) + (120591119894 minus 120591) sdot 119863119894

= 1205831113567 + 120591119863119894 + 120576119894

bull Error term now includes two components1 ldquoBaselinerdquo variation in the outcome (119884119894(0) minus 1205831113567)2 Variation in the treatment effect (120591119894 minus 120591)

bull Easy to verify that under experiment 120124[120576119894|119863119894] = 0bull Thus OLS estimates the ATE with no covariates

36 58

Adding covariates

bull What happens with no unmeasured confounders Need tocondition on 119883119894 now

bull Remember identification of the ATEATT using iteratedexpectations

bull ATE is the weighted sum of CATEs

120591 = 1114012119909120591(119909) Pr[119883119894 = 119909]

bull ATEATT are weighted averages of CATEsbull What about the regression estimand 120591119877 How does it related

to the ATEATT

37 58

Heterogeneous effects and regression

bull Letrsquos investigate this under a saturated regression model

119884119894 = 1114012119909119861119909119894120572119909 + 120591119877119863119894 + 119890119894

bull Use a dummy variable for each unique combination of 119883119894119861119909119894 = 120128(119883119894 = 119909)

bull Linear in 119883119894 by construction

38 58

Investigating the regression coefficient

bull How can we investigate 120591119877 Well we can rely on theregression anatomy

120591119877 = Cov(119884119894 119863119894 minus 119864[119863119894|119883119894])120141(119863119894 minus 119864[119863119894|119883119894])

bull 119863119894 minus 120124[119863119894|119883119894] is the residual from a regression of 119863119894 on the fullset of dummies

bull With a little work we can show

120591119877 =120124 1114094120591(119883119894)(119863119894 minus 120124[119863119894|119883119894])11135691114097

120124[(119863119894 minus 119864[119863119894|119883119894])1113569]=

120124[120591(119883119894)1205901113569119889(119883119894)]120124[1205901113569119889(119883119894)]

bull 1205901113569119889(119909) = 120141[119863119894|119883119894 = 119909] is the conditional variance of treatmentassignment

39 58

ATE versus OLS

120591119877 = 120124[120591(119883119894)119882119894] = 1114012119909120591(119909)

1205901113569119889(119909)120124[1205901113569119889(119883119894)]

โ„™[119883119894 = 119909]

bull Compare to the ATE

120591 = 120124[120591(119883119894)] = 1114012119909120591(119909)โ„™[119883119894 = 119909]

bull Both weight strata relative to their size (โ„™[119883119894 = 119909])bull OLS weights strata higher if the treatment variance in those

strata (1205901113569119889(119909)) is higher in those strata relative to the averagevariance across strata (120124[1205901113569119889(119883119894)])

bull The ATE weights only by their size

40 58

Regression weighting

119882119894 =1205901113569119889(119883119894)

120124[1205901113569119889(119883119894)]

bull Why does OLS weight like thisbull OLS is a minimum-variance estimator more weight to

more precise within-strata estimatesbull Within-strata estimates are most precise when the treatment

is evenly spread and thus has the highest variancebull If 119863119894 is binary then we know the conditional variance will be

1205901113569119889(119909) = โ„™[119863119894 = 1|119883119894 = 119909] (1 minus โ„™[119863119894 = 1|119883119894 = 119909])= 119890(119909) (1 minus 119890(119909))

bull Maximum variance with โ„™[119863119894 = 1|119883119894 = 119909] = 12

41 58

OLS weighting examplebull Binary covariate

โ„™[119883119894 = 1] = 075 โ„™[119883119894 = 0] = 025โ„™[119863119894 = 1|119883119894 = 1] = 09 โ„™[119863119894 = 1|119883119894 = 0] = 05

1205901113569119889(1) = 009 1205901113569119889(0) = 025120591(1) = 1 120591(0) = minus1

bull Implies the ATE is 120591 = 05bull Average conditional variance 120124[1205901113569119889(119883119894)] = 013bull weights for 119883119894 = 1 are 009013 = 0692 for 119883119894 = 0

025013 = 192120591119877 = 120124[120591(119883119894)119882119894]

= 120591(1)119882(1)โ„™[119883119894 = 1] + 120591(0)119882(0)โ„™[119883119894 = 0]= 1 times 0692 times 075 + minus1 times 192 times 025= 0039

42 58

When will OLS estimate the ATE

bull When does 120591 = 120591119877bull Constant treatment effects 120591(119909) = 120591 = 120591119877bull Constant probability of treatment 119890(119909) = โ„™[119863119894 = 1|119883119894 = 119909] = 119890

Implies that the OLS weights are 1

bull Incorrect linearity assumption in 119883119894 will lead to more bias

43 58

Other ways to use regression

bull Whatrsquos the path forward Accept the bias (might be relatively small with saturated

models) Use a different regression approach

bull Let 120583119889(119909) = 120124[119884119894(119889)|119883119894 = 119909] be the CEF for the potentialoutcome under 119863119894 = 119889

bull By consistency and nuc we have 120583119889(119909) = 120124[119884119894|119863119894 = 119889119883119894 = 119909]bull Estimate a regression of 119884119894 on 119883119894 among the 119863119894 = 119889 groupbull Then 119889(119909) is just a predicted value from the regression for

119883119894 = 119909bull How can we use this

44 58

Imputation estimators

bull Impute the treated potential outcomes with 1114023119884119894(1) = 1113568(119883119894)bull Impute the control potential outcomes with 1114023119884119894(0) = 1113567(119883119894)bull Procedure

Regress 119884119894 on 119883119894 in the treated group and get predicted valuesfor all units (treated or control)

Regress 119884119894 on 119883119894 in the control group and get predicted valuesfor all units (treated or control)

Take the average difference between these predicted values

bull More mathematically look like this

120591119894119898119901 =1119873

11140121198941113568(119883119894) minus 1113567(119883119894)

bull Sometimes called an imputation estimator

45 58

Simple imputation estimator

bull Use predict() from the within-group models on the data fromthe entire sample

bull Useful trick use a model on the entire data andmodelframe() to get the right design matrix

heterogeneous effects

yhet lt- ifelse(d == 1 y + rnorm(n 0 5) y)

mod lt- lm(yhet ~ d + X)

mod1 lt- lm(yhet ~ X subset = d == 1)

mod0 lt- lm(yhet ~ X subset = d == 0)

y1imps lt- predict(mod1 modelframe(mod))

y0imps lt- predict(mod0 modelframe(mod))

mean(y1imps - y0imps)

[1] 061

46 58

Notes on imputation estimators

bull If 119889(119909) are consistent estimators then 120591119894119898119901 is consistent forthe ATE

bull Why donrsquot people use this Most people donrsquot know the results wersquove been talking about Harder to implement than vanilla OLS

bull Can use linear regression to estimate 119889(119909) = 119909prime120573119889bull Recent trend is to estimate 119889(119909) via non-parametric methods

such as Kernel regression local linear regression regression trees etc Easiest is generalized additive models (GAMs)

47 58

Imputation estimator visualization

-2 -1 0 1 2 3

-05

00

05

10

15

x

y

TreatedControl

48 58

Imputation estimator visualization

-2 -1 0 1 2 3

-05

00

05

10

15

x

y

TreatedControl

49 58

Imputation estimator visualization

-2 -1 0 1 2 3

-05

00

05

10

15

x

y

TreatedControl

50 58

Nonlinear relationshipsbull Same idea but with nonlinear relationship between 119884119894 and 119883119894

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

51 58

Nonlinear relationshipsbull Same idea but with nonlinear relationship between 119884119894 and 119883119894

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

52 58

Nonlinear relationshipsbull Same idea but with nonlinear relationship between 119884119894 and 119883119894

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

53 58

Using semiparametric regressionbull Here CEFs are nonlinear but we donrsquot know their formbull We can use GAMs from the mgcv package to for flexible

estimatelibrary(mgcv)

mod0 lt- gam(y ~ s(x) subset = d == 0)

summary(mod0)

Family gaussian

Link function identity

Formula

y ~ s(x)

Parametric coefficients

Estimate Std Error t value Pr(gt|t|)

(Intercept) -00225 00154 -146 016

Approximate significance of smooth terms

edf Refdf F p-value

s(x) 603 708 413 lt2e-16

---

Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1

R-sq(adj) = 0909 Deviance explained = 928

GCV = 00093204 Scale est = 00071351 n = 30

54 58

Using GAMs

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

55 58

Using GAMs

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

56 58

Using GAMs

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

57 58

Limited dependent variables

bull Usual advice model the data from first principles Logitprobit for binary Poisson for counts etc

bull OLS is a-ok with limited DVs when Binary treatment and no covariates (just diff-in-means) Binary treatment discrete covariates and saturated models

(stratified diff-in-means)

bull Imposing a model on LDVs in this case imposes adistributional assumption which could be wrong

bull Even in unsaturated models the marginal effect from OLSoften decent compared to nonlinear models

Could go wrong in small samples If using nonlinear models always get effects on the scale of the

outcome

58 58

  • Agnostic Regression
  • Regression and Causality
  • Regression with Heterogeneous Treatment Effects
Page 8: Gov 2002: 7. Regression and Causalityโ€ข Define linear regression: =argmin ๐”ผ[( ๐‘–โˆ’ โ€ฒ ) ] โ€ข The solution to this is the following: =๐”ผ[ ๐‘– โ€ฒ]โˆ’ ๐”ผ[ ๐‘– ๐‘–] โ€ข

Regression anatomy

bull Consider simple linear regression

(120572 120573) = argmin119886119887

120124 1114094(119884119894 minus 119886 minus 119887119883119894)11135691114097

bull In this case we can write the populationtrue slope 120573 as

120573 = 120124[119883119894119883prime119894 ]minus1113568120124[119883119894119884119894] =

Cov(119884119894 119883119894)120141[119883119894]

bull With more covariates 120573 is more complicated but we can stillwrite it like this

bull Let 119896119894 be the residual from a regression of 119883119896119894 on all the otherindependent variables Then 120573119896 the coefficient for 119883119896119894 is

120573119896 =Cov(119884119894 119896119894)

119881(119896119894)

8 58

Justification 1 Linear CEFs

bull Justification 1 if the CEF is linear the population regressionfunction is it That is if 119864[119884119894|119883119894] = 119883prime

119894 119887 then 119887 = 120573bull When would we expect the CEF to be linear Two cases

1 Outcome and covariates are multivariate normal2 Linear regression model is saturated

bull A model is saturated if there are as many parameters asthere are possible combination of the 119883119894 variables

9 58

Saturated model example

bull Two binary variables 1198831113568119894 for incumbency status and 1198831113569119894 forparty of the candidate

bull Four possible values of 119883119894 four possible values of 120583(119883119894)

119864[119884119894|1198831113568119894 = 01198831113569119894 = 0] = 120572119864[119884119894|1198831113568119894 = 11198831113569119894 = 0] = 120572 + 120573119864[119884119894|1198831113568119894 = 01198831113569119894 = 1] = 120572 + 120574119864[119884119894|1198831113568119894 = 11198831113569119894 = 1] = 120572 + 120573 + 120574 + 120575

bull We can write the CEF as follows

119864[119884119894|1198831113568119894 1198831113569119894] = 120572 + 1205731198831113568119894 + 1205741198831113569119894 + 120575(11988311135681198941198831113569119894)

10 58

Saturated models example

119864[119884119894|1198831113568119894 1198831113569119894] = 120572 + 1205731198831113568119894 + 1205741198831113569119894 + 120575(11988311135681198941198831113569119894)

bull Basically each value of 120583(119883119894) is being estimated separately within-strata estimation No borrowing of information from across values of 119883119894

bull Requires a set of dummies for each categorical variable plusall interactions

bull Or a series of dummies for each unique combination of 119883119894bull This makes linearity hold mechanically and so linearity is

not an assumption Just a fact about saturated CEFs saturated models for limited dependent variables = A-OK

11 58

Saturated model example

bull Washington (AER) data on the effects of daughtersbull Wersquoll look at the relationship between voting and number of

kids (causal)

girls lt- foreignreaddta(rdquogirlsdtardquo)

head(girls[ c(rdquonamerdquo rdquototchirdquo rdquoaauwrdquo)])

name totchi aauw

1 ABERCROMBIE NEIL 0 100

2 ACKERMAN GARY L 3 88

3 ADERHOLT ROBERT B 0 0

4 ALLEN THOMAS H 2 100

5 ANDREWS ROBERT E 2 100

6 ARCHER WR 7 0

12 58

Linear model

summary(lm(aauw ~ totchi data = girls))

Coefficients

Estimate Std Error t value Pr(gt|t|)

(Intercept) 6131 181 3381 lt2e-16

totchi -533 062 -859 lt2e-16

---

Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1

Residual standard error 42 on 1733 degrees of freedom

(5 observations deleted due to missingness)

Multiple R-squared 00408 Adjusted R-squared 00403

F-statistic 738 on 1 and 1733 DF p-value lt2e-16

13 58

Saturated modelsummary(lm(aauw ~ asfactor(totchi) data = girls))

Coefficients

Estimate Std Error t value Pr(gt|t|)

(Intercept) 5641 276 2042 lt 2e-16

asfactor(totchi)1 545 411 133 01851

asfactor(totchi)2 -380 327 -116 02454

asfactor(totchi)3 -1365 345 -395 81e-05

asfactor(totchi)4 -1931 401 -482 16e-06

asfactor(totchi)5 -1546 485 -319 00015

asfactor(totchi)6 -3359 1042 -322 00013

asfactor(totchi)7 -1713 1141 -150 01336

asfactor(totchi)8 -5533 1228 -451 70e-06

asfactor(totchi)9 -5041 2408 -209 00364

asfactor(totchi)10 -5341 2090 -256 00107

asfactor(totchi)12 -5641 4153 -136 01745

---

Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1

Residual standard error 41 on 1723 degrees of freedom

(5 observations deleted due to missingness)

Multiple R-squared 00506 Adjusted R-squared 00446

F-statistic 836 on 11 and 1723 DF p-value 184e-1414 58

Saturated model minus the constantsummary(lm(aauw ~ asfactor(totchi) - 1 data = girls))

Coefficients

Estimate Std Error t value Pr(gt|t|)

asfactor(totchi)0 5641 276 2042 lt2e-16

asfactor(totchi)1 6186 305 2031 lt2e-16

asfactor(totchi)2 5262 175 3013 lt2e-16

asfactor(totchi)3 4276 207 2062 lt2e-16

asfactor(totchi)4 3711 290 1279 lt2e-16

asfactor(totchi)5 4095 399 1027 lt2e-16

asfactor(totchi)6 2282 1005 227 00233

asfactor(totchi)7 3929 1107 355 00004

asfactor(totchi)8 108 1196 009 09278

asfactor(totchi)9 600 2392 025 08020

asfactor(totchi)10 300 2072 014 08849

asfactor(totchi)12 000 4143 000 10000

---

Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1

Residual standard error 41 on 1723 degrees of freedom

(5 observations deleted due to missingness)

Multiple R-squared 0587 Adjusted R-squared 0584

F-statistic 204 on 12 and 1723 DF p-value lt2e-1615 58

Compare to within-strata means

bull The saturated model makes no assumptions about thebetween-strata relationships

bull Just calculates within-strata means

c1 lt- coef(lm(aauw ~ asfactor(totchi) - 1 data = girls))

c2 lt- with(girls tapply(aauw totchi mean narm = TRUE))

rbind(c1 c2)

0 1 2 3 4 5 6 7 8 9 10 12

c1 56 62 53 43 37 41 23 39 11 6 3 0

c2 56 62 53 43 37 41 23 39 11 6 3 0

16 58

Other justifications for OLS

bull Justification 2 119883prime119894 120573 is the best linear predictor (in a

mean-squared error sense) of 119884119894 Why 120573 = argmin119887 120124[(119884119894 minus 119883prime

119894 119887)1113569]

bull Justification 3 119883prime119894 120573 provides the minimum mean squared

error linear approxmiation to 119864[119884119894|119883119894]bull Even if the CEF is not linear a linear regression provides the

best linear approximation to that CEFbull Donrsquot need to believe the assumptions (linearity) in order to

use regression as a good approximation to the CEFbull Warning if the CEF is very nonlinear then this approximation

could be terrible

17 58

The error terms

bull Letrsquos define the error term 119890119894 equiv 119884119894 minus 119883prime119894 120573 so that

119884119894 = 119883prime119894 120573 + [119884119894 minus 119883prime

119894 120573] = 119883prime119894 120573 + 119890119894

bull Note the residual 119890119894 is uncorrelated with 119883119894

120124[119883119894119890119894] = 120124[119883119894(119884119894 minus 119883prime119894 120573)]

= 120124[119883119894119884119894] minus 120124[119883119894119883prime119894 120573]

= 120124[119883119894119884119894] minus 120124 1114094119883119894119883prime119894120124[119883119894119883prime

119894 ]minus1113568120124[119883119894119884119894]1114097= 120124[119883119894119884119894] minus 120124[119883119894119883prime

119894 ]120124[119883119894119883prime119894 ]minus1113568120124[119883119894119884119894]

= 120124[119883119894119884119894] minus 120124[119883119894119884119894] = 0

bull No assumptions on the linearity of 120124[119884119894|119883119894]

18 58

OLS estimator

bull We know the population value of 120573 is

120573 = 120124[119883119894119883prime119894 ]minus1113568120124[119883119894119884119894]

bull How do we get an estimator of thisbull Plug-in principle replace population expectation with

sample versions

=

โŽกโŽขโŽขโŽขโŽขโŽขโŽฃ1119873

1114012119894119883119894119883prime

119894

โŽคโŽฅโŽฅโŽฅโŽฅโŽฅโŽฆ

minus11135681119873

1114012119894119883119894119884119894

bull If you work through the matrix algebra this turns out to be

= (119831prime119831)minus1113568 119831prime119858

19 58

Asymptotic OLS inference

bull With this representation in hand we can write the OLSestimator as follows

= 120573 +

โŽกโŽขโŽขโŽขโŽขโŽขโŽฃ1114012

119894119883119894119883prime

119894

โŽคโŽฅโŽฅโŽฅโŽฅโŽฅโŽฆ

minus1113568

1114012119894119883119894119890119894

bull Core idea sum119894119883119894119890119894 is the sum of rvs so the CLT applies

bull That plus some simple asymptotic theory allows us to say

radic119873( minus 120573) 119873(0ฮฉ)

bull Converges in distribution to a Normal distribution with meanvector 0 and covariance matrix ฮฉ

ฮฉ = 120124[119883119894119883prime119894 ]minus1113568120124[119883119894119883prime

119894 1198901113569119894 ]120124[119883119894119883prime119894 ]minus1113568

bull No linearity assumption needed20 58

Estimating the variance

bull In large samples then

radic119873( minus 120573) sim 119873(0ฮฉ)

bull How to estimate ฮฉ Plug-in principle again

1114023ฮฉ =

โŽกโŽขโŽขโŽขโŽขโŽขโŽฃ1114012

119894119883119894119883prime

119894

โŽคโŽฅโŽฅโŽฅโŽฅโŽฅโŽฆ

minus1113568 โŽกโŽขโŽขโŽขโŽขโŽขโŽฃ1114012119894119883119894119883prime

119894 1113569119894

โŽคโŽฅโŽฅโŽฅโŽฅโŽฅโŽฆ

โŽกโŽขโŽขโŽขโŽขโŽขโŽฃ1114012

119894119883119894119883prime

119894

โŽคโŽฅโŽฅโŽฅโŽฅโŽฅโŽฆ

minus1113568

bull Replace 119890119894 with its emprical counterpart (residuals)119894 = 119884119894 minus 119883prime

119894 bull Replace the population moments of 119883119894 with their sample

counterpartsbull The square root of the diagonals of this covariance matrix are

the ldquorobustrdquo or Huber-White standard errors that Statacommonly report

21 58

Heteroskedasticity

bull No assumptions of homoskedasticitybull Heteroskedaticity will definitely occur when

CEF is linear but the 1205901113569(119909) = 120141[119884119894|119883119894 = 119909] is not constant in 119909 119864[119884119894|119883119894] is not linear but we use the linear regression to

approxmiate it

22 58

2 Regression andCausality

23 58

Regression and causality

bull Most econometrics textbooks regression defined withoutrespect to causality

bull But then when is ldquobiasedrdquo The above derivations work forsome 120124[119884119894|119883119894]

bull The question then is when does knowing the CEF tell ussomething about causality

bull MHE argues that a regression is causal when the CEF itapproximates is causal Identification is king

bull We will show that under certain conditions a regression of theoutcome on the treatment and the covariates can recover acausal parameter but perhaps not the one in which we areinterested

24 58

Review

bull Quick reminder we have potential outcomes 119884119894(1) and 119884119894(0)and two parameters the ATE and ATT

120591 = 119864[119884119894(1) minus 119884119894(0)]120591ATT = 119864[119884119894(1) minus 119884119894(0)|119863119894 = 1]

bull We have shown in past weeks that these effects are identifiedwhen ignorability holds MHE calls this the conditionalindependence assumption (CIA)

25 58

Linear constant effects model binarytreatment

bull Experiment with a simple experiment we can rewrite theconsistency assumption to be a regression formula

119884119894 = 119863119894119884119894(1) + (1 minus 119863119894)119884119894(0)= 119884119894(0) + (119884119894(1) minus 119884119894(0))119863119894

= 120124[119884119894(0)] + 120591119863119894 + (119884119894(0) minus 120124[119884119894(0)])= 1205831113567 + 120591119863119894 + 1199071113567119894

bull Note that if ignorability holds (as in an experiment) for 119884119894(0)then it will also hold for 1199071113567119894 since 120124[119884119894(0)] is constant Thusthis satifies the usual assumptions for regression

26 58

Now with covariates

bull Now assume no unmeasured confounders 119884119894(119889) โŸ‚โŸ‚ 119863119894|119883119894bull We will assume a linear model for the potential outcomes

119884119894(119889) = 120572 + 120591 sdot 119889 + 120578119894

bull Remember that linearity isnrsquot an assumption if 119863119894 is binarybull Effect of 119863119894 is constant here the 120578119894 are the only source of

individual variation and we have 119864[120578119894] = 0bull Consistency assumption allows us to write this as

119884119894 = 120572 + 120591119863119894 + 120578119894

27 58

Covariates in the error

bull Letrsquos assume that 120578119894 is linear in 119883119894 120578119894 = 119883prime119894 120574 + 120584119894

bull New error is uncorrelated with 119883119894 120124[120584119894|119883119894] = 0bull This is an assumption Might be falsebull Plug into the above

120124[119884119894(119889)|119883119894] = 119864[119884119894|119863119894 119883119894] = 120572 + 120591119863119894 + 119864[120578119894|119883119894]= 120572 + 120591119863119894 + 119883prime

119894 120574 + 119864[120584119894|119883119894]= 120572 + 120591119863119894 + 119883prime

119894 120574

28 58

Summing up regression with constanteffects

bull Reviewing the assumptions wersquove used no unmeasured confounders constant treatment effects linearity of the treatmentcovariates

bull Under these we can run the following regression to estimatethe ATE 120591

119884119894 = 120572 + 120591119863119894 + 119883prime119894 120574 + 120584119894

bull Works with continuous or ordinal 119863119894 if linearity in the effect ofthese variables is truly linear

29 58

OLS constant effects simulation

Model with linear covariates constant 0 effect of treatment

library(mvtnorm)

n lt- 100

p lt- 4

X lt- rmvnorm(n = 100 mean = rep(0 p))

gamma lt- c(274 137 137 137)

y lt- 210 + X c(gamma) + rnorm(n)

alpha lt- c(-1 05 -05 -01)

dprobs lt- bootinvlogit(X alpha)

d lt- rbinom(n size = 1 prob = dprobs)

30 58

OLS with no covariates

summary(lm(y ~ d))

Coefficients

Estimate Std Error t value Pr(gt|t|)

(Intercept) 22036 458 4813 lt 2e-16

d -2869 654 -439 0000029

---

Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1

Residual standard error 33 on 98 degrees of freedom

Multiple R-squared 0164 Adjusted R-squared 0156

F-statistic 192 on 1 and 98 DF p-value 0000029

31 58

OLS with covariates

summary(lm(y ~ d + X))

Coefficients

Estimate Std Error t value Pr(gt|t|)

(Intercept) 209952 0159 13207 lt2e-16

d -0128 0253 -05 062

X1 27368 0126 2175 lt2e-16

X2 13677 0114 1201 lt2e-16

X3 13673 0130 1051 lt2e-16

X4 13570 0106 1280 lt2e-16

---

Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1

Residual standard error 1 on 94 degrees of freedom

Multiple R-squared 0999 Adjusted R-squared 0999

F-statistic 244e+04 on 5 and 94 DF p-value lt2e-16

32 58

What happens with nonlinearity

Suppose we can only observe the following covariates

z1 lt- exp(X[ 1]2)

z2 lt- X[ 2](1 + exp(X[ 1])) + 10

z3 lt- (X[ 1] X[ 3]25 + 06)^3

z4 lt- (X[ 2] + X[ 4] + 20)^2

Implies that 119884119894 and 119863119894 are functions of log(1198851198941113568) 1198851198941113569 119885111356911989411135681198851198941113569

1 log(1198851198941113568) 1198851198941113570 log(1198851198941113568) and 119883111356811135691198941113571

Regression is a nonlinear function of the observed covariates

33 58

When linearity goes wrong

summary(lm(y ~ d + z1 + z2 + z3 + z4))

Coefficients

Estimate Std Error t value Pr(gt|t|)

(Intercept) 218728 300799 073 0469

d -69292 33854 -205 0043

z1 363110 26970 1346 lt2e-16

z2 -29033 36619 -079 0430

z3 862022 431030 200 0048

z4 04021 00329 1223 lt2e-16

---

Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1

Residual standard error 14 on 94 degrees of freedom

Multiple R-squared 0855 Adjusted R-squared 0847

F-statistic 111 on 5 and 94 DF p-value lt2e-16

34 58

3 Regression withHeterogeneousTreatment Effects

35 58

Heterogeneous effects binary treatment

bull Completely randomized experiment

119884119894 = 119863119894119884119894(1) + (1 minus 119863119894)119884119894(0)= 119884119894(0) + (119884119894(1) minus 119884119894(0))119863119894

= 1205831113567 + 120591119894119863119894 + (119884119894(0) minus 1205831113567)= 1205831113567 + 120591119863119894 + (119884119894(0) minus 1205831113567) + (120591119894 minus 120591) sdot 119863119894

= 1205831113567 + 120591119863119894 + 120576119894

bull Error term now includes two components1 ldquoBaselinerdquo variation in the outcome (119884119894(0) minus 1205831113567)2 Variation in the treatment effect (120591119894 minus 120591)

bull Easy to verify that under experiment 120124[120576119894|119863119894] = 0bull Thus OLS estimates the ATE with no covariates

36 58

Adding covariates

bull What happens with no unmeasured confounders Need tocondition on 119883119894 now

bull Remember identification of the ATEATT using iteratedexpectations

bull ATE is the weighted sum of CATEs

120591 = 1114012119909120591(119909) Pr[119883119894 = 119909]

bull ATEATT are weighted averages of CATEsbull What about the regression estimand 120591119877 How does it related

to the ATEATT

37 58

Heterogeneous effects and regression

bull Letrsquos investigate this under a saturated regression model

119884119894 = 1114012119909119861119909119894120572119909 + 120591119877119863119894 + 119890119894

bull Use a dummy variable for each unique combination of 119883119894119861119909119894 = 120128(119883119894 = 119909)

bull Linear in 119883119894 by construction

38 58

Investigating the regression coefficient

bull How can we investigate 120591119877 Well we can rely on theregression anatomy

120591119877 = Cov(119884119894 119863119894 minus 119864[119863119894|119883119894])120141(119863119894 minus 119864[119863119894|119883119894])

bull 119863119894 minus 120124[119863119894|119883119894] is the residual from a regression of 119863119894 on the fullset of dummies

bull With a little work we can show

120591119877 =120124 1114094120591(119883119894)(119863119894 minus 120124[119863119894|119883119894])11135691114097

120124[(119863119894 minus 119864[119863119894|119883119894])1113569]=

120124[120591(119883119894)1205901113569119889(119883119894)]120124[1205901113569119889(119883119894)]

bull 1205901113569119889(119909) = 120141[119863119894|119883119894 = 119909] is the conditional variance of treatmentassignment

39 58

ATE versus OLS

120591119877 = 120124[120591(119883119894)119882119894] = 1114012119909120591(119909)

1205901113569119889(119909)120124[1205901113569119889(119883119894)]

โ„™[119883119894 = 119909]

bull Compare to the ATE

120591 = 120124[120591(119883119894)] = 1114012119909120591(119909)โ„™[119883119894 = 119909]

bull Both weight strata relative to their size (โ„™[119883119894 = 119909])bull OLS weights strata higher if the treatment variance in those

strata (1205901113569119889(119909)) is higher in those strata relative to the averagevariance across strata (120124[1205901113569119889(119883119894)])

bull The ATE weights only by their size

40 58

Regression weighting

119882119894 =1205901113569119889(119883119894)

120124[1205901113569119889(119883119894)]

bull Why does OLS weight like thisbull OLS is a minimum-variance estimator more weight to

more precise within-strata estimatesbull Within-strata estimates are most precise when the treatment

is evenly spread and thus has the highest variancebull If 119863119894 is binary then we know the conditional variance will be

1205901113569119889(119909) = โ„™[119863119894 = 1|119883119894 = 119909] (1 minus โ„™[119863119894 = 1|119883119894 = 119909])= 119890(119909) (1 minus 119890(119909))

bull Maximum variance with โ„™[119863119894 = 1|119883119894 = 119909] = 12

41 58

OLS weighting examplebull Binary covariate

โ„™[119883119894 = 1] = 075 โ„™[119883119894 = 0] = 025โ„™[119863119894 = 1|119883119894 = 1] = 09 โ„™[119863119894 = 1|119883119894 = 0] = 05

1205901113569119889(1) = 009 1205901113569119889(0) = 025120591(1) = 1 120591(0) = minus1

bull Implies the ATE is 120591 = 05bull Average conditional variance 120124[1205901113569119889(119883119894)] = 013bull weights for 119883119894 = 1 are 009013 = 0692 for 119883119894 = 0

025013 = 192120591119877 = 120124[120591(119883119894)119882119894]

= 120591(1)119882(1)โ„™[119883119894 = 1] + 120591(0)119882(0)โ„™[119883119894 = 0]= 1 times 0692 times 075 + minus1 times 192 times 025= 0039

42 58

When will OLS estimate the ATE

bull When does 120591 = 120591119877bull Constant treatment effects 120591(119909) = 120591 = 120591119877bull Constant probability of treatment 119890(119909) = โ„™[119863119894 = 1|119883119894 = 119909] = 119890

Implies that the OLS weights are 1

bull Incorrect linearity assumption in 119883119894 will lead to more bias

43 58

Other ways to use regression

bull Whatrsquos the path forward Accept the bias (might be relatively small with saturated

models) Use a different regression approach

bull Let 120583119889(119909) = 120124[119884119894(119889)|119883119894 = 119909] be the CEF for the potentialoutcome under 119863119894 = 119889

bull By consistency and nuc we have 120583119889(119909) = 120124[119884119894|119863119894 = 119889119883119894 = 119909]bull Estimate a regression of 119884119894 on 119883119894 among the 119863119894 = 119889 groupbull Then 119889(119909) is just a predicted value from the regression for

119883119894 = 119909bull How can we use this

44 58

Imputation estimators

bull Impute the treated potential outcomes with 1114023119884119894(1) = 1113568(119883119894)bull Impute the control potential outcomes with 1114023119884119894(0) = 1113567(119883119894)bull Procedure

Regress 119884119894 on 119883119894 in the treated group and get predicted valuesfor all units (treated or control)

Regress 119884119894 on 119883119894 in the control group and get predicted valuesfor all units (treated or control)

Take the average difference between these predicted values

bull More mathematically look like this

120591119894119898119901 =1119873

11140121198941113568(119883119894) minus 1113567(119883119894)

bull Sometimes called an imputation estimator

45 58

Simple imputation estimator

bull Use predict() from the within-group models on the data fromthe entire sample

bull Useful trick use a model on the entire data andmodelframe() to get the right design matrix

heterogeneous effects

yhet lt- ifelse(d == 1 y + rnorm(n 0 5) y)

mod lt- lm(yhet ~ d + X)

mod1 lt- lm(yhet ~ X subset = d == 1)

mod0 lt- lm(yhet ~ X subset = d == 0)

y1imps lt- predict(mod1 modelframe(mod))

y0imps lt- predict(mod0 modelframe(mod))

mean(y1imps - y0imps)

[1] 061

46 58

Notes on imputation estimators

bull If 119889(119909) are consistent estimators then 120591119894119898119901 is consistent forthe ATE

bull Why donrsquot people use this Most people donrsquot know the results wersquove been talking about Harder to implement than vanilla OLS

bull Can use linear regression to estimate 119889(119909) = 119909prime120573119889bull Recent trend is to estimate 119889(119909) via non-parametric methods

such as Kernel regression local linear regression regression trees etc Easiest is generalized additive models (GAMs)

47 58

Imputation estimator visualization

-2 -1 0 1 2 3

-05

00

05

10

15

x

y

TreatedControl

48 58

Imputation estimator visualization

-2 -1 0 1 2 3

-05

00

05

10

15

x

y

TreatedControl

49 58

Imputation estimator visualization

-2 -1 0 1 2 3

-05

00

05

10

15

x

y

TreatedControl

50 58

Nonlinear relationshipsbull Same idea but with nonlinear relationship between 119884119894 and 119883119894

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

51 58

Nonlinear relationshipsbull Same idea but with nonlinear relationship between 119884119894 and 119883119894

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

52 58

Nonlinear relationshipsbull Same idea but with nonlinear relationship between 119884119894 and 119883119894

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

53 58

Using semiparametric regressionbull Here CEFs are nonlinear but we donrsquot know their formbull We can use GAMs from the mgcv package to for flexible

estimatelibrary(mgcv)

mod0 lt- gam(y ~ s(x) subset = d == 0)

summary(mod0)

Family gaussian

Link function identity

Formula

y ~ s(x)

Parametric coefficients

Estimate Std Error t value Pr(gt|t|)

(Intercept) -00225 00154 -146 016

Approximate significance of smooth terms

edf Refdf F p-value

s(x) 603 708 413 lt2e-16

---

Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1

R-sq(adj) = 0909 Deviance explained = 928

GCV = 00093204 Scale est = 00071351 n = 30

54 58

Using GAMs

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

55 58

Using GAMs

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

56 58

Using GAMs

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

57 58

Limited dependent variables

bull Usual advice model the data from first principles Logitprobit for binary Poisson for counts etc

bull OLS is a-ok with limited DVs when Binary treatment and no covariates (just diff-in-means) Binary treatment discrete covariates and saturated models

(stratified diff-in-means)

bull Imposing a model on LDVs in this case imposes adistributional assumption which could be wrong

bull Even in unsaturated models the marginal effect from OLSoften decent compared to nonlinear models

Could go wrong in small samples If using nonlinear models always get effects on the scale of the

outcome

58 58

  • Agnostic Regression
  • Regression and Causality
  • Regression with Heterogeneous Treatment Effects
Page 9: Gov 2002: 7. Regression and Causalityโ€ข Define linear regression: =argmin ๐”ผ[( ๐‘–โˆ’ โ€ฒ ) ] โ€ข The solution to this is the following: =๐”ผ[ ๐‘– โ€ฒ]โˆ’ ๐”ผ[ ๐‘– ๐‘–] โ€ข

Justification 1 Linear CEFs

bull Justification 1 if the CEF is linear the population regressionfunction is it That is if 119864[119884119894|119883119894] = 119883prime

119894 119887 then 119887 = 120573bull When would we expect the CEF to be linear Two cases

1 Outcome and covariates are multivariate normal2 Linear regression model is saturated

bull A model is saturated if there are as many parameters asthere are possible combination of the 119883119894 variables

9 58

Saturated model example

bull Two binary variables 1198831113568119894 for incumbency status and 1198831113569119894 forparty of the candidate

bull Four possible values of 119883119894 four possible values of 120583(119883119894)

119864[119884119894|1198831113568119894 = 01198831113569119894 = 0] = 120572119864[119884119894|1198831113568119894 = 11198831113569119894 = 0] = 120572 + 120573119864[119884119894|1198831113568119894 = 01198831113569119894 = 1] = 120572 + 120574119864[119884119894|1198831113568119894 = 11198831113569119894 = 1] = 120572 + 120573 + 120574 + 120575

bull We can write the CEF as follows

119864[119884119894|1198831113568119894 1198831113569119894] = 120572 + 1205731198831113568119894 + 1205741198831113569119894 + 120575(11988311135681198941198831113569119894)

10 58

Saturated models example

119864[119884119894|1198831113568119894 1198831113569119894] = 120572 + 1205731198831113568119894 + 1205741198831113569119894 + 120575(11988311135681198941198831113569119894)

bull Basically each value of 120583(119883119894) is being estimated separately within-strata estimation No borrowing of information from across values of 119883119894

bull Requires a set of dummies for each categorical variable plusall interactions

bull Or a series of dummies for each unique combination of 119883119894bull This makes linearity hold mechanically and so linearity is

not an assumption Just a fact about saturated CEFs saturated models for limited dependent variables = A-OK

11 58

Saturated model example

bull Washington (AER) data on the effects of daughtersbull Wersquoll look at the relationship between voting and number of

kids (causal)

girls lt- foreignreaddta(rdquogirlsdtardquo)

head(girls[ c(rdquonamerdquo rdquototchirdquo rdquoaauwrdquo)])

name totchi aauw

1 ABERCROMBIE NEIL 0 100

2 ACKERMAN GARY L 3 88

3 ADERHOLT ROBERT B 0 0

4 ALLEN THOMAS H 2 100

5 ANDREWS ROBERT E 2 100

6 ARCHER WR 7 0

12 58

Linear model

summary(lm(aauw ~ totchi data = girls))

Coefficients

Estimate Std Error t value Pr(gt|t|)

(Intercept) 6131 181 3381 lt2e-16

totchi -533 062 -859 lt2e-16

---

Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1

Residual standard error 42 on 1733 degrees of freedom

(5 observations deleted due to missingness)

Multiple R-squared 00408 Adjusted R-squared 00403

F-statistic 738 on 1 and 1733 DF p-value lt2e-16

13 58

Saturated modelsummary(lm(aauw ~ asfactor(totchi) data = girls))

Coefficients

Estimate Std Error t value Pr(gt|t|)

(Intercept) 5641 276 2042 lt 2e-16

asfactor(totchi)1 545 411 133 01851

asfactor(totchi)2 -380 327 -116 02454

asfactor(totchi)3 -1365 345 -395 81e-05

asfactor(totchi)4 -1931 401 -482 16e-06

asfactor(totchi)5 -1546 485 -319 00015

asfactor(totchi)6 -3359 1042 -322 00013

asfactor(totchi)7 -1713 1141 -150 01336

asfactor(totchi)8 -5533 1228 -451 70e-06

asfactor(totchi)9 -5041 2408 -209 00364

asfactor(totchi)10 -5341 2090 -256 00107

asfactor(totchi)12 -5641 4153 -136 01745

---

Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1

Residual standard error 41 on 1723 degrees of freedom

(5 observations deleted due to missingness)

Multiple R-squared 00506 Adjusted R-squared 00446

F-statistic 836 on 11 and 1723 DF p-value 184e-1414 58

Saturated model minus the constantsummary(lm(aauw ~ asfactor(totchi) - 1 data = girls))

Coefficients

Estimate Std Error t value Pr(gt|t|)

asfactor(totchi)0 5641 276 2042 lt2e-16

asfactor(totchi)1 6186 305 2031 lt2e-16

asfactor(totchi)2 5262 175 3013 lt2e-16

asfactor(totchi)3 4276 207 2062 lt2e-16

asfactor(totchi)4 3711 290 1279 lt2e-16

asfactor(totchi)5 4095 399 1027 lt2e-16

asfactor(totchi)6 2282 1005 227 00233

asfactor(totchi)7 3929 1107 355 00004

asfactor(totchi)8 108 1196 009 09278

asfactor(totchi)9 600 2392 025 08020

asfactor(totchi)10 300 2072 014 08849

asfactor(totchi)12 000 4143 000 10000

---

Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1

Residual standard error 41 on 1723 degrees of freedom

(5 observations deleted due to missingness)

Multiple R-squared 0587 Adjusted R-squared 0584

F-statistic 204 on 12 and 1723 DF p-value lt2e-1615 58

Compare to within-strata means

bull The saturated model makes no assumptions about thebetween-strata relationships

bull Just calculates within-strata means

c1 lt- coef(lm(aauw ~ asfactor(totchi) - 1 data = girls))

c2 lt- with(girls tapply(aauw totchi mean narm = TRUE))

rbind(c1 c2)

0 1 2 3 4 5 6 7 8 9 10 12

c1 56 62 53 43 37 41 23 39 11 6 3 0

c2 56 62 53 43 37 41 23 39 11 6 3 0

16 58

Other justifications for OLS

bull Justification 2 119883prime119894 120573 is the best linear predictor (in a

mean-squared error sense) of 119884119894 Why 120573 = argmin119887 120124[(119884119894 minus 119883prime

119894 119887)1113569]

bull Justification 3 119883prime119894 120573 provides the minimum mean squared

error linear approxmiation to 119864[119884119894|119883119894]bull Even if the CEF is not linear a linear regression provides the

best linear approximation to that CEFbull Donrsquot need to believe the assumptions (linearity) in order to

use regression as a good approximation to the CEFbull Warning if the CEF is very nonlinear then this approximation

could be terrible

17 58

The error terms

bull Letrsquos define the error term 119890119894 equiv 119884119894 minus 119883prime119894 120573 so that

119884119894 = 119883prime119894 120573 + [119884119894 minus 119883prime

119894 120573] = 119883prime119894 120573 + 119890119894

bull Note the residual 119890119894 is uncorrelated with 119883119894

120124[119883119894119890119894] = 120124[119883119894(119884119894 minus 119883prime119894 120573)]

= 120124[119883119894119884119894] minus 120124[119883119894119883prime119894 120573]

= 120124[119883119894119884119894] minus 120124 1114094119883119894119883prime119894120124[119883119894119883prime

119894 ]minus1113568120124[119883119894119884119894]1114097= 120124[119883119894119884119894] minus 120124[119883119894119883prime

119894 ]120124[119883119894119883prime119894 ]minus1113568120124[119883119894119884119894]

= 120124[119883119894119884119894] minus 120124[119883119894119884119894] = 0

bull No assumptions on the linearity of 120124[119884119894|119883119894]

18 58

OLS estimator

bull We know the population value of 120573 is

120573 = 120124[119883119894119883prime119894 ]minus1113568120124[119883119894119884119894]

bull How do we get an estimator of thisbull Plug-in principle replace population expectation with

sample versions

=

โŽกโŽขโŽขโŽขโŽขโŽขโŽฃ1119873

1114012119894119883119894119883prime

119894

โŽคโŽฅโŽฅโŽฅโŽฅโŽฅโŽฆ

minus11135681119873

1114012119894119883119894119884119894

bull If you work through the matrix algebra this turns out to be

= (119831prime119831)minus1113568 119831prime119858

19 58

Asymptotic OLS inference

bull With this representation in hand we can write the OLSestimator as follows

= 120573 +

โŽกโŽขโŽขโŽขโŽขโŽขโŽฃ1114012

119894119883119894119883prime

119894

โŽคโŽฅโŽฅโŽฅโŽฅโŽฅโŽฆ

minus1113568

1114012119894119883119894119890119894

bull Core idea sum119894119883119894119890119894 is the sum of rvs so the CLT applies

bull That plus some simple asymptotic theory allows us to say

radic119873( minus 120573) 119873(0ฮฉ)

bull Converges in distribution to a Normal distribution with meanvector 0 and covariance matrix ฮฉ

ฮฉ = 120124[119883119894119883prime119894 ]minus1113568120124[119883119894119883prime

119894 1198901113569119894 ]120124[119883119894119883prime119894 ]minus1113568

bull No linearity assumption needed20 58

Estimating the variance

bull In large samples then

radic119873( minus 120573) sim 119873(0ฮฉ)

bull How to estimate ฮฉ Plug-in principle again

1114023ฮฉ =

โŽกโŽขโŽขโŽขโŽขโŽขโŽฃ1114012

119894119883119894119883prime

119894

โŽคโŽฅโŽฅโŽฅโŽฅโŽฅโŽฆ

minus1113568 โŽกโŽขโŽขโŽขโŽขโŽขโŽฃ1114012119894119883119894119883prime

119894 1113569119894

โŽคโŽฅโŽฅโŽฅโŽฅโŽฅโŽฆ

โŽกโŽขโŽขโŽขโŽขโŽขโŽฃ1114012

119894119883119894119883prime

119894

โŽคโŽฅโŽฅโŽฅโŽฅโŽฅโŽฆ

minus1113568

bull Replace 119890119894 with its emprical counterpart (residuals)119894 = 119884119894 minus 119883prime

119894 bull Replace the population moments of 119883119894 with their sample

counterpartsbull The square root of the diagonals of this covariance matrix are

the ldquorobustrdquo or Huber-White standard errors that Statacommonly report

21 58

Heteroskedasticity

bull No assumptions of homoskedasticitybull Heteroskedaticity will definitely occur when

CEF is linear but the 1205901113569(119909) = 120141[119884119894|119883119894 = 119909] is not constant in 119909 119864[119884119894|119883119894] is not linear but we use the linear regression to

approxmiate it

22 58

2 Regression andCausality

23 58

Regression and causality

bull Most econometrics textbooks regression defined withoutrespect to causality

bull But then when is ldquobiasedrdquo The above derivations work forsome 120124[119884119894|119883119894]

bull The question then is when does knowing the CEF tell ussomething about causality

bull MHE argues that a regression is causal when the CEF itapproximates is causal Identification is king

bull We will show that under certain conditions a regression of theoutcome on the treatment and the covariates can recover acausal parameter but perhaps not the one in which we areinterested

24 58

Review

bull Quick reminder we have potential outcomes 119884119894(1) and 119884119894(0)and two parameters the ATE and ATT

120591 = 119864[119884119894(1) minus 119884119894(0)]120591ATT = 119864[119884119894(1) minus 119884119894(0)|119863119894 = 1]

bull We have shown in past weeks that these effects are identifiedwhen ignorability holds MHE calls this the conditionalindependence assumption (CIA)

25 58

Linear constant effects model binarytreatment

bull Experiment with a simple experiment we can rewrite theconsistency assumption to be a regression formula

119884119894 = 119863119894119884119894(1) + (1 minus 119863119894)119884119894(0)= 119884119894(0) + (119884119894(1) minus 119884119894(0))119863119894

= 120124[119884119894(0)] + 120591119863119894 + (119884119894(0) minus 120124[119884119894(0)])= 1205831113567 + 120591119863119894 + 1199071113567119894

bull Note that if ignorability holds (as in an experiment) for 119884119894(0)then it will also hold for 1199071113567119894 since 120124[119884119894(0)] is constant Thusthis satifies the usual assumptions for regression

26 58

Now with covariates

bull Now assume no unmeasured confounders 119884119894(119889) โŸ‚โŸ‚ 119863119894|119883119894bull We will assume a linear model for the potential outcomes

119884119894(119889) = 120572 + 120591 sdot 119889 + 120578119894

bull Remember that linearity isnrsquot an assumption if 119863119894 is binarybull Effect of 119863119894 is constant here the 120578119894 are the only source of

individual variation and we have 119864[120578119894] = 0bull Consistency assumption allows us to write this as

119884119894 = 120572 + 120591119863119894 + 120578119894

27 58

Covariates in the error

bull Letrsquos assume that 120578119894 is linear in 119883119894 120578119894 = 119883prime119894 120574 + 120584119894

bull New error is uncorrelated with 119883119894 120124[120584119894|119883119894] = 0bull This is an assumption Might be falsebull Plug into the above

120124[119884119894(119889)|119883119894] = 119864[119884119894|119863119894 119883119894] = 120572 + 120591119863119894 + 119864[120578119894|119883119894]= 120572 + 120591119863119894 + 119883prime

119894 120574 + 119864[120584119894|119883119894]= 120572 + 120591119863119894 + 119883prime

119894 120574

28 58

Summing up regression with constanteffects

bull Reviewing the assumptions wersquove used no unmeasured confounders constant treatment effects linearity of the treatmentcovariates

bull Under these we can run the following regression to estimatethe ATE 120591

119884119894 = 120572 + 120591119863119894 + 119883prime119894 120574 + 120584119894

bull Works with continuous or ordinal 119863119894 if linearity in the effect ofthese variables is truly linear

29 58

OLS constant effects simulation

Model with linear covariates constant 0 effect of treatment

library(mvtnorm)

n lt- 100

p lt- 4

X lt- rmvnorm(n = 100 mean = rep(0 p))

gamma lt- c(274 137 137 137)

y lt- 210 + X c(gamma) + rnorm(n)

alpha lt- c(-1 05 -05 -01)

dprobs lt- bootinvlogit(X alpha)

d lt- rbinom(n size = 1 prob = dprobs)

30 58

OLS with no covariates

summary(lm(y ~ d))

Coefficients

Estimate Std Error t value Pr(gt|t|)

(Intercept) 22036 458 4813 lt 2e-16

d -2869 654 -439 0000029

---

Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1

Residual standard error 33 on 98 degrees of freedom

Multiple R-squared 0164 Adjusted R-squared 0156

F-statistic 192 on 1 and 98 DF p-value 0000029

31 58

OLS with covariates

summary(lm(y ~ d + X))

Coefficients

Estimate Std Error t value Pr(gt|t|)

(Intercept) 209952 0159 13207 lt2e-16

d -0128 0253 -05 062

X1 27368 0126 2175 lt2e-16

X2 13677 0114 1201 lt2e-16

X3 13673 0130 1051 lt2e-16

X4 13570 0106 1280 lt2e-16

---

Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1

Residual standard error 1 on 94 degrees of freedom

Multiple R-squared 0999 Adjusted R-squared 0999

F-statistic 244e+04 on 5 and 94 DF p-value lt2e-16

32 58

What happens with nonlinearity

Suppose we can only observe the following covariates

z1 lt- exp(X[ 1]2)

z2 lt- X[ 2](1 + exp(X[ 1])) + 10

z3 lt- (X[ 1] X[ 3]25 + 06)^3

z4 lt- (X[ 2] + X[ 4] + 20)^2

Implies that 119884119894 and 119863119894 are functions of log(1198851198941113568) 1198851198941113569 119885111356911989411135681198851198941113569

1 log(1198851198941113568) 1198851198941113570 log(1198851198941113568) and 119883111356811135691198941113571

Regression is a nonlinear function of the observed covariates

33 58

When linearity goes wrong

summary(lm(y ~ d + z1 + z2 + z3 + z4))

Coefficients

Estimate Std Error t value Pr(gt|t|)

(Intercept) 218728 300799 073 0469

d -69292 33854 -205 0043

z1 363110 26970 1346 lt2e-16

z2 -29033 36619 -079 0430

z3 862022 431030 200 0048

z4 04021 00329 1223 lt2e-16

---

Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1

Residual standard error 14 on 94 degrees of freedom

Multiple R-squared 0855 Adjusted R-squared 0847

F-statistic 111 on 5 and 94 DF p-value lt2e-16

34 58

3 Regression withHeterogeneousTreatment Effects

35 58

Heterogeneous effects binary treatment

bull Completely randomized experiment

119884119894 = 119863119894119884119894(1) + (1 minus 119863119894)119884119894(0)= 119884119894(0) + (119884119894(1) minus 119884119894(0))119863119894

= 1205831113567 + 120591119894119863119894 + (119884119894(0) minus 1205831113567)= 1205831113567 + 120591119863119894 + (119884119894(0) minus 1205831113567) + (120591119894 minus 120591) sdot 119863119894

= 1205831113567 + 120591119863119894 + 120576119894

bull Error term now includes two components1 ldquoBaselinerdquo variation in the outcome (119884119894(0) minus 1205831113567)2 Variation in the treatment effect (120591119894 minus 120591)

bull Easy to verify that under experiment 120124[120576119894|119863119894] = 0bull Thus OLS estimates the ATE with no covariates

36 58

Adding covariates

bull What happens with no unmeasured confounders Need tocondition on 119883119894 now

bull Remember identification of the ATEATT using iteratedexpectations

bull ATE is the weighted sum of CATEs

120591 = 1114012119909120591(119909) Pr[119883119894 = 119909]

bull ATEATT are weighted averages of CATEsbull What about the regression estimand 120591119877 How does it related

to the ATEATT

37 58

Heterogeneous effects and regression

bull Letrsquos investigate this under a saturated regression model

119884119894 = 1114012119909119861119909119894120572119909 + 120591119877119863119894 + 119890119894

bull Use a dummy variable for each unique combination of 119883119894119861119909119894 = 120128(119883119894 = 119909)

bull Linear in 119883119894 by construction

38 58

Investigating the regression coefficient

bull How can we investigate 120591119877 Well we can rely on theregression anatomy

120591119877 = Cov(119884119894 119863119894 minus 119864[119863119894|119883119894])120141(119863119894 minus 119864[119863119894|119883119894])

bull 119863119894 minus 120124[119863119894|119883119894] is the residual from a regression of 119863119894 on the fullset of dummies

bull With a little work we can show

120591119877 =120124 1114094120591(119883119894)(119863119894 minus 120124[119863119894|119883119894])11135691114097

120124[(119863119894 minus 119864[119863119894|119883119894])1113569]=

120124[120591(119883119894)1205901113569119889(119883119894)]120124[1205901113569119889(119883119894)]

bull 1205901113569119889(119909) = 120141[119863119894|119883119894 = 119909] is the conditional variance of treatmentassignment

39 58

ATE versus OLS

120591119877 = 120124[120591(119883119894)119882119894] = 1114012119909120591(119909)

1205901113569119889(119909)120124[1205901113569119889(119883119894)]

โ„™[119883119894 = 119909]

bull Compare to the ATE

120591 = 120124[120591(119883119894)] = 1114012119909120591(119909)โ„™[119883119894 = 119909]

bull Both weight strata relative to their size (โ„™[119883119894 = 119909])bull OLS weights strata higher if the treatment variance in those

strata (1205901113569119889(119909)) is higher in those strata relative to the averagevariance across strata (120124[1205901113569119889(119883119894)])

bull The ATE weights only by their size

40 58

Regression weighting

119882119894 =1205901113569119889(119883119894)

120124[1205901113569119889(119883119894)]

bull Why does OLS weight like thisbull OLS is a minimum-variance estimator more weight to

more precise within-strata estimatesbull Within-strata estimates are most precise when the treatment

is evenly spread and thus has the highest variancebull If 119863119894 is binary then we know the conditional variance will be

1205901113569119889(119909) = โ„™[119863119894 = 1|119883119894 = 119909] (1 minus โ„™[119863119894 = 1|119883119894 = 119909])= 119890(119909) (1 minus 119890(119909))

bull Maximum variance with โ„™[119863119894 = 1|119883119894 = 119909] = 12

41 58

OLS weighting examplebull Binary covariate

โ„™[119883119894 = 1] = 075 โ„™[119883119894 = 0] = 025โ„™[119863119894 = 1|119883119894 = 1] = 09 โ„™[119863119894 = 1|119883119894 = 0] = 05

1205901113569119889(1) = 009 1205901113569119889(0) = 025120591(1) = 1 120591(0) = minus1

bull Implies the ATE is 120591 = 05bull Average conditional variance 120124[1205901113569119889(119883119894)] = 013bull weights for 119883119894 = 1 are 009013 = 0692 for 119883119894 = 0

025013 = 192120591119877 = 120124[120591(119883119894)119882119894]

= 120591(1)119882(1)โ„™[119883119894 = 1] + 120591(0)119882(0)โ„™[119883119894 = 0]= 1 times 0692 times 075 + minus1 times 192 times 025= 0039

42 58

When will OLS estimate the ATE

bull When does 120591 = 120591119877bull Constant treatment effects 120591(119909) = 120591 = 120591119877bull Constant probability of treatment 119890(119909) = โ„™[119863119894 = 1|119883119894 = 119909] = 119890

Implies that the OLS weights are 1

bull Incorrect linearity assumption in 119883119894 will lead to more bias

43 58

Other ways to use regression

bull Whatrsquos the path forward Accept the bias (might be relatively small with saturated

models) Use a different regression approach

bull Let 120583119889(119909) = 120124[119884119894(119889)|119883119894 = 119909] be the CEF for the potentialoutcome under 119863119894 = 119889

bull By consistency and nuc we have 120583119889(119909) = 120124[119884119894|119863119894 = 119889119883119894 = 119909]bull Estimate a regression of 119884119894 on 119883119894 among the 119863119894 = 119889 groupbull Then 119889(119909) is just a predicted value from the regression for

119883119894 = 119909bull How can we use this

44 58

Imputation estimators

bull Impute the treated potential outcomes with 1114023119884119894(1) = 1113568(119883119894)bull Impute the control potential outcomes with 1114023119884119894(0) = 1113567(119883119894)bull Procedure

Regress 119884119894 on 119883119894 in the treated group and get predicted valuesfor all units (treated or control)

Regress 119884119894 on 119883119894 in the control group and get predicted valuesfor all units (treated or control)

Take the average difference between these predicted values

bull More mathematically look like this

120591119894119898119901 =1119873

11140121198941113568(119883119894) minus 1113567(119883119894)

bull Sometimes called an imputation estimator

45 58

Simple imputation estimator

bull Use predict() from the within-group models on the data fromthe entire sample

bull Useful trick use a model on the entire data andmodelframe() to get the right design matrix

heterogeneous effects

yhet lt- ifelse(d == 1 y + rnorm(n 0 5) y)

mod lt- lm(yhet ~ d + X)

mod1 lt- lm(yhet ~ X subset = d == 1)

mod0 lt- lm(yhet ~ X subset = d == 0)

y1imps lt- predict(mod1 modelframe(mod))

y0imps lt- predict(mod0 modelframe(mod))

mean(y1imps - y0imps)

[1] 061

46 58

Notes on imputation estimators

bull If 119889(119909) are consistent estimators then 120591119894119898119901 is consistent forthe ATE

bull Why donrsquot people use this Most people donrsquot know the results wersquove been talking about Harder to implement than vanilla OLS

bull Can use linear regression to estimate 119889(119909) = 119909prime120573119889bull Recent trend is to estimate 119889(119909) via non-parametric methods

such as Kernel regression local linear regression regression trees etc Easiest is generalized additive models (GAMs)

47 58

Imputation estimator visualization

-2 -1 0 1 2 3

-05

00

05

10

15

x

y

TreatedControl

48 58

Imputation estimator visualization

-2 -1 0 1 2 3

-05

00

05

10

15

x

y

TreatedControl

49 58

Imputation estimator visualization

-2 -1 0 1 2 3

-05

00

05

10

15

x

y

TreatedControl

50 58

Nonlinear relationshipsbull Same idea but with nonlinear relationship between 119884119894 and 119883119894

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

51 58

Nonlinear relationshipsbull Same idea but with nonlinear relationship between 119884119894 and 119883119894

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

52 58

Nonlinear relationshipsbull Same idea but with nonlinear relationship between 119884119894 and 119883119894

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

53 58

Using semiparametric regressionbull Here CEFs are nonlinear but we donrsquot know their formbull We can use GAMs from the mgcv package to for flexible

estimatelibrary(mgcv)

mod0 lt- gam(y ~ s(x) subset = d == 0)

summary(mod0)

Family gaussian

Link function identity

Formula

y ~ s(x)

Parametric coefficients

Estimate Std Error t value Pr(gt|t|)

(Intercept) -00225 00154 -146 016

Approximate significance of smooth terms

edf Refdf F p-value

s(x) 603 708 413 lt2e-16

---

Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1

R-sq(adj) = 0909 Deviance explained = 928

GCV = 00093204 Scale est = 00071351 n = 30

54 58

Using GAMs

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

55 58

Using GAMs

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

56 58

Using GAMs

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

57 58

Limited dependent variables

bull Usual advice model the data from first principles Logitprobit for binary Poisson for counts etc

bull OLS is a-ok with limited DVs when Binary treatment and no covariates (just diff-in-means) Binary treatment discrete covariates and saturated models

(stratified diff-in-means)

bull Imposing a model on LDVs in this case imposes adistributional assumption which could be wrong

bull Even in unsaturated models the marginal effect from OLSoften decent compared to nonlinear models

Could go wrong in small samples If using nonlinear models always get effects on the scale of the

outcome

58 58

  • Agnostic Regression
  • Regression and Causality
  • Regression with Heterogeneous Treatment Effects
Page 10: Gov 2002: 7. Regression and Causalityโ€ข Define linear regression: =argmin ๐”ผ[( ๐‘–โˆ’ โ€ฒ ) ] โ€ข The solution to this is the following: =๐”ผ[ ๐‘– โ€ฒ]โˆ’ ๐”ผ[ ๐‘– ๐‘–] โ€ข

Saturated model example

bull Two binary variables 1198831113568119894 for incumbency status and 1198831113569119894 forparty of the candidate

bull Four possible values of 119883119894 four possible values of 120583(119883119894)

119864[119884119894|1198831113568119894 = 01198831113569119894 = 0] = 120572119864[119884119894|1198831113568119894 = 11198831113569119894 = 0] = 120572 + 120573119864[119884119894|1198831113568119894 = 01198831113569119894 = 1] = 120572 + 120574119864[119884119894|1198831113568119894 = 11198831113569119894 = 1] = 120572 + 120573 + 120574 + 120575

bull We can write the CEF as follows

119864[119884119894|1198831113568119894 1198831113569119894] = 120572 + 1205731198831113568119894 + 1205741198831113569119894 + 120575(11988311135681198941198831113569119894)

10 58

Saturated models example

119864[119884119894|1198831113568119894 1198831113569119894] = 120572 + 1205731198831113568119894 + 1205741198831113569119894 + 120575(11988311135681198941198831113569119894)

bull Basically each value of 120583(119883119894) is being estimated separately within-strata estimation No borrowing of information from across values of 119883119894

bull Requires a set of dummies for each categorical variable plusall interactions

bull Or a series of dummies for each unique combination of 119883119894bull This makes linearity hold mechanically and so linearity is

not an assumption Just a fact about saturated CEFs saturated models for limited dependent variables = A-OK

11 58

Saturated model example

bull Washington (AER) data on the effects of daughtersbull Wersquoll look at the relationship between voting and number of

kids (causal)

girls lt- foreignreaddta(rdquogirlsdtardquo)

head(girls[ c(rdquonamerdquo rdquototchirdquo rdquoaauwrdquo)])

name totchi aauw

1 ABERCROMBIE NEIL 0 100

2 ACKERMAN GARY L 3 88

3 ADERHOLT ROBERT B 0 0

4 ALLEN THOMAS H 2 100

5 ANDREWS ROBERT E 2 100

6 ARCHER WR 7 0

12 58

Linear model

summary(lm(aauw ~ totchi data = girls))

Coefficients

Estimate Std Error t value Pr(gt|t|)

(Intercept) 6131 181 3381 lt2e-16

totchi -533 062 -859 lt2e-16

---

Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1

Residual standard error 42 on 1733 degrees of freedom

(5 observations deleted due to missingness)

Multiple R-squared 00408 Adjusted R-squared 00403

F-statistic 738 on 1 and 1733 DF p-value lt2e-16

13 58

Saturated modelsummary(lm(aauw ~ asfactor(totchi) data = girls))

Coefficients

Estimate Std Error t value Pr(gt|t|)

(Intercept) 5641 276 2042 lt 2e-16

asfactor(totchi)1 545 411 133 01851

asfactor(totchi)2 -380 327 -116 02454

asfactor(totchi)3 -1365 345 -395 81e-05

asfactor(totchi)4 -1931 401 -482 16e-06

asfactor(totchi)5 -1546 485 -319 00015

asfactor(totchi)6 -3359 1042 -322 00013

asfactor(totchi)7 -1713 1141 -150 01336

asfactor(totchi)8 -5533 1228 -451 70e-06

asfactor(totchi)9 -5041 2408 -209 00364

asfactor(totchi)10 -5341 2090 -256 00107

asfactor(totchi)12 -5641 4153 -136 01745

---

Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1

Residual standard error 41 on 1723 degrees of freedom

(5 observations deleted due to missingness)

Multiple R-squared 00506 Adjusted R-squared 00446

F-statistic 836 on 11 and 1723 DF p-value 184e-1414 58

Saturated model minus the constantsummary(lm(aauw ~ asfactor(totchi) - 1 data = girls))

Coefficients

Estimate Std Error t value Pr(gt|t|)

asfactor(totchi)0 5641 276 2042 lt2e-16

asfactor(totchi)1 6186 305 2031 lt2e-16

asfactor(totchi)2 5262 175 3013 lt2e-16

asfactor(totchi)3 4276 207 2062 lt2e-16

asfactor(totchi)4 3711 290 1279 lt2e-16

asfactor(totchi)5 4095 399 1027 lt2e-16

asfactor(totchi)6 2282 1005 227 00233

asfactor(totchi)7 3929 1107 355 00004

asfactor(totchi)8 108 1196 009 09278

asfactor(totchi)9 600 2392 025 08020

asfactor(totchi)10 300 2072 014 08849

asfactor(totchi)12 000 4143 000 10000

---

Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1

Residual standard error 41 on 1723 degrees of freedom

(5 observations deleted due to missingness)

Multiple R-squared 0587 Adjusted R-squared 0584

F-statistic 204 on 12 and 1723 DF p-value lt2e-1615 58

Compare to within-strata means

bull The saturated model makes no assumptions about thebetween-strata relationships

bull Just calculates within-strata means

c1 lt- coef(lm(aauw ~ asfactor(totchi) - 1 data = girls))

c2 lt- with(girls tapply(aauw totchi mean narm = TRUE))

rbind(c1 c2)

0 1 2 3 4 5 6 7 8 9 10 12

c1 56 62 53 43 37 41 23 39 11 6 3 0

c2 56 62 53 43 37 41 23 39 11 6 3 0

16 58

Other justifications for OLS

bull Justification 2 119883prime119894 120573 is the best linear predictor (in a

mean-squared error sense) of 119884119894 Why 120573 = argmin119887 120124[(119884119894 minus 119883prime

119894 119887)1113569]

bull Justification 3 119883prime119894 120573 provides the minimum mean squared

error linear approxmiation to 119864[119884119894|119883119894]bull Even if the CEF is not linear a linear regression provides the

best linear approximation to that CEFbull Donrsquot need to believe the assumptions (linearity) in order to

use regression as a good approximation to the CEFbull Warning if the CEF is very nonlinear then this approximation

could be terrible

17 58

The error terms

bull Letrsquos define the error term 119890119894 equiv 119884119894 minus 119883prime119894 120573 so that

119884119894 = 119883prime119894 120573 + [119884119894 minus 119883prime

119894 120573] = 119883prime119894 120573 + 119890119894

bull Note the residual 119890119894 is uncorrelated with 119883119894

120124[119883119894119890119894] = 120124[119883119894(119884119894 minus 119883prime119894 120573)]

= 120124[119883119894119884119894] minus 120124[119883119894119883prime119894 120573]

= 120124[119883119894119884119894] minus 120124 1114094119883119894119883prime119894120124[119883119894119883prime

119894 ]minus1113568120124[119883119894119884119894]1114097= 120124[119883119894119884119894] minus 120124[119883119894119883prime

119894 ]120124[119883119894119883prime119894 ]minus1113568120124[119883119894119884119894]

= 120124[119883119894119884119894] minus 120124[119883119894119884119894] = 0

bull No assumptions on the linearity of 120124[119884119894|119883119894]

18 58

OLS estimator

bull We know the population value of 120573 is

120573 = 120124[119883119894119883prime119894 ]minus1113568120124[119883119894119884119894]

bull How do we get an estimator of thisbull Plug-in principle replace population expectation with

sample versions

=

โŽกโŽขโŽขโŽขโŽขโŽขโŽฃ1119873

1114012119894119883119894119883prime

119894

โŽคโŽฅโŽฅโŽฅโŽฅโŽฅโŽฆ

minus11135681119873

1114012119894119883119894119884119894

bull If you work through the matrix algebra this turns out to be

= (119831prime119831)minus1113568 119831prime119858

19 58

Asymptotic OLS inference

bull With this representation in hand we can write the OLSestimator as follows

= 120573 +

โŽกโŽขโŽขโŽขโŽขโŽขโŽฃ1114012

119894119883119894119883prime

119894

โŽคโŽฅโŽฅโŽฅโŽฅโŽฅโŽฆ

minus1113568

1114012119894119883119894119890119894

bull Core idea sum119894119883119894119890119894 is the sum of rvs so the CLT applies

bull That plus some simple asymptotic theory allows us to say

radic119873( minus 120573) 119873(0ฮฉ)

bull Converges in distribution to a Normal distribution with meanvector 0 and covariance matrix ฮฉ

ฮฉ = 120124[119883119894119883prime119894 ]minus1113568120124[119883119894119883prime

119894 1198901113569119894 ]120124[119883119894119883prime119894 ]minus1113568

bull No linearity assumption needed20 58

Estimating the variance

bull In large samples then

radic119873( minus 120573) sim 119873(0ฮฉ)

bull How to estimate ฮฉ Plug-in principle again

1114023ฮฉ =

โŽกโŽขโŽขโŽขโŽขโŽขโŽฃ1114012

119894119883119894119883prime

119894

โŽคโŽฅโŽฅโŽฅโŽฅโŽฅโŽฆ

minus1113568 โŽกโŽขโŽขโŽขโŽขโŽขโŽฃ1114012119894119883119894119883prime

119894 1113569119894

โŽคโŽฅโŽฅโŽฅโŽฅโŽฅโŽฆ

โŽกโŽขโŽขโŽขโŽขโŽขโŽฃ1114012

119894119883119894119883prime

119894

โŽคโŽฅโŽฅโŽฅโŽฅโŽฅโŽฆ

minus1113568

bull Replace 119890119894 with its emprical counterpart (residuals)119894 = 119884119894 minus 119883prime

119894 bull Replace the population moments of 119883119894 with their sample

counterpartsbull The square root of the diagonals of this covariance matrix are

the ldquorobustrdquo or Huber-White standard errors that Statacommonly report

21 58

Heteroskedasticity

bull No assumptions of homoskedasticitybull Heteroskedaticity will definitely occur when

CEF is linear but the 1205901113569(119909) = 120141[119884119894|119883119894 = 119909] is not constant in 119909 119864[119884119894|119883119894] is not linear but we use the linear regression to

approxmiate it

22 58

2 Regression andCausality

23 58

Regression and causality

bull Most econometrics textbooks regression defined withoutrespect to causality

bull But then when is ldquobiasedrdquo The above derivations work forsome 120124[119884119894|119883119894]

bull The question then is when does knowing the CEF tell ussomething about causality

bull MHE argues that a regression is causal when the CEF itapproximates is causal Identification is king

bull We will show that under certain conditions a regression of theoutcome on the treatment and the covariates can recover acausal parameter but perhaps not the one in which we areinterested

24 58

Review

bull Quick reminder we have potential outcomes 119884119894(1) and 119884119894(0)and two parameters the ATE and ATT

120591 = 119864[119884119894(1) minus 119884119894(0)]120591ATT = 119864[119884119894(1) minus 119884119894(0)|119863119894 = 1]

bull We have shown in past weeks that these effects are identifiedwhen ignorability holds MHE calls this the conditionalindependence assumption (CIA)

25 58

Linear constant effects model binarytreatment

bull Experiment with a simple experiment we can rewrite theconsistency assumption to be a regression formula

119884119894 = 119863119894119884119894(1) + (1 minus 119863119894)119884119894(0)= 119884119894(0) + (119884119894(1) minus 119884119894(0))119863119894

= 120124[119884119894(0)] + 120591119863119894 + (119884119894(0) minus 120124[119884119894(0)])= 1205831113567 + 120591119863119894 + 1199071113567119894

bull Note that if ignorability holds (as in an experiment) for 119884119894(0)then it will also hold for 1199071113567119894 since 120124[119884119894(0)] is constant Thusthis satifies the usual assumptions for regression

26 58

Now with covariates

bull Now assume no unmeasured confounders 119884119894(119889) โŸ‚โŸ‚ 119863119894|119883119894bull We will assume a linear model for the potential outcomes

119884119894(119889) = 120572 + 120591 sdot 119889 + 120578119894

bull Remember that linearity isnrsquot an assumption if 119863119894 is binarybull Effect of 119863119894 is constant here the 120578119894 are the only source of

individual variation and we have 119864[120578119894] = 0bull Consistency assumption allows us to write this as

119884119894 = 120572 + 120591119863119894 + 120578119894

27 58

Covariates in the error

bull Letrsquos assume that 120578119894 is linear in 119883119894 120578119894 = 119883prime119894 120574 + 120584119894

bull New error is uncorrelated with 119883119894 120124[120584119894|119883119894] = 0bull This is an assumption Might be falsebull Plug into the above

120124[119884119894(119889)|119883119894] = 119864[119884119894|119863119894 119883119894] = 120572 + 120591119863119894 + 119864[120578119894|119883119894]= 120572 + 120591119863119894 + 119883prime

119894 120574 + 119864[120584119894|119883119894]= 120572 + 120591119863119894 + 119883prime

119894 120574

28 58

Summing up regression with constanteffects

bull Reviewing the assumptions wersquove used no unmeasured confounders constant treatment effects linearity of the treatmentcovariates

bull Under these we can run the following regression to estimatethe ATE 120591

119884119894 = 120572 + 120591119863119894 + 119883prime119894 120574 + 120584119894

bull Works with continuous or ordinal 119863119894 if linearity in the effect ofthese variables is truly linear

29 58

OLS constant effects simulation

Model with linear covariates constant 0 effect of treatment

library(mvtnorm)

n lt- 100

p lt- 4

X lt- rmvnorm(n = 100 mean = rep(0 p))

gamma lt- c(274 137 137 137)

y lt- 210 + X c(gamma) + rnorm(n)

alpha lt- c(-1 05 -05 -01)

dprobs lt- bootinvlogit(X alpha)

d lt- rbinom(n size = 1 prob = dprobs)

30 58

OLS with no covariates

summary(lm(y ~ d))

Coefficients

Estimate Std Error t value Pr(gt|t|)

(Intercept) 22036 458 4813 lt 2e-16

d -2869 654 -439 0000029

---

Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1

Residual standard error 33 on 98 degrees of freedom

Multiple R-squared 0164 Adjusted R-squared 0156

F-statistic 192 on 1 and 98 DF p-value 0000029

31 58

OLS with covariates

summary(lm(y ~ d + X))

Coefficients

Estimate Std Error t value Pr(gt|t|)

(Intercept) 209952 0159 13207 lt2e-16

d -0128 0253 -05 062

X1 27368 0126 2175 lt2e-16

X2 13677 0114 1201 lt2e-16

X3 13673 0130 1051 lt2e-16

X4 13570 0106 1280 lt2e-16

---

Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1

Residual standard error 1 on 94 degrees of freedom

Multiple R-squared 0999 Adjusted R-squared 0999

F-statistic 244e+04 on 5 and 94 DF p-value lt2e-16

32 58

What happens with nonlinearity

Suppose we can only observe the following covariates

z1 lt- exp(X[ 1]2)

z2 lt- X[ 2](1 + exp(X[ 1])) + 10

z3 lt- (X[ 1] X[ 3]25 + 06)^3

z4 lt- (X[ 2] + X[ 4] + 20)^2

Implies that 119884119894 and 119863119894 are functions of log(1198851198941113568) 1198851198941113569 119885111356911989411135681198851198941113569

1 log(1198851198941113568) 1198851198941113570 log(1198851198941113568) and 119883111356811135691198941113571

Regression is a nonlinear function of the observed covariates

33 58

When linearity goes wrong

summary(lm(y ~ d + z1 + z2 + z3 + z4))

Coefficients

Estimate Std Error t value Pr(gt|t|)

(Intercept) 218728 300799 073 0469

d -69292 33854 -205 0043

z1 363110 26970 1346 lt2e-16

z2 -29033 36619 -079 0430

z3 862022 431030 200 0048

z4 04021 00329 1223 lt2e-16

---

Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1

Residual standard error 14 on 94 degrees of freedom

Multiple R-squared 0855 Adjusted R-squared 0847

F-statistic 111 on 5 and 94 DF p-value lt2e-16

34 58

3 Regression withHeterogeneousTreatment Effects

35 58

Heterogeneous effects binary treatment

bull Completely randomized experiment

119884119894 = 119863119894119884119894(1) + (1 minus 119863119894)119884119894(0)= 119884119894(0) + (119884119894(1) minus 119884119894(0))119863119894

= 1205831113567 + 120591119894119863119894 + (119884119894(0) minus 1205831113567)= 1205831113567 + 120591119863119894 + (119884119894(0) minus 1205831113567) + (120591119894 minus 120591) sdot 119863119894

= 1205831113567 + 120591119863119894 + 120576119894

bull Error term now includes two components1 ldquoBaselinerdquo variation in the outcome (119884119894(0) minus 1205831113567)2 Variation in the treatment effect (120591119894 minus 120591)

bull Easy to verify that under experiment 120124[120576119894|119863119894] = 0bull Thus OLS estimates the ATE with no covariates

36 58

Adding covariates

bull What happens with no unmeasured confounders Need tocondition on 119883119894 now

bull Remember identification of the ATEATT using iteratedexpectations

bull ATE is the weighted sum of CATEs

120591 = 1114012119909120591(119909) Pr[119883119894 = 119909]

bull ATEATT are weighted averages of CATEsbull What about the regression estimand 120591119877 How does it related

to the ATEATT

37 58

Heterogeneous effects and regression

bull Letrsquos investigate this under a saturated regression model

119884119894 = 1114012119909119861119909119894120572119909 + 120591119877119863119894 + 119890119894

bull Use a dummy variable for each unique combination of 119883119894119861119909119894 = 120128(119883119894 = 119909)

bull Linear in 119883119894 by construction

38 58

Investigating the regression coefficient

bull How can we investigate 120591119877 Well we can rely on theregression anatomy

120591119877 = Cov(119884119894 119863119894 minus 119864[119863119894|119883119894])120141(119863119894 minus 119864[119863119894|119883119894])

bull 119863119894 minus 120124[119863119894|119883119894] is the residual from a regression of 119863119894 on the fullset of dummies

bull With a little work we can show

120591119877 =120124 1114094120591(119883119894)(119863119894 minus 120124[119863119894|119883119894])11135691114097

120124[(119863119894 minus 119864[119863119894|119883119894])1113569]=

120124[120591(119883119894)1205901113569119889(119883119894)]120124[1205901113569119889(119883119894)]

bull 1205901113569119889(119909) = 120141[119863119894|119883119894 = 119909] is the conditional variance of treatmentassignment

39 58

ATE versus OLS

120591119877 = 120124[120591(119883119894)119882119894] = 1114012119909120591(119909)

1205901113569119889(119909)120124[1205901113569119889(119883119894)]

โ„™[119883119894 = 119909]

bull Compare to the ATE

120591 = 120124[120591(119883119894)] = 1114012119909120591(119909)โ„™[119883119894 = 119909]

bull Both weight strata relative to their size (โ„™[119883119894 = 119909])bull OLS weights strata higher if the treatment variance in those

strata (1205901113569119889(119909)) is higher in those strata relative to the averagevariance across strata (120124[1205901113569119889(119883119894)])

bull The ATE weights only by their size

40 58

Regression weighting

119882119894 =1205901113569119889(119883119894)

120124[1205901113569119889(119883119894)]

bull Why does OLS weight like thisbull OLS is a minimum-variance estimator more weight to

more precise within-strata estimatesbull Within-strata estimates are most precise when the treatment

is evenly spread and thus has the highest variancebull If 119863119894 is binary then we know the conditional variance will be

1205901113569119889(119909) = โ„™[119863119894 = 1|119883119894 = 119909] (1 minus โ„™[119863119894 = 1|119883119894 = 119909])= 119890(119909) (1 minus 119890(119909))

bull Maximum variance with โ„™[119863119894 = 1|119883119894 = 119909] = 12

41 58

OLS weighting examplebull Binary covariate

โ„™[119883119894 = 1] = 075 โ„™[119883119894 = 0] = 025โ„™[119863119894 = 1|119883119894 = 1] = 09 โ„™[119863119894 = 1|119883119894 = 0] = 05

1205901113569119889(1) = 009 1205901113569119889(0) = 025120591(1) = 1 120591(0) = minus1

bull Implies the ATE is 120591 = 05bull Average conditional variance 120124[1205901113569119889(119883119894)] = 013bull weights for 119883119894 = 1 are 009013 = 0692 for 119883119894 = 0

025013 = 192120591119877 = 120124[120591(119883119894)119882119894]

= 120591(1)119882(1)โ„™[119883119894 = 1] + 120591(0)119882(0)โ„™[119883119894 = 0]= 1 times 0692 times 075 + minus1 times 192 times 025= 0039

42 58

When will OLS estimate the ATE

bull When does 120591 = 120591119877bull Constant treatment effects 120591(119909) = 120591 = 120591119877bull Constant probability of treatment 119890(119909) = โ„™[119863119894 = 1|119883119894 = 119909] = 119890

Implies that the OLS weights are 1

bull Incorrect linearity assumption in 119883119894 will lead to more bias

43 58

Other ways to use regression

bull Whatrsquos the path forward Accept the bias (might be relatively small with saturated

models) Use a different regression approach

bull Let 120583119889(119909) = 120124[119884119894(119889)|119883119894 = 119909] be the CEF for the potentialoutcome under 119863119894 = 119889

bull By consistency and nuc we have 120583119889(119909) = 120124[119884119894|119863119894 = 119889119883119894 = 119909]bull Estimate a regression of 119884119894 on 119883119894 among the 119863119894 = 119889 groupbull Then 119889(119909) is just a predicted value from the regression for

119883119894 = 119909bull How can we use this

44 58

Imputation estimators

bull Impute the treated potential outcomes with 1114023119884119894(1) = 1113568(119883119894)bull Impute the control potential outcomes with 1114023119884119894(0) = 1113567(119883119894)bull Procedure

Regress 119884119894 on 119883119894 in the treated group and get predicted valuesfor all units (treated or control)

Regress 119884119894 on 119883119894 in the control group and get predicted valuesfor all units (treated or control)

Take the average difference between these predicted values

bull More mathematically look like this

120591119894119898119901 =1119873

11140121198941113568(119883119894) minus 1113567(119883119894)

bull Sometimes called an imputation estimator

45 58

Simple imputation estimator

bull Use predict() from the within-group models on the data fromthe entire sample

bull Useful trick use a model on the entire data andmodelframe() to get the right design matrix

heterogeneous effects

yhet lt- ifelse(d == 1 y + rnorm(n 0 5) y)

mod lt- lm(yhet ~ d + X)

mod1 lt- lm(yhet ~ X subset = d == 1)

mod0 lt- lm(yhet ~ X subset = d == 0)

y1imps lt- predict(mod1 modelframe(mod))

y0imps lt- predict(mod0 modelframe(mod))

mean(y1imps - y0imps)

[1] 061

46 58

Notes on imputation estimators

bull If 119889(119909) are consistent estimators then 120591119894119898119901 is consistent forthe ATE

bull Why donrsquot people use this Most people donrsquot know the results wersquove been talking about Harder to implement than vanilla OLS

bull Can use linear regression to estimate 119889(119909) = 119909prime120573119889bull Recent trend is to estimate 119889(119909) via non-parametric methods

such as Kernel regression local linear regression regression trees etc Easiest is generalized additive models (GAMs)

47 58

Imputation estimator visualization

-2 -1 0 1 2 3

-05

00

05

10

15

x

y

TreatedControl

48 58

Imputation estimator visualization

-2 -1 0 1 2 3

-05

00

05

10

15

x

y

TreatedControl

49 58

Imputation estimator visualization

-2 -1 0 1 2 3

-05

00

05

10

15

x

y

TreatedControl

50 58

Nonlinear relationshipsbull Same idea but with nonlinear relationship between 119884119894 and 119883119894

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

51 58

Nonlinear relationshipsbull Same idea but with nonlinear relationship between 119884119894 and 119883119894

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

52 58

Nonlinear relationshipsbull Same idea but with nonlinear relationship between 119884119894 and 119883119894

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

53 58

Using semiparametric regressionbull Here CEFs are nonlinear but we donrsquot know their formbull We can use GAMs from the mgcv package to for flexible

estimatelibrary(mgcv)

mod0 lt- gam(y ~ s(x) subset = d == 0)

summary(mod0)

Family gaussian

Link function identity

Formula

y ~ s(x)

Parametric coefficients

Estimate Std Error t value Pr(gt|t|)

(Intercept) -00225 00154 -146 016

Approximate significance of smooth terms

edf Refdf F p-value

s(x) 603 708 413 lt2e-16

---

Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1

R-sq(adj) = 0909 Deviance explained = 928

GCV = 00093204 Scale est = 00071351 n = 30

54 58

Using GAMs

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

55 58

Using GAMs

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

56 58

Using GAMs

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

57 58

Limited dependent variables

bull Usual advice model the data from first principles Logitprobit for binary Poisson for counts etc

bull OLS is a-ok with limited DVs when Binary treatment and no covariates (just diff-in-means) Binary treatment discrete covariates and saturated models

(stratified diff-in-means)

bull Imposing a model on LDVs in this case imposes adistributional assumption which could be wrong

bull Even in unsaturated models the marginal effect from OLSoften decent compared to nonlinear models

Could go wrong in small samples If using nonlinear models always get effects on the scale of the

outcome

58 58

  • Agnostic Regression
  • Regression and Causality
  • Regression with Heterogeneous Treatment Effects
Page 11: Gov 2002: 7. Regression and Causalityโ€ข Define linear regression: =argmin ๐”ผ[( ๐‘–โˆ’ โ€ฒ ) ] โ€ข The solution to this is the following: =๐”ผ[ ๐‘– โ€ฒ]โˆ’ ๐”ผ[ ๐‘– ๐‘–] โ€ข

Saturated models example

119864[119884119894|1198831113568119894 1198831113569119894] = 120572 + 1205731198831113568119894 + 1205741198831113569119894 + 120575(11988311135681198941198831113569119894)

bull Basically each value of 120583(119883119894) is being estimated separately within-strata estimation No borrowing of information from across values of 119883119894

bull Requires a set of dummies for each categorical variable plusall interactions

bull Or a series of dummies for each unique combination of 119883119894bull This makes linearity hold mechanically and so linearity is

not an assumption Just a fact about saturated CEFs saturated models for limited dependent variables = A-OK

11 58

Saturated model example

bull Washington (AER) data on the effects of daughtersbull Wersquoll look at the relationship between voting and number of

kids (causal)

girls lt- foreignreaddta(rdquogirlsdtardquo)

head(girls[ c(rdquonamerdquo rdquototchirdquo rdquoaauwrdquo)])

name totchi aauw

1 ABERCROMBIE NEIL 0 100

2 ACKERMAN GARY L 3 88

3 ADERHOLT ROBERT B 0 0

4 ALLEN THOMAS H 2 100

5 ANDREWS ROBERT E 2 100

6 ARCHER WR 7 0

12 58

Linear model

summary(lm(aauw ~ totchi data = girls))

Coefficients

Estimate Std Error t value Pr(gt|t|)

(Intercept) 6131 181 3381 lt2e-16

totchi -533 062 -859 lt2e-16

---

Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1

Residual standard error 42 on 1733 degrees of freedom

(5 observations deleted due to missingness)

Multiple R-squared 00408 Adjusted R-squared 00403

F-statistic 738 on 1 and 1733 DF p-value lt2e-16

13 58

Saturated modelsummary(lm(aauw ~ asfactor(totchi) data = girls))

Coefficients

Estimate Std Error t value Pr(gt|t|)

(Intercept) 5641 276 2042 lt 2e-16

asfactor(totchi)1 545 411 133 01851

asfactor(totchi)2 -380 327 -116 02454

asfactor(totchi)3 -1365 345 -395 81e-05

asfactor(totchi)4 -1931 401 -482 16e-06

asfactor(totchi)5 -1546 485 -319 00015

asfactor(totchi)6 -3359 1042 -322 00013

asfactor(totchi)7 -1713 1141 -150 01336

asfactor(totchi)8 -5533 1228 -451 70e-06

asfactor(totchi)9 -5041 2408 -209 00364

asfactor(totchi)10 -5341 2090 -256 00107

asfactor(totchi)12 -5641 4153 -136 01745

---

Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1

Residual standard error 41 on 1723 degrees of freedom

(5 observations deleted due to missingness)

Multiple R-squared 00506 Adjusted R-squared 00446

F-statistic 836 on 11 and 1723 DF p-value 184e-1414 58

Saturated model minus the constantsummary(lm(aauw ~ asfactor(totchi) - 1 data = girls))

Coefficients

Estimate Std Error t value Pr(gt|t|)

asfactor(totchi)0 5641 276 2042 lt2e-16

asfactor(totchi)1 6186 305 2031 lt2e-16

asfactor(totchi)2 5262 175 3013 lt2e-16

asfactor(totchi)3 4276 207 2062 lt2e-16

asfactor(totchi)4 3711 290 1279 lt2e-16

asfactor(totchi)5 4095 399 1027 lt2e-16

asfactor(totchi)6 2282 1005 227 00233

asfactor(totchi)7 3929 1107 355 00004

asfactor(totchi)8 108 1196 009 09278

asfactor(totchi)9 600 2392 025 08020

asfactor(totchi)10 300 2072 014 08849

asfactor(totchi)12 000 4143 000 10000

---

Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1

Residual standard error 41 on 1723 degrees of freedom

(5 observations deleted due to missingness)

Multiple R-squared 0587 Adjusted R-squared 0584

F-statistic 204 on 12 and 1723 DF p-value lt2e-1615 58

Compare to within-strata means

bull The saturated model makes no assumptions about thebetween-strata relationships

bull Just calculates within-strata means

c1 lt- coef(lm(aauw ~ asfactor(totchi) - 1 data = girls))

c2 lt- with(girls tapply(aauw totchi mean narm = TRUE))

rbind(c1 c2)

0 1 2 3 4 5 6 7 8 9 10 12

c1 56 62 53 43 37 41 23 39 11 6 3 0

c2 56 62 53 43 37 41 23 39 11 6 3 0

16 58

Other justifications for OLS

bull Justification 2 119883prime119894 120573 is the best linear predictor (in a

mean-squared error sense) of 119884119894 Why 120573 = argmin119887 120124[(119884119894 minus 119883prime

119894 119887)1113569]

bull Justification 3 119883prime119894 120573 provides the minimum mean squared

error linear approxmiation to 119864[119884119894|119883119894]bull Even if the CEF is not linear a linear regression provides the

best linear approximation to that CEFbull Donrsquot need to believe the assumptions (linearity) in order to

use regression as a good approximation to the CEFbull Warning if the CEF is very nonlinear then this approximation

could be terrible

17 58

The error terms

bull Letrsquos define the error term 119890119894 equiv 119884119894 minus 119883prime119894 120573 so that

119884119894 = 119883prime119894 120573 + [119884119894 minus 119883prime

119894 120573] = 119883prime119894 120573 + 119890119894

bull Note the residual 119890119894 is uncorrelated with 119883119894

120124[119883119894119890119894] = 120124[119883119894(119884119894 minus 119883prime119894 120573)]

= 120124[119883119894119884119894] minus 120124[119883119894119883prime119894 120573]

= 120124[119883119894119884119894] minus 120124 1114094119883119894119883prime119894120124[119883119894119883prime

119894 ]minus1113568120124[119883119894119884119894]1114097= 120124[119883119894119884119894] minus 120124[119883119894119883prime

119894 ]120124[119883119894119883prime119894 ]minus1113568120124[119883119894119884119894]

= 120124[119883119894119884119894] minus 120124[119883119894119884119894] = 0

bull No assumptions on the linearity of 120124[119884119894|119883119894]

18 58

OLS estimator

bull We know the population value of 120573 is

120573 = 120124[119883119894119883prime119894 ]minus1113568120124[119883119894119884119894]

bull How do we get an estimator of thisbull Plug-in principle replace population expectation with

sample versions

=

โŽกโŽขโŽขโŽขโŽขโŽขโŽฃ1119873

1114012119894119883119894119883prime

119894

โŽคโŽฅโŽฅโŽฅโŽฅโŽฅโŽฆ

minus11135681119873

1114012119894119883119894119884119894

bull If you work through the matrix algebra this turns out to be

= (119831prime119831)minus1113568 119831prime119858

19 58

Asymptotic OLS inference

bull With this representation in hand we can write the OLSestimator as follows

= 120573 +

โŽกโŽขโŽขโŽขโŽขโŽขโŽฃ1114012

119894119883119894119883prime

119894

โŽคโŽฅโŽฅโŽฅโŽฅโŽฅโŽฆ

minus1113568

1114012119894119883119894119890119894

bull Core idea sum119894119883119894119890119894 is the sum of rvs so the CLT applies

bull That plus some simple asymptotic theory allows us to say

radic119873( minus 120573) 119873(0ฮฉ)

bull Converges in distribution to a Normal distribution with meanvector 0 and covariance matrix ฮฉ

ฮฉ = 120124[119883119894119883prime119894 ]minus1113568120124[119883119894119883prime

119894 1198901113569119894 ]120124[119883119894119883prime119894 ]minus1113568

bull No linearity assumption needed20 58

Estimating the variance

bull In large samples then

radic119873( minus 120573) sim 119873(0ฮฉ)

bull How to estimate ฮฉ Plug-in principle again

1114023ฮฉ =

โŽกโŽขโŽขโŽขโŽขโŽขโŽฃ1114012

119894119883119894119883prime

119894

โŽคโŽฅโŽฅโŽฅโŽฅโŽฅโŽฆ

minus1113568 โŽกโŽขโŽขโŽขโŽขโŽขโŽฃ1114012119894119883119894119883prime

119894 1113569119894

โŽคโŽฅโŽฅโŽฅโŽฅโŽฅโŽฆ

โŽกโŽขโŽขโŽขโŽขโŽขโŽฃ1114012

119894119883119894119883prime

119894

โŽคโŽฅโŽฅโŽฅโŽฅโŽฅโŽฆ

minus1113568

bull Replace 119890119894 with its emprical counterpart (residuals)119894 = 119884119894 minus 119883prime

119894 bull Replace the population moments of 119883119894 with their sample

counterpartsbull The square root of the diagonals of this covariance matrix are

the ldquorobustrdquo or Huber-White standard errors that Statacommonly report

21 58

Heteroskedasticity

bull No assumptions of homoskedasticitybull Heteroskedaticity will definitely occur when

CEF is linear but the 1205901113569(119909) = 120141[119884119894|119883119894 = 119909] is not constant in 119909 119864[119884119894|119883119894] is not linear but we use the linear regression to

approxmiate it

22 58

2 Regression andCausality

23 58

Regression and causality

bull Most econometrics textbooks regression defined withoutrespect to causality

bull But then when is ldquobiasedrdquo The above derivations work forsome 120124[119884119894|119883119894]

bull The question then is when does knowing the CEF tell ussomething about causality

bull MHE argues that a regression is causal when the CEF itapproximates is causal Identification is king

bull We will show that under certain conditions a regression of theoutcome on the treatment and the covariates can recover acausal parameter but perhaps not the one in which we areinterested

24 58

Review

bull Quick reminder we have potential outcomes 119884119894(1) and 119884119894(0)and two parameters the ATE and ATT

120591 = 119864[119884119894(1) minus 119884119894(0)]120591ATT = 119864[119884119894(1) minus 119884119894(0)|119863119894 = 1]

bull We have shown in past weeks that these effects are identifiedwhen ignorability holds MHE calls this the conditionalindependence assumption (CIA)

25 58

Linear constant effects model binarytreatment

bull Experiment with a simple experiment we can rewrite theconsistency assumption to be a regression formula

119884119894 = 119863119894119884119894(1) + (1 minus 119863119894)119884119894(0)= 119884119894(0) + (119884119894(1) minus 119884119894(0))119863119894

= 120124[119884119894(0)] + 120591119863119894 + (119884119894(0) minus 120124[119884119894(0)])= 1205831113567 + 120591119863119894 + 1199071113567119894

bull Note that if ignorability holds (as in an experiment) for 119884119894(0)then it will also hold for 1199071113567119894 since 120124[119884119894(0)] is constant Thusthis satifies the usual assumptions for regression

26 58

Now with covariates

bull Now assume no unmeasured confounders 119884119894(119889) โŸ‚โŸ‚ 119863119894|119883119894bull We will assume a linear model for the potential outcomes

119884119894(119889) = 120572 + 120591 sdot 119889 + 120578119894

bull Remember that linearity isnrsquot an assumption if 119863119894 is binarybull Effect of 119863119894 is constant here the 120578119894 are the only source of

individual variation and we have 119864[120578119894] = 0bull Consistency assumption allows us to write this as

119884119894 = 120572 + 120591119863119894 + 120578119894

27 58

Covariates in the error

bull Letrsquos assume that 120578119894 is linear in 119883119894 120578119894 = 119883prime119894 120574 + 120584119894

bull New error is uncorrelated with 119883119894 120124[120584119894|119883119894] = 0bull This is an assumption Might be falsebull Plug into the above

120124[119884119894(119889)|119883119894] = 119864[119884119894|119863119894 119883119894] = 120572 + 120591119863119894 + 119864[120578119894|119883119894]= 120572 + 120591119863119894 + 119883prime

119894 120574 + 119864[120584119894|119883119894]= 120572 + 120591119863119894 + 119883prime

119894 120574

28 58

Summing up regression with constanteffects

bull Reviewing the assumptions wersquove used no unmeasured confounders constant treatment effects linearity of the treatmentcovariates

bull Under these we can run the following regression to estimatethe ATE 120591

119884119894 = 120572 + 120591119863119894 + 119883prime119894 120574 + 120584119894

bull Works with continuous or ordinal 119863119894 if linearity in the effect ofthese variables is truly linear

29 58

OLS constant effects simulation

Model with linear covariates constant 0 effect of treatment

library(mvtnorm)

n lt- 100

p lt- 4

X lt- rmvnorm(n = 100 mean = rep(0 p))

gamma lt- c(274 137 137 137)

y lt- 210 + X c(gamma) + rnorm(n)

alpha lt- c(-1 05 -05 -01)

dprobs lt- bootinvlogit(X alpha)

d lt- rbinom(n size = 1 prob = dprobs)

30 58

OLS with no covariates

summary(lm(y ~ d))

Coefficients

Estimate Std Error t value Pr(gt|t|)

(Intercept) 22036 458 4813 lt 2e-16

d -2869 654 -439 0000029

---

Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1

Residual standard error 33 on 98 degrees of freedom

Multiple R-squared 0164 Adjusted R-squared 0156

F-statistic 192 on 1 and 98 DF p-value 0000029

31 58

OLS with covariates

summary(lm(y ~ d + X))

Coefficients

Estimate Std Error t value Pr(gt|t|)

(Intercept) 209952 0159 13207 lt2e-16

d -0128 0253 -05 062

X1 27368 0126 2175 lt2e-16

X2 13677 0114 1201 lt2e-16

X3 13673 0130 1051 lt2e-16

X4 13570 0106 1280 lt2e-16

---

Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1

Residual standard error 1 on 94 degrees of freedom

Multiple R-squared 0999 Adjusted R-squared 0999

F-statistic 244e+04 on 5 and 94 DF p-value lt2e-16

32 58

What happens with nonlinearity

Suppose we can only observe the following covariates

z1 lt- exp(X[ 1]2)

z2 lt- X[ 2](1 + exp(X[ 1])) + 10

z3 lt- (X[ 1] X[ 3]25 + 06)^3

z4 lt- (X[ 2] + X[ 4] + 20)^2

Implies that 119884119894 and 119863119894 are functions of log(1198851198941113568) 1198851198941113569 119885111356911989411135681198851198941113569

1 log(1198851198941113568) 1198851198941113570 log(1198851198941113568) and 119883111356811135691198941113571

Regression is a nonlinear function of the observed covariates

33 58

When linearity goes wrong

summary(lm(y ~ d + z1 + z2 + z3 + z4))

Coefficients

Estimate Std Error t value Pr(gt|t|)

(Intercept) 218728 300799 073 0469

d -69292 33854 -205 0043

z1 363110 26970 1346 lt2e-16

z2 -29033 36619 -079 0430

z3 862022 431030 200 0048

z4 04021 00329 1223 lt2e-16

---

Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1

Residual standard error 14 on 94 degrees of freedom

Multiple R-squared 0855 Adjusted R-squared 0847

F-statistic 111 on 5 and 94 DF p-value lt2e-16

34 58

3 Regression withHeterogeneousTreatment Effects

35 58

Heterogeneous effects binary treatment

bull Completely randomized experiment

119884119894 = 119863119894119884119894(1) + (1 minus 119863119894)119884119894(0)= 119884119894(0) + (119884119894(1) minus 119884119894(0))119863119894

= 1205831113567 + 120591119894119863119894 + (119884119894(0) minus 1205831113567)= 1205831113567 + 120591119863119894 + (119884119894(0) minus 1205831113567) + (120591119894 minus 120591) sdot 119863119894

= 1205831113567 + 120591119863119894 + 120576119894

bull Error term now includes two components1 ldquoBaselinerdquo variation in the outcome (119884119894(0) minus 1205831113567)2 Variation in the treatment effect (120591119894 minus 120591)

bull Easy to verify that under experiment 120124[120576119894|119863119894] = 0bull Thus OLS estimates the ATE with no covariates

36 58

Adding covariates

bull What happens with no unmeasured confounders Need tocondition on 119883119894 now

bull Remember identification of the ATEATT using iteratedexpectations

bull ATE is the weighted sum of CATEs

120591 = 1114012119909120591(119909) Pr[119883119894 = 119909]

bull ATEATT are weighted averages of CATEsbull What about the regression estimand 120591119877 How does it related

to the ATEATT

37 58

Heterogeneous effects and regression

bull Letrsquos investigate this under a saturated regression model

119884119894 = 1114012119909119861119909119894120572119909 + 120591119877119863119894 + 119890119894

bull Use a dummy variable for each unique combination of 119883119894119861119909119894 = 120128(119883119894 = 119909)

bull Linear in 119883119894 by construction

38 58

Investigating the regression coefficient

bull How can we investigate 120591119877 Well we can rely on theregression anatomy

120591119877 = Cov(119884119894 119863119894 minus 119864[119863119894|119883119894])120141(119863119894 minus 119864[119863119894|119883119894])

bull 119863119894 minus 120124[119863119894|119883119894] is the residual from a regression of 119863119894 on the fullset of dummies

bull With a little work we can show

120591119877 =120124 1114094120591(119883119894)(119863119894 minus 120124[119863119894|119883119894])11135691114097

120124[(119863119894 minus 119864[119863119894|119883119894])1113569]=

120124[120591(119883119894)1205901113569119889(119883119894)]120124[1205901113569119889(119883119894)]

bull 1205901113569119889(119909) = 120141[119863119894|119883119894 = 119909] is the conditional variance of treatmentassignment

39 58

ATE versus OLS

120591119877 = 120124[120591(119883119894)119882119894] = 1114012119909120591(119909)

1205901113569119889(119909)120124[1205901113569119889(119883119894)]

โ„™[119883119894 = 119909]

bull Compare to the ATE

120591 = 120124[120591(119883119894)] = 1114012119909120591(119909)โ„™[119883119894 = 119909]

bull Both weight strata relative to their size (โ„™[119883119894 = 119909])bull OLS weights strata higher if the treatment variance in those

strata (1205901113569119889(119909)) is higher in those strata relative to the averagevariance across strata (120124[1205901113569119889(119883119894)])

bull The ATE weights only by their size

40 58

Regression weighting

119882119894 =1205901113569119889(119883119894)

120124[1205901113569119889(119883119894)]

bull Why does OLS weight like thisbull OLS is a minimum-variance estimator more weight to

more precise within-strata estimatesbull Within-strata estimates are most precise when the treatment

is evenly spread and thus has the highest variancebull If 119863119894 is binary then we know the conditional variance will be

1205901113569119889(119909) = โ„™[119863119894 = 1|119883119894 = 119909] (1 minus โ„™[119863119894 = 1|119883119894 = 119909])= 119890(119909) (1 minus 119890(119909))

bull Maximum variance with โ„™[119863119894 = 1|119883119894 = 119909] = 12

41 58

OLS weighting examplebull Binary covariate

โ„™[119883119894 = 1] = 075 โ„™[119883119894 = 0] = 025โ„™[119863119894 = 1|119883119894 = 1] = 09 โ„™[119863119894 = 1|119883119894 = 0] = 05

1205901113569119889(1) = 009 1205901113569119889(0) = 025120591(1) = 1 120591(0) = minus1

bull Implies the ATE is 120591 = 05bull Average conditional variance 120124[1205901113569119889(119883119894)] = 013bull weights for 119883119894 = 1 are 009013 = 0692 for 119883119894 = 0

025013 = 192120591119877 = 120124[120591(119883119894)119882119894]

= 120591(1)119882(1)โ„™[119883119894 = 1] + 120591(0)119882(0)โ„™[119883119894 = 0]= 1 times 0692 times 075 + minus1 times 192 times 025= 0039

42 58

When will OLS estimate the ATE

bull When does 120591 = 120591119877bull Constant treatment effects 120591(119909) = 120591 = 120591119877bull Constant probability of treatment 119890(119909) = โ„™[119863119894 = 1|119883119894 = 119909] = 119890

Implies that the OLS weights are 1

bull Incorrect linearity assumption in 119883119894 will lead to more bias

43 58

Other ways to use regression

bull Whatrsquos the path forward Accept the bias (might be relatively small with saturated

models) Use a different regression approach

bull Let 120583119889(119909) = 120124[119884119894(119889)|119883119894 = 119909] be the CEF for the potentialoutcome under 119863119894 = 119889

bull By consistency and nuc we have 120583119889(119909) = 120124[119884119894|119863119894 = 119889119883119894 = 119909]bull Estimate a regression of 119884119894 on 119883119894 among the 119863119894 = 119889 groupbull Then 119889(119909) is just a predicted value from the regression for

119883119894 = 119909bull How can we use this

44 58

Imputation estimators

bull Impute the treated potential outcomes with 1114023119884119894(1) = 1113568(119883119894)bull Impute the control potential outcomes with 1114023119884119894(0) = 1113567(119883119894)bull Procedure

Regress 119884119894 on 119883119894 in the treated group and get predicted valuesfor all units (treated or control)

Regress 119884119894 on 119883119894 in the control group and get predicted valuesfor all units (treated or control)

Take the average difference between these predicted values

bull More mathematically look like this

120591119894119898119901 =1119873

11140121198941113568(119883119894) minus 1113567(119883119894)

bull Sometimes called an imputation estimator

45 58

Simple imputation estimator

bull Use predict() from the within-group models on the data fromthe entire sample

bull Useful trick use a model on the entire data andmodelframe() to get the right design matrix

heterogeneous effects

yhet lt- ifelse(d == 1 y + rnorm(n 0 5) y)

mod lt- lm(yhet ~ d + X)

mod1 lt- lm(yhet ~ X subset = d == 1)

mod0 lt- lm(yhet ~ X subset = d == 0)

y1imps lt- predict(mod1 modelframe(mod))

y0imps lt- predict(mod0 modelframe(mod))

mean(y1imps - y0imps)

[1] 061

46 58

Notes on imputation estimators

bull If 119889(119909) are consistent estimators then 120591119894119898119901 is consistent forthe ATE

bull Why donrsquot people use this Most people donrsquot know the results wersquove been talking about Harder to implement than vanilla OLS

bull Can use linear regression to estimate 119889(119909) = 119909prime120573119889bull Recent trend is to estimate 119889(119909) via non-parametric methods

such as Kernel regression local linear regression regression trees etc Easiest is generalized additive models (GAMs)

47 58

Imputation estimator visualization

-2 -1 0 1 2 3

-05

00

05

10

15

x

y

TreatedControl

48 58

Imputation estimator visualization

-2 -1 0 1 2 3

-05

00

05

10

15

x

y

TreatedControl

49 58

Imputation estimator visualization

-2 -1 0 1 2 3

-05

00

05

10

15

x

y

TreatedControl

50 58

Nonlinear relationshipsbull Same idea but with nonlinear relationship between 119884119894 and 119883119894

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

51 58

Nonlinear relationshipsbull Same idea but with nonlinear relationship between 119884119894 and 119883119894

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

52 58

Nonlinear relationshipsbull Same idea but with nonlinear relationship between 119884119894 and 119883119894

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

53 58

Using semiparametric regressionbull Here CEFs are nonlinear but we donrsquot know their formbull We can use GAMs from the mgcv package to for flexible

estimatelibrary(mgcv)

mod0 lt- gam(y ~ s(x) subset = d == 0)

summary(mod0)

Family gaussian

Link function identity

Formula

y ~ s(x)

Parametric coefficients

Estimate Std Error t value Pr(gt|t|)

(Intercept) -00225 00154 -146 016

Approximate significance of smooth terms

edf Refdf F p-value

s(x) 603 708 413 lt2e-16

---

Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1

R-sq(adj) = 0909 Deviance explained = 928

GCV = 00093204 Scale est = 00071351 n = 30

54 58

Using GAMs

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

55 58

Using GAMs

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

56 58

Using GAMs

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

57 58

Limited dependent variables

bull Usual advice model the data from first principles Logitprobit for binary Poisson for counts etc

bull OLS is a-ok with limited DVs when Binary treatment and no covariates (just diff-in-means) Binary treatment discrete covariates and saturated models

(stratified diff-in-means)

bull Imposing a model on LDVs in this case imposes adistributional assumption which could be wrong

bull Even in unsaturated models the marginal effect from OLSoften decent compared to nonlinear models

Could go wrong in small samples If using nonlinear models always get effects on the scale of the

outcome

58 58

  • Agnostic Regression
  • Regression and Causality
  • Regression with Heterogeneous Treatment Effects
Page 12: Gov 2002: 7. Regression and Causalityโ€ข Define linear regression: =argmin ๐”ผ[( ๐‘–โˆ’ โ€ฒ ) ] โ€ข The solution to this is the following: =๐”ผ[ ๐‘– โ€ฒ]โˆ’ ๐”ผ[ ๐‘– ๐‘–] โ€ข

Saturated model example

bull Washington (AER) data on the effects of daughtersbull Wersquoll look at the relationship between voting and number of

kids (causal)

girls lt- foreignreaddta(rdquogirlsdtardquo)

head(girls[ c(rdquonamerdquo rdquototchirdquo rdquoaauwrdquo)])

name totchi aauw

1 ABERCROMBIE NEIL 0 100

2 ACKERMAN GARY L 3 88

3 ADERHOLT ROBERT B 0 0

4 ALLEN THOMAS H 2 100

5 ANDREWS ROBERT E 2 100

6 ARCHER WR 7 0

12 58

Linear model

summary(lm(aauw ~ totchi data = girls))

Coefficients

Estimate Std Error t value Pr(gt|t|)

(Intercept) 6131 181 3381 lt2e-16

totchi -533 062 -859 lt2e-16

---

Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1

Residual standard error 42 on 1733 degrees of freedom

(5 observations deleted due to missingness)

Multiple R-squared 00408 Adjusted R-squared 00403

F-statistic 738 on 1 and 1733 DF p-value lt2e-16

13 58

Saturated modelsummary(lm(aauw ~ asfactor(totchi) data = girls))

Coefficients

Estimate Std Error t value Pr(gt|t|)

(Intercept) 5641 276 2042 lt 2e-16

asfactor(totchi)1 545 411 133 01851

asfactor(totchi)2 -380 327 -116 02454

asfactor(totchi)3 -1365 345 -395 81e-05

asfactor(totchi)4 -1931 401 -482 16e-06

asfactor(totchi)5 -1546 485 -319 00015

asfactor(totchi)6 -3359 1042 -322 00013

asfactor(totchi)7 -1713 1141 -150 01336

asfactor(totchi)8 -5533 1228 -451 70e-06

asfactor(totchi)9 -5041 2408 -209 00364

asfactor(totchi)10 -5341 2090 -256 00107

asfactor(totchi)12 -5641 4153 -136 01745

---

Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1

Residual standard error 41 on 1723 degrees of freedom

(5 observations deleted due to missingness)

Multiple R-squared 00506 Adjusted R-squared 00446

F-statistic 836 on 11 and 1723 DF p-value 184e-1414 58

Saturated model minus the constantsummary(lm(aauw ~ asfactor(totchi) - 1 data = girls))

Coefficients

Estimate Std Error t value Pr(gt|t|)

asfactor(totchi)0 5641 276 2042 lt2e-16

asfactor(totchi)1 6186 305 2031 lt2e-16

asfactor(totchi)2 5262 175 3013 lt2e-16

asfactor(totchi)3 4276 207 2062 lt2e-16

asfactor(totchi)4 3711 290 1279 lt2e-16

asfactor(totchi)5 4095 399 1027 lt2e-16

asfactor(totchi)6 2282 1005 227 00233

asfactor(totchi)7 3929 1107 355 00004

asfactor(totchi)8 108 1196 009 09278

asfactor(totchi)9 600 2392 025 08020

asfactor(totchi)10 300 2072 014 08849

asfactor(totchi)12 000 4143 000 10000

---

Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1

Residual standard error 41 on 1723 degrees of freedom

(5 observations deleted due to missingness)

Multiple R-squared 0587 Adjusted R-squared 0584

F-statistic 204 on 12 and 1723 DF p-value lt2e-1615 58

Compare to within-strata means

bull The saturated model makes no assumptions about thebetween-strata relationships

bull Just calculates within-strata means

c1 lt- coef(lm(aauw ~ asfactor(totchi) - 1 data = girls))

c2 lt- with(girls tapply(aauw totchi mean narm = TRUE))

rbind(c1 c2)

0 1 2 3 4 5 6 7 8 9 10 12

c1 56 62 53 43 37 41 23 39 11 6 3 0

c2 56 62 53 43 37 41 23 39 11 6 3 0

16 58

Other justifications for OLS

bull Justification 2 119883prime119894 120573 is the best linear predictor (in a

mean-squared error sense) of 119884119894 Why 120573 = argmin119887 120124[(119884119894 minus 119883prime

119894 119887)1113569]

bull Justification 3 119883prime119894 120573 provides the minimum mean squared

error linear approxmiation to 119864[119884119894|119883119894]bull Even if the CEF is not linear a linear regression provides the

best linear approximation to that CEFbull Donrsquot need to believe the assumptions (linearity) in order to

use regression as a good approximation to the CEFbull Warning if the CEF is very nonlinear then this approximation

could be terrible

17 58

The error terms

bull Letrsquos define the error term 119890119894 equiv 119884119894 minus 119883prime119894 120573 so that

119884119894 = 119883prime119894 120573 + [119884119894 minus 119883prime

119894 120573] = 119883prime119894 120573 + 119890119894

bull Note the residual 119890119894 is uncorrelated with 119883119894

120124[119883119894119890119894] = 120124[119883119894(119884119894 minus 119883prime119894 120573)]

= 120124[119883119894119884119894] minus 120124[119883119894119883prime119894 120573]

= 120124[119883119894119884119894] minus 120124 1114094119883119894119883prime119894120124[119883119894119883prime

119894 ]minus1113568120124[119883119894119884119894]1114097= 120124[119883119894119884119894] minus 120124[119883119894119883prime

119894 ]120124[119883119894119883prime119894 ]minus1113568120124[119883119894119884119894]

= 120124[119883119894119884119894] minus 120124[119883119894119884119894] = 0

bull No assumptions on the linearity of 120124[119884119894|119883119894]

18 58

OLS estimator

bull We know the population value of 120573 is

120573 = 120124[119883119894119883prime119894 ]minus1113568120124[119883119894119884119894]

bull How do we get an estimator of thisbull Plug-in principle replace population expectation with

sample versions

=

โŽกโŽขโŽขโŽขโŽขโŽขโŽฃ1119873

1114012119894119883119894119883prime

119894

โŽคโŽฅโŽฅโŽฅโŽฅโŽฅโŽฆ

minus11135681119873

1114012119894119883119894119884119894

bull If you work through the matrix algebra this turns out to be

= (119831prime119831)minus1113568 119831prime119858

19 58

Asymptotic OLS inference

bull With this representation in hand we can write the OLSestimator as follows

= 120573 +

โŽกโŽขโŽขโŽขโŽขโŽขโŽฃ1114012

119894119883119894119883prime

119894

โŽคโŽฅโŽฅโŽฅโŽฅโŽฅโŽฆ

minus1113568

1114012119894119883119894119890119894

bull Core idea sum119894119883119894119890119894 is the sum of rvs so the CLT applies

bull That plus some simple asymptotic theory allows us to say

radic119873( minus 120573) 119873(0ฮฉ)

bull Converges in distribution to a Normal distribution with meanvector 0 and covariance matrix ฮฉ

ฮฉ = 120124[119883119894119883prime119894 ]minus1113568120124[119883119894119883prime

119894 1198901113569119894 ]120124[119883119894119883prime119894 ]minus1113568

bull No linearity assumption needed20 58

Estimating the variance

bull In large samples then

radic119873( minus 120573) sim 119873(0ฮฉ)

bull How to estimate ฮฉ Plug-in principle again

1114023ฮฉ =

โŽกโŽขโŽขโŽขโŽขโŽขโŽฃ1114012

119894119883119894119883prime

119894

โŽคโŽฅโŽฅโŽฅโŽฅโŽฅโŽฆ

minus1113568 โŽกโŽขโŽขโŽขโŽขโŽขโŽฃ1114012119894119883119894119883prime

119894 1113569119894

โŽคโŽฅโŽฅโŽฅโŽฅโŽฅโŽฆ

โŽกโŽขโŽขโŽขโŽขโŽขโŽฃ1114012

119894119883119894119883prime

119894

โŽคโŽฅโŽฅโŽฅโŽฅโŽฅโŽฆ

minus1113568

bull Replace 119890119894 with its emprical counterpart (residuals)119894 = 119884119894 minus 119883prime

119894 bull Replace the population moments of 119883119894 with their sample

counterpartsbull The square root of the diagonals of this covariance matrix are

the ldquorobustrdquo or Huber-White standard errors that Statacommonly report

21 58

Heteroskedasticity

bull No assumptions of homoskedasticitybull Heteroskedaticity will definitely occur when

CEF is linear but the 1205901113569(119909) = 120141[119884119894|119883119894 = 119909] is not constant in 119909 119864[119884119894|119883119894] is not linear but we use the linear regression to

approxmiate it

22 58

2 Regression andCausality

23 58

Regression and causality

bull Most econometrics textbooks regression defined withoutrespect to causality

bull But then when is ldquobiasedrdquo The above derivations work forsome 120124[119884119894|119883119894]

bull The question then is when does knowing the CEF tell ussomething about causality

bull MHE argues that a regression is causal when the CEF itapproximates is causal Identification is king

bull We will show that under certain conditions a regression of theoutcome on the treatment and the covariates can recover acausal parameter but perhaps not the one in which we areinterested

24 58

Review

bull Quick reminder we have potential outcomes 119884119894(1) and 119884119894(0)and two parameters the ATE and ATT

120591 = 119864[119884119894(1) minus 119884119894(0)]120591ATT = 119864[119884119894(1) minus 119884119894(0)|119863119894 = 1]

bull We have shown in past weeks that these effects are identifiedwhen ignorability holds MHE calls this the conditionalindependence assumption (CIA)

25 58

Linear constant effects model binarytreatment

bull Experiment with a simple experiment we can rewrite theconsistency assumption to be a regression formula

119884119894 = 119863119894119884119894(1) + (1 minus 119863119894)119884119894(0)= 119884119894(0) + (119884119894(1) minus 119884119894(0))119863119894

= 120124[119884119894(0)] + 120591119863119894 + (119884119894(0) minus 120124[119884119894(0)])= 1205831113567 + 120591119863119894 + 1199071113567119894

bull Note that if ignorability holds (as in an experiment) for 119884119894(0)then it will also hold for 1199071113567119894 since 120124[119884119894(0)] is constant Thusthis satifies the usual assumptions for regression

26 58

Now with covariates

bull Now assume no unmeasured confounders 119884119894(119889) โŸ‚โŸ‚ 119863119894|119883119894bull We will assume a linear model for the potential outcomes

119884119894(119889) = 120572 + 120591 sdot 119889 + 120578119894

bull Remember that linearity isnrsquot an assumption if 119863119894 is binarybull Effect of 119863119894 is constant here the 120578119894 are the only source of

individual variation and we have 119864[120578119894] = 0bull Consistency assumption allows us to write this as

119884119894 = 120572 + 120591119863119894 + 120578119894

27 58

Covariates in the error

bull Letrsquos assume that 120578119894 is linear in 119883119894 120578119894 = 119883prime119894 120574 + 120584119894

bull New error is uncorrelated with 119883119894 120124[120584119894|119883119894] = 0bull This is an assumption Might be falsebull Plug into the above

120124[119884119894(119889)|119883119894] = 119864[119884119894|119863119894 119883119894] = 120572 + 120591119863119894 + 119864[120578119894|119883119894]= 120572 + 120591119863119894 + 119883prime

119894 120574 + 119864[120584119894|119883119894]= 120572 + 120591119863119894 + 119883prime

119894 120574

28 58

Summing up regression with constanteffects

bull Reviewing the assumptions wersquove used no unmeasured confounders constant treatment effects linearity of the treatmentcovariates

bull Under these we can run the following regression to estimatethe ATE 120591

119884119894 = 120572 + 120591119863119894 + 119883prime119894 120574 + 120584119894

bull Works with continuous or ordinal 119863119894 if linearity in the effect ofthese variables is truly linear

29 58

OLS constant effects simulation

Model with linear covariates constant 0 effect of treatment

library(mvtnorm)

n lt- 100

p lt- 4

X lt- rmvnorm(n = 100 mean = rep(0 p))

gamma lt- c(274 137 137 137)

y lt- 210 + X c(gamma) + rnorm(n)

alpha lt- c(-1 05 -05 -01)

dprobs lt- bootinvlogit(X alpha)

d lt- rbinom(n size = 1 prob = dprobs)

30 58

OLS with no covariates

summary(lm(y ~ d))

Coefficients

Estimate Std Error t value Pr(gt|t|)

(Intercept) 22036 458 4813 lt 2e-16

d -2869 654 -439 0000029

---

Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1

Residual standard error 33 on 98 degrees of freedom

Multiple R-squared 0164 Adjusted R-squared 0156

F-statistic 192 on 1 and 98 DF p-value 0000029

31 58

OLS with covariates

summary(lm(y ~ d + X))

Coefficients

Estimate Std Error t value Pr(gt|t|)

(Intercept) 209952 0159 13207 lt2e-16

d -0128 0253 -05 062

X1 27368 0126 2175 lt2e-16

X2 13677 0114 1201 lt2e-16

X3 13673 0130 1051 lt2e-16

X4 13570 0106 1280 lt2e-16

---

Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1

Residual standard error 1 on 94 degrees of freedom

Multiple R-squared 0999 Adjusted R-squared 0999

F-statistic 244e+04 on 5 and 94 DF p-value lt2e-16

32 58

What happens with nonlinearity

Suppose we can only observe the following covariates

z1 lt- exp(X[ 1]2)

z2 lt- X[ 2](1 + exp(X[ 1])) + 10

z3 lt- (X[ 1] X[ 3]25 + 06)^3

z4 lt- (X[ 2] + X[ 4] + 20)^2

Implies that 119884119894 and 119863119894 are functions of log(1198851198941113568) 1198851198941113569 119885111356911989411135681198851198941113569

1 log(1198851198941113568) 1198851198941113570 log(1198851198941113568) and 119883111356811135691198941113571

Regression is a nonlinear function of the observed covariates

33 58

When linearity goes wrong

summary(lm(y ~ d + z1 + z2 + z3 + z4))

Coefficients

Estimate Std Error t value Pr(gt|t|)

(Intercept) 218728 300799 073 0469

d -69292 33854 -205 0043

z1 363110 26970 1346 lt2e-16

z2 -29033 36619 -079 0430

z3 862022 431030 200 0048

z4 04021 00329 1223 lt2e-16

---

Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1

Residual standard error 14 on 94 degrees of freedom

Multiple R-squared 0855 Adjusted R-squared 0847

F-statistic 111 on 5 and 94 DF p-value lt2e-16

34 58

3 Regression withHeterogeneousTreatment Effects

35 58

Heterogeneous effects binary treatment

bull Completely randomized experiment

119884119894 = 119863119894119884119894(1) + (1 minus 119863119894)119884119894(0)= 119884119894(0) + (119884119894(1) minus 119884119894(0))119863119894

= 1205831113567 + 120591119894119863119894 + (119884119894(0) minus 1205831113567)= 1205831113567 + 120591119863119894 + (119884119894(0) minus 1205831113567) + (120591119894 minus 120591) sdot 119863119894

= 1205831113567 + 120591119863119894 + 120576119894

bull Error term now includes two components1 ldquoBaselinerdquo variation in the outcome (119884119894(0) minus 1205831113567)2 Variation in the treatment effect (120591119894 minus 120591)

bull Easy to verify that under experiment 120124[120576119894|119863119894] = 0bull Thus OLS estimates the ATE with no covariates

36 58

Adding covariates

bull What happens with no unmeasured confounders Need tocondition on 119883119894 now

bull Remember identification of the ATEATT using iteratedexpectations

bull ATE is the weighted sum of CATEs

120591 = 1114012119909120591(119909) Pr[119883119894 = 119909]

bull ATEATT are weighted averages of CATEsbull What about the regression estimand 120591119877 How does it related

to the ATEATT

37 58

Heterogeneous effects and regression

bull Letrsquos investigate this under a saturated regression model

119884119894 = 1114012119909119861119909119894120572119909 + 120591119877119863119894 + 119890119894

bull Use a dummy variable for each unique combination of 119883119894119861119909119894 = 120128(119883119894 = 119909)

bull Linear in 119883119894 by construction

38 58

Investigating the regression coefficient

bull How can we investigate 120591119877 Well we can rely on theregression anatomy

120591119877 = Cov(119884119894 119863119894 minus 119864[119863119894|119883119894])120141(119863119894 minus 119864[119863119894|119883119894])

bull 119863119894 minus 120124[119863119894|119883119894] is the residual from a regression of 119863119894 on the fullset of dummies

bull With a little work we can show

120591119877 =120124 1114094120591(119883119894)(119863119894 minus 120124[119863119894|119883119894])11135691114097

120124[(119863119894 minus 119864[119863119894|119883119894])1113569]=

120124[120591(119883119894)1205901113569119889(119883119894)]120124[1205901113569119889(119883119894)]

bull 1205901113569119889(119909) = 120141[119863119894|119883119894 = 119909] is the conditional variance of treatmentassignment

39 58

ATE versus OLS

120591119877 = 120124[120591(119883119894)119882119894] = 1114012119909120591(119909)

1205901113569119889(119909)120124[1205901113569119889(119883119894)]

โ„™[119883119894 = 119909]

bull Compare to the ATE

120591 = 120124[120591(119883119894)] = 1114012119909120591(119909)โ„™[119883119894 = 119909]

bull Both weight strata relative to their size (โ„™[119883119894 = 119909])bull OLS weights strata higher if the treatment variance in those

strata (1205901113569119889(119909)) is higher in those strata relative to the averagevariance across strata (120124[1205901113569119889(119883119894)])

bull The ATE weights only by their size

40 58

Regression weighting

119882119894 =1205901113569119889(119883119894)

120124[1205901113569119889(119883119894)]

bull Why does OLS weight like thisbull OLS is a minimum-variance estimator more weight to

more precise within-strata estimatesbull Within-strata estimates are most precise when the treatment

is evenly spread and thus has the highest variancebull If 119863119894 is binary then we know the conditional variance will be

1205901113569119889(119909) = โ„™[119863119894 = 1|119883119894 = 119909] (1 minus โ„™[119863119894 = 1|119883119894 = 119909])= 119890(119909) (1 minus 119890(119909))

bull Maximum variance with โ„™[119863119894 = 1|119883119894 = 119909] = 12

41 58

OLS weighting examplebull Binary covariate

โ„™[119883119894 = 1] = 075 โ„™[119883119894 = 0] = 025โ„™[119863119894 = 1|119883119894 = 1] = 09 โ„™[119863119894 = 1|119883119894 = 0] = 05

1205901113569119889(1) = 009 1205901113569119889(0) = 025120591(1) = 1 120591(0) = minus1

bull Implies the ATE is 120591 = 05bull Average conditional variance 120124[1205901113569119889(119883119894)] = 013bull weights for 119883119894 = 1 are 009013 = 0692 for 119883119894 = 0

025013 = 192120591119877 = 120124[120591(119883119894)119882119894]

= 120591(1)119882(1)โ„™[119883119894 = 1] + 120591(0)119882(0)โ„™[119883119894 = 0]= 1 times 0692 times 075 + minus1 times 192 times 025= 0039

42 58

When will OLS estimate the ATE

bull When does 120591 = 120591119877bull Constant treatment effects 120591(119909) = 120591 = 120591119877bull Constant probability of treatment 119890(119909) = โ„™[119863119894 = 1|119883119894 = 119909] = 119890

Implies that the OLS weights are 1

bull Incorrect linearity assumption in 119883119894 will lead to more bias

43 58

Other ways to use regression

bull Whatrsquos the path forward Accept the bias (might be relatively small with saturated

models) Use a different regression approach

bull Let 120583119889(119909) = 120124[119884119894(119889)|119883119894 = 119909] be the CEF for the potentialoutcome under 119863119894 = 119889

bull By consistency and nuc we have 120583119889(119909) = 120124[119884119894|119863119894 = 119889119883119894 = 119909]bull Estimate a regression of 119884119894 on 119883119894 among the 119863119894 = 119889 groupbull Then 119889(119909) is just a predicted value from the regression for

119883119894 = 119909bull How can we use this

44 58

Imputation estimators

bull Impute the treated potential outcomes with 1114023119884119894(1) = 1113568(119883119894)bull Impute the control potential outcomes with 1114023119884119894(0) = 1113567(119883119894)bull Procedure

Regress 119884119894 on 119883119894 in the treated group and get predicted valuesfor all units (treated or control)

Regress 119884119894 on 119883119894 in the control group and get predicted valuesfor all units (treated or control)

Take the average difference between these predicted values

bull More mathematically look like this

120591119894119898119901 =1119873

11140121198941113568(119883119894) minus 1113567(119883119894)

bull Sometimes called an imputation estimator

45 58

Simple imputation estimator

bull Use predict() from the within-group models on the data fromthe entire sample

bull Useful trick use a model on the entire data andmodelframe() to get the right design matrix

heterogeneous effects

yhet lt- ifelse(d == 1 y + rnorm(n 0 5) y)

mod lt- lm(yhet ~ d + X)

mod1 lt- lm(yhet ~ X subset = d == 1)

mod0 lt- lm(yhet ~ X subset = d == 0)

y1imps lt- predict(mod1 modelframe(mod))

y0imps lt- predict(mod0 modelframe(mod))

mean(y1imps - y0imps)

[1] 061

46 58

Notes on imputation estimators

bull If 119889(119909) are consistent estimators then 120591119894119898119901 is consistent forthe ATE

bull Why donrsquot people use this Most people donrsquot know the results wersquove been talking about Harder to implement than vanilla OLS

bull Can use linear regression to estimate 119889(119909) = 119909prime120573119889bull Recent trend is to estimate 119889(119909) via non-parametric methods

such as Kernel regression local linear regression regression trees etc Easiest is generalized additive models (GAMs)

47 58

Imputation estimator visualization

-2 -1 0 1 2 3

-05

00

05

10

15

x

y

TreatedControl

48 58

Imputation estimator visualization

-2 -1 0 1 2 3

-05

00

05

10

15

x

y

TreatedControl

49 58

Imputation estimator visualization

-2 -1 0 1 2 3

-05

00

05

10

15

x

y

TreatedControl

50 58

Nonlinear relationshipsbull Same idea but with nonlinear relationship between 119884119894 and 119883119894

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

51 58

Nonlinear relationshipsbull Same idea but with nonlinear relationship between 119884119894 and 119883119894

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

52 58

Nonlinear relationshipsbull Same idea but with nonlinear relationship between 119884119894 and 119883119894

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

53 58

Using semiparametric regressionbull Here CEFs are nonlinear but we donrsquot know their formbull We can use GAMs from the mgcv package to for flexible

estimatelibrary(mgcv)

mod0 lt- gam(y ~ s(x) subset = d == 0)

summary(mod0)

Family gaussian

Link function identity

Formula

y ~ s(x)

Parametric coefficients

Estimate Std Error t value Pr(gt|t|)

(Intercept) -00225 00154 -146 016

Approximate significance of smooth terms

edf Refdf F p-value

s(x) 603 708 413 lt2e-16

---

Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1

R-sq(adj) = 0909 Deviance explained = 928

GCV = 00093204 Scale est = 00071351 n = 30

54 58

Using GAMs

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

55 58

Using GAMs

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

56 58

Using GAMs

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

57 58

Limited dependent variables

bull Usual advice model the data from first principles Logitprobit for binary Poisson for counts etc

bull OLS is a-ok with limited DVs when Binary treatment and no covariates (just diff-in-means) Binary treatment discrete covariates and saturated models

(stratified diff-in-means)

bull Imposing a model on LDVs in this case imposes adistributional assumption which could be wrong

bull Even in unsaturated models the marginal effect from OLSoften decent compared to nonlinear models

Could go wrong in small samples If using nonlinear models always get effects on the scale of the

outcome

58 58

  • Agnostic Regression
  • Regression and Causality
  • Regression with Heterogeneous Treatment Effects
Page 13: Gov 2002: 7. Regression and Causalityโ€ข Define linear regression: =argmin ๐”ผ[( ๐‘–โˆ’ โ€ฒ ) ] โ€ข The solution to this is the following: =๐”ผ[ ๐‘– โ€ฒ]โˆ’ ๐”ผ[ ๐‘– ๐‘–] โ€ข

Linear model

summary(lm(aauw ~ totchi data = girls))

Coefficients

Estimate Std Error t value Pr(gt|t|)

(Intercept) 6131 181 3381 lt2e-16

totchi -533 062 -859 lt2e-16

---

Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1

Residual standard error 42 on 1733 degrees of freedom

(5 observations deleted due to missingness)

Multiple R-squared 00408 Adjusted R-squared 00403

F-statistic 738 on 1 and 1733 DF p-value lt2e-16

13 58

Saturated modelsummary(lm(aauw ~ asfactor(totchi) data = girls))

Coefficients

Estimate Std Error t value Pr(gt|t|)

(Intercept) 5641 276 2042 lt 2e-16

asfactor(totchi)1 545 411 133 01851

asfactor(totchi)2 -380 327 -116 02454

asfactor(totchi)3 -1365 345 -395 81e-05

asfactor(totchi)4 -1931 401 -482 16e-06

asfactor(totchi)5 -1546 485 -319 00015

asfactor(totchi)6 -3359 1042 -322 00013

asfactor(totchi)7 -1713 1141 -150 01336

asfactor(totchi)8 -5533 1228 -451 70e-06

asfactor(totchi)9 -5041 2408 -209 00364

asfactor(totchi)10 -5341 2090 -256 00107

asfactor(totchi)12 -5641 4153 -136 01745

---

Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1

Residual standard error 41 on 1723 degrees of freedom

(5 observations deleted due to missingness)

Multiple R-squared 00506 Adjusted R-squared 00446

F-statistic 836 on 11 and 1723 DF p-value 184e-1414 58

Saturated model minus the constantsummary(lm(aauw ~ asfactor(totchi) - 1 data = girls))

Coefficients

Estimate Std Error t value Pr(gt|t|)

asfactor(totchi)0 5641 276 2042 lt2e-16

asfactor(totchi)1 6186 305 2031 lt2e-16

asfactor(totchi)2 5262 175 3013 lt2e-16

asfactor(totchi)3 4276 207 2062 lt2e-16

asfactor(totchi)4 3711 290 1279 lt2e-16

asfactor(totchi)5 4095 399 1027 lt2e-16

asfactor(totchi)6 2282 1005 227 00233

asfactor(totchi)7 3929 1107 355 00004

asfactor(totchi)8 108 1196 009 09278

asfactor(totchi)9 600 2392 025 08020

asfactor(totchi)10 300 2072 014 08849

asfactor(totchi)12 000 4143 000 10000

---

Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1

Residual standard error 41 on 1723 degrees of freedom

(5 observations deleted due to missingness)

Multiple R-squared 0587 Adjusted R-squared 0584

F-statistic 204 on 12 and 1723 DF p-value lt2e-1615 58

Compare to within-strata means

bull The saturated model makes no assumptions about thebetween-strata relationships

bull Just calculates within-strata means

c1 lt- coef(lm(aauw ~ asfactor(totchi) - 1 data = girls))

c2 lt- with(girls tapply(aauw totchi mean narm = TRUE))

rbind(c1 c2)

0 1 2 3 4 5 6 7 8 9 10 12

c1 56 62 53 43 37 41 23 39 11 6 3 0

c2 56 62 53 43 37 41 23 39 11 6 3 0

16 58

Other justifications for OLS

bull Justification 2 119883prime119894 120573 is the best linear predictor (in a

mean-squared error sense) of 119884119894 Why 120573 = argmin119887 120124[(119884119894 minus 119883prime

119894 119887)1113569]

bull Justification 3 119883prime119894 120573 provides the minimum mean squared

error linear approxmiation to 119864[119884119894|119883119894]bull Even if the CEF is not linear a linear regression provides the

best linear approximation to that CEFbull Donrsquot need to believe the assumptions (linearity) in order to

use regression as a good approximation to the CEFbull Warning if the CEF is very nonlinear then this approximation

could be terrible

17 58

The error terms

bull Letrsquos define the error term 119890119894 equiv 119884119894 minus 119883prime119894 120573 so that

119884119894 = 119883prime119894 120573 + [119884119894 minus 119883prime

119894 120573] = 119883prime119894 120573 + 119890119894

bull Note the residual 119890119894 is uncorrelated with 119883119894

120124[119883119894119890119894] = 120124[119883119894(119884119894 minus 119883prime119894 120573)]

= 120124[119883119894119884119894] minus 120124[119883119894119883prime119894 120573]

= 120124[119883119894119884119894] minus 120124 1114094119883119894119883prime119894120124[119883119894119883prime

119894 ]minus1113568120124[119883119894119884119894]1114097= 120124[119883119894119884119894] minus 120124[119883119894119883prime

119894 ]120124[119883119894119883prime119894 ]minus1113568120124[119883119894119884119894]

= 120124[119883119894119884119894] minus 120124[119883119894119884119894] = 0

bull No assumptions on the linearity of 120124[119884119894|119883119894]

18 58

OLS estimator

bull We know the population value of 120573 is

120573 = 120124[119883119894119883prime119894 ]minus1113568120124[119883119894119884119894]

bull How do we get an estimator of thisbull Plug-in principle replace population expectation with

sample versions

=

โŽกโŽขโŽขโŽขโŽขโŽขโŽฃ1119873

1114012119894119883119894119883prime

119894

โŽคโŽฅโŽฅโŽฅโŽฅโŽฅโŽฆ

minus11135681119873

1114012119894119883119894119884119894

bull If you work through the matrix algebra this turns out to be

= (119831prime119831)minus1113568 119831prime119858

19 58

Asymptotic OLS inference

bull With this representation in hand we can write the OLSestimator as follows

= 120573 +

โŽกโŽขโŽขโŽขโŽขโŽขโŽฃ1114012

119894119883119894119883prime

119894

โŽคโŽฅโŽฅโŽฅโŽฅโŽฅโŽฆ

minus1113568

1114012119894119883119894119890119894

bull Core idea sum119894119883119894119890119894 is the sum of rvs so the CLT applies

bull That plus some simple asymptotic theory allows us to say

radic119873( minus 120573) 119873(0ฮฉ)

bull Converges in distribution to a Normal distribution with meanvector 0 and covariance matrix ฮฉ

ฮฉ = 120124[119883119894119883prime119894 ]minus1113568120124[119883119894119883prime

119894 1198901113569119894 ]120124[119883119894119883prime119894 ]minus1113568

bull No linearity assumption needed20 58

Estimating the variance

bull In large samples then

radic119873( minus 120573) sim 119873(0ฮฉ)

bull How to estimate ฮฉ Plug-in principle again

1114023ฮฉ =

โŽกโŽขโŽขโŽขโŽขโŽขโŽฃ1114012

119894119883119894119883prime

119894

โŽคโŽฅโŽฅโŽฅโŽฅโŽฅโŽฆ

minus1113568 โŽกโŽขโŽขโŽขโŽขโŽขโŽฃ1114012119894119883119894119883prime

119894 1113569119894

โŽคโŽฅโŽฅโŽฅโŽฅโŽฅโŽฆ

โŽกโŽขโŽขโŽขโŽขโŽขโŽฃ1114012

119894119883119894119883prime

119894

โŽคโŽฅโŽฅโŽฅโŽฅโŽฅโŽฆ

minus1113568

bull Replace 119890119894 with its emprical counterpart (residuals)119894 = 119884119894 minus 119883prime

119894 bull Replace the population moments of 119883119894 with their sample

counterpartsbull The square root of the diagonals of this covariance matrix are

the ldquorobustrdquo or Huber-White standard errors that Statacommonly report

21 58

Heteroskedasticity

bull No assumptions of homoskedasticitybull Heteroskedaticity will definitely occur when

CEF is linear but the 1205901113569(119909) = 120141[119884119894|119883119894 = 119909] is not constant in 119909 119864[119884119894|119883119894] is not linear but we use the linear regression to

approxmiate it

22 58

2 Regression andCausality

23 58

Regression and causality

bull Most econometrics textbooks regression defined withoutrespect to causality

bull But then when is ldquobiasedrdquo The above derivations work forsome 120124[119884119894|119883119894]

bull The question then is when does knowing the CEF tell ussomething about causality

bull MHE argues that a regression is causal when the CEF itapproximates is causal Identification is king

bull We will show that under certain conditions a regression of theoutcome on the treatment and the covariates can recover acausal parameter but perhaps not the one in which we areinterested

24 58

Review

bull Quick reminder we have potential outcomes 119884119894(1) and 119884119894(0)and two parameters the ATE and ATT

120591 = 119864[119884119894(1) minus 119884119894(0)]120591ATT = 119864[119884119894(1) minus 119884119894(0)|119863119894 = 1]

bull We have shown in past weeks that these effects are identifiedwhen ignorability holds MHE calls this the conditionalindependence assumption (CIA)

25 58

Linear constant effects model binarytreatment

bull Experiment with a simple experiment we can rewrite theconsistency assumption to be a regression formula

119884119894 = 119863119894119884119894(1) + (1 minus 119863119894)119884119894(0)= 119884119894(0) + (119884119894(1) minus 119884119894(0))119863119894

= 120124[119884119894(0)] + 120591119863119894 + (119884119894(0) minus 120124[119884119894(0)])= 1205831113567 + 120591119863119894 + 1199071113567119894

bull Note that if ignorability holds (as in an experiment) for 119884119894(0)then it will also hold for 1199071113567119894 since 120124[119884119894(0)] is constant Thusthis satifies the usual assumptions for regression

26 58

Now with covariates

bull Now assume no unmeasured confounders 119884119894(119889) โŸ‚โŸ‚ 119863119894|119883119894bull We will assume a linear model for the potential outcomes

119884119894(119889) = 120572 + 120591 sdot 119889 + 120578119894

bull Remember that linearity isnrsquot an assumption if 119863119894 is binarybull Effect of 119863119894 is constant here the 120578119894 are the only source of

individual variation and we have 119864[120578119894] = 0bull Consistency assumption allows us to write this as

119884119894 = 120572 + 120591119863119894 + 120578119894

27 58

Covariates in the error

bull Letrsquos assume that 120578119894 is linear in 119883119894 120578119894 = 119883prime119894 120574 + 120584119894

bull New error is uncorrelated with 119883119894 120124[120584119894|119883119894] = 0bull This is an assumption Might be falsebull Plug into the above

120124[119884119894(119889)|119883119894] = 119864[119884119894|119863119894 119883119894] = 120572 + 120591119863119894 + 119864[120578119894|119883119894]= 120572 + 120591119863119894 + 119883prime

119894 120574 + 119864[120584119894|119883119894]= 120572 + 120591119863119894 + 119883prime

119894 120574

28 58

Summing up regression with constanteffects

bull Reviewing the assumptions wersquove used no unmeasured confounders constant treatment effects linearity of the treatmentcovariates

bull Under these we can run the following regression to estimatethe ATE 120591

119884119894 = 120572 + 120591119863119894 + 119883prime119894 120574 + 120584119894

bull Works with continuous or ordinal 119863119894 if linearity in the effect ofthese variables is truly linear

29 58

OLS constant effects simulation

Model with linear covariates constant 0 effect of treatment

library(mvtnorm)

n lt- 100

p lt- 4

X lt- rmvnorm(n = 100 mean = rep(0 p))

gamma lt- c(274 137 137 137)

y lt- 210 + X c(gamma) + rnorm(n)

alpha lt- c(-1 05 -05 -01)

dprobs lt- bootinvlogit(X alpha)

d lt- rbinom(n size = 1 prob = dprobs)

30 58

OLS with no covariates

summary(lm(y ~ d))

Coefficients

Estimate Std Error t value Pr(gt|t|)

(Intercept) 22036 458 4813 lt 2e-16

d -2869 654 -439 0000029

---

Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1

Residual standard error 33 on 98 degrees of freedom

Multiple R-squared 0164 Adjusted R-squared 0156

F-statistic 192 on 1 and 98 DF p-value 0000029

31 58

OLS with covariates

summary(lm(y ~ d + X))

Coefficients

Estimate Std Error t value Pr(gt|t|)

(Intercept) 209952 0159 13207 lt2e-16

d -0128 0253 -05 062

X1 27368 0126 2175 lt2e-16

X2 13677 0114 1201 lt2e-16

X3 13673 0130 1051 lt2e-16

X4 13570 0106 1280 lt2e-16

---

Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1

Residual standard error 1 on 94 degrees of freedom

Multiple R-squared 0999 Adjusted R-squared 0999

F-statistic 244e+04 on 5 and 94 DF p-value lt2e-16

32 58

What happens with nonlinearity

Suppose we can only observe the following covariates

z1 lt- exp(X[ 1]2)

z2 lt- X[ 2](1 + exp(X[ 1])) + 10

z3 lt- (X[ 1] X[ 3]25 + 06)^3

z4 lt- (X[ 2] + X[ 4] + 20)^2

Implies that 119884119894 and 119863119894 are functions of log(1198851198941113568) 1198851198941113569 119885111356911989411135681198851198941113569

1 log(1198851198941113568) 1198851198941113570 log(1198851198941113568) and 119883111356811135691198941113571

Regression is a nonlinear function of the observed covariates

33 58

When linearity goes wrong

summary(lm(y ~ d + z1 + z2 + z3 + z4))

Coefficients

Estimate Std Error t value Pr(gt|t|)

(Intercept) 218728 300799 073 0469

d -69292 33854 -205 0043

z1 363110 26970 1346 lt2e-16

z2 -29033 36619 -079 0430

z3 862022 431030 200 0048

z4 04021 00329 1223 lt2e-16

---

Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1

Residual standard error 14 on 94 degrees of freedom

Multiple R-squared 0855 Adjusted R-squared 0847

F-statistic 111 on 5 and 94 DF p-value lt2e-16

34 58

3 Regression withHeterogeneousTreatment Effects

35 58

Heterogeneous effects binary treatment

bull Completely randomized experiment

119884119894 = 119863119894119884119894(1) + (1 minus 119863119894)119884119894(0)= 119884119894(0) + (119884119894(1) minus 119884119894(0))119863119894

= 1205831113567 + 120591119894119863119894 + (119884119894(0) minus 1205831113567)= 1205831113567 + 120591119863119894 + (119884119894(0) minus 1205831113567) + (120591119894 minus 120591) sdot 119863119894

= 1205831113567 + 120591119863119894 + 120576119894

bull Error term now includes two components1 ldquoBaselinerdquo variation in the outcome (119884119894(0) minus 1205831113567)2 Variation in the treatment effect (120591119894 minus 120591)

bull Easy to verify that under experiment 120124[120576119894|119863119894] = 0bull Thus OLS estimates the ATE with no covariates

36 58

Adding covariates

bull What happens with no unmeasured confounders Need tocondition on 119883119894 now

bull Remember identification of the ATEATT using iteratedexpectations

bull ATE is the weighted sum of CATEs

120591 = 1114012119909120591(119909) Pr[119883119894 = 119909]

bull ATEATT are weighted averages of CATEsbull What about the regression estimand 120591119877 How does it related

to the ATEATT

37 58

Heterogeneous effects and regression

bull Letrsquos investigate this under a saturated regression model

119884119894 = 1114012119909119861119909119894120572119909 + 120591119877119863119894 + 119890119894

bull Use a dummy variable for each unique combination of 119883119894119861119909119894 = 120128(119883119894 = 119909)

bull Linear in 119883119894 by construction

38 58

Investigating the regression coefficient

bull How can we investigate 120591119877 Well we can rely on theregression anatomy

120591119877 = Cov(119884119894 119863119894 minus 119864[119863119894|119883119894])120141(119863119894 minus 119864[119863119894|119883119894])

bull 119863119894 minus 120124[119863119894|119883119894] is the residual from a regression of 119863119894 on the fullset of dummies

bull With a little work we can show

120591119877 =120124 1114094120591(119883119894)(119863119894 minus 120124[119863119894|119883119894])11135691114097

120124[(119863119894 minus 119864[119863119894|119883119894])1113569]=

120124[120591(119883119894)1205901113569119889(119883119894)]120124[1205901113569119889(119883119894)]

bull 1205901113569119889(119909) = 120141[119863119894|119883119894 = 119909] is the conditional variance of treatmentassignment

39 58

ATE versus OLS

120591119877 = 120124[120591(119883119894)119882119894] = 1114012119909120591(119909)

1205901113569119889(119909)120124[1205901113569119889(119883119894)]

โ„™[119883119894 = 119909]

bull Compare to the ATE

120591 = 120124[120591(119883119894)] = 1114012119909120591(119909)โ„™[119883119894 = 119909]

bull Both weight strata relative to their size (โ„™[119883119894 = 119909])bull OLS weights strata higher if the treatment variance in those

strata (1205901113569119889(119909)) is higher in those strata relative to the averagevariance across strata (120124[1205901113569119889(119883119894)])

bull The ATE weights only by their size

40 58

Regression weighting

119882119894 =1205901113569119889(119883119894)

120124[1205901113569119889(119883119894)]

bull Why does OLS weight like thisbull OLS is a minimum-variance estimator more weight to

more precise within-strata estimatesbull Within-strata estimates are most precise when the treatment

is evenly spread and thus has the highest variancebull If 119863119894 is binary then we know the conditional variance will be

1205901113569119889(119909) = โ„™[119863119894 = 1|119883119894 = 119909] (1 minus โ„™[119863119894 = 1|119883119894 = 119909])= 119890(119909) (1 minus 119890(119909))

bull Maximum variance with โ„™[119863119894 = 1|119883119894 = 119909] = 12

41 58

OLS weighting examplebull Binary covariate

โ„™[119883119894 = 1] = 075 โ„™[119883119894 = 0] = 025โ„™[119863119894 = 1|119883119894 = 1] = 09 โ„™[119863119894 = 1|119883119894 = 0] = 05

1205901113569119889(1) = 009 1205901113569119889(0) = 025120591(1) = 1 120591(0) = minus1

bull Implies the ATE is 120591 = 05bull Average conditional variance 120124[1205901113569119889(119883119894)] = 013bull weights for 119883119894 = 1 are 009013 = 0692 for 119883119894 = 0

025013 = 192120591119877 = 120124[120591(119883119894)119882119894]

= 120591(1)119882(1)โ„™[119883119894 = 1] + 120591(0)119882(0)โ„™[119883119894 = 0]= 1 times 0692 times 075 + minus1 times 192 times 025= 0039

42 58

When will OLS estimate the ATE

bull When does 120591 = 120591119877bull Constant treatment effects 120591(119909) = 120591 = 120591119877bull Constant probability of treatment 119890(119909) = โ„™[119863119894 = 1|119883119894 = 119909] = 119890

Implies that the OLS weights are 1

bull Incorrect linearity assumption in 119883119894 will lead to more bias

43 58

Other ways to use regression

bull Whatrsquos the path forward Accept the bias (might be relatively small with saturated

models) Use a different regression approach

bull Let 120583119889(119909) = 120124[119884119894(119889)|119883119894 = 119909] be the CEF for the potentialoutcome under 119863119894 = 119889

bull By consistency and nuc we have 120583119889(119909) = 120124[119884119894|119863119894 = 119889119883119894 = 119909]bull Estimate a regression of 119884119894 on 119883119894 among the 119863119894 = 119889 groupbull Then 119889(119909) is just a predicted value from the regression for

119883119894 = 119909bull How can we use this

44 58

Imputation estimators

bull Impute the treated potential outcomes with 1114023119884119894(1) = 1113568(119883119894)bull Impute the control potential outcomes with 1114023119884119894(0) = 1113567(119883119894)bull Procedure

Regress 119884119894 on 119883119894 in the treated group and get predicted valuesfor all units (treated or control)

Regress 119884119894 on 119883119894 in the control group and get predicted valuesfor all units (treated or control)

Take the average difference between these predicted values

bull More mathematically look like this

120591119894119898119901 =1119873

11140121198941113568(119883119894) minus 1113567(119883119894)

bull Sometimes called an imputation estimator

45 58

Simple imputation estimator

bull Use predict() from the within-group models on the data fromthe entire sample

bull Useful trick use a model on the entire data andmodelframe() to get the right design matrix

heterogeneous effects

yhet lt- ifelse(d == 1 y + rnorm(n 0 5) y)

mod lt- lm(yhet ~ d + X)

mod1 lt- lm(yhet ~ X subset = d == 1)

mod0 lt- lm(yhet ~ X subset = d == 0)

y1imps lt- predict(mod1 modelframe(mod))

y0imps lt- predict(mod0 modelframe(mod))

mean(y1imps - y0imps)

[1] 061

46 58

Notes on imputation estimators

bull If 119889(119909) are consistent estimators then 120591119894119898119901 is consistent forthe ATE

bull Why donrsquot people use this Most people donrsquot know the results wersquove been talking about Harder to implement than vanilla OLS

bull Can use linear regression to estimate 119889(119909) = 119909prime120573119889bull Recent trend is to estimate 119889(119909) via non-parametric methods

such as Kernel regression local linear regression regression trees etc Easiest is generalized additive models (GAMs)

47 58

Imputation estimator visualization

-2 -1 0 1 2 3

-05

00

05

10

15

x

y

TreatedControl

48 58

Imputation estimator visualization

-2 -1 0 1 2 3

-05

00

05

10

15

x

y

TreatedControl

49 58

Imputation estimator visualization

-2 -1 0 1 2 3

-05

00

05

10

15

x

y

TreatedControl

50 58

Nonlinear relationshipsbull Same idea but with nonlinear relationship between 119884119894 and 119883119894

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

51 58

Nonlinear relationshipsbull Same idea but with nonlinear relationship between 119884119894 and 119883119894

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

52 58

Nonlinear relationshipsbull Same idea but with nonlinear relationship between 119884119894 and 119883119894

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

53 58

Using semiparametric regressionbull Here CEFs are nonlinear but we donrsquot know their formbull We can use GAMs from the mgcv package to for flexible

estimatelibrary(mgcv)

mod0 lt- gam(y ~ s(x) subset = d == 0)

summary(mod0)

Family gaussian

Link function identity

Formula

y ~ s(x)

Parametric coefficients

Estimate Std Error t value Pr(gt|t|)

(Intercept) -00225 00154 -146 016

Approximate significance of smooth terms

edf Refdf F p-value

s(x) 603 708 413 lt2e-16

---

Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1

R-sq(adj) = 0909 Deviance explained = 928

GCV = 00093204 Scale est = 00071351 n = 30

54 58

Using GAMs

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

55 58

Using GAMs

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

56 58

Using GAMs

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

57 58

Limited dependent variables

bull Usual advice model the data from first principles Logitprobit for binary Poisson for counts etc

bull OLS is a-ok with limited DVs when Binary treatment and no covariates (just diff-in-means) Binary treatment discrete covariates and saturated models

(stratified diff-in-means)

bull Imposing a model on LDVs in this case imposes adistributional assumption which could be wrong

bull Even in unsaturated models the marginal effect from OLSoften decent compared to nonlinear models

Could go wrong in small samples If using nonlinear models always get effects on the scale of the

outcome

58 58

  • Agnostic Regression
  • Regression and Causality
  • Regression with Heterogeneous Treatment Effects
Page 14: Gov 2002: 7. Regression and Causalityโ€ข Define linear regression: =argmin ๐”ผ[( ๐‘–โˆ’ โ€ฒ ) ] โ€ข The solution to this is the following: =๐”ผ[ ๐‘– โ€ฒ]โˆ’ ๐”ผ[ ๐‘– ๐‘–] โ€ข

Saturated modelsummary(lm(aauw ~ asfactor(totchi) data = girls))

Coefficients

Estimate Std Error t value Pr(gt|t|)

(Intercept) 5641 276 2042 lt 2e-16

asfactor(totchi)1 545 411 133 01851

asfactor(totchi)2 -380 327 -116 02454

asfactor(totchi)3 -1365 345 -395 81e-05

asfactor(totchi)4 -1931 401 -482 16e-06

asfactor(totchi)5 -1546 485 -319 00015

asfactor(totchi)6 -3359 1042 -322 00013

asfactor(totchi)7 -1713 1141 -150 01336

asfactor(totchi)8 -5533 1228 -451 70e-06

asfactor(totchi)9 -5041 2408 -209 00364

asfactor(totchi)10 -5341 2090 -256 00107

asfactor(totchi)12 -5641 4153 -136 01745

---

Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1

Residual standard error 41 on 1723 degrees of freedom

(5 observations deleted due to missingness)

Multiple R-squared 00506 Adjusted R-squared 00446

F-statistic 836 on 11 and 1723 DF p-value 184e-1414 58

Saturated model minus the constantsummary(lm(aauw ~ asfactor(totchi) - 1 data = girls))

Coefficients

Estimate Std Error t value Pr(gt|t|)

asfactor(totchi)0 5641 276 2042 lt2e-16

asfactor(totchi)1 6186 305 2031 lt2e-16

asfactor(totchi)2 5262 175 3013 lt2e-16

asfactor(totchi)3 4276 207 2062 lt2e-16

asfactor(totchi)4 3711 290 1279 lt2e-16

asfactor(totchi)5 4095 399 1027 lt2e-16

asfactor(totchi)6 2282 1005 227 00233

asfactor(totchi)7 3929 1107 355 00004

asfactor(totchi)8 108 1196 009 09278

asfactor(totchi)9 600 2392 025 08020

asfactor(totchi)10 300 2072 014 08849

asfactor(totchi)12 000 4143 000 10000

---

Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1

Residual standard error 41 on 1723 degrees of freedom

(5 observations deleted due to missingness)

Multiple R-squared 0587 Adjusted R-squared 0584

F-statistic 204 on 12 and 1723 DF p-value lt2e-1615 58

Compare to within-strata means

bull The saturated model makes no assumptions about thebetween-strata relationships

bull Just calculates within-strata means

c1 lt- coef(lm(aauw ~ asfactor(totchi) - 1 data = girls))

c2 lt- with(girls tapply(aauw totchi mean narm = TRUE))

rbind(c1 c2)

0 1 2 3 4 5 6 7 8 9 10 12

c1 56 62 53 43 37 41 23 39 11 6 3 0

c2 56 62 53 43 37 41 23 39 11 6 3 0

16 58

Other justifications for OLS

bull Justification 2 119883prime119894 120573 is the best linear predictor (in a

mean-squared error sense) of 119884119894 Why 120573 = argmin119887 120124[(119884119894 minus 119883prime

119894 119887)1113569]

bull Justification 3 119883prime119894 120573 provides the minimum mean squared

error linear approxmiation to 119864[119884119894|119883119894]bull Even if the CEF is not linear a linear regression provides the

best linear approximation to that CEFbull Donrsquot need to believe the assumptions (linearity) in order to

use regression as a good approximation to the CEFbull Warning if the CEF is very nonlinear then this approximation

could be terrible

17 58

The error terms

bull Letrsquos define the error term 119890119894 equiv 119884119894 minus 119883prime119894 120573 so that

119884119894 = 119883prime119894 120573 + [119884119894 minus 119883prime

119894 120573] = 119883prime119894 120573 + 119890119894

bull Note the residual 119890119894 is uncorrelated with 119883119894

120124[119883119894119890119894] = 120124[119883119894(119884119894 minus 119883prime119894 120573)]

= 120124[119883119894119884119894] minus 120124[119883119894119883prime119894 120573]

= 120124[119883119894119884119894] minus 120124 1114094119883119894119883prime119894120124[119883119894119883prime

119894 ]minus1113568120124[119883119894119884119894]1114097= 120124[119883119894119884119894] minus 120124[119883119894119883prime

119894 ]120124[119883119894119883prime119894 ]minus1113568120124[119883119894119884119894]

= 120124[119883119894119884119894] minus 120124[119883119894119884119894] = 0

bull No assumptions on the linearity of 120124[119884119894|119883119894]

18 58

OLS estimator

bull We know the population value of 120573 is

120573 = 120124[119883119894119883prime119894 ]minus1113568120124[119883119894119884119894]

bull How do we get an estimator of thisbull Plug-in principle replace population expectation with

sample versions

=

โŽกโŽขโŽขโŽขโŽขโŽขโŽฃ1119873

1114012119894119883119894119883prime

119894

โŽคโŽฅโŽฅโŽฅโŽฅโŽฅโŽฆ

minus11135681119873

1114012119894119883119894119884119894

bull If you work through the matrix algebra this turns out to be

= (119831prime119831)minus1113568 119831prime119858

19 58

Asymptotic OLS inference

bull With this representation in hand we can write the OLSestimator as follows

= 120573 +

โŽกโŽขโŽขโŽขโŽขโŽขโŽฃ1114012

119894119883119894119883prime

119894

โŽคโŽฅโŽฅโŽฅโŽฅโŽฅโŽฆ

minus1113568

1114012119894119883119894119890119894

bull Core idea sum119894119883119894119890119894 is the sum of rvs so the CLT applies

bull That plus some simple asymptotic theory allows us to say

radic119873( minus 120573) 119873(0ฮฉ)

bull Converges in distribution to a Normal distribution with meanvector 0 and covariance matrix ฮฉ

ฮฉ = 120124[119883119894119883prime119894 ]minus1113568120124[119883119894119883prime

119894 1198901113569119894 ]120124[119883119894119883prime119894 ]minus1113568

bull No linearity assumption needed20 58

Estimating the variance

bull In large samples then

radic119873( minus 120573) sim 119873(0ฮฉ)

bull How to estimate ฮฉ Plug-in principle again

1114023ฮฉ =

โŽกโŽขโŽขโŽขโŽขโŽขโŽฃ1114012

119894119883119894119883prime

119894

โŽคโŽฅโŽฅโŽฅโŽฅโŽฅโŽฆ

minus1113568 โŽกโŽขโŽขโŽขโŽขโŽขโŽฃ1114012119894119883119894119883prime

119894 1113569119894

โŽคโŽฅโŽฅโŽฅโŽฅโŽฅโŽฆ

โŽกโŽขโŽขโŽขโŽขโŽขโŽฃ1114012

119894119883119894119883prime

119894

โŽคโŽฅโŽฅโŽฅโŽฅโŽฅโŽฆ

minus1113568

bull Replace 119890119894 with its emprical counterpart (residuals)119894 = 119884119894 minus 119883prime

119894 bull Replace the population moments of 119883119894 with their sample

counterpartsbull The square root of the diagonals of this covariance matrix are

the ldquorobustrdquo or Huber-White standard errors that Statacommonly report

21 58

Heteroskedasticity

bull No assumptions of homoskedasticitybull Heteroskedaticity will definitely occur when

CEF is linear but the 1205901113569(119909) = 120141[119884119894|119883119894 = 119909] is not constant in 119909 119864[119884119894|119883119894] is not linear but we use the linear regression to

approxmiate it

22 58

2 Regression andCausality

23 58

Regression and causality

bull Most econometrics textbooks regression defined withoutrespect to causality

bull But then when is ldquobiasedrdquo The above derivations work forsome 120124[119884119894|119883119894]

bull The question then is when does knowing the CEF tell ussomething about causality

bull MHE argues that a regression is causal when the CEF itapproximates is causal Identification is king

bull We will show that under certain conditions a regression of theoutcome on the treatment and the covariates can recover acausal parameter but perhaps not the one in which we areinterested

24 58

Review

bull Quick reminder we have potential outcomes 119884119894(1) and 119884119894(0)and two parameters the ATE and ATT

120591 = 119864[119884119894(1) minus 119884119894(0)]120591ATT = 119864[119884119894(1) minus 119884119894(0)|119863119894 = 1]

bull We have shown in past weeks that these effects are identifiedwhen ignorability holds MHE calls this the conditionalindependence assumption (CIA)

25 58

Linear constant effects model binarytreatment

bull Experiment with a simple experiment we can rewrite theconsistency assumption to be a regression formula

119884119894 = 119863119894119884119894(1) + (1 minus 119863119894)119884119894(0)= 119884119894(0) + (119884119894(1) minus 119884119894(0))119863119894

= 120124[119884119894(0)] + 120591119863119894 + (119884119894(0) minus 120124[119884119894(0)])= 1205831113567 + 120591119863119894 + 1199071113567119894

bull Note that if ignorability holds (as in an experiment) for 119884119894(0)then it will also hold for 1199071113567119894 since 120124[119884119894(0)] is constant Thusthis satifies the usual assumptions for regression

26 58

Now with covariates

bull Now assume no unmeasured confounders 119884119894(119889) โŸ‚โŸ‚ 119863119894|119883119894bull We will assume a linear model for the potential outcomes

119884119894(119889) = 120572 + 120591 sdot 119889 + 120578119894

bull Remember that linearity isnrsquot an assumption if 119863119894 is binarybull Effect of 119863119894 is constant here the 120578119894 are the only source of

individual variation and we have 119864[120578119894] = 0bull Consistency assumption allows us to write this as

119884119894 = 120572 + 120591119863119894 + 120578119894

27 58

Covariates in the error

bull Letrsquos assume that 120578119894 is linear in 119883119894 120578119894 = 119883prime119894 120574 + 120584119894

bull New error is uncorrelated with 119883119894 120124[120584119894|119883119894] = 0bull This is an assumption Might be falsebull Plug into the above

120124[119884119894(119889)|119883119894] = 119864[119884119894|119863119894 119883119894] = 120572 + 120591119863119894 + 119864[120578119894|119883119894]= 120572 + 120591119863119894 + 119883prime

119894 120574 + 119864[120584119894|119883119894]= 120572 + 120591119863119894 + 119883prime

119894 120574

28 58

Summing up regression with constanteffects

bull Reviewing the assumptions wersquove used no unmeasured confounders constant treatment effects linearity of the treatmentcovariates

bull Under these we can run the following regression to estimatethe ATE 120591

119884119894 = 120572 + 120591119863119894 + 119883prime119894 120574 + 120584119894

bull Works with continuous or ordinal 119863119894 if linearity in the effect ofthese variables is truly linear

29 58

OLS constant effects simulation

Model with linear covariates constant 0 effect of treatment

library(mvtnorm)

n lt- 100

p lt- 4

X lt- rmvnorm(n = 100 mean = rep(0 p))

gamma lt- c(274 137 137 137)

y lt- 210 + X c(gamma) + rnorm(n)

alpha lt- c(-1 05 -05 -01)

dprobs lt- bootinvlogit(X alpha)

d lt- rbinom(n size = 1 prob = dprobs)

30 58

OLS with no covariates

summary(lm(y ~ d))

Coefficients

Estimate Std Error t value Pr(gt|t|)

(Intercept) 22036 458 4813 lt 2e-16

d -2869 654 -439 0000029

---

Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1

Residual standard error 33 on 98 degrees of freedom

Multiple R-squared 0164 Adjusted R-squared 0156

F-statistic 192 on 1 and 98 DF p-value 0000029

31 58

OLS with covariates

summary(lm(y ~ d + X))

Coefficients

Estimate Std Error t value Pr(gt|t|)

(Intercept) 209952 0159 13207 lt2e-16

d -0128 0253 -05 062

X1 27368 0126 2175 lt2e-16

X2 13677 0114 1201 lt2e-16

X3 13673 0130 1051 lt2e-16

X4 13570 0106 1280 lt2e-16

---

Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1

Residual standard error 1 on 94 degrees of freedom

Multiple R-squared 0999 Adjusted R-squared 0999

F-statistic 244e+04 on 5 and 94 DF p-value lt2e-16

32 58

What happens with nonlinearity

Suppose we can only observe the following covariates

z1 lt- exp(X[ 1]2)

z2 lt- X[ 2](1 + exp(X[ 1])) + 10

z3 lt- (X[ 1] X[ 3]25 + 06)^3

z4 lt- (X[ 2] + X[ 4] + 20)^2

Implies that 119884119894 and 119863119894 are functions of log(1198851198941113568) 1198851198941113569 119885111356911989411135681198851198941113569

1 log(1198851198941113568) 1198851198941113570 log(1198851198941113568) and 119883111356811135691198941113571

Regression is a nonlinear function of the observed covariates

33 58

When linearity goes wrong

summary(lm(y ~ d + z1 + z2 + z3 + z4))

Coefficients

Estimate Std Error t value Pr(gt|t|)

(Intercept) 218728 300799 073 0469

d -69292 33854 -205 0043

z1 363110 26970 1346 lt2e-16

z2 -29033 36619 -079 0430

z3 862022 431030 200 0048

z4 04021 00329 1223 lt2e-16

---

Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1

Residual standard error 14 on 94 degrees of freedom

Multiple R-squared 0855 Adjusted R-squared 0847

F-statistic 111 on 5 and 94 DF p-value lt2e-16

34 58

3 Regression withHeterogeneousTreatment Effects

35 58

Heterogeneous effects binary treatment

bull Completely randomized experiment

119884119894 = 119863119894119884119894(1) + (1 minus 119863119894)119884119894(0)= 119884119894(0) + (119884119894(1) minus 119884119894(0))119863119894

= 1205831113567 + 120591119894119863119894 + (119884119894(0) minus 1205831113567)= 1205831113567 + 120591119863119894 + (119884119894(0) minus 1205831113567) + (120591119894 minus 120591) sdot 119863119894

= 1205831113567 + 120591119863119894 + 120576119894

bull Error term now includes two components1 ldquoBaselinerdquo variation in the outcome (119884119894(0) minus 1205831113567)2 Variation in the treatment effect (120591119894 minus 120591)

bull Easy to verify that under experiment 120124[120576119894|119863119894] = 0bull Thus OLS estimates the ATE with no covariates

36 58

Adding covariates

bull What happens with no unmeasured confounders Need tocondition on 119883119894 now

bull Remember identification of the ATEATT using iteratedexpectations

bull ATE is the weighted sum of CATEs

120591 = 1114012119909120591(119909) Pr[119883119894 = 119909]

bull ATEATT are weighted averages of CATEsbull What about the regression estimand 120591119877 How does it related

to the ATEATT

37 58

Heterogeneous effects and regression

bull Letrsquos investigate this under a saturated regression model

119884119894 = 1114012119909119861119909119894120572119909 + 120591119877119863119894 + 119890119894

bull Use a dummy variable for each unique combination of 119883119894119861119909119894 = 120128(119883119894 = 119909)

bull Linear in 119883119894 by construction

38 58

Investigating the regression coefficient

bull How can we investigate 120591119877 Well we can rely on theregression anatomy

120591119877 = Cov(119884119894 119863119894 minus 119864[119863119894|119883119894])120141(119863119894 minus 119864[119863119894|119883119894])

bull 119863119894 minus 120124[119863119894|119883119894] is the residual from a regression of 119863119894 on the fullset of dummies

bull With a little work we can show

120591119877 =120124 1114094120591(119883119894)(119863119894 minus 120124[119863119894|119883119894])11135691114097

120124[(119863119894 minus 119864[119863119894|119883119894])1113569]=

120124[120591(119883119894)1205901113569119889(119883119894)]120124[1205901113569119889(119883119894)]

bull 1205901113569119889(119909) = 120141[119863119894|119883119894 = 119909] is the conditional variance of treatmentassignment

39 58

ATE versus OLS

120591119877 = 120124[120591(119883119894)119882119894] = 1114012119909120591(119909)

1205901113569119889(119909)120124[1205901113569119889(119883119894)]

โ„™[119883119894 = 119909]

bull Compare to the ATE

120591 = 120124[120591(119883119894)] = 1114012119909120591(119909)โ„™[119883119894 = 119909]

bull Both weight strata relative to their size (โ„™[119883119894 = 119909])bull OLS weights strata higher if the treatment variance in those

strata (1205901113569119889(119909)) is higher in those strata relative to the averagevariance across strata (120124[1205901113569119889(119883119894)])

bull The ATE weights only by their size

40 58

Regression weighting

119882119894 =1205901113569119889(119883119894)

120124[1205901113569119889(119883119894)]

bull Why does OLS weight like thisbull OLS is a minimum-variance estimator more weight to

more precise within-strata estimatesbull Within-strata estimates are most precise when the treatment

is evenly spread and thus has the highest variancebull If 119863119894 is binary then we know the conditional variance will be

1205901113569119889(119909) = โ„™[119863119894 = 1|119883119894 = 119909] (1 minus โ„™[119863119894 = 1|119883119894 = 119909])= 119890(119909) (1 minus 119890(119909))

bull Maximum variance with โ„™[119863119894 = 1|119883119894 = 119909] = 12

41 58

OLS weighting examplebull Binary covariate

โ„™[119883119894 = 1] = 075 โ„™[119883119894 = 0] = 025โ„™[119863119894 = 1|119883119894 = 1] = 09 โ„™[119863119894 = 1|119883119894 = 0] = 05

1205901113569119889(1) = 009 1205901113569119889(0) = 025120591(1) = 1 120591(0) = minus1

bull Implies the ATE is 120591 = 05bull Average conditional variance 120124[1205901113569119889(119883119894)] = 013bull weights for 119883119894 = 1 are 009013 = 0692 for 119883119894 = 0

025013 = 192120591119877 = 120124[120591(119883119894)119882119894]

= 120591(1)119882(1)โ„™[119883119894 = 1] + 120591(0)119882(0)โ„™[119883119894 = 0]= 1 times 0692 times 075 + minus1 times 192 times 025= 0039

42 58

When will OLS estimate the ATE

bull When does 120591 = 120591119877bull Constant treatment effects 120591(119909) = 120591 = 120591119877bull Constant probability of treatment 119890(119909) = โ„™[119863119894 = 1|119883119894 = 119909] = 119890

Implies that the OLS weights are 1

bull Incorrect linearity assumption in 119883119894 will lead to more bias

43 58

Other ways to use regression

bull Whatrsquos the path forward Accept the bias (might be relatively small with saturated

models) Use a different regression approach

bull Let 120583119889(119909) = 120124[119884119894(119889)|119883119894 = 119909] be the CEF for the potentialoutcome under 119863119894 = 119889

bull By consistency and nuc we have 120583119889(119909) = 120124[119884119894|119863119894 = 119889119883119894 = 119909]bull Estimate a regression of 119884119894 on 119883119894 among the 119863119894 = 119889 groupbull Then 119889(119909) is just a predicted value from the regression for

119883119894 = 119909bull How can we use this

44 58

Imputation estimators

bull Impute the treated potential outcomes with 1114023119884119894(1) = 1113568(119883119894)bull Impute the control potential outcomes with 1114023119884119894(0) = 1113567(119883119894)bull Procedure

Regress 119884119894 on 119883119894 in the treated group and get predicted valuesfor all units (treated or control)

Regress 119884119894 on 119883119894 in the control group and get predicted valuesfor all units (treated or control)

Take the average difference between these predicted values

bull More mathematically look like this

120591119894119898119901 =1119873

11140121198941113568(119883119894) minus 1113567(119883119894)

bull Sometimes called an imputation estimator

45 58

Simple imputation estimator

bull Use predict() from the within-group models on the data fromthe entire sample

bull Useful trick use a model on the entire data andmodelframe() to get the right design matrix

heterogeneous effects

yhet lt- ifelse(d == 1 y + rnorm(n 0 5) y)

mod lt- lm(yhet ~ d + X)

mod1 lt- lm(yhet ~ X subset = d == 1)

mod0 lt- lm(yhet ~ X subset = d == 0)

y1imps lt- predict(mod1 modelframe(mod))

y0imps lt- predict(mod0 modelframe(mod))

mean(y1imps - y0imps)

[1] 061

46 58

Notes on imputation estimators

bull If 119889(119909) are consistent estimators then 120591119894119898119901 is consistent forthe ATE

bull Why donrsquot people use this Most people donrsquot know the results wersquove been talking about Harder to implement than vanilla OLS

bull Can use linear regression to estimate 119889(119909) = 119909prime120573119889bull Recent trend is to estimate 119889(119909) via non-parametric methods

such as Kernel regression local linear regression regression trees etc Easiest is generalized additive models (GAMs)

47 58

Imputation estimator visualization

-2 -1 0 1 2 3

-05

00

05

10

15

x

y

TreatedControl

48 58

Imputation estimator visualization

-2 -1 0 1 2 3

-05

00

05

10

15

x

y

TreatedControl

49 58

Imputation estimator visualization

-2 -1 0 1 2 3

-05

00

05

10

15

x

y

TreatedControl

50 58

Nonlinear relationshipsbull Same idea but with nonlinear relationship between 119884119894 and 119883119894

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

51 58

Nonlinear relationshipsbull Same idea but with nonlinear relationship between 119884119894 and 119883119894

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

52 58

Nonlinear relationshipsbull Same idea but with nonlinear relationship between 119884119894 and 119883119894

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

53 58

Using semiparametric regressionbull Here CEFs are nonlinear but we donrsquot know their formbull We can use GAMs from the mgcv package to for flexible

estimatelibrary(mgcv)

mod0 lt- gam(y ~ s(x) subset = d == 0)

summary(mod0)

Family gaussian

Link function identity

Formula

y ~ s(x)

Parametric coefficients

Estimate Std Error t value Pr(gt|t|)

(Intercept) -00225 00154 -146 016

Approximate significance of smooth terms

edf Refdf F p-value

s(x) 603 708 413 lt2e-16

---

Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1

R-sq(adj) = 0909 Deviance explained = 928

GCV = 00093204 Scale est = 00071351 n = 30

54 58

Using GAMs

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

55 58

Using GAMs

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

56 58

Using GAMs

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

57 58

Limited dependent variables

bull Usual advice model the data from first principles Logitprobit for binary Poisson for counts etc

bull OLS is a-ok with limited DVs when Binary treatment and no covariates (just diff-in-means) Binary treatment discrete covariates and saturated models

(stratified diff-in-means)

bull Imposing a model on LDVs in this case imposes adistributional assumption which could be wrong

bull Even in unsaturated models the marginal effect from OLSoften decent compared to nonlinear models

Could go wrong in small samples If using nonlinear models always get effects on the scale of the

outcome

58 58

  • Agnostic Regression
  • Regression and Causality
  • Regression with Heterogeneous Treatment Effects
Page 15: Gov 2002: 7. Regression and Causalityโ€ข Define linear regression: =argmin ๐”ผ[( ๐‘–โˆ’ โ€ฒ ) ] โ€ข The solution to this is the following: =๐”ผ[ ๐‘– โ€ฒ]โˆ’ ๐”ผ[ ๐‘– ๐‘–] โ€ข

Saturated model minus the constantsummary(lm(aauw ~ asfactor(totchi) - 1 data = girls))

Coefficients

Estimate Std Error t value Pr(gt|t|)

asfactor(totchi)0 5641 276 2042 lt2e-16

asfactor(totchi)1 6186 305 2031 lt2e-16

asfactor(totchi)2 5262 175 3013 lt2e-16

asfactor(totchi)3 4276 207 2062 lt2e-16

asfactor(totchi)4 3711 290 1279 lt2e-16

asfactor(totchi)5 4095 399 1027 lt2e-16

asfactor(totchi)6 2282 1005 227 00233

asfactor(totchi)7 3929 1107 355 00004

asfactor(totchi)8 108 1196 009 09278

asfactor(totchi)9 600 2392 025 08020

asfactor(totchi)10 300 2072 014 08849

asfactor(totchi)12 000 4143 000 10000

---

Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1

Residual standard error 41 on 1723 degrees of freedom

(5 observations deleted due to missingness)

Multiple R-squared 0587 Adjusted R-squared 0584

F-statistic 204 on 12 and 1723 DF p-value lt2e-1615 58

Compare to within-strata means

bull The saturated model makes no assumptions about thebetween-strata relationships

bull Just calculates within-strata means

c1 lt- coef(lm(aauw ~ asfactor(totchi) - 1 data = girls))

c2 lt- with(girls tapply(aauw totchi mean narm = TRUE))

rbind(c1 c2)

0 1 2 3 4 5 6 7 8 9 10 12

c1 56 62 53 43 37 41 23 39 11 6 3 0

c2 56 62 53 43 37 41 23 39 11 6 3 0

16 58

Other justifications for OLS

bull Justification 2 119883prime119894 120573 is the best linear predictor (in a

mean-squared error sense) of 119884119894 Why 120573 = argmin119887 120124[(119884119894 minus 119883prime

119894 119887)1113569]

bull Justification 3 119883prime119894 120573 provides the minimum mean squared

error linear approxmiation to 119864[119884119894|119883119894]bull Even if the CEF is not linear a linear regression provides the

best linear approximation to that CEFbull Donrsquot need to believe the assumptions (linearity) in order to

use regression as a good approximation to the CEFbull Warning if the CEF is very nonlinear then this approximation

could be terrible

17 58

The error terms

bull Letrsquos define the error term 119890119894 equiv 119884119894 minus 119883prime119894 120573 so that

119884119894 = 119883prime119894 120573 + [119884119894 minus 119883prime

119894 120573] = 119883prime119894 120573 + 119890119894

bull Note the residual 119890119894 is uncorrelated with 119883119894

120124[119883119894119890119894] = 120124[119883119894(119884119894 minus 119883prime119894 120573)]

= 120124[119883119894119884119894] minus 120124[119883119894119883prime119894 120573]

= 120124[119883119894119884119894] minus 120124 1114094119883119894119883prime119894120124[119883119894119883prime

119894 ]minus1113568120124[119883119894119884119894]1114097= 120124[119883119894119884119894] minus 120124[119883119894119883prime

119894 ]120124[119883119894119883prime119894 ]minus1113568120124[119883119894119884119894]

= 120124[119883119894119884119894] minus 120124[119883119894119884119894] = 0

bull No assumptions on the linearity of 120124[119884119894|119883119894]

18 58

OLS estimator

bull We know the population value of 120573 is

120573 = 120124[119883119894119883prime119894 ]minus1113568120124[119883119894119884119894]

bull How do we get an estimator of thisbull Plug-in principle replace population expectation with

sample versions

=

โŽกโŽขโŽขโŽขโŽขโŽขโŽฃ1119873

1114012119894119883119894119883prime

119894

โŽคโŽฅโŽฅโŽฅโŽฅโŽฅโŽฆ

minus11135681119873

1114012119894119883119894119884119894

bull If you work through the matrix algebra this turns out to be

= (119831prime119831)minus1113568 119831prime119858

19 58

Asymptotic OLS inference

bull With this representation in hand we can write the OLSestimator as follows

= 120573 +

โŽกโŽขโŽขโŽขโŽขโŽขโŽฃ1114012

119894119883119894119883prime

119894

โŽคโŽฅโŽฅโŽฅโŽฅโŽฅโŽฆ

minus1113568

1114012119894119883119894119890119894

bull Core idea sum119894119883119894119890119894 is the sum of rvs so the CLT applies

bull That plus some simple asymptotic theory allows us to say

radic119873( minus 120573) 119873(0ฮฉ)

bull Converges in distribution to a Normal distribution with meanvector 0 and covariance matrix ฮฉ

ฮฉ = 120124[119883119894119883prime119894 ]minus1113568120124[119883119894119883prime

119894 1198901113569119894 ]120124[119883119894119883prime119894 ]minus1113568

bull No linearity assumption needed20 58

Estimating the variance

bull In large samples then

radic119873( minus 120573) sim 119873(0ฮฉ)

bull How to estimate ฮฉ Plug-in principle again

1114023ฮฉ =

โŽกโŽขโŽขโŽขโŽขโŽขโŽฃ1114012

119894119883119894119883prime

119894

โŽคโŽฅโŽฅโŽฅโŽฅโŽฅโŽฆ

minus1113568 โŽกโŽขโŽขโŽขโŽขโŽขโŽฃ1114012119894119883119894119883prime

119894 1113569119894

โŽคโŽฅโŽฅโŽฅโŽฅโŽฅโŽฆ

โŽกโŽขโŽขโŽขโŽขโŽขโŽฃ1114012

119894119883119894119883prime

119894

โŽคโŽฅโŽฅโŽฅโŽฅโŽฅโŽฆ

minus1113568

bull Replace 119890119894 with its emprical counterpart (residuals)119894 = 119884119894 minus 119883prime

119894 bull Replace the population moments of 119883119894 with their sample

counterpartsbull The square root of the diagonals of this covariance matrix are

the ldquorobustrdquo or Huber-White standard errors that Statacommonly report

21 58

Heteroskedasticity

bull No assumptions of homoskedasticitybull Heteroskedaticity will definitely occur when

CEF is linear but the 1205901113569(119909) = 120141[119884119894|119883119894 = 119909] is not constant in 119909 119864[119884119894|119883119894] is not linear but we use the linear regression to

approxmiate it

22 58

2 Regression andCausality

23 58

Regression and causality

bull Most econometrics textbooks regression defined withoutrespect to causality

bull But then when is ldquobiasedrdquo The above derivations work forsome 120124[119884119894|119883119894]

bull The question then is when does knowing the CEF tell ussomething about causality

bull MHE argues that a regression is causal when the CEF itapproximates is causal Identification is king

bull We will show that under certain conditions a regression of theoutcome on the treatment and the covariates can recover acausal parameter but perhaps not the one in which we areinterested

24 58

Review

bull Quick reminder we have potential outcomes 119884119894(1) and 119884119894(0)and two parameters the ATE and ATT

120591 = 119864[119884119894(1) minus 119884119894(0)]120591ATT = 119864[119884119894(1) minus 119884119894(0)|119863119894 = 1]

bull We have shown in past weeks that these effects are identifiedwhen ignorability holds MHE calls this the conditionalindependence assumption (CIA)

25 58

Linear constant effects model binarytreatment

bull Experiment with a simple experiment we can rewrite theconsistency assumption to be a regression formula

119884119894 = 119863119894119884119894(1) + (1 minus 119863119894)119884119894(0)= 119884119894(0) + (119884119894(1) minus 119884119894(0))119863119894

= 120124[119884119894(0)] + 120591119863119894 + (119884119894(0) minus 120124[119884119894(0)])= 1205831113567 + 120591119863119894 + 1199071113567119894

bull Note that if ignorability holds (as in an experiment) for 119884119894(0)then it will also hold for 1199071113567119894 since 120124[119884119894(0)] is constant Thusthis satifies the usual assumptions for regression

26 58

Now with covariates

bull Now assume no unmeasured confounders 119884119894(119889) โŸ‚โŸ‚ 119863119894|119883119894bull We will assume a linear model for the potential outcomes

119884119894(119889) = 120572 + 120591 sdot 119889 + 120578119894

bull Remember that linearity isnrsquot an assumption if 119863119894 is binarybull Effect of 119863119894 is constant here the 120578119894 are the only source of

individual variation and we have 119864[120578119894] = 0bull Consistency assumption allows us to write this as

119884119894 = 120572 + 120591119863119894 + 120578119894

27 58

Covariates in the error

bull Letrsquos assume that 120578119894 is linear in 119883119894 120578119894 = 119883prime119894 120574 + 120584119894

bull New error is uncorrelated with 119883119894 120124[120584119894|119883119894] = 0bull This is an assumption Might be falsebull Plug into the above

120124[119884119894(119889)|119883119894] = 119864[119884119894|119863119894 119883119894] = 120572 + 120591119863119894 + 119864[120578119894|119883119894]= 120572 + 120591119863119894 + 119883prime

119894 120574 + 119864[120584119894|119883119894]= 120572 + 120591119863119894 + 119883prime

119894 120574

28 58

Summing up regression with constanteffects

bull Reviewing the assumptions wersquove used no unmeasured confounders constant treatment effects linearity of the treatmentcovariates

bull Under these we can run the following regression to estimatethe ATE 120591

119884119894 = 120572 + 120591119863119894 + 119883prime119894 120574 + 120584119894

bull Works with continuous or ordinal 119863119894 if linearity in the effect ofthese variables is truly linear

29 58

OLS constant effects simulation

Model with linear covariates constant 0 effect of treatment

library(mvtnorm)

n lt- 100

p lt- 4

X lt- rmvnorm(n = 100 mean = rep(0 p))

gamma lt- c(274 137 137 137)

y lt- 210 + X c(gamma) + rnorm(n)

alpha lt- c(-1 05 -05 -01)

dprobs lt- bootinvlogit(X alpha)

d lt- rbinom(n size = 1 prob = dprobs)

30 58

OLS with no covariates

summary(lm(y ~ d))

Coefficients

Estimate Std Error t value Pr(gt|t|)

(Intercept) 22036 458 4813 lt 2e-16

d -2869 654 -439 0000029

---

Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1

Residual standard error 33 on 98 degrees of freedom

Multiple R-squared 0164 Adjusted R-squared 0156

F-statistic 192 on 1 and 98 DF p-value 0000029

31 58

OLS with covariates

summary(lm(y ~ d + X))

Coefficients

Estimate Std Error t value Pr(gt|t|)

(Intercept) 209952 0159 13207 lt2e-16

d -0128 0253 -05 062

X1 27368 0126 2175 lt2e-16

X2 13677 0114 1201 lt2e-16

X3 13673 0130 1051 lt2e-16

X4 13570 0106 1280 lt2e-16

---

Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1

Residual standard error 1 on 94 degrees of freedom

Multiple R-squared 0999 Adjusted R-squared 0999

F-statistic 244e+04 on 5 and 94 DF p-value lt2e-16

32 58

What happens with nonlinearity

Suppose we can only observe the following covariates

z1 lt- exp(X[ 1]2)

z2 lt- X[ 2](1 + exp(X[ 1])) + 10

z3 lt- (X[ 1] X[ 3]25 + 06)^3

z4 lt- (X[ 2] + X[ 4] + 20)^2

Implies that 119884119894 and 119863119894 are functions of log(1198851198941113568) 1198851198941113569 119885111356911989411135681198851198941113569

1 log(1198851198941113568) 1198851198941113570 log(1198851198941113568) and 119883111356811135691198941113571

Regression is a nonlinear function of the observed covariates

33 58

When linearity goes wrong

summary(lm(y ~ d + z1 + z2 + z3 + z4))

Coefficients

Estimate Std Error t value Pr(gt|t|)

(Intercept) 218728 300799 073 0469

d -69292 33854 -205 0043

z1 363110 26970 1346 lt2e-16

z2 -29033 36619 -079 0430

z3 862022 431030 200 0048

z4 04021 00329 1223 lt2e-16

---

Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1

Residual standard error 14 on 94 degrees of freedom

Multiple R-squared 0855 Adjusted R-squared 0847

F-statistic 111 on 5 and 94 DF p-value lt2e-16

34 58

3 Regression withHeterogeneousTreatment Effects

35 58

Heterogeneous effects binary treatment

bull Completely randomized experiment

119884119894 = 119863119894119884119894(1) + (1 minus 119863119894)119884119894(0)= 119884119894(0) + (119884119894(1) minus 119884119894(0))119863119894

= 1205831113567 + 120591119894119863119894 + (119884119894(0) minus 1205831113567)= 1205831113567 + 120591119863119894 + (119884119894(0) minus 1205831113567) + (120591119894 minus 120591) sdot 119863119894

= 1205831113567 + 120591119863119894 + 120576119894

bull Error term now includes two components1 ldquoBaselinerdquo variation in the outcome (119884119894(0) minus 1205831113567)2 Variation in the treatment effect (120591119894 minus 120591)

bull Easy to verify that under experiment 120124[120576119894|119863119894] = 0bull Thus OLS estimates the ATE with no covariates

36 58

Adding covariates

bull What happens with no unmeasured confounders Need tocondition on 119883119894 now

bull Remember identification of the ATEATT using iteratedexpectations

bull ATE is the weighted sum of CATEs

120591 = 1114012119909120591(119909) Pr[119883119894 = 119909]

bull ATEATT are weighted averages of CATEsbull What about the regression estimand 120591119877 How does it related

to the ATEATT

37 58

Heterogeneous effects and regression

bull Letrsquos investigate this under a saturated regression model

119884119894 = 1114012119909119861119909119894120572119909 + 120591119877119863119894 + 119890119894

bull Use a dummy variable for each unique combination of 119883119894119861119909119894 = 120128(119883119894 = 119909)

bull Linear in 119883119894 by construction

38 58

Investigating the regression coefficient

bull How can we investigate 120591119877 Well we can rely on theregression anatomy

120591119877 = Cov(119884119894 119863119894 minus 119864[119863119894|119883119894])120141(119863119894 minus 119864[119863119894|119883119894])

bull 119863119894 minus 120124[119863119894|119883119894] is the residual from a regression of 119863119894 on the fullset of dummies

bull With a little work we can show

120591119877 =120124 1114094120591(119883119894)(119863119894 minus 120124[119863119894|119883119894])11135691114097

120124[(119863119894 minus 119864[119863119894|119883119894])1113569]=

120124[120591(119883119894)1205901113569119889(119883119894)]120124[1205901113569119889(119883119894)]

bull 1205901113569119889(119909) = 120141[119863119894|119883119894 = 119909] is the conditional variance of treatmentassignment

39 58

ATE versus OLS

120591119877 = 120124[120591(119883119894)119882119894] = 1114012119909120591(119909)

1205901113569119889(119909)120124[1205901113569119889(119883119894)]

โ„™[119883119894 = 119909]

bull Compare to the ATE

120591 = 120124[120591(119883119894)] = 1114012119909120591(119909)โ„™[119883119894 = 119909]

bull Both weight strata relative to their size (โ„™[119883119894 = 119909])bull OLS weights strata higher if the treatment variance in those

strata (1205901113569119889(119909)) is higher in those strata relative to the averagevariance across strata (120124[1205901113569119889(119883119894)])

bull The ATE weights only by their size

40 58

Regression weighting

119882119894 =1205901113569119889(119883119894)

120124[1205901113569119889(119883119894)]

bull Why does OLS weight like thisbull OLS is a minimum-variance estimator more weight to

more precise within-strata estimatesbull Within-strata estimates are most precise when the treatment

is evenly spread and thus has the highest variancebull If 119863119894 is binary then we know the conditional variance will be

1205901113569119889(119909) = โ„™[119863119894 = 1|119883119894 = 119909] (1 minus โ„™[119863119894 = 1|119883119894 = 119909])= 119890(119909) (1 minus 119890(119909))

bull Maximum variance with โ„™[119863119894 = 1|119883119894 = 119909] = 12

41 58

OLS weighting examplebull Binary covariate

โ„™[119883119894 = 1] = 075 โ„™[119883119894 = 0] = 025โ„™[119863119894 = 1|119883119894 = 1] = 09 โ„™[119863119894 = 1|119883119894 = 0] = 05

1205901113569119889(1) = 009 1205901113569119889(0) = 025120591(1) = 1 120591(0) = minus1

bull Implies the ATE is 120591 = 05bull Average conditional variance 120124[1205901113569119889(119883119894)] = 013bull weights for 119883119894 = 1 are 009013 = 0692 for 119883119894 = 0

025013 = 192120591119877 = 120124[120591(119883119894)119882119894]

= 120591(1)119882(1)โ„™[119883119894 = 1] + 120591(0)119882(0)โ„™[119883119894 = 0]= 1 times 0692 times 075 + minus1 times 192 times 025= 0039

42 58

When will OLS estimate the ATE

bull When does 120591 = 120591119877bull Constant treatment effects 120591(119909) = 120591 = 120591119877bull Constant probability of treatment 119890(119909) = โ„™[119863119894 = 1|119883119894 = 119909] = 119890

Implies that the OLS weights are 1

bull Incorrect linearity assumption in 119883119894 will lead to more bias

43 58

Other ways to use regression

bull Whatrsquos the path forward Accept the bias (might be relatively small with saturated

models) Use a different regression approach

bull Let 120583119889(119909) = 120124[119884119894(119889)|119883119894 = 119909] be the CEF for the potentialoutcome under 119863119894 = 119889

bull By consistency and nuc we have 120583119889(119909) = 120124[119884119894|119863119894 = 119889119883119894 = 119909]bull Estimate a regression of 119884119894 on 119883119894 among the 119863119894 = 119889 groupbull Then 119889(119909) is just a predicted value from the regression for

119883119894 = 119909bull How can we use this

44 58

Imputation estimators

bull Impute the treated potential outcomes with 1114023119884119894(1) = 1113568(119883119894)bull Impute the control potential outcomes with 1114023119884119894(0) = 1113567(119883119894)bull Procedure

Regress 119884119894 on 119883119894 in the treated group and get predicted valuesfor all units (treated or control)

Regress 119884119894 on 119883119894 in the control group and get predicted valuesfor all units (treated or control)

Take the average difference between these predicted values

bull More mathematically look like this

120591119894119898119901 =1119873

11140121198941113568(119883119894) minus 1113567(119883119894)

bull Sometimes called an imputation estimator

45 58

Simple imputation estimator

bull Use predict() from the within-group models on the data fromthe entire sample

bull Useful trick use a model on the entire data andmodelframe() to get the right design matrix

heterogeneous effects

yhet lt- ifelse(d == 1 y + rnorm(n 0 5) y)

mod lt- lm(yhet ~ d + X)

mod1 lt- lm(yhet ~ X subset = d == 1)

mod0 lt- lm(yhet ~ X subset = d == 0)

y1imps lt- predict(mod1 modelframe(mod))

y0imps lt- predict(mod0 modelframe(mod))

mean(y1imps - y0imps)

[1] 061

46 58

Notes on imputation estimators

bull If 119889(119909) are consistent estimators then 120591119894119898119901 is consistent forthe ATE

bull Why donrsquot people use this Most people donrsquot know the results wersquove been talking about Harder to implement than vanilla OLS

bull Can use linear regression to estimate 119889(119909) = 119909prime120573119889bull Recent trend is to estimate 119889(119909) via non-parametric methods

such as Kernel regression local linear regression regression trees etc Easiest is generalized additive models (GAMs)

47 58

Imputation estimator visualization

-2 -1 0 1 2 3

-05

00

05

10

15

x

y

TreatedControl

48 58

Imputation estimator visualization

-2 -1 0 1 2 3

-05

00

05

10

15

x

y

TreatedControl

49 58

Imputation estimator visualization

-2 -1 0 1 2 3

-05

00

05

10

15

x

y

TreatedControl

50 58

Nonlinear relationshipsbull Same idea but with nonlinear relationship between 119884119894 and 119883119894

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

51 58

Nonlinear relationshipsbull Same idea but with nonlinear relationship between 119884119894 and 119883119894

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

52 58

Nonlinear relationshipsbull Same idea but with nonlinear relationship between 119884119894 and 119883119894

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

53 58

Using semiparametric regressionbull Here CEFs are nonlinear but we donrsquot know their formbull We can use GAMs from the mgcv package to for flexible

estimatelibrary(mgcv)

mod0 lt- gam(y ~ s(x) subset = d == 0)

summary(mod0)

Family gaussian

Link function identity

Formula

y ~ s(x)

Parametric coefficients

Estimate Std Error t value Pr(gt|t|)

(Intercept) -00225 00154 -146 016

Approximate significance of smooth terms

edf Refdf F p-value

s(x) 603 708 413 lt2e-16

---

Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1

R-sq(adj) = 0909 Deviance explained = 928

GCV = 00093204 Scale est = 00071351 n = 30

54 58

Using GAMs

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

55 58

Using GAMs

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

56 58

Using GAMs

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

57 58

Limited dependent variables

bull Usual advice model the data from first principles Logitprobit for binary Poisson for counts etc

bull OLS is a-ok with limited DVs when Binary treatment and no covariates (just diff-in-means) Binary treatment discrete covariates and saturated models

(stratified diff-in-means)

bull Imposing a model on LDVs in this case imposes adistributional assumption which could be wrong

bull Even in unsaturated models the marginal effect from OLSoften decent compared to nonlinear models

Could go wrong in small samples If using nonlinear models always get effects on the scale of the

outcome

58 58

  • Agnostic Regression
  • Regression and Causality
  • Regression with Heterogeneous Treatment Effects
Page 16: Gov 2002: 7. Regression and Causalityโ€ข Define linear regression: =argmin ๐”ผ[( ๐‘–โˆ’ โ€ฒ ) ] โ€ข The solution to this is the following: =๐”ผ[ ๐‘– โ€ฒ]โˆ’ ๐”ผ[ ๐‘– ๐‘–] โ€ข

Compare to within-strata means

bull The saturated model makes no assumptions about thebetween-strata relationships

bull Just calculates within-strata means

c1 lt- coef(lm(aauw ~ asfactor(totchi) - 1 data = girls))

c2 lt- with(girls tapply(aauw totchi mean narm = TRUE))

rbind(c1 c2)

0 1 2 3 4 5 6 7 8 9 10 12

c1 56 62 53 43 37 41 23 39 11 6 3 0

c2 56 62 53 43 37 41 23 39 11 6 3 0

16 58

Other justifications for OLS

bull Justification 2 119883prime119894 120573 is the best linear predictor (in a

mean-squared error sense) of 119884119894 Why 120573 = argmin119887 120124[(119884119894 minus 119883prime

119894 119887)1113569]

bull Justification 3 119883prime119894 120573 provides the minimum mean squared

error linear approxmiation to 119864[119884119894|119883119894]bull Even if the CEF is not linear a linear regression provides the

best linear approximation to that CEFbull Donrsquot need to believe the assumptions (linearity) in order to

use regression as a good approximation to the CEFbull Warning if the CEF is very nonlinear then this approximation

could be terrible

17 58

The error terms

bull Letrsquos define the error term 119890119894 equiv 119884119894 minus 119883prime119894 120573 so that

119884119894 = 119883prime119894 120573 + [119884119894 minus 119883prime

119894 120573] = 119883prime119894 120573 + 119890119894

bull Note the residual 119890119894 is uncorrelated with 119883119894

120124[119883119894119890119894] = 120124[119883119894(119884119894 minus 119883prime119894 120573)]

= 120124[119883119894119884119894] minus 120124[119883119894119883prime119894 120573]

= 120124[119883119894119884119894] minus 120124 1114094119883119894119883prime119894120124[119883119894119883prime

119894 ]minus1113568120124[119883119894119884119894]1114097= 120124[119883119894119884119894] minus 120124[119883119894119883prime

119894 ]120124[119883119894119883prime119894 ]minus1113568120124[119883119894119884119894]

= 120124[119883119894119884119894] minus 120124[119883119894119884119894] = 0

bull No assumptions on the linearity of 120124[119884119894|119883119894]

18 58

OLS estimator

bull We know the population value of 120573 is

120573 = 120124[119883119894119883prime119894 ]minus1113568120124[119883119894119884119894]

bull How do we get an estimator of thisbull Plug-in principle replace population expectation with

sample versions

=

โŽกโŽขโŽขโŽขโŽขโŽขโŽฃ1119873

1114012119894119883119894119883prime

119894

โŽคโŽฅโŽฅโŽฅโŽฅโŽฅโŽฆ

minus11135681119873

1114012119894119883119894119884119894

bull If you work through the matrix algebra this turns out to be

= (119831prime119831)minus1113568 119831prime119858

19 58

Asymptotic OLS inference

bull With this representation in hand we can write the OLSestimator as follows

= 120573 +

โŽกโŽขโŽขโŽขโŽขโŽขโŽฃ1114012

119894119883119894119883prime

119894

โŽคโŽฅโŽฅโŽฅโŽฅโŽฅโŽฆ

minus1113568

1114012119894119883119894119890119894

bull Core idea sum119894119883119894119890119894 is the sum of rvs so the CLT applies

bull That plus some simple asymptotic theory allows us to say

radic119873( minus 120573) 119873(0ฮฉ)

bull Converges in distribution to a Normal distribution with meanvector 0 and covariance matrix ฮฉ

ฮฉ = 120124[119883119894119883prime119894 ]minus1113568120124[119883119894119883prime

119894 1198901113569119894 ]120124[119883119894119883prime119894 ]minus1113568

bull No linearity assumption needed20 58

Estimating the variance

bull In large samples then

radic119873( minus 120573) sim 119873(0ฮฉ)

bull How to estimate ฮฉ Plug-in principle again

1114023ฮฉ =

โŽกโŽขโŽขโŽขโŽขโŽขโŽฃ1114012

119894119883119894119883prime

119894

โŽคโŽฅโŽฅโŽฅโŽฅโŽฅโŽฆ

minus1113568 โŽกโŽขโŽขโŽขโŽขโŽขโŽฃ1114012119894119883119894119883prime

119894 1113569119894

โŽคโŽฅโŽฅโŽฅโŽฅโŽฅโŽฆ

โŽกโŽขโŽขโŽขโŽขโŽขโŽฃ1114012

119894119883119894119883prime

119894

โŽคโŽฅโŽฅโŽฅโŽฅโŽฅโŽฆ

minus1113568

bull Replace 119890119894 with its emprical counterpart (residuals)119894 = 119884119894 minus 119883prime

119894 bull Replace the population moments of 119883119894 with their sample

counterpartsbull The square root of the diagonals of this covariance matrix are

the ldquorobustrdquo or Huber-White standard errors that Statacommonly report

21 58

Heteroskedasticity

bull No assumptions of homoskedasticitybull Heteroskedaticity will definitely occur when

CEF is linear but the 1205901113569(119909) = 120141[119884119894|119883119894 = 119909] is not constant in 119909 119864[119884119894|119883119894] is not linear but we use the linear regression to

approxmiate it

22 58

2 Regression andCausality

23 58

Regression and causality

bull Most econometrics textbooks regression defined withoutrespect to causality

bull But then when is ldquobiasedrdquo The above derivations work forsome 120124[119884119894|119883119894]

bull The question then is when does knowing the CEF tell ussomething about causality

bull MHE argues that a regression is causal when the CEF itapproximates is causal Identification is king

bull We will show that under certain conditions a regression of theoutcome on the treatment and the covariates can recover acausal parameter but perhaps not the one in which we areinterested

24 58

Review

bull Quick reminder we have potential outcomes 119884119894(1) and 119884119894(0)and two parameters the ATE and ATT

120591 = 119864[119884119894(1) minus 119884119894(0)]120591ATT = 119864[119884119894(1) minus 119884119894(0)|119863119894 = 1]

bull We have shown in past weeks that these effects are identifiedwhen ignorability holds MHE calls this the conditionalindependence assumption (CIA)

25 58

Linear constant effects model binarytreatment

bull Experiment with a simple experiment we can rewrite theconsistency assumption to be a regression formula

119884119894 = 119863119894119884119894(1) + (1 minus 119863119894)119884119894(0)= 119884119894(0) + (119884119894(1) minus 119884119894(0))119863119894

= 120124[119884119894(0)] + 120591119863119894 + (119884119894(0) minus 120124[119884119894(0)])= 1205831113567 + 120591119863119894 + 1199071113567119894

bull Note that if ignorability holds (as in an experiment) for 119884119894(0)then it will also hold for 1199071113567119894 since 120124[119884119894(0)] is constant Thusthis satifies the usual assumptions for regression

26 58

Now with covariates

bull Now assume no unmeasured confounders 119884119894(119889) โŸ‚โŸ‚ 119863119894|119883119894bull We will assume a linear model for the potential outcomes

119884119894(119889) = 120572 + 120591 sdot 119889 + 120578119894

bull Remember that linearity isnrsquot an assumption if 119863119894 is binarybull Effect of 119863119894 is constant here the 120578119894 are the only source of

individual variation and we have 119864[120578119894] = 0bull Consistency assumption allows us to write this as

119884119894 = 120572 + 120591119863119894 + 120578119894

27 58

Covariates in the error

bull Letrsquos assume that 120578119894 is linear in 119883119894 120578119894 = 119883prime119894 120574 + 120584119894

bull New error is uncorrelated with 119883119894 120124[120584119894|119883119894] = 0bull This is an assumption Might be falsebull Plug into the above

120124[119884119894(119889)|119883119894] = 119864[119884119894|119863119894 119883119894] = 120572 + 120591119863119894 + 119864[120578119894|119883119894]= 120572 + 120591119863119894 + 119883prime

119894 120574 + 119864[120584119894|119883119894]= 120572 + 120591119863119894 + 119883prime

119894 120574

28 58

Summing up regression with constanteffects

bull Reviewing the assumptions wersquove used no unmeasured confounders constant treatment effects linearity of the treatmentcovariates

bull Under these we can run the following regression to estimatethe ATE 120591

119884119894 = 120572 + 120591119863119894 + 119883prime119894 120574 + 120584119894

bull Works with continuous or ordinal 119863119894 if linearity in the effect ofthese variables is truly linear

29 58

OLS constant effects simulation

Model with linear covariates constant 0 effect of treatment

library(mvtnorm)

n lt- 100

p lt- 4

X lt- rmvnorm(n = 100 mean = rep(0 p))

gamma lt- c(274 137 137 137)

y lt- 210 + X c(gamma) + rnorm(n)

alpha lt- c(-1 05 -05 -01)

dprobs lt- bootinvlogit(X alpha)

d lt- rbinom(n size = 1 prob = dprobs)

30 58

OLS with no covariates

summary(lm(y ~ d))

Coefficients

Estimate Std Error t value Pr(gt|t|)

(Intercept) 22036 458 4813 lt 2e-16

d -2869 654 -439 0000029

---

Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1

Residual standard error 33 on 98 degrees of freedom

Multiple R-squared 0164 Adjusted R-squared 0156

F-statistic 192 on 1 and 98 DF p-value 0000029

31 58

OLS with covariates

summary(lm(y ~ d + X))

Coefficients

Estimate Std Error t value Pr(gt|t|)

(Intercept) 209952 0159 13207 lt2e-16

d -0128 0253 -05 062

X1 27368 0126 2175 lt2e-16

X2 13677 0114 1201 lt2e-16

X3 13673 0130 1051 lt2e-16

X4 13570 0106 1280 lt2e-16

---

Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1

Residual standard error 1 on 94 degrees of freedom

Multiple R-squared 0999 Adjusted R-squared 0999

F-statistic 244e+04 on 5 and 94 DF p-value lt2e-16

32 58

What happens with nonlinearity

Suppose we can only observe the following covariates

z1 lt- exp(X[ 1]2)

z2 lt- X[ 2](1 + exp(X[ 1])) + 10

z3 lt- (X[ 1] X[ 3]25 + 06)^3

z4 lt- (X[ 2] + X[ 4] + 20)^2

Implies that 119884119894 and 119863119894 are functions of log(1198851198941113568) 1198851198941113569 119885111356911989411135681198851198941113569

1 log(1198851198941113568) 1198851198941113570 log(1198851198941113568) and 119883111356811135691198941113571

Regression is a nonlinear function of the observed covariates

33 58

When linearity goes wrong

summary(lm(y ~ d + z1 + z2 + z3 + z4))

Coefficients

Estimate Std Error t value Pr(gt|t|)

(Intercept) 218728 300799 073 0469

d -69292 33854 -205 0043

z1 363110 26970 1346 lt2e-16

z2 -29033 36619 -079 0430

z3 862022 431030 200 0048

z4 04021 00329 1223 lt2e-16

---

Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1

Residual standard error 14 on 94 degrees of freedom

Multiple R-squared 0855 Adjusted R-squared 0847

F-statistic 111 on 5 and 94 DF p-value lt2e-16

34 58

3 Regression withHeterogeneousTreatment Effects

35 58

Heterogeneous effects binary treatment

bull Completely randomized experiment

119884119894 = 119863119894119884119894(1) + (1 minus 119863119894)119884119894(0)= 119884119894(0) + (119884119894(1) minus 119884119894(0))119863119894

= 1205831113567 + 120591119894119863119894 + (119884119894(0) minus 1205831113567)= 1205831113567 + 120591119863119894 + (119884119894(0) minus 1205831113567) + (120591119894 minus 120591) sdot 119863119894

= 1205831113567 + 120591119863119894 + 120576119894

bull Error term now includes two components1 ldquoBaselinerdquo variation in the outcome (119884119894(0) minus 1205831113567)2 Variation in the treatment effect (120591119894 minus 120591)

bull Easy to verify that under experiment 120124[120576119894|119863119894] = 0bull Thus OLS estimates the ATE with no covariates

36 58

Adding covariates

bull What happens with no unmeasured confounders Need tocondition on 119883119894 now

bull Remember identification of the ATEATT using iteratedexpectations

bull ATE is the weighted sum of CATEs

120591 = 1114012119909120591(119909) Pr[119883119894 = 119909]

bull ATEATT are weighted averages of CATEsbull What about the regression estimand 120591119877 How does it related

to the ATEATT

37 58

Heterogeneous effects and regression

bull Letrsquos investigate this under a saturated regression model

119884119894 = 1114012119909119861119909119894120572119909 + 120591119877119863119894 + 119890119894

bull Use a dummy variable for each unique combination of 119883119894119861119909119894 = 120128(119883119894 = 119909)

bull Linear in 119883119894 by construction

38 58

Investigating the regression coefficient

bull How can we investigate 120591119877 Well we can rely on theregression anatomy

120591119877 = Cov(119884119894 119863119894 minus 119864[119863119894|119883119894])120141(119863119894 minus 119864[119863119894|119883119894])

bull 119863119894 minus 120124[119863119894|119883119894] is the residual from a regression of 119863119894 on the fullset of dummies

bull With a little work we can show

120591119877 =120124 1114094120591(119883119894)(119863119894 minus 120124[119863119894|119883119894])11135691114097

120124[(119863119894 minus 119864[119863119894|119883119894])1113569]=

120124[120591(119883119894)1205901113569119889(119883119894)]120124[1205901113569119889(119883119894)]

bull 1205901113569119889(119909) = 120141[119863119894|119883119894 = 119909] is the conditional variance of treatmentassignment

39 58

ATE versus OLS

120591119877 = 120124[120591(119883119894)119882119894] = 1114012119909120591(119909)

1205901113569119889(119909)120124[1205901113569119889(119883119894)]

โ„™[119883119894 = 119909]

bull Compare to the ATE

120591 = 120124[120591(119883119894)] = 1114012119909120591(119909)โ„™[119883119894 = 119909]

bull Both weight strata relative to their size (โ„™[119883119894 = 119909])bull OLS weights strata higher if the treatment variance in those

strata (1205901113569119889(119909)) is higher in those strata relative to the averagevariance across strata (120124[1205901113569119889(119883119894)])

bull The ATE weights only by their size

40 58

Regression weighting

119882119894 =1205901113569119889(119883119894)

120124[1205901113569119889(119883119894)]

bull Why does OLS weight like thisbull OLS is a minimum-variance estimator more weight to

more precise within-strata estimatesbull Within-strata estimates are most precise when the treatment

is evenly spread and thus has the highest variancebull If 119863119894 is binary then we know the conditional variance will be

1205901113569119889(119909) = โ„™[119863119894 = 1|119883119894 = 119909] (1 minus โ„™[119863119894 = 1|119883119894 = 119909])= 119890(119909) (1 minus 119890(119909))

bull Maximum variance with โ„™[119863119894 = 1|119883119894 = 119909] = 12

41 58

OLS weighting examplebull Binary covariate

โ„™[119883119894 = 1] = 075 โ„™[119883119894 = 0] = 025โ„™[119863119894 = 1|119883119894 = 1] = 09 โ„™[119863119894 = 1|119883119894 = 0] = 05

1205901113569119889(1) = 009 1205901113569119889(0) = 025120591(1) = 1 120591(0) = minus1

bull Implies the ATE is 120591 = 05bull Average conditional variance 120124[1205901113569119889(119883119894)] = 013bull weights for 119883119894 = 1 are 009013 = 0692 for 119883119894 = 0

025013 = 192120591119877 = 120124[120591(119883119894)119882119894]

= 120591(1)119882(1)โ„™[119883119894 = 1] + 120591(0)119882(0)โ„™[119883119894 = 0]= 1 times 0692 times 075 + minus1 times 192 times 025= 0039

42 58

When will OLS estimate the ATE

bull When does 120591 = 120591119877bull Constant treatment effects 120591(119909) = 120591 = 120591119877bull Constant probability of treatment 119890(119909) = โ„™[119863119894 = 1|119883119894 = 119909] = 119890

Implies that the OLS weights are 1

bull Incorrect linearity assumption in 119883119894 will lead to more bias

43 58

Other ways to use regression

bull Whatrsquos the path forward Accept the bias (might be relatively small with saturated

models) Use a different regression approach

bull Let 120583119889(119909) = 120124[119884119894(119889)|119883119894 = 119909] be the CEF for the potentialoutcome under 119863119894 = 119889

bull By consistency and nuc we have 120583119889(119909) = 120124[119884119894|119863119894 = 119889119883119894 = 119909]bull Estimate a regression of 119884119894 on 119883119894 among the 119863119894 = 119889 groupbull Then 119889(119909) is just a predicted value from the regression for

119883119894 = 119909bull How can we use this

44 58

Imputation estimators

bull Impute the treated potential outcomes with 1114023119884119894(1) = 1113568(119883119894)bull Impute the control potential outcomes with 1114023119884119894(0) = 1113567(119883119894)bull Procedure

Regress 119884119894 on 119883119894 in the treated group and get predicted valuesfor all units (treated or control)

Regress 119884119894 on 119883119894 in the control group and get predicted valuesfor all units (treated or control)

Take the average difference between these predicted values

bull More mathematically look like this

120591119894119898119901 =1119873

11140121198941113568(119883119894) minus 1113567(119883119894)

bull Sometimes called an imputation estimator

45 58

Simple imputation estimator

bull Use predict() from the within-group models on the data fromthe entire sample

bull Useful trick use a model on the entire data andmodelframe() to get the right design matrix

heterogeneous effects

yhet lt- ifelse(d == 1 y + rnorm(n 0 5) y)

mod lt- lm(yhet ~ d + X)

mod1 lt- lm(yhet ~ X subset = d == 1)

mod0 lt- lm(yhet ~ X subset = d == 0)

y1imps lt- predict(mod1 modelframe(mod))

y0imps lt- predict(mod0 modelframe(mod))

mean(y1imps - y0imps)

[1] 061

46 58

Notes on imputation estimators

bull If 119889(119909) are consistent estimators then 120591119894119898119901 is consistent forthe ATE

bull Why donrsquot people use this Most people donrsquot know the results wersquove been talking about Harder to implement than vanilla OLS

bull Can use linear regression to estimate 119889(119909) = 119909prime120573119889bull Recent trend is to estimate 119889(119909) via non-parametric methods

such as Kernel regression local linear regression regression trees etc Easiest is generalized additive models (GAMs)

47 58

Imputation estimator visualization

-2 -1 0 1 2 3

-05

00

05

10

15

x

y

TreatedControl

48 58

Imputation estimator visualization

-2 -1 0 1 2 3

-05

00

05

10

15

x

y

TreatedControl

49 58

Imputation estimator visualization

-2 -1 0 1 2 3

-05

00

05

10

15

x

y

TreatedControl

50 58

Nonlinear relationshipsbull Same idea but with nonlinear relationship between 119884119894 and 119883119894

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

51 58

Nonlinear relationshipsbull Same idea but with nonlinear relationship between 119884119894 and 119883119894

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

52 58

Nonlinear relationshipsbull Same idea but with nonlinear relationship between 119884119894 and 119883119894

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

53 58

Using semiparametric regressionbull Here CEFs are nonlinear but we donrsquot know their formbull We can use GAMs from the mgcv package to for flexible

estimatelibrary(mgcv)

mod0 lt- gam(y ~ s(x) subset = d == 0)

summary(mod0)

Family gaussian

Link function identity

Formula

y ~ s(x)

Parametric coefficients

Estimate Std Error t value Pr(gt|t|)

(Intercept) -00225 00154 -146 016

Approximate significance of smooth terms

edf Refdf F p-value

s(x) 603 708 413 lt2e-16

---

Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1

R-sq(adj) = 0909 Deviance explained = 928

GCV = 00093204 Scale est = 00071351 n = 30

54 58

Using GAMs

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

55 58

Using GAMs

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

56 58

Using GAMs

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

57 58

Limited dependent variables

bull Usual advice model the data from first principles Logitprobit for binary Poisson for counts etc

bull OLS is a-ok with limited DVs when Binary treatment and no covariates (just diff-in-means) Binary treatment discrete covariates and saturated models

(stratified diff-in-means)

bull Imposing a model on LDVs in this case imposes adistributional assumption which could be wrong

bull Even in unsaturated models the marginal effect from OLSoften decent compared to nonlinear models

Could go wrong in small samples If using nonlinear models always get effects on the scale of the

outcome

58 58

  • Agnostic Regression
  • Regression and Causality
  • Regression with Heterogeneous Treatment Effects
Page 17: Gov 2002: 7. Regression and Causalityโ€ข Define linear regression: =argmin ๐”ผ[( ๐‘–โˆ’ โ€ฒ ) ] โ€ข The solution to this is the following: =๐”ผ[ ๐‘– โ€ฒ]โˆ’ ๐”ผ[ ๐‘– ๐‘–] โ€ข

Other justifications for OLS

bull Justification 2 119883prime119894 120573 is the best linear predictor (in a

mean-squared error sense) of 119884119894 Why 120573 = argmin119887 120124[(119884119894 minus 119883prime

119894 119887)1113569]

bull Justification 3 119883prime119894 120573 provides the minimum mean squared

error linear approxmiation to 119864[119884119894|119883119894]bull Even if the CEF is not linear a linear regression provides the

best linear approximation to that CEFbull Donrsquot need to believe the assumptions (linearity) in order to

use regression as a good approximation to the CEFbull Warning if the CEF is very nonlinear then this approximation

could be terrible

17 58

The error terms

bull Letrsquos define the error term 119890119894 equiv 119884119894 minus 119883prime119894 120573 so that

119884119894 = 119883prime119894 120573 + [119884119894 minus 119883prime

119894 120573] = 119883prime119894 120573 + 119890119894

bull Note the residual 119890119894 is uncorrelated with 119883119894

120124[119883119894119890119894] = 120124[119883119894(119884119894 minus 119883prime119894 120573)]

= 120124[119883119894119884119894] minus 120124[119883119894119883prime119894 120573]

= 120124[119883119894119884119894] minus 120124 1114094119883119894119883prime119894120124[119883119894119883prime

119894 ]minus1113568120124[119883119894119884119894]1114097= 120124[119883119894119884119894] minus 120124[119883119894119883prime

119894 ]120124[119883119894119883prime119894 ]minus1113568120124[119883119894119884119894]

= 120124[119883119894119884119894] minus 120124[119883119894119884119894] = 0

bull No assumptions on the linearity of 120124[119884119894|119883119894]

18 58

OLS estimator

bull We know the population value of 120573 is

120573 = 120124[119883119894119883prime119894 ]minus1113568120124[119883119894119884119894]

bull How do we get an estimator of thisbull Plug-in principle replace population expectation with

sample versions

=

โŽกโŽขโŽขโŽขโŽขโŽขโŽฃ1119873

1114012119894119883119894119883prime

119894

โŽคโŽฅโŽฅโŽฅโŽฅโŽฅโŽฆ

minus11135681119873

1114012119894119883119894119884119894

bull If you work through the matrix algebra this turns out to be

= (119831prime119831)minus1113568 119831prime119858

19 58

Asymptotic OLS inference

bull With this representation in hand we can write the OLSestimator as follows

= 120573 +

โŽกโŽขโŽขโŽขโŽขโŽขโŽฃ1114012

119894119883119894119883prime

119894

โŽคโŽฅโŽฅโŽฅโŽฅโŽฅโŽฆ

minus1113568

1114012119894119883119894119890119894

bull Core idea sum119894119883119894119890119894 is the sum of rvs so the CLT applies

bull That plus some simple asymptotic theory allows us to say

radic119873( minus 120573) 119873(0ฮฉ)

bull Converges in distribution to a Normal distribution with meanvector 0 and covariance matrix ฮฉ

ฮฉ = 120124[119883119894119883prime119894 ]minus1113568120124[119883119894119883prime

119894 1198901113569119894 ]120124[119883119894119883prime119894 ]minus1113568

bull No linearity assumption needed20 58

Estimating the variance

bull In large samples then

radic119873( minus 120573) sim 119873(0ฮฉ)

bull How to estimate ฮฉ Plug-in principle again

1114023ฮฉ =

โŽกโŽขโŽขโŽขโŽขโŽขโŽฃ1114012

119894119883119894119883prime

119894

โŽคโŽฅโŽฅโŽฅโŽฅโŽฅโŽฆ

minus1113568 โŽกโŽขโŽขโŽขโŽขโŽขโŽฃ1114012119894119883119894119883prime

119894 1113569119894

โŽคโŽฅโŽฅโŽฅโŽฅโŽฅโŽฆ

โŽกโŽขโŽขโŽขโŽขโŽขโŽฃ1114012

119894119883119894119883prime

119894

โŽคโŽฅโŽฅโŽฅโŽฅโŽฅโŽฆ

minus1113568

bull Replace 119890119894 with its emprical counterpart (residuals)119894 = 119884119894 minus 119883prime

119894 bull Replace the population moments of 119883119894 with their sample

counterpartsbull The square root of the diagonals of this covariance matrix are

the ldquorobustrdquo or Huber-White standard errors that Statacommonly report

21 58

Heteroskedasticity

bull No assumptions of homoskedasticitybull Heteroskedaticity will definitely occur when

CEF is linear but the 1205901113569(119909) = 120141[119884119894|119883119894 = 119909] is not constant in 119909 119864[119884119894|119883119894] is not linear but we use the linear regression to

approxmiate it

22 58

2 Regression andCausality

23 58

Regression and causality

bull Most econometrics textbooks regression defined withoutrespect to causality

bull But then when is ldquobiasedrdquo The above derivations work forsome 120124[119884119894|119883119894]

bull The question then is when does knowing the CEF tell ussomething about causality

bull MHE argues that a regression is causal when the CEF itapproximates is causal Identification is king

bull We will show that under certain conditions a regression of theoutcome on the treatment and the covariates can recover acausal parameter but perhaps not the one in which we areinterested

24 58

Review

bull Quick reminder we have potential outcomes 119884119894(1) and 119884119894(0)and two parameters the ATE and ATT

120591 = 119864[119884119894(1) minus 119884119894(0)]120591ATT = 119864[119884119894(1) minus 119884119894(0)|119863119894 = 1]

bull We have shown in past weeks that these effects are identifiedwhen ignorability holds MHE calls this the conditionalindependence assumption (CIA)

25 58

Linear constant effects model binarytreatment

bull Experiment with a simple experiment we can rewrite theconsistency assumption to be a regression formula

119884119894 = 119863119894119884119894(1) + (1 minus 119863119894)119884119894(0)= 119884119894(0) + (119884119894(1) minus 119884119894(0))119863119894

= 120124[119884119894(0)] + 120591119863119894 + (119884119894(0) minus 120124[119884119894(0)])= 1205831113567 + 120591119863119894 + 1199071113567119894

bull Note that if ignorability holds (as in an experiment) for 119884119894(0)then it will also hold for 1199071113567119894 since 120124[119884119894(0)] is constant Thusthis satifies the usual assumptions for regression

26 58

Now with covariates

bull Now assume no unmeasured confounders 119884119894(119889) โŸ‚โŸ‚ 119863119894|119883119894bull We will assume a linear model for the potential outcomes

119884119894(119889) = 120572 + 120591 sdot 119889 + 120578119894

bull Remember that linearity isnrsquot an assumption if 119863119894 is binarybull Effect of 119863119894 is constant here the 120578119894 are the only source of

individual variation and we have 119864[120578119894] = 0bull Consistency assumption allows us to write this as

119884119894 = 120572 + 120591119863119894 + 120578119894

27 58

Covariates in the error

bull Letrsquos assume that 120578119894 is linear in 119883119894 120578119894 = 119883prime119894 120574 + 120584119894

bull New error is uncorrelated with 119883119894 120124[120584119894|119883119894] = 0bull This is an assumption Might be falsebull Plug into the above

120124[119884119894(119889)|119883119894] = 119864[119884119894|119863119894 119883119894] = 120572 + 120591119863119894 + 119864[120578119894|119883119894]= 120572 + 120591119863119894 + 119883prime

119894 120574 + 119864[120584119894|119883119894]= 120572 + 120591119863119894 + 119883prime

119894 120574

28 58

Summing up regression with constanteffects

bull Reviewing the assumptions wersquove used no unmeasured confounders constant treatment effects linearity of the treatmentcovariates

bull Under these we can run the following regression to estimatethe ATE 120591

119884119894 = 120572 + 120591119863119894 + 119883prime119894 120574 + 120584119894

bull Works with continuous or ordinal 119863119894 if linearity in the effect ofthese variables is truly linear

29 58

OLS constant effects simulation

Model with linear covariates constant 0 effect of treatment

library(mvtnorm)

n lt- 100

p lt- 4

X lt- rmvnorm(n = 100 mean = rep(0 p))

gamma lt- c(274 137 137 137)

y lt- 210 + X c(gamma) + rnorm(n)

alpha lt- c(-1 05 -05 -01)

dprobs lt- bootinvlogit(X alpha)

d lt- rbinom(n size = 1 prob = dprobs)

30 58

OLS with no covariates

summary(lm(y ~ d))

Coefficients

Estimate Std Error t value Pr(gt|t|)

(Intercept) 22036 458 4813 lt 2e-16

d -2869 654 -439 0000029

---

Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1

Residual standard error 33 on 98 degrees of freedom

Multiple R-squared 0164 Adjusted R-squared 0156

F-statistic 192 on 1 and 98 DF p-value 0000029

31 58

OLS with covariates

summary(lm(y ~ d + X))

Coefficients

Estimate Std Error t value Pr(gt|t|)

(Intercept) 209952 0159 13207 lt2e-16

d -0128 0253 -05 062

X1 27368 0126 2175 lt2e-16

X2 13677 0114 1201 lt2e-16

X3 13673 0130 1051 lt2e-16

X4 13570 0106 1280 lt2e-16

---

Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1

Residual standard error 1 on 94 degrees of freedom

Multiple R-squared 0999 Adjusted R-squared 0999

F-statistic 244e+04 on 5 and 94 DF p-value lt2e-16

32 58

What happens with nonlinearity

Suppose we can only observe the following covariates

z1 lt- exp(X[ 1]2)

z2 lt- X[ 2](1 + exp(X[ 1])) + 10

z3 lt- (X[ 1] X[ 3]25 + 06)^3

z4 lt- (X[ 2] + X[ 4] + 20)^2

Implies that 119884119894 and 119863119894 are functions of log(1198851198941113568) 1198851198941113569 119885111356911989411135681198851198941113569

1 log(1198851198941113568) 1198851198941113570 log(1198851198941113568) and 119883111356811135691198941113571

Regression is a nonlinear function of the observed covariates

33 58

When linearity goes wrong

summary(lm(y ~ d + z1 + z2 + z3 + z4))

Coefficients

Estimate Std Error t value Pr(gt|t|)

(Intercept) 218728 300799 073 0469

d -69292 33854 -205 0043

z1 363110 26970 1346 lt2e-16

z2 -29033 36619 -079 0430

z3 862022 431030 200 0048

z4 04021 00329 1223 lt2e-16

---

Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1

Residual standard error 14 on 94 degrees of freedom

Multiple R-squared 0855 Adjusted R-squared 0847

F-statistic 111 on 5 and 94 DF p-value lt2e-16

34 58

3 Regression withHeterogeneousTreatment Effects

35 58

Heterogeneous effects binary treatment

bull Completely randomized experiment

119884119894 = 119863119894119884119894(1) + (1 minus 119863119894)119884119894(0)= 119884119894(0) + (119884119894(1) minus 119884119894(0))119863119894

= 1205831113567 + 120591119894119863119894 + (119884119894(0) minus 1205831113567)= 1205831113567 + 120591119863119894 + (119884119894(0) minus 1205831113567) + (120591119894 minus 120591) sdot 119863119894

= 1205831113567 + 120591119863119894 + 120576119894

bull Error term now includes two components1 ldquoBaselinerdquo variation in the outcome (119884119894(0) minus 1205831113567)2 Variation in the treatment effect (120591119894 minus 120591)

bull Easy to verify that under experiment 120124[120576119894|119863119894] = 0bull Thus OLS estimates the ATE with no covariates

36 58

Adding covariates

bull What happens with no unmeasured confounders Need tocondition on 119883119894 now

bull Remember identification of the ATEATT using iteratedexpectations

bull ATE is the weighted sum of CATEs

120591 = 1114012119909120591(119909) Pr[119883119894 = 119909]

bull ATEATT are weighted averages of CATEsbull What about the regression estimand 120591119877 How does it related

to the ATEATT

37 58

Heterogeneous effects and regression

bull Letrsquos investigate this under a saturated regression model

119884119894 = 1114012119909119861119909119894120572119909 + 120591119877119863119894 + 119890119894

bull Use a dummy variable for each unique combination of 119883119894119861119909119894 = 120128(119883119894 = 119909)

bull Linear in 119883119894 by construction

38 58

Investigating the regression coefficient

bull How can we investigate 120591119877 Well we can rely on theregression anatomy

120591119877 = Cov(119884119894 119863119894 minus 119864[119863119894|119883119894])120141(119863119894 minus 119864[119863119894|119883119894])

bull 119863119894 minus 120124[119863119894|119883119894] is the residual from a regression of 119863119894 on the fullset of dummies

bull With a little work we can show

120591119877 =120124 1114094120591(119883119894)(119863119894 minus 120124[119863119894|119883119894])11135691114097

120124[(119863119894 minus 119864[119863119894|119883119894])1113569]=

120124[120591(119883119894)1205901113569119889(119883119894)]120124[1205901113569119889(119883119894)]

bull 1205901113569119889(119909) = 120141[119863119894|119883119894 = 119909] is the conditional variance of treatmentassignment

39 58

ATE versus OLS

120591119877 = 120124[120591(119883119894)119882119894] = 1114012119909120591(119909)

1205901113569119889(119909)120124[1205901113569119889(119883119894)]

โ„™[119883119894 = 119909]

bull Compare to the ATE

120591 = 120124[120591(119883119894)] = 1114012119909120591(119909)โ„™[119883119894 = 119909]

bull Both weight strata relative to their size (โ„™[119883119894 = 119909])bull OLS weights strata higher if the treatment variance in those

strata (1205901113569119889(119909)) is higher in those strata relative to the averagevariance across strata (120124[1205901113569119889(119883119894)])

bull The ATE weights only by their size

40 58

Regression weighting

119882119894 =1205901113569119889(119883119894)

120124[1205901113569119889(119883119894)]

bull Why does OLS weight like thisbull OLS is a minimum-variance estimator more weight to

more precise within-strata estimatesbull Within-strata estimates are most precise when the treatment

is evenly spread and thus has the highest variancebull If 119863119894 is binary then we know the conditional variance will be

1205901113569119889(119909) = โ„™[119863119894 = 1|119883119894 = 119909] (1 minus โ„™[119863119894 = 1|119883119894 = 119909])= 119890(119909) (1 minus 119890(119909))

bull Maximum variance with โ„™[119863119894 = 1|119883119894 = 119909] = 12

41 58

OLS weighting examplebull Binary covariate

โ„™[119883119894 = 1] = 075 โ„™[119883119894 = 0] = 025โ„™[119863119894 = 1|119883119894 = 1] = 09 โ„™[119863119894 = 1|119883119894 = 0] = 05

1205901113569119889(1) = 009 1205901113569119889(0) = 025120591(1) = 1 120591(0) = minus1

bull Implies the ATE is 120591 = 05bull Average conditional variance 120124[1205901113569119889(119883119894)] = 013bull weights for 119883119894 = 1 are 009013 = 0692 for 119883119894 = 0

025013 = 192120591119877 = 120124[120591(119883119894)119882119894]

= 120591(1)119882(1)โ„™[119883119894 = 1] + 120591(0)119882(0)โ„™[119883119894 = 0]= 1 times 0692 times 075 + minus1 times 192 times 025= 0039

42 58

When will OLS estimate the ATE

bull When does 120591 = 120591119877bull Constant treatment effects 120591(119909) = 120591 = 120591119877bull Constant probability of treatment 119890(119909) = โ„™[119863119894 = 1|119883119894 = 119909] = 119890

Implies that the OLS weights are 1

bull Incorrect linearity assumption in 119883119894 will lead to more bias

43 58

Other ways to use regression

bull Whatrsquos the path forward Accept the bias (might be relatively small with saturated

models) Use a different regression approach

bull Let 120583119889(119909) = 120124[119884119894(119889)|119883119894 = 119909] be the CEF for the potentialoutcome under 119863119894 = 119889

bull By consistency and nuc we have 120583119889(119909) = 120124[119884119894|119863119894 = 119889119883119894 = 119909]bull Estimate a regression of 119884119894 on 119883119894 among the 119863119894 = 119889 groupbull Then 119889(119909) is just a predicted value from the regression for

119883119894 = 119909bull How can we use this

44 58

Imputation estimators

bull Impute the treated potential outcomes with 1114023119884119894(1) = 1113568(119883119894)bull Impute the control potential outcomes with 1114023119884119894(0) = 1113567(119883119894)bull Procedure

Regress 119884119894 on 119883119894 in the treated group and get predicted valuesfor all units (treated or control)

Regress 119884119894 on 119883119894 in the control group and get predicted valuesfor all units (treated or control)

Take the average difference between these predicted values

bull More mathematically look like this

120591119894119898119901 =1119873

11140121198941113568(119883119894) minus 1113567(119883119894)

bull Sometimes called an imputation estimator

45 58

Simple imputation estimator

bull Use predict() from the within-group models on the data fromthe entire sample

bull Useful trick use a model on the entire data andmodelframe() to get the right design matrix

heterogeneous effects

yhet lt- ifelse(d == 1 y + rnorm(n 0 5) y)

mod lt- lm(yhet ~ d + X)

mod1 lt- lm(yhet ~ X subset = d == 1)

mod0 lt- lm(yhet ~ X subset = d == 0)

y1imps lt- predict(mod1 modelframe(mod))

y0imps lt- predict(mod0 modelframe(mod))

mean(y1imps - y0imps)

[1] 061

46 58

Notes on imputation estimators

bull If 119889(119909) are consistent estimators then 120591119894119898119901 is consistent forthe ATE

bull Why donrsquot people use this Most people donrsquot know the results wersquove been talking about Harder to implement than vanilla OLS

bull Can use linear regression to estimate 119889(119909) = 119909prime120573119889bull Recent trend is to estimate 119889(119909) via non-parametric methods

such as Kernel regression local linear regression regression trees etc Easiest is generalized additive models (GAMs)

47 58

Imputation estimator visualization

-2 -1 0 1 2 3

-05

00

05

10

15

x

y

TreatedControl

48 58

Imputation estimator visualization

-2 -1 0 1 2 3

-05

00

05

10

15

x

y

TreatedControl

49 58

Imputation estimator visualization

-2 -1 0 1 2 3

-05

00

05

10

15

x

y

TreatedControl

50 58

Nonlinear relationshipsbull Same idea but with nonlinear relationship between 119884119894 and 119883119894

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

51 58

Nonlinear relationshipsbull Same idea but with nonlinear relationship between 119884119894 and 119883119894

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

52 58

Nonlinear relationshipsbull Same idea but with nonlinear relationship between 119884119894 and 119883119894

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

53 58

Using semiparametric regressionbull Here CEFs are nonlinear but we donrsquot know their formbull We can use GAMs from the mgcv package to for flexible

estimatelibrary(mgcv)

mod0 lt- gam(y ~ s(x) subset = d == 0)

summary(mod0)

Family gaussian

Link function identity

Formula

y ~ s(x)

Parametric coefficients

Estimate Std Error t value Pr(gt|t|)

(Intercept) -00225 00154 -146 016

Approximate significance of smooth terms

edf Refdf F p-value

s(x) 603 708 413 lt2e-16

---

Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1

R-sq(adj) = 0909 Deviance explained = 928

GCV = 00093204 Scale est = 00071351 n = 30

54 58

Using GAMs

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

55 58

Using GAMs

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

56 58

Using GAMs

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

57 58

Limited dependent variables

bull Usual advice model the data from first principles Logitprobit for binary Poisson for counts etc

bull OLS is a-ok with limited DVs when Binary treatment and no covariates (just diff-in-means) Binary treatment discrete covariates and saturated models

(stratified diff-in-means)

bull Imposing a model on LDVs in this case imposes adistributional assumption which could be wrong

bull Even in unsaturated models the marginal effect from OLSoften decent compared to nonlinear models

Could go wrong in small samples If using nonlinear models always get effects on the scale of the

outcome

58 58

  • Agnostic Regression
  • Regression and Causality
  • Regression with Heterogeneous Treatment Effects
Page 18: Gov 2002: 7. Regression and Causalityโ€ข Define linear regression: =argmin ๐”ผ[( ๐‘–โˆ’ โ€ฒ ) ] โ€ข The solution to this is the following: =๐”ผ[ ๐‘– โ€ฒ]โˆ’ ๐”ผ[ ๐‘– ๐‘–] โ€ข

The error terms

bull Letrsquos define the error term 119890119894 equiv 119884119894 minus 119883prime119894 120573 so that

119884119894 = 119883prime119894 120573 + [119884119894 minus 119883prime

119894 120573] = 119883prime119894 120573 + 119890119894

bull Note the residual 119890119894 is uncorrelated with 119883119894

120124[119883119894119890119894] = 120124[119883119894(119884119894 minus 119883prime119894 120573)]

= 120124[119883119894119884119894] minus 120124[119883119894119883prime119894 120573]

= 120124[119883119894119884119894] minus 120124 1114094119883119894119883prime119894120124[119883119894119883prime

119894 ]minus1113568120124[119883119894119884119894]1114097= 120124[119883119894119884119894] minus 120124[119883119894119883prime

119894 ]120124[119883119894119883prime119894 ]minus1113568120124[119883119894119884119894]

= 120124[119883119894119884119894] minus 120124[119883119894119884119894] = 0

bull No assumptions on the linearity of 120124[119884119894|119883119894]

18 58

OLS estimator

bull We know the population value of 120573 is

120573 = 120124[119883119894119883prime119894 ]minus1113568120124[119883119894119884119894]

bull How do we get an estimator of thisbull Plug-in principle replace population expectation with

sample versions

=

โŽกโŽขโŽขโŽขโŽขโŽขโŽฃ1119873

1114012119894119883119894119883prime

119894

โŽคโŽฅโŽฅโŽฅโŽฅโŽฅโŽฆ

minus11135681119873

1114012119894119883119894119884119894

bull If you work through the matrix algebra this turns out to be

= (119831prime119831)minus1113568 119831prime119858

19 58

Asymptotic OLS inference

bull With this representation in hand we can write the OLSestimator as follows

= 120573 +

โŽกโŽขโŽขโŽขโŽขโŽขโŽฃ1114012

119894119883119894119883prime

119894

โŽคโŽฅโŽฅโŽฅโŽฅโŽฅโŽฆ

minus1113568

1114012119894119883119894119890119894

bull Core idea sum119894119883119894119890119894 is the sum of rvs so the CLT applies

bull That plus some simple asymptotic theory allows us to say

radic119873( minus 120573) 119873(0ฮฉ)

bull Converges in distribution to a Normal distribution with meanvector 0 and covariance matrix ฮฉ

ฮฉ = 120124[119883119894119883prime119894 ]minus1113568120124[119883119894119883prime

119894 1198901113569119894 ]120124[119883119894119883prime119894 ]minus1113568

bull No linearity assumption needed20 58

Estimating the variance

bull In large samples then

radic119873( minus 120573) sim 119873(0ฮฉ)

bull How to estimate ฮฉ Plug-in principle again

1114023ฮฉ =

โŽกโŽขโŽขโŽขโŽขโŽขโŽฃ1114012

119894119883119894119883prime

119894

โŽคโŽฅโŽฅโŽฅโŽฅโŽฅโŽฆ

minus1113568 โŽกโŽขโŽขโŽขโŽขโŽขโŽฃ1114012119894119883119894119883prime

119894 1113569119894

โŽคโŽฅโŽฅโŽฅโŽฅโŽฅโŽฆ

โŽกโŽขโŽขโŽขโŽขโŽขโŽฃ1114012

119894119883119894119883prime

119894

โŽคโŽฅโŽฅโŽฅโŽฅโŽฅโŽฆ

minus1113568

bull Replace 119890119894 with its emprical counterpart (residuals)119894 = 119884119894 minus 119883prime

119894 bull Replace the population moments of 119883119894 with their sample

counterpartsbull The square root of the diagonals of this covariance matrix are

the ldquorobustrdquo or Huber-White standard errors that Statacommonly report

21 58

Heteroskedasticity

bull No assumptions of homoskedasticitybull Heteroskedaticity will definitely occur when

CEF is linear but the 1205901113569(119909) = 120141[119884119894|119883119894 = 119909] is not constant in 119909 119864[119884119894|119883119894] is not linear but we use the linear regression to

approxmiate it

22 58

2 Regression andCausality

23 58

Regression and causality

bull Most econometrics textbooks regression defined withoutrespect to causality

bull But then when is ldquobiasedrdquo The above derivations work forsome 120124[119884119894|119883119894]

bull The question then is when does knowing the CEF tell ussomething about causality

bull MHE argues that a regression is causal when the CEF itapproximates is causal Identification is king

bull We will show that under certain conditions a regression of theoutcome on the treatment and the covariates can recover acausal parameter but perhaps not the one in which we areinterested

24 58

Review

bull Quick reminder we have potential outcomes 119884119894(1) and 119884119894(0)and two parameters the ATE and ATT

120591 = 119864[119884119894(1) minus 119884119894(0)]120591ATT = 119864[119884119894(1) minus 119884119894(0)|119863119894 = 1]

bull We have shown in past weeks that these effects are identifiedwhen ignorability holds MHE calls this the conditionalindependence assumption (CIA)

25 58

Linear constant effects model binarytreatment

bull Experiment with a simple experiment we can rewrite theconsistency assumption to be a regression formula

119884119894 = 119863119894119884119894(1) + (1 minus 119863119894)119884119894(0)= 119884119894(0) + (119884119894(1) minus 119884119894(0))119863119894

= 120124[119884119894(0)] + 120591119863119894 + (119884119894(0) minus 120124[119884119894(0)])= 1205831113567 + 120591119863119894 + 1199071113567119894

bull Note that if ignorability holds (as in an experiment) for 119884119894(0)then it will also hold for 1199071113567119894 since 120124[119884119894(0)] is constant Thusthis satifies the usual assumptions for regression

26 58

Now with covariates

bull Now assume no unmeasured confounders 119884119894(119889) โŸ‚โŸ‚ 119863119894|119883119894bull We will assume a linear model for the potential outcomes

119884119894(119889) = 120572 + 120591 sdot 119889 + 120578119894

bull Remember that linearity isnrsquot an assumption if 119863119894 is binarybull Effect of 119863119894 is constant here the 120578119894 are the only source of

individual variation and we have 119864[120578119894] = 0bull Consistency assumption allows us to write this as

119884119894 = 120572 + 120591119863119894 + 120578119894

27 58

Covariates in the error

bull Letrsquos assume that 120578119894 is linear in 119883119894 120578119894 = 119883prime119894 120574 + 120584119894

bull New error is uncorrelated with 119883119894 120124[120584119894|119883119894] = 0bull This is an assumption Might be falsebull Plug into the above

120124[119884119894(119889)|119883119894] = 119864[119884119894|119863119894 119883119894] = 120572 + 120591119863119894 + 119864[120578119894|119883119894]= 120572 + 120591119863119894 + 119883prime

119894 120574 + 119864[120584119894|119883119894]= 120572 + 120591119863119894 + 119883prime

119894 120574

28 58

Summing up regression with constanteffects

bull Reviewing the assumptions wersquove used no unmeasured confounders constant treatment effects linearity of the treatmentcovariates

bull Under these we can run the following regression to estimatethe ATE 120591

119884119894 = 120572 + 120591119863119894 + 119883prime119894 120574 + 120584119894

bull Works with continuous or ordinal 119863119894 if linearity in the effect ofthese variables is truly linear

29 58

OLS constant effects simulation

Model with linear covariates constant 0 effect of treatment

library(mvtnorm)

n lt- 100

p lt- 4

X lt- rmvnorm(n = 100 mean = rep(0 p))

gamma lt- c(274 137 137 137)

y lt- 210 + X c(gamma) + rnorm(n)

alpha lt- c(-1 05 -05 -01)

dprobs lt- bootinvlogit(X alpha)

d lt- rbinom(n size = 1 prob = dprobs)

30 58

OLS with no covariates

summary(lm(y ~ d))

Coefficients

Estimate Std Error t value Pr(gt|t|)

(Intercept) 22036 458 4813 lt 2e-16

d -2869 654 -439 0000029

---

Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1

Residual standard error 33 on 98 degrees of freedom

Multiple R-squared 0164 Adjusted R-squared 0156

F-statistic 192 on 1 and 98 DF p-value 0000029

31 58

OLS with covariates

summary(lm(y ~ d + X))

Coefficients

Estimate Std Error t value Pr(gt|t|)

(Intercept) 209952 0159 13207 lt2e-16

d -0128 0253 -05 062

X1 27368 0126 2175 lt2e-16

X2 13677 0114 1201 lt2e-16

X3 13673 0130 1051 lt2e-16

X4 13570 0106 1280 lt2e-16

---

Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1

Residual standard error 1 on 94 degrees of freedom

Multiple R-squared 0999 Adjusted R-squared 0999

F-statistic 244e+04 on 5 and 94 DF p-value lt2e-16

32 58

What happens with nonlinearity

Suppose we can only observe the following covariates

z1 lt- exp(X[ 1]2)

z2 lt- X[ 2](1 + exp(X[ 1])) + 10

z3 lt- (X[ 1] X[ 3]25 + 06)^3

z4 lt- (X[ 2] + X[ 4] + 20)^2

Implies that 119884119894 and 119863119894 are functions of log(1198851198941113568) 1198851198941113569 119885111356911989411135681198851198941113569

1 log(1198851198941113568) 1198851198941113570 log(1198851198941113568) and 119883111356811135691198941113571

Regression is a nonlinear function of the observed covariates

33 58

When linearity goes wrong

summary(lm(y ~ d + z1 + z2 + z3 + z4))

Coefficients

Estimate Std Error t value Pr(gt|t|)

(Intercept) 218728 300799 073 0469

d -69292 33854 -205 0043

z1 363110 26970 1346 lt2e-16

z2 -29033 36619 -079 0430

z3 862022 431030 200 0048

z4 04021 00329 1223 lt2e-16

---

Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1

Residual standard error 14 on 94 degrees of freedom

Multiple R-squared 0855 Adjusted R-squared 0847

F-statistic 111 on 5 and 94 DF p-value lt2e-16

34 58

3 Regression withHeterogeneousTreatment Effects

35 58

Heterogeneous effects binary treatment

bull Completely randomized experiment

119884119894 = 119863119894119884119894(1) + (1 minus 119863119894)119884119894(0)= 119884119894(0) + (119884119894(1) minus 119884119894(0))119863119894

= 1205831113567 + 120591119894119863119894 + (119884119894(0) minus 1205831113567)= 1205831113567 + 120591119863119894 + (119884119894(0) minus 1205831113567) + (120591119894 minus 120591) sdot 119863119894

= 1205831113567 + 120591119863119894 + 120576119894

bull Error term now includes two components1 ldquoBaselinerdquo variation in the outcome (119884119894(0) minus 1205831113567)2 Variation in the treatment effect (120591119894 minus 120591)

bull Easy to verify that under experiment 120124[120576119894|119863119894] = 0bull Thus OLS estimates the ATE with no covariates

36 58

Adding covariates

bull What happens with no unmeasured confounders Need tocondition on 119883119894 now

bull Remember identification of the ATEATT using iteratedexpectations

bull ATE is the weighted sum of CATEs

120591 = 1114012119909120591(119909) Pr[119883119894 = 119909]

bull ATEATT are weighted averages of CATEsbull What about the regression estimand 120591119877 How does it related

to the ATEATT

37 58

Heterogeneous effects and regression

bull Letrsquos investigate this under a saturated regression model

119884119894 = 1114012119909119861119909119894120572119909 + 120591119877119863119894 + 119890119894

bull Use a dummy variable for each unique combination of 119883119894119861119909119894 = 120128(119883119894 = 119909)

bull Linear in 119883119894 by construction

38 58

Investigating the regression coefficient

bull How can we investigate 120591119877 Well we can rely on theregression anatomy

120591119877 = Cov(119884119894 119863119894 minus 119864[119863119894|119883119894])120141(119863119894 minus 119864[119863119894|119883119894])

bull 119863119894 minus 120124[119863119894|119883119894] is the residual from a regression of 119863119894 on the fullset of dummies

bull With a little work we can show

120591119877 =120124 1114094120591(119883119894)(119863119894 minus 120124[119863119894|119883119894])11135691114097

120124[(119863119894 minus 119864[119863119894|119883119894])1113569]=

120124[120591(119883119894)1205901113569119889(119883119894)]120124[1205901113569119889(119883119894)]

bull 1205901113569119889(119909) = 120141[119863119894|119883119894 = 119909] is the conditional variance of treatmentassignment

39 58

ATE versus OLS

120591119877 = 120124[120591(119883119894)119882119894] = 1114012119909120591(119909)

1205901113569119889(119909)120124[1205901113569119889(119883119894)]

โ„™[119883119894 = 119909]

bull Compare to the ATE

120591 = 120124[120591(119883119894)] = 1114012119909120591(119909)โ„™[119883119894 = 119909]

bull Both weight strata relative to their size (โ„™[119883119894 = 119909])bull OLS weights strata higher if the treatment variance in those

strata (1205901113569119889(119909)) is higher in those strata relative to the averagevariance across strata (120124[1205901113569119889(119883119894)])

bull The ATE weights only by their size

40 58

Regression weighting

119882119894 =1205901113569119889(119883119894)

120124[1205901113569119889(119883119894)]

bull Why does OLS weight like thisbull OLS is a minimum-variance estimator more weight to

more precise within-strata estimatesbull Within-strata estimates are most precise when the treatment

is evenly spread and thus has the highest variancebull If 119863119894 is binary then we know the conditional variance will be

1205901113569119889(119909) = โ„™[119863119894 = 1|119883119894 = 119909] (1 minus โ„™[119863119894 = 1|119883119894 = 119909])= 119890(119909) (1 minus 119890(119909))

bull Maximum variance with โ„™[119863119894 = 1|119883119894 = 119909] = 12

41 58

OLS weighting examplebull Binary covariate

โ„™[119883119894 = 1] = 075 โ„™[119883119894 = 0] = 025โ„™[119863119894 = 1|119883119894 = 1] = 09 โ„™[119863119894 = 1|119883119894 = 0] = 05

1205901113569119889(1) = 009 1205901113569119889(0) = 025120591(1) = 1 120591(0) = minus1

bull Implies the ATE is 120591 = 05bull Average conditional variance 120124[1205901113569119889(119883119894)] = 013bull weights for 119883119894 = 1 are 009013 = 0692 for 119883119894 = 0

025013 = 192120591119877 = 120124[120591(119883119894)119882119894]

= 120591(1)119882(1)โ„™[119883119894 = 1] + 120591(0)119882(0)โ„™[119883119894 = 0]= 1 times 0692 times 075 + minus1 times 192 times 025= 0039

42 58

When will OLS estimate the ATE

bull When does 120591 = 120591119877bull Constant treatment effects 120591(119909) = 120591 = 120591119877bull Constant probability of treatment 119890(119909) = โ„™[119863119894 = 1|119883119894 = 119909] = 119890

Implies that the OLS weights are 1

bull Incorrect linearity assumption in 119883119894 will lead to more bias

43 58

Other ways to use regression

bull Whatrsquos the path forward Accept the bias (might be relatively small with saturated

models) Use a different regression approach

bull Let 120583119889(119909) = 120124[119884119894(119889)|119883119894 = 119909] be the CEF for the potentialoutcome under 119863119894 = 119889

bull By consistency and nuc we have 120583119889(119909) = 120124[119884119894|119863119894 = 119889119883119894 = 119909]bull Estimate a regression of 119884119894 on 119883119894 among the 119863119894 = 119889 groupbull Then 119889(119909) is just a predicted value from the regression for

119883119894 = 119909bull How can we use this

44 58

Imputation estimators

bull Impute the treated potential outcomes with 1114023119884119894(1) = 1113568(119883119894)bull Impute the control potential outcomes with 1114023119884119894(0) = 1113567(119883119894)bull Procedure

Regress 119884119894 on 119883119894 in the treated group and get predicted valuesfor all units (treated or control)

Regress 119884119894 on 119883119894 in the control group and get predicted valuesfor all units (treated or control)

Take the average difference between these predicted values

bull More mathematically look like this

120591119894119898119901 =1119873

11140121198941113568(119883119894) minus 1113567(119883119894)

bull Sometimes called an imputation estimator

45 58

Simple imputation estimator

bull Use predict() from the within-group models on the data fromthe entire sample

bull Useful trick use a model on the entire data andmodelframe() to get the right design matrix

heterogeneous effects

yhet lt- ifelse(d == 1 y + rnorm(n 0 5) y)

mod lt- lm(yhet ~ d + X)

mod1 lt- lm(yhet ~ X subset = d == 1)

mod0 lt- lm(yhet ~ X subset = d == 0)

y1imps lt- predict(mod1 modelframe(mod))

y0imps lt- predict(mod0 modelframe(mod))

mean(y1imps - y0imps)

[1] 061

46 58

Notes on imputation estimators

bull If 119889(119909) are consistent estimators then 120591119894119898119901 is consistent forthe ATE

bull Why donrsquot people use this Most people donrsquot know the results wersquove been talking about Harder to implement than vanilla OLS

bull Can use linear regression to estimate 119889(119909) = 119909prime120573119889bull Recent trend is to estimate 119889(119909) via non-parametric methods

such as Kernel regression local linear regression regression trees etc Easiest is generalized additive models (GAMs)

47 58

Imputation estimator visualization

-2 -1 0 1 2 3

-05

00

05

10

15

x

y

TreatedControl

48 58

Imputation estimator visualization

-2 -1 0 1 2 3

-05

00

05

10

15

x

y

TreatedControl

49 58

Imputation estimator visualization

-2 -1 0 1 2 3

-05

00

05

10

15

x

y

TreatedControl

50 58

Nonlinear relationshipsbull Same idea but with nonlinear relationship between 119884119894 and 119883119894

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

51 58

Nonlinear relationshipsbull Same idea but with nonlinear relationship between 119884119894 and 119883119894

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

52 58

Nonlinear relationshipsbull Same idea but with nonlinear relationship between 119884119894 and 119883119894

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

53 58

Using semiparametric regressionbull Here CEFs are nonlinear but we donrsquot know their formbull We can use GAMs from the mgcv package to for flexible

estimatelibrary(mgcv)

mod0 lt- gam(y ~ s(x) subset = d == 0)

summary(mod0)

Family gaussian

Link function identity

Formula

y ~ s(x)

Parametric coefficients

Estimate Std Error t value Pr(gt|t|)

(Intercept) -00225 00154 -146 016

Approximate significance of smooth terms

edf Refdf F p-value

s(x) 603 708 413 lt2e-16

---

Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1

R-sq(adj) = 0909 Deviance explained = 928

GCV = 00093204 Scale est = 00071351 n = 30

54 58

Using GAMs

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

55 58

Using GAMs

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

56 58

Using GAMs

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

57 58

Limited dependent variables

bull Usual advice model the data from first principles Logitprobit for binary Poisson for counts etc

bull OLS is a-ok with limited DVs when Binary treatment and no covariates (just diff-in-means) Binary treatment discrete covariates and saturated models

(stratified diff-in-means)

bull Imposing a model on LDVs in this case imposes adistributional assumption which could be wrong

bull Even in unsaturated models the marginal effect from OLSoften decent compared to nonlinear models

Could go wrong in small samples If using nonlinear models always get effects on the scale of the

outcome

58 58

  • Agnostic Regression
  • Regression and Causality
  • Regression with Heterogeneous Treatment Effects
Page 19: Gov 2002: 7. Regression and Causalityโ€ข Define linear regression: =argmin ๐”ผ[( ๐‘–โˆ’ โ€ฒ ) ] โ€ข The solution to this is the following: =๐”ผ[ ๐‘– โ€ฒ]โˆ’ ๐”ผ[ ๐‘– ๐‘–] โ€ข

OLS estimator

bull We know the population value of 120573 is

120573 = 120124[119883119894119883prime119894 ]minus1113568120124[119883119894119884119894]

bull How do we get an estimator of thisbull Plug-in principle replace population expectation with

sample versions

=

โŽกโŽขโŽขโŽขโŽขโŽขโŽฃ1119873

1114012119894119883119894119883prime

119894

โŽคโŽฅโŽฅโŽฅโŽฅโŽฅโŽฆ

minus11135681119873

1114012119894119883119894119884119894

bull If you work through the matrix algebra this turns out to be

= (119831prime119831)minus1113568 119831prime119858

19 58

Asymptotic OLS inference

bull With this representation in hand we can write the OLSestimator as follows

= 120573 +

โŽกโŽขโŽขโŽขโŽขโŽขโŽฃ1114012

119894119883119894119883prime

119894

โŽคโŽฅโŽฅโŽฅโŽฅโŽฅโŽฆ

minus1113568

1114012119894119883119894119890119894

bull Core idea sum119894119883119894119890119894 is the sum of rvs so the CLT applies

bull That plus some simple asymptotic theory allows us to say

radic119873( minus 120573) 119873(0ฮฉ)

bull Converges in distribution to a Normal distribution with meanvector 0 and covariance matrix ฮฉ

ฮฉ = 120124[119883119894119883prime119894 ]minus1113568120124[119883119894119883prime

119894 1198901113569119894 ]120124[119883119894119883prime119894 ]minus1113568

bull No linearity assumption needed20 58

Estimating the variance

bull In large samples then

radic119873( minus 120573) sim 119873(0ฮฉ)

bull How to estimate ฮฉ Plug-in principle again

1114023ฮฉ =

โŽกโŽขโŽขโŽขโŽขโŽขโŽฃ1114012

119894119883119894119883prime

119894

โŽคโŽฅโŽฅโŽฅโŽฅโŽฅโŽฆ

minus1113568 โŽกโŽขโŽขโŽขโŽขโŽขโŽฃ1114012119894119883119894119883prime

119894 1113569119894

โŽคโŽฅโŽฅโŽฅโŽฅโŽฅโŽฆ

โŽกโŽขโŽขโŽขโŽขโŽขโŽฃ1114012

119894119883119894119883prime

119894

โŽคโŽฅโŽฅโŽฅโŽฅโŽฅโŽฆ

minus1113568

bull Replace 119890119894 with its emprical counterpart (residuals)119894 = 119884119894 minus 119883prime

119894 bull Replace the population moments of 119883119894 with their sample

counterpartsbull The square root of the diagonals of this covariance matrix are

the ldquorobustrdquo or Huber-White standard errors that Statacommonly report

21 58

Heteroskedasticity

bull No assumptions of homoskedasticitybull Heteroskedaticity will definitely occur when

CEF is linear but the 1205901113569(119909) = 120141[119884119894|119883119894 = 119909] is not constant in 119909 119864[119884119894|119883119894] is not linear but we use the linear regression to

approxmiate it

22 58

2 Regression andCausality

23 58

Regression and causality

bull Most econometrics textbooks regression defined withoutrespect to causality

bull But then when is ldquobiasedrdquo The above derivations work forsome 120124[119884119894|119883119894]

bull The question then is when does knowing the CEF tell ussomething about causality

bull MHE argues that a regression is causal when the CEF itapproximates is causal Identification is king

bull We will show that under certain conditions a regression of theoutcome on the treatment and the covariates can recover acausal parameter but perhaps not the one in which we areinterested

24 58

Review

bull Quick reminder we have potential outcomes 119884119894(1) and 119884119894(0)and two parameters the ATE and ATT

120591 = 119864[119884119894(1) minus 119884119894(0)]120591ATT = 119864[119884119894(1) minus 119884119894(0)|119863119894 = 1]

bull We have shown in past weeks that these effects are identifiedwhen ignorability holds MHE calls this the conditionalindependence assumption (CIA)

25 58

Linear constant effects model binarytreatment

bull Experiment with a simple experiment we can rewrite theconsistency assumption to be a regression formula

119884119894 = 119863119894119884119894(1) + (1 minus 119863119894)119884119894(0)= 119884119894(0) + (119884119894(1) minus 119884119894(0))119863119894

= 120124[119884119894(0)] + 120591119863119894 + (119884119894(0) minus 120124[119884119894(0)])= 1205831113567 + 120591119863119894 + 1199071113567119894

bull Note that if ignorability holds (as in an experiment) for 119884119894(0)then it will also hold for 1199071113567119894 since 120124[119884119894(0)] is constant Thusthis satifies the usual assumptions for regression

26 58

Now with covariates

bull Now assume no unmeasured confounders 119884119894(119889) โŸ‚โŸ‚ 119863119894|119883119894bull We will assume a linear model for the potential outcomes

119884119894(119889) = 120572 + 120591 sdot 119889 + 120578119894

bull Remember that linearity isnrsquot an assumption if 119863119894 is binarybull Effect of 119863119894 is constant here the 120578119894 are the only source of

individual variation and we have 119864[120578119894] = 0bull Consistency assumption allows us to write this as

119884119894 = 120572 + 120591119863119894 + 120578119894

27 58

Covariates in the error

bull Letrsquos assume that 120578119894 is linear in 119883119894 120578119894 = 119883prime119894 120574 + 120584119894

bull New error is uncorrelated with 119883119894 120124[120584119894|119883119894] = 0bull This is an assumption Might be falsebull Plug into the above

120124[119884119894(119889)|119883119894] = 119864[119884119894|119863119894 119883119894] = 120572 + 120591119863119894 + 119864[120578119894|119883119894]= 120572 + 120591119863119894 + 119883prime

119894 120574 + 119864[120584119894|119883119894]= 120572 + 120591119863119894 + 119883prime

119894 120574

28 58

Summing up regression with constanteffects

bull Reviewing the assumptions wersquove used no unmeasured confounders constant treatment effects linearity of the treatmentcovariates

bull Under these we can run the following regression to estimatethe ATE 120591

119884119894 = 120572 + 120591119863119894 + 119883prime119894 120574 + 120584119894

bull Works with continuous or ordinal 119863119894 if linearity in the effect ofthese variables is truly linear

29 58

OLS constant effects simulation

Model with linear covariates constant 0 effect of treatment

library(mvtnorm)

n lt- 100

p lt- 4

X lt- rmvnorm(n = 100 mean = rep(0 p))

gamma lt- c(274 137 137 137)

y lt- 210 + X c(gamma) + rnorm(n)

alpha lt- c(-1 05 -05 -01)

dprobs lt- bootinvlogit(X alpha)

d lt- rbinom(n size = 1 prob = dprobs)

30 58

OLS with no covariates

summary(lm(y ~ d))

Coefficients

Estimate Std Error t value Pr(gt|t|)

(Intercept) 22036 458 4813 lt 2e-16

d -2869 654 -439 0000029

---

Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1

Residual standard error 33 on 98 degrees of freedom

Multiple R-squared 0164 Adjusted R-squared 0156

F-statistic 192 on 1 and 98 DF p-value 0000029

31 58

OLS with covariates

summary(lm(y ~ d + X))

Coefficients

Estimate Std Error t value Pr(gt|t|)

(Intercept) 209952 0159 13207 lt2e-16

d -0128 0253 -05 062

X1 27368 0126 2175 lt2e-16

X2 13677 0114 1201 lt2e-16

X3 13673 0130 1051 lt2e-16

X4 13570 0106 1280 lt2e-16

---

Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1

Residual standard error 1 on 94 degrees of freedom

Multiple R-squared 0999 Adjusted R-squared 0999

F-statistic 244e+04 on 5 and 94 DF p-value lt2e-16

32 58

What happens with nonlinearity

Suppose we can only observe the following covariates

z1 lt- exp(X[ 1]2)

z2 lt- X[ 2](1 + exp(X[ 1])) + 10

z3 lt- (X[ 1] X[ 3]25 + 06)^3

z4 lt- (X[ 2] + X[ 4] + 20)^2

Implies that 119884119894 and 119863119894 are functions of log(1198851198941113568) 1198851198941113569 119885111356911989411135681198851198941113569

1 log(1198851198941113568) 1198851198941113570 log(1198851198941113568) and 119883111356811135691198941113571

Regression is a nonlinear function of the observed covariates

33 58

When linearity goes wrong

summary(lm(y ~ d + z1 + z2 + z3 + z4))

Coefficients

Estimate Std Error t value Pr(gt|t|)

(Intercept) 218728 300799 073 0469

d -69292 33854 -205 0043

z1 363110 26970 1346 lt2e-16

z2 -29033 36619 -079 0430

z3 862022 431030 200 0048

z4 04021 00329 1223 lt2e-16

---

Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1

Residual standard error 14 on 94 degrees of freedom

Multiple R-squared 0855 Adjusted R-squared 0847

F-statistic 111 on 5 and 94 DF p-value lt2e-16

34 58

3 Regression withHeterogeneousTreatment Effects

35 58

Heterogeneous effects binary treatment

bull Completely randomized experiment

119884119894 = 119863119894119884119894(1) + (1 minus 119863119894)119884119894(0)= 119884119894(0) + (119884119894(1) minus 119884119894(0))119863119894

= 1205831113567 + 120591119894119863119894 + (119884119894(0) minus 1205831113567)= 1205831113567 + 120591119863119894 + (119884119894(0) minus 1205831113567) + (120591119894 minus 120591) sdot 119863119894

= 1205831113567 + 120591119863119894 + 120576119894

bull Error term now includes two components1 ldquoBaselinerdquo variation in the outcome (119884119894(0) minus 1205831113567)2 Variation in the treatment effect (120591119894 minus 120591)

bull Easy to verify that under experiment 120124[120576119894|119863119894] = 0bull Thus OLS estimates the ATE with no covariates

36 58

Adding covariates

bull What happens with no unmeasured confounders Need tocondition on 119883119894 now

bull Remember identification of the ATEATT using iteratedexpectations

bull ATE is the weighted sum of CATEs

120591 = 1114012119909120591(119909) Pr[119883119894 = 119909]

bull ATEATT are weighted averages of CATEsbull What about the regression estimand 120591119877 How does it related

to the ATEATT

37 58

Heterogeneous effects and regression

bull Letrsquos investigate this under a saturated regression model

119884119894 = 1114012119909119861119909119894120572119909 + 120591119877119863119894 + 119890119894

bull Use a dummy variable for each unique combination of 119883119894119861119909119894 = 120128(119883119894 = 119909)

bull Linear in 119883119894 by construction

38 58

Investigating the regression coefficient

bull How can we investigate 120591119877 Well we can rely on theregression anatomy

120591119877 = Cov(119884119894 119863119894 minus 119864[119863119894|119883119894])120141(119863119894 minus 119864[119863119894|119883119894])

bull 119863119894 minus 120124[119863119894|119883119894] is the residual from a regression of 119863119894 on the fullset of dummies

bull With a little work we can show

120591119877 =120124 1114094120591(119883119894)(119863119894 minus 120124[119863119894|119883119894])11135691114097

120124[(119863119894 minus 119864[119863119894|119883119894])1113569]=

120124[120591(119883119894)1205901113569119889(119883119894)]120124[1205901113569119889(119883119894)]

bull 1205901113569119889(119909) = 120141[119863119894|119883119894 = 119909] is the conditional variance of treatmentassignment

39 58

ATE versus OLS

120591119877 = 120124[120591(119883119894)119882119894] = 1114012119909120591(119909)

1205901113569119889(119909)120124[1205901113569119889(119883119894)]

โ„™[119883119894 = 119909]

bull Compare to the ATE

120591 = 120124[120591(119883119894)] = 1114012119909120591(119909)โ„™[119883119894 = 119909]

bull Both weight strata relative to their size (โ„™[119883119894 = 119909])bull OLS weights strata higher if the treatment variance in those

strata (1205901113569119889(119909)) is higher in those strata relative to the averagevariance across strata (120124[1205901113569119889(119883119894)])

bull The ATE weights only by their size

40 58

Regression weighting

119882119894 =1205901113569119889(119883119894)

120124[1205901113569119889(119883119894)]

bull Why does OLS weight like thisbull OLS is a minimum-variance estimator more weight to

more precise within-strata estimatesbull Within-strata estimates are most precise when the treatment

is evenly spread and thus has the highest variancebull If 119863119894 is binary then we know the conditional variance will be

1205901113569119889(119909) = โ„™[119863119894 = 1|119883119894 = 119909] (1 minus โ„™[119863119894 = 1|119883119894 = 119909])= 119890(119909) (1 minus 119890(119909))

bull Maximum variance with โ„™[119863119894 = 1|119883119894 = 119909] = 12

41 58

OLS weighting examplebull Binary covariate

โ„™[119883119894 = 1] = 075 โ„™[119883119894 = 0] = 025โ„™[119863119894 = 1|119883119894 = 1] = 09 โ„™[119863119894 = 1|119883119894 = 0] = 05

1205901113569119889(1) = 009 1205901113569119889(0) = 025120591(1) = 1 120591(0) = minus1

bull Implies the ATE is 120591 = 05bull Average conditional variance 120124[1205901113569119889(119883119894)] = 013bull weights for 119883119894 = 1 are 009013 = 0692 for 119883119894 = 0

025013 = 192120591119877 = 120124[120591(119883119894)119882119894]

= 120591(1)119882(1)โ„™[119883119894 = 1] + 120591(0)119882(0)โ„™[119883119894 = 0]= 1 times 0692 times 075 + minus1 times 192 times 025= 0039

42 58

When will OLS estimate the ATE

bull When does 120591 = 120591119877bull Constant treatment effects 120591(119909) = 120591 = 120591119877bull Constant probability of treatment 119890(119909) = โ„™[119863119894 = 1|119883119894 = 119909] = 119890

Implies that the OLS weights are 1

bull Incorrect linearity assumption in 119883119894 will lead to more bias

43 58

Other ways to use regression

bull Whatrsquos the path forward Accept the bias (might be relatively small with saturated

models) Use a different regression approach

bull Let 120583119889(119909) = 120124[119884119894(119889)|119883119894 = 119909] be the CEF for the potentialoutcome under 119863119894 = 119889

bull By consistency and nuc we have 120583119889(119909) = 120124[119884119894|119863119894 = 119889119883119894 = 119909]bull Estimate a regression of 119884119894 on 119883119894 among the 119863119894 = 119889 groupbull Then 119889(119909) is just a predicted value from the regression for

119883119894 = 119909bull How can we use this

44 58

Imputation estimators

bull Impute the treated potential outcomes with 1114023119884119894(1) = 1113568(119883119894)bull Impute the control potential outcomes with 1114023119884119894(0) = 1113567(119883119894)bull Procedure

Regress 119884119894 on 119883119894 in the treated group and get predicted valuesfor all units (treated or control)

Regress 119884119894 on 119883119894 in the control group and get predicted valuesfor all units (treated or control)

Take the average difference between these predicted values

bull More mathematically look like this

120591119894119898119901 =1119873

11140121198941113568(119883119894) minus 1113567(119883119894)

bull Sometimes called an imputation estimator

45 58

Simple imputation estimator

bull Use predict() from the within-group models on the data fromthe entire sample

bull Useful trick use a model on the entire data andmodelframe() to get the right design matrix

heterogeneous effects

yhet lt- ifelse(d == 1 y + rnorm(n 0 5) y)

mod lt- lm(yhet ~ d + X)

mod1 lt- lm(yhet ~ X subset = d == 1)

mod0 lt- lm(yhet ~ X subset = d == 0)

y1imps lt- predict(mod1 modelframe(mod))

y0imps lt- predict(mod0 modelframe(mod))

mean(y1imps - y0imps)

[1] 061

46 58

Notes on imputation estimators

bull If 119889(119909) are consistent estimators then 120591119894119898119901 is consistent forthe ATE

bull Why donrsquot people use this Most people donrsquot know the results wersquove been talking about Harder to implement than vanilla OLS

bull Can use linear regression to estimate 119889(119909) = 119909prime120573119889bull Recent trend is to estimate 119889(119909) via non-parametric methods

such as Kernel regression local linear regression regression trees etc Easiest is generalized additive models (GAMs)

47 58

Imputation estimator visualization

-2 -1 0 1 2 3

-05

00

05

10

15

x

y

TreatedControl

48 58

Imputation estimator visualization

-2 -1 0 1 2 3

-05

00

05

10

15

x

y

TreatedControl

49 58

Imputation estimator visualization

-2 -1 0 1 2 3

-05

00

05

10

15

x

y

TreatedControl

50 58

Nonlinear relationshipsbull Same idea but with nonlinear relationship between 119884119894 and 119883119894

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

51 58

Nonlinear relationshipsbull Same idea but with nonlinear relationship between 119884119894 and 119883119894

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

52 58

Nonlinear relationshipsbull Same idea but with nonlinear relationship between 119884119894 and 119883119894

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

53 58

Using semiparametric regressionbull Here CEFs are nonlinear but we donrsquot know their formbull We can use GAMs from the mgcv package to for flexible

estimatelibrary(mgcv)

mod0 lt- gam(y ~ s(x) subset = d == 0)

summary(mod0)

Family gaussian

Link function identity

Formula

y ~ s(x)

Parametric coefficients

Estimate Std Error t value Pr(gt|t|)

(Intercept) -00225 00154 -146 016

Approximate significance of smooth terms

edf Refdf F p-value

s(x) 603 708 413 lt2e-16

---

Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1

R-sq(adj) = 0909 Deviance explained = 928

GCV = 00093204 Scale est = 00071351 n = 30

54 58

Using GAMs

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

55 58

Using GAMs

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

56 58

Using GAMs

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

57 58

Limited dependent variables

bull Usual advice model the data from first principles Logitprobit for binary Poisson for counts etc

bull OLS is a-ok with limited DVs when Binary treatment and no covariates (just diff-in-means) Binary treatment discrete covariates and saturated models

(stratified diff-in-means)

bull Imposing a model on LDVs in this case imposes adistributional assumption which could be wrong

bull Even in unsaturated models the marginal effect from OLSoften decent compared to nonlinear models

Could go wrong in small samples If using nonlinear models always get effects on the scale of the

outcome

58 58

  • Agnostic Regression
  • Regression and Causality
  • Regression with Heterogeneous Treatment Effects
Page 20: Gov 2002: 7. Regression and Causalityโ€ข Define linear regression: =argmin ๐”ผ[( ๐‘–โˆ’ โ€ฒ ) ] โ€ข The solution to this is the following: =๐”ผ[ ๐‘– โ€ฒ]โˆ’ ๐”ผ[ ๐‘– ๐‘–] โ€ข

Asymptotic OLS inference

bull With this representation in hand we can write the OLSestimator as follows

= 120573 +

โŽกโŽขโŽขโŽขโŽขโŽขโŽฃ1114012

119894119883119894119883prime

119894

โŽคโŽฅโŽฅโŽฅโŽฅโŽฅโŽฆ

minus1113568

1114012119894119883119894119890119894

bull Core idea sum119894119883119894119890119894 is the sum of rvs so the CLT applies

bull That plus some simple asymptotic theory allows us to say

radic119873( minus 120573) 119873(0ฮฉ)

bull Converges in distribution to a Normal distribution with meanvector 0 and covariance matrix ฮฉ

ฮฉ = 120124[119883119894119883prime119894 ]minus1113568120124[119883119894119883prime

119894 1198901113569119894 ]120124[119883119894119883prime119894 ]minus1113568

bull No linearity assumption needed20 58

Estimating the variance

bull In large samples then

radic119873( minus 120573) sim 119873(0ฮฉ)

bull How to estimate ฮฉ Plug-in principle again

1114023ฮฉ =

โŽกโŽขโŽขโŽขโŽขโŽขโŽฃ1114012

119894119883119894119883prime

119894

โŽคโŽฅโŽฅโŽฅโŽฅโŽฅโŽฆ

minus1113568 โŽกโŽขโŽขโŽขโŽขโŽขโŽฃ1114012119894119883119894119883prime

119894 1113569119894

โŽคโŽฅโŽฅโŽฅโŽฅโŽฅโŽฆ

โŽกโŽขโŽขโŽขโŽขโŽขโŽฃ1114012

119894119883119894119883prime

119894

โŽคโŽฅโŽฅโŽฅโŽฅโŽฅโŽฆ

minus1113568

bull Replace 119890119894 with its emprical counterpart (residuals)119894 = 119884119894 minus 119883prime

119894 bull Replace the population moments of 119883119894 with their sample

counterpartsbull The square root of the diagonals of this covariance matrix are

the ldquorobustrdquo or Huber-White standard errors that Statacommonly report

21 58

Heteroskedasticity

bull No assumptions of homoskedasticitybull Heteroskedaticity will definitely occur when

CEF is linear but the 1205901113569(119909) = 120141[119884119894|119883119894 = 119909] is not constant in 119909 119864[119884119894|119883119894] is not linear but we use the linear regression to

approxmiate it

22 58

2 Regression andCausality

23 58

Regression and causality

bull Most econometrics textbooks regression defined withoutrespect to causality

bull But then when is ldquobiasedrdquo The above derivations work forsome 120124[119884119894|119883119894]

bull The question then is when does knowing the CEF tell ussomething about causality

bull MHE argues that a regression is causal when the CEF itapproximates is causal Identification is king

bull We will show that under certain conditions a regression of theoutcome on the treatment and the covariates can recover acausal parameter but perhaps not the one in which we areinterested

24 58

Review

bull Quick reminder we have potential outcomes 119884119894(1) and 119884119894(0)and two parameters the ATE and ATT

120591 = 119864[119884119894(1) minus 119884119894(0)]120591ATT = 119864[119884119894(1) minus 119884119894(0)|119863119894 = 1]

bull We have shown in past weeks that these effects are identifiedwhen ignorability holds MHE calls this the conditionalindependence assumption (CIA)

25 58

Linear constant effects model binarytreatment

bull Experiment with a simple experiment we can rewrite theconsistency assumption to be a regression formula

119884119894 = 119863119894119884119894(1) + (1 minus 119863119894)119884119894(0)= 119884119894(0) + (119884119894(1) minus 119884119894(0))119863119894

= 120124[119884119894(0)] + 120591119863119894 + (119884119894(0) minus 120124[119884119894(0)])= 1205831113567 + 120591119863119894 + 1199071113567119894

bull Note that if ignorability holds (as in an experiment) for 119884119894(0)then it will also hold for 1199071113567119894 since 120124[119884119894(0)] is constant Thusthis satifies the usual assumptions for regression

26 58

Now with covariates

bull Now assume no unmeasured confounders 119884119894(119889) โŸ‚โŸ‚ 119863119894|119883119894bull We will assume a linear model for the potential outcomes

119884119894(119889) = 120572 + 120591 sdot 119889 + 120578119894

bull Remember that linearity isnrsquot an assumption if 119863119894 is binarybull Effect of 119863119894 is constant here the 120578119894 are the only source of

individual variation and we have 119864[120578119894] = 0bull Consistency assumption allows us to write this as

119884119894 = 120572 + 120591119863119894 + 120578119894

27 58

Covariates in the error

bull Letrsquos assume that 120578119894 is linear in 119883119894 120578119894 = 119883prime119894 120574 + 120584119894

bull New error is uncorrelated with 119883119894 120124[120584119894|119883119894] = 0bull This is an assumption Might be falsebull Plug into the above

120124[119884119894(119889)|119883119894] = 119864[119884119894|119863119894 119883119894] = 120572 + 120591119863119894 + 119864[120578119894|119883119894]= 120572 + 120591119863119894 + 119883prime

119894 120574 + 119864[120584119894|119883119894]= 120572 + 120591119863119894 + 119883prime

119894 120574

28 58

Summing up regression with constanteffects

bull Reviewing the assumptions wersquove used no unmeasured confounders constant treatment effects linearity of the treatmentcovariates

bull Under these we can run the following regression to estimatethe ATE 120591

119884119894 = 120572 + 120591119863119894 + 119883prime119894 120574 + 120584119894

bull Works with continuous or ordinal 119863119894 if linearity in the effect ofthese variables is truly linear

29 58

OLS constant effects simulation

Model with linear covariates constant 0 effect of treatment

library(mvtnorm)

n lt- 100

p lt- 4

X lt- rmvnorm(n = 100 mean = rep(0 p))

gamma lt- c(274 137 137 137)

y lt- 210 + X c(gamma) + rnorm(n)

alpha lt- c(-1 05 -05 -01)

dprobs lt- bootinvlogit(X alpha)

d lt- rbinom(n size = 1 prob = dprobs)

30 58

OLS with no covariates

summary(lm(y ~ d))

Coefficients

Estimate Std Error t value Pr(gt|t|)

(Intercept) 22036 458 4813 lt 2e-16

d -2869 654 -439 0000029

---

Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1

Residual standard error 33 on 98 degrees of freedom

Multiple R-squared 0164 Adjusted R-squared 0156

F-statistic 192 on 1 and 98 DF p-value 0000029

31 58

OLS with covariates

summary(lm(y ~ d + X))

Coefficients

Estimate Std Error t value Pr(gt|t|)

(Intercept) 209952 0159 13207 lt2e-16

d -0128 0253 -05 062

X1 27368 0126 2175 lt2e-16

X2 13677 0114 1201 lt2e-16

X3 13673 0130 1051 lt2e-16

X4 13570 0106 1280 lt2e-16

---

Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1

Residual standard error 1 on 94 degrees of freedom

Multiple R-squared 0999 Adjusted R-squared 0999

F-statistic 244e+04 on 5 and 94 DF p-value lt2e-16

32 58

What happens with nonlinearity

Suppose we can only observe the following covariates

z1 lt- exp(X[ 1]2)

z2 lt- X[ 2](1 + exp(X[ 1])) + 10

z3 lt- (X[ 1] X[ 3]25 + 06)^3

z4 lt- (X[ 2] + X[ 4] + 20)^2

Implies that 119884119894 and 119863119894 are functions of log(1198851198941113568) 1198851198941113569 119885111356911989411135681198851198941113569

1 log(1198851198941113568) 1198851198941113570 log(1198851198941113568) and 119883111356811135691198941113571

Regression is a nonlinear function of the observed covariates

33 58

When linearity goes wrong

summary(lm(y ~ d + z1 + z2 + z3 + z4))

Coefficients

Estimate Std Error t value Pr(gt|t|)

(Intercept) 218728 300799 073 0469

d -69292 33854 -205 0043

z1 363110 26970 1346 lt2e-16

z2 -29033 36619 -079 0430

z3 862022 431030 200 0048

z4 04021 00329 1223 lt2e-16

---

Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1

Residual standard error 14 on 94 degrees of freedom

Multiple R-squared 0855 Adjusted R-squared 0847

F-statistic 111 on 5 and 94 DF p-value lt2e-16

34 58

3 Regression withHeterogeneousTreatment Effects

35 58

Heterogeneous effects binary treatment

bull Completely randomized experiment

119884119894 = 119863119894119884119894(1) + (1 minus 119863119894)119884119894(0)= 119884119894(0) + (119884119894(1) minus 119884119894(0))119863119894

= 1205831113567 + 120591119894119863119894 + (119884119894(0) minus 1205831113567)= 1205831113567 + 120591119863119894 + (119884119894(0) minus 1205831113567) + (120591119894 minus 120591) sdot 119863119894

= 1205831113567 + 120591119863119894 + 120576119894

bull Error term now includes two components1 ldquoBaselinerdquo variation in the outcome (119884119894(0) minus 1205831113567)2 Variation in the treatment effect (120591119894 minus 120591)

bull Easy to verify that under experiment 120124[120576119894|119863119894] = 0bull Thus OLS estimates the ATE with no covariates

36 58

Adding covariates

bull What happens with no unmeasured confounders Need tocondition on 119883119894 now

bull Remember identification of the ATEATT using iteratedexpectations

bull ATE is the weighted sum of CATEs

120591 = 1114012119909120591(119909) Pr[119883119894 = 119909]

bull ATEATT are weighted averages of CATEsbull What about the regression estimand 120591119877 How does it related

to the ATEATT

37 58

Heterogeneous effects and regression

bull Letrsquos investigate this under a saturated regression model

119884119894 = 1114012119909119861119909119894120572119909 + 120591119877119863119894 + 119890119894

bull Use a dummy variable for each unique combination of 119883119894119861119909119894 = 120128(119883119894 = 119909)

bull Linear in 119883119894 by construction

38 58

Investigating the regression coefficient

bull How can we investigate 120591119877 Well we can rely on theregression anatomy

120591119877 = Cov(119884119894 119863119894 minus 119864[119863119894|119883119894])120141(119863119894 minus 119864[119863119894|119883119894])

bull 119863119894 minus 120124[119863119894|119883119894] is the residual from a regression of 119863119894 on the fullset of dummies

bull With a little work we can show

120591119877 =120124 1114094120591(119883119894)(119863119894 minus 120124[119863119894|119883119894])11135691114097

120124[(119863119894 minus 119864[119863119894|119883119894])1113569]=

120124[120591(119883119894)1205901113569119889(119883119894)]120124[1205901113569119889(119883119894)]

bull 1205901113569119889(119909) = 120141[119863119894|119883119894 = 119909] is the conditional variance of treatmentassignment

39 58

ATE versus OLS

120591119877 = 120124[120591(119883119894)119882119894] = 1114012119909120591(119909)

1205901113569119889(119909)120124[1205901113569119889(119883119894)]

โ„™[119883119894 = 119909]

bull Compare to the ATE

120591 = 120124[120591(119883119894)] = 1114012119909120591(119909)โ„™[119883119894 = 119909]

bull Both weight strata relative to their size (โ„™[119883119894 = 119909])bull OLS weights strata higher if the treatment variance in those

strata (1205901113569119889(119909)) is higher in those strata relative to the averagevariance across strata (120124[1205901113569119889(119883119894)])

bull The ATE weights only by their size

40 58

Regression weighting

119882119894 =1205901113569119889(119883119894)

120124[1205901113569119889(119883119894)]

bull Why does OLS weight like thisbull OLS is a minimum-variance estimator more weight to

more precise within-strata estimatesbull Within-strata estimates are most precise when the treatment

is evenly spread and thus has the highest variancebull If 119863119894 is binary then we know the conditional variance will be

1205901113569119889(119909) = โ„™[119863119894 = 1|119883119894 = 119909] (1 minus โ„™[119863119894 = 1|119883119894 = 119909])= 119890(119909) (1 minus 119890(119909))

bull Maximum variance with โ„™[119863119894 = 1|119883119894 = 119909] = 12

41 58

OLS weighting examplebull Binary covariate

โ„™[119883119894 = 1] = 075 โ„™[119883119894 = 0] = 025โ„™[119863119894 = 1|119883119894 = 1] = 09 โ„™[119863119894 = 1|119883119894 = 0] = 05

1205901113569119889(1) = 009 1205901113569119889(0) = 025120591(1) = 1 120591(0) = minus1

bull Implies the ATE is 120591 = 05bull Average conditional variance 120124[1205901113569119889(119883119894)] = 013bull weights for 119883119894 = 1 are 009013 = 0692 for 119883119894 = 0

025013 = 192120591119877 = 120124[120591(119883119894)119882119894]

= 120591(1)119882(1)โ„™[119883119894 = 1] + 120591(0)119882(0)โ„™[119883119894 = 0]= 1 times 0692 times 075 + minus1 times 192 times 025= 0039

42 58

When will OLS estimate the ATE

bull When does 120591 = 120591119877bull Constant treatment effects 120591(119909) = 120591 = 120591119877bull Constant probability of treatment 119890(119909) = โ„™[119863119894 = 1|119883119894 = 119909] = 119890

Implies that the OLS weights are 1

bull Incorrect linearity assumption in 119883119894 will lead to more bias

43 58

Other ways to use regression

bull Whatrsquos the path forward Accept the bias (might be relatively small with saturated

models) Use a different regression approach

bull Let 120583119889(119909) = 120124[119884119894(119889)|119883119894 = 119909] be the CEF for the potentialoutcome under 119863119894 = 119889

bull By consistency and nuc we have 120583119889(119909) = 120124[119884119894|119863119894 = 119889119883119894 = 119909]bull Estimate a regression of 119884119894 on 119883119894 among the 119863119894 = 119889 groupbull Then 119889(119909) is just a predicted value from the regression for

119883119894 = 119909bull How can we use this

44 58

Imputation estimators

bull Impute the treated potential outcomes with 1114023119884119894(1) = 1113568(119883119894)bull Impute the control potential outcomes with 1114023119884119894(0) = 1113567(119883119894)bull Procedure

Regress 119884119894 on 119883119894 in the treated group and get predicted valuesfor all units (treated or control)

Regress 119884119894 on 119883119894 in the control group and get predicted valuesfor all units (treated or control)

Take the average difference between these predicted values

bull More mathematically look like this

120591119894119898119901 =1119873

11140121198941113568(119883119894) minus 1113567(119883119894)

bull Sometimes called an imputation estimator

45 58

Simple imputation estimator

bull Use predict() from the within-group models on the data fromthe entire sample

bull Useful trick use a model on the entire data andmodelframe() to get the right design matrix

heterogeneous effects

yhet lt- ifelse(d == 1 y + rnorm(n 0 5) y)

mod lt- lm(yhet ~ d + X)

mod1 lt- lm(yhet ~ X subset = d == 1)

mod0 lt- lm(yhet ~ X subset = d == 0)

y1imps lt- predict(mod1 modelframe(mod))

y0imps lt- predict(mod0 modelframe(mod))

mean(y1imps - y0imps)

[1] 061

46 58

Notes on imputation estimators

bull If 119889(119909) are consistent estimators then 120591119894119898119901 is consistent forthe ATE

bull Why donrsquot people use this Most people donrsquot know the results wersquove been talking about Harder to implement than vanilla OLS

bull Can use linear regression to estimate 119889(119909) = 119909prime120573119889bull Recent trend is to estimate 119889(119909) via non-parametric methods

such as Kernel regression local linear regression regression trees etc Easiest is generalized additive models (GAMs)

47 58

Imputation estimator visualization

-2 -1 0 1 2 3

-05

00

05

10

15

x

y

TreatedControl

48 58

Imputation estimator visualization

-2 -1 0 1 2 3

-05

00

05

10

15

x

y

TreatedControl

49 58

Imputation estimator visualization

-2 -1 0 1 2 3

-05

00

05

10

15

x

y

TreatedControl

50 58

Nonlinear relationshipsbull Same idea but with nonlinear relationship between 119884119894 and 119883119894

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

51 58

Nonlinear relationshipsbull Same idea but with nonlinear relationship between 119884119894 and 119883119894

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

52 58

Nonlinear relationshipsbull Same idea but with nonlinear relationship between 119884119894 and 119883119894

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

53 58

Using semiparametric regressionbull Here CEFs are nonlinear but we donrsquot know their formbull We can use GAMs from the mgcv package to for flexible

estimatelibrary(mgcv)

mod0 lt- gam(y ~ s(x) subset = d == 0)

summary(mod0)

Family gaussian

Link function identity

Formula

y ~ s(x)

Parametric coefficients

Estimate Std Error t value Pr(gt|t|)

(Intercept) -00225 00154 -146 016

Approximate significance of smooth terms

edf Refdf F p-value

s(x) 603 708 413 lt2e-16

---

Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1

R-sq(adj) = 0909 Deviance explained = 928

GCV = 00093204 Scale est = 00071351 n = 30

54 58

Using GAMs

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

55 58

Using GAMs

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

56 58

Using GAMs

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

57 58

Limited dependent variables

bull Usual advice model the data from first principles Logitprobit for binary Poisson for counts etc

bull OLS is a-ok with limited DVs when Binary treatment and no covariates (just diff-in-means) Binary treatment discrete covariates and saturated models

(stratified diff-in-means)

bull Imposing a model on LDVs in this case imposes adistributional assumption which could be wrong

bull Even in unsaturated models the marginal effect from OLSoften decent compared to nonlinear models

Could go wrong in small samples If using nonlinear models always get effects on the scale of the

outcome

58 58

  • Agnostic Regression
  • Regression and Causality
  • Regression with Heterogeneous Treatment Effects
Page 21: Gov 2002: 7. Regression and Causalityโ€ข Define linear regression: =argmin ๐”ผ[( ๐‘–โˆ’ โ€ฒ ) ] โ€ข The solution to this is the following: =๐”ผ[ ๐‘– โ€ฒ]โˆ’ ๐”ผ[ ๐‘– ๐‘–] โ€ข

Estimating the variance

bull In large samples then

radic119873( minus 120573) sim 119873(0ฮฉ)

bull How to estimate ฮฉ Plug-in principle again

1114023ฮฉ =

โŽกโŽขโŽขโŽขโŽขโŽขโŽฃ1114012

119894119883119894119883prime

119894

โŽคโŽฅโŽฅโŽฅโŽฅโŽฅโŽฆ

minus1113568 โŽกโŽขโŽขโŽขโŽขโŽขโŽฃ1114012119894119883119894119883prime

119894 1113569119894

โŽคโŽฅโŽฅโŽฅโŽฅโŽฅโŽฆ

โŽกโŽขโŽขโŽขโŽขโŽขโŽฃ1114012

119894119883119894119883prime

119894

โŽคโŽฅโŽฅโŽฅโŽฅโŽฅโŽฆ

minus1113568

bull Replace 119890119894 with its emprical counterpart (residuals)119894 = 119884119894 minus 119883prime

119894 bull Replace the population moments of 119883119894 with their sample

counterpartsbull The square root of the diagonals of this covariance matrix are

the ldquorobustrdquo or Huber-White standard errors that Statacommonly report

21 58

Heteroskedasticity

bull No assumptions of homoskedasticitybull Heteroskedaticity will definitely occur when

CEF is linear but the 1205901113569(119909) = 120141[119884119894|119883119894 = 119909] is not constant in 119909 119864[119884119894|119883119894] is not linear but we use the linear regression to

approxmiate it

22 58

2 Regression andCausality

23 58

Regression and causality

bull Most econometrics textbooks regression defined withoutrespect to causality

bull But then when is ldquobiasedrdquo The above derivations work forsome 120124[119884119894|119883119894]

bull The question then is when does knowing the CEF tell ussomething about causality

bull MHE argues that a regression is causal when the CEF itapproximates is causal Identification is king

bull We will show that under certain conditions a regression of theoutcome on the treatment and the covariates can recover acausal parameter but perhaps not the one in which we areinterested

24 58

Review

bull Quick reminder we have potential outcomes 119884119894(1) and 119884119894(0)and two parameters the ATE and ATT

120591 = 119864[119884119894(1) minus 119884119894(0)]120591ATT = 119864[119884119894(1) minus 119884119894(0)|119863119894 = 1]

bull We have shown in past weeks that these effects are identifiedwhen ignorability holds MHE calls this the conditionalindependence assumption (CIA)

25 58

Linear constant effects model binarytreatment

bull Experiment with a simple experiment we can rewrite theconsistency assumption to be a regression formula

119884119894 = 119863119894119884119894(1) + (1 minus 119863119894)119884119894(0)= 119884119894(0) + (119884119894(1) minus 119884119894(0))119863119894

= 120124[119884119894(0)] + 120591119863119894 + (119884119894(0) minus 120124[119884119894(0)])= 1205831113567 + 120591119863119894 + 1199071113567119894

bull Note that if ignorability holds (as in an experiment) for 119884119894(0)then it will also hold for 1199071113567119894 since 120124[119884119894(0)] is constant Thusthis satifies the usual assumptions for regression

26 58

Now with covariates

bull Now assume no unmeasured confounders 119884119894(119889) โŸ‚โŸ‚ 119863119894|119883119894bull We will assume a linear model for the potential outcomes

119884119894(119889) = 120572 + 120591 sdot 119889 + 120578119894

bull Remember that linearity isnrsquot an assumption if 119863119894 is binarybull Effect of 119863119894 is constant here the 120578119894 are the only source of

individual variation and we have 119864[120578119894] = 0bull Consistency assumption allows us to write this as

119884119894 = 120572 + 120591119863119894 + 120578119894

27 58

Covariates in the error

bull Letrsquos assume that 120578119894 is linear in 119883119894 120578119894 = 119883prime119894 120574 + 120584119894

bull New error is uncorrelated with 119883119894 120124[120584119894|119883119894] = 0bull This is an assumption Might be falsebull Plug into the above

120124[119884119894(119889)|119883119894] = 119864[119884119894|119863119894 119883119894] = 120572 + 120591119863119894 + 119864[120578119894|119883119894]= 120572 + 120591119863119894 + 119883prime

119894 120574 + 119864[120584119894|119883119894]= 120572 + 120591119863119894 + 119883prime

119894 120574

28 58

Summing up regression with constanteffects

bull Reviewing the assumptions wersquove used no unmeasured confounders constant treatment effects linearity of the treatmentcovariates

bull Under these we can run the following regression to estimatethe ATE 120591

119884119894 = 120572 + 120591119863119894 + 119883prime119894 120574 + 120584119894

bull Works with continuous or ordinal 119863119894 if linearity in the effect ofthese variables is truly linear

29 58

OLS constant effects simulation

Model with linear covariates constant 0 effect of treatment

library(mvtnorm)

n lt- 100

p lt- 4

X lt- rmvnorm(n = 100 mean = rep(0 p))

gamma lt- c(274 137 137 137)

y lt- 210 + X c(gamma) + rnorm(n)

alpha lt- c(-1 05 -05 -01)

dprobs lt- bootinvlogit(X alpha)

d lt- rbinom(n size = 1 prob = dprobs)

30 58

OLS with no covariates

summary(lm(y ~ d))

Coefficients

Estimate Std Error t value Pr(gt|t|)

(Intercept) 22036 458 4813 lt 2e-16

d -2869 654 -439 0000029

---

Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1

Residual standard error 33 on 98 degrees of freedom

Multiple R-squared 0164 Adjusted R-squared 0156

F-statistic 192 on 1 and 98 DF p-value 0000029

31 58

OLS with covariates

summary(lm(y ~ d + X))

Coefficients

Estimate Std Error t value Pr(gt|t|)

(Intercept) 209952 0159 13207 lt2e-16

d -0128 0253 -05 062

X1 27368 0126 2175 lt2e-16

X2 13677 0114 1201 lt2e-16

X3 13673 0130 1051 lt2e-16

X4 13570 0106 1280 lt2e-16

---

Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1

Residual standard error 1 on 94 degrees of freedom

Multiple R-squared 0999 Adjusted R-squared 0999

F-statistic 244e+04 on 5 and 94 DF p-value lt2e-16

32 58

What happens with nonlinearity

Suppose we can only observe the following covariates

z1 lt- exp(X[ 1]2)

z2 lt- X[ 2](1 + exp(X[ 1])) + 10

z3 lt- (X[ 1] X[ 3]25 + 06)^3

z4 lt- (X[ 2] + X[ 4] + 20)^2

Implies that 119884119894 and 119863119894 are functions of log(1198851198941113568) 1198851198941113569 119885111356911989411135681198851198941113569

1 log(1198851198941113568) 1198851198941113570 log(1198851198941113568) and 119883111356811135691198941113571

Regression is a nonlinear function of the observed covariates

33 58

When linearity goes wrong

summary(lm(y ~ d + z1 + z2 + z3 + z4))

Coefficients

Estimate Std Error t value Pr(gt|t|)

(Intercept) 218728 300799 073 0469

d -69292 33854 -205 0043

z1 363110 26970 1346 lt2e-16

z2 -29033 36619 -079 0430

z3 862022 431030 200 0048

z4 04021 00329 1223 lt2e-16

---

Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1

Residual standard error 14 on 94 degrees of freedom

Multiple R-squared 0855 Adjusted R-squared 0847

F-statistic 111 on 5 and 94 DF p-value lt2e-16

34 58

3 Regression withHeterogeneousTreatment Effects

35 58

Heterogeneous effects binary treatment

bull Completely randomized experiment

119884119894 = 119863119894119884119894(1) + (1 minus 119863119894)119884119894(0)= 119884119894(0) + (119884119894(1) minus 119884119894(0))119863119894

= 1205831113567 + 120591119894119863119894 + (119884119894(0) minus 1205831113567)= 1205831113567 + 120591119863119894 + (119884119894(0) minus 1205831113567) + (120591119894 minus 120591) sdot 119863119894

= 1205831113567 + 120591119863119894 + 120576119894

bull Error term now includes two components1 ldquoBaselinerdquo variation in the outcome (119884119894(0) minus 1205831113567)2 Variation in the treatment effect (120591119894 minus 120591)

bull Easy to verify that under experiment 120124[120576119894|119863119894] = 0bull Thus OLS estimates the ATE with no covariates

36 58

Adding covariates

bull What happens with no unmeasured confounders Need tocondition on 119883119894 now

bull Remember identification of the ATEATT using iteratedexpectations

bull ATE is the weighted sum of CATEs

120591 = 1114012119909120591(119909) Pr[119883119894 = 119909]

bull ATEATT are weighted averages of CATEsbull What about the regression estimand 120591119877 How does it related

to the ATEATT

37 58

Heterogeneous effects and regression

bull Letrsquos investigate this under a saturated regression model

119884119894 = 1114012119909119861119909119894120572119909 + 120591119877119863119894 + 119890119894

bull Use a dummy variable for each unique combination of 119883119894119861119909119894 = 120128(119883119894 = 119909)

bull Linear in 119883119894 by construction

38 58

Investigating the regression coefficient

bull How can we investigate 120591119877 Well we can rely on theregression anatomy

120591119877 = Cov(119884119894 119863119894 minus 119864[119863119894|119883119894])120141(119863119894 minus 119864[119863119894|119883119894])

bull 119863119894 minus 120124[119863119894|119883119894] is the residual from a regression of 119863119894 on the fullset of dummies

bull With a little work we can show

120591119877 =120124 1114094120591(119883119894)(119863119894 minus 120124[119863119894|119883119894])11135691114097

120124[(119863119894 minus 119864[119863119894|119883119894])1113569]=

120124[120591(119883119894)1205901113569119889(119883119894)]120124[1205901113569119889(119883119894)]

bull 1205901113569119889(119909) = 120141[119863119894|119883119894 = 119909] is the conditional variance of treatmentassignment

39 58

ATE versus OLS

120591119877 = 120124[120591(119883119894)119882119894] = 1114012119909120591(119909)

1205901113569119889(119909)120124[1205901113569119889(119883119894)]

โ„™[119883119894 = 119909]

bull Compare to the ATE

120591 = 120124[120591(119883119894)] = 1114012119909120591(119909)โ„™[119883119894 = 119909]

bull Both weight strata relative to their size (โ„™[119883119894 = 119909])bull OLS weights strata higher if the treatment variance in those

strata (1205901113569119889(119909)) is higher in those strata relative to the averagevariance across strata (120124[1205901113569119889(119883119894)])

bull The ATE weights only by their size

40 58

Regression weighting

119882119894 =1205901113569119889(119883119894)

120124[1205901113569119889(119883119894)]

bull Why does OLS weight like thisbull OLS is a minimum-variance estimator more weight to

more precise within-strata estimatesbull Within-strata estimates are most precise when the treatment

is evenly spread and thus has the highest variancebull If 119863119894 is binary then we know the conditional variance will be

1205901113569119889(119909) = โ„™[119863119894 = 1|119883119894 = 119909] (1 minus โ„™[119863119894 = 1|119883119894 = 119909])= 119890(119909) (1 minus 119890(119909))

bull Maximum variance with โ„™[119863119894 = 1|119883119894 = 119909] = 12

41 58

OLS weighting examplebull Binary covariate

โ„™[119883119894 = 1] = 075 โ„™[119883119894 = 0] = 025โ„™[119863119894 = 1|119883119894 = 1] = 09 โ„™[119863119894 = 1|119883119894 = 0] = 05

1205901113569119889(1) = 009 1205901113569119889(0) = 025120591(1) = 1 120591(0) = minus1

bull Implies the ATE is 120591 = 05bull Average conditional variance 120124[1205901113569119889(119883119894)] = 013bull weights for 119883119894 = 1 are 009013 = 0692 for 119883119894 = 0

025013 = 192120591119877 = 120124[120591(119883119894)119882119894]

= 120591(1)119882(1)โ„™[119883119894 = 1] + 120591(0)119882(0)โ„™[119883119894 = 0]= 1 times 0692 times 075 + minus1 times 192 times 025= 0039

42 58

When will OLS estimate the ATE

bull When does 120591 = 120591119877bull Constant treatment effects 120591(119909) = 120591 = 120591119877bull Constant probability of treatment 119890(119909) = โ„™[119863119894 = 1|119883119894 = 119909] = 119890

Implies that the OLS weights are 1

bull Incorrect linearity assumption in 119883119894 will lead to more bias

43 58

Other ways to use regression

bull Whatrsquos the path forward Accept the bias (might be relatively small with saturated

models) Use a different regression approach

bull Let 120583119889(119909) = 120124[119884119894(119889)|119883119894 = 119909] be the CEF for the potentialoutcome under 119863119894 = 119889

bull By consistency and nuc we have 120583119889(119909) = 120124[119884119894|119863119894 = 119889119883119894 = 119909]bull Estimate a regression of 119884119894 on 119883119894 among the 119863119894 = 119889 groupbull Then 119889(119909) is just a predicted value from the regression for

119883119894 = 119909bull How can we use this

44 58

Imputation estimators

bull Impute the treated potential outcomes with 1114023119884119894(1) = 1113568(119883119894)bull Impute the control potential outcomes with 1114023119884119894(0) = 1113567(119883119894)bull Procedure

Regress 119884119894 on 119883119894 in the treated group and get predicted valuesfor all units (treated or control)

Regress 119884119894 on 119883119894 in the control group and get predicted valuesfor all units (treated or control)

Take the average difference between these predicted values

bull More mathematically look like this

120591119894119898119901 =1119873

11140121198941113568(119883119894) minus 1113567(119883119894)

bull Sometimes called an imputation estimator

45 58

Simple imputation estimator

bull Use predict() from the within-group models on the data fromthe entire sample

bull Useful trick use a model on the entire data andmodelframe() to get the right design matrix

heterogeneous effects

yhet lt- ifelse(d == 1 y + rnorm(n 0 5) y)

mod lt- lm(yhet ~ d + X)

mod1 lt- lm(yhet ~ X subset = d == 1)

mod0 lt- lm(yhet ~ X subset = d == 0)

y1imps lt- predict(mod1 modelframe(mod))

y0imps lt- predict(mod0 modelframe(mod))

mean(y1imps - y0imps)

[1] 061

46 58

Notes on imputation estimators

bull If 119889(119909) are consistent estimators then 120591119894119898119901 is consistent forthe ATE

bull Why donrsquot people use this Most people donrsquot know the results wersquove been talking about Harder to implement than vanilla OLS

bull Can use linear regression to estimate 119889(119909) = 119909prime120573119889bull Recent trend is to estimate 119889(119909) via non-parametric methods

such as Kernel regression local linear regression regression trees etc Easiest is generalized additive models (GAMs)

47 58

Imputation estimator visualization

-2 -1 0 1 2 3

-05

00

05

10

15

x

y

TreatedControl

48 58

Imputation estimator visualization

-2 -1 0 1 2 3

-05

00

05

10

15

x

y

TreatedControl

49 58

Imputation estimator visualization

-2 -1 0 1 2 3

-05

00

05

10

15

x

y

TreatedControl

50 58

Nonlinear relationshipsbull Same idea but with nonlinear relationship between 119884119894 and 119883119894

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

51 58

Nonlinear relationshipsbull Same idea but with nonlinear relationship between 119884119894 and 119883119894

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

52 58

Nonlinear relationshipsbull Same idea but with nonlinear relationship between 119884119894 and 119883119894

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

53 58

Using semiparametric regressionbull Here CEFs are nonlinear but we donrsquot know their formbull We can use GAMs from the mgcv package to for flexible

estimatelibrary(mgcv)

mod0 lt- gam(y ~ s(x) subset = d == 0)

summary(mod0)

Family gaussian

Link function identity

Formula

y ~ s(x)

Parametric coefficients

Estimate Std Error t value Pr(gt|t|)

(Intercept) -00225 00154 -146 016

Approximate significance of smooth terms

edf Refdf F p-value

s(x) 603 708 413 lt2e-16

---

Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1

R-sq(adj) = 0909 Deviance explained = 928

GCV = 00093204 Scale est = 00071351 n = 30

54 58

Using GAMs

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

55 58

Using GAMs

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

56 58

Using GAMs

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

57 58

Limited dependent variables

bull Usual advice model the data from first principles Logitprobit for binary Poisson for counts etc

bull OLS is a-ok with limited DVs when Binary treatment and no covariates (just diff-in-means) Binary treatment discrete covariates and saturated models

(stratified diff-in-means)

bull Imposing a model on LDVs in this case imposes adistributional assumption which could be wrong

bull Even in unsaturated models the marginal effect from OLSoften decent compared to nonlinear models

Could go wrong in small samples If using nonlinear models always get effects on the scale of the

outcome

58 58

  • Agnostic Regression
  • Regression and Causality
  • Regression with Heterogeneous Treatment Effects
Page 22: Gov 2002: 7. Regression and Causalityโ€ข Define linear regression: =argmin ๐”ผ[( ๐‘–โˆ’ โ€ฒ ) ] โ€ข The solution to this is the following: =๐”ผ[ ๐‘– โ€ฒ]โˆ’ ๐”ผ[ ๐‘– ๐‘–] โ€ข

Heteroskedasticity

bull No assumptions of homoskedasticitybull Heteroskedaticity will definitely occur when

CEF is linear but the 1205901113569(119909) = 120141[119884119894|119883119894 = 119909] is not constant in 119909 119864[119884119894|119883119894] is not linear but we use the linear regression to

approxmiate it

22 58

2 Regression andCausality

23 58

Regression and causality

bull Most econometrics textbooks regression defined withoutrespect to causality

bull But then when is ldquobiasedrdquo The above derivations work forsome 120124[119884119894|119883119894]

bull The question then is when does knowing the CEF tell ussomething about causality

bull MHE argues that a regression is causal when the CEF itapproximates is causal Identification is king

bull We will show that under certain conditions a regression of theoutcome on the treatment and the covariates can recover acausal parameter but perhaps not the one in which we areinterested

24 58

Review

bull Quick reminder we have potential outcomes 119884119894(1) and 119884119894(0)and two parameters the ATE and ATT

120591 = 119864[119884119894(1) minus 119884119894(0)]120591ATT = 119864[119884119894(1) minus 119884119894(0)|119863119894 = 1]

bull We have shown in past weeks that these effects are identifiedwhen ignorability holds MHE calls this the conditionalindependence assumption (CIA)

25 58

Linear constant effects model binarytreatment

bull Experiment with a simple experiment we can rewrite theconsistency assumption to be a regression formula

119884119894 = 119863119894119884119894(1) + (1 minus 119863119894)119884119894(0)= 119884119894(0) + (119884119894(1) minus 119884119894(0))119863119894

= 120124[119884119894(0)] + 120591119863119894 + (119884119894(0) minus 120124[119884119894(0)])= 1205831113567 + 120591119863119894 + 1199071113567119894

bull Note that if ignorability holds (as in an experiment) for 119884119894(0)then it will also hold for 1199071113567119894 since 120124[119884119894(0)] is constant Thusthis satifies the usual assumptions for regression

26 58

Now with covariates

bull Now assume no unmeasured confounders 119884119894(119889) โŸ‚โŸ‚ 119863119894|119883119894bull We will assume a linear model for the potential outcomes

119884119894(119889) = 120572 + 120591 sdot 119889 + 120578119894

bull Remember that linearity isnrsquot an assumption if 119863119894 is binarybull Effect of 119863119894 is constant here the 120578119894 are the only source of

individual variation and we have 119864[120578119894] = 0bull Consistency assumption allows us to write this as

119884119894 = 120572 + 120591119863119894 + 120578119894

27 58

Covariates in the error

bull Letrsquos assume that 120578119894 is linear in 119883119894 120578119894 = 119883prime119894 120574 + 120584119894

bull New error is uncorrelated with 119883119894 120124[120584119894|119883119894] = 0bull This is an assumption Might be falsebull Plug into the above

120124[119884119894(119889)|119883119894] = 119864[119884119894|119863119894 119883119894] = 120572 + 120591119863119894 + 119864[120578119894|119883119894]= 120572 + 120591119863119894 + 119883prime

119894 120574 + 119864[120584119894|119883119894]= 120572 + 120591119863119894 + 119883prime

119894 120574

28 58

Summing up regression with constanteffects

bull Reviewing the assumptions wersquove used no unmeasured confounders constant treatment effects linearity of the treatmentcovariates

bull Under these we can run the following regression to estimatethe ATE 120591

119884119894 = 120572 + 120591119863119894 + 119883prime119894 120574 + 120584119894

bull Works with continuous or ordinal 119863119894 if linearity in the effect ofthese variables is truly linear

29 58

OLS constant effects simulation

Model with linear covariates constant 0 effect of treatment

library(mvtnorm)

n lt- 100

p lt- 4

X lt- rmvnorm(n = 100 mean = rep(0 p))

gamma lt- c(274 137 137 137)

y lt- 210 + X c(gamma) + rnorm(n)

alpha lt- c(-1 05 -05 -01)

dprobs lt- bootinvlogit(X alpha)

d lt- rbinom(n size = 1 prob = dprobs)

30 58

OLS with no covariates

summary(lm(y ~ d))

Coefficients

Estimate Std Error t value Pr(gt|t|)

(Intercept) 22036 458 4813 lt 2e-16

d -2869 654 -439 0000029

---

Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1

Residual standard error 33 on 98 degrees of freedom

Multiple R-squared 0164 Adjusted R-squared 0156

F-statistic 192 on 1 and 98 DF p-value 0000029

31 58

OLS with covariates

summary(lm(y ~ d + X))

Coefficients

Estimate Std Error t value Pr(gt|t|)

(Intercept) 209952 0159 13207 lt2e-16

d -0128 0253 -05 062

X1 27368 0126 2175 lt2e-16

X2 13677 0114 1201 lt2e-16

X3 13673 0130 1051 lt2e-16

X4 13570 0106 1280 lt2e-16

---

Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1

Residual standard error 1 on 94 degrees of freedom

Multiple R-squared 0999 Adjusted R-squared 0999

F-statistic 244e+04 on 5 and 94 DF p-value lt2e-16

32 58

What happens with nonlinearity

Suppose we can only observe the following covariates

z1 lt- exp(X[ 1]2)

z2 lt- X[ 2](1 + exp(X[ 1])) + 10

z3 lt- (X[ 1] X[ 3]25 + 06)^3

z4 lt- (X[ 2] + X[ 4] + 20)^2

Implies that 119884119894 and 119863119894 are functions of log(1198851198941113568) 1198851198941113569 119885111356911989411135681198851198941113569

1 log(1198851198941113568) 1198851198941113570 log(1198851198941113568) and 119883111356811135691198941113571

Regression is a nonlinear function of the observed covariates

33 58

When linearity goes wrong

summary(lm(y ~ d + z1 + z2 + z3 + z4))

Coefficients

Estimate Std Error t value Pr(gt|t|)

(Intercept) 218728 300799 073 0469

d -69292 33854 -205 0043

z1 363110 26970 1346 lt2e-16

z2 -29033 36619 -079 0430

z3 862022 431030 200 0048

z4 04021 00329 1223 lt2e-16

---

Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1

Residual standard error 14 on 94 degrees of freedom

Multiple R-squared 0855 Adjusted R-squared 0847

F-statistic 111 on 5 and 94 DF p-value lt2e-16

34 58

3 Regression withHeterogeneousTreatment Effects

35 58

Heterogeneous effects binary treatment

bull Completely randomized experiment

119884119894 = 119863119894119884119894(1) + (1 minus 119863119894)119884119894(0)= 119884119894(0) + (119884119894(1) minus 119884119894(0))119863119894

= 1205831113567 + 120591119894119863119894 + (119884119894(0) minus 1205831113567)= 1205831113567 + 120591119863119894 + (119884119894(0) minus 1205831113567) + (120591119894 minus 120591) sdot 119863119894

= 1205831113567 + 120591119863119894 + 120576119894

bull Error term now includes two components1 ldquoBaselinerdquo variation in the outcome (119884119894(0) minus 1205831113567)2 Variation in the treatment effect (120591119894 minus 120591)

bull Easy to verify that under experiment 120124[120576119894|119863119894] = 0bull Thus OLS estimates the ATE with no covariates

36 58

Adding covariates

bull What happens with no unmeasured confounders Need tocondition on 119883119894 now

bull Remember identification of the ATEATT using iteratedexpectations

bull ATE is the weighted sum of CATEs

120591 = 1114012119909120591(119909) Pr[119883119894 = 119909]

bull ATEATT are weighted averages of CATEsbull What about the regression estimand 120591119877 How does it related

to the ATEATT

37 58

Heterogeneous effects and regression

bull Letrsquos investigate this under a saturated regression model

119884119894 = 1114012119909119861119909119894120572119909 + 120591119877119863119894 + 119890119894

bull Use a dummy variable for each unique combination of 119883119894119861119909119894 = 120128(119883119894 = 119909)

bull Linear in 119883119894 by construction

38 58

Investigating the regression coefficient

bull How can we investigate 120591119877 Well we can rely on theregression anatomy

120591119877 = Cov(119884119894 119863119894 minus 119864[119863119894|119883119894])120141(119863119894 minus 119864[119863119894|119883119894])

bull 119863119894 minus 120124[119863119894|119883119894] is the residual from a regression of 119863119894 on the fullset of dummies

bull With a little work we can show

120591119877 =120124 1114094120591(119883119894)(119863119894 minus 120124[119863119894|119883119894])11135691114097

120124[(119863119894 minus 119864[119863119894|119883119894])1113569]=

120124[120591(119883119894)1205901113569119889(119883119894)]120124[1205901113569119889(119883119894)]

bull 1205901113569119889(119909) = 120141[119863119894|119883119894 = 119909] is the conditional variance of treatmentassignment

39 58

ATE versus OLS

120591119877 = 120124[120591(119883119894)119882119894] = 1114012119909120591(119909)

1205901113569119889(119909)120124[1205901113569119889(119883119894)]

โ„™[119883119894 = 119909]

bull Compare to the ATE

120591 = 120124[120591(119883119894)] = 1114012119909120591(119909)โ„™[119883119894 = 119909]

bull Both weight strata relative to their size (โ„™[119883119894 = 119909])bull OLS weights strata higher if the treatment variance in those

strata (1205901113569119889(119909)) is higher in those strata relative to the averagevariance across strata (120124[1205901113569119889(119883119894)])

bull The ATE weights only by their size

40 58

Regression weighting

119882119894 =1205901113569119889(119883119894)

120124[1205901113569119889(119883119894)]

bull Why does OLS weight like thisbull OLS is a minimum-variance estimator more weight to

more precise within-strata estimatesbull Within-strata estimates are most precise when the treatment

is evenly spread and thus has the highest variancebull If 119863119894 is binary then we know the conditional variance will be

1205901113569119889(119909) = โ„™[119863119894 = 1|119883119894 = 119909] (1 minus โ„™[119863119894 = 1|119883119894 = 119909])= 119890(119909) (1 minus 119890(119909))

bull Maximum variance with โ„™[119863119894 = 1|119883119894 = 119909] = 12

41 58

OLS weighting examplebull Binary covariate

โ„™[119883119894 = 1] = 075 โ„™[119883119894 = 0] = 025โ„™[119863119894 = 1|119883119894 = 1] = 09 โ„™[119863119894 = 1|119883119894 = 0] = 05

1205901113569119889(1) = 009 1205901113569119889(0) = 025120591(1) = 1 120591(0) = minus1

bull Implies the ATE is 120591 = 05bull Average conditional variance 120124[1205901113569119889(119883119894)] = 013bull weights for 119883119894 = 1 are 009013 = 0692 for 119883119894 = 0

025013 = 192120591119877 = 120124[120591(119883119894)119882119894]

= 120591(1)119882(1)โ„™[119883119894 = 1] + 120591(0)119882(0)โ„™[119883119894 = 0]= 1 times 0692 times 075 + minus1 times 192 times 025= 0039

42 58

When will OLS estimate the ATE

bull When does 120591 = 120591119877bull Constant treatment effects 120591(119909) = 120591 = 120591119877bull Constant probability of treatment 119890(119909) = โ„™[119863119894 = 1|119883119894 = 119909] = 119890

Implies that the OLS weights are 1

bull Incorrect linearity assumption in 119883119894 will lead to more bias

43 58

Other ways to use regression

bull Whatrsquos the path forward Accept the bias (might be relatively small with saturated

models) Use a different regression approach

bull Let 120583119889(119909) = 120124[119884119894(119889)|119883119894 = 119909] be the CEF for the potentialoutcome under 119863119894 = 119889

bull By consistency and nuc we have 120583119889(119909) = 120124[119884119894|119863119894 = 119889119883119894 = 119909]bull Estimate a regression of 119884119894 on 119883119894 among the 119863119894 = 119889 groupbull Then 119889(119909) is just a predicted value from the regression for

119883119894 = 119909bull How can we use this

44 58

Imputation estimators

bull Impute the treated potential outcomes with 1114023119884119894(1) = 1113568(119883119894)bull Impute the control potential outcomes with 1114023119884119894(0) = 1113567(119883119894)bull Procedure

Regress 119884119894 on 119883119894 in the treated group and get predicted valuesfor all units (treated or control)

Regress 119884119894 on 119883119894 in the control group and get predicted valuesfor all units (treated or control)

Take the average difference between these predicted values

bull More mathematically look like this

120591119894119898119901 =1119873

11140121198941113568(119883119894) minus 1113567(119883119894)

bull Sometimes called an imputation estimator

45 58

Simple imputation estimator

bull Use predict() from the within-group models on the data fromthe entire sample

bull Useful trick use a model on the entire data andmodelframe() to get the right design matrix

heterogeneous effects

yhet lt- ifelse(d == 1 y + rnorm(n 0 5) y)

mod lt- lm(yhet ~ d + X)

mod1 lt- lm(yhet ~ X subset = d == 1)

mod0 lt- lm(yhet ~ X subset = d == 0)

y1imps lt- predict(mod1 modelframe(mod))

y0imps lt- predict(mod0 modelframe(mod))

mean(y1imps - y0imps)

[1] 061

46 58

Notes on imputation estimators

bull If 119889(119909) are consistent estimators then 120591119894119898119901 is consistent forthe ATE

bull Why donrsquot people use this Most people donrsquot know the results wersquove been talking about Harder to implement than vanilla OLS

bull Can use linear regression to estimate 119889(119909) = 119909prime120573119889bull Recent trend is to estimate 119889(119909) via non-parametric methods

such as Kernel regression local linear regression regression trees etc Easiest is generalized additive models (GAMs)

47 58

Imputation estimator visualization

-2 -1 0 1 2 3

-05

00

05

10

15

x

y

TreatedControl

48 58

Imputation estimator visualization

-2 -1 0 1 2 3

-05

00

05

10

15

x

y

TreatedControl

49 58

Imputation estimator visualization

-2 -1 0 1 2 3

-05

00

05

10

15

x

y

TreatedControl

50 58

Nonlinear relationshipsbull Same idea but with nonlinear relationship between 119884119894 and 119883119894

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

51 58

Nonlinear relationshipsbull Same idea but with nonlinear relationship between 119884119894 and 119883119894

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

52 58

Nonlinear relationshipsbull Same idea but with nonlinear relationship between 119884119894 and 119883119894

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

53 58

Using semiparametric regressionbull Here CEFs are nonlinear but we donrsquot know their formbull We can use GAMs from the mgcv package to for flexible

estimatelibrary(mgcv)

mod0 lt- gam(y ~ s(x) subset = d == 0)

summary(mod0)

Family gaussian

Link function identity

Formula

y ~ s(x)

Parametric coefficients

Estimate Std Error t value Pr(gt|t|)

(Intercept) -00225 00154 -146 016

Approximate significance of smooth terms

edf Refdf F p-value

s(x) 603 708 413 lt2e-16

---

Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1

R-sq(adj) = 0909 Deviance explained = 928

GCV = 00093204 Scale est = 00071351 n = 30

54 58

Using GAMs

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

55 58

Using GAMs

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

56 58

Using GAMs

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

57 58

Limited dependent variables

bull Usual advice model the data from first principles Logitprobit for binary Poisson for counts etc

bull OLS is a-ok with limited DVs when Binary treatment and no covariates (just diff-in-means) Binary treatment discrete covariates and saturated models

(stratified diff-in-means)

bull Imposing a model on LDVs in this case imposes adistributional assumption which could be wrong

bull Even in unsaturated models the marginal effect from OLSoften decent compared to nonlinear models

Could go wrong in small samples If using nonlinear models always get effects on the scale of the

outcome

58 58

  • Agnostic Regression
  • Regression and Causality
  • Regression with Heterogeneous Treatment Effects
Page 23: Gov 2002: 7. Regression and Causalityโ€ข Define linear regression: =argmin ๐”ผ[( ๐‘–โˆ’ โ€ฒ ) ] โ€ข The solution to this is the following: =๐”ผ[ ๐‘– โ€ฒ]โˆ’ ๐”ผ[ ๐‘– ๐‘–] โ€ข

2 Regression andCausality

23 58

Regression and causality

bull Most econometrics textbooks regression defined withoutrespect to causality

bull But then when is ldquobiasedrdquo The above derivations work forsome 120124[119884119894|119883119894]

bull The question then is when does knowing the CEF tell ussomething about causality

bull MHE argues that a regression is causal when the CEF itapproximates is causal Identification is king

bull We will show that under certain conditions a regression of theoutcome on the treatment and the covariates can recover acausal parameter but perhaps not the one in which we areinterested

24 58

Review

bull Quick reminder we have potential outcomes 119884119894(1) and 119884119894(0)and two parameters the ATE and ATT

120591 = 119864[119884119894(1) minus 119884119894(0)]120591ATT = 119864[119884119894(1) minus 119884119894(0)|119863119894 = 1]

bull We have shown in past weeks that these effects are identifiedwhen ignorability holds MHE calls this the conditionalindependence assumption (CIA)

25 58

Linear constant effects model binarytreatment

bull Experiment with a simple experiment we can rewrite theconsistency assumption to be a regression formula

119884119894 = 119863119894119884119894(1) + (1 minus 119863119894)119884119894(0)= 119884119894(0) + (119884119894(1) minus 119884119894(0))119863119894

= 120124[119884119894(0)] + 120591119863119894 + (119884119894(0) minus 120124[119884119894(0)])= 1205831113567 + 120591119863119894 + 1199071113567119894

bull Note that if ignorability holds (as in an experiment) for 119884119894(0)then it will also hold for 1199071113567119894 since 120124[119884119894(0)] is constant Thusthis satifies the usual assumptions for regression

26 58

Now with covariates

bull Now assume no unmeasured confounders 119884119894(119889) โŸ‚โŸ‚ 119863119894|119883119894bull We will assume a linear model for the potential outcomes

119884119894(119889) = 120572 + 120591 sdot 119889 + 120578119894

bull Remember that linearity isnrsquot an assumption if 119863119894 is binarybull Effect of 119863119894 is constant here the 120578119894 are the only source of

individual variation and we have 119864[120578119894] = 0bull Consistency assumption allows us to write this as

119884119894 = 120572 + 120591119863119894 + 120578119894

27 58

Covariates in the error

bull Letrsquos assume that 120578119894 is linear in 119883119894 120578119894 = 119883prime119894 120574 + 120584119894

bull New error is uncorrelated with 119883119894 120124[120584119894|119883119894] = 0bull This is an assumption Might be falsebull Plug into the above

120124[119884119894(119889)|119883119894] = 119864[119884119894|119863119894 119883119894] = 120572 + 120591119863119894 + 119864[120578119894|119883119894]= 120572 + 120591119863119894 + 119883prime

119894 120574 + 119864[120584119894|119883119894]= 120572 + 120591119863119894 + 119883prime

119894 120574

28 58

Summing up regression with constanteffects

bull Reviewing the assumptions wersquove used no unmeasured confounders constant treatment effects linearity of the treatmentcovariates

bull Under these we can run the following regression to estimatethe ATE 120591

119884119894 = 120572 + 120591119863119894 + 119883prime119894 120574 + 120584119894

bull Works with continuous or ordinal 119863119894 if linearity in the effect ofthese variables is truly linear

29 58

OLS constant effects simulation

Model with linear covariates constant 0 effect of treatment

library(mvtnorm)

n lt- 100

p lt- 4

X lt- rmvnorm(n = 100 mean = rep(0 p))

gamma lt- c(274 137 137 137)

y lt- 210 + X c(gamma) + rnorm(n)

alpha lt- c(-1 05 -05 -01)

dprobs lt- bootinvlogit(X alpha)

d lt- rbinom(n size = 1 prob = dprobs)

30 58

OLS with no covariates

summary(lm(y ~ d))

Coefficients

Estimate Std Error t value Pr(gt|t|)

(Intercept) 22036 458 4813 lt 2e-16

d -2869 654 -439 0000029

---

Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1

Residual standard error 33 on 98 degrees of freedom

Multiple R-squared 0164 Adjusted R-squared 0156

F-statistic 192 on 1 and 98 DF p-value 0000029

31 58

OLS with covariates

summary(lm(y ~ d + X))

Coefficients

Estimate Std Error t value Pr(gt|t|)

(Intercept) 209952 0159 13207 lt2e-16

d -0128 0253 -05 062

X1 27368 0126 2175 lt2e-16

X2 13677 0114 1201 lt2e-16

X3 13673 0130 1051 lt2e-16

X4 13570 0106 1280 lt2e-16

---

Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1

Residual standard error 1 on 94 degrees of freedom

Multiple R-squared 0999 Adjusted R-squared 0999

F-statistic 244e+04 on 5 and 94 DF p-value lt2e-16

32 58

What happens with nonlinearity

Suppose we can only observe the following covariates

z1 lt- exp(X[ 1]2)

z2 lt- X[ 2](1 + exp(X[ 1])) + 10

z3 lt- (X[ 1] X[ 3]25 + 06)^3

z4 lt- (X[ 2] + X[ 4] + 20)^2

Implies that 119884119894 and 119863119894 are functions of log(1198851198941113568) 1198851198941113569 119885111356911989411135681198851198941113569

1 log(1198851198941113568) 1198851198941113570 log(1198851198941113568) and 119883111356811135691198941113571

Regression is a nonlinear function of the observed covariates

33 58

When linearity goes wrong

summary(lm(y ~ d + z1 + z2 + z3 + z4))

Coefficients

Estimate Std Error t value Pr(gt|t|)

(Intercept) 218728 300799 073 0469

d -69292 33854 -205 0043

z1 363110 26970 1346 lt2e-16

z2 -29033 36619 -079 0430

z3 862022 431030 200 0048

z4 04021 00329 1223 lt2e-16

---

Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1

Residual standard error 14 on 94 degrees of freedom

Multiple R-squared 0855 Adjusted R-squared 0847

F-statistic 111 on 5 and 94 DF p-value lt2e-16

34 58

3 Regression withHeterogeneousTreatment Effects

35 58

Heterogeneous effects binary treatment

bull Completely randomized experiment

119884119894 = 119863119894119884119894(1) + (1 minus 119863119894)119884119894(0)= 119884119894(0) + (119884119894(1) minus 119884119894(0))119863119894

= 1205831113567 + 120591119894119863119894 + (119884119894(0) minus 1205831113567)= 1205831113567 + 120591119863119894 + (119884119894(0) minus 1205831113567) + (120591119894 minus 120591) sdot 119863119894

= 1205831113567 + 120591119863119894 + 120576119894

bull Error term now includes two components1 ldquoBaselinerdquo variation in the outcome (119884119894(0) minus 1205831113567)2 Variation in the treatment effect (120591119894 minus 120591)

bull Easy to verify that under experiment 120124[120576119894|119863119894] = 0bull Thus OLS estimates the ATE with no covariates

36 58

Adding covariates

bull What happens with no unmeasured confounders Need tocondition on 119883119894 now

bull Remember identification of the ATEATT using iteratedexpectations

bull ATE is the weighted sum of CATEs

120591 = 1114012119909120591(119909) Pr[119883119894 = 119909]

bull ATEATT are weighted averages of CATEsbull What about the regression estimand 120591119877 How does it related

to the ATEATT

37 58

Heterogeneous effects and regression

bull Letrsquos investigate this under a saturated regression model

119884119894 = 1114012119909119861119909119894120572119909 + 120591119877119863119894 + 119890119894

bull Use a dummy variable for each unique combination of 119883119894119861119909119894 = 120128(119883119894 = 119909)

bull Linear in 119883119894 by construction

38 58

Investigating the regression coefficient

bull How can we investigate 120591119877 Well we can rely on theregression anatomy

120591119877 = Cov(119884119894 119863119894 minus 119864[119863119894|119883119894])120141(119863119894 minus 119864[119863119894|119883119894])

bull 119863119894 minus 120124[119863119894|119883119894] is the residual from a regression of 119863119894 on the fullset of dummies

bull With a little work we can show

120591119877 =120124 1114094120591(119883119894)(119863119894 minus 120124[119863119894|119883119894])11135691114097

120124[(119863119894 minus 119864[119863119894|119883119894])1113569]=

120124[120591(119883119894)1205901113569119889(119883119894)]120124[1205901113569119889(119883119894)]

bull 1205901113569119889(119909) = 120141[119863119894|119883119894 = 119909] is the conditional variance of treatmentassignment

39 58

ATE versus OLS

120591119877 = 120124[120591(119883119894)119882119894] = 1114012119909120591(119909)

1205901113569119889(119909)120124[1205901113569119889(119883119894)]

โ„™[119883119894 = 119909]

bull Compare to the ATE

120591 = 120124[120591(119883119894)] = 1114012119909120591(119909)โ„™[119883119894 = 119909]

bull Both weight strata relative to their size (โ„™[119883119894 = 119909])bull OLS weights strata higher if the treatment variance in those

strata (1205901113569119889(119909)) is higher in those strata relative to the averagevariance across strata (120124[1205901113569119889(119883119894)])

bull The ATE weights only by their size

40 58

Regression weighting

119882119894 =1205901113569119889(119883119894)

120124[1205901113569119889(119883119894)]

bull Why does OLS weight like thisbull OLS is a minimum-variance estimator more weight to

more precise within-strata estimatesbull Within-strata estimates are most precise when the treatment

is evenly spread and thus has the highest variancebull If 119863119894 is binary then we know the conditional variance will be

1205901113569119889(119909) = โ„™[119863119894 = 1|119883119894 = 119909] (1 minus โ„™[119863119894 = 1|119883119894 = 119909])= 119890(119909) (1 minus 119890(119909))

bull Maximum variance with โ„™[119863119894 = 1|119883119894 = 119909] = 12

41 58

OLS weighting examplebull Binary covariate

โ„™[119883119894 = 1] = 075 โ„™[119883119894 = 0] = 025โ„™[119863119894 = 1|119883119894 = 1] = 09 โ„™[119863119894 = 1|119883119894 = 0] = 05

1205901113569119889(1) = 009 1205901113569119889(0) = 025120591(1) = 1 120591(0) = minus1

bull Implies the ATE is 120591 = 05bull Average conditional variance 120124[1205901113569119889(119883119894)] = 013bull weights for 119883119894 = 1 are 009013 = 0692 for 119883119894 = 0

025013 = 192120591119877 = 120124[120591(119883119894)119882119894]

= 120591(1)119882(1)โ„™[119883119894 = 1] + 120591(0)119882(0)โ„™[119883119894 = 0]= 1 times 0692 times 075 + minus1 times 192 times 025= 0039

42 58

When will OLS estimate the ATE

bull When does 120591 = 120591119877bull Constant treatment effects 120591(119909) = 120591 = 120591119877bull Constant probability of treatment 119890(119909) = โ„™[119863119894 = 1|119883119894 = 119909] = 119890

Implies that the OLS weights are 1

bull Incorrect linearity assumption in 119883119894 will lead to more bias

43 58

Other ways to use regression

bull Whatrsquos the path forward Accept the bias (might be relatively small with saturated

models) Use a different regression approach

bull Let 120583119889(119909) = 120124[119884119894(119889)|119883119894 = 119909] be the CEF for the potentialoutcome under 119863119894 = 119889

bull By consistency and nuc we have 120583119889(119909) = 120124[119884119894|119863119894 = 119889119883119894 = 119909]bull Estimate a regression of 119884119894 on 119883119894 among the 119863119894 = 119889 groupbull Then 119889(119909) is just a predicted value from the regression for

119883119894 = 119909bull How can we use this

44 58

Imputation estimators

bull Impute the treated potential outcomes with 1114023119884119894(1) = 1113568(119883119894)bull Impute the control potential outcomes with 1114023119884119894(0) = 1113567(119883119894)bull Procedure

Regress 119884119894 on 119883119894 in the treated group and get predicted valuesfor all units (treated or control)

Regress 119884119894 on 119883119894 in the control group and get predicted valuesfor all units (treated or control)

Take the average difference between these predicted values

bull More mathematically look like this

120591119894119898119901 =1119873

11140121198941113568(119883119894) minus 1113567(119883119894)

bull Sometimes called an imputation estimator

45 58

Simple imputation estimator

bull Use predict() from the within-group models on the data fromthe entire sample

bull Useful trick use a model on the entire data andmodelframe() to get the right design matrix

heterogeneous effects

yhet lt- ifelse(d == 1 y + rnorm(n 0 5) y)

mod lt- lm(yhet ~ d + X)

mod1 lt- lm(yhet ~ X subset = d == 1)

mod0 lt- lm(yhet ~ X subset = d == 0)

y1imps lt- predict(mod1 modelframe(mod))

y0imps lt- predict(mod0 modelframe(mod))

mean(y1imps - y0imps)

[1] 061

46 58

Notes on imputation estimators

bull If 119889(119909) are consistent estimators then 120591119894119898119901 is consistent forthe ATE

bull Why donrsquot people use this Most people donrsquot know the results wersquove been talking about Harder to implement than vanilla OLS

bull Can use linear regression to estimate 119889(119909) = 119909prime120573119889bull Recent trend is to estimate 119889(119909) via non-parametric methods

such as Kernel regression local linear regression regression trees etc Easiest is generalized additive models (GAMs)

47 58

Imputation estimator visualization

-2 -1 0 1 2 3

-05

00

05

10

15

x

y

TreatedControl

48 58

Imputation estimator visualization

-2 -1 0 1 2 3

-05

00

05

10

15

x

y

TreatedControl

49 58

Imputation estimator visualization

-2 -1 0 1 2 3

-05

00

05

10

15

x

y

TreatedControl

50 58

Nonlinear relationshipsbull Same idea but with nonlinear relationship between 119884119894 and 119883119894

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

51 58

Nonlinear relationshipsbull Same idea but with nonlinear relationship between 119884119894 and 119883119894

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

52 58

Nonlinear relationshipsbull Same idea but with nonlinear relationship between 119884119894 and 119883119894

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

53 58

Using semiparametric regressionbull Here CEFs are nonlinear but we donrsquot know their formbull We can use GAMs from the mgcv package to for flexible

estimatelibrary(mgcv)

mod0 lt- gam(y ~ s(x) subset = d == 0)

summary(mod0)

Family gaussian

Link function identity

Formula

y ~ s(x)

Parametric coefficients

Estimate Std Error t value Pr(gt|t|)

(Intercept) -00225 00154 -146 016

Approximate significance of smooth terms

edf Refdf F p-value

s(x) 603 708 413 lt2e-16

---

Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1

R-sq(adj) = 0909 Deviance explained = 928

GCV = 00093204 Scale est = 00071351 n = 30

54 58

Using GAMs

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

55 58

Using GAMs

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

56 58

Using GAMs

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

57 58

Limited dependent variables

bull Usual advice model the data from first principles Logitprobit for binary Poisson for counts etc

bull OLS is a-ok with limited DVs when Binary treatment and no covariates (just diff-in-means) Binary treatment discrete covariates and saturated models

(stratified diff-in-means)

bull Imposing a model on LDVs in this case imposes adistributional assumption which could be wrong

bull Even in unsaturated models the marginal effect from OLSoften decent compared to nonlinear models

Could go wrong in small samples If using nonlinear models always get effects on the scale of the

outcome

58 58

  • Agnostic Regression
  • Regression and Causality
  • Regression with Heterogeneous Treatment Effects
Page 24: Gov 2002: 7. Regression and Causalityโ€ข Define linear regression: =argmin ๐”ผ[( ๐‘–โˆ’ โ€ฒ ) ] โ€ข The solution to this is the following: =๐”ผ[ ๐‘– โ€ฒ]โˆ’ ๐”ผ[ ๐‘– ๐‘–] โ€ข

Regression and causality

bull Most econometrics textbooks regression defined withoutrespect to causality

bull But then when is ldquobiasedrdquo The above derivations work forsome 120124[119884119894|119883119894]

bull The question then is when does knowing the CEF tell ussomething about causality

bull MHE argues that a regression is causal when the CEF itapproximates is causal Identification is king

bull We will show that under certain conditions a regression of theoutcome on the treatment and the covariates can recover acausal parameter but perhaps not the one in which we areinterested

24 58

Review

bull Quick reminder we have potential outcomes 119884119894(1) and 119884119894(0)and two parameters the ATE and ATT

120591 = 119864[119884119894(1) minus 119884119894(0)]120591ATT = 119864[119884119894(1) minus 119884119894(0)|119863119894 = 1]

bull We have shown in past weeks that these effects are identifiedwhen ignorability holds MHE calls this the conditionalindependence assumption (CIA)

25 58

Linear constant effects model binarytreatment

bull Experiment with a simple experiment we can rewrite theconsistency assumption to be a regression formula

119884119894 = 119863119894119884119894(1) + (1 minus 119863119894)119884119894(0)= 119884119894(0) + (119884119894(1) minus 119884119894(0))119863119894

= 120124[119884119894(0)] + 120591119863119894 + (119884119894(0) minus 120124[119884119894(0)])= 1205831113567 + 120591119863119894 + 1199071113567119894

bull Note that if ignorability holds (as in an experiment) for 119884119894(0)then it will also hold for 1199071113567119894 since 120124[119884119894(0)] is constant Thusthis satifies the usual assumptions for regression

26 58

Now with covariates

bull Now assume no unmeasured confounders 119884119894(119889) โŸ‚โŸ‚ 119863119894|119883119894bull We will assume a linear model for the potential outcomes

119884119894(119889) = 120572 + 120591 sdot 119889 + 120578119894

bull Remember that linearity isnrsquot an assumption if 119863119894 is binarybull Effect of 119863119894 is constant here the 120578119894 are the only source of

individual variation and we have 119864[120578119894] = 0bull Consistency assumption allows us to write this as

119884119894 = 120572 + 120591119863119894 + 120578119894

27 58

Covariates in the error

bull Letrsquos assume that 120578119894 is linear in 119883119894 120578119894 = 119883prime119894 120574 + 120584119894

bull New error is uncorrelated with 119883119894 120124[120584119894|119883119894] = 0bull This is an assumption Might be falsebull Plug into the above

120124[119884119894(119889)|119883119894] = 119864[119884119894|119863119894 119883119894] = 120572 + 120591119863119894 + 119864[120578119894|119883119894]= 120572 + 120591119863119894 + 119883prime

119894 120574 + 119864[120584119894|119883119894]= 120572 + 120591119863119894 + 119883prime

119894 120574

28 58

Summing up regression with constanteffects

bull Reviewing the assumptions wersquove used no unmeasured confounders constant treatment effects linearity of the treatmentcovariates

bull Under these we can run the following regression to estimatethe ATE 120591

119884119894 = 120572 + 120591119863119894 + 119883prime119894 120574 + 120584119894

bull Works with continuous or ordinal 119863119894 if linearity in the effect ofthese variables is truly linear

29 58

OLS constant effects simulation

Model with linear covariates constant 0 effect of treatment

library(mvtnorm)

n lt- 100

p lt- 4

X lt- rmvnorm(n = 100 mean = rep(0 p))

gamma lt- c(274 137 137 137)

y lt- 210 + X c(gamma) + rnorm(n)

alpha lt- c(-1 05 -05 -01)

dprobs lt- bootinvlogit(X alpha)

d lt- rbinom(n size = 1 prob = dprobs)

30 58

OLS with no covariates

summary(lm(y ~ d))

Coefficients

Estimate Std Error t value Pr(gt|t|)

(Intercept) 22036 458 4813 lt 2e-16

d -2869 654 -439 0000029

---

Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1

Residual standard error 33 on 98 degrees of freedom

Multiple R-squared 0164 Adjusted R-squared 0156

F-statistic 192 on 1 and 98 DF p-value 0000029

31 58

OLS with covariates

summary(lm(y ~ d + X))

Coefficients

Estimate Std Error t value Pr(gt|t|)

(Intercept) 209952 0159 13207 lt2e-16

d -0128 0253 -05 062

X1 27368 0126 2175 lt2e-16

X2 13677 0114 1201 lt2e-16

X3 13673 0130 1051 lt2e-16

X4 13570 0106 1280 lt2e-16

---

Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1

Residual standard error 1 on 94 degrees of freedom

Multiple R-squared 0999 Adjusted R-squared 0999

F-statistic 244e+04 on 5 and 94 DF p-value lt2e-16

32 58

What happens with nonlinearity

Suppose we can only observe the following covariates

z1 lt- exp(X[ 1]2)

z2 lt- X[ 2](1 + exp(X[ 1])) + 10

z3 lt- (X[ 1] X[ 3]25 + 06)^3

z4 lt- (X[ 2] + X[ 4] + 20)^2

Implies that 119884119894 and 119863119894 are functions of log(1198851198941113568) 1198851198941113569 119885111356911989411135681198851198941113569

1 log(1198851198941113568) 1198851198941113570 log(1198851198941113568) and 119883111356811135691198941113571

Regression is a nonlinear function of the observed covariates

33 58

When linearity goes wrong

summary(lm(y ~ d + z1 + z2 + z3 + z4))

Coefficients

Estimate Std Error t value Pr(gt|t|)

(Intercept) 218728 300799 073 0469

d -69292 33854 -205 0043

z1 363110 26970 1346 lt2e-16

z2 -29033 36619 -079 0430

z3 862022 431030 200 0048

z4 04021 00329 1223 lt2e-16

---

Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1

Residual standard error 14 on 94 degrees of freedom

Multiple R-squared 0855 Adjusted R-squared 0847

F-statistic 111 on 5 and 94 DF p-value lt2e-16

34 58

3 Regression withHeterogeneousTreatment Effects

35 58

Heterogeneous effects binary treatment

bull Completely randomized experiment

119884119894 = 119863119894119884119894(1) + (1 minus 119863119894)119884119894(0)= 119884119894(0) + (119884119894(1) minus 119884119894(0))119863119894

= 1205831113567 + 120591119894119863119894 + (119884119894(0) minus 1205831113567)= 1205831113567 + 120591119863119894 + (119884119894(0) minus 1205831113567) + (120591119894 minus 120591) sdot 119863119894

= 1205831113567 + 120591119863119894 + 120576119894

bull Error term now includes two components1 ldquoBaselinerdquo variation in the outcome (119884119894(0) minus 1205831113567)2 Variation in the treatment effect (120591119894 minus 120591)

bull Easy to verify that under experiment 120124[120576119894|119863119894] = 0bull Thus OLS estimates the ATE with no covariates

36 58

Adding covariates

bull What happens with no unmeasured confounders Need tocondition on 119883119894 now

bull Remember identification of the ATEATT using iteratedexpectations

bull ATE is the weighted sum of CATEs

120591 = 1114012119909120591(119909) Pr[119883119894 = 119909]

bull ATEATT are weighted averages of CATEsbull What about the regression estimand 120591119877 How does it related

to the ATEATT

37 58

Heterogeneous effects and regression

bull Letrsquos investigate this under a saturated regression model

119884119894 = 1114012119909119861119909119894120572119909 + 120591119877119863119894 + 119890119894

bull Use a dummy variable for each unique combination of 119883119894119861119909119894 = 120128(119883119894 = 119909)

bull Linear in 119883119894 by construction

38 58

Investigating the regression coefficient

bull How can we investigate 120591119877 Well we can rely on theregression anatomy

120591119877 = Cov(119884119894 119863119894 minus 119864[119863119894|119883119894])120141(119863119894 minus 119864[119863119894|119883119894])

bull 119863119894 minus 120124[119863119894|119883119894] is the residual from a regression of 119863119894 on the fullset of dummies

bull With a little work we can show

120591119877 =120124 1114094120591(119883119894)(119863119894 minus 120124[119863119894|119883119894])11135691114097

120124[(119863119894 minus 119864[119863119894|119883119894])1113569]=

120124[120591(119883119894)1205901113569119889(119883119894)]120124[1205901113569119889(119883119894)]

bull 1205901113569119889(119909) = 120141[119863119894|119883119894 = 119909] is the conditional variance of treatmentassignment

39 58

ATE versus OLS

120591119877 = 120124[120591(119883119894)119882119894] = 1114012119909120591(119909)

1205901113569119889(119909)120124[1205901113569119889(119883119894)]

โ„™[119883119894 = 119909]

bull Compare to the ATE

120591 = 120124[120591(119883119894)] = 1114012119909120591(119909)โ„™[119883119894 = 119909]

bull Both weight strata relative to their size (โ„™[119883119894 = 119909])bull OLS weights strata higher if the treatment variance in those

strata (1205901113569119889(119909)) is higher in those strata relative to the averagevariance across strata (120124[1205901113569119889(119883119894)])

bull The ATE weights only by their size

40 58

Regression weighting

119882119894 =1205901113569119889(119883119894)

120124[1205901113569119889(119883119894)]

bull Why does OLS weight like thisbull OLS is a minimum-variance estimator more weight to

more precise within-strata estimatesbull Within-strata estimates are most precise when the treatment

is evenly spread and thus has the highest variancebull If 119863119894 is binary then we know the conditional variance will be

1205901113569119889(119909) = โ„™[119863119894 = 1|119883119894 = 119909] (1 minus โ„™[119863119894 = 1|119883119894 = 119909])= 119890(119909) (1 minus 119890(119909))

bull Maximum variance with โ„™[119863119894 = 1|119883119894 = 119909] = 12

41 58

OLS weighting examplebull Binary covariate

โ„™[119883119894 = 1] = 075 โ„™[119883119894 = 0] = 025โ„™[119863119894 = 1|119883119894 = 1] = 09 โ„™[119863119894 = 1|119883119894 = 0] = 05

1205901113569119889(1) = 009 1205901113569119889(0) = 025120591(1) = 1 120591(0) = minus1

bull Implies the ATE is 120591 = 05bull Average conditional variance 120124[1205901113569119889(119883119894)] = 013bull weights for 119883119894 = 1 are 009013 = 0692 for 119883119894 = 0

025013 = 192120591119877 = 120124[120591(119883119894)119882119894]

= 120591(1)119882(1)โ„™[119883119894 = 1] + 120591(0)119882(0)โ„™[119883119894 = 0]= 1 times 0692 times 075 + minus1 times 192 times 025= 0039

42 58

When will OLS estimate the ATE

bull When does 120591 = 120591119877bull Constant treatment effects 120591(119909) = 120591 = 120591119877bull Constant probability of treatment 119890(119909) = โ„™[119863119894 = 1|119883119894 = 119909] = 119890

Implies that the OLS weights are 1

bull Incorrect linearity assumption in 119883119894 will lead to more bias

43 58

Other ways to use regression

bull Whatrsquos the path forward Accept the bias (might be relatively small with saturated

models) Use a different regression approach

bull Let 120583119889(119909) = 120124[119884119894(119889)|119883119894 = 119909] be the CEF for the potentialoutcome under 119863119894 = 119889

bull By consistency and nuc we have 120583119889(119909) = 120124[119884119894|119863119894 = 119889119883119894 = 119909]bull Estimate a regression of 119884119894 on 119883119894 among the 119863119894 = 119889 groupbull Then 119889(119909) is just a predicted value from the regression for

119883119894 = 119909bull How can we use this

44 58

Imputation estimators

bull Impute the treated potential outcomes with 1114023119884119894(1) = 1113568(119883119894)bull Impute the control potential outcomes with 1114023119884119894(0) = 1113567(119883119894)bull Procedure

Regress 119884119894 on 119883119894 in the treated group and get predicted valuesfor all units (treated or control)

Regress 119884119894 on 119883119894 in the control group and get predicted valuesfor all units (treated or control)

Take the average difference between these predicted values

bull More mathematically look like this

120591119894119898119901 =1119873

11140121198941113568(119883119894) minus 1113567(119883119894)

bull Sometimes called an imputation estimator

45 58

Simple imputation estimator

bull Use predict() from the within-group models on the data fromthe entire sample

bull Useful trick use a model on the entire data andmodelframe() to get the right design matrix

heterogeneous effects

yhet lt- ifelse(d == 1 y + rnorm(n 0 5) y)

mod lt- lm(yhet ~ d + X)

mod1 lt- lm(yhet ~ X subset = d == 1)

mod0 lt- lm(yhet ~ X subset = d == 0)

y1imps lt- predict(mod1 modelframe(mod))

y0imps lt- predict(mod0 modelframe(mod))

mean(y1imps - y0imps)

[1] 061

46 58

Notes on imputation estimators

bull If 119889(119909) are consistent estimators then 120591119894119898119901 is consistent forthe ATE

bull Why donrsquot people use this Most people donrsquot know the results wersquove been talking about Harder to implement than vanilla OLS

bull Can use linear regression to estimate 119889(119909) = 119909prime120573119889bull Recent trend is to estimate 119889(119909) via non-parametric methods

such as Kernel regression local linear regression regression trees etc Easiest is generalized additive models (GAMs)

47 58

Imputation estimator visualization

-2 -1 0 1 2 3

-05

00

05

10

15

x

y

TreatedControl

48 58

Imputation estimator visualization

-2 -1 0 1 2 3

-05

00

05

10

15

x

y

TreatedControl

49 58

Imputation estimator visualization

-2 -1 0 1 2 3

-05

00

05

10

15

x

y

TreatedControl

50 58

Nonlinear relationshipsbull Same idea but with nonlinear relationship between 119884119894 and 119883119894

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

51 58

Nonlinear relationshipsbull Same idea but with nonlinear relationship between 119884119894 and 119883119894

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

52 58

Nonlinear relationshipsbull Same idea but with nonlinear relationship between 119884119894 and 119883119894

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

53 58

Using semiparametric regressionbull Here CEFs are nonlinear but we donrsquot know their formbull We can use GAMs from the mgcv package to for flexible

estimatelibrary(mgcv)

mod0 lt- gam(y ~ s(x) subset = d == 0)

summary(mod0)

Family gaussian

Link function identity

Formula

y ~ s(x)

Parametric coefficients

Estimate Std Error t value Pr(gt|t|)

(Intercept) -00225 00154 -146 016

Approximate significance of smooth terms

edf Refdf F p-value

s(x) 603 708 413 lt2e-16

---

Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1

R-sq(adj) = 0909 Deviance explained = 928

GCV = 00093204 Scale est = 00071351 n = 30

54 58

Using GAMs

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

55 58

Using GAMs

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

56 58

Using GAMs

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

57 58

Limited dependent variables

bull Usual advice model the data from first principles Logitprobit for binary Poisson for counts etc

bull OLS is a-ok with limited DVs when Binary treatment and no covariates (just diff-in-means) Binary treatment discrete covariates and saturated models

(stratified diff-in-means)

bull Imposing a model on LDVs in this case imposes adistributional assumption which could be wrong

bull Even in unsaturated models the marginal effect from OLSoften decent compared to nonlinear models

Could go wrong in small samples If using nonlinear models always get effects on the scale of the

outcome

58 58

  • Agnostic Regression
  • Regression and Causality
  • Regression with Heterogeneous Treatment Effects
Page 25: Gov 2002: 7. Regression and Causalityโ€ข Define linear regression: =argmin ๐”ผ[( ๐‘–โˆ’ โ€ฒ ) ] โ€ข The solution to this is the following: =๐”ผ[ ๐‘– โ€ฒ]โˆ’ ๐”ผ[ ๐‘– ๐‘–] โ€ข

Review

bull Quick reminder we have potential outcomes 119884119894(1) and 119884119894(0)and two parameters the ATE and ATT

120591 = 119864[119884119894(1) minus 119884119894(0)]120591ATT = 119864[119884119894(1) minus 119884119894(0)|119863119894 = 1]

bull We have shown in past weeks that these effects are identifiedwhen ignorability holds MHE calls this the conditionalindependence assumption (CIA)

25 58

Linear constant effects model binarytreatment

bull Experiment with a simple experiment we can rewrite theconsistency assumption to be a regression formula

119884119894 = 119863119894119884119894(1) + (1 minus 119863119894)119884119894(0)= 119884119894(0) + (119884119894(1) minus 119884119894(0))119863119894

= 120124[119884119894(0)] + 120591119863119894 + (119884119894(0) minus 120124[119884119894(0)])= 1205831113567 + 120591119863119894 + 1199071113567119894

bull Note that if ignorability holds (as in an experiment) for 119884119894(0)then it will also hold for 1199071113567119894 since 120124[119884119894(0)] is constant Thusthis satifies the usual assumptions for regression

26 58

Now with covariates

bull Now assume no unmeasured confounders 119884119894(119889) โŸ‚โŸ‚ 119863119894|119883119894bull We will assume a linear model for the potential outcomes

119884119894(119889) = 120572 + 120591 sdot 119889 + 120578119894

bull Remember that linearity isnrsquot an assumption if 119863119894 is binarybull Effect of 119863119894 is constant here the 120578119894 are the only source of

individual variation and we have 119864[120578119894] = 0bull Consistency assumption allows us to write this as

119884119894 = 120572 + 120591119863119894 + 120578119894

27 58

Covariates in the error

bull Letrsquos assume that 120578119894 is linear in 119883119894 120578119894 = 119883prime119894 120574 + 120584119894

bull New error is uncorrelated with 119883119894 120124[120584119894|119883119894] = 0bull This is an assumption Might be falsebull Plug into the above

120124[119884119894(119889)|119883119894] = 119864[119884119894|119863119894 119883119894] = 120572 + 120591119863119894 + 119864[120578119894|119883119894]= 120572 + 120591119863119894 + 119883prime

119894 120574 + 119864[120584119894|119883119894]= 120572 + 120591119863119894 + 119883prime

119894 120574

28 58

Summing up regression with constanteffects

bull Reviewing the assumptions wersquove used no unmeasured confounders constant treatment effects linearity of the treatmentcovariates

bull Under these we can run the following regression to estimatethe ATE 120591

119884119894 = 120572 + 120591119863119894 + 119883prime119894 120574 + 120584119894

bull Works with continuous or ordinal 119863119894 if linearity in the effect ofthese variables is truly linear

29 58

OLS constant effects simulation

Model with linear covariates constant 0 effect of treatment

library(mvtnorm)

n lt- 100

p lt- 4

X lt- rmvnorm(n = 100 mean = rep(0 p))

gamma lt- c(274 137 137 137)

y lt- 210 + X c(gamma) + rnorm(n)

alpha lt- c(-1 05 -05 -01)

dprobs lt- bootinvlogit(X alpha)

d lt- rbinom(n size = 1 prob = dprobs)

30 58

OLS with no covariates

summary(lm(y ~ d))

Coefficients

Estimate Std Error t value Pr(gt|t|)

(Intercept) 22036 458 4813 lt 2e-16

d -2869 654 -439 0000029

---

Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1

Residual standard error 33 on 98 degrees of freedom

Multiple R-squared 0164 Adjusted R-squared 0156

F-statistic 192 on 1 and 98 DF p-value 0000029

31 58

OLS with covariates

summary(lm(y ~ d + X))

Coefficients

Estimate Std Error t value Pr(gt|t|)

(Intercept) 209952 0159 13207 lt2e-16

d -0128 0253 -05 062

X1 27368 0126 2175 lt2e-16

X2 13677 0114 1201 lt2e-16

X3 13673 0130 1051 lt2e-16

X4 13570 0106 1280 lt2e-16

---

Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1

Residual standard error 1 on 94 degrees of freedom

Multiple R-squared 0999 Adjusted R-squared 0999

F-statistic 244e+04 on 5 and 94 DF p-value lt2e-16

32 58

What happens with nonlinearity

Suppose we can only observe the following covariates

z1 lt- exp(X[ 1]2)

z2 lt- X[ 2](1 + exp(X[ 1])) + 10

z3 lt- (X[ 1] X[ 3]25 + 06)^3

z4 lt- (X[ 2] + X[ 4] + 20)^2

Implies that 119884119894 and 119863119894 are functions of log(1198851198941113568) 1198851198941113569 119885111356911989411135681198851198941113569

1 log(1198851198941113568) 1198851198941113570 log(1198851198941113568) and 119883111356811135691198941113571

Regression is a nonlinear function of the observed covariates

33 58

When linearity goes wrong

summary(lm(y ~ d + z1 + z2 + z3 + z4))

Coefficients

Estimate Std Error t value Pr(gt|t|)

(Intercept) 218728 300799 073 0469

d -69292 33854 -205 0043

z1 363110 26970 1346 lt2e-16

z2 -29033 36619 -079 0430

z3 862022 431030 200 0048

z4 04021 00329 1223 lt2e-16

---

Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1

Residual standard error 14 on 94 degrees of freedom

Multiple R-squared 0855 Adjusted R-squared 0847

F-statistic 111 on 5 and 94 DF p-value lt2e-16

34 58

3 Regression withHeterogeneousTreatment Effects

35 58

Heterogeneous effects binary treatment

bull Completely randomized experiment

119884119894 = 119863119894119884119894(1) + (1 minus 119863119894)119884119894(0)= 119884119894(0) + (119884119894(1) minus 119884119894(0))119863119894

= 1205831113567 + 120591119894119863119894 + (119884119894(0) minus 1205831113567)= 1205831113567 + 120591119863119894 + (119884119894(0) minus 1205831113567) + (120591119894 minus 120591) sdot 119863119894

= 1205831113567 + 120591119863119894 + 120576119894

bull Error term now includes two components1 ldquoBaselinerdquo variation in the outcome (119884119894(0) minus 1205831113567)2 Variation in the treatment effect (120591119894 minus 120591)

bull Easy to verify that under experiment 120124[120576119894|119863119894] = 0bull Thus OLS estimates the ATE with no covariates

36 58

Adding covariates

bull What happens with no unmeasured confounders Need tocondition on 119883119894 now

bull Remember identification of the ATEATT using iteratedexpectations

bull ATE is the weighted sum of CATEs

120591 = 1114012119909120591(119909) Pr[119883119894 = 119909]

bull ATEATT are weighted averages of CATEsbull What about the regression estimand 120591119877 How does it related

to the ATEATT

37 58

Heterogeneous effects and regression

bull Letrsquos investigate this under a saturated regression model

119884119894 = 1114012119909119861119909119894120572119909 + 120591119877119863119894 + 119890119894

bull Use a dummy variable for each unique combination of 119883119894119861119909119894 = 120128(119883119894 = 119909)

bull Linear in 119883119894 by construction

38 58

Investigating the regression coefficient

bull How can we investigate 120591119877 Well we can rely on theregression anatomy

120591119877 = Cov(119884119894 119863119894 minus 119864[119863119894|119883119894])120141(119863119894 minus 119864[119863119894|119883119894])

bull 119863119894 minus 120124[119863119894|119883119894] is the residual from a regression of 119863119894 on the fullset of dummies

bull With a little work we can show

120591119877 =120124 1114094120591(119883119894)(119863119894 minus 120124[119863119894|119883119894])11135691114097

120124[(119863119894 minus 119864[119863119894|119883119894])1113569]=

120124[120591(119883119894)1205901113569119889(119883119894)]120124[1205901113569119889(119883119894)]

bull 1205901113569119889(119909) = 120141[119863119894|119883119894 = 119909] is the conditional variance of treatmentassignment

39 58

ATE versus OLS

120591119877 = 120124[120591(119883119894)119882119894] = 1114012119909120591(119909)

1205901113569119889(119909)120124[1205901113569119889(119883119894)]

โ„™[119883119894 = 119909]

bull Compare to the ATE

120591 = 120124[120591(119883119894)] = 1114012119909120591(119909)โ„™[119883119894 = 119909]

bull Both weight strata relative to their size (โ„™[119883119894 = 119909])bull OLS weights strata higher if the treatment variance in those

strata (1205901113569119889(119909)) is higher in those strata relative to the averagevariance across strata (120124[1205901113569119889(119883119894)])

bull The ATE weights only by their size

40 58

Regression weighting

119882119894 =1205901113569119889(119883119894)

120124[1205901113569119889(119883119894)]

bull Why does OLS weight like thisbull OLS is a minimum-variance estimator more weight to

more precise within-strata estimatesbull Within-strata estimates are most precise when the treatment

is evenly spread and thus has the highest variancebull If 119863119894 is binary then we know the conditional variance will be

1205901113569119889(119909) = โ„™[119863119894 = 1|119883119894 = 119909] (1 minus โ„™[119863119894 = 1|119883119894 = 119909])= 119890(119909) (1 minus 119890(119909))

bull Maximum variance with โ„™[119863119894 = 1|119883119894 = 119909] = 12

41 58

OLS weighting examplebull Binary covariate

โ„™[119883119894 = 1] = 075 โ„™[119883119894 = 0] = 025โ„™[119863119894 = 1|119883119894 = 1] = 09 โ„™[119863119894 = 1|119883119894 = 0] = 05

1205901113569119889(1) = 009 1205901113569119889(0) = 025120591(1) = 1 120591(0) = minus1

bull Implies the ATE is 120591 = 05bull Average conditional variance 120124[1205901113569119889(119883119894)] = 013bull weights for 119883119894 = 1 are 009013 = 0692 for 119883119894 = 0

025013 = 192120591119877 = 120124[120591(119883119894)119882119894]

= 120591(1)119882(1)โ„™[119883119894 = 1] + 120591(0)119882(0)โ„™[119883119894 = 0]= 1 times 0692 times 075 + minus1 times 192 times 025= 0039

42 58

When will OLS estimate the ATE

bull When does 120591 = 120591119877bull Constant treatment effects 120591(119909) = 120591 = 120591119877bull Constant probability of treatment 119890(119909) = โ„™[119863119894 = 1|119883119894 = 119909] = 119890

Implies that the OLS weights are 1

bull Incorrect linearity assumption in 119883119894 will lead to more bias

43 58

Other ways to use regression

bull Whatrsquos the path forward Accept the bias (might be relatively small with saturated

models) Use a different regression approach

bull Let 120583119889(119909) = 120124[119884119894(119889)|119883119894 = 119909] be the CEF for the potentialoutcome under 119863119894 = 119889

bull By consistency and nuc we have 120583119889(119909) = 120124[119884119894|119863119894 = 119889119883119894 = 119909]bull Estimate a regression of 119884119894 on 119883119894 among the 119863119894 = 119889 groupbull Then 119889(119909) is just a predicted value from the regression for

119883119894 = 119909bull How can we use this

44 58

Imputation estimators

bull Impute the treated potential outcomes with 1114023119884119894(1) = 1113568(119883119894)bull Impute the control potential outcomes with 1114023119884119894(0) = 1113567(119883119894)bull Procedure

Regress 119884119894 on 119883119894 in the treated group and get predicted valuesfor all units (treated or control)

Regress 119884119894 on 119883119894 in the control group and get predicted valuesfor all units (treated or control)

Take the average difference between these predicted values

bull More mathematically look like this

120591119894119898119901 =1119873

11140121198941113568(119883119894) minus 1113567(119883119894)

bull Sometimes called an imputation estimator

45 58

Simple imputation estimator

bull Use predict() from the within-group models on the data fromthe entire sample

bull Useful trick use a model on the entire data andmodelframe() to get the right design matrix

heterogeneous effects

yhet lt- ifelse(d == 1 y + rnorm(n 0 5) y)

mod lt- lm(yhet ~ d + X)

mod1 lt- lm(yhet ~ X subset = d == 1)

mod0 lt- lm(yhet ~ X subset = d == 0)

y1imps lt- predict(mod1 modelframe(mod))

y0imps lt- predict(mod0 modelframe(mod))

mean(y1imps - y0imps)

[1] 061

46 58

Notes on imputation estimators

bull If 119889(119909) are consistent estimators then 120591119894119898119901 is consistent forthe ATE

bull Why donrsquot people use this Most people donrsquot know the results wersquove been talking about Harder to implement than vanilla OLS

bull Can use linear regression to estimate 119889(119909) = 119909prime120573119889bull Recent trend is to estimate 119889(119909) via non-parametric methods

such as Kernel regression local linear regression regression trees etc Easiest is generalized additive models (GAMs)

47 58

Imputation estimator visualization

-2 -1 0 1 2 3

-05

00

05

10

15

x

y

TreatedControl

48 58

Imputation estimator visualization

-2 -1 0 1 2 3

-05

00

05

10

15

x

y

TreatedControl

49 58

Imputation estimator visualization

-2 -1 0 1 2 3

-05

00

05

10

15

x

y

TreatedControl

50 58

Nonlinear relationshipsbull Same idea but with nonlinear relationship between 119884119894 and 119883119894

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

51 58

Nonlinear relationshipsbull Same idea but with nonlinear relationship between 119884119894 and 119883119894

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

52 58

Nonlinear relationshipsbull Same idea but with nonlinear relationship between 119884119894 and 119883119894

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

53 58

Using semiparametric regressionbull Here CEFs are nonlinear but we donrsquot know their formbull We can use GAMs from the mgcv package to for flexible

estimatelibrary(mgcv)

mod0 lt- gam(y ~ s(x) subset = d == 0)

summary(mod0)

Family gaussian

Link function identity

Formula

y ~ s(x)

Parametric coefficients

Estimate Std Error t value Pr(gt|t|)

(Intercept) -00225 00154 -146 016

Approximate significance of smooth terms

edf Refdf F p-value

s(x) 603 708 413 lt2e-16

---

Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1

R-sq(adj) = 0909 Deviance explained = 928

GCV = 00093204 Scale est = 00071351 n = 30

54 58

Using GAMs

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

55 58

Using GAMs

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

56 58

Using GAMs

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

57 58

Limited dependent variables

bull Usual advice model the data from first principles Logitprobit for binary Poisson for counts etc

bull OLS is a-ok with limited DVs when Binary treatment and no covariates (just diff-in-means) Binary treatment discrete covariates and saturated models

(stratified diff-in-means)

bull Imposing a model on LDVs in this case imposes adistributional assumption which could be wrong

bull Even in unsaturated models the marginal effect from OLSoften decent compared to nonlinear models

Could go wrong in small samples If using nonlinear models always get effects on the scale of the

outcome

58 58

  • Agnostic Regression
  • Regression and Causality
  • Regression with Heterogeneous Treatment Effects
Page 26: Gov 2002: 7. Regression and Causalityโ€ข Define linear regression: =argmin ๐”ผ[( ๐‘–โˆ’ โ€ฒ ) ] โ€ข The solution to this is the following: =๐”ผ[ ๐‘– โ€ฒ]โˆ’ ๐”ผ[ ๐‘– ๐‘–] โ€ข

Linear constant effects model binarytreatment

bull Experiment with a simple experiment we can rewrite theconsistency assumption to be a regression formula

119884119894 = 119863119894119884119894(1) + (1 minus 119863119894)119884119894(0)= 119884119894(0) + (119884119894(1) minus 119884119894(0))119863119894

= 120124[119884119894(0)] + 120591119863119894 + (119884119894(0) minus 120124[119884119894(0)])= 1205831113567 + 120591119863119894 + 1199071113567119894

bull Note that if ignorability holds (as in an experiment) for 119884119894(0)then it will also hold for 1199071113567119894 since 120124[119884119894(0)] is constant Thusthis satifies the usual assumptions for regression

26 58

Now with covariates

bull Now assume no unmeasured confounders 119884119894(119889) โŸ‚โŸ‚ 119863119894|119883119894bull We will assume a linear model for the potential outcomes

119884119894(119889) = 120572 + 120591 sdot 119889 + 120578119894

bull Remember that linearity isnrsquot an assumption if 119863119894 is binarybull Effect of 119863119894 is constant here the 120578119894 are the only source of

individual variation and we have 119864[120578119894] = 0bull Consistency assumption allows us to write this as

119884119894 = 120572 + 120591119863119894 + 120578119894

27 58

Covariates in the error

bull Letrsquos assume that 120578119894 is linear in 119883119894 120578119894 = 119883prime119894 120574 + 120584119894

bull New error is uncorrelated with 119883119894 120124[120584119894|119883119894] = 0bull This is an assumption Might be falsebull Plug into the above

120124[119884119894(119889)|119883119894] = 119864[119884119894|119863119894 119883119894] = 120572 + 120591119863119894 + 119864[120578119894|119883119894]= 120572 + 120591119863119894 + 119883prime

119894 120574 + 119864[120584119894|119883119894]= 120572 + 120591119863119894 + 119883prime

119894 120574

28 58

Summing up regression with constanteffects

bull Reviewing the assumptions wersquove used no unmeasured confounders constant treatment effects linearity of the treatmentcovariates

bull Under these we can run the following regression to estimatethe ATE 120591

119884119894 = 120572 + 120591119863119894 + 119883prime119894 120574 + 120584119894

bull Works with continuous or ordinal 119863119894 if linearity in the effect ofthese variables is truly linear

29 58

OLS constant effects simulation

Model with linear covariates constant 0 effect of treatment

library(mvtnorm)

n lt- 100

p lt- 4

X lt- rmvnorm(n = 100 mean = rep(0 p))

gamma lt- c(274 137 137 137)

y lt- 210 + X c(gamma) + rnorm(n)

alpha lt- c(-1 05 -05 -01)

dprobs lt- bootinvlogit(X alpha)

d lt- rbinom(n size = 1 prob = dprobs)

30 58

OLS with no covariates

summary(lm(y ~ d))

Coefficients

Estimate Std Error t value Pr(gt|t|)

(Intercept) 22036 458 4813 lt 2e-16

d -2869 654 -439 0000029

---

Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1

Residual standard error 33 on 98 degrees of freedom

Multiple R-squared 0164 Adjusted R-squared 0156

F-statistic 192 on 1 and 98 DF p-value 0000029

31 58

OLS with covariates

summary(lm(y ~ d + X))

Coefficients

Estimate Std Error t value Pr(gt|t|)

(Intercept) 209952 0159 13207 lt2e-16

d -0128 0253 -05 062

X1 27368 0126 2175 lt2e-16

X2 13677 0114 1201 lt2e-16

X3 13673 0130 1051 lt2e-16

X4 13570 0106 1280 lt2e-16

---

Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1

Residual standard error 1 on 94 degrees of freedom

Multiple R-squared 0999 Adjusted R-squared 0999

F-statistic 244e+04 on 5 and 94 DF p-value lt2e-16

32 58

What happens with nonlinearity

Suppose we can only observe the following covariates

z1 lt- exp(X[ 1]2)

z2 lt- X[ 2](1 + exp(X[ 1])) + 10

z3 lt- (X[ 1] X[ 3]25 + 06)^3

z4 lt- (X[ 2] + X[ 4] + 20)^2

Implies that 119884119894 and 119863119894 are functions of log(1198851198941113568) 1198851198941113569 119885111356911989411135681198851198941113569

1 log(1198851198941113568) 1198851198941113570 log(1198851198941113568) and 119883111356811135691198941113571

Regression is a nonlinear function of the observed covariates

33 58

When linearity goes wrong

summary(lm(y ~ d + z1 + z2 + z3 + z4))

Coefficients

Estimate Std Error t value Pr(gt|t|)

(Intercept) 218728 300799 073 0469

d -69292 33854 -205 0043

z1 363110 26970 1346 lt2e-16

z2 -29033 36619 -079 0430

z3 862022 431030 200 0048

z4 04021 00329 1223 lt2e-16

---

Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1

Residual standard error 14 on 94 degrees of freedom

Multiple R-squared 0855 Adjusted R-squared 0847

F-statistic 111 on 5 and 94 DF p-value lt2e-16

34 58

3 Regression withHeterogeneousTreatment Effects

35 58

Heterogeneous effects binary treatment

bull Completely randomized experiment

119884119894 = 119863119894119884119894(1) + (1 minus 119863119894)119884119894(0)= 119884119894(0) + (119884119894(1) minus 119884119894(0))119863119894

= 1205831113567 + 120591119894119863119894 + (119884119894(0) minus 1205831113567)= 1205831113567 + 120591119863119894 + (119884119894(0) minus 1205831113567) + (120591119894 minus 120591) sdot 119863119894

= 1205831113567 + 120591119863119894 + 120576119894

bull Error term now includes two components1 ldquoBaselinerdquo variation in the outcome (119884119894(0) minus 1205831113567)2 Variation in the treatment effect (120591119894 minus 120591)

bull Easy to verify that under experiment 120124[120576119894|119863119894] = 0bull Thus OLS estimates the ATE with no covariates

36 58

Adding covariates

bull What happens with no unmeasured confounders Need tocondition on 119883119894 now

bull Remember identification of the ATEATT using iteratedexpectations

bull ATE is the weighted sum of CATEs

120591 = 1114012119909120591(119909) Pr[119883119894 = 119909]

bull ATEATT are weighted averages of CATEsbull What about the regression estimand 120591119877 How does it related

to the ATEATT

37 58

Heterogeneous effects and regression

bull Letrsquos investigate this under a saturated regression model

119884119894 = 1114012119909119861119909119894120572119909 + 120591119877119863119894 + 119890119894

bull Use a dummy variable for each unique combination of 119883119894119861119909119894 = 120128(119883119894 = 119909)

bull Linear in 119883119894 by construction

38 58

Investigating the regression coefficient

bull How can we investigate 120591119877 Well we can rely on theregression anatomy

120591119877 = Cov(119884119894 119863119894 minus 119864[119863119894|119883119894])120141(119863119894 minus 119864[119863119894|119883119894])

bull 119863119894 minus 120124[119863119894|119883119894] is the residual from a regression of 119863119894 on the fullset of dummies

bull With a little work we can show

120591119877 =120124 1114094120591(119883119894)(119863119894 minus 120124[119863119894|119883119894])11135691114097

120124[(119863119894 minus 119864[119863119894|119883119894])1113569]=

120124[120591(119883119894)1205901113569119889(119883119894)]120124[1205901113569119889(119883119894)]

bull 1205901113569119889(119909) = 120141[119863119894|119883119894 = 119909] is the conditional variance of treatmentassignment

39 58

ATE versus OLS

120591119877 = 120124[120591(119883119894)119882119894] = 1114012119909120591(119909)

1205901113569119889(119909)120124[1205901113569119889(119883119894)]

โ„™[119883119894 = 119909]

bull Compare to the ATE

120591 = 120124[120591(119883119894)] = 1114012119909120591(119909)โ„™[119883119894 = 119909]

bull Both weight strata relative to their size (โ„™[119883119894 = 119909])bull OLS weights strata higher if the treatment variance in those

strata (1205901113569119889(119909)) is higher in those strata relative to the averagevariance across strata (120124[1205901113569119889(119883119894)])

bull The ATE weights only by their size

40 58

Regression weighting

119882119894 =1205901113569119889(119883119894)

120124[1205901113569119889(119883119894)]

bull Why does OLS weight like thisbull OLS is a minimum-variance estimator more weight to

more precise within-strata estimatesbull Within-strata estimates are most precise when the treatment

is evenly spread and thus has the highest variancebull If 119863119894 is binary then we know the conditional variance will be

1205901113569119889(119909) = โ„™[119863119894 = 1|119883119894 = 119909] (1 minus โ„™[119863119894 = 1|119883119894 = 119909])= 119890(119909) (1 minus 119890(119909))

bull Maximum variance with โ„™[119863119894 = 1|119883119894 = 119909] = 12

41 58

OLS weighting examplebull Binary covariate

โ„™[119883119894 = 1] = 075 โ„™[119883119894 = 0] = 025โ„™[119863119894 = 1|119883119894 = 1] = 09 โ„™[119863119894 = 1|119883119894 = 0] = 05

1205901113569119889(1) = 009 1205901113569119889(0) = 025120591(1) = 1 120591(0) = minus1

bull Implies the ATE is 120591 = 05bull Average conditional variance 120124[1205901113569119889(119883119894)] = 013bull weights for 119883119894 = 1 are 009013 = 0692 for 119883119894 = 0

025013 = 192120591119877 = 120124[120591(119883119894)119882119894]

= 120591(1)119882(1)โ„™[119883119894 = 1] + 120591(0)119882(0)โ„™[119883119894 = 0]= 1 times 0692 times 075 + minus1 times 192 times 025= 0039

42 58

When will OLS estimate the ATE

bull When does 120591 = 120591119877bull Constant treatment effects 120591(119909) = 120591 = 120591119877bull Constant probability of treatment 119890(119909) = โ„™[119863119894 = 1|119883119894 = 119909] = 119890

Implies that the OLS weights are 1

bull Incorrect linearity assumption in 119883119894 will lead to more bias

43 58

Other ways to use regression

bull Whatrsquos the path forward Accept the bias (might be relatively small with saturated

models) Use a different regression approach

bull Let 120583119889(119909) = 120124[119884119894(119889)|119883119894 = 119909] be the CEF for the potentialoutcome under 119863119894 = 119889

bull By consistency and nuc we have 120583119889(119909) = 120124[119884119894|119863119894 = 119889119883119894 = 119909]bull Estimate a regression of 119884119894 on 119883119894 among the 119863119894 = 119889 groupbull Then 119889(119909) is just a predicted value from the regression for

119883119894 = 119909bull How can we use this

44 58

Imputation estimators

bull Impute the treated potential outcomes with 1114023119884119894(1) = 1113568(119883119894)bull Impute the control potential outcomes with 1114023119884119894(0) = 1113567(119883119894)bull Procedure

Regress 119884119894 on 119883119894 in the treated group and get predicted valuesfor all units (treated or control)

Regress 119884119894 on 119883119894 in the control group and get predicted valuesfor all units (treated or control)

Take the average difference between these predicted values

bull More mathematically look like this

120591119894119898119901 =1119873

11140121198941113568(119883119894) minus 1113567(119883119894)

bull Sometimes called an imputation estimator

45 58

Simple imputation estimator

bull Use predict() from the within-group models on the data fromthe entire sample

bull Useful trick use a model on the entire data andmodelframe() to get the right design matrix

heterogeneous effects

yhet lt- ifelse(d == 1 y + rnorm(n 0 5) y)

mod lt- lm(yhet ~ d + X)

mod1 lt- lm(yhet ~ X subset = d == 1)

mod0 lt- lm(yhet ~ X subset = d == 0)

y1imps lt- predict(mod1 modelframe(mod))

y0imps lt- predict(mod0 modelframe(mod))

mean(y1imps - y0imps)

[1] 061

46 58

Notes on imputation estimators

bull If 119889(119909) are consistent estimators then 120591119894119898119901 is consistent forthe ATE

bull Why donrsquot people use this Most people donrsquot know the results wersquove been talking about Harder to implement than vanilla OLS

bull Can use linear regression to estimate 119889(119909) = 119909prime120573119889bull Recent trend is to estimate 119889(119909) via non-parametric methods

such as Kernel regression local linear regression regression trees etc Easiest is generalized additive models (GAMs)

47 58

Imputation estimator visualization

-2 -1 0 1 2 3

-05

00

05

10

15

x

y

TreatedControl

48 58

Imputation estimator visualization

-2 -1 0 1 2 3

-05

00

05

10

15

x

y

TreatedControl

49 58

Imputation estimator visualization

-2 -1 0 1 2 3

-05

00

05

10

15

x

y

TreatedControl

50 58

Nonlinear relationshipsbull Same idea but with nonlinear relationship between 119884119894 and 119883119894

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

51 58

Nonlinear relationshipsbull Same idea but with nonlinear relationship between 119884119894 and 119883119894

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

52 58

Nonlinear relationshipsbull Same idea but with nonlinear relationship between 119884119894 and 119883119894

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

53 58

Using semiparametric regressionbull Here CEFs are nonlinear but we donrsquot know their formbull We can use GAMs from the mgcv package to for flexible

estimatelibrary(mgcv)

mod0 lt- gam(y ~ s(x) subset = d == 0)

summary(mod0)

Family gaussian

Link function identity

Formula

y ~ s(x)

Parametric coefficients

Estimate Std Error t value Pr(gt|t|)

(Intercept) -00225 00154 -146 016

Approximate significance of smooth terms

edf Refdf F p-value

s(x) 603 708 413 lt2e-16

---

Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1

R-sq(adj) = 0909 Deviance explained = 928

GCV = 00093204 Scale est = 00071351 n = 30

54 58

Using GAMs

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

55 58

Using GAMs

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

56 58

Using GAMs

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

57 58

Limited dependent variables

bull Usual advice model the data from first principles Logitprobit for binary Poisson for counts etc

bull OLS is a-ok with limited DVs when Binary treatment and no covariates (just diff-in-means) Binary treatment discrete covariates and saturated models

(stratified diff-in-means)

bull Imposing a model on LDVs in this case imposes adistributional assumption which could be wrong

bull Even in unsaturated models the marginal effect from OLSoften decent compared to nonlinear models

Could go wrong in small samples If using nonlinear models always get effects on the scale of the

outcome

58 58

  • Agnostic Regression
  • Regression and Causality
  • Regression with Heterogeneous Treatment Effects
Page 27: Gov 2002: 7. Regression and Causalityโ€ข Define linear regression: =argmin ๐”ผ[( ๐‘–โˆ’ โ€ฒ ) ] โ€ข The solution to this is the following: =๐”ผ[ ๐‘– โ€ฒ]โˆ’ ๐”ผ[ ๐‘– ๐‘–] โ€ข

Now with covariates

bull Now assume no unmeasured confounders 119884119894(119889) โŸ‚โŸ‚ 119863119894|119883119894bull We will assume a linear model for the potential outcomes

119884119894(119889) = 120572 + 120591 sdot 119889 + 120578119894

bull Remember that linearity isnrsquot an assumption if 119863119894 is binarybull Effect of 119863119894 is constant here the 120578119894 are the only source of

individual variation and we have 119864[120578119894] = 0bull Consistency assumption allows us to write this as

119884119894 = 120572 + 120591119863119894 + 120578119894

27 58

Covariates in the error

bull Letrsquos assume that 120578119894 is linear in 119883119894 120578119894 = 119883prime119894 120574 + 120584119894

bull New error is uncorrelated with 119883119894 120124[120584119894|119883119894] = 0bull This is an assumption Might be falsebull Plug into the above

120124[119884119894(119889)|119883119894] = 119864[119884119894|119863119894 119883119894] = 120572 + 120591119863119894 + 119864[120578119894|119883119894]= 120572 + 120591119863119894 + 119883prime

119894 120574 + 119864[120584119894|119883119894]= 120572 + 120591119863119894 + 119883prime

119894 120574

28 58

Summing up regression with constanteffects

bull Reviewing the assumptions wersquove used no unmeasured confounders constant treatment effects linearity of the treatmentcovariates

bull Under these we can run the following regression to estimatethe ATE 120591

119884119894 = 120572 + 120591119863119894 + 119883prime119894 120574 + 120584119894

bull Works with continuous or ordinal 119863119894 if linearity in the effect ofthese variables is truly linear

29 58

OLS constant effects simulation

Model with linear covariates constant 0 effect of treatment

library(mvtnorm)

n lt- 100

p lt- 4

X lt- rmvnorm(n = 100 mean = rep(0 p))

gamma lt- c(274 137 137 137)

y lt- 210 + X c(gamma) + rnorm(n)

alpha lt- c(-1 05 -05 -01)

dprobs lt- bootinvlogit(X alpha)

d lt- rbinom(n size = 1 prob = dprobs)

30 58

OLS with no covariates

summary(lm(y ~ d))

Coefficients

Estimate Std Error t value Pr(gt|t|)

(Intercept) 22036 458 4813 lt 2e-16

d -2869 654 -439 0000029

---

Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1

Residual standard error 33 on 98 degrees of freedom

Multiple R-squared 0164 Adjusted R-squared 0156

F-statistic 192 on 1 and 98 DF p-value 0000029

31 58

OLS with covariates

summary(lm(y ~ d + X))

Coefficients

Estimate Std Error t value Pr(gt|t|)

(Intercept) 209952 0159 13207 lt2e-16

d -0128 0253 -05 062

X1 27368 0126 2175 lt2e-16

X2 13677 0114 1201 lt2e-16

X3 13673 0130 1051 lt2e-16

X4 13570 0106 1280 lt2e-16

---

Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1

Residual standard error 1 on 94 degrees of freedom

Multiple R-squared 0999 Adjusted R-squared 0999

F-statistic 244e+04 on 5 and 94 DF p-value lt2e-16

32 58

What happens with nonlinearity

Suppose we can only observe the following covariates

z1 lt- exp(X[ 1]2)

z2 lt- X[ 2](1 + exp(X[ 1])) + 10

z3 lt- (X[ 1] X[ 3]25 + 06)^3

z4 lt- (X[ 2] + X[ 4] + 20)^2

Implies that 119884119894 and 119863119894 are functions of log(1198851198941113568) 1198851198941113569 119885111356911989411135681198851198941113569

1 log(1198851198941113568) 1198851198941113570 log(1198851198941113568) and 119883111356811135691198941113571

Regression is a nonlinear function of the observed covariates

33 58

When linearity goes wrong

summary(lm(y ~ d + z1 + z2 + z3 + z4))

Coefficients

Estimate Std Error t value Pr(gt|t|)

(Intercept) 218728 300799 073 0469

d -69292 33854 -205 0043

z1 363110 26970 1346 lt2e-16

z2 -29033 36619 -079 0430

z3 862022 431030 200 0048

z4 04021 00329 1223 lt2e-16

---

Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1

Residual standard error 14 on 94 degrees of freedom

Multiple R-squared 0855 Adjusted R-squared 0847

F-statistic 111 on 5 and 94 DF p-value lt2e-16

34 58

3 Regression withHeterogeneousTreatment Effects

35 58

Heterogeneous effects binary treatment

bull Completely randomized experiment

119884119894 = 119863119894119884119894(1) + (1 minus 119863119894)119884119894(0)= 119884119894(0) + (119884119894(1) minus 119884119894(0))119863119894

= 1205831113567 + 120591119894119863119894 + (119884119894(0) minus 1205831113567)= 1205831113567 + 120591119863119894 + (119884119894(0) minus 1205831113567) + (120591119894 minus 120591) sdot 119863119894

= 1205831113567 + 120591119863119894 + 120576119894

bull Error term now includes two components1 ldquoBaselinerdquo variation in the outcome (119884119894(0) minus 1205831113567)2 Variation in the treatment effect (120591119894 minus 120591)

bull Easy to verify that under experiment 120124[120576119894|119863119894] = 0bull Thus OLS estimates the ATE with no covariates

36 58

Adding covariates

bull What happens with no unmeasured confounders Need tocondition on 119883119894 now

bull Remember identification of the ATEATT using iteratedexpectations

bull ATE is the weighted sum of CATEs

120591 = 1114012119909120591(119909) Pr[119883119894 = 119909]

bull ATEATT are weighted averages of CATEsbull What about the regression estimand 120591119877 How does it related

to the ATEATT

37 58

Heterogeneous effects and regression

bull Letrsquos investigate this under a saturated regression model

119884119894 = 1114012119909119861119909119894120572119909 + 120591119877119863119894 + 119890119894

bull Use a dummy variable for each unique combination of 119883119894119861119909119894 = 120128(119883119894 = 119909)

bull Linear in 119883119894 by construction

38 58

Investigating the regression coefficient

bull How can we investigate 120591119877 Well we can rely on theregression anatomy

120591119877 = Cov(119884119894 119863119894 minus 119864[119863119894|119883119894])120141(119863119894 minus 119864[119863119894|119883119894])

bull 119863119894 minus 120124[119863119894|119883119894] is the residual from a regression of 119863119894 on the fullset of dummies

bull With a little work we can show

120591119877 =120124 1114094120591(119883119894)(119863119894 minus 120124[119863119894|119883119894])11135691114097

120124[(119863119894 minus 119864[119863119894|119883119894])1113569]=

120124[120591(119883119894)1205901113569119889(119883119894)]120124[1205901113569119889(119883119894)]

bull 1205901113569119889(119909) = 120141[119863119894|119883119894 = 119909] is the conditional variance of treatmentassignment

39 58

ATE versus OLS

120591119877 = 120124[120591(119883119894)119882119894] = 1114012119909120591(119909)

1205901113569119889(119909)120124[1205901113569119889(119883119894)]

โ„™[119883119894 = 119909]

bull Compare to the ATE

120591 = 120124[120591(119883119894)] = 1114012119909120591(119909)โ„™[119883119894 = 119909]

bull Both weight strata relative to their size (โ„™[119883119894 = 119909])bull OLS weights strata higher if the treatment variance in those

strata (1205901113569119889(119909)) is higher in those strata relative to the averagevariance across strata (120124[1205901113569119889(119883119894)])

bull The ATE weights only by their size

40 58

Regression weighting

119882119894 =1205901113569119889(119883119894)

120124[1205901113569119889(119883119894)]

bull Why does OLS weight like thisbull OLS is a minimum-variance estimator more weight to

more precise within-strata estimatesbull Within-strata estimates are most precise when the treatment

is evenly spread and thus has the highest variancebull If 119863119894 is binary then we know the conditional variance will be

1205901113569119889(119909) = โ„™[119863119894 = 1|119883119894 = 119909] (1 minus โ„™[119863119894 = 1|119883119894 = 119909])= 119890(119909) (1 minus 119890(119909))

bull Maximum variance with โ„™[119863119894 = 1|119883119894 = 119909] = 12

41 58

OLS weighting examplebull Binary covariate

โ„™[119883119894 = 1] = 075 โ„™[119883119894 = 0] = 025โ„™[119863119894 = 1|119883119894 = 1] = 09 โ„™[119863119894 = 1|119883119894 = 0] = 05

1205901113569119889(1) = 009 1205901113569119889(0) = 025120591(1) = 1 120591(0) = minus1

bull Implies the ATE is 120591 = 05bull Average conditional variance 120124[1205901113569119889(119883119894)] = 013bull weights for 119883119894 = 1 are 009013 = 0692 for 119883119894 = 0

025013 = 192120591119877 = 120124[120591(119883119894)119882119894]

= 120591(1)119882(1)โ„™[119883119894 = 1] + 120591(0)119882(0)โ„™[119883119894 = 0]= 1 times 0692 times 075 + minus1 times 192 times 025= 0039

42 58

When will OLS estimate the ATE

bull When does 120591 = 120591119877bull Constant treatment effects 120591(119909) = 120591 = 120591119877bull Constant probability of treatment 119890(119909) = โ„™[119863119894 = 1|119883119894 = 119909] = 119890

Implies that the OLS weights are 1

bull Incorrect linearity assumption in 119883119894 will lead to more bias

43 58

Other ways to use regression

bull Whatrsquos the path forward Accept the bias (might be relatively small with saturated

models) Use a different regression approach

bull Let 120583119889(119909) = 120124[119884119894(119889)|119883119894 = 119909] be the CEF for the potentialoutcome under 119863119894 = 119889

bull By consistency and nuc we have 120583119889(119909) = 120124[119884119894|119863119894 = 119889119883119894 = 119909]bull Estimate a regression of 119884119894 on 119883119894 among the 119863119894 = 119889 groupbull Then 119889(119909) is just a predicted value from the regression for

119883119894 = 119909bull How can we use this

44 58

Imputation estimators

bull Impute the treated potential outcomes with 1114023119884119894(1) = 1113568(119883119894)bull Impute the control potential outcomes with 1114023119884119894(0) = 1113567(119883119894)bull Procedure

Regress 119884119894 on 119883119894 in the treated group and get predicted valuesfor all units (treated or control)

Regress 119884119894 on 119883119894 in the control group and get predicted valuesfor all units (treated or control)

Take the average difference between these predicted values

bull More mathematically look like this

120591119894119898119901 =1119873

11140121198941113568(119883119894) minus 1113567(119883119894)

bull Sometimes called an imputation estimator

45 58

Simple imputation estimator

bull Use predict() from the within-group models on the data fromthe entire sample

bull Useful trick use a model on the entire data andmodelframe() to get the right design matrix

heterogeneous effects

yhet lt- ifelse(d == 1 y + rnorm(n 0 5) y)

mod lt- lm(yhet ~ d + X)

mod1 lt- lm(yhet ~ X subset = d == 1)

mod0 lt- lm(yhet ~ X subset = d == 0)

y1imps lt- predict(mod1 modelframe(mod))

y0imps lt- predict(mod0 modelframe(mod))

mean(y1imps - y0imps)

[1] 061

46 58

Notes on imputation estimators

bull If 119889(119909) are consistent estimators then 120591119894119898119901 is consistent forthe ATE

bull Why donrsquot people use this Most people donrsquot know the results wersquove been talking about Harder to implement than vanilla OLS

bull Can use linear regression to estimate 119889(119909) = 119909prime120573119889bull Recent trend is to estimate 119889(119909) via non-parametric methods

such as Kernel regression local linear regression regression trees etc Easiest is generalized additive models (GAMs)

47 58

Imputation estimator visualization

-2 -1 0 1 2 3

-05

00

05

10

15

x

y

TreatedControl

48 58

Imputation estimator visualization

-2 -1 0 1 2 3

-05

00

05

10

15

x

y

TreatedControl

49 58

Imputation estimator visualization

-2 -1 0 1 2 3

-05

00

05

10

15

x

y

TreatedControl

50 58

Nonlinear relationshipsbull Same idea but with nonlinear relationship between 119884119894 and 119883119894

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

51 58

Nonlinear relationshipsbull Same idea but with nonlinear relationship between 119884119894 and 119883119894

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

52 58

Nonlinear relationshipsbull Same idea but with nonlinear relationship between 119884119894 and 119883119894

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

53 58

Using semiparametric regressionbull Here CEFs are nonlinear but we donrsquot know their formbull We can use GAMs from the mgcv package to for flexible

estimatelibrary(mgcv)

mod0 lt- gam(y ~ s(x) subset = d == 0)

summary(mod0)

Family gaussian

Link function identity

Formula

y ~ s(x)

Parametric coefficients

Estimate Std Error t value Pr(gt|t|)

(Intercept) -00225 00154 -146 016

Approximate significance of smooth terms

edf Refdf F p-value

s(x) 603 708 413 lt2e-16

---

Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1

R-sq(adj) = 0909 Deviance explained = 928

GCV = 00093204 Scale est = 00071351 n = 30

54 58

Using GAMs

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

55 58

Using GAMs

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

56 58

Using GAMs

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

57 58

Limited dependent variables

bull Usual advice model the data from first principles Logitprobit for binary Poisson for counts etc

bull OLS is a-ok with limited DVs when Binary treatment and no covariates (just diff-in-means) Binary treatment discrete covariates and saturated models

(stratified diff-in-means)

bull Imposing a model on LDVs in this case imposes adistributional assumption which could be wrong

bull Even in unsaturated models the marginal effect from OLSoften decent compared to nonlinear models

Could go wrong in small samples If using nonlinear models always get effects on the scale of the

outcome

58 58

  • Agnostic Regression
  • Regression and Causality
  • Regression with Heterogeneous Treatment Effects
Page 28: Gov 2002: 7. Regression and Causalityโ€ข Define linear regression: =argmin ๐”ผ[( ๐‘–โˆ’ โ€ฒ ) ] โ€ข The solution to this is the following: =๐”ผ[ ๐‘– โ€ฒ]โˆ’ ๐”ผ[ ๐‘– ๐‘–] โ€ข

Covariates in the error

bull Letrsquos assume that 120578119894 is linear in 119883119894 120578119894 = 119883prime119894 120574 + 120584119894

bull New error is uncorrelated with 119883119894 120124[120584119894|119883119894] = 0bull This is an assumption Might be falsebull Plug into the above

120124[119884119894(119889)|119883119894] = 119864[119884119894|119863119894 119883119894] = 120572 + 120591119863119894 + 119864[120578119894|119883119894]= 120572 + 120591119863119894 + 119883prime

119894 120574 + 119864[120584119894|119883119894]= 120572 + 120591119863119894 + 119883prime

119894 120574

28 58

Summing up regression with constanteffects

bull Reviewing the assumptions wersquove used no unmeasured confounders constant treatment effects linearity of the treatmentcovariates

bull Under these we can run the following regression to estimatethe ATE 120591

119884119894 = 120572 + 120591119863119894 + 119883prime119894 120574 + 120584119894

bull Works with continuous or ordinal 119863119894 if linearity in the effect ofthese variables is truly linear

29 58

OLS constant effects simulation

Model with linear covariates constant 0 effect of treatment

library(mvtnorm)

n lt- 100

p lt- 4

X lt- rmvnorm(n = 100 mean = rep(0 p))

gamma lt- c(274 137 137 137)

y lt- 210 + X c(gamma) + rnorm(n)

alpha lt- c(-1 05 -05 -01)

dprobs lt- bootinvlogit(X alpha)

d lt- rbinom(n size = 1 prob = dprobs)

30 58

OLS with no covariates

summary(lm(y ~ d))

Coefficients

Estimate Std Error t value Pr(gt|t|)

(Intercept) 22036 458 4813 lt 2e-16

d -2869 654 -439 0000029

---

Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1

Residual standard error 33 on 98 degrees of freedom

Multiple R-squared 0164 Adjusted R-squared 0156

F-statistic 192 on 1 and 98 DF p-value 0000029

31 58

OLS with covariates

summary(lm(y ~ d + X))

Coefficients

Estimate Std Error t value Pr(gt|t|)

(Intercept) 209952 0159 13207 lt2e-16

d -0128 0253 -05 062

X1 27368 0126 2175 lt2e-16

X2 13677 0114 1201 lt2e-16

X3 13673 0130 1051 lt2e-16

X4 13570 0106 1280 lt2e-16

---

Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1

Residual standard error 1 on 94 degrees of freedom

Multiple R-squared 0999 Adjusted R-squared 0999

F-statistic 244e+04 on 5 and 94 DF p-value lt2e-16

32 58

What happens with nonlinearity

Suppose we can only observe the following covariates

z1 lt- exp(X[ 1]2)

z2 lt- X[ 2](1 + exp(X[ 1])) + 10

z3 lt- (X[ 1] X[ 3]25 + 06)^3

z4 lt- (X[ 2] + X[ 4] + 20)^2

Implies that 119884119894 and 119863119894 are functions of log(1198851198941113568) 1198851198941113569 119885111356911989411135681198851198941113569

1 log(1198851198941113568) 1198851198941113570 log(1198851198941113568) and 119883111356811135691198941113571

Regression is a nonlinear function of the observed covariates

33 58

When linearity goes wrong

summary(lm(y ~ d + z1 + z2 + z3 + z4))

Coefficients

Estimate Std Error t value Pr(gt|t|)

(Intercept) 218728 300799 073 0469

d -69292 33854 -205 0043

z1 363110 26970 1346 lt2e-16

z2 -29033 36619 -079 0430

z3 862022 431030 200 0048

z4 04021 00329 1223 lt2e-16

---

Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1

Residual standard error 14 on 94 degrees of freedom

Multiple R-squared 0855 Adjusted R-squared 0847

F-statistic 111 on 5 and 94 DF p-value lt2e-16

34 58

3 Regression withHeterogeneousTreatment Effects

35 58

Heterogeneous effects binary treatment

bull Completely randomized experiment

119884119894 = 119863119894119884119894(1) + (1 minus 119863119894)119884119894(0)= 119884119894(0) + (119884119894(1) minus 119884119894(0))119863119894

= 1205831113567 + 120591119894119863119894 + (119884119894(0) minus 1205831113567)= 1205831113567 + 120591119863119894 + (119884119894(0) minus 1205831113567) + (120591119894 minus 120591) sdot 119863119894

= 1205831113567 + 120591119863119894 + 120576119894

bull Error term now includes two components1 ldquoBaselinerdquo variation in the outcome (119884119894(0) minus 1205831113567)2 Variation in the treatment effect (120591119894 minus 120591)

bull Easy to verify that under experiment 120124[120576119894|119863119894] = 0bull Thus OLS estimates the ATE with no covariates

36 58

Adding covariates

bull What happens with no unmeasured confounders Need tocondition on 119883119894 now

bull Remember identification of the ATEATT using iteratedexpectations

bull ATE is the weighted sum of CATEs

120591 = 1114012119909120591(119909) Pr[119883119894 = 119909]

bull ATEATT are weighted averages of CATEsbull What about the regression estimand 120591119877 How does it related

to the ATEATT

37 58

Heterogeneous effects and regression

bull Letrsquos investigate this under a saturated regression model

119884119894 = 1114012119909119861119909119894120572119909 + 120591119877119863119894 + 119890119894

bull Use a dummy variable for each unique combination of 119883119894119861119909119894 = 120128(119883119894 = 119909)

bull Linear in 119883119894 by construction

38 58

Investigating the regression coefficient

bull How can we investigate 120591119877 Well we can rely on theregression anatomy

120591119877 = Cov(119884119894 119863119894 minus 119864[119863119894|119883119894])120141(119863119894 minus 119864[119863119894|119883119894])

bull 119863119894 minus 120124[119863119894|119883119894] is the residual from a regression of 119863119894 on the fullset of dummies

bull With a little work we can show

120591119877 =120124 1114094120591(119883119894)(119863119894 minus 120124[119863119894|119883119894])11135691114097

120124[(119863119894 minus 119864[119863119894|119883119894])1113569]=

120124[120591(119883119894)1205901113569119889(119883119894)]120124[1205901113569119889(119883119894)]

bull 1205901113569119889(119909) = 120141[119863119894|119883119894 = 119909] is the conditional variance of treatmentassignment

39 58

ATE versus OLS

120591119877 = 120124[120591(119883119894)119882119894] = 1114012119909120591(119909)

1205901113569119889(119909)120124[1205901113569119889(119883119894)]

โ„™[119883119894 = 119909]

bull Compare to the ATE

120591 = 120124[120591(119883119894)] = 1114012119909120591(119909)โ„™[119883119894 = 119909]

bull Both weight strata relative to their size (โ„™[119883119894 = 119909])bull OLS weights strata higher if the treatment variance in those

strata (1205901113569119889(119909)) is higher in those strata relative to the averagevariance across strata (120124[1205901113569119889(119883119894)])

bull The ATE weights only by their size

40 58

Regression weighting

119882119894 =1205901113569119889(119883119894)

120124[1205901113569119889(119883119894)]

bull Why does OLS weight like thisbull OLS is a minimum-variance estimator more weight to

more precise within-strata estimatesbull Within-strata estimates are most precise when the treatment

is evenly spread and thus has the highest variancebull If 119863119894 is binary then we know the conditional variance will be

1205901113569119889(119909) = โ„™[119863119894 = 1|119883119894 = 119909] (1 minus โ„™[119863119894 = 1|119883119894 = 119909])= 119890(119909) (1 minus 119890(119909))

bull Maximum variance with โ„™[119863119894 = 1|119883119894 = 119909] = 12

41 58

OLS weighting examplebull Binary covariate

โ„™[119883119894 = 1] = 075 โ„™[119883119894 = 0] = 025โ„™[119863119894 = 1|119883119894 = 1] = 09 โ„™[119863119894 = 1|119883119894 = 0] = 05

1205901113569119889(1) = 009 1205901113569119889(0) = 025120591(1) = 1 120591(0) = minus1

bull Implies the ATE is 120591 = 05bull Average conditional variance 120124[1205901113569119889(119883119894)] = 013bull weights for 119883119894 = 1 are 009013 = 0692 for 119883119894 = 0

025013 = 192120591119877 = 120124[120591(119883119894)119882119894]

= 120591(1)119882(1)โ„™[119883119894 = 1] + 120591(0)119882(0)โ„™[119883119894 = 0]= 1 times 0692 times 075 + minus1 times 192 times 025= 0039

42 58

When will OLS estimate the ATE

bull When does 120591 = 120591119877bull Constant treatment effects 120591(119909) = 120591 = 120591119877bull Constant probability of treatment 119890(119909) = โ„™[119863119894 = 1|119883119894 = 119909] = 119890

Implies that the OLS weights are 1

bull Incorrect linearity assumption in 119883119894 will lead to more bias

43 58

Other ways to use regression

bull Whatrsquos the path forward Accept the bias (might be relatively small with saturated

models) Use a different regression approach

bull Let 120583119889(119909) = 120124[119884119894(119889)|119883119894 = 119909] be the CEF for the potentialoutcome under 119863119894 = 119889

bull By consistency and nuc we have 120583119889(119909) = 120124[119884119894|119863119894 = 119889119883119894 = 119909]bull Estimate a regression of 119884119894 on 119883119894 among the 119863119894 = 119889 groupbull Then 119889(119909) is just a predicted value from the regression for

119883119894 = 119909bull How can we use this

44 58

Imputation estimators

bull Impute the treated potential outcomes with 1114023119884119894(1) = 1113568(119883119894)bull Impute the control potential outcomes with 1114023119884119894(0) = 1113567(119883119894)bull Procedure

Regress 119884119894 on 119883119894 in the treated group and get predicted valuesfor all units (treated or control)

Regress 119884119894 on 119883119894 in the control group and get predicted valuesfor all units (treated or control)

Take the average difference between these predicted values

bull More mathematically look like this

120591119894119898119901 =1119873

11140121198941113568(119883119894) minus 1113567(119883119894)

bull Sometimes called an imputation estimator

45 58

Simple imputation estimator

bull Use predict() from the within-group models on the data fromthe entire sample

bull Useful trick use a model on the entire data andmodelframe() to get the right design matrix

heterogeneous effects

yhet lt- ifelse(d == 1 y + rnorm(n 0 5) y)

mod lt- lm(yhet ~ d + X)

mod1 lt- lm(yhet ~ X subset = d == 1)

mod0 lt- lm(yhet ~ X subset = d == 0)

y1imps lt- predict(mod1 modelframe(mod))

y0imps lt- predict(mod0 modelframe(mod))

mean(y1imps - y0imps)

[1] 061

46 58

Notes on imputation estimators

bull If 119889(119909) are consistent estimators then 120591119894119898119901 is consistent forthe ATE

bull Why donrsquot people use this Most people donrsquot know the results wersquove been talking about Harder to implement than vanilla OLS

bull Can use linear regression to estimate 119889(119909) = 119909prime120573119889bull Recent trend is to estimate 119889(119909) via non-parametric methods

such as Kernel regression local linear regression regression trees etc Easiest is generalized additive models (GAMs)

47 58

Imputation estimator visualization

-2 -1 0 1 2 3

-05

00

05

10

15

x

y

TreatedControl

48 58

Imputation estimator visualization

-2 -1 0 1 2 3

-05

00

05

10

15

x

y

TreatedControl

49 58

Imputation estimator visualization

-2 -1 0 1 2 3

-05

00

05

10

15

x

y

TreatedControl

50 58

Nonlinear relationshipsbull Same idea but with nonlinear relationship between 119884119894 and 119883119894

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

51 58

Nonlinear relationshipsbull Same idea but with nonlinear relationship between 119884119894 and 119883119894

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

52 58

Nonlinear relationshipsbull Same idea but with nonlinear relationship between 119884119894 and 119883119894

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

53 58

Using semiparametric regressionbull Here CEFs are nonlinear but we donrsquot know their formbull We can use GAMs from the mgcv package to for flexible

estimatelibrary(mgcv)

mod0 lt- gam(y ~ s(x) subset = d == 0)

summary(mod0)

Family gaussian

Link function identity

Formula

y ~ s(x)

Parametric coefficients

Estimate Std Error t value Pr(gt|t|)

(Intercept) -00225 00154 -146 016

Approximate significance of smooth terms

edf Refdf F p-value

s(x) 603 708 413 lt2e-16

---

Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1

R-sq(adj) = 0909 Deviance explained = 928

GCV = 00093204 Scale est = 00071351 n = 30

54 58

Using GAMs

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

55 58

Using GAMs

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

56 58

Using GAMs

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

57 58

Limited dependent variables

bull Usual advice model the data from first principles Logitprobit for binary Poisson for counts etc

bull OLS is a-ok with limited DVs when Binary treatment and no covariates (just diff-in-means) Binary treatment discrete covariates and saturated models

(stratified diff-in-means)

bull Imposing a model on LDVs in this case imposes adistributional assumption which could be wrong

bull Even in unsaturated models the marginal effect from OLSoften decent compared to nonlinear models

Could go wrong in small samples If using nonlinear models always get effects on the scale of the

outcome

58 58

  • Agnostic Regression
  • Regression and Causality
  • Regression with Heterogeneous Treatment Effects
Page 29: Gov 2002: 7. Regression and Causalityโ€ข Define linear regression: =argmin ๐”ผ[( ๐‘–โˆ’ โ€ฒ ) ] โ€ข The solution to this is the following: =๐”ผ[ ๐‘– โ€ฒ]โˆ’ ๐”ผ[ ๐‘– ๐‘–] โ€ข

Summing up regression with constanteffects

bull Reviewing the assumptions wersquove used no unmeasured confounders constant treatment effects linearity of the treatmentcovariates

bull Under these we can run the following regression to estimatethe ATE 120591

119884119894 = 120572 + 120591119863119894 + 119883prime119894 120574 + 120584119894

bull Works with continuous or ordinal 119863119894 if linearity in the effect ofthese variables is truly linear

29 58

OLS constant effects simulation

Model with linear covariates constant 0 effect of treatment

library(mvtnorm)

n lt- 100

p lt- 4

X lt- rmvnorm(n = 100 mean = rep(0 p))

gamma lt- c(274 137 137 137)

y lt- 210 + X c(gamma) + rnorm(n)

alpha lt- c(-1 05 -05 -01)

dprobs lt- bootinvlogit(X alpha)

d lt- rbinom(n size = 1 prob = dprobs)

30 58

OLS with no covariates

summary(lm(y ~ d))

Coefficients

Estimate Std Error t value Pr(gt|t|)

(Intercept) 22036 458 4813 lt 2e-16

d -2869 654 -439 0000029

---

Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1

Residual standard error 33 on 98 degrees of freedom

Multiple R-squared 0164 Adjusted R-squared 0156

F-statistic 192 on 1 and 98 DF p-value 0000029

31 58

OLS with covariates

summary(lm(y ~ d + X))

Coefficients

Estimate Std Error t value Pr(gt|t|)

(Intercept) 209952 0159 13207 lt2e-16

d -0128 0253 -05 062

X1 27368 0126 2175 lt2e-16

X2 13677 0114 1201 lt2e-16

X3 13673 0130 1051 lt2e-16

X4 13570 0106 1280 lt2e-16

---

Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1

Residual standard error 1 on 94 degrees of freedom

Multiple R-squared 0999 Adjusted R-squared 0999

F-statistic 244e+04 on 5 and 94 DF p-value lt2e-16

32 58

What happens with nonlinearity

Suppose we can only observe the following covariates

z1 lt- exp(X[ 1]2)

z2 lt- X[ 2](1 + exp(X[ 1])) + 10

z3 lt- (X[ 1] X[ 3]25 + 06)^3

z4 lt- (X[ 2] + X[ 4] + 20)^2

Implies that 119884119894 and 119863119894 are functions of log(1198851198941113568) 1198851198941113569 119885111356911989411135681198851198941113569

1 log(1198851198941113568) 1198851198941113570 log(1198851198941113568) and 119883111356811135691198941113571

Regression is a nonlinear function of the observed covariates

33 58

When linearity goes wrong

summary(lm(y ~ d + z1 + z2 + z3 + z4))

Coefficients

Estimate Std Error t value Pr(gt|t|)

(Intercept) 218728 300799 073 0469

d -69292 33854 -205 0043

z1 363110 26970 1346 lt2e-16

z2 -29033 36619 -079 0430

z3 862022 431030 200 0048

z4 04021 00329 1223 lt2e-16

---

Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1

Residual standard error 14 on 94 degrees of freedom

Multiple R-squared 0855 Adjusted R-squared 0847

F-statistic 111 on 5 and 94 DF p-value lt2e-16

34 58

3 Regression withHeterogeneousTreatment Effects

35 58

Heterogeneous effects binary treatment

bull Completely randomized experiment

119884119894 = 119863119894119884119894(1) + (1 minus 119863119894)119884119894(0)= 119884119894(0) + (119884119894(1) minus 119884119894(0))119863119894

= 1205831113567 + 120591119894119863119894 + (119884119894(0) minus 1205831113567)= 1205831113567 + 120591119863119894 + (119884119894(0) minus 1205831113567) + (120591119894 minus 120591) sdot 119863119894

= 1205831113567 + 120591119863119894 + 120576119894

bull Error term now includes two components1 ldquoBaselinerdquo variation in the outcome (119884119894(0) minus 1205831113567)2 Variation in the treatment effect (120591119894 minus 120591)

bull Easy to verify that under experiment 120124[120576119894|119863119894] = 0bull Thus OLS estimates the ATE with no covariates

36 58

Adding covariates

bull What happens with no unmeasured confounders Need tocondition on 119883119894 now

bull Remember identification of the ATEATT using iteratedexpectations

bull ATE is the weighted sum of CATEs

120591 = 1114012119909120591(119909) Pr[119883119894 = 119909]

bull ATEATT are weighted averages of CATEsbull What about the regression estimand 120591119877 How does it related

to the ATEATT

37 58

Heterogeneous effects and regression

bull Letrsquos investigate this under a saturated regression model

119884119894 = 1114012119909119861119909119894120572119909 + 120591119877119863119894 + 119890119894

bull Use a dummy variable for each unique combination of 119883119894119861119909119894 = 120128(119883119894 = 119909)

bull Linear in 119883119894 by construction

38 58

Investigating the regression coefficient

bull How can we investigate 120591119877 Well we can rely on theregression anatomy

120591119877 = Cov(119884119894 119863119894 minus 119864[119863119894|119883119894])120141(119863119894 minus 119864[119863119894|119883119894])

bull 119863119894 minus 120124[119863119894|119883119894] is the residual from a regression of 119863119894 on the fullset of dummies

bull With a little work we can show

120591119877 =120124 1114094120591(119883119894)(119863119894 minus 120124[119863119894|119883119894])11135691114097

120124[(119863119894 minus 119864[119863119894|119883119894])1113569]=

120124[120591(119883119894)1205901113569119889(119883119894)]120124[1205901113569119889(119883119894)]

bull 1205901113569119889(119909) = 120141[119863119894|119883119894 = 119909] is the conditional variance of treatmentassignment

39 58

ATE versus OLS

120591119877 = 120124[120591(119883119894)119882119894] = 1114012119909120591(119909)

1205901113569119889(119909)120124[1205901113569119889(119883119894)]

โ„™[119883119894 = 119909]

bull Compare to the ATE

120591 = 120124[120591(119883119894)] = 1114012119909120591(119909)โ„™[119883119894 = 119909]

bull Both weight strata relative to their size (โ„™[119883119894 = 119909])bull OLS weights strata higher if the treatment variance in those

strata (1205901113569119889(119909)) is higher in those strata relative to the averagevariance across strata (120124[1205901113569119889(119883119894)])

bull The ATE weights only by their size

40 58

Regression weighting

119882119894 =1205901113569119889(119883119894)

120124[1205901113569119889(119883119894)]

bull Why does OLS weight like thisbull OLS is a minimum-variance estimator more weight to

more precise within-strata estimatesbull Within-strata estimates are most precise when the treatment

is evenly spread and thus has the highest variancebull If 119863119894 is binary then we know the conditional variance will be

1205901113569119889(119909) = โ„™[119863119894 = 1|119883119894 = 119909] (1 minus โ„™[119863119894 = 1|119883119894 = 119909])= 119890(119909) (1 minus 119890(119909))

bull Maximum variance with โ„™[119863119894 = 1|119883119894 = 119909] = 12

41 58

OLS weighting examplebull Binary covariate

โ„™[119883119894 = 1] = 075 โ„™[119883119894 = 0] = 025โ„™[119863119894 = 1|119883119894 = 1] = 09 โ„™[119863119894 = 1|119883119894 = 0] = 05

1205901113569119889(1) = 009 1205901113569119889(0) = 025120591(1) = 1 120591(0) = minus1

bull Implies the ATE is 120591 = 05bull Average conditional variance 120124[1205901113569119889(119883119894)] = 013bull weights for 119883119894 = 1 are 009013 = 0692 for 119883119894 = 0

025013 = 192120591119877 = 120124[120591(119883119894)119882119894]

= 120591(1)119882(1)โ„™[119883119894 = 1] + 120591(0)119882(0)โ„™[119883119894 = 0]= 1 times 0692 times 075 + minus1 times 192 times 025= 0039

42 58

When will OLS estimate the ATE

bull When does 120591 = 120591119877bull Constant treatment effects 120591(119909) = 120591 = 120591119877bull Constant probability of treatment 119890(119909) = โ„™[119863119894 = 1|119883119894 = 119909] = 119890

Implies that the OLS weights are 1

bull Incorrect linearity assumption in 119883119894 will lead to more bias

43 58

Other ways to use regression

bull Whatrsquos the path forward Accept the bias (might be relatively small with saturated

models) Use a different regression approach

bull Let 120583119889(119909) = 120124[119884119894(119889)|119883119894 = 119909] be the CEF for the potentialoutcome under 119863119894 = 119889

bull By consistency and nuc we have 120583119889(119909) = 120124[119884119894|119863119894 = 119889119883119894 = 119909]bull Estimate a regression of 119884119894 on 119883119894 among the 119863119894 = 119889 groupbull Then 119889(119909) is just a predicted value from the regression for

119883119894 = 119909bull How can we use this

44 58

Imputation estimators

bull Impute the treated potential outcomes with 1114023119884119894(1) = 1113568(119883119894)bull Impute the control potential outcomes with 1114023119884119894(0) = 1113567(119883119894)bull Procedure

Regress 119884119894 on 119883119894 in the treated group and get predicted valuesfor all units (treated or control)

Regress 119884119894 on 119883119894 in the control group and get predicted valuesfor all units (treated or control)

Take the average difference between these predicted values

bull More mathematically look like this

120591119894119898119901 =1119873

11140121198941113568(119883119894) minus 1113567(119883119894)

bull Sometimes called an imputation estimator

45 58

Simple imputation estimator

bull Use predict() from the within-group models on the data fromthe entire sample

bull Useful trick use a model on the entire data andmodelframe() to get the right design matrix

heterogeneous effects

yhet lt- ifelse(d == 1 y + rnorm(n 0 5) y)

mod lt- lm(yhet ~ d + X)

mod1 lt- lm(yhet ~ X subset = d == 1)

mod0 lt- lm(yhet ~ X subset = d == 0)

y1imps lt- predict(mod1 modelframe(mod))

y0imps lt- predict(mod0 modelframe(mod))

mean(y1imps - y0imps)

[1] 061

46 58

Notes on imputation estimators

bull If 119889(119909) are consistent estimators then 120591119894119898119901 is consistent forthe ATE

bull Why donrsquot people use this Most people donrsquot know the results wersquove been talking about Harder to implement than vanilla OLS

bull Can use linear regression to estimate 119889(119909) = 119909prime120573119889bull Recent trend is to estimate 119889(119909) via non-parametric methods

such as Kernel regression local linear regression regression trees etc Easiest is generalized additive models (GAMs)

47 58

Imputation estimator visualization

-2 -1 0 1 2 3

-05

00

05

10

15

x

y

TreatedControl

48 58

Imputation estimator visualization

-2 -1 0 1 2 3

-05

00

05

10

15

x

y

TreatedControl

49 58

Imputation estimator visualization

-2 -1 0 1 2 3

-05

00

05

10

15

x

y

TreatedControl

50 58

Nonlinear relationshipsbull Same idea but with nonlinear relationship between 119884119894 and 119883119894

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

51 58

Nonlinear relationshipsbull Same idea but with nonlinear relationship between 119884119894 and 119883119894

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

52 58

Nonlinear relationshipsbull Same idea but with nonlinear relationship between 119884119894 and 119883119894

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

53 58

Using semiparametric regressionbull Here CEFs are nonlinear but we donrsquot know their formbull We can use GAMs from the mgcv package to for flexible

estimatelibrary(mgcv)

mod0 lt- gam(y ~ s(x) subset = d == 0)

summary(mod0)

Family gaussian

Link function identity

Formula

y ~ s(x)

Parametric coefficients

Estimate Std Error t value Pr(gt|t|)

(Intercept) -00225 00154 -146 016

Approximate significance of smooth terms

edf Refdf F p-value

s(x) 603 708 413 lt2e-16

---

Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1

R-sq(adj) = 0909 Deviance explained = 928

GCV = 00093204 Scale est = 00071351 n = 30

54 58

Using GAMs

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

55 58

Using GAMs

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

56 58

Using GAMs

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

57 58

Limited dependent variables

bull Usual advice model the data from first principles Logitprobit for binary Poisson for counts etc

bull OLS is a-ok with limited DVs when Binary treatment and no covariates (just diff-in-means) Binary treatment discrete covariates and saturated models

(stratified diff-in-means)

bull Imposing a model on LDVs in this case imposes adistributional assumption which could be wrong

bull Even in unsaturated models the marginal effect from OLSoften decent compared to nonlinear models

Could go wrong in small samples If using nonlinear models always get effects on the scale of the

outcome

58 58

  • Agnostic Regression
  • Regression and Causality
  • Regression with Heterogeneous Treatment Effects
Page 30: Gov 2002: 7. Regression and Causalityโ€ข Define linear regression: =argmin ๐”ผ[( ๐‘–โˆ’ โ€ฒ ) ] โ€ข The solution to this is the following: =๐”ผ[ ๐‘– โ€ฒ]โˆ’ ๐”ผ[ ๐‘– ๐‘–] โ€ข

OLS constant effects simulation

Model with linear covariates constant 0 effect of treatment

library(mvtnorm)

n lt- 100

p lt- 4

X lt- rmvnorm(n = 100 mean = rep(0 p))

gamma lt- c(274 137 137 137)

y lt- 210 + X c(gamma) + rnorm(n)

alpha lt- c(-1 05 -05 -01)

dprobs lt- bootinvlogit(X alpha)

d lt- rbinom(n size = 1 prob = dprobs)

30 58

OLS with no covariates

summary(lm(y ~ d))

Coefficients

Estimate Std Error t value Pr(gt|t|)

(Intercept) 22036 458 4813 lt 2e-16

d -2869 654 -439 0000029

---

Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1

Residual standard error 33 on 98 degrees of freedom

Multiple R-squared 0164 Adjusted R-squared 0156

F-statistic 192 on 1 and 98 DF p-value 0000029

31 58

OLS with covariates

summary(lm(y ~ d + X))

Coefficients

Estimate Std Error t value Pr(gt|t|)

(Intercept) 209952 0159 13207 lt2e-16

d -0128 0253 -05 062

X1 27368 0126 2175 lt2e-16

X2 13677 0114 1201 lt2e-16

X3 13673 0130 1051 lt2e-16

X4 13570 0106 1280 lt2e-16

---

Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1

Residual standard error 1 on 94 degrees of freedom

Multiple R-squared 0999 Adjusted R-squared 0999

F-statistic 244e+04 on 5 and 94 DF p-value lt2e-16

32 58

What happens with nonlinearity

Suppose we can only observe the following covariates

z1 lt- exp(X[ 1]2)

z2 lt- X[ 2](1 + exp(X[ 1])) + 10

z3 lt- (X[ 1] X[ 3]25 + 06)^3

z4 lt- (X[ 2] + X[ 4] + 20)^2

Implies that 119884119894 and 119863119894 are functions of log(1198851198941113568) 1198851198941113569 119885111356911989411135681198851198941113569

1 log(1198851198941113568) 1198851198941113570 log(1198851198941113568) and 119883111356811135691198941113571

Regression is a nonlinear function of the observed covariates

33 58

When linearity goes wrong

summary(lm(y ~ d + z1 + z2 + z3 + z4))

Coefficients

Estimate Std Error t value Pr(gt|t|)

(Intercept) 218728 300799 073 0469

d -69292 33854 -205 0043

z1 363110 26970 1346 lt2e-16

z2 -29033 36619 -079 0430

z3 862022 431030 200 0048

z4 04021 00329 1223 lt2e-16

---

Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1

Residual standard error 14 on 94 degrees of freedom

Multiple R-squared 0855 Adjusted R-squared 0847

F-statistic 111 on 5 and 94 DF p-value lt2e-16

34 58

3 Regression withHeterogeneousTreatment Effects

35 58

Heterogeneous effects binary treatment

bull Completely randomized experiment

119884119894 = 119863119894119884119894(1) + (1 minus 119863119894)119884119894(0)= 119884119894(0) + (119884119894(1) minus 119884119894(0))119863119894

= 1205831113567 + 120591119894119863119894 + (119884119894(0) minus 1205831113567)= 1205831113567 + 120591119863119894 + (119884119894(0) minus 1205831113567) + (120591119894 minus 120591) sdot 119863119894

= 1205831113567 + 120591119863119894 + 120576119894

bull Error term now includes two components1 ldquoBaselinerdquo variation in the outcome (119884119894(0) minus 1205831113567)2 Variation in the treatment effect (120591119894 minus 120591)

bull Easy to verify that under experiment 120124[120576119894|119863119894] = 0bull Thus OLS estimates the ATE with no covariates

36 58

Adding covariates

bull What happens with no unmeasured confounders Need tocondition on 119883119894 now

bull Remember identification of the ATEATT using iteratedexpectations

bull ATE is the weighted sum of CATEs

120591 = 1114012119909120591(119909) Pr[119883119894 = 119909]

bull ATEATT are weighted averages of CATEsbull What about the regression estimand 120591119877 How does it related

to the ATEATT

37 58

Heterogeneous effects and regression

bull Letrsquos investigate this under a saturated regression model

119884119894 = 1114012119909119861119909119894120572119909 + 120591119877119863119894 + 119890119894

bull Use a dummy variable for each unique combination of 119883119894119861119909119894 = 120128(119883119894 = 119909)

bull Linear in 119883119894 by construction

38 58

Investigating the regression coefficient

bull How can we investigate 120591119877 Well we can rely on theregression anatomy

120591119877 = Cov(119884119894 119863119894 minus 119864[119863119894|119883119894])120141(119863119894 minus 119864[119863119894|119883119894])

bull 119863119894 minus 120124[119863119894|119883119894] is the residual from a regression of 119863119894 on the fullset of dummies

bull With a little work we can show

120591119877 =120124 1114094120591(119883119894)(119863119894 minus 120124[119863119894|119883119894])11135691114097

120124[(119863119894 minus 119864[119863119894|119883119894])1113569]=

120124[120591(119883119894)1205901113569119889(119883119894)]120124[1205901113569119889(119883119894)]

bull 1205901113569119889(119909) = 120141[119863119894|119883119894 = 119909] is the conditional variance of treatmentassignment

39 58

ATE versus OLS

120591119877 = 120124[120591(119883119894)119882119894] = 1114012119909120591(119909)

1205901113569119889(119909)120124[1205901113569119889(119883119894)]

โ„™[119883119894 = 119909]

bull Compare to the ATE

120591 = 120124[120591(119883119894)] = 1114012119909120591(119909)โ„™[119883119894 = 119909]

bull Both weight strata relative to their size (โ„™[119883119894 = 119909])bull OLS weights strata higher if the treatment variance in those

strata (1205901113569119889(119909)) is higher in those strata relative to the averagevariance across strata (120124[1205901113569119889(119883119894)])

bull The ATE weights only by their size

40 58

Regression weighting

119882119894 =1205901113569119889(119883119894)

120124[1205901113569119889(119883119894)]

bull Why does OLS weight like thisbull OLS is a minimum-variance estimator more weight to

more precise within-strata estimatesbull Within-strata estimates are most precise when the treatment

is evenly spread and thus has the highest variancebull If 119863119894 is binary then we know the conditional variance will be

1205901113569119889(119909) = โ„™[119863119894 = 1|119883119894 = 119909] (1 minus โ„™[119863119894 = 1|119883119894 = 119909])= 119890(119909) (1 minus 119890(119909))

bull Maximum variance with โ„™[119863119894 = 1|119883119894 = 119909] = 12

41 58

OLS weighting examplebull Binary covariate

โ„™[119883119894 = 1] = 075 โ„™[119883119894 = 0] = 025โ„™[119863119894 = 1|119883119894 = 1] = 09 โ„™[119863119894 = 1|119883119894 = 0] = 05

1205901113569119889(1) = 009 1205901113569119889(0) = 025120591(1) = 1 120591(0) = minus1

bull Implies the ATE is 120591 = 05bull Average conditional variance 120124[1205901113569119889(119883119894)] = 013bull weights for 119883119894 = 1 are 009013 = 0692 for 119883119894 = 0

025013 = 192120591119877 = 120124[120591(119883119894)119882119894]

= 120591(1)119882(1)โ„™[119883119894 = 1] + 120591(0)119882(0)โ„™[119883119894 = 0]= 1 times 0692 times 075 + minus1 times 192 times 025= 0039

42 58

When will OLS estimate the ATE

bull When does 120591 = 120591119877bull Constant treatment effects 120591(119909) = 120591 = 120591119877bull Constant probability of treatment 119890(119909) = โ„™[119863119894 = 1|119883119894 = 119909] = 119890

Implies that the OLS weights are 1

bull Incorrect linearity assumption in 119883119894 will lead to more bias

43 58

Other ways to use regression

bull Whatrsquos the path forward Accept the bias (might be relatively small with saturated

models) Use a different regression approach

bull Let 120583119889(119909) = 120124[119884119894(119889)|119883119894 = 119909] be the CEF for the potentialoutcome under 119863119894 = 119889

bull By consistency and nuc we have 120583119889(119909) = 120124[119884119894|119863119894 = 119889119883119894 = 119909]bull Estimate a regression of 119884119894 on 119883119894 among the 119863119894 = 119889 groupbull Then 119889(119909) is just a predicted value from the regression for

119883119894 = 119909bull How can we use this

44 58

Imputation estimators

bull Impute the treated potential outcomes with 1114023119884119894(1) = 1113568(119883119894)bull Impute the control potential outcomes with 1114023119884119894(0) = 1113567(119883119894)bull Procedure

Regress 119884119894 on 119883119894 in the treated group and get predicted valuesfor all units (treated or control)

Regress 119884119894 on 119883119894 in the control group and get predicted valuesfor all units (treated or control)

Take the average difference between these predicted values

bull More mathematically look like this

120591119894119898119901 =1119873

11140121198941113568(119883119894) minus 1113567(119883119894)

bull Sometimes called an imputation estimator

45 58

Simple imputation estimator

bull Use predict() from the within-group models on the data fromthe entire sample

bull Useful trick use a model on the entire data andmodelframe() to get the right design matrix

heterogeneous effects

yhet lt- ifelse(d == 1 y + rnorm(n 0 5) y)

mod lt- lm(yhet ~ d + X)

mod1 lt- lm(yhet ~ X subset = d == 1)

mod0 lt- lm(yhet ~ X subset = d == 0)

y1imps lt- predict(mod1 modelframe(mod))

y0imps lt- predict(mod0 modelframe(mod))

mean(y1imps - y0imps)

[1] 061

46 58

Notes on imputation estimators

bull If 119889(119909) are consistent estimators then 120591119894119898119901 is consistent forthe ATE

bull Why donrsquot people use this Most people donrsquot know the results wersquove been talking about Harder to implement than vanilla OLS

bull Can use linear regression to estimate 119889(119909) = 119909prime120573119889bull Recent trend is to estimate 119889(119909) via non-parametric methods

such as Kernel regression local linear regression regression trees etc Easiest is generalized additive models (GAMs)

47 58

Imputation estimator visualization

-2 -1 0 1 2 3

-05

00

05

10

15

x

y

TreatedControl

48 58

Imputation estimator visualization

-2 -1 0 1 2 3

-05

00

05

10

15

x

y

TreatedControl

49 58

Imputation estimator visualization

-2 -1 0 1 2 3

-05

00

05

10

15

x

y

TreatedControl

50 58

Nonlinear relationshipsbull Same idea but with nonlinear relationship between 119884119894 and 119883119894

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

51 58

Nonlinear relationshipsbull Same idea but with nonlinear relationship between 119884119894 and 119883119894

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

52 58

Nonlinear relationshipsbull Same idea but with nonlinear relationship between 119884119894 and 119883119894

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

53 58

Using semiparametric regressionbull Here CEFs are nonlinear but we donrsquot know their formbull We can use GAMs from the mgcv package to for flexible

estimatelibrary(mgcv)

mod0 lt- gam(y ~ s(x) subset = d == 0)

summary(mod0)

Family gaussian

Link function identity

Formula

y ~ s(x)

Parametric coefficients

Estimate Std Error t value Pr(gt|t|)

(Intercept) -00225 00154 -146 016

Approximate significance of smooth terms

edf Refdf F p-value

s(x) 603 708 413 lt2e-16

---

Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1

R-sq(adj) = 0909 Deviance explained = 928

GCV = 00093204 Scale est = 00071351 n = 30

54 58

Using GAMs

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

55 58

Using GAMs

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

56 58

Using GAMs

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

57 58

Limited dependent variables

bull Usual advice model the data from first principles Logitprobit for binary Poisson for counts etc

bull OLS is a-ok with limited DVs when Binary treatment and no covariates (just diff-in-means) Binary treatment discrete covariates and saturated models

(stratified diff-in-means)

bull Imposing a model on LDVs in this case imposes adistributional assumption which could be wrong

bull Even in unsaturated models the marginal effect from OLSoften decent compared to nonlinear models

Could go wrong in small samples If using nonlinear models always get effects on the scale of the

outcome

58 58

  • Agnostic Regression
  • Regression and Causality
  • Regression with Heterogeneous Treatment Effects
Page 31: Gov 2002: 7. Regression and Causalityโ€ข Define linear regression: =argmin ๐”ผ[( ๐‘–โˆ’ โ€ฒ ) ] โ€ข The solution to this is the following: =๐”ผ[ ๐‘– โ€ฒ]โˆ’ ๐”ผ[ ๐‘– ๐‘–] โ€ข

OLS with no covariates

summary(lm(y ~ d))

Coefficients

Estimate Std Error t value Pr(gt|t|)

(Intercept) 22036 458 4813 lt 2e-16

d -2869 654 -439 0000029

---

Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1

Residual standard error 33 on 98 degrees of freedom

Multiple R-squared 0164 Adjusted R-squared 0156

F-statistic 192 on 1 and 98 DF p-value 0000029

31 58

OLS with covariates

summary(lm(y ~ d + X))

Coefficients

Estimate Std Error t value Pr(gt|t|)

(Intercept) 209952 0159 13207 lt2e-16

d -0128 0253 -05 062

X1 27368 0126 2175 lt2e-16

X2 13677 0114 1201 lt2e-16

X3 13673 0130 1051 lt2e-16

X4 13570 0106 1280 lt2e-16

---

Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1

Residual standard error 1 on 94 degrees of freedom

Multiple R-squared 0999 Adjusted R-squared 0999

F-statistic 244e+04 on 5 and 94 DF p-value lt2e-16

32 58

What happens with nonlinearity

Suppose we can only observe the following covariates

z1 lt- exp(X[ 1]2)

z2 lt- X[ 2](1 + exp(X[ 1])) + 10

z3 lt- (X[ 1] X[ 3]25 + 06)^3

z4 lt- (X[ 2] + X[ 4] + 20)^2

Implies that 119884119894 and 119863119894 are functions of log(1198851198941113568) 1198851198941113569 119885111356911989411135681198851198941113569

1 log(1198851198941113568) 1198851198941113570 log(1198851198941113568) and 119883111356811135691198941113571

Regression is a nonlinear function of the observed covariates

33 58

When linearity goes wrong

summary(lm(y ~ d + z1 + z2 + z3 + z4))

Coefficients

Estimate Std Error t value Pr(gt|t|)

(Intercept) 218728 300799 073 0469

d -69292 33854 -205 0043

z1 363110 26970 1346 lt2e-16

z2 -29033 36619 -079 0430

z3 862022 431030 200 0048

z4 04021 00329 1223 lt2e-16

---

Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1

Residual standard error 14 on 94 degrees of freedom

Multiple R-squared 0855 Adjusted R-squared 0847

F-statistic 111 on 5 and 94 DF p-value lt2e-16

34 58

3 Regression withHeterogeneousTreatment Effects

35 58

Heterogeneous effects binary treatment

bull Completely randomized experiment

119884119894 = 119863119894119884119894(1) + (1 minus 119863119894)119884119894(0)= 119884119894(0) + (119884119894(1) minus 119884119894(0))119863119894

= 1205831113567 + 120591119894119863119894 + (119884119894(0) minus 1205831113567)= 1205831113567 + 120591119863119894 + (119884119894(0) minus 1205831113567) + (120591119894 minus 120591) sdot 119863119894

= 1205831113567 + 120591119863119894 + 120576119894

bull Error term now includes two components1 ldquoBaselinerdquo variation in the outcome (119884119894(0) minus 1205831113567)2 Variation in the treatment effect (120591119894 minus 120591)

bull Easy to verify that under experiment 120124[120576119894|119863119894] = 0bull Thus OLS estimates the ATE with no covariates

36 58

Adding covariates

bull What happens with no unmeasured confounders Need tocondition on 119883119894 now

bull Remember identification of the ATEATT using iteratedexpectations

bull ATE is the weighted sum of CATEs

120591 = 1114012119909120591(119909) Pr[119883119894 = 119909]

bull ATEATT are weighted averages of CATEsbull What about the regression estimand 120591119877 How does it related

to the ATEATT

37 58

Heterogeneous effects and regression

bull Letrsquos investigate this under a saturated regression model

119884119894 = 1114012119909119861119909119894120572119909 + 120591119877119863119894 + 119890119894

bull Use a dummy variable for each unique combination of 119883119894119861119909119894 = 120128(119883119894 = 119909)

bull Linear in 119883119894 by construction

38 58

Investigating the regression coefficient

bull How can we investigate 120591119877 Well we can rely on theregression anatomy

120591119877 = Cov(119884119894 119863119894 minus 119864[119863119894|119883119894])120141(119863119894 minus 119864[119863119894|119883119894])

bull 119863119894 minus 120124[119863119894|119883119894] is the residual from a regression of 119863119894 on the fullset of dummies

bull With a little work we can show

120591119877 =120124 1114094120591(119883119894)(119863119894 minus 120124[119863119894|119883119894])11135691114097

120124[(119863119894 minus 119864[119863119894|119883119894])1113569]=

120124[120591(119883119894)1205901113569119889(119883119894)]120124[1205901113569119889(119883119894)]

bull 1205901113569119889(119909) = 120141[119863119894|119883119894 = 119909] is the conditional variance of treatmentassignment

39 58

ATE versus OLS

120591119877 = 120124[120591(119883119894)119882119894] = 1114012119909120591(119909)

1205901113569119889(119909)120124[1205901113569119889(119883119894)]

โ„™[119883119894 = 119909]

bull Compare to the ATE

120591 = 120124[120591(119883119894)] = 1114012119909120591(119909)โ„™[119883119894 = 119909]

bull Both weight strata relative to their size (โ„™[119883119894 = 119909])bull OLS weights strata higher if the treatment variance in those

strata (1205901113569119889(119909)) is higher in those strata relative to the averagevariance across strata (120124[1205901113569119889(119883119894)])

bull The ATE weights only by their size

40 58

Regression weighting

119882119894 =1205901113569119889(119883119894)

120124[1205901113569119889(119883119894)]

bull Why does OLS weight like thisbull OLS is a minimum-variance estimator more weight to

more precise within-strata estimatesbull Within-strata estimates are most precise when the treatment

is evenly spread and thus has the highest variancebull If 119863119894 is binary then we know the conditional variance will be

1205901113569119889(119909) = โ„™[119863119894 = 1|119883119894 = 119909] (1 minus โ„™[119863119894 = 1|119883119894 = 119909])= 119890(119909) (1 minus 119890(119909))

bull Maximum variance with โ„™[119863119894 = 1|119883119894 = 119909] = 12

41 58

OLS weighting examplebull Binary covariate

โ„™[119883119894 = 1] = 075 โ„™[119883119894 = 0] = 025โ„™[119863119894 = 1|119883119894 = 1] = 09 โ„™[119863119894 = 1|119883119894 = 0] = 05

1205901113569119889(1) = 009 1205901113569119889(0) = 025120591(1) = 1 120591(0) = minus1

bull Implies the ATE is 120591 = 05bull Average conditional variance 120124[1205901113569119889(119883119894)] = 013bull weights for 119883119894 = 1 are 009013 = 0692 for 119883119894 = 0

025013 = 192120591119877 = 120124[120591(119883119894)119882119894]

= 120591(1)119882(1)โ„™[119883119894 = 1] + 120591(0)119882(0)โ„™[119883119894 = 0]= 1 times 0692 times 075 + minus1 times 192 times 025= 0039

42 58

When will OLS estimate the ATE

bull When does 120591 = 120591119877bull Constant treatment effects 120591(119909) = 120591 = 120591119877bull Constant probability of treatment 119890(119909) = โ„™[119863119894 = 1|119883119894 = 119909] = 119890

Implies that the OLS weights are 1

bull Incorrect linearity assumption in 119883119894 will lead to more bias

43 58

Other ways to use regression

bull Whatrsquos the path forward Accept the bias (might be relatively small with saturated

models) Use a different regression approach

bull Let 120583119889(119909) = 120124[119884119894(119889)|119883119894 = 119909] be the CEF for the potentialoutcome under 119863119894 = 119889

bull By consistency and nuc we have 120583119889(119909) = 120124[119884119894|119863119894 = 119889119883119894 = 119909]bull Estimate a regression of 119884119894 on 119883119894 among the 119863119894 = 119889 groupbull Then 119889(119909) is just a predicted value from the regression for

119883119894 = 119909bull How can we use this

44 58

Imputation estimators

bull Impute the treated potential outcomes with 1114023119884119894(1) = 1113568(119883119894)bull Impute the control potential outcomes with 1114023119884119894(0) = 1113567(119883119894)bull Procedure

Regress 119884119894 on 119883119894 in the treated group and get predicted valuesfor all units (treated or control)

Regress 119884119894 on 119883119894 in the control group and get predicted valuesfor all units (treated or control)

Take the average difference between these predicted values

bull More mathematically look like this

120591119894119898119901 =1119873

11140121198941113568(119883119894) minus 1113567(119883119894)

bull Sometimes called an imputation estimator

45 58

Simple imputation estimator

bull Use predict() from the within-group models on the data fromthe entire sample

bull Useful trick use a model on the entire data andmodelframe() to get the right design matrix

heterogeneous effects

yhet lt- ifelse(d == 1 y + rnorm(n 0 5) y)

mod lt- lm(yhet ~ d + X)

mod1 lt- lm(yhet ~ X subset = d == 1)

mod0 lt- lm(yhet ~ X subset = d == 0)

y1imps lt- predict(mod1 modelframe(mod))

y0imps lt- predict(mod0 modelframe(mod))

mean(y1imps - y0imps)

[1] 061

46 58

Notes on imputation estimators

bull If 119889(119909) are consistent estimators then 120591119894119898119901 is consistent forthe ATE

bull Why donrsquot people use this Most people donrsquot know the results wersquove been talking about Harder to implement than vanilla OLS

bull Can use linear regression to estimate 119889(119909) = 119909prime120573119889bull Recent trend is to estimate 119889(119909) via non-parametric methods

such as Kernel regression local linear regression regression trees etc Easiest is generalized additive models (GAMs)

47 58

Imputation estimator visualization

-2 -1 0 1 2 3

-05

00

05

10

15

x

y

TreatedControl

48 58

Imputation estimator visualization

-2 -1 0 1 2 3

-05

00

05

10

15

x

y

TreatedControl

49 58

Imputation estimator visualization

-2 -1 0 1 2 3

-05

00

05

10

15

x

y

TreatedControl

50 58

Nonlinear relationshipsbull Same idea but with nonlinear relationship between 119884119894 and 119883119894

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

51 58

Nonlinear relationshipsbull Same idea but with nonlinear relationship between 119884119894 and 119883119894

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

52 58

Nonlinear relationshipsbull Same idea but with nonlinear relationship between 119884119894 and 119883119894

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

53 58

Using semiparametric regressionbull Here CEFs are nonlinear but we donrsquot know their formbull We can use GAMs from the mgcv package to for flexible

estimatelibrary(mgcv)

mod0 lt- gam(y ~ s(x) subset = d == 0)

summary(mod0)

Family gaussian

Link function identity

Formula

y ~ s(x)

Parametric coefficients

Estimate Std Error t value Pr(gt|t|)

(Intercept) -00225 00154 -146 016

Approximate significance of smooth terms

edf Refdf F p-value

s(x) 603 708 413 lt2e-16

---

Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1

R-sq(adj) = 0909 Deviance explained = 928

GCV = 00093204 Scale est = 00071351 n = 30

54 58

Using GAMs

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

55 58

Using GAMs

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

56 58

Using GAMs

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

57 58

Limited dependent variables

bull Usual advice model the data from first principles Logitprobit for binary Poisson for counts etc

bull OLS is a-ok with limited DVs when Binary treatment and no covariates (just diff-in-means) Binary treatment discrete covariates and saturated models

(stratified diff-in-means)

bull Imposing a model on LDVs in this case imposes adistributional assumption which could be wrong

bull Even in unsaturated models the marginal effect from OLSoften decent compared to nonlinear models

Could go wrong in small samples If using nonlinear models always get effects on the scale of the

outcome

58 58

  • Agnostic Regression
  • Regression and Causality
  • Regression with Heterogeneous Treatment Effects
Page 32: Gov 2002: 7. Regression and Causalityโ€ข Define linear regression: =argmin ๐”ผ[( ๐‘–โˆ’ โ€ฒ ) ] โ€ข The solution to this is the following: =๐”ผ[ ๐‘– โ€ฒ]โˆ’ ๐”ผ[ ๐‘– ๐‘–] โ€ข

OLS with covariates

summary(lm(y ~ d + X))

Coefficients

Estimate Std Error t value Pr(gt|t|)

(Intercept) 209952 0159 13207 lt2e-16

d -0128 0253 -05 062

X1 27368 0126 2175 lt2e-16

X2 13677 0114 1201 lt2e-16

X3 13673 0130 1051 lt2e-16

X4 13570 0106 1280 lt2e-16

---

Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1

Residual standard error 1 on 94 degrees of freedom

Multiple R-squared 0999 Adjusted R-squared 0999

F-statistic 244e+04 on 5 and 94 DF p-value lt2e-16

32 58

What happens with nonlinearity

Suppose we can only observe the following covariates

z1 lt- exp(X[ 1]2)

z2 lt- X[ 2](1 + exp(X[ 1])) + 10

z3 lt- (X[ 1] X[ 3]25 + 06)^3

z4 lt- (X[ 2] + X[ 4] + 20)^2

Implies that 119884119894 and 119863119894 are functions of log(1198851198941113568) 1198851198941113569 119885111356911989411135681198851198941113569

1 log(1198851198941113568) 1198851198941113570 log(1198851198941113568) and 119883111356811135691198941113571

Regression is a nonlinear function of the observed covariates

33 58

When linearity goes wrong

summary(lm(y ~ d + z1 + z2 + z3 + z4))

Coefficients

Estimate Std Error t value Pr(gt|t|)

(Intercept) 218728 300799 073 0469

d -69292 33854 -205 0043

z1 363110 26970 1346 lt2e-16

z2 -29033 36619 -079 0430

z3 862022 431030 200 0048

z4 04021 00329 1223 lt2e-16

---

Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1

Residual standard error 14 on 94 degrees of freedom

Multiple R-squared 0855 Adjusted R-squared 0847

F-statistic 111 on 5 and 94 DF p-value lt2e-16

34 58

3 Regression withHeterogeneousTreatment Effects

35 58

Heterogeneous effects binary treatment

bull Completely randomized experiment

119884119894 = 119863119894119884119894(1) + (1 minus 119863119894)119884119894(0)= 119884119894(0) + (119884119894(1) minus 119884119894(0))119863119894

= 1205831113567 + 120591119894119863119894 + (119884119894(0) minus 1205831113567)= 1205831113567 + 120591119863119894 + (119884119894(0) minus 1205831113567) + (120591119894 minus 120591) sdot 119863119894

= 1205831113567 + 120591119863119894 + 120576119894

bull Error term now includes two components1 ldquoBaselinerdquo variation in the outcome (119884119894(0) minus 1205831113567)2 Variation in the treatment effect (120591119894 minus 120591)

bull Easy to verify that under experiment 120124[120576119894|119863119894] = 0bull Thus OLS estimates the ATE with no covariates

36 58

Adding covariates

bull What happens with no unmeasured confounders Need tocondition on 119883119894 now

bull Remember identification of the ATEATT using iteratedexpectations

bull ATE is the weighted sum of CATEs

120591 = 1114012119909120591(119909) Pr[119883119894 = 119909]

bull ATEATT are weighted averages of CATEsbull What about the regression estimand 120591119877 How does it related

to the ATEATT

37 58

Heterogeneous effects and regression

bull Letrsquos investigate this under a saturated regression model

119884119894 = 1114012119909119861119909119894120572119909 + 120591119877119863119894 + 119890119894

bull Use a dummy variable for each unique combination of 119883119894119861119909119894 = 120128(119883119894 = 119909)

bull Linear in 119883119894 by construction

38 58

Investigating the regression coefficient

bull How can we investigate 120591119877 Well we can rely on theregression anatomy

120591119877 = Cov(119884119894 119863119894 minus 119864[119863119894|119883119894])120141(119863119894 minus 119864[119863119894|119883119894])

bull 119863119894 minus 120124[119863119894|119883119894] is the residual from a regression of 119863119894 on the fullset of dummies

bull With a little work we can show

120591119877 =120124 1114094120591(119883119894)(119863119894 minus 120124[119863119894|119883119894])11135691114097

120124[(119863119894 minus 119864[119863119894|119883119894])1113569]=

120124[120591(119883119894)1205901113569119889(119883119894)]120124[1205901113569119889(119883119894)]

bull 1205901113569119889(119909) = 120141[119863119894|119883119894 = 119909] is the conditional variance of treatmentassignment

39 58

ATE versus OLS

120591119877 = 120124[120591(119883119894)119882119894] = 1114012119909120591(119909)

1205901113569119889(119909)120124[1205901113569119889(119883119894)]

โ„™[119883119894 = 119909]

bull Compare to the ATE

120591 = 120124[120591(119883119894)] = 1114012119909120591(119909)โ„™[119883119894 = 119909]

bull Both weight strata relative to their size (โ„™[119883119894 = 119909])bull OLS weights strata higher if the treatment variance in those

strata (1205901113569119889(119909)) is higher in those strata relative to the averagevariance across strata (120124[1205901113569119889(119883119894)])

bull The ATE weights only by their size

40 58

Regression weighting

119882119894 =1205901113569119889(119883119894)

120124[1205901113569119889(119883119894)]

bull Why does OLS weight like thisbull OLS is a minimum-variance estimator more weight to

more precise within-strata estimatesbull Within-strata estimates are most precise when the treatment

is evenly spread and thus has the highest variancebull If 119863119894 is binary then we know the conditional variance will be

1205901113569119889(119909) = โ„™[119863119894 = 1|119883119894 = 119909] (1 minus โ„™[119863119894 = 1|119883119894 = 119909])= 119890(119909) (1 minus 119890(119909))

bull Maximum variance with โ„™[119863119894 = 1|119883119894 = 119909] = 12

41 58

OLS weighting examplebull Binary covariate

โ„™[119883119894 = 1] = 075 โ„™[119883119894 = 0] = 025โ„™[119863119894 = 1|119883119894 = 1] = 09 โ„™[119863119894 = 1|119883119894 = 0] = 05

1205901113569119889(1) = 009 1205901113569119889(0) = 025120591(1) = 1 120591(0) = minus1

bull Implies the ATE is 120591 = 05bull Average conditional variance 120124[1205901113569119889(119883119894)] = 013bull weights for 119883119894 = 1 are 009013 = 0692 for 119883119894 = 0

025013 = 192120591119877 = 120124[120591(119883119894)119882119894]

= 120591(1)119882(1)โ„™[119883119894 = 1] + 120591(0)119882(0)โ„™[119883119894 = 0]= 1 times 0692 times 075 + minus1 times 192 times 025= 0039

42 58

When will OLS estimate the ATE

bull When does 120591 = 120591119877bull Constant treatment effects 120591(119909) = 120591 = 120591119877bull Constant probability of treatment 119890(119909) = โ„™[119863119894 = 1|119883119894 = 119909] = 119890

Implies that the OLS weights are 1

bull Incorrect linearity assumption in 119883119894 will lead to more bias

43 58

Other ways to use regression

bull Whatrsquos the path forward Accept the bias (might be relatively small with saturated

models) Use a different regression approach

bull Let 120583119889(119909) = 120124[119884119894(119889)|119883119894 = 119909] be the CEF for the potentialoutcome under 119863119894 = 119889

bull By consistency and nuc we have 120583119889(119909) = 120124[119884119894|119863119894 = 119889119883119894 = 119909]bull Estimate a regression of 119884119894 on 119883119894 among the 119863119894 = 119889 groupbull Then 119889(119909) is just a predicted value from the regression for

119883119894 = 119909bull How can we use this

44 58

Imputation estimators

bull Impute the treated potential outcomes with 1114023119884119894(1) = 1113568(119883119894)bull Impute the control potential outcomes with 1114023119884119894(0) = 1113567(119883119894)bull Procedure

Regress 119884119894 on 119883119894 in the treated group and get predicted valuesfor all units (treated or control)

Regress 119884119894 on 119883119894 in the control group and get predicted valuesfor all units (treated or control)

Take the average difference between these predicted values

bull More mathematically look like this

120591119894119898119901 =1119873

11140121198941113568(119883119894) minus 1113567(119883119894)

bull Sometimes called an imputation estimator

45 58

Simple imputation estimator

bull Use predict() from the within-group models on the data fromthe entire sample

bull Useful trick use a model on the entire data andmodelframe() to get the right design matrix

heterogeneous effects

yhet lt- ifelse(d == 1 y + rnorm(n 0 5) y)

mod lt- lm(yhet ~ d + X)

mod1 lt- lm(yhet ~ X subset = d == 1)

mod0 lt- lm(yhet ~ X subset = d == 0)

y1imps lt- predict(mod1 modelframe(mod))

y0imps lt- predict(mod0 modelframe(mod))

mean(y1imps - y0imps)

[1] 061

46 58

Notes on imputation estimators

bull If 119889(119909) are consistent estimators then 120591119894119898119901 is consistent forthe ATE

bull Why donrsquot people use this Most people donrsquot know the results wersquove been talking about Harder to implement than vanilla OLS

bull Can use linear regression to estimate 119889(119909) = 119909prime120573119889bull Recent trend is to estimate 119889(119909) via non-parametric methods

such as Kernel regression local linear regression regression trees etc Easiest is generalized additive models (GAMs)

47 58

Imputation estimator visualization

-2 -1 0 1 2 3

-05

00

05

10

15

x

y

TreatedControl

48 58

Imputation estimator visualization

-2 -1 0 1 2 3

-05

00

05

10

15

x

y

TreatedControl

49 58

Imputation estimator visualization

-2 -1 0 1 2 3

-05

00

05

10

15

x

y

TreatedControl

50 58

Nonlinear relationshipsbull Same idea but with nonlinear relationship between 119884119894 and 119883119894

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

51 58

Nonlinear relationshipsbull Same idea but with nonlinear relationship between 119884119894 and 119883119894

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

52 58

Nonlinear relationshipsbull Same idea but with nonlinear relationship between 119884119894 and 119883119894

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

53 58

Using semiparametric regressionbull Here CEFs are nonlinear but we donrsquot know their formbull We can use GAMs from the mgcv package to for flexible

estimatelibrary(mgcv)

mod0 lt- gam(y ~ s(x) subset = d == 0)

summary(mod0)

Family gaussian

Link function identity

Formula

y ~ s(x)

Parametric coefficients

Estimate Std Error t value Pr(gt|t|)

(Intercept) -00225 00154 -146 016

Approximate significance of smooth terms

edf Refdf F p-value

s(x) 603 708 413 lt2e-16

---

Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1

R-sq(adj) = 0909 Deviance explained = 928

GCV = 00093204 Scale est = 00071351 n = 30

54 58

Using GAMs

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

55 58

Using GAMs

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

56 58

Using GAMs

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

57 58

Limited dependent variables

bull Usual advice model the data from first principles Logitprobit for binary Poisson for counts etc

bull OLS is a-ok with limited DVs when Binary treatment and no covariates (just diff-in-means) Binary treatment discrete covariates and saturated models

(stratified diff-in-means)

bull Imposing a model on LDVs in this case imposes adistributional assumption which could be wrong

bull Even in unsaturated models the marginal effect from OLSoften decent compared to nonlinear models

Could go wrong in small samples If using nonlinear models always get effects on the scale of the

outcome

58 58

  • Agnostic Regression
  • Regression and Causality
  • Regression with Heterogeneous Treatment Effects
Page 33: Gov 2002: 7. Regression and Causalityโ€ข Define linear regression: =argmin ๐”ผ[( ๐‘–โˆ’ โ€ฒ ) ] โ€ข The solution to this is the following: =๐”ผ[ ๐‘– โ€ฒ]โˆ’ ๐”ผ[ ๐‘– ๐‘–] โ€ข

What happens with nonlinearity

Suppose we can only observe the following covariates

z1 lt- exp(X[ 1]2)

z2 lt- X[ 2](1 + exp(X[ 1])) + 10

z3 lt- (X[ 1] X[ 3]25 + 06)^3

z4 lt- (X[ 2] + X[ 4] + 20)^2

Implies that 119884119894 and 119863119894 are functions of log(1198851198941113568) 1198851198941113569 119885111356911989411135681198851198941113569

1 log(1198851198941113568) 1198851198941113570 log(1198851198941113568) and 119883111356811135691198941113571

Regression is a nonlinear function of the observed covariates

33 58

When linearity goes wrong

summary(lm(y ~ d + z1 + z2 + z3 + z4))

Coefficients

Estimate Std Error t value Pr(gt|t|)

(Intercept) 218728 300799 073 0469

d -69292 33854 -205 0043

z1 363110 26970 1346 lt2e-16

z2 -29033 36619 -079 0430

z3 862022 431030 200 0048

z4 04021 00329 1223 lt2e-16

---

Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1

Residual standard error 14 on 94 degrees of freedom

Multiple R-squared 0855 Adjusted R-squared 0847

F-statistic 111 on 5 and 94 DF p-value lt2e-16

34 58

3 Regression withHeterogeneousTreatment Effects

35 58

Heterogeneous effects binary treatment

bull Completely randomized experiment

119884119894 = 119863119894119884119894(1) + (1 minus 119863119894)119884119894(0)= 119884119894(0) + (119884119894(1) minus 119884119894(0))119863119894

= 1205831113567 + 120591119894119863119894 + (119884119894(0) minus 1205831113567)= 1205831113567 + 120591119863119894 + (119884119894(0) minus 1205831113567) + (120591119894 minus 120591) sdot 119863119894

= 1205831113567 + 120591119863119894 + 120576119894

bull Error term now includes two components1 ldquoBaselinerdquo variation in the outcome (119884119894(0) minus 1205831113567)2 Variation in the treatment effect (120591119894 minus 120591)

bull Easy to verify that under experiment 120124[120576119894|119863119894] = 0bull Thus OLS estimates the ATE with no covariates

36 58

Adding covariates

bull What happens with no unmeasured confounders Need tocondition on 119883119894 now

bull Remember identification of the ATEATT using iteratedexpectations

bull ATE is the weighted sum of CATEs

120591 = 1114012119909120591(119909) Pr[119883119894 = 119909]

bull ATEATT are weighted averages of CATEsbull What about the regression estimand 120591119877 How does it related

to the ATEATT

37 58

Heterogeneous effects and regression

bull Letrsquos investigate this under a saturated regression model

119884119894 = 1114012119909119861119909119894120572119909 + 120591119877119863119894 + 119890119894

bull Use a dummy variable for each unique combination of 119883119894119861119909119894 = 120128(119883119894 = 119909)

bull Linear in 119883119894 by construction

38 58

Investigating the regression coefficient

bull How can we investigate 120591119877 Well we can rely on theregression anatomy

120591119877 = Cov(119884119894 119863119894 minus 119864[119863119894|119883119894])120141(119863119894 minus 119864[119863119894|119883119894])

bull 119863119894 minus 120124[119863119894|119883119894] is the residual from a regression of 119863119894 on the fullset of dummies

bull With a little work we can show

120591119877 =120124 1114094120591(119883119894)(119863119894 minus 120124[119863119894|119883119894])11135691114097

120124[(119863119894 minus 119864[119863119894|119883119894])1113569]=

120124[120591(119883119894)1205901113569119889(119883119894)]120124[1205901113569119889(119883119894)]

bull 1205901113569119889(119909) = 120141[119863119894|119883119894 = 119909] is the conditional variance of treatmentassignment

39 58

ATE versus OLS

120591119877 = 120124[120591(119883119894)119882119894] = 1114012119909120591(119909)

1205901113569119889(119909)120124[1205901113569119889(119883119894)]

โ„™[119883119894 = 119909]

bull Compare to the ATE

120591 = 120124[120591(119883119894)] = 1114012119909120591(119909)โ„™[119883119894 = 119909]

bull Both weight strata relative to their size (โ„™[119883119894 = 119909])bull OLS weights strata higher if the treatment variance in those

strata (1205901113569119889(119909)) is higher in those strata relative to the averagevariance across strata (120124[1205901113569119889(119883119894)])

bull The ATE weights only by their size

40 58

Regression weighting

119882119894 =1205901113569119889(119883119894)

120124[1205901113569119889(119883119894)]

bull Why does OLS weight like thisbull OLS is a minimum-variance estimator more weight to

more precise within-strata estimatesbull Within-strata estimates are most precise when the treatment

is evenly spread and thus has the highest variancebull If 119863119894 is binary then we know the conditional variance will be

1205901113569119889(119909) = โ„™[119863119894 = 1|119883119894 = 119909] (1 minus โ„™[119863119894 = 1|119883119894 = 119909])= 119890(119909) (1 minus 119890(119909))

bull Maximum variance with โ„™[119863119894 = 1|119883119894 = 119909] = 12

41 58

OLS weighting examplebull Binary covariate

โ„™[119883119894 = 1] = 075 โ„™[119883119894 = 0] = 025โ„™[119863119894 = 1|119883119894 = 1] = 09 โ„™[119863119894 = 1|119883119894 = 0] = 05

1205901113569119889(1) = 009 1205901113569119889(0) = 025120591(1) = 1 120591(0) = minus1

bull Implies the ATE is 120591 = 05bull Average conditional variance 120124[1205901113569119889(119883119894)] = 013bull weights for 119883119894 = 1 are 009013 = 0692 for 119883119894 = 0

025013 = 192120591119877 = 120124[120591(119883119894)119882119894]

= 120591(1)119882(1)โ„™[119883119894 = 1] + 120591(0)119882(0)โ„™[119883119894 = 0]= 1 times 0692 times 075 + minus1 times 192 times 025= 0039

42 58

When will OLS estimate the ATE

bull When does 120591 = 120591119877bull Constant treatment effects 120591(119909) = 120591 = 120591119877bull Constant probability of treatment 119890(119909) = โ„™[119863119894 = 1|119883119894 = 119909] = 119890

Implies that the OLS weights are 1

bull Incorrect linearity assumption in 119883119894 will lead to more bias

43 58

Other ways to use regression

bull Whatrsquos the path forward Accept the bias (might be relatively small with saturated

models) Use a different regression approach

bull Let 120583119889(119909) = 120124[119884119894(119889)|119883119894 = 119909] be the CEF for the potentialoutcome under 119863119894 = 119889

bull By consistency and nuc we have 120583119889(119909) = 120124[119884119894|119863119894 = 119889119883119894 = 119909]bull Estimate a regression of 119884119894 on 119883119894 among the 119863119894 = 119889 groupbull Then 119889(119909) is just a predicted value from the regression for

119883119894 = 119909bull How can we use this

44 58

Imputation estimators

bull Impute the treated potential outcomes with 1114023119884119894(1) = 1113568(119883119894)bull Impute the control potential outcomes with 1114023119884119894(0) = 1113567(119883119894)bull Procedure

Regress 119884119894 on 119883119894 in the treated group and get predicted valuesfor all units (treated or control)

Regress 119884119894 on 119883119894 in the control group and get predicted valuesfor all units (treated or control)

Take the average difference between these predicted values

bull More mathematically look like this

120591119894119898119901 =1119873

11140121198941113568(119883119894) minus 1113567(119883119894)

bull Sometimes called an imputation estimator

45 58

Simple imputation estimator

bull Use predict() from the within-group models on the data fromthe entire sample

bull Useful trick use a model on the entire data andmodelframe() to get the right design matrix

heterogeneous effects

yhet lt- ifelse(d == 1 y + rnorm(n 0 5) y)

mod lt- lm(yhet ~ d + X)

mod1 lt- lm(yhet ~ X subset = d == 1)

mod0 lt- lm(yhet ~ X subset = d == 0)

y1imps lt- predict(mod1 modelframe(mod))

y0imps lt- predict(mod0 modelframe(mod))

mean(y1imps - y0imps)

[1] 061

46 58

Notes on imputation estimators

bull If 119889(119909) are consistent estimators then 120591119894119898119901 is consistent forthe ATE

bull Why donrsquot people use this Most people donrsquot know the results wersquove been talking about Harder to implement than vanilla OLS

bull Can use linear regression to estimate 119889(119909) = 119909prime120573119889bull Recent trend is to estimate 119889(119909) via non-parametric methods

such as Kernel regression local linear regression regression trees etc Easiest is generalized additive models (GAMs)

47 58

Imputation estimator visualization

-2 -1 0 1 2 3

-05

00

05

10

15

x

y

TreatedControl

48 58

Imputation estimator visualization

-2 -1 0 1 2 3

-05

00

05

10

15

x

y

TreatedControl

49 58

Imputation estimator visualization

-2 -1 0 1 2 3

-05

00

05

10

15

x

y

TreatedControl

50 58

Nonlinear relationshipsbull Same idea but with nonlinear relationship between 119884119894 and 119883119894

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

51 58

Nonlinear relationshipsbull Same idea but with nonlinear relationship between 119884119894 and 119883119894

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

52 58

Nonlinear relationshipsbull Same idea but with nonlinear relationship between 119884119894 and 119883119894

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

53 58

Using semiparametric regressionbull Here CEFs are nonlinear but we donrsquot know their formbull We can use GAMs from the mgcv package to for flexible

estimatelibrary(mgcv)

mod0 lt- gam(y ~ s(x) subset = d == 0)

summary(mod0)

Family gaussian

Link function identity

Formula

y ~ s(x)

Parametric coefficients

Estimate Std Error t value Pr(gt|t|)

(Intercept) -00225 00154 -146 016

Approximate significance of smooth terms

edf Refdf F p-value

s(x) 603 708 413 lt2e-16

---

Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1

R-sq(adj) = 0909 Deviance explained = 928

GCV = 00093204 Scale est = 00071351 n = 30

54 58

Using GAMs

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

55 58

Using GAMs

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

56 58

Using GAMs

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

57 58

Limited dependent variables

bull Usual advice model the data from first principles Logitprobit for binary Poisson for counts etc

bull OLS is a-ok with limited DVs when Binary treatment and no covariates (just diff-in-means) Binary treatment discrete covariates and saturated models

(stratified diff-in-means)

bull Imposing a model on LDVs in this case imposes adistributional assumption which could be wrong

bull Even in unsaturated models the marginal effect from OLSoften decent compared to nonlinear models

Could go wrong in small samples If using nonlinear models always get effects on the scale of the

outcome

58 58

  • Agnostic Regression
  • Regression and Causality
  • Regression with Heterogeneous Treatment Effects
Page 34: Gov 2002: 7. Regression and Causalityโ€ข Define linear regression: =argmin ๐”ผ[( ๐‘–โˆ’ โ€ฒ ) ] โ€ข The solution to this is the following: =๐”ผ[ ๐‘– โ€ฒ]โˆ’ ๐”ผ[ ๐‘– ๐‘–] โ€ข

When linearity goes wrong

summary(lm(y ~ d + z1 + z2 + z3 + z4))

Coefficients

Estimate Std Error t value Pr(gt|t|)

(Intercept) 218728 300799 073 0469

d -69292 33854 -205 0043

z1 363110 26970 1346 lt2e-16

z2 -29033 36619 -079 0430

z3 862022 431030 200 0048

z4 04021 00329 1223 lt2e-16

---

Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1

Residual standard error 14 on 94 degrees of freedom

Multiple R-squared 0855 Adjusted R-squared 0847

F-statistic 111 on 5 and 94 DF p-value lt2e-16

34 58

3 Regression withHeterogeneousTreatment Effects

35 58

Heterogeneous effects binary treatment

bull Completely randomized experiment

119884119894 = 119863119894119884119894(1) + (1 minus 119863119894)119884119894(0)= 119884119894(0) + (119884119894(1) minus 119884119894(0))119863119894

= 1205831113567 + 120591119894119863119894 + (119884119894(0) minus 1205831113567)= 1205831113567 + 120591119863119894 + (119884119894(0) minus 1205831113567) + (120591119894 minus 120591) sdot 119863119894

= 1205831113567 + 120591119863119894 + 120576119894

bull Error term now includes two components1 ldquoBaselinerdquo variation in the outcome (119884119894(0) minus 1205831113567)2 Variation in the treatment effect (120591119894 minus 120591)

bull Easy to verify that under experiment 120124[120576119894|119863119894] = 0bull Thus OLS estimates the ATE with no covariates

36 58

Adding covariates

bull What happens with no unmeasured confounders Need tocondition on 119883119894 now

bull Remember identification of the ATEATT using iteratedexpectations

bull ATE is the weighted sum of CATEs

120591 = 1114012119909120591(119909) Pr[119883119894 = 119909]

bull ATEATT are weighted averages of CATEsbull What about the regression estimand 120591119877 How does it related

to the ATEATT

37 58

Heterogeneous effects and regression

bull Letrsquos investigate this under a saturated regression model

119884119894 = 1114012119909119861119909119894120572119909 + 120591119877119863119894 + 119890119894

bull Use a dummy variable for each unique combination of 119883119894119861119909119894 = 120128(119883119894 = 119909)

bull Linear in 119883119894 by construction

38 58

Investigating the regression coefficient

bull How can we investigate 120591119877 Well we can rely on theregression anatomy

120591119877 = Cov(119884119894 119863119894 minus 119864[119863119894|119883119894])120141(119863119894 minus 119864[119863119894|119883119894])

bull 119863119894 minus 120124[119863119894|119883119894] is the residual from a regression of 119863119894 on the fullset of dummies

bull With a little work we can show

120591119877 =120124 1114094120591(119883119894)(119863119894 minus 120124[119863119894|119883119894])11135691114097

120124[(119863119894 minus 119864[119863119894|119883119894])1113569]=

120124[120591(119883119894)1205901113569119889(119883119894)]120124[1205901113569119889(119883119894)]

bull 1205901113569119889(119909) = 120141[119863119894|119883119894 = 119909] is the conditional variance of treatmentassignment

39 58

ATE versus OLS

120591119877 = 120124[120591(119883119894)119882119894] = 1114012119909120591(119909)

1205901113569119889(119909)120124[1205901113569119889(119883119894)]

โ„™[119883119894 = 119909]

bull Compare to the ATE

120591 = 120124[120591(119883119894)] = 1114012119909120591(119909)โ„™[119883119894 = 119909]

bull Both weight strata relative to their size (โ„™[119883119894 = 119909])bull OLS weights strata higher if the treatment variance in those

strata (1205901113569119889(119909)) is higher in those strata relative to the averagevariance across strata (120124[1205901113569119889(119883119894)])

bull The ATE weights only by their size

40 58

Regression weighting

119882119894 =1205901113569119889(119883119894)

120124[1205901113569119889(119883119894)]

bull Why does OLS weight like thisbull OLS is a minimum-variance estimator more weight to

more precise within-strata estimatesbull Within-strata estimates are most precise when the treatment

is evenly spread and thus has the highest variancebull If 119863119894 is binary then we know the conditional variance will be

1205901113569119889(119909) = โ„™[119863119894 = 1|119883119894 = 119909] (1 minus โ„™[119863119894 = 1|119883119894 = 119909])= 119890(119909) (1 minus 119890(119909))

bull Maximum variance with โ„™[119863119894 = 1|119883119894 = 119909] = 12

41 58

OLS weighting examplebull Binary covariate

โ„™[119883119894 = 1] = 075 โ„™[119883119894 = 0] = 025โ„™[119863119894 = 1|119883119894 = 1] = 09 โ„™[119863119894 = 1|119883119894 = 0] = 05

1205901113569119889(1) = 009 1205901113569119889(0) = 025120591(1) = 1 120591(0) = minus1

bull Implies the ATE is 120591 = 05bull Average conditional variance 120124[1205901113569119889(119883119894)] = 013bull weights for 119883119894 = 1 are 009013 = 0692 for 119883119894 = 0

025013 = 192120591119877 = 120124[120591(119883119894)119882119894]

= 120591(1)119882(1)โ„™[119883119894 = 1] + 120591(0)119882(0)โ„™[119883119894 = 0]= 1 times 0692 times 075 + minus1 times 192 times 025= 0039

42 58

When will OLS estimate the ATE

bull When does 120591 = 120591119877bull Constant treatment effects 120591(119909) = 120591 = 120591119877bull Constant probability of treatment 119890(119909) = โ„™[119863119894 = 1|119883119894 = 119909] = 119890

Implies that the OLS weights are 1

bull Incorrect linearity assumption in 119883119894 will lead to more bias

43 58

Other ways to use regression

bull Whatrsquos the path forward Accept the bias (might be relatively small with saturated

models) Use a different regression approach

bull Let 120583119889(119909) = 120124[119884119894(119889)|119883119894 = 119909] be the CEF for the potentialoutcome under 119863119894 = 119889

bull By consistency and nuc we have 120583119889(119909) = 120124[119884119894|119863119894 = 119889119883119894 = 119909]bull Estimate a regression of 119884119894 on 119883119894 among the 119863119894 = 119889 groupbull Then 119889(119909) is just a predicted value from the regression for

119883119894 = 119909bull How can we use this

44 58

Imputation estimators

bull Impute the treated potential outcomes with 1114023119884119894(1) = 1113568(119883119894)bull Impute the control potential outcomes with 1114023119884119894(0) = 1113567(119883119894)bull Procedure

Regress 119884119894 on 119883119894 in the treated group and get predicted valuesfor all units (treated or control)

Regress 119884119894 on 119883119894 in the control group and get predicted valuesfor all units (treated or control)

Take the average difference between these predicted values

bull More mathematically look like this

120591119894119898119901 =1119873

11140121198941113568(119883119894) minus 1113567(119883119894)

bull Sometimes called an imputation estimator

45 58

Simple imputation estimator

bull Use predict() from the within-group models on the data fromthe entire sample

bull Useful trick use a model on the entire data andmodelframe() to get the right design matrix

heterogeneous effects

yhet lt- ifelse(d == 1 y + rnorm(n 0 5) y)

mod lt- lm(yhet ~ d + X)

mod1 lt- lm(yhet ~ X subset = d == 1)

mod0 lt- lm(yhet ~ X subset = d == 0)

y1imps lt- predict(mod1 modelframe(mod))

y0imps lt- predict(mod0 modelframe(mod))

mean(y1imps - y0imps)

[1] 061

46 58

Notes on imputation estimators

bull If 119889(119909) are consistent estimators then 120591119894119898119901 is consistent forthe ATE

bull Why donrsquot people use this Most people donrsquot know the results wersquove been talking about Harder to implement than vanilla OLS

bull Can use linear regression to estimate 119889(119909) = 119909prime120573119889bull Recent trend is to estimate 119889(119909) via non-parametric methods

such as Kernel regression local linear regression regression trees etc Easiest is generalized additive models (GAMs)

47 58

Imputation estimator visualization

-2 -1 0 1 2 3

-05

00

05

10

15

x

y

TreatedControl

48 58

Imputation estimator visualization

-2 -1 0 1 2 3

-05

00

05

10

15

x

y

TreatedControl

49 58

Imputation estimator visualization

-2 -1 0 1 2 3

-05

00

05

10

15

x

y

TreatedControl

50 58

Nonlinear relationshipsbull Same idea but with nonlinear relationship between 119884119894 and 119883119894

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

51 58

Nonlinear relationshipsbull Same idea but with nonlinear relationship between 119884119894 and 119883119894

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

52 58

Nonlinear relationshipsbull Same idea but with nonlinear relationship between 119884119894 and 119883119894

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

53 58

Using semiparametric regressionbull Here CEFs are nonlinear but we donrsquot know their formbull We can use GAMs from the mgcv package to for flexible

estimatelibrary(mgcv)

mod0 lt- gam(y ~ s(x) subset = d == 0)

summary(mod0)

Family gaussian

Link function identity

Formula

y ~ s(x)

Parametric coefficients

Estimate Std Error t value Pr(gt|t|)

(Intercept) -00225 00154 -146 016

Approximate significance of smooth terms

edf Refdf F p-value

s(x) 603 708 413 lt2e-16

---

Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1

R-sq(adj) = 0909 Deviance explained = 928

GCV = 00093204 Scale est = 00071351 n = 30

54 58

Using GAMs

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

55 58

Using GAMs

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

56 58

Using GAMs

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

57 58

Limited dependent variables

bull Usual advice model the data from first principles Logitprobit for binary Poisson for counts etc

bull OLS is a-ok with limited DVs when Binary treatment and no covariates (just diff-in-means) Binary treatment discrete covariates and saturated models

(stratified diff-in-means)

bull Imposing a model on LDVs in this case imposes adistributional assumption which could be wrong

bull Even in unsaturated models the marginal effect from OLSoften decent compared to nonlinear models

Could go wrong in small samples If using nonlinear models always get effects on the scale of the

outcome

58 58

  • Agnostic Regression
  • Regression and Causality
  • Regression with Heterogeneous Treatment Effects
Page 35: Gov 2002: 7. Regression and Causalityโ€ข Define linear regression: =argmin ๐”ผ[( ๐‘–โˆ’ โ€ฒ ) ] โ€ข The solution to this is the following: =๐”ผ[ ๐‘– โ€ฒ]โˆ’ ๐”ผ[ ๐‘– ๐‘–] โ€ข

3 Regression withHeterogeneousTreatment Effects

35 58

Heterogeneous effects binary treatment

bull Completely randomized experiment

119884119894 = 119863119894119884119894(1) + (1 minus 119863119894)119884119894(0)= 119884119894(0) + (119884119894(1) minus 119884119894(0))119863119894

= 1205831113567 + 120591119894119863119894 + (119884119894(0) minus 1205831113567)= 1205831113567 + 120591119863119894 + (119884119894(0) minus 1205831113567) + (120591119894 minus 120591) sdot 119863119894

= 1205831113567 + 120591119863119894 + 120576119894

bull Error term now includes two components1 ldquoBaselinerdquo variation in the outcome (119884119894(0) minus 1205831113567)2 Variation in the treatment effect (120591119894 minus 120591)

bull Easy to verify that under experiment 120124[120576119894|119863119894] = 0bull Thus OLS estimates the ATE with no covariates

36 58

Adding covariates

bull What happens with no unmeasured confounders Need tocondition on 119883119894 now

bull Remember identification of the ATEATT using iteratedexpectations

bull ATE is the weighted sum of CATEs

120591 = 1114012119909120591(119909) Pr[119883119894 = 119909]

bull ATEATT are weighted averages of CATEsbull What about the regression estimand 120591119877 How does it related

to the ATEATT

37 58

Heterogeneous effects and regression

bull Letrsquos investigate this under a saturated regression model

119884119894 = 1114012119909119861119909119894120572119909 + 120591119877119863119894 + 119890119894

bull Use a dummy variable for each unique combination of 119883119894119861119909119894 = 120128(119883119894 = 119909)

bull Linear in 119883119894 by construction

38 58

Investigating the regression coefficient

bull How can we investigate 120591119877 Well we can rely on theregression anatomy

120591119877 = Cov(119884119894 119863119894 minus 119864[119863119894|119883119894])120141(119863119894 minus 119864[119863119894|119883119894])

bull 119863119894 minus 120124[119863119894|119883119894] is the residual from a regression of 119863119894 on the fullset of dummies

bull With a little work we can show

120591119877 =120124 1114094120591(119883119894)(119863119894 minus 120124[119863119894|119883119894])11135691114097

120124[(119863119894 minus 119864[119863119894|119883119894])1113569]=

120124[120591(119883119894)1205901113569119889(119883119894)]120124[1205901113569119889(119883119894)]

bull 1205901113569119889(119909) = 120141[119863119894|119883119894 = 119909] is the conditional variance of treatmentassignment

39 58

ATE versus OLS

120591119877 = 120124[120591(119883119894)119882119894] = 1114012119909120591(119909)

1205901113569119889(119909)120124[1205901113569119889(119883119894)]

โ„™[119883119894 = 119909]

bull Compare to the ATE

120591 = 120124[120591(119883119894)] = 1114012119909120591(119909)โ„™[119883119894 = 119909]

bull Both weight strata relative to their size (โ„™[119883119894 = 119909])bull OLS weights strata higher if the treatment variance in those

strata (1205901113569119889(119909)) is higher in those strata relative to the averagevariance across strata (120124[1205901113569119889(119883119894)])

bull The ATE weights only by their size

40 58

Regression weighting

119882119894 =1205901113569119889(119883119894)

120124[1205901113569119889(119883119894)]

bull Why does OLS weight like thisbull OLS is a minimum-variance estimator more weight to

more precise within-strata estimatesbull Within-strata estimates are most precise when the treatment

is evenly spread and thus has the highest variancebull If 119863119894 is binary then we know the conditional variance will be

1205901113569119889(119909) = โ„™[119863119894 = 1|119883119894 = 119909] (1 minus โ„™[119863119894 = 1|119883119894 = 119909])= 119890(119909) (1 minus 119890(119909))

bull Maximum variance with โ„™[119863119894 = 1|119883119894 = 119909] = 12

41 58

OLS weighting examplebull Binary covariate

โ„™[119883119894 = 1] = 075 โ„™[119883119894 = 0] = 025โ„™[119863119894 = 1|119883119894 = 1] = 09 โ„™[119863119894 = 1|119883119894 = 0] = 05

1205901113569119889(1) = 009 1205901113569119889(0) = 025120591(1) = 1 120591(0) = minus1

bull Implies the ATE is 120591 = 05bull Average conditional variance 120124[1205901113569119889(119883119894)] = 013bull weights for 119883119894 = 1 are 009013 = 0692 for 119883119894 = 0

025013 = 192120591119877 = 120124[120591(119883119894)119882119894]

= 120591(1)119882(1)โ„™[119883119894 = 1] + 120591(0)119882(0)โ„™[119883119894 = 0]= 1 times 0692 times 075 + minus1 times 192 times 025= 0039

42 58

When will OLS estimate the ATE

bull When does 120591 = 120591119877bull Constant treatment effects 120591(119909) = 120591 = 120591119877bull Constant probability of treatment 119890(119909) = โ„™[119863119894 = 1|119883119894 = 119909] = 119890

Implies that the OLS weights are 1

bull Incorrect linearity assumption in 119883119894 will lead to more bias

43 58

Other ways to use regression

bull Whatrsquos the path forward Accept the bias (might be relatively small with saturated

models) Use a different regression approach

bull Let 120583119889(119909) = 120124[119884119894(119889)|119883119894 = 119909] be the CEF for the potentialoutcome under 119863119894 = 119889

bull By consistency and nuc we have 120583119889(119909) = 120124[119884119894|119863119894 = 119889119883119894 = 119909]bull Estimate a regression of 119884119894 on 119883119894 among the 119863119894 = 119889 groupbull Then 119889(119909) is just a predicted value from the regression for

119883119894 = 119909bull How can we use this

44 58

Imputation estimators

bull Impute the treated potential outcomes with 1114023119884119894(1) = 1113568(119883119894)bull Impute the control potential outcomes with 1114023119884119894(0) = 1113567(119883119894)bull Procedure

Regress 119884119894 on 119883119894 in the treated group and get predicted valuesfor all units (treated or control)

Regress 119884119894 on 119883119894 in the control group and get predicted valuesfor all units (treated or control)

Take the average difference between these predicted values

bull More mathematically look like this

120591119894119898119901 =1119873

11140121198941113568(119883119894) minus 1113567(119883119894)

bull Sometimes called an imputation estimator

45 58

Simple imputation estimator

bull Use predict() from the within-group models on the data fromthe entire sample

bull Useful trick use a model on the entire data andmodelframe() to get the right design matrix

heterogeneous effects

yhet lt- ifelse(d == 1 y + rnorm(n 0 5) y)

mod lt- lm(yhet ~ d + X)

mod1 lt- lm(yhet ~ X subset = d == 1)

mod0 lt- lm(yhet ~ X subset = d == 0)

y1imps lt- predict(mod1 modelframe(mod))

y0imps lt- predict(mod0 modelframe(mod))

mean(y1imps - y0imps)

[1] 061

46 58

Notes on imputation estimators

bull If 119889(119909) are consistent estimators then 120591119894119898119901 is consistent forthe ATE

bull Why donrsquot people use this Most people donrsquot know the results wersquove been talking about Harder to implement than vanilla OLS

bull Can use linear regression to estimate 119889(119909) = 119909prime120573119889bull Recent trend is to estimate 119889(119909) via non-parametric methods

such as Kernel regression local linear regression regression trees etc Easiest is generalized additive models (GAMs)

47 58

Imputation estimator visualization

-2 -1 0 1 2 3

-05

00

05

10

15

x

y

TreatedControl

48 58

Imputation estimator visualization

-2 -1 0 1 2 3

-05

00

05

10

15

x

y

TreatedControl

49 58

Imputation estimator visualization

-2 -1 0 1 2 3

-05

00

05

10

15

x

y

TreatedControl

50 58

Nonlinear relationshipsbull Same idea but with nonlinear relationship between 119884119894 and 119883119894

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

51 58

Nonlinear relationshipsbull Same idea but with nonlinear relationship between 119884119894 and 119883119894

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

52 58

Nonlinear relationshipsbull Same idea but with nonlinear relationship between 119884119894 and 119883119894

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

53 58

Using semiparametric regressionbull Here CEFs are nonlinear but we donrsquot know their formbull We can use GAMs from the mgcv package to for flexible

estimatelibrary(mgcv)

mod0 lt- gam(y ~ s(x) subset = d == 0)

summary(mod0)

Family gaussian

Link function identity

Formula

y ~ s(x)

Parametric coefficients

Estimate Std Error t value Pr(gt|t|)

(Intercept) -00225 00154 -146 016

Approximate significance of smooth terms

edf Refdf F p-value

s(x) 603 708 413 lt2e-16

---

Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1

R-sq(adj) = 0909 Deviance explained = 928

GCV = 00093204 Scale est = 00071351 n = 30

54 58

Using GAMs

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

55 58

Using GAMs

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

56 58

Using GAMs

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

57 58

Limited dependent variables

bull Usual advice model the data from first principles Logitprobit for binary Poisson for counts etc

bull OLS is a-ok with limited DVs when Binary treatment and no covariates (just diff-in-means) Binary treatment discrete covariates and saturated models

(stratified diff-in-means)

bull Imposing a model on LDVs in this case imposes adistributional assumption which could be wrong

bull Even in unsaturated models the marginal effect from OLSoften decent compared to nonlinear models

Could go wrong in small samples If using nonlinear models always get effects on the scale of the

outcome

58 58

  • Agnostic Regression
  • Regression and Causality
  • Regression with Heterogeneous Treatment Effects
Page 36: Gov 2002: 7. Regression and Causalityโ€ข Define linear regression: =argmin ๐”ผ[( ๐‘–โˆ’ โ€ฒ ) ] โ€ข The solution to this is the following: =๐”ผ[ ๐‘– โ€ฒ]โˆ’ ๐”ผ[ ๐‘– ๐‘–] โ€ข

Heterogeneous effects binary treatment

bull Completely randomized experiment

119884119894 = 119863119894119884119894(1) + (1 minus 119863119894)119884119894(0)= 119884119894(0) + (119884119894(1) minus 119884119894(0))119863119894

= 1205831113567 + 120591119894119863119894 + (119884119894(0) minus 1205831113567)= 1205831113567 + 120591119863119894 + (119884119894(0) minus 1205831113567) + (120591119894 minus 120591) sdot 119863119894

= 1205831113567 + 120591119863119894 + 120576119894

bull Error term now includes two components1 ldquoBaselinerdquo variation in the outcome (119884119894(0) minus 1205831113567)2 Variation in the treatment effect (120591119894 minus 120591)

bull Easy to verify that under experiment 120124[120576119894|119863119894] = 0bull Thus OLS estimates the ATE with no covariates

36 58

Adding covariates

bull What happens with no unmeasured confounders Need tocondition on 119883119894 now

bull Remember identification of the ATEATT using iteratedexpectations

bull ATE is the weighted sum of CATEs

120591 = 1114012119909120591(119909) Pr[119883119894 = 119909]

bull ATEATT are weighted averages of CATEsbull What about the regression estimand 120591119877 How does it related

to the ATEATT

37 58

Heterogeneous effects and regression

bull Letrsquos investigate this under a saturated regression model

119884119894 = 1114012119909119861119909119894120572119909 + 120591119877119863119894 + 119890119894

bull Use a dummy variable for each unique combination of 119883119894119861119909119894 = 120128(119883119894 = 119909)

bull Linear in 119883119894 by construction

38 58

Investigating the regression coefficient

bull How can we investigate 120591119877 Well we can rely on theregression anatomy

120591119877 = Cov(119884119894 119863119894 minus 119864[119863119894|119883119894])120141(119863119894 minus 119864[119863119894|119883119894])

bull 119863119894 minus 120124[119863119894|119883119894] is the residual from a regression of 119863119894 on the fullset of dummies

bull With a little work we can show

120591119877 =120124 1114094120591(119883119894)(119863119894 minus 120124[119863119894|119883119894])11135691114097

120124[(119863119894 minus 119864[119863119894|119883119894])1113569]=

120124[120591(119883119894)1205901113569119889(119883119894)]120124[1205901113569119889(119883119894)]

bull 1205901113569119889(119909) = 120141[119863119894|119883119894 = 119909] is the conditional variance of treatmentassignment

39 58

ATE versus OLS

120591119877 = 120124[120591(119883119894)119882119894] = 1114012119909120591(119909)

1205901113569119889(119909)120124[1205901113569119889(119883119894)]

โ„™[119883119894 = 119909]

bull Compare to the ATE

120591 = 120124[120591(119883119894)] = 1114012119909120591(119909)โ„™[119883119894 = 119909]

bull Both weight strata relative to their size (โ„™[119883119894 = 119909])bull OLS weights strata higher if the treatment variance in those

strata (1205901113569119889(119909)) is higher in those strata relative to the averagevariance across strata (120124[1205901113569119889(119883119894)])

bull The ATE weights only by their size

40 58

Regression weighting

119882119894 =1205901113569119889(119883119894)

120124[1205901113569119889(119883119894)]

bull Why does OLS weight like thisbull OLS is a minimum-variance estimator more weight to

more precise within-strata estimatesbull Within-strata estimates are most precise when the treatment

is evenly spread and thus has the highest variancebull If 119863119894 is binary then we know the conditional variance will be

1205901113569119889(119909) = โ„™[119863119894 = 1|119883119894 = 119909] (1 minus โ„™[119863119894 = 1|119883119894 = 119909])= 119890(119909) (1 minus 119890(119909))

bull Maximum variance with โ„™[119863119894 = 1|119883119894 = 119909] = 12

41 58

OLS weighting examplebull Binary covariate

โ„™[119883119894 = 1] = 075 โ„™[119883119894 = 0] = 025โ„™[119863119894 = 1|119883119894 = 1] = 09 โ„™[119863119894 = 1|119883119894 = 0] = 05

1205901113569119889(1) = 009 1205901113569119889(0) = 025120591(1) = 1 120591(0) = minus1

bull Implies the ATE is 120591 = 05bull Average conditional variance 120124[1205901113569119889(119883119894)] = 013bull weights for 119883119894 = 1 are 009013 = 0692 for 119883119894 = 0

025013 = 192120591119877 = 120124[120591(119883119894)119882119894]

= 120591(1)119882(1)โ„™[119883119894 = 1] + 120591(0)119882(0)โ„™[119883119894 = 0]= 1 times 0692 times 075 + minus1 times 192 times 025= 0039

42 58

When will OLS estimate the ATE

bull When does 120591 = 120591119877bull Constant treatment effects 120591(119909) = 120591 = 120591119877bull Constant probability of treatment 119890(119909) = โ„™[119863119894 = 1|119883119894 = 119909] = 119890

Implies that the OLS weights are 1

bull Incorrect linearity assumption in 119883119894 will lead to more bias

43 58

Other ways to use regression

bull Whatrsquos the path forward Accept the bias (might be relatively small with saturated

models) Use a different regression approach

bull Let 120583119889(119909) = 120124[119884119894(119889)|119883119894 = 119909] be the CEF for the potentialoutcome under 119863119894 = 119889

bull By consistency and nuc we have 120583119889(119909) = 120124[119884119894|119863119894 = 119889119883119894 = 119909]bull Estimate a regression of 119884119894 on 119883119894 among the 119863119894 = 119889 groupbull Then 119889(119909) is just a predicted value from the regression for

119883119894 = 119909bull How can we use this

44 58

Imputation estimators

bull Impute the treated potential outcomes with 1114023119884119894(1) = 1113568(119883119894)bull Impute the control potential outcomes with 1114023119884119894(0) = 1113567(119883119894)bull Procedure

Regress 119884119894 on 119883119894 in the treated group and get predicted valuesfor all units (treated or control)

Regress 119884119894 on 119883119894 in the control group and get predicted valuesfor all units (treated or control)

Take the average difference between these predicted values

bull More mathematically look like this

120591119894119898119901 =1119873

11140121198941113568(119883119894) minus 1113567(119883119894)

bull Sometimes called an imputation estimator

45 58

Simple imputation estimator

bull Use predict() from the within-group models on the data fromthe entire sample

bull Useful trick use a model on the entire data andmodelframe() to get the right design matrix

heterogeneous effects

yhet lt- ifelse(d == 1 y + rnorm(n 0 5) y)

mod lt- lm(yhet ~ d + X)

mod1 lt- lm(yhet ~ X subset = d == 1)

mod0 lt- lm(yhet ~ X subset = d == 0)

y1imps lt- predict(mod1 modelframe(mod))

y0imps lt- predict(mod0 modelframe(mod))

mean(y1imps - y0imps)

[1] 061

46 58

Notes on imputation estimators

bull If 119889(119909) are consistent estimators then 120591119894119898119901 is consistent forthe ATE

bull Why donrsquot people use this Most people donrsquot know the results wersquove been talking about Harder to implement than vanilla OLS

bull Can use linear regression to estimate 119889(119909) = 119909prime120573119889bull Recent trend is to estimate 119889(119909) via non-parametric methods

such as Kernel regression local linear regression regression trees etc Easiest is generalized additive models (GAMs)

47 58

Imputation estimator visualization

-2 -1 0 1 2 3

-05

00

05

10

15

x

y

TreatedControl

48 58

Imputation estimator visualization

-2 -1 0 1 2 3

-05

00

05

10

15

x

y

TreatedControl

49 58

Imputation estimator visualization

-2 -1 0 1 2 3

-05

00

05

10

15

x

y

TreatedControl

50 58

Nonlinear relationshipsbull Same idea but with nonlinear relationship between 119884119894 and 119883119894

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

51 58

Nonlinear relationshipsbull Same idea but with nonlinear relationship between 119884119894 and 119883119894

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

52 58

Nonlinear relationshipsbull Same idea but with nonlinear relationship between 119884119894 and 119883119894

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

53 58

Using semiparametric regressionbull Here CEFs are nonlinear but we donrsquot know their formbull We can use GAMs from the mgcv package to for flexible

estimatelibrary(mgcv)

mod0 lt- gam(y ~ s(x) subset = d == 0)

summary(mod0)

Family gaussian

Link function identity

Formula

y ~ s(x)

Parametric coefficients

Estimate Std Error t value Pr(gt|t|)

(Intercept) -00225 00154 -146 016

Approximate significance of smooth terms

edf Refdf F p-value

s(x) 603 708 413 lt2e-16

---

Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1

R-sq(adj) = 0909 Deviance explained = 928

GCV = 00093204 Scale est = 00071351 n = 30

54 58

Using GAMs

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

55 58

Using GAMs

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

56 58

Using GAMs

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

57 58

Limited dependent variables

bull Usual advice model the data from first principles Logitprobit for binary Poisson for counts etc

bull OLS is a-ok with limited DVs when Binary treatment and no covariates (just diff-in-means) Binary treatment discrete covariates and saturated models

(stratified diff-in-means)

bull Imposing a model on LDVs in this case imposes adistributional assumption which could be wrong

bull Even in unsaturated models the marginal effect from OLSoften decent compared to nonlinear models

Could go wrong in small samples If using nonlinear models always get effects on the scale of the

outcome

58 58

  • Agnostic Regression
  • Regression and Causality
  • Regression with Heterogeneous Treatment Effects
Page 37: Gov 2002: 7. Regression and Causalityโ€ข Define linear regression: =argmin ๐”ผ[( ๐‘–โˆ’ โ€ฒ ) ] โ€ข The solution to this is the following: =๐”ผ[ ๐‘– โ€ฒ]โˆ’ ๐”ผ[ ๐‘– ๐‘–] โ€ข

Adding covariates

bull What happens with no unmeasured confounders Need tocondition on 119883119894 now

bull Remember identification of the ATEATT using iteratedexpectations

bull ATE is the weighted sum of CATEs

120591 = 1114012119909120591(119909) Pr[119883119894 = 119909]

bull ATEATT are weighted averages of CATEsbull What about the regression estimand 120591119877 How does it related

to the ATEATT

37 58

Heterogeneous effects and regression

bull Letrsquos investigate this under a saturated regression model

119884119894 = 1114012119909119861119909119894120572119909 + 120591119877119863119894 + 119890119894

bull Use a dummy variable for each unique combination of 119883119894119861119909119894 = 120128(119883119894 = 119909)

bull Linear in 119883119894 by construction

38 58

Investigating the regression coefficient

bull How can we investigate 120591119877 Well we can rely on theregression anatomy

120591119877 = Cov(119884119894 119863119894 minus 119864[119863119894|119883119894])120141(119863119894 minus 119864[119863119894|119883119894])

bull 119863119894 minus 120124[119863119894|119883119894] is the residual from a regression of 119863119894 on the fullset of dummies

bull With a little work we can show

120591119877 =120124 1114094120591(119883119894)(119863119894 minus 120124[119863119894|119883119894])11135691114097

120124[(119863119894 minus 119864[119863119894|119883119894])1113569]=

120124[120591(119883119894)1205901113569119889(119883119894)]120124[1205901113569119889(119883119894)]

bull 1205901113569119889(119909) = 120141[119863119894|119883119894 = 119909] is the conditional variance of treatmentassignment

39 58

ATE versus OLS

120591119877 = 120124[120591(119883119894)119882119894] = 1114012119909120591(119909)

1205901113569119889(119909)120124[1205901113569119889(119883119894)]

โ„™[119883119894 = 119909]

bull Compare to the ATE

120591 = 120124[120591(119883119894)] = 1114012119909120591(119909)โ„™[119883119894 = 119909]

bull Both weight strata relative to their size (โ„™[119883119894 = 119909])bull OLS weights strata higher if the treatment variance in those

strata (1205901113569119889(119909)) is higher in those strata relative to the averagevariance across strata (120124[1205901113569119889(119883119894)])

bull The ATE weights only by their size

40 58

Regression weighting

119882119894 =1205901113569119889(119883119894)

120124[1205901113569119889(119883119894)]

bull Why does OLS weight like thisbull OLS is a minimum-variance estimator more weight to

more precise within-strata estimatesbull Within-strata estimates are most precise when the treatment

is evenly spread and thus has the highest variancebull If 119863119894 is binary then we know the conditional variance will be

1205901113569119889(119909) = โ„™[119863119894 = 1|119883119894 = 119909] (1 minus โ„™[119863119894 = 1|119883119894 = 119909])= 119890(119909) (1 minus 119890(119909))

bull Maximum variance with โ„™[119863119894 = 1|119883119894 = 119909] = 12

41 58

OLS weighting examplebull Binary covariate

โ„™[119883119894 = 1] = 075 โ„™[119883119894 = 0] = 025โ„™[119863119894 = 1|119883119894 = 1] = 09 โ„™[119863119894 = 1|119883119894 = 0] = 05

1205901113569119889(1) = 009 1205901113569119889(0) = 025120591(1) = 1 120591(0) = minus1

bull Implies the ATE is 120591 = 05bull Average conditional variance 120124[1205901113569119889(119883119894)] = 013bull weights for 119883119894 = 1 are 009013 = 0692 for 119883119894 = 0

025013 = 192120591119877 = 120124[120591(119883119894)119882119894]

= 120591(1)119882(1)โ„™[119883119894 = 1] + 120591(0)119882(0)โ„™[119883119894 = 0]= 1 times 0692 times 075 + minus1 times 192 times 025= 0039

42 58

When will OLS estimate the ATE

bull When does 120591 = 120591119877bull Constant treatment effects 120591(119909) = 120591 = 120591119877bull Constant probability of treatment 119890(119909) = โ„™[119863119894 = 1|119883119894 = 119909] = 119890

Implies that the OLS weights are 1

bull Incorrect linearity assumption in 119883119894 will lead to more bias

43 58

Other ways to use regression

bull Whatrsquos the path forward Accept the bias (might be relatively small with saturated

models) Use a different regression approach

bull Let 120583119889(119909) = 120124[119884119894(119889)|119883119894 = 119909] be the CEF for the potentialoutcome under 119863119894 = 119889

bull By consistency and nuc we have 120583119889(119909) = 120124[119884119894|119863119894 = 119889119883119894 = 119909]bull Estimate a regression of 119884119894 on 119883119894 among the 119863119894 = 119889 groupbull Then 119889(119909) is just a predicted value from the regression for

119883119894 = 119909bull How can we use this

44 58

Imputation estimators

bull Impute the treated potential outcomes with 1114023119884119894(1) = 1113568(119883119894)bull Impute the control potential outcomes with 1114023119884119894(0) = 1113567(119883119894)bull Procedure

Regress 119884119894 on 119883119894 in the treated group and get predicted valuesfor all units (treated or control)

Regress 119884119894 on 119883119894 in the control group and get predicted valuesfor all units (treated or control)

Take the average difference between these predicted values

bull More mathematically look like this

120591119894119898119901 =1119873

11140121198941113568(119883119894) minus 1113567(119883119894)

bull Sometimes called an imputation estimator

45 58

Simple imputation estimator

bull Use predict() from the within-group models on the data fromthe entire sample

bull Useful trick use a model on the entire data andmodelframe() to get the right design matrix

heterogeneous effects

yhet lt- ifelse(d == 1 y + rnorm(n 0 5) y)

mod lt- lm(yhet ~ d + X)

mod1 lt- lm(yhet ~ X subset = d == 1)

mod0 lt- lm(yhet ~ X subset = d == 0)

y1imps lt- predict(mod1 modelframe(mod))

y0imps lt- predict(mod0 modelframe(mod))

mean(y1imps - y0imps)

[1] 061

46 58

Notes on imputation estimators

bull If 119889(119909) are consistent estimators then 120591119894119898119901 is consistent forthe ATE

bull Why donrsquot people use this Most people donrsquot know the results wersquove been talking about Harder to implement than vanilla OLS

bull Can use linear regression to estimate 119889(119909) = 119909prime120573119889bull Recent trend is to estimate 119889(119909) via non-parametric methods

such as Kernel regression local linear regression regression trees etc Easiest is generalized additive models (GAMs)

47 58

Imputation estimator visualization

-2 -1 0 1 2 3

-05

00

05

10

15

x

y

TreatedControl

48 58

Imputation estimator visualization

-2 -1 0 1 2 3

-05

00

05

10

15

x

y

TreatedControl

49 58

Imputation estimator visualization

-2 -1 0 1 2 3

-05

00

05

10

15

x

y

TreatedControl

50 58

Nonlinear relationshipsbull Same idea but with nonlinear relationship between 119884119894 and 119883119894

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

51 58

Nonlinear relationshipsbull Same idea but with nonlinear relationship between 119884119894 and 119883119894

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

52 58

Nonlinear relationshipsbull Same idea but with nonlinear relationship between 119884119894 and 119883119894

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

53 58

Using semiparametric regressionbull Here CEFs are nonlinear but we donrsquot know their formbull We can use GAMs from the mgcv package to for flexible

estimatelibrary(mgcv)

mod0 lt- gam(y ~ s(x) subset = d == 0)

summary(mod0)

Family gaussian

Link function identity

Formula

y ~ s(x)

Parametric coefficients

Estimate Std Error t value Pr(gt|t|)

(Intercept) -00225 00154 -146 016

Approximate significance of smooth terms

edf Refdf F p-value

s(x) 603 708 413 lt2e-16

---

Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1

R-sq(adj) = 0909 Deviance explained = 928

GCV = 00093204 Scale est = 00071351 n = 30

54 58

Using GAMs

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

55 58

Using GAMs

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

56 58

Using GAMs

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

57 58

Limited dependent variables

bull Usual advice model the data from first principles Logitprobit for binary Poisson for counts etc

bull OLS is a-ok with limited DVs when Binary treatment and no covariates (just diff-in-means) Binary treatment discrete covariates and saturated models

(stratified diff-in-means)

bull Imposing a model on LDVs in this case imposes adistributional assumption which could be wrong

bull Even in unsaturated models the marginal effect from OLSoften decent compared to nonlinear models

Could go wrong in small samples If using nonlinear models always get effects on the scale of the

outcome

58 58

  • Agnostic Regression
  • Regression and Causality
  • Regression with Heterogeneous Treatment Effects
Page 38: Gov 2002: 7. Regression and Causalityโ€ข Define linear regression: =argmin ๐”ผ[( ๐‘–โˆ’ โ€ฒ ) ] โ€ข The solution to this is the following: =๐”ผ[ ๐‘– โ€ฒ]โˆ’ ๐”ผ[ ๐‘– ๐‘–] โ€ข

Heterogeneous effects and regression

bull Letrsquos investigate this under a saturated regression model

119884119894 = 1114012119909119861119909119894120572119909 + 120591119877119863119894 + 119890119894

bull Use a dummy variable for each unique combination of 119883119894119861119909119894 = 120128(119883119894 = 119909)

bull Linear in 119883119894 by construction

38 58

Investigating the regression coefficient

bull How can we investigate 120591119877 Well we can rely on theregression anatomy

120591119877 = Cov(119884119894 119863119894 minus 119864[119863119894|119883119894])120141(119863119894 minus 119864[119863119894|119883119894])

bull 119863119894 minus 120124[119863119894|119883119894] is the residual from a regression of 119863119894 on the fullset of dummies

bull With a little work we can show

120591119877 =120124 1114094120591(119883119894)(119863119894 minus 120124[119863119894|119883119894])11135691114097

120124[(119863119894 minus 119864[119863119894|119883119894])1113569]=

120124[120591(119883119894)1205901113569119889(119883119894)]120124[1205901113569119889(119883119894)]

bull 1205901113569119889(119909) = 120141[119863119894|119883119894 = 119909] is the conditional variance of treatmentassignment

39 58

ATE versus OLS

120591119877 = 120124[120591(119883119894)119882119894] = 1114012119909120591(119909)

1205901113569119889(119909)120124[1205901113569119889(119883119894)]

โ„™[119883119894 = 119909]

bull Compare to the ATE

120591 = 120124[120591(119883119894)] = 1114012119909120591(119909)โ„™[119883119894 = 119909]

bull Both weight strata relative to their size (โ„™[119883119894 = 119909])bull OLS weights strata higher if the treatment variance in those

strata (1205901113569119889(119909)) is higher in those strata relative to the averagevariance across strata (120124[1205901113569119889(119883119894)])

bull The ATE weights only by their size

40 58

Regression weighting

119882119894 =1205901113569119889(119883119894)

120124[1205901113569119889(119883119894)]

bull Why does OLS weight like thisbull OLS is a minimum-variance estimator more weight to

more precise within-strata estimatesbull Within-strata estimates are most precise when the treatment

is evenly spread and thus has the highest variancebull If 119863119894 is binary then we know the conditional variance will be

1205901113569119889(119909) = โ„™[119863119894 = 1|119883119894 = 119909] (1 minus โ„™[119863119894 = 1|119883119894 = 119909])= 119890(119909) (1 minus 119890(119909))

bull Maximum variance with โ„™[119863119894 = 1|119883119894 = 119909] = 12

41 58

OLS weighting examplebull Binary covariate

โ„™[119883119894 = 1] = 075 โ„™[119883119894 = 0] = 025โ„™[119863119894 = 1|119883119894 = 1] = 09 โ„™[119863119894 = 1|119883119894 = 0] = 05

1205901113569119889(1) = 009 1205901113569119889(0) = 025120591(1) = 1 120591(0) = minus1

bull Implies the ATE is 120591 = 05bull Average conditional variance 120124[1205901113569119889(119883119894)] = 013bull weights for 119883119894 = 1 are 009013 = 0692 for 119883119894 = 0

025013 = 192120591119877 = 120124[120591(119883119894)119882119894]

= 120591(1)119882(1)โ„™[119883119894 = 1] + 120591(0)119882(0)โ„™[119883119894 = 0]= 1 times 0692 times 075 + minus1 times 192 times 025= 0039

42 58

When will OLS estimate the ATE

bull When does 120591 = 120591119877bull Constant treatment effects 120591(119909) = 120591 = 120591119877bull Constant probability of treatment 119890(119909) = โ„™[119863119894 = 1|119883119894 = 119909] = 119890

Implies that the OLS weights are 1

bull Incorrect linearity assumption in 119883119894 will lead to more bias

43 58

Other ways to use regression

bull Whatrsquos the path forward Accept the bias (might be relatively small with saturated

models) Use a different regression approach

bull Let 120583119889(119909) = 120124[119884119894(119889)|119883119894 = 119909] be the CEF for the potentialoutcome under 119863119894 = 119889

bull By consistency and nuc we have 120583119889(119909) = 120124[119884119894|119863119894 = 119889119883119894 = 119909]bull Estimate a regression of 119884119894 on 119883119894 among the 119863119894 = 119889 groupbull Then 119889(119909) is just a predicted value from the regression for

119883119894 = 119909bull How can we use this

44 58

Imputation estimators

bull Impute the treated potential outcomes with 1114023119884119894(1) = 1113568(119883119894)bull Impute the control potential outcomes with 1114023119884119894(0) = 1113567(119883119894)bull Procedure

Regress 119884119894 on 119883119894 in the treated group and get predicted valuesfor all units (treated or control)

Regress 119884119894 on 119883119894 in the control group and get predicted valuesfor all units (treated or control)

Take the average difference between these predicted values

bull More mathematically look like this

120591119894119898119901 =1119873

11140121198941113568(119883119894) minus 1113567(119883119894)

bull Sometimes called an imputation estimator

45 58

Simple imputation estimator

bull Use predict() from the within-group models on the data fromthe entire sample

bull Useful trick use a model on the entire data andmodelframe() to get the right design matrix

heterogeneous effects

yhet lt- ifelse(d == 1 y + rnorm(n 0 5) y)

mod lt- lm(yhet ~ d + X)

mod1 lt- lm(yhet ~ X subset = d == 1)

mod0 lt- lm(yhet ~ X subset = d == 0)

y1imps lt- predict(mod1 modelframe(mod))

y0imps lt- predict(mod0 modelframe(mod))

mean(y1imps - y0imps)

[1] 061

46 58

Notes on imputation estimators

bull If 119889(119909) are consistent estimators then 120591119894119898119901 is consistent forthe ATE

bull Why donrsquot people use this Most people donrsquot know the results wersquove been talking about Harder to implement than vanilla OLS

bull Can use linear regression to estimate 119889(119909) = 119909prime120573119889bull Recent trend is to estimate 119889(119909) via non-parametric methods

such as Kernel regression local linear regression regression trees etc Easiest is generalized additive models (GAMs)

47 58

Imputation estimator visualization

-2 -1 0 1 2 3

-05

00

05

10

15

x

y

TreatedControl

48 58

Imputation estimator visualization

-2 -1 0 1 2 3

-05

00

05

10

15

x

y

TreatedControl

49 58

Imputation estimator visualization

-2 -1 0 1 2 3

-05

00

05

10

15

x

y

TreatedControl

50 58

Nonlinear relationshipsbull Same idea but with nonlinear relationship between 119884119894 and 119883119894

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

51 58

Nonlinear relationshipsbull Same idea but with nonlinear relationship between 119884119894 and 119883119894

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

52 58

Nonlinear relationshipsbull Same idea but with nonlinear relationship between 119884119894 and 119883119894

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

53 58

Using semiparametric regressionbull Here CEFs are nonlinear but we donrsquot know their formbull We can use GAMs from the mgcv package to for flexible

estimatelibrary(mgcv)

mod0 lt- gam(y ~ s(x) subset = d == 0)

summary(mod0)

Family gaussian

Link function identity

Formula

y ~ s(x)

Parametric coefficients

Estimate Std Error t value Pr(gt|t|)

(Intercept) -00225 00154 -146 016

Approximate significance of smooth terms

edf Refdf F p-value

s(x) 603 708 413 lt2e-16

---

Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1

R-sq(adj) = 0909 Deviance explained = 928

GCV = 00093204 Scale est = 00071351 n = 30

54 58

Using GAMs

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

55 58

Using GAMs

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

56 58

Using GAMs

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

57 58

Limited dependent variables

bull Usual advice model the data from first principles Logitprobit for binary Poisson for counts etc

bull OLS is a-ok with limited DVs when Binary treatment and no covariates (just diff-in-means) Binary treatment discrete covariates and saturated models

(stratified diff-in-means)

bull Imposing a model on LDVs in this case imposes adistributional assumption which could be wrong

bull Even in unsaturated models the marginal effect from OLSoften decent compared to nonlinear models

Could go wrong in small samples If using nonlinear models always get effects on the scale of the

outcome

58 58

  • Agnostic Regression
  • Regression and Causality
  • Regression with Heterogeneous Treatment Effects
Page 39: Gov 2002: 7. Regression and Causalityโ€ข Define linear regression: =argmin ๐”ผ[( ๐‘–โˆ’ โ€ฒ ) ] โ€ข The solution to this is the following: =๐”ผ[ ๐‘– โ€ฒ]โˆ’ ๐”ผ[ ๐‘– ๐‘–] โ€ข

Investigating the regression coefficient

bull How can we investigate 120591119877 Well we can rely on theregression anatomy

120591119877 = Cov(119884119894 119863119894 minus 119864[119863119894|119883119894])120141(119863119894 minus 119864[119863119894|119883119894])

bull 119863119894 minus 120124[119863119894|119883119894] is the residual from a regression of 119863119894 on the fullset of dummies

bull With a little work we can show

120591119877 =120124 1114094120591(119883119894)(119863119894 minus 120124[119863119894|119883119894])11135691114097

120124[(119863119894 minus 119864[119863119894|119883119894])1113569]=

120124[120591(119883119894)1205901113569119889(119883119894)]120124[1205901113569119889(119883119894)]

bull 1205901113569119889(119909) = 120141[119863119894|119883119894 = 119909] is the conditional variance of treatmentassignment

39 58

ATE versus OLS

120591119877 = 120124[120591(119883119894)119882119894] = 1114012119909120591(119909)

1205901113569119889(119909)120124[1205901113569119889(119883119894)]

โ„™[119883119894 = 119909]

bull Compare to the ATE

120591 = 120124[120591(119883119894)] = 1114012119909120591(119909)โ„™[119883119894 = 119909]

bull Both weight strata relative to their size (โ„™[119883119894 = 119909])bull OLS weights strata higher if the treatment variance in those

strata (1205901113569119889(119909)) is higher in those strata relative to the averagevariance across strata (120124[1205901113569119889(119883119894)])

bull The ATE weights only by their size

40 58

Regression weighting

119882119894 =1205901113569119889(119883119894)

120124[1205901113569119889(119883119894)]

bull Why does OLS weight like thisbull OLS is a minimum-variance estimator more weight to

more precise within-strata estimatesbull Within-strata estimates are most precise when the treatment

is evenly spread and thus has the highest variancebull If 119863119894 is binary then we know the conditional variance will be

1205901113569119889(119909) = โ„™[119863119894 = 1|119883119894 = 119909] (1 minus โ„™[119863119894 = 1|119883119894 = 119909])= 119890(119909) (1 minus 119890(119909))

bull Maximum variance with โ„™[119863119894 = 1|119883119894 = 119909] = 12

41 58

OLS weighting examplebull Binary covariate

โ„™[119883119894 = 1] = 075 โ„™[119883119894 = 0] = 025โ„™[119863119894 = 1|119883119894 = 1] = 09 โ„™[119863119894 = 1|119883119894 = 0] = 05

1205901113569119889(1) = 009 1205901113569119889(0) = 025120591(1) = 1 120591(0) = minus1

bull Implies the ATE is 120591 = 05bull Average conditional variance 120124[1205901113569119889(119883119894)] = 013bull weights for 119883119894 = 1 are 009013 = 0692 for 119883119894 = 0

025013 = 192120591119877 = 120124[120591(119883119894)119882119894]

= 120591(1)119882(1)โ„™[119883119894 = 1] + 120591(0)119882(0)โ„™[119883119894 = 0]= 1 times 0692 times 075 + minus1 times 192 times 025= 0039

42 58

When will OLS estimate the ATE

bull When does 120591 = 120591119877bull Constant treatment effects 120591(119909) = 120591 = 120591119877bull Constant probability of treatment 119890(119909) = โ„™[119863119894 = 1|119883119894 = 119909] = 119890

Implies that the OLS weights are 1

bull Incorrect linearity assumption in 119883119894 will lead to more bias

43 58

Other ways to use regression

bull Whatrsquos the path forward Accept the bias (might be relatively small with saturated

models) Use a different regression approach

bull Let 120583119889(119909) = 120124[119884119894(119889)|119883119894 = 119909] be the CEF for the potentialoutcome under 119863119894 = 119889

bull By consistency and nuc we have 120583119889(119909) = 120124[119884119894|119863119894 = 119889119883119894 = 119909]bull Estimate a regression of 119884119894 on 119883119894 among the 119863119894 = 119889 groupbull Then 119889(119909) is just a predicted value from the regression for

119883119894 = 119909bull How can we use this

44 58

Imputation estimators

bull Impute the treated potential outcomes with 1114023119884119894(1) = 1113568(119883119894)bull Impute the control potential outcomes with 1114023119884119894(0) = 1113567(119883119894)bull Procedure

Regress 119884119894 on 119883119894 in the treated group and get predicted valuesfor all units (treated or control)

Regress 119884119894 on 119883119894 in the control group and get predicted valuesfor all units (treated or control)

Take the average difference between these predicted values

bull More mathematically look like this

120591119894119898119901 =1119873

11140121198941113568(119883119894) minus 1113567(119883119894)

bull Sometimes called an imputation estimator

45 58

Simple imputation estimator

bull Use predict() from the within-group models on the data fromthe entire sample

bull Useful trick use a model on the entire data andmodelframe() to get the right design matrix

heterogeneous effects

yhet lt- ifelse(d == 1 y + rnorm(n 0 5) y)

mod lt- lm(yhet ~ d + X)

mod1 lt- lm(yhet ~ X subset = d == 1)

mod0 lt- lm(yhet ~ X subset = d == 0)

y1imps lt- predict(mod1 modelframe(mod))

y0imps lt- predict(mod0 modelframe(mod))

mean(y1imps - y0imps)

[1] 061

46 58

Notes on imputation estimators

bull If 119889(119909) are consistent estimators then 120591119894119898119901 is consistent forthe ATE

bull Why donrsquot people use this Most people donrsquot know the results wersquove been talking about Harder to implement than vanilla OLS

bull Can use linear regression to estimate 119889(119909) = 119909prime120573119889bull Recent trend is to estimate 119889(119909) via non-parametric methods

such as Kernel regression local linear regression regression trees etc Easiest is generalized additive models (GAMs)

47 58

Imputation estimator visualization

-2 -1 0 1 2 3

-05

00

05

10

15

x

y

TreatedControl

48 58

Imputation estimator visualization

-2 -1 0 1 2 3

-05

00

05

10

15

x

y

TreatedControl

49 58

Imputation estimator visualization

-2 -1 0 1 2 3

-05

00

05

10

15

x

y

TreatedControl

50 58

Nonlinear relationshipsbull Same idea but with nonlinear relationship between 119884119894 and 119883119894

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

51 58

Nonlinear relationshipsbull Same idea but with nonlinear relationship between 119884119894 and 119883119894

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

52 58

Nonlinear relationshipsbull Same idea but with nonlinear relationship between 119884119894 and 119883119894

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

53 58

Using semiparametric regressionbull Here CEFs are nonlinear but we donrsquot know their formbull We can use GAMs from the mgcv package to for flexible

estimatelibrary(mgcv)

mod0 lt- gam(y ~ s(x) subset = d == 0)

summary(mod0)

Family gaussian

Link function identity

Formula

y ~ s(x)

Parametric coefficients

Estimate Std Error t value Pr(gt|t|)

(Intercept) -00225 00154 -146 016

Approximate significance of smooth terms

edf Refdf F p-value

s(x) 603 708 413 lt2e-16

---

Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1

R-sq(adj) = 0909 Deviance explained = 928

GCV = 00093204 Scale est = 00071351 n = 30

54 58

Using GAMs

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

55 58

Using GAMs

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

56 58

Using GAMs

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

57 58

Limited dependent variables

bull Usual advice model the data from first principles Logitprobit for binary Poisson for counts etc

bull OLS is a-ok with limited DVs when Binary treatment and no covariates (just diff-in-means) Binary treatment discrete covariates and saturated models

(stratified diff-in-means)

bull Imposing a model on LDVs in this case imposes adistributional assumption which could be wrong

bull Even in unsaturated models the marginal effect from OLSoften decent compared to nonlinear models

Could go wrong in small samples If using nonlinear models always get effects on the scale of the

outcome

58 58

  • Agnostic Regression
  • Regression and Causality
  • Regression with Heterogeneous Treatment Effects
Page 40: Gov 2002: 7. Regression and Causalityโ€ข Define linear regression: =argmin ๐”ผ[( ๐‘–โˆ’ โ€ฒ ) ] โ€ข The solution to this is the following: =๐”ผ[ ๐‘– โ€ฒ]โˆ’ ๐”ผ[ ๐‘– ๐‘–] โ€ข

ATE versus OLS

120591119877 = 120124[120591(119883119894)119882119894] = 1114012119909120591(119909)

1205901113569119889(119909)120124[1205901113569119889(119883119894)]

โ„™[119883119894 = 119909]

bull Compare to the ATE

120591 = 120124[120591(119883119894)] = 1114012119909120591(119909)โ„™[119883119894 = 119909]

bull Both weight strata relative to their size (โ„™[119883119894 = 119909])bull OLS weights strata higher if the treatment variance in those

strata (1205901113569119889(119909)) is higher in those strata relative to the averagevariance across strata (120124[1205901113569119889(119883119894)])

bull The ATE weights only by their size

40 58

Regression weighting

119882119894 =1205901113569119889(119883119894)

120124[1205901113569119889(119883119894)]

bull Why does OLS weight like thisbull OLS is a minimum-variance estimator more weight to

more precise within-strata estimatesbull Within-strata estimates are most precise when the treatment

is evenly spread and thus has the highest variancebull If 119863119894 is binary then we know the conditional variance will be

1205901113569119889(119909) = โ„™[119863119894 = 1|119883119894 = 119909] (1 minus โ„™[119863119894 = 1|119883119894 = 119909])= 119890(119909) (1 minus 119890(119909))

bull Maximum variance with โ„™[119863119894 = 1|119883119894 = 119909] = 12

41 58

OLS weighting examplebull Binary covariate

โ„™[119883119894 = 1] = 075 โ„™[119883119894 = 0] = 025โ„™[119863119894 = 1|119883119894 = 1] = 09 โ„™[119863119894 = 1|119883119894 = 0] = 05

1205901113569119889(1) = 009 1205901113569119889(0) = 025120591(1) = 1 120591(0) = minus1

bull Implies the ATE is 120591 = 05bull Average conditional variance 120124[1205901113569119889(119883119894)] = 013bull weights for 119883119894 = 1 are 009013 = 0692 for 119883119894 = 0

025013 = 192120591119877 = 120124[120591(119883119894)119882119894]

= 120591(1)119882(1)โ„™[119883119894 = 1] + 120591(0)119882(0)โ„™[119883119894 = 0]= 1 times 0692 times 075 + minus1 times 192 times 025= 0039

42 58

When will OLS estimate the ATE

bull When does 120591 = 120591119877bull Constant treatment effects 120591(119909) = 120591 = 120591119877bull Constant probability of treatment 119890(119909) = โ„™[119863119894 = 1|119883119894 = 119909] = 119890

Implies that the OLS weights are 1

bull Incorrect linearity assumption in 119883119894 will lead to more bias

43 58

Other ways to use regression

bull Whatrsquos the path forward Accept the bias (might be relatively small with saturated

models) Use a different regression approach

bull Let 120583119889(119909) = 120124[119884119894(119889)|119883119894 = 119909] be the CEF for the potentialoutcome under 119863119894 = 119889

bull By consistency and nuc we have 120583119889(119909) = 120124[119884119894|119863119894 = 119889119883119894 = 119909]bull Estimate a regression of 119884119894 on 119883119894 among the 119863119894 = 119889 groupbull Then 119889(119909) is just a predicted value from the regression for

119883119894 = 119909bull How can we use this

44 58

Imputation estimators

bull Impute the treated potential outcomes with 1114023119884119894(1) = 1113568(119883119894)bull Impute the control potential outcomes with 1114023119884119894(0) = 1113567(119883119894)bull Procedure

Regress 119884119894 on 119883119894 in the treated group and get predicted valuesfor all units (treated or control)

Regress 119884119894 on 119883119894 in the control group and get predicted valuesfor all units (treated or control)

Take the average difference between these predicted values

bull More mathematically look like this

120591119894119898119901 =1119873

11140121198941113568(119883119894) minus 1113567(119883119894)

bull Sometimes called an imputation estimator

45 58

Simple imputation estimator

bull Use predict() from the within-group models on the data fromthe entire sample

bull Useful trick use a model on the entire data andmodelframe() to get the right design matrix

heterogeneous effects

yhet lt- ifelse(d == 1 y + rnorm(n 0 5) y)

mod lt- lm(yhet ~ d + X)

mod1 lt- lm(yhet ~ X subset = d == 1)

mod0 lt- lm(yhet ~ X subset = d == 0)

y1imps lt- predict(mod1 modelframe(mod))

y0imps lt- predict(mod0 modelframe(mod))

mean(y1imps - y0imps)

[1] 061

46 58

Notes on imputation estimators

bull If 119889(119909) are consistent estimators then 120591119894119898119901 is consistent forthe ATE

bull Why donrsquot people use this Most people donrsquot know the results wersquove been talking about Harder to implement than vanilla OLS

bull Can use linear regression to estimate 119889(119909) = 119909prime120573119889bull Recent trend is to estimate 119889(119909) via non-parametric methods

such as Kernel regression local linear regression regression trees etc Easiest is generalized additive models (GAMs)

47 58

Imputation estimator visualization

-2 -1 0 1 2 3

-05

00

05

10

15

x

y

TreatedControl

48 58

Imputation estimator visualization

-2 -1 0 1 2 3

-05

00

05

10

15

x

y

TreatedControl

49 58

Imputation estimator visualization

-2 -1 0 1 2 3

-05

00

05

10

15

x

y

TreatedControl

50 58

Nonlinear relationshipsbull Same idea but with nonlinear relationship between 119884119894 and 119883119894

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

51 58

Nonlinear relationshipsbull Same idea but with nonlinear relationship between 119884119894 and 119883119894

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

52 58

Nonlinear relationshipsbull Same idea but with nonlinear relationship between 119884119894 and 119883119894

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

53 58

Using semiparametric regressionbull Here CEFs are nonlinear but we donrsquot know their formbull We can use GAMs from the mgcv package to for flexible

estimatelibrary(mgcv)

mod0 lt- gam(y ~ s(x) subset = d == 0)

summary(mod0)

Family gaussian

Link function identity

Formula

y ~ s(x)

Parametric coefficients

Estimate Std Error t value Pr(gt|t|)

(Intercept) -00225 00154 -146 016

Approximate significance of smooth terms

edf Refdf F p-value

s(x) 603 708 413 lt2e-16

---

Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1

R-sq(adj) = 0909 Deviance explained = 928

GCV = 00093204 Scale est = 00071351 n = 30

54 58

Using GAMs

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

55 58

Using GAMs

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

56 58

Using GAMs

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

57 58

Limited dependent variables

bull Usual advice model the data from first principles Logitprobit for binary Poisson for counts etc

bull OLS is a-ok with limited DVs when Binary treatment and no covariates (just diff-in-means) Binary treatment discrete covariates and saturated models

(stratified diff-in-means)

bull Imposing a model on LDVs in this case imposes adistributional assumption which could be wrong

bull Even in unsaturated models the marginal effect from OLSoften decent compared to nonlinear models

Could go wrong in small samples If using nonlinear models always get effects on the scale of the

outcome

58 58

  • Agnostic Regression
  • Regression and Causality
  • Regression with Heterogeneous Treatment Effects
Page 41: Gov 2002: 7. Regression and Causalityโ€ข Define linear regression: =argmin ๐”ผ[( ๐‘–โˆ’ โ€ฒ ) ] โ€ข The solution to this is the following: =๐”ผ[ ๐‘– โ€ฒ]โˆ’ ๐”ผ[ ๐‘– ๐‘–] โ€ข

Regression weighting

119882119894 =1205901113569119889(119883119894)

120124[1205901113569119889(119883119894)]

bull Why does OLS weight like thisbull OLS is a minimum-variance estimator more weight to

more precise within-strata estimatesbull Within-strata estimates are most precise when the treatment

is evenly spread and thus has the highest variancebull If 119863119894 is binary then we know the conditional variance will be

1205901113569119889(119909) = โ„™[119863119894 = 1|119883119894 = 119909] (1 minus โ„™[119863119894 = 1|119883119894 = 119909])= 119890(119909) (1 minus 119890(119909))

bull Maximum variance with โ„™[119863119894 = 1|119883119894 = 119909] = 12

41 58

OLS weighting examplebull Binary covariate

โ„™[119883119894 = 1] = 075 โ„™[119883119894 = 0] = 025โ„™[119863119894 = 1|119883119894 = 1] = 09 โ„™[119863119894 = 1|119883119894 = 0] = 05

1205901113569119889(1) = 009 1205901113569119889(0) = 025120591(1) = 1 120591(0) = minus1

bull Implies the ATE is 120591 = 05bull Average conditional variance 120124[1205901113569119889(119883119894)] = 013bull weights for 119883119894 = 1 are 009013 = 0692 for 119883119894 = 0

025013 = 192120591119877 = 120124[120591(119883119894)119882119894]

= 120591(1)119882(1)โ„™[119883119894 = 1] + 120591(0)119882(0)โ„™[119883119894 = 0]= 1 times 0692 times 075 + minus1 times 192 times 025= 0039

42 58

When will OLS estimate the ATE

bull When does 120591 = 120591119877bull Constant treatment effects 120591(119909) = 120591 = 120591119877bull Constant probability of treatment 119890(119909) = โ„™[119863119894 = 1|119883119894 = 119909] = 119890

Implies that the OLS weights are 1

bull Incorrect linearity assumption in 119883119894 will lead to more bias

43 58

Other ways to use regression

bull Whatrsquos the path forward Accept the bias (might be relatively small with saturated

models) Use a different regression approach

bull Let 120583119889(119909) = 120124[119884119894(119889)|119883119894 = 119909] be the CEF for the potentialoutcome under 119863119894 = 119889

bull By consistency and nuc we have 120583119889(119909) = 120124[119884119894|119863119894 = 119889119883119894 = 119909]bull Estimate a regression of 119884119894 on 119883119894 among the 119863119894 = 119889 groupbull Then 119889(119909) is just a predicted value from the regression for

119883119894 = 119909bull How can we use this

44 58

Imputation estimators

bull Impute the treated potential outcomes with 1114023119884119894(1) = 1113568(119883119894)bull Impute the control potential outcomes with 1114023119884119894(0) = 1113567(119883119894)bull Procedure

Regress 119884119894 on 119883119894 in the treated group and get predicted valuesfor all units (treated or control)

Regress 119884119894 on 119883119894 in the control group and get predicted valuesfor all units (treated or control)

Take the average difference between these predicted values

bull More mathematically look like this

120591119894119898119901 =1119873

11140121198941113568(119883119894) minus 1113567(119883119894)

bull Sometimes called an imputation estimator

45 58

Simple imputation estimator

bull Use predict() from the within-group models on the data fromthe entire sample

bull Useful trick use a model on the entire data andmodelframe() to get the right design matrix

heterogeneous effects

yhet lt- ifelse(d == 1 y + rnorm(n 0 5) y)

mod lt- lm(yhet ~ d + X)

mod1 lt- lm(yhet ~ X subset = d == 1)

mod0 lt- lm(yhet ~ X subset = d == 0)

y1imps lt- predict(mod1 modelframe(mod))

y0imps lt- predict(mod0 modelframe(mod))

mean(y1imps - y0imps)

[1] 061

46 58

Notes on imputation estimators

bull If 119889(119909) are consistent estimators then 120591119894119898119901 is consistent forthe ATE

bull Why donrsquot people use this Most people donrsquot know the results wersquove been talking about Harder to implement than vanilla OLS

bull Can use linear regression to estimate 119889(119909) = 119909prime120573119889bull Recent trend is to estimate 119889(119909) via non-parametric methods

such as Kernel regression local linear regression regression trees etc Easiest is generalized additive models (GAMs)

47 58

Imputation estimator visualization

-2 -1 0 1 2 3

-05

00

05

10

15

x

y

TreatedControl

48 58

Imputation estimator visualization

-2 -1 0 1 2 3

-05

00

05

10

15

x

y

TreatedControl

49 58

Imputation estimator visualization

-2 -1 0 1 2 3

-05

00

05

10

15

x

y

TreatedControl

50 58

Nonlinear relationshipsbull Same idea but with nonlinear relationship between 119884119894 and 119883119894

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

51 58

Nonlinear relationshipsbull Same idea but with nonlinear relationship between 119884119894 and 119883119894

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

52 58

Nonlinear relationshipsbull Same idea but with nonlinear relationship between 119884119894 and 119883119894

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

53 58

Using semiparametric regressionbull Here CEFs are nonlinear but we donrsquot know their formbull We can use GAMs from the mgcv package to for flexible

estimatelibrary(mgcv)

mod0 lt- gam(y ~ s(x) subset = d == 0)

summary(mod0)

Family gaussian

Link function identity

Formula

y ~ s(x)

Parametric coefficients

Estimate Std Error t value Pr(gt|t|)

(Intercept) -00225 00154 -146 016

Approximate significance of smooth terms

edf Refdf F p-value

s(x) 603 708 413 lt2e-16

---

Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1

R-sq(adj) = 0909 Deviance explained = 928

GCV = 00093204 Scale est = 00071351 n = 30

54 58

Using GAMs

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

55 58

Using GAMs

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

56 58

Using GAMs

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

57 58

Limited dependent variables

bull Usual advice model the data from first principles Logitprobit for binary Poisson for counts etc

bull OLS is a-ok with limited DVs when Binary treatment and no covariates (just diff-in-means) Binary treatment discrete covariates and saturated models

(stratified diff-in-means)

bull Imposing a model on LDVs in this case imposes adistributional assumption which could be wrong

bull Even in unsaturated models the marginal effect from OLSoften decent compared to nonlinear models

Could go wrong in small samples If using nonlinear models always get effects on the scale of the

outcome

58 58

  • Agnostic Regression
  • Regression and Causality
  • Regression with Heterogeneous Treatment Effects
Page 42: Gov 2002: 7. Regression and Causalityโ€ข Define linear regression: =argmin ๐”ผ[( ๐‘–โˆ’ โ€ฒ ) ] โ€ข The solution to this is the following: =๐”ผ[ ๐‘– โ€ฒ]โˆ’ ๐”ผ[ ๐‘– ๐‘–] โ€ข

OLS weighting examplebull Binary covariate

โ„™[119883119894 = 1] = 075 โ„™[119883119894 = 0] = 025โ„™[119863119894 = 1|119883119894 = 1] = 09 โ„™[119863119894 = 1|119883119894 = 0] = 05

1205901113569119889(1) = 009 1205901113569119889(0) = 025120591(1) = 1 120591(0) = minus1

bull Implies the ATE is 120591 = 05bull Average conditional variance 120124[1205901113569119889(119883119894)] = 013bull weights for 119883119894 = 1 are 009013 = 0692 for 119883119894 = 0

025013 = 192120591119877 = 120124[120591(119883119894)119882119894]

= 120591(1)119882(1)โ„™[119883119894 = 1] + 120591(0)119882(0)โ„™[119883119894 = 0]= 1 times 0692 times 075 + minus1 times 192 times 025= 0039

42 58

When will OLS estimate the ATE

bull When does 120591 = 120591119877bull Constant treatment effects 120591(119909) = 120591 = 120591119877bull Constant probability of treatment 119890(119909) = โ„™[119863119894 = 1|119883119894 = 119909] = 119890

Implies that the OLS weights are 1

bull Incorrect linearity assumption in 119883119894 will lead to more bias

43 58

Other ways to use regression

bull Whatrsquos the path forward Accept the bias (might be relatively small with saturated

models) Use a different regression approach

bull Let 120583119889(119909) = 120124[119884119894(119889)|119883119894 = 119909] be the CEF for the potentialoutcome under 119863119894 = 119889

bull By consistency and nuc we have 120583119889(119909) = 120124[119884119894|119863119894 = 119889119883119894 = 119909]bull Estimate a regression of 119884119894 on 119883119894 among the 119863119894 = 119889 groupbull Then 119889(119909) is just a predicted value from the regression for

119883119894 = 119909bull How can we use this

44 58

Imputation estimators

bull Impute the treated potential outcomes with 1114023119884119894(1) = 1113568(119883119894)bull Impute the control potential outcomes with 1114023119884119894(0) = 1113567(119883119894)bull Procedure

Regress 119884119894 on 119883119894 in the treated group and get predicted valuesfor all units (treated or control)

Regress 119884119894 on 119883119894 in the control group and get predicted valuesfor all units (treated or control)

Take the average difference between these predicted values

bull More mathematically look like this

120591119894119898119901 =1119873

11140121198941113568(119883119894) minus 1113567(119883119894)

bull Sometimes called an imputation estimator

45 58

Simple imputation estimator

bull Use predict() from the within-group models on the data fromthe entire sample

bull Useful trick use a model on the entire data andmodelframe() to get the right design matrix

heterogeneous effects

yhet lt- ifelse(d == 1 y + rnorm(n 0 5) y)

mod lt- lm(yhet ~ d + X)

mod1 lt- lm(yhet ~ X subset = d == 1)

mod0 lt- lm(yhet ~ X subset = d == 0)

y1imps lt- predict(mod1 modelframe(mod))

y0imps lt- predict(mod0 modelframe(mod))

mean(y1imps - y0imps)

[1] 061

46 58

Notes on imputation estimators

bull If 119889(119909) are consistent estimators then 120591119894119898119901 is consistent forthe ATE

bull Why donrsquot people use this Most people donrsquot know the results wersquove been talking about Harder to implement than vanilla OLS

bull Can use linear regression to estimate 119889(119909) = 119909prime120573119889bull Recent trend is to estimate 119889(119909) via non-parametric methods

such as Kernel regression local linear regression regression trees etc Easiest is generalized additive models (GAMs)

47 58

Imputation estimator visualization

-2 -1 0 1 2 3

-05

00

05

10

15

x

y

TreatedControl

48 58

Imputation estimator visualization

-2 -1 0 1 2 3

-05

00

05

10

15

x

y

TreatedControl

49 58

Imputation estimator visualization

-2 -1 0 1 2 3

-05

00

05

10

15

x

y

TreatedControl

50 58

Nonlinear relationshipsbull Same idea but with nonlinear relationship between 119884119894 and 119883119894

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

51 58

Nonlinear relationshipsbull Same idea but with nonlinear relationship between 119884119894 and 119883119894

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

52 58

Nonlinear relationshipsbull Same idea but with nonlinear relationship between 119884119894 and 119883119894

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

53 58

Using semiparametric regressionbull Here CEFs are nonlinear but we donrsquot know their formbull We can use GAMs from the mgcv package to for flexible

estimatelibrary(mgcv)

mod0 lt- gam(y ~ s(x) subset = d == 0)

summary(mod0)

Family gaussian

Link function identity

Formula

y ~ s(x)

Parametric coefficients

Estimate Std Error t value Pr(gt|t|)

(Intercept) -00225 00154 -146 016

Approximate significance of smooth terms

edf Refdf F p-value

s(x) 603 708 413 lt2e-16

---

Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1

R-sq(adj) = 0909 Deviance explained = 928

GCV = 00093204 Scale est = 00071351 n = 30

54 58

Using GAMs

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

55 58

Using GAMs

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

56 58

Using GAMs

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

57 58

Limited dependent variables

bull Usual advice model the data from first principles Logitprobit for binary Poisson for counts etc

bull OLS is a-ok with limited DVs when Binary treatment and no covariates (just diff-in-means) Binary treatment discrete covariates and saturated models

(stratified diff-in-means)

bull Imposing a model on LDVs in this case imposes adistributional assumption which could be wrong

bull Even in unsaturated models the marginal effect from OLSoften decent compared to nonlinear models

Could go wrong in small samples If using nonlinear models always get effects on the scale of the

outcome

58 58

  • Agnostic Regression
  • Regression and Causality
  • Regression with Heterogeneous Treatment Effects
Page 43: Gov 2002: 7. Regression and Causalityโ€ข Define linear regression: =argmin ๐”ผ[( ๐‘–โˆ’ โ€ฒ ) ] โ€ข The solution to this is the following: =๐”ผ[ ๐‘– โ€ฒ]โˆ’ ๐”ผ[ ๐‘– ๐‘–] โ€ข

When will OLS estimate the ATE

bull When does 120591 = 120591119877bull Constant treatment effects 120591(119909) = 120591 = 120591119877bull Constant probability of treatment 119890(119909) = โ„™[119863119894 = 1|119883119894 = 119909] = 119890

Implies that the OLS weights are 1

bull Incorrect linearity assumption in 119883119894 will lead to more bias

43 58

Other ways to use regression

bull Whatrsquos the path forward Accept the bias (might be relatively small with saturated

models) Use a different regression approach

bull Let 120583119889(119909) = 120124[119884119894(119889)|119883119894 = 119909] be the CEF for the potentialoutcome under 119863119894 = 119889

bull By consistency and nuc we have 120583119889(119909) = 120124[119884119894|119863119894 = 119889119883119894 = 119909]bull Estimate a regression of 119884119894 on 119883119894 among the 119863119894 = 119889 groupbull Then 119889(119909) is just a predicted value from the regression for

119883119894 = 119909bull How can we use this

44 58

Imputation estimators

bull Impute the treated potential outcomes with 1114023119884119894(1) = 1113568(119883119894)bull Impute the control potential outcomes with 1114023119884119894(0) = 1113567(119883119894)bull Procedure

Regress 119884119894 on 119883119894 in the treated group and get predicted valuesfor all units (treated or control)

Regress 119884119894 on 119883119894 in the control group and get predicted valuesfor all units (treated or control)

Take the average difference between these predicted values

bull More mathematically look like this

120591119894119898119901 =1119873

11140121198941113568(119883119894) minus 1113567(119883119894)

bull Sometimes called an imputation estimator

45 58

Simple imputation estimator

bull Use predict() from the within-group models on the data fromthe entire sample

bull Useful trick use a model on the entire data andmodelframe() to get the right design matrix

heterogeneous effects

yhet lt- ifelse(d == 1 y + rnorm(n 0 5) y)

mod lt- lm(yhet ~ d + X)

mod1 lt- lm(yhet ~ X subset = d == 1)

mod0 lt- lm(yhet ~ X subset = d == 0)

y1imps lt- predict(mod1 modelframe(mod))

y0imps lt- predict(mod0 modelframe(mod))

mean(y1imps - y0imps)

[1] 061

46 58

Notes on imputation estimators

bull If 119889(119909) are consistent estimators then 120591119894119898119901 is consistent forthe ATE

bull Why donrsquot people use this Most people donrsquot know the results wersquove been talking about Harder to implement than vanilla OLS

bull Can use linear regression to estimate 119889(119909) = 119909prime120573119889bull Recent trend is to estimate 119889(119909) via non-parametric methods

such as Kernel regression local linear regression regression trees etc Easiest is generalized additive models (GAMs)

47 58

Imputation estimator visualization

-2 -1 0 1 2 3

-05

00

05

10

15

x

y

TreatedControl

48 58

Imputation estimator visualization

-2 -1 0 1 2 3

-05

00

05

10

15

x

y

TreatedControl

49 58

Imputation estimator visualization

-2 -1 0 1 2 3

-05

00

05

10

15

x

y

TreatedControl

50 58

Nonlinear relationshipsbull Same idea but with nonlinear relationship between 119884119894 and 119883119894

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

51 58

Nonlinear relationshipsbull Same idea but with nonlinear relationship between 119884119894 and 119883119894

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

52 58

Nonlinear relationshipsbull Same idea but with nonlinear relationship between 119884119894 and 119883119894

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

53 58

Using semiparametric regressionbull Here CEFs are nonlinear but we donrsquot know their formbull We can use GAMs from the mgcv package to for flexible

estimatelibrary(mgcv)

mod0 lt- gam(y ~ s(x) subset = d == 0)

summary(mod0)

Family gaussian

Link function identity

Formula

y ~ s(x)

Parametric coefficients

Estimate Std Error t value Pr(gt|t|)

(Intercept) -00225 00154 -146 016

Approximate significance of smooth terms

edf Refdf F p-value

s(x) 603 708 413 lt2e-16

---

Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1

R-sq(adj) = 0909 Deviance explained = 928

GCV = 00093204 Scale est = 00071351 n = 30

54 58

Using GAMs

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

55 58

Using GAMs

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

56 58

Using GAMs

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

57 58

Limited dependent variables

bull Usual advice model the data from first principles Logitprobit for binary Poisson for counts etc

bull OLS is a-ok with limited DVs when Binary treatment and no covariates (just diff-in-means) Binary treatment discrete covariates and saturated models

(stratified diff-in-means)

bull Imposing a model on LDVs in this case imposes adistributional assumption which could be wrong

bull Even in unsaturated models the marginal effect from OLSoften decent compared to nonlinear models

Could go wrong in small samples If using nonlinear models always get effects on the scale of the

outcome

58 58

  • Agnostic Regression
  • Regression and Causality
  • Regression with Heterogeneous Treatment Effects
Page 44: Gov 2002: 7. Regression and Causalityโ€ข Define linear regression: =argmin ๐”ผ[( ๐‘–โˆ’ โ€ฒ ) ] โ€ข The solution to this is the following: =๐”ผ[ ๐‘– โ€ฒ]โˆ’ ๐”ผ[ ๐‘– ๐‘–] โ€ข

Other ways to use regression

bull Whatrsquos the path forward Accept the bias (might be relatively small with saturated

models) Use a different regression approach

bull Let 120583119889(119909) = 120124[119884119894(119889)|119883119894 = 119909] be the CEF for the potentialoutcome under 119863119894 = 119889

bull By consistency and nuc we have 120583119889(119909) = 120124[119884119894|119863119894 = 119889119883119894 = 119909]bull Estimate a regression of 119884119894 on 119883119894 among the 119863119894 = 119889 groupbull Then 119889(119909) is just a predicted value from the regression for

119883119894 = 119909bull How can we use this

44 58

Imputation estimators

bull Impute the treated potential outcomes with 1114023119884119894(1) = 1113568(119883119894)bull Impute the control potential outcomes with 1114023119884119894(0) = 1113567(119883119894)bull Procedure

Regress 119884119894 on 119883119894 in the treated group and get predicted valuesfor all units (treated or control)

Regress 119884119894 on 119883119894 in the control group and get predicted valuesfor all units (treated or control)

Take the average difference between these predicted values

bull More mathematically look like this

120591119894119898119901 =1119873

11140121198941113568(119883119894) minus 1113567(119883119894)

bull Sometimes called an imputation estimator

45 58

Simple imputation estimator

bull Use predict() from the within-group models on the data fromthe entire sample

bull Useful trick use a model on the entire data andmodelframe() to get the right design matrix

heterogeneous effects

yhet lt- ifelse(d == 1 y + rnorm(n 0 5) y)

mod lt- lm(yhet ~ d + X)

mod1 lt- lm(yhet ~ X subset = d == 1)

mod0 lt- lm(yhet ~ X subset = d == 0)

y1imps lt- predict(mod1 modelframe(mod))

y0imps lt- predict(mod0 modelframe(mod))

mean(y1imps - y0imps)

[1] 061

46 58

Notes on imputation estimators

bull If 119889(119909) are consistent estimators then 120591119894119898119901 is consistent forthe ATE

bull Why donrsquot people use this Most people donrsquot know the results wersquove been talking about Harder to implement than vanilla OLS

bull Can use linear regression to estimate 119889(119909) = 119909prime120573119889bull Recent trend is to estimate 119889(119909) via non-parametric methods

such as Kernel regression local linear regression regression trees etc Easiest is generalized additive models (GAMs)

47 58

Imputation estimator visualization

-2 -1 0 1 2 3

-05

00

05

10

15

x

y

TreatedControl

48 58

Imputation estimator visualization

-2 -1 0 1 2 3

-05

00

05

10

15

x

y

TreatedControl

49 58

Imputation estimator visualization

-2 -1 0 1 2 3

-05

00

05

10

15

x

y

TreatedControl

50 58

Nonlinear relationshipsbull Same idea but with nonlinear relationship between 119884119894 and 119883119894

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

51 58

Nonlinear relationshipsbull Same idea but with nonlinear relationship between 119884119894 and 119883119894

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

52 58

Nonlinear relationshipsbull Same idea but with nonlinear relationship between 119884119894 and 119883119894

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

53 58

Using semiparametric regressionbull Here CEFs are nonlinear but we donrsquot know their formbull We can use GAMs from the mgcv package to for flexible

estimatelibrary(mgcv)

mod0 lt- gam(y ~ s(x) subset = d == 0)

summary(mod0)

Family gaussian

Link function identity

Formula

y ~ s(x)

Parametric coefficients

Estimate Std Error t value Pr(gt|t|)

(Intercept) -00225 00154 -146 016

Approximate significance of smooth terms

edf Refdf F p-value

s(x) 603 708 413 lt2e-16

---

Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1

R-sq(adj) = 0909 Deviance explained = 928

GCV = 00093204 Scale est = 00071351 n = 30

54 58

Using GAMs

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

55 58

Using GAMs

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

56 58

Using GAMs

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

57 58

Limited dependent variables

bull Usual advice model the data from first principles Logitprobit for binary Poisson for counts etc

bull OLS is a-ok with limited DVs when Binary treatment and no covariates (just diff-in-means) Binary treatment discrete covariates and saturated models

(stratified diff-in-means)

bull Imposing a model on LDVs in this case imposes adistributional assumption which could be wrong

bull Even in unsaturated models the marginal effect from OLSoften decent compared to nonlinear models

Could go wrong in small samples If using nonlinear models always get effects on the scale of the

outcome

58 58

  • Agnostic Regression
  • Regression and Causality
  • Regression with Heterogeneous Treatment Effects
Page 45: Gov 2002: 7. Regression and Causalityโ€ข Define linear regression: =argmin ๐”ผ[( ๐‘–โˆ’ โ€ฒ ) ] โ€ข The solution to this is the following: =๐”ผ[ ๐‘– โ€ฒ]โˆ’ ๐”ผ[ ๐‘– ๐‘–] โ€ข

Imputation estimators

bull Impute the treated potential outcomes with 1114023119884119894(1) = 1113568(119883119894)bull Impute the control potential outcomes with 1114023119884119894(0) = 1113567(119883119894)bull Procedure

Regress 119884119894 on 119883119894 in the treated group and get predicted valuesfor all units (treated or control)

Regress 119884119894 on 119883119894 in the control group and get predicted valuesfor all units (treated or control)

Take the average difference between these predicted values

bull More mathematically look like this

120591119894119898119901 =1119873

11140121198941113568(119883119894) minus 1113567(119883119894)

bull Sometimes called an imputation estimator

45 58

Simple imputation estimator

bull Use predict() from the within-group models on the data fromthe entire sample

bull Useful trick use a model on the entire data andmodelframe() to get the right design matrix

heterogeneous effects

yhet lt- ifelse(d == 1 y + rnorm(n 0 5) y)

mod lt- lm(yhet ~ d + X)

mod1 lt- lm(yhet ~ X subset = d == 1)

mod0 lt- lm(yhet ~ X subset = d == 0)

y1imps lt- predict(mod1 modelframe(mod))

y0imps lt- predict(mod0 modelframe(mod))

mean(y1imps - y0imps)

[1] 061

46 58

Notes on imputation estimators

bull If 119889(119909) are consistent estimators then 120591119894119898119901 is consistent forthe ATE

bull Why donrsquot people use this Most people donrsquot know the results wersquove been talking about Harder to implement than vanilla OLS

bull Can use linear regression to estimate 119889(119909) = 119909prime120573119889bull Recent trend is to estimate 119889(119909) via non-parametric methods

such as Kernel regression local linear regression regression trees etc Easiest is generalized additive models (GAMs)

47 58

Imputation estimator visualization

-2 -1 0 1 2 3

-05

00

05

10

15

x

y

TreatedControl

48 58

Imputation estimator visualization

-2 -1 0 1 2 3

-05

00

05

10

15

x

y

TreatedControl

49 58

Imputation estimator visualization

-2 -1 0 1 2 3

-05

00

05

10

15

x

y

TreatedControl

50 58

Nonlinear relationshipsbull Same idea but with nonlinear relationship between 119884119894 and 119883119894

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

51 58

Nonlinear relationshipsbull Same idea but with nonlinear relationship between 119884119894 and 119883119894

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

52 58

Nonlinear relationshipsbull Same idea but with nonlinear relationship between 119884119894 and 119883119894

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

53 58

Using semiparametric regressionbull Here CEFs are nonlinear but we donrsquot know their formbull We can use GAMs from the mgcv package to for flexible

estimatelibrary(mgcv)

mod0 lt- gam(y ~ s(x) subset = d == 0)

summary(mod0)

Family gaussian

Link function identity

Formula

y ~ s(x)

Parametric coefficients

Estimate Std Error t value Pr(gt|t|)

(Intercept) -00225 00154 -146 016

Approximate significance of smooth terms

edf Refdf F p-value

s(x) 603 708 413 lt2e-16

---

Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1

R-sq(adj) = 0909 Deviance explained = 928

GCV = 00093204 Scale est = 00071351 n = 30

54 58

Using GAMs

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

55 58

Using GAMs

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

56 58

Using GAMs

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

57 58

Limited dependent variables

bull Usual advice model the data from first principles Logitprobit for binary Poisson for counts etc

bull OLS is a-ok with limited DVs when Binary treatment and no covariates (just diff-in-means) Binary treatment discrete covariates and saturated models

(stratified diff-in-means)

bull Imposing a model on LDVs in this case imposes adistributional assumption which could be wrong

bull Even in unsaturated models the marginal effect from OLSoften decent compared to nonlinear models

Could go wrong in small samples If using nonlinear models always get effects on the scale of the

outcome

58 58

  • Agnostic Regression
  • Regression and Causality
  • Regression with Heterogeneous Treatment Effects
Page 46: Gov 2002: 7. Regression and Causalityโ€ข Define linear regression: =argmin ๐”ผ[( ๐‘–โˆ’ โ€ฒ ) ] โ€ข The solution to this is the following: =๐”ผ[ ๐‘– โ€ฒ]โˆ’ ๐”ผ[ ๐‘– ๐‘–] โ€ข

Simple imputation estimator

bull Use predict() from the within-group models on the data fromthe entire sample

bull Useful trick use a model on the entire data andmodelframe() to get the right design matrix

heterogeneous effects

yhet lt- ifelse(d == 1 y + rnorm(n 0 5) y)

mod lt- lm(yhet ~ d + X)

mod1 lt- lm(yhet ~ X subset = d == 1)

mod0 lt- lm(yhet ~ X subset = d == 0)

y1imps lt- predict(mod1 modelframe(mod))

y0imps lt- predict(mod0 modelframe(mod))

mean(y1imps - y0imps)

[1] 061

46 58

Notes on imputation estimators

bull If 119889(119909) are consistent estimators then 120591119894119898119901 is consistent forthe ATE

bull Why donrsquot people use this Most people donrsquot know the results wersquove been talking about Harder to implement than vanilla OLS

bull Can use linear regression to estimate 119889(119909) = 119909prime120573119889bull Recent trend is to estimate 119889(119909) via non-parametric methods

such as Kernel regression local linear regression regression trees etc Easiest is generalized additive models (GAMs)

47 58

Imputation estimator visualization

-2 -1 0 1 2 3

-05

00

05

10

15

x

y

TreatedControl

48 58

Imputation estimator visualization

-2 -1 0 1 2 3

-05

00

05

10

15

x

y

TreatedControl

49 58

Imputation estimator visualization

-2 -1 0 1 2 3

-05

00

05

10

15

x

y

TreatedControl

50 58

Nonlinear relationshipsbull Same idea but with nonlinear relationship between 119884119894 and 119883119894

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

51 58

Nonlinear relationshipsbull Same idea but with nonlinear relationship between 119884119894 and 119883119894

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

52 58

Nonlinear relationshipsbull Same idea but with nonlinear relationship between 119884119894 and 119883119894

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

53 58

Using semiparametric regressionbull Here CEFs are nonlinear but we donrsquot know their formbull We can use GAMs from the mgcv package to for flexible

estimatelibrary(mgcv)

mod0 lt- gam(y ~ s(x) subset = d == 0)

summary(mod0)

Family gaussian

Link function identity

Formula

y ~ s(x)

Parametric coefficients

Estimate Std Error t value Pr(gt|t|)

(Intercept) -00225 00154 -146 016

Approximate significance of smooth terms

edf Refdf F p-value

s(x) 603 708 413 lt2e-16

---

Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1

R-sq(adj) = 0909 Deviance explained = 928

GCV = 00093204 Scale est = 00071351 n = 30

54 58

Using GAMs

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

55 58

Using GAMs

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

56 58

Using GAMs

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

57 58

Limited dependent variables

bull Usual advice model the data from first principles Logitprobit for binary Poisson for counts etc

bull OLS is a-ok with limited DVs when Binary treatment and no covariates (just diff-in-means) Binary treatment discrete covariates and saturated models

(stratified diff-in-means)

bull Imposing a model on LDVs in this case imposes adistributional assumption which could be wrong

bull Even in unsaturated models the marginal effect from OLSoften decent compared to nonlinear models

Could go wrong in small samples If using nonlinear models always get effects on the scale of the

outcome

58 58

  • Agnostic Regression
  • Regression and Causality
  • Regression with Heterogeneous Treatment Effects
Page 47: Gov 2002: 7. Regression and Causalityโ€ข Define linear regression: =argmin ๐”ผ[( ๐‘–โˆ’ โ€ฒ ) ] โ€ข The solution to this is the following: =๐”ผ[ ๐‘– โ€ฒ]โˆ’ ๐”ผ[ ๐‘– ๐‘–] โ€ข

Notes on imputation estimators

bull If 119889(119909) are consistent estimators then 120591119894119898119901 is consistent forthe ATE

bull Why donrsquot people use this Most people donrsquot know the results wersquove been talking about Harder to implement than vanilla OLS

bull Can use linear regression to estimate 119889(119909) = 119909prime120573119889bull Recent trend is to estimate 119889(119909) via non-parametric methods

such as Kernel regression local linear regression regression trees etc Easiest is generalized additive models (GAMs)

47 58

Imputation estimator visualization

-2 -1 0 1 2 3

-05

00

05

10

15

x

y

TreatedControl

48 58

Imputation estimator visualization

-2 -1 0 1 2 3

-05

00

05

10

15

x

y

TreatedControl

49 58

Imputation estimator visualization

-2 -1 0 1 2 3

-05

00

05

10

15

x

y

TreatedControl

50 58

Nonlinear relationshipsbull Same idea but with nonlinear relationship between 119884119894 and 119883119894

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

51 58

Nonlinear relationshipsbull Same idea but with nonlinear relationship between 119884119894 and 119883119894

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

52 58

Nonlinear relationshipsbull Same idea but with nonlinear relationship between 119884119894 and 119883119894

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

53 58

Using semiparametric regressionbull Here CEFs are nonlinear but we donrsquot know their formbull We can use GAMs from the mgcv package to for flexible

estimatelibrary(mgcv)

mod0 lt- gam(y ~ s(x) subset = d == 0)

summary(mod0)

Family gaussian

Link function identity

Formula

y ~ s(x)

Parametric coefficients

Estimate Std Error t value Pr(gt|t|)

(Intercept) -00225 00154 -146 016

Approximate significance of smooth terms

edf Refdf F p-value

s(x) 603 708 413 lt2e-16

---

Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1

R-sq(adj) = 0909 Deviance explained = 928

GCV = 00093204 Scale est = 00071351 n = 30

54 58

Using GAMs

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

55 58

Using GAMs

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

56 58

Using GAMs

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

57 58

Limited dependent variables

bull Usual advice model the data from first principles Logitprobit for binary Poisson for counts etc

bull OLS is a-ok with limited DVs when Binary treatment and no covariates (just diff-in-means) Binary treatment discrete covariates and saturated models

(stratified diff-in-means)

bull Imposing a model on LDVs in this case imposes adistributional assumption which could be wrong

bull Even in unsaturated models the marginal effect from OLSoften decent compared to nonlinear models

Could go wrong in small samples If using nonlinear models always get effects on the scale of the

outcome

58 58

  • Agnostic Regression
  • Regression and Causality
  • Regression with Heterogeneous Treatment Effects
Page 48: Gov 2002: 7. Regression and Causalityโ€ข Define linear regression: =argmin ๐”ผ[( ๐‘–โˆ’ โ€ฒ ) ] โ€ข The solution to this is the following: =๐”ผ[ ๐‘– โ€ฒ]โˆ’ ๐”ผ[ ๐‘– ๐‘–] โ€ข

Imputation estimator visualization

-2 -1 0 1 2 3

-05

00

05

10

15

x

y

TreatedControl

48 58

Imputation estimator visualization

-2 -1 0 1 2 3

-05

00

05

10

15

x

y

TreatedControl

49 58

Imputation estimator visualization

-2 -1 0 1 2 3

-05

00

05

10

15

x

y

TreatedControl

50 58

Nonlinear relationshipsbull Same idea but with nonlinear relationship between 119884119894 and 119883119894

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

51 58

Nonlinear relationshipsbull Same idea but with nonlinear relationship between 119884119894 and 119883119894

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

52 58

Nonlinear relationshipsbull Same idea but with nonlinear relationship between 119884119894 and 119883119894

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

53 58

Using semiparametric regressionbull Here CEFs are nonlinear but we donrsquot know their formbull We can use GAMs from the mgcv package to for flexible

estimatelibrary(mgcv)

mod0 lt- gam(y ~ s(x) subset = d == 0)

summary(mod0)

Family gaussian

Link function identity

Formula

y ~ s(x)

Parametric coefficients

Estimate Std Error t value Pr(gt|t|)

(Intercept) -00225 00154 -146 016

Approximate significance of smooth terms

edf Refdf F p-value

s(x) 603 708 413 lt2e-16

---

Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1

R-sq(adj) = 0909 Deviance explained = 928

GCV = 00093204 Scale est = 00071351 n = 30

54 58

Using GAMs

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

55 58

Using GAMs

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

56 58

Using GAMs

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

57 58

Limited dependent variables

bull Usual advice model the data from first principles Logitprobit for binary Poisson for counts etc

bull OLS is a-ok with limited DVs when Binary treatment and no covariates (just diff-in-means) Binary treatment discrete covariates and saturated models

(stratified diff-in-means)

bull Imposing a model on LDVs in this case imposes adistributional assumption which could be wrong

bull Even in unsaturated models the marginal effect from OLSoften decent compared to nonlinear models

Could go wrong in small samples If using nonlinear models always get effects on the scale of the

outcome

58 58

  • Agnostic Regression
  • Regression and Causality
  • Regression with Heterogeneous Treatment Effects
Page 49: Gov 2002: 7. Regression and Causalityโ€ข Define linear regression: =argmin ๐”ผ[( ๐‘–โˆ’ โ€ฒ ) ] โ€ข The solution to this is the following: =๐”ผ[ ๐‘– โ€ฒ]โˆ’ ๐”ผ[ ๐‘– ๐‘–] โ€ข

Imputation estimator visualization

-2 -1 0 1 2 3

-05

00

05

10

15

x

y

TreatedControl

49 58

Imputation estimator visualization

-2 -1 0 1 2 3

-05

00

05

10

15

x

y

TreatedControl

50 58

Nonlinear relationshipsbull Same idea but with nonlinear relationship between 119884119894 and 119883119894

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

51 58

Nonlinear relationshipsbull Same idea but with nonlinear relationship between 119884119894 and 119883119894

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

52 58

Nonlinear relationshipsbull Same idea but with nonlinear relationship between 119884119894 and 119883119894

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

53 58

Using semiparametric regressionbull Here CEFs are nonlinear but we donrsquot know their formbull We can use GAMs from the mgcv package to for flexible

estimatelibrary(mgcv)

mod0 lt- gam(y ~ s(x) subset = d == 0)

summary(mod0)

Family gaussian

Link function identity

Formula

y ~ s(x)

Parametric coefficients

Estimate Std Error t value Pr(gt|t|)

(Intercept) -00225 00154 -146 016

Approximate significance of smooth terms

edf Refdf F p-value

s(x) 603 708 413 lt2e-16

---

Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1

R-sq(adj) = 0909 Deviance explained = 928

GCV = 00093204 Scale est = 00071351 n = 30

54 58

Using GAMs

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

55 58

Using GAMs

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

56 58

Using GAMs

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

57 58

Limited dependent variables

bull Usual advice model the data from first principles Logitprobit for binary Poisson for counts etc

bull OLS is a-ok with limited DVs when Binary treatment and no covariates (just diff-in-means) Binary treatment discrete covariates and saturated models

(stratified diff-in-means)

bull Imposing a model on LDVs in this case imposes adistributional assumption which could be wrong

bull Even in unsaturated models the marginal effect from OLSoften decent compared to nonlinear models

Could go wrong in small samples If using nonlinear models always get effects on the scale of the

outcome

58 58

  • Agnostic Regression
  • Regression and Causality
  • Regression with Heterogeneous Treatment Effects
Page 50: Gov 2002: 7. Regression and Causalityโ€ข Define linear regression: =argmin ๐”ผ[( ๐‘–โˆ’ โ€ฒ ) ] โ€ข The solution to this is the following: =๐”ผ[ ๐‘– โ€ฒ]โˆ’ ๐”ผ[ ๐‘– ๐‘–] โ€ข

Imputation estimator visualization

-2 -1 0 1 2 3

-05

00

05

10

15

x

y

TreatedControl

50 58

Nonlinear relationshipsbull Same idea but with nonlinear relationship between 119884119894 and 119883119894

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

51 58

Nonlinear relationshipsbull Same idea but with nonlinear relationship between 119884119894 and 119883119894

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

52 58

Nonlinear relationshipsbull Same idea but with nonlinear relationship between 119884119894 and 119883119894

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

53 58

Using semiparametric regressionbull Here CEFs are nonlinear but we donrsquot know their formbull We can use GAMs from the mgcv package to for flexible

estimatelibrary(mgcv)

mod0 lt- gam(y ~ s(x) subset = d == 0)

summary(mod0)

Family gaussian

Link function identity

Formula

y ~ s(x)

Parametric coefficients

Estimate Std Error t value Pr(gt|t|)

(Intercept) -00225 00154 -146 016

Approximate significance of smooth terms

edf Refdf F p-value

s(x) 603 708 413 lt2e-16

---

Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1

R-sq(adj) = 0909 Deviance explained = 928

GCV = 00093204 Scale est = 00071351 n = 30

54 58

Using GAMs

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

55 58

Using GAMs

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

56 58

Using GAMs

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

57 58

Limited dependent variables

bull Usual advice model the data from first principles Logitprobit for binary Poisson for counts etc

bull OLS is a-ok with limited DVs when Binary treatment and no covariates (just diff-in-means) Binary treatment discrete covariates and saturated models

(stratified diff-in-means)

bull Imposing a model on LDVs in this case imposes adistributional assumption which could be wrong

bull Even in unsaturated models the marginal effect from OLSoften decent compared to nonlinear models

Could go wrong in small samples If using nonlinear models always get effects on the scale of the

outcome

58 58

  • Agnostic Regression
  • Regression and Causality
  • Regression with Heterogeneous Treatment Effects
Page 51: Gov 2002: 7. Regression and Causalityโ€ข Define linear regression: =argmin ๐”ผ[( ๐‘–โˆ’ โ€ฒ ) ] โ€ข The solution to this is the following: =๐”ผ[ ๐‘– โ€ฒ]โˆ’ ๐”ผ[ ๐‘– ๐‘–] โ€ข

Nonlinear relationshipsbull Same idea but with nonlinear relationship between 119884119894 and 119883119894

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

51 58

Nonlinear relationshipsbull Same idea but with nonlinear relationship between 119884119894 and 119883119894

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

52 58

Nonlinear relationshipsbull Same idea but with nonlinear relationship between 119884119894 and 119883119894

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

53 58

Using semiparametric regressionbull Here CEFs are nonlinear but we donrsquot know their formbull We can use GAMs from the mgcv package to for flexible

estimatelibrary(mgcv)

mod0 lt- gam(y ~ s(x) subset = d == 0)

summary(mod0)

Family gaussian

Link function identity

Formula

y ~ s(x)

Parametric coefficients

Estimate Std Error t value Pr(gt|t|)

(Intercept) -00225 00154 -146 016

Approximate significance of smooth terms

edf Refdf F p-value

s(x) 603 708 413 lt2e-16

---

Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1

R-sq(adj) = 0909 Deviance explained = 928

GCV = 00093204 Scale est = 00071351 n = 30

54 58

Using GAMs

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

55 58

Using GAMs

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

56 58

Using GAMs

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

57 58

Limited dependent variables

bull Usual advice model the data from first principles Logitprobit for binary Poisson for counts etc

bull OLS is a-ok with limited DVs when Binary treatment and no covariates (just diff-in-means) Binary treatment discrete covariates and saturated models

(stratified diff-in-means)

bull Imposing a model on LDVs in this case imposes adistributional assumption which could be wrong

bull Even in unsaturated models the marginal effect from OLSoften decent compared to nonlinear models

Could go wrong in small samples If using nonlinear models always get effects on the scale of the

outcome

58 58

  • Agnostic Regression
  • Regression and Causality
  • Regression with Heterogeneous Treatment Effects
Page 52: Gov 2002: 7. Regression and Causalityโ€ข Define linear regression: =argmin ๐”ผ[( ๐‘–โˆ’ โ€ฒ ) ] โ€ข The solution to this is the following: =๐”ผ[ ๐‘– โ€ฒ]โˆ’ ๐”ผ[ ๐‘– ๐‘–] โ€ข

Nonlinear relationshipsbull Same idea but with nonlinear relationship between 119884119894 and 119883119894

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

52 58

Nonlinear relationshipsbull Same idea but with nonlinear relationship between 119884119894 and 119883119894

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

53 58

Using semiparametric regressionbull Here CEFs are nonlinear but we donrsquot know their formbull We can use GAMs from the mgcv package to for flexible

estimatelibrary(mgcv)

mod0 lt- gam(y ~ s(x) subset = d == 0)

summary(mod0)

Family gaussian

Link function identity

Formula

y ~ s(x)

Parametric coefficients

Estimate Std Error t value Pr(gt|t|)

(Intercept) -00225 00154 -146 016

Approximate significance of smooth terms

edf Refdf F p-value

s(x) 603 708 413 lt2e-16

---

Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1

R-sq(adj) = 0909 Deviance explained = 928

GCV = 00093204 Scale est = 00071351 n = 30

54 58

Using GAMs

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

55 58

Using GAMs

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

56 58

Using GAMs

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

57 58

Limited dependent variables

bull Usual advice model the data from first principles Logitprobit for binary Poisson for counts etc

bull OLS is a-ok with limited DVs when Binary treatment and no covariates (just diff-in-means) Binary treatment discrete covariates and saturated models

(stratified diff-in-means)

bull Imposing a model on LDVs in this case imposes adistributional assumption which could be wrong

bull Even in unsaturated models the marginal effect from OLSoften decent compared to nonlinear models

Could go wrong in small samples If using nonlinear models always get effects on the scale of the

outcome

58 58

  • Agnostic Regression
  • Regression and Causality
  • Regression with Heterogeneous Treatment Effects
Page 53: Gov 2002: 7. Regression and Causalityโ€ข Define linear regression: =argmin ๐”ผ[( ๐‘–โˆ’ โ€ฒ ) ] โ€ข The solution to this is the following: =๐”ผ[ ๐‘– โ€ฒ]โˆ’ ๐”ผ[ ๐‘– ๐‘–] โ€ข

Nonlinear relationshipsbull Same idea but with nonlinear relationship between 119884119894 and 119883119894

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

53 58

Using semiparametric regressionbull Here CEFs are nonlinear but we donrsquot know their formbull We can use GAMs from the mgcv package to for flexible

estimatelibrary(mgcv)

mod0 lt- gam(y ~ s(x) subset = d == 0)

summary(mod0)

Family gaussian

Link function identity

Formula

y ~ s(x)

Parametric coefficients

Estimate Std Error t value Pr(gt|t|)

(Intercept) -00225 00154 -146 016

Approximate significance of smooth terms

edf Refdf F p-value

s(x) 603 708 413 lt2e-16

---

Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1

R-sq(adj) = 0909 Deviance explained = 928

GCV = 00093204 Scale est = 00071351 n = 30

54 58

Using GAMs

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

55 58

Using GAMs

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

56 58

Using GAMs

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

57 58

Limited dependent variables

bull Usual advice model the data from first principles Logitprobit for binary Poisson for counts etc

bull OLS is a-ok with limited DVs when Binary treatment and no covariates (just diff-in-means) Binary treatment discrete covariates and saturated models

(stratified diff-in-means)

bull Imposing a model on LDVs in this case imposes adistributional assumption which could be wrong

bull Even in unsaturated models the marginal effect from OLSoften decent compared to nonlinear models

Could go wrong in small samples If using nonlinear models always get effects on the scale of the

outcome

58 58

  • Agnostic Regression
  • Regression and Causality
  • Regression with Heterogeneous Treatment Effects
Page 54: Gov 2002: 7. Regression and Causalityโ€ข Define linear regression: =argmin ๐”ผ[( ๐‘–โˆ’ โ€ฒ ) ] โ€ข The solution to this is the following: =๐”ผ[ ๐‘– โ€ฒ]โˆ’ ๐”ผ[ ๐‘– ๐‘–] โ€ข

Using semiparametric regressionbull Here CEFs are nonlinear but we donrsquot know their formbull We can use GAMs from the mgcv package to for flexible

estimatelibrary(mgcv)

mod0 lt- gam(y ~ s(x) subset = d == 0)

summary(mod0)

Family gaussian

Link function identity

Formula

y ~ s(x)

Parametric coefficients

Estimate Std Error t value Pr(gt|t|)

(Intercept) -00225 00154 -146 016

Approximate significance of smooth terms

edf Refdf F p-value

s(x) 603 708 413 lt2e-16

---

Signif codes 0 rsquorsquo 0001 rsquorsquo 001 rsquorsquo 005 rsquorsquo 01 rsquo rsquo 1

R-sq(adj) = 0909 Deviance explained = 928

GCV = 00093204 Scale est = 00071351 n = 30

54 58

Using GAMs

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

55 58

Using GAMs

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

56 58

Using GAMs

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

57 58

Limited dependent variables

bull Usual advice model the data from first principles Logitprobit for binary Poisson for counts etc

bull OLS is a-ok with limited DVs when Binary treatment and no covariates (just diff-in-means) Binary treatment discrete covariates and saturated models

(stratified diff-in-means)

bull Imposing a model on LDVs in this case imposes adistributional assumption which could be wrong

bull Even in unsaturated models the marginal effect from OLSoften decent compared to nonlinear models

Could go wrong in small samples If using nonlinear models always get effects on the scale of the

outcome

58 58

  • Agnostic Regression
  • Regression and Causality
  • Regression with Heterogeneous Treatment Effects
Page 55: Gov 2002: 7. Regression and Causalityโ€ข Define linear regression: =argmin ๐”ผ[( ๐‘–โˆ’ โ€ฒ ) ] โ€ข The solution to this is the following: =๐”ผ[ ๐‘– โ€ฒ]โˆ’ ๐”ผ[ ๐‘– ๐‘–] โ€ข

Using GAMs

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

55 58

Using GAMs

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

56 58

Using GAMs

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

57 58

Limited dependent variables

bull Usual advice model the data from first principles Logitprobit for binary Poisson for counts etc

bull OLS is a-ok with limited DVs when Binary treatment and no covariates (just diff-in-means) Binary treatment discrete covariates and saturated models

(stratified diff-in-means)

bull Imposing a model on LDVs in this case imposes adistributional assumption which could be wrong

bull Even in unsaturated models the marginal effect from OLSoften decent compared to nonlinear models

Could go wrong in small samples If using nonlinear models always get effects on the scale of the

outcome

58 58

  • Agnostic Regression
  • Regression and Causality
  • Regression with Heterogeneous Treatment Effects
Page 56: Gov 2002: 7. Regression and Causalityโ€ข Define linear regression: =argmin ๐”ผ[( ๐‘–โˆ’ โ€ฒ ) ] โ€ข The solution to this is the following: =๐”ผ[ ๐‘– โ€ฒ]โˆ’ ๐”ผ[ ๐‘– ๐‘–] โ€ข

Using GAMs

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

56 58

Using GAMs

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

57 58

Limited dependent variables

bull Usual advice model the data from first principles Logitprobit for binary Poisson for counts etc

bull OLS is a-ok with limited DVs when Binary treatment and no covariates (just diff-in-means) Binary treatment discrete covariates and saturated models

(stratified diff-in-means)

bull Imposing a model on LDVs in this case imposes adistributional assumption which could be wrong

bull Even in unsaturated models the marginal effect from OLSoften decent compared to nonlinear models

Could go wrong in small samples If using nonlinear models always get effects on the scale of the

outcome

58 58

  • Agnostic Regression
  • Regression and Causality
  • Regression with Heterogeneous Treatment Effects
Page 57: Gov 2002: 7. Regression and Causalityโ€ข Define linear regression: =argmin ๐”ผ[( ๐‘–โˆ’ โ€ฒ ) ] โ€ข The solution to this is the following: =๐”ผ[ ๐‘– โ€ฒ]โˆ’ ๐”ผ[ ๐‘– ๐‘–] โ€ข

Using GAMs

-2 -1 0 1 2

-10

-05

00

05

x

y

TreatedControl

57 58

Limited dependent variables

bull Usual advice model the data from first principles Logitprobit for binary Poisson for counts etc

bull OLS is a-ok with limited DVs when Binary treatment and no covariates (just diff-in-means) Binary treatment discrete covariates and saturated models

(stratified diff-in-means)

bull Imposing a model on LDVs in this case imposes adistributional assumption which could be wrong

bull Even in unsaturated models the marginal effect from OLSoften decent compared to nonlinear models

Could go wrong in small samples If using nonlinear models always get effects on the scale of the

outcome

58 58

  • Agnostic Regression
  • Regression and Causality
  • Regression with Heterogeneous Treatment Effects
Page 58: Gov 2002: 7. Regression and Causalityโ€ข Define linear regression: =argmin ๐”ผ[( ๐‘–โˆ’ โ€ฒ ) ] โ€ข The solution to this is the following: =๐”ผ[ ๐‘– โ€ฒ]โˆ’ ๐”ผ[ ๐‘– ๐‘–] โ€ข

Limited dependent variables

bull Usual advice model the data from first principles Logitprobit for binary Poisson for counts etc

bull OLS is a-ok with limited DVs when Binary treatment and no covariates (just diff-in-means) Binary treatment discrete covariates and saturated models

(stratified diff-in-means)

bull Imposing a model on LDVs in this case imposes adistributional assumption which could be wrong

bull Even in unsaturated models the marginal effect from OLSoften decent compared to nonlinear models

Could go wrong in small samples If using nonlinear models always get effects on the scale of the

outcome

58 58

  • Agnostic Regression
  • Regression and Causality
  • Regression with Heterogeneous Treatment Effects