EPSE 581C: Causal Inference for Applied Researchers

EPSE 581C: Causal Inference for Applied Researchers

Ed Kroc

University of British Columbia

[email protected]

May 22, 2019

Ed Kroc (UBC) Causal Inference May 22, 2019 1 / 48

Last time

Model misspecification and (some of) its effects


Today

More model misspecification and (some of) its effects

Consistency and unbiasedness of estimators


Regression Discontinuity (RD) design

Suppose our data look like this:


Regression discontinuity design

Estimation:

It would be unreasonable to assume equal slopes on both sides of thethreshold. Thus, we may propose the model:

Y “ β0 ` βTT ` βXX ` βTXT ¨ X ` δ.

Under this specification, our estimate of the ACE is:

zACE pX “ 2q “ pEpY p1q | X “ 2q ´ pEpY p0q | X “ 2q

“ pβT ` 2pβTX

But what if we misspecified the model by assuming equalslopes on both sides of the threshold?

This would produce a case of model misspecification.


Model misspecification

Broadly construed, there are three main types of model misspecification:

(1) Misspecification of the random error structure.

Heteroskedasticity of errors

Autocorrelation of errors (response)

(2) Misspecification of the “link” function.

Severe lack of normality of errors

(3) Misspecification of the covariate structure.

Misspecified functional form for covariates

Omitted covariates

All three of these issues are common to all forms of regression analysis(including factor analysis, SEMs, mixed effects modelling, etc.)

In practice, (3) can be very difficult to detect and to properlycorrect for. Unfortunately, (3) is also the most important case.


Regression assumptions: descriptive/predictive vs. causal

The best way to check for violations of any of the regressionassumptions is by examining residual plots (or standardized residualplots for GLMs).

One should always plot residuals vs. fitted values, and residualsvs. each predictor (even this is not sufficient to detect violations).

If all assumptions are satisfied, then all residual plots should looksomething like a random blob:




If errors autocorrelate, residual vs. fitted plot may look like:




If errors have unequal variances, then residual vs. fitted plot may looklike:




If the functional form of the predictors is misspecified, then residualsvs. fitted plot may look like:


Model misspecification: misspecified covariate function

(3) Misspecification of the covariate structure.

Must respecify the functional form of the model.

Usually diagnosable by looking at residual plots and/or examining theraw data, but this is rarely trivial.

Moreover, a better functional form may be too complicated toreasonably estimate given the amount of data we have.

Taught to err on the side of simplicity in explanatory/predictiveinference,

. . . but for causal inference, this issue cannot be downplayed.



Suppose our data look like this:



Estimation:

It would be unreasonable to assume equal slopes on both sides of thethreshold. Thus, we may propose the model:



zACE pX “ 2q “ pEpY p1q | X “ 2q ´ pEpY p0q | X “ 2q

“ pβT ` 2pβTX

But what if we misspecified the model by assuming equalslopes on both sides of the threshold?

This would produce a case of model misspecification.



Suppose we propose the misspecified model for our example data inthe previous diagram:

Y “ β0 ` βTT ` βXX ` δ1.


zACEwrong pX “ 2q “ pEpY p1q | X “ 2q ´ pEpY p0q | X “ 2q

“ pβ1T

However, we know that the more appropriate estimate from theproperly specified model is

zACE rightpX “ 2q “ pβT ` 2pβTX .



Misspecified regression model in orange:



Thus, the estimate from our misspecified model is off by:

zACE rightpX “ 2q ´zACEwrong pX “ 2q “ pβT ` 2pβTX ´ pβ1T

It is very important to notice that

pβT ‰ pβ1T

This is because our estimates depend on the model specification.



Recall: we proposed the misspecified model for our example data:

Y “ β0 ` βTT ` βXX ` δ1.

In actuality, the true model is:


So the error in the misspecified model, δ1, does not satisfy thenecessary assumptions of the regression framework. In particular:

δ1 “ βTXT ¨ X ` δ,

so δ1 is confounded with T and X ; i.e. δ1 is not independent of T orX .



Under our model misspecification, we know that a term is missingfrom our model; i.e. the interaction of T and X is absorbed into theerror term:

δ1 “ βTXT ¨ X ` δ,

Thus, Covpδ1,T q ‰ 0, and so

βT “CovpY ,T q ´ Covpδ1,T q

VarpT q

However, our standard regression estimators assume that there are noviolations of assumptions; thus, our actual estimate is:

pβ1T “yCovpY ,T q

xVarpT q“

řni“1pyi ´ syqpti ´ stqřn

i“1pti ´ stq2



We know that the actual population parameter we are interested in isβT from the correctly specified model:

Y “ β0 ` βTT ` βXX ` βTXT ¨ X ` δ

Doing the same covariance algebra as before, and noting that allregression assumptions are (mostly) satisfied since the model isproperly specified, we find

βT “CovpY ,T q ´ βTXCovpTX ,T q

VarpT q“

CovpY ,T q

VarpT q´ βTXEpX q.

But using the misspecified model, we do not estimate this! Instead,we estimate only the first term:

β1T “CovpY ,T q

VarpT q


Model misspecification: Ex. 1

Misspecified model on LEFT; properly specified model on RIGHT:

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.5

1.0

1.5

2.0

x

y

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.5

1.0

1.5

2.0

x

y




0.0 0.5 1.0 1.5

-0.15

-0.05

0.05

0.10

0.15

0.20

fitted(mod.w)

residuals(mod.w)

0.0 0.5 1.0 1.5 2.0

-0.10

-0.05

0.00

0.05

0.10

fitted(mod.r)

residuals(mod.r)

Clear evidence of model misspecification in residuals vs. fitted plot!




0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.5

1.0

1.5

2.0

x

y

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.5

1.0

1.5

2.0

x

yzACEwrong pX “ 0.5q “ pβ1T “ 0.515

zACE rightpX “ 0.5q “ pβT ` 0.5pβTX “ 0.008` 0.5 ˚ 0.999 “ 0.508

Not too bad. . ., but what if the misspecification was worse?




0.0 0.2 0.4 0.6 0.8 1.0

-2.5

-2.0

-1.5

-1.0

-0.5

0.0

x

y

0.0 0.2 0.4 0.6 0.8 1.0

-2.5

-2.0

-1.5

-1.0

-0.5

0.0

x

y




-2.5 -2.0 -1.5 -1.0 -0.5 0.0

-0.3

-0.2

-0.1

0.0

0.1

0.2

fitted(mod.w2)

residuals(mod.w2)

-3.0 -2.5 -2.0 -1.5 -1.0 -0.5 0.0

-0.10

-0.05

0.00

0.05

0.10

fitted(mod.r)

residuals(mod.r)

Clear evidence of model misspecification in residuals vs. fitted plot!




0.0 0.2 0.4 0.6 0.8 1.0

-2.5

-2.0

-1.5

-1.0

-0.5

0.0

x

y

0.0 0.2 0.4 0.6 0.8 1.0

-2.5

-2.0

-1.5

-1.0

-0.5

0.0

x

yzACEwrong pX “ 0.5q “ pβ1T “ ´0.152

zACE rightpX “ 0.5q “ pβT ` 0.5pβTX ` p0.5q2pβTX2 ““ ´0.527

Misspecified model ACE estimate is 3-times too small.


Model misspecification: ignore fit statistics

Notice: fit statistics are useless here.

That is, misspecified models can still “fit” the data very well.

Good enough for explanatory modelling.

Not good enough for causal modelling!

Ignore all fit statistics when performing causal modelling, including:

Goodness-of-fit F -tests

R2 statistics

Information criterion statistics (AIC, BIC, DIC, etc.)


Model misspecification: ignore statistical significance

Notice: statistical significance of model coefficient estimates isirrelevant here.

Recall numerical Ex. 1:

All estimates significant for misspecified model

In properly specified model, intercept (pβ0) and marginal treatment

(pβT ) estimates not statistically significant.

Recall numerical Ex. 2:

All estimates significant for misspecified model

In properly specified model, intercept (pβ0) and marginal first-order

treatment (pβT ) estimates not statistically significant.


Model misspecification: bigger sample size will never fixthe problem

It is common “wisdom” that the more data you have, the better youwill be able to quantify your effects of interest.

This is true for explanatory/descriptive and predictive modelling, butfalse for causal modelling.



There are two extremely important and desirable properties we usually likeour estimators to have:

Consistency

Unbiasedness

Other properties are also often desirable (e.g. asymptotic normality), butconsistency and unbiasedness are by far the most important.


Unbiasedness of estimators

Generally, an estimator pθ for some population parameter θ of a randomvariable of interest X is called unbiased if:

Eppθq “ θ

In words, an estimator is unbiased for its estimand (what it is trying toestimate) if, on average, the estimator equals the estimand.

Example: In a random sample, the sample mean, pθ “ 1n

řni“1 Xi , is an

unbiased estimator of the population mean, θ “ EpX q:

E

˜

1

n

nÿ

i“1

Xi

¸

“1

n

nÿ

i“1

EpXi q

“1

n

nÿ

i“1

EpX q

“nEpX q

n“ EpX q X


Consistency of estimators

Generally, an estimator pθ is called consistent if, as the sample size increaseswithout bound, the sample value of pθ approaches a single number, a:

for all ε ą 0, limnÑ8

Prp|pθ ´ a| ą ε | Snq “ 0,

where Sn denotes a random sample of size n.

If an estimator is both unbiased and consistent, then not only does itsaverage value equal the true estimand of interest, but as we increasethe sample size, the estimator becomes more and more precise aboutthis true value.

That is, such an estimator is both accurate and precise as sample sizeincreases.



It is entirely possible that an estimator is consistent but biased;e.g. the unadjusted sample variance:

1

n

nÿ

i“1

pxi ´ sxq2

It is also entirely possible that an estimator is unbiased butinconsistent; e.g. using the average of the sample min and max toestimate the population mean:

maxtxi : 1 ď i ď nu `mintxi : 1 ď i ď nu

2

Estimators can also be neither unbiased nor consistent. Very bad!

Also, some estimators are asymptotically unbiased and consistent.


Consistency and unbiasedness of estimators: Example

The sample mean is an unbiased and consistent estimator of thepopulation mean (for population random variables with finite mean):

sX “1

n

nÿ

i“1

Xi

The average of the sample extremes is an unbiased but inconsistentestimator of the population mean:

Avgpmin,maxq :“maxtxi : 1 ď i ď nu `mintxi : 1 ď i ď nu

2



Example: Suppose we have 30 observations from a normallydistributed population, X „ Np3.3, 1q.

These observations generate the two sample statistics:

sX “ 3.39, Avgpmin,maxq “ 3.28

Both seem pretty good. This is no accident either.



Sampling Distribution of the Sample Mean

Sample Mean

Frequency

2.8 3.0 3.2 3.4 3.6 3.8 4.0

050

100

150

200

Sampling Distribution of the Sample Average Spread

Sample Average Spread

Frequency

2.0 2.5 3.0 3.5 4.0 4.5

050

100

150

200

Simulated 1000 draws of 30 observations from X to create these(estimated) sampling distributions.

Both estimators unbiased, but average of extremes is not veryprecise. . .Ed Kroc (UBC) Causal Inference May 22, 2019 35 / 48


Histogram of avg

avg

Frequency

3.0 3.2 3.4 3.6

050

100

150

200

Histogram of rng

rng

Frequency

2.0 2.5 3.0 3.5 4.0 4.5

050

100

150

200

250

Increased sample size: simulated 1000 draws of 100 observations fromX to create these new (estimated) sampling distributions.

Notice: sample mean gets more precise with larger sample size, butsample average of extremes does not.



Histogram of avg

avg

Frequency

3.20 3.25 3.30 3.35 3.40

050

100

150

200

250

Histogram of rng

rng

Frequency

2.5 3.0 3.5 4.0

050

100

150

200

250

300

Increased sample size: simulated 1000 draws of 1000 observationsfrom X to create these new (estimated) sampling distributions.




Histogram of avg

avg

Frequency

3.26 3.28 3.30 3.32 3.34

0500

1000

1500

2000

Histogram of rng

rng

Frequency

2.0 2.5 3.0 3.5 4.0 4.5

01000

2000

3000

Increased sample size: simulated 1000 draws of 10,000 observationsfrom X to create these new (estimated) sampling distributions.


Observe consistency of sample mean, inconsistency of sample averageof extremes.



When all the usual regression assumptions hold, the standardestimators for the model coefficients (e.g. maximum likelihood orordinary least squares estimators) are consistent and unbiased for thetrue population values of those parameters.

However, when the regression model is misspecified, the estimatorsare still consistent, but they are no longer unbiased. Moreover, theyare not even asymptotically unbiased.

White (1982), Econometrica: MLEs of regression coefficients willapproach the values that minimize the Kullback-Leibler divergencebetween the specified model and the true model.



UPSHOT: if the functional form of your model is misspecified, and/orif you are missing important covariates, it doesn’t matter how muchdata you have: your estimates will always be wrong, even if they arevery precise.

This is a HUGE problem for causal inference.

It is common “wisdom” that the more data you have, the better youwill be able to quantify your effects of interest; this is false whenperforming model-based causal inference.


Model misspecification: omitted variables

So far, we have only focused on model misspecification where thefunctional form of the covariates is misspecified, but our modelsalways contained all explanatory variables.

In practical non-experimental research, we will always be missingsome confounders; we can’t measure everything, or even knoweverything we should always be measuring!

Detecting important omitted variables can be very difficult.

Residual plots still the way to go, but they will not always suggestomitted variable bias.

Hence, why the exchangeability of treatment is so important in anRD-design: treatment is “as good as” randomly assigned near thethreshold; thus, biasing effects of omitted variables should benegligible (near the threshold).


A return to controlled experiments

Why don’t we hear about these issues (omitted variables, modelmisspecification) in the context of controlled experiments?

ANSWER: usually, well-controlled experiments bypass these issues bydesign.

Example: Does an increase in NO2 in native SE BC soil causeArabidopsis lyrata leaves to grow larger?

3ˆ 5 factorial design on 90 seeds:

3 levels of NO2: control (native soil), 1.5 times average NO2

concentration, 2 times average NO2 concentration.

5 time points (after sprouting), no repeated measures: 5 days, 10 days,15 days, 20 days, 25 days.

Outcome measure: length of eighth leaves.



Here, we could propose a full 2-way ANOVA model:

Len “ µ` τNO2 ` τage ` τNO2ˆage ` ε,

where τX denotes the average treatment effect of X , µ denotes thegrand average lengths of eighth leaves (over all nitrogen levels andtime points), and ε denotes random error.

Experiment is controlled to fix the values of possible confounders:e.g. humidity, light, water, O2 levels, etc.

Levels of explanatory factors are also fixed; NO2 and age arecontinuous variables, but experimental control fixes the possiblevalues these variables can assume to finite sets.




Len “ µ` τNO2 ` τage ` τNO2ˆage ` ε.

However, suppose there was some unknown confounder V that wedidn’t account for: e.g. maybe 10 of the 90 seeds are less viable thanthe others.

But here, randomization of seeds to experimental treatments (NO2ˆ

age) will likely remove the effect of this confounder:

Prpseed i P NO2 ˆ age | V q “ Prpseed i P NO2 ˆ ageq.

Therefore,

PrpLen | NO2, age, V q “ PrpLen | NO2, ageq




Len “ µ` τNO2 ` τage ` τNO2ˆage ` ε.

What about misspecifying the functional form of the model?

Not an issue in ANOVA of controlled, randomized experiments.

Notice: ANOVA model does not have to posit an explicit functionalform between response and covariates because all covariates arecategorized into finitely many, controlled factor levels.

Suppose Len is (positive, concave down) quadratically related to age.Then average treatment effects τage will increase quadratically overthe 5 fixed ages since we estimate the average effect for each fixedage.



Contrast with observational protocol: if we cannot control the age ofthe plants, then we are forced to quantify the average effect of afunction of age on response, e.g.

Len “ β0 ` τNO2 ` βage ¨ age ` βN02ˆage ¨ τNO2 ¨ age ` ε

Such a regression model assumes a linear relationship between ageand response.

But since we have no control over age, we are forced to model allages simultaneously; this is much harder to do than to simplycalculate the average effect of age on response for a finite, fixednumber of age categories.



A natural idea may be to simply ad hoc categorize age; i.e. weobserve 90 plants in the wild with arbitrary ages, but then categorizeage after the fact into 3 categories: 0–9 days, 10–19 days, 20–29 days.

But this only fixes the problem if sample units are exchangeable (overnitrogen treatment and all possible confounders) within each ad hocage category.

However, is nitrogen level fixed in the wild? Probably not!

And older plants may be exposed to more light and water (otherwisethe plants would die before reaching 10 days of age).

Therefore, in order to ensure exchangeability of sample units overtreatments, we now have to account for these omitted variables, aswell as the functional relationships between them, and betweennitrogen. . . So we are back to our model misspecification issues.


Next time

The Neyman-Rubin causal model

Propensity scores


EPSE 581C: Causal Inference for Applied Researchers

Documents

Transcript of EPSE 581C: Causal Inference for Applied Researchers