Lecture 1: Introduction, Regressions and Causal Inference 1.pdf · Lecture 1: Introduction,...
Transcript of Lecture 1: Introduction, Regressions and Causal Inference 1.pdf · Lecture 1: Introduction,...
Lecture 1: Introduction, Regressions and CausalInference
January 10, 2016
Introduction Regressions Causal Inference Control Variables Randomized Experiments
Introduction
From PPG1004H you should understand the following concepts:
1 Understand means and variances
2 Understand p-values and Hypothesis Tests3 Underlying Regression Assumptions (and why E[U|X ] = 0 is so
important)4 Understand what controls do5 Fixed-effects6 Interpreting Regression Coefficients7 The Basics of Causal Inference
This will be the focus of next course
Introduction Regressions Causal Inference Control Variables Randomized Experiments
Introduction
From PPG1004H you should understand the following concepts:
1 Understand means and variances
2 Understand p-values and Hypothesis Tests3 Underlying Regression Assumptions (and why E[U|X ] = 0 is so
important)4 Understand what controls do5 Fixed-effects6 Interpreting Regression Coefficients7 The Basics of Causal Inference
This will be the focus of next course
Introduction Regressions Causal Inference Control Variables Randomized Experiments
Introduction
From PPG1004H you should understand the following concepts:
1 Understand means and variances2 Understand p-values and Hypothesis Tests
3 Underlying Regression Assumptions (and why E[U|X ] = 0 is soimportant)
4 Understand what controls do5 Fixed-effects6 Interpreting Regression Coefficients7 The Basics of Causal Inference
This will be the focus of next course
Introduction Regressions Causal Inference Control Variables Randomized Experiments
Introduction
From PPG1004H you should understand the following concepts:
1 Understand means and variances2 Understand p-values and Hypothesis Tests
3 Underlying Regression Assumptions (and why E[U|X ] = 0 is soimportant)
4 Understand what controls do5 Fixed-effects6 Interpreting Regression Coefficients7 The Basics of Causal Inference
This will be the focus of next course
Introduction Regressions Causal Inference Control Variables Randomized Experiments
Introduction
From PPG1004H you should understand the following concepts:
1 Understand means and variances2 Understand p-values and Hypothesis Tests3 Underlying Regression Assumptions (and why E[U|X ] = 0 is so
important)
4 Understand what controls do5 Fixed-effects6 Interpreting Regression Coefficients7 The Basics of Causal Inference
This will be the focus of next course
Introduction Regressions Causal Inference Control Variables Randomized Experiments
Introduction
From PPG1004H you should understand the following concepts:
1 Understand means and variances2 Understand p-values and Hypothesis Tests3 Underlying Regression Assumptions (and why E[U|X ] = 0 is so
important)
4 Understand what controls do5 Fixed-effects6 Interpreting Regression Coefficients7 The Basics of Causal Inference
This will be the focus of next course
Introduction Regressions Causal Inference Control Variables Randomized Experiments
Introduction
From PPG1004H you should understand the following concepts:
1 Understand means and variances2 Understand p-values and Hypothesis Tests3 Underlying Regression Assumptions (and why E[U|X ] = 0 is so
important)4 Understand what controls do
5 Fixed-effects6 Interpreting Regression Coefficients7 The Basics of Causal Inference
This will be the focus of next course
Introduction Regressions Causal Inference Control Variables Randomized Experiments
Introduction
From PPG1004H you should understand the following concepts:
1 Understand means and variances2 Understand p-values and Hypothesis Tests3 Underlying Regression Assumptions (and why E[U|X ] = 0 is so
important)4 Understand what controls do5 Fixed-effects
6 Interpreting Regression Coefficients7 The Basics of Causal Inference
This will be the focus of next course
Introduction Regressions Causal Inference Control Variables Randomized Experiments
Introduction
From PPG1004H you should understand the following concepts:
1 Understand means and variances2 Understand p-values and Hypothesis Tests3 Underlying Regression Assumptions (and why E[U|X ] = 0 is so
important)4 Understand what controls do5 Fixed-effects6 Interpreting Regression Coefficients
7 The Basics of Causal InferenceThis will be the focus of next course
Introduction Regressions Causal Inference Control Variables Randomized Experiments
Introduction
From PPG1004H you should understand the following concepts:
1 Understand means and variances2 Understand p-values and Hypothesis Tests3 Underlying Regression Assumptions (and why E[U|X ] = 0 is so
important)4 Understand what controls do5 Fixed-effects6 Interpreting Regression Coefficients7 The Basics of Causal Inference
This will be the focus of next course
Introduction Regressions Causal Inference Control Variables Randomized Experiments
Quick Review
p-values
Definition: The p-value is the probability of getting results at leastas extreme as the ones you observed, given that the null hypothesis iscorrect
It can’t tell you the magnitude of an effect, the strength of theevidence or the probability that the finding was the result of chance.
Layman’s explanation: You suspect a coin is weighted toward heads(therefore set H0 : p = 0.5). You flip it 100 times and get more headsthan tails. The p-value won’t tell you whether the coin is fair, but itwill tell you the probability that you’d get at least as many heads asyou did if the coin was fair. That’s it.
Introduction Regressions Causal Inference Control Variables Randomized Experiments
Quick Review
P-value and Economic vs. Statistical Significance
Statistical Significance: If p-value< 0.05, then your result isstatistically significant
Economic Significance: We could not care a less about the p-value
Introduction Regressions Causal Inference Control Variables Randomized Experiments
Regressions
Regression: a measure of the relation between the mean value of onevariable and corresponding values of other variables
There are many types of regressions (logit, probit, IV)We focus on Ordinary Least Square (OLS) regressions
OLS: Minimizes differences between observed responses in a linearregression model
Yi = α + β1X1i + εi =⇒ “univariate” regressionYi = α + β1X1i + β2X2i + εi =⇒ “multivariate” regression
Introduction Regressions Causal Inference Control Variables Randomized Experiments
Regressions and STATA
A regression equation tells you what to write in STATA
Yi = α + β1X1i + β2X2i + εi
In STATA:
Introduction Regressions Causal Inference Control Variables Randomized Experiments
Unit of Analysis
Before writing down a regression equation, know the unit of analysisOften used subscripts: i =individual, t =time, s =school, g =grade,p =province
Yi = α + β1X1i + β2X2i + εi
Unit is:
Ypt = α + β1X1pt + β2X2pt + εpt
Unit is:
Introduction Regressions Causal Inference Control Variables Randomized Experiments
Unit of Analysis II
Not all variables need to be at the same unit of analysisJust the outcome (Y), the regressor of interest and the error term
“Identifying variation” comes from the unit of analysis
Ysgt = α + β1X1sgt + β2X2st + εsgt
The above regression uses variation from:
Introduction Regressions Causal Inference Control Variables Randomized Experiments
Underlying Regression Assumptions
The Required Underlying Regression Assumptions are:*****=IMPORTANT
1 Correct specification: Yi = α+ βXi + εi (linearity and additivity)****2 Exogeneity: E[ε|X ] = 0 *****3 No perfect multicollinearity4 Homoskedasticity: Var [ε|X ] = σ2
The other two often used Assumptions are:1 Normality: ε|X ∼ N(0, σ2)2 Observations are i.i.d.
Introduction Regressions Causal Inference Control Variables Randomized Experiments
Regression Assumption 1
Specification is correct: Yi = α + βXi + εi (linearity and additivity)
While this assumption is important, we generally have to assumesome functional form. If we believe this is wrong we can addinteractions or polynomials:
Interaction: Yi = α + β1X1i + β2X2i + β3(X1 ∗ X2)i + εiPolynomial: Yi = α + β1Xi + β2X 2
i + εi
Introduction Regressions Causal Inference Control Variables Randomized Experiments
Regression Assumption 2
Exogeneity: E[ε|X ] = 0. This is BY FAR the most importantassumption.
The assumption for the regression equation Yi = α + β1X1i + εi isviolated when an omitted variable, X2, is BOTH correlated to theoutcome and X1. I.e.:
Corr(Y ,X2) 6= 0Corr(X1,X2) 6= 0
If either condition fails, β1 is BIASED
Introduction Regressions Causal Inference Control Variables Randomized Experiments
Regression Assumption 2 Continued
Exogeneity: E[ε|X ] = 0.
Much of what empirical economists do is to find a way to make thisassumption hold
We do it by shutting down the link between X1 and X2 (i.e. makeCorr(X1,X2) = 0)
Introduction Regressions Causal Inference Control Variables Randomized Experiments
The Other Assumptions
No perfect multicollinearityNot really a problem; just do not add collinear variables together in aregression (or let STATA solve the problem)
Homoskedasticity: Var [ε|X ] = σ2
Not a problem; can allow for heteroskedastic or clustered standarderrors easily. In STATA for heteroskedastic put “,r” after the “reg”command
Normality - Does not affect bias, only efficiency of OLSi.i.d. - Important for serial or auto-correlation of error terms. We can(somewhat) correct for this.
Introduction Regressions Causal Inference Control Variables Randomized Experiments
Introduction
What do you think of these studies? (that we see all the time innewspapers)
1 http://www.nbcnews.com/health/cancer/university-texas-study-links-meat-kidney-cancer-n459811
2 http://hereandnow.wbur.org/2016/01/06/sugar-breast-cancer-study
Introduction Regressions Causal Inference Control Variables Randomized Experiments
Correlation and Causation
Correlation:
Causation:
1 http://www.tylervigen.com/spurious-correlations
Introduction Regressions Causal Inference Control Variables Randomized Experiments
Causal Inference
For any causal statement you should be able to answer all thefollowing questions:
1 What is the unit of analysis?2 What is the treatment?3 What outcome are we interested in?4 What are the counterfactual outcomes?5 What is the causal link?6 How is the counterfactual mimicked? Does this sound reasonable?
Introduction Regressions Causal Inference Control Variables Randomized Experiments
Causal Inference
A good way to think about these is to do a thought experiment andthink about ‘treated’ and ‘untreated’Suppose we have two types of people. People A get a drug. People Bdo not. We are interested in their blood pressure.
What is our treatment?
What is our counterfactual?
Introduction Regressions Causal Inference Control Variables Randomized Experiments
Causal Inference
I am going to introduce some math notation here:The outcome for each treated person is: Y1,iThe outcome for each untreated person is: Y0,i
What are the outcomes for the treated?
For the untreated?
What is the treatment effect?
What is the selection bias?
Introduction Regressions Causal Inference Control Variables Randomized Experiments
Thought experiments to regressions
How can we find the difference in outcomes between people A and B?
µA − µB or can regress:
Introduction Regressions Causal Inference Control Variables Randomized Experiments
Example
Let’s try doing the causal statements for the following article:http://www.nbcnews.com/health/cancer/university-texas-study-links-meat-kidney-cancer-n459811
1 What is the unit of analysis?
2 What is the treatment?3 What outcome are we interested in?4 What are the counterfactual outcomes?5 What is the causal link?6 How is the counterfactual mimicked? Does this sound reasonable?
Introduction Regressions Causal Inference Control Variables Randomized Experiments
Example
Let’s try doing the causal statements for the following article:http://www.nbcnews.com/health/cancer/university-texas-study-links-meat-kidney-cancer-n459811
1 What is the unit of analysis?2 What is the treatment?
3 What outcome are we interested in?4 What are the counterfactual outcomes?5 What is the causal link?6 How is the counterfactual mimicked? Does this sound reasonable?
Introduction Regressions Causal Inference Control Variables Randomized Experiments
Example
Let’s try doing the causal statements for the following article:http://www.nbcnews.com/health/cancer/university-texas-study-links-meat-kidney-cancer-n459811
1 What is the unit of analysis?2 What is the treatment?3 What outcome are we interested in?
4 What are the counterfactual outcomes?5 What is the causal link?6 How is the counterfactual mimicked? Does this sound reasonable?
Introduction Regressions Causal Inference Control Variables Randomized Experiments
Example
Let’s try doing the causal statements for the following article:http://www.nbcnews.com/health/cancer/university-texas-study-links-meat-kidney-cancer-n459811
1 What is the unit of analysis?2 What is the treatment?3 What outcome are we interested in?4 What are the counterfactual outcomes?
5 What is the causal link?6 How is the counterfactual mimicked? Does this sound reasonable?
Introduction Regressions Causal Inference Control Variables Randomized Experiments
Example
Let’s try doing the causal statements for the following article:http://www.nbcnews.com/health/cancer/university-texas-study-links-meat-kidney-cancer-n459811
1 What is the unit of analysis?2 What is the treatment?3 What outcome are we interested in?4 What are the counterfactual outcomes?5 What is the causal link?
6 How is the counterfactual mimicked? Does this sound reasonable?
Introduction Regressions Causal Inference Control Variables Randomized Experiments
Example
Let’s try doing the causal statements for the following article:http://www.nbcnews.com/health/cancer/university-texas-study-links-meat-kidney-cancer-n459811
1 What is the unit of analysis?2 What is the treatment?3 What outcome are we interested in?4 What are the counterfactual outcomes?5 What is the causal link?6 How is the counterfactual mimicked? Does this sound reasonable?
Introduction Regressions Causal Inference Control Variables Randomized Experiments
Causal Inference
There are 5 basic empirical methods to obtain causal inference:
1 Controls (includes matching/fixed-effects)2 Randomized Experiments3 Difference-in-Differences4 Instrumental Variables5 Regression Discontinuity
Introduction Regressions Causal Inference Control Variables Randomized Experiments
Control Variables
The main problem of causal inference is the possibility of omittedvariable bias (OVB)
So why do we not just control for all omitted variables?
Introduction Regressions Causal Inference Control Variables Randomized Experiments
Control Variables
So while controls do not seem great at getting causality, they help by:
Eliminating obvious OVBReducing standard error
For this reason, controls are almost always included in every regression
Introduction Regressions Causal Inference Control Variables Randomized Experiments
Control Variables
http://well.blogs.nytimes.com/2015/08/19/researchers-link-longer-work-hours-and-stroke-risk/?_r=0
1 What is the unit of analysis?
2 What is the treatment?3 What outcome are we interested in?4 What are the counterfactual outcomes?5 What is the causal link?6 How is the counterfactual mimicked? Does this sound reasonable?
Introduction Regressions Causal Inference Control Variables Randomized Experiments
Control Variables
http://well.blogs.nytimes.com/2015/08/19/researchers-link-longer-work-hours-and-stroke-risk/?_r=0
1 What is the unit of analysis?2 What is the treatment?
3 What outcome are we interested in?4 What are the counterfactual outcomes?5 What is the causal link?6 How is the counterfactual mimicked? Does this sound reasonable?
Introduction Regressions Causal Inference Control Variables Randomized Experiments
Control Variables
http://well.blogs.nytimes.com/2015/08/19/researchers-link-longer-work-hours-and-stroke-risk/?_r=0
1 What is the unit of analysis?2 What is the treatment?3 What outcome are we interested in?
4 What are the counterfactual outcomes?5 What is the causal link?6 How is the counterfactual mimicked? Does this sound reasonable?
Introduction Regressions Causal Inference Control Variables Randomized Experiments
Control Variables
http://well.blogs.nytimes.com/2015/08/19/researchers-link-longer-work-hours-and-stroke-risk/?_r=0
1 What is the unit of analysis?2 What is the treatment?3 What outcome are we interested in?4 What are the counterfactual outcomes?
5 What is the causal link?6 How is the counterfactual mimicked? Does this sound reasonable?
Introduction Regressions Causal Inference Control Variables Randomized Experiments
Control Variables
http://well.blogs.nytimes.com/2015/08/19/researchers-link-longer-work-hours-and-stroke-risk/?_r=0
1 What is the unit of analysis?2 What is the treatment?3 What outcome are we interested in?4 What are the counterfactual outcomes?5 What is the causal link?
6 How is the counterfactual mimicked? Does this sound reasonable?
Introduction Regressions Causal Inference Control Variables Randomized Experiments
Control Variables
http://well.blogs.nytimes.com/2015/08/19/researchers-link-longer-work-hours-and-stroke-risk/?_r=0
1 What is the unit of analysis?2 What is the treatment?3 What outcome are we interested in?4 What are the counterfactual outcomes?5 What is the causal link?6 How is the counterfactual mimicked? Does this sound reasonable?
Introduction Regressions Causal Inference Control Variables Randomized Experiments
Control Variables
strokei = α + β1HoursWorkedi + β2Controlsi + εi
What controls could you realistically never put in the above regressionthat may lead to OVB?
Introduction Regressions Causal Inference Control Variables Randomized Experiments
Control Variables
Whether you add a control or not often depends on qualitativereasoning
For this reason, researchers often report many regressions, usingvariable levels of controls:
In general, add variables that should either:Directly affects the outcomeProxy for another unobserved variable that affects the outcome
Introduction Regressions Causal Inference Control Variables Randomized Experiments
Table 1: Difference-in-Differences Estimates of CSR on Private School Share
Outcome Variable: Private School Share (%)
(1) (2) (3) (4)
Treatment*Post -1.41*** -1.35*** -0.91*** -1.32***(0.17) (0.18) (0.28) (0.27)
Treatment 2.87*** 2.82*** 4.33*** 3.32***(0.25) (0.52) (0.60) (0.46)
Post -0.73*** 0.26* -0.01 0.24*(0.15) (0.15) (0.10) (0.13)
Year/Grade FE No Yes Yes Yes
Demographic Controls No No Yes Yes
District FE No No No Yes
Number of Observations 253,056 253,056 188,210 188,210
Introduction Regressions Causal Inference Control Variables Randomized Experiments
Control Variables
For example,strokei = α + β1HoursWorkedi + β2HoursExercisedi + β3Racei + εi
What is the control HoursExercisedi for?
What is the control Racei for?
Introduction Regressions Causal Inference Control Variables Randomized Experiments
Fixed-effects
Fixed-effects can be a bit confusing. Often they are a control.Sometimes they can be used for causal inference.
As a control:Fixed effects are just a bunch of dummy variables. They are added ascontrols just like any other variable.
i.e. Time FEs, province FEs, school FEs
For time FEs, if you have 10 years of data you:
Introduction Regressions Causal Inference Control Variables Randomized Experiments
Fixed-effects II
For causal inference:
Our essential concern is that people who work longer hours also differin other dimensions (e.g. diet)
Idea: Why not control for the fact they are the same person?Must have panel data (i.e. variation over time across the same person)
Essentially we estimate the effect of an increase/decrease in workinghours for person X on his likelihood of cancer
Introduction Regressions Causal Inference Control Variables Randomized Experiments
Fixed-effects in a Regression
In a regression, you write down fixed effect without a β in front. Thesubscript then denotes the fixed-effect.
strokeit = α + β1HoursWorkedit + β2Controlsit + λt + δi + εit
δi =
λt =
Introduction Regressions Causal Inference Control Variables Randomized Experiments
Using Fixed-effects
Implementing fixed-effects:
Open up STATA
Introduction Regressions Causal Inference Control Variables Randomized Experiments
Pros and Cons
What is the major bias concern in fixed-effects?
Pros vs. Cons of fixed-effects:
Introduction Regressions Causal Inference Control Variables Randomized Experiments
Causal Inference
There are 5 basic empirical methods to obtain causal inference:
1 Controls (includes matching/fixed-effects)2 Randomized Experiments3 Difference-in-Differences4 Instrumental Variables5 Regression Discontinuity
Introduction Regressions Causal Inference Control Variables Randomized Experiments
Randomized Experiments
Randomized experiments (or RCTs) are the “gold standard” of policyevaluation
Unfortunately, they are really tough to get off the ground
Also, some questions are not amenable to RCTsFor example:
Introduction Regressions Causal Inference Control Variables Randomized Experiments
Implementation
Implementation of randomized experiments can be difficult.
Before the experiment can be run you need to:1 Find necessary sample size (use “ssi” in STATA)2 Get funding (they are often very expensive)3 Get ethical approval (can be very difficult)
Introduction Regressions Causal Inference Control Variables Randomized Experiments
Implementation II
Afterwards, you need to:1 Randomize
Since they are so expensive, to prevent improper randomization due to“luck” researchers often “stratify” by some characteristic whichguarantees balance between treatment and control in thosecharacteristics
2 Ensure there is limited attrition (often the bane of randomized trials)3 Ensure there is no cross-contamination
Introduction Regressions Causal Inference Control Variables Randomized Experiments
Evaluating Randomized Trials
To evaluate randomized trials, researchers look at internal andexternal validity
We will do this for Project STAR
Introduction Regressions Causal Inference Control Variables Randomized Experiments
Internal Validity
For internal validity we look at:
1 Proper Randomization (look for covariate balance)
2 Differential Attrition
3 Cross Contamination
4 Hawthorne Effects (could also be under external validity)
Introduction Regressions Causal Inference Control Variables Randomized Experiments
Internal Validity
For internal validity we look at:1 Proper Randomization (look for covariate balance)
2 Differential Attrition
3 Cross Contamination
4 Hawthorne Effects (could also be under external validity)
Introduction Regressions Causal Inference Control Variables Randomized Experiments
Internal Validity
For internal validity we look at:1 Proper Randomization (look for covariate balance)
2 Differential Attrition
3 Cross Contamination
4 Hawthorne Effects (could also be under external validity)
Introduction Regressions Causal Inference Control Variables Randomized Experiments
Internal Validity
For internal validity we look at:1 Proper Randomization (look for covariate balance)
2 Differential Attrition
3 Cross Contamination
4 Hawthorne Effects (could also be under external validity)
Introduction Regressions Causal Inference Control Variables Randomized Experiments
External Validity
For external validity we look at:
1 Generalizability (i.e. sample selection)
2 Scalability
3 General Equilibrium Effects
Introduction Regressions Causal Inference Control Variables Randomized Experiments
External Validity
For external validity we look at:1 Generalizability (i.e. sample selection)
2 Scalability
3 General Equilibrium Effects
Introduction Regressions Causal Inference Control Variables Randomized Experiments
External Validity
For external validity we look at:1 Generalizability (i.e. sample selection)
2 Scalability
3 General Equilibrium Effects
Introduction Regressions Causal Inference Control Variables Randomized Experiments
External Validity
For external validity we look at:1 Generalizability (i.e. sample selection)
2 Scalability
3 General Equilibrium Effects
Introduction Regressions Causal Inference Control Variables Randomized Experiments
Project STAR
What is Project STAR?
Open up STATA....