decomposition of r2 in multiple regression with correlated regressors
Linear Regression with Multiple Regressors
Transcript of Linear Regression with Multiple Regressors
Outline Omitted Variable Bias Multiple Regression OLS Measures of Fit Assumptions Sampling Distribution Multicollinearity
Linear Regression with MultipleRegressors
Ming-Ching Luoh
2021.3.9.
1 / 43
Outline Omitted Variable Bias Multiple Regression OLS Measures of Fit Assumptions Sampling Distribution Multicollinearity
Omitted Variable Bias
aeMultiple Regression Model
ae OLS Estimator
Measures of Fit
Least Squares Assumptions
Sampling Distribution of OLS Estimators
Multicollinearity
2 / 43
Outline Omitted Variable Bias Multiple Regression OLS Measures of Fit Assumptions Sampling Distribution Multicollinearity
Omitted Variable BiasOLS estimate of the Test Score/STR relation
Test Score = 698.9 − 2.28 × STR, R2 = .05,
(10.4) (0.52)
● Is this a credible estimate of the causal eòect on test scores ofa change in the student-teacher ratio?
No, there are omitted confounding factors that bias the OLSestimator. STR could be “picking up" the eòect of theseconfounding factors.
3 / 43
Outline Omitted Variable Bias Multiple Regression OLS Measures of Fit Assumptions Sampling Distribution Multicollinearity
● ae bias in the OLS estimator that occurs as a result of anomitted factor is called omitted variable bias.
● For omitted variable bias to occur, the omitted factor “Z"must be
1 A determinant of Y , and2 correlated with the regressor X.
Both conditionsmust hold for the omission of Z to result inomitted variable bias.
4 / 43
Outline Omitted Variable Bias Multiple Regression OLS Measures of Fit Assumptions Sampling Distribution Multicollinearity
In the test score example:
1 English language ability (Z) (measured as the percentageof students who are still learning English) plausibly aòectsstandardized test scores. Z is a determinant of Y .
2 Immigrant communities tend to be less aøuent and thushave smaller school budgets, and higher STR: Z iscorrelated with X.
Accordingly, β1 is biased.
● What is the direction of the bias?
● What does common sense suggest?
● If common sense is not very obvious, there is a formula.
5 / 43
Outline Omitted Variable Bias Multiple Regression OLS Measures of Fit Assumptions Sampling Distribution Multicollinearity
A formula for omitted variable biasRecall the equation,
β1 = β1 +1n ∑
ni=1(Xi − X)ui
1n ∑
ni=1(Xi − X)2
= β1 + ρXuσuσX
because
1n
n∑i=1(Xi − X)2
p→ σ2X
1n
n∑i=1(Xi − X)ui
p→ Cov(Xi , ui) = ρXuσXσu
6 / 43
Outline Omitted Variable Bias Multiple Regression OLS Measures of Fit Assumptions Sampling Distribution Multicollinearity
Omitted variable bias formula:
β1p→ β1 + ρXu
σuσX
● If an omitted factor Z is both:
● a determinant of Y (that is, it is contained in u), and● correlated with X,
then ρXu ≠ 0 and the OLS estimator β1 is not consistent.
● ae larger is ∣ρXu∣, the larger is the bias.
● ae direction of the bias in β1 depends on whether X and uare positively or negatively correlated.
7 / 43
Outline Omitted Variable Bias Multiple Regression OLS Measures of Fit Assumptions Sampling Distribution Multicollinearity
To bemore speciûc, suppose that the truemodel is
Yi = β0 + β1Xi + β2Zi + ui , Cov(Xi , ui) = 0
ae estimatedmodel when Z is omitted is
Yi = β0 + β1Xi + єi , єi ≡ β2Zi + ui
ae covariance between Xi and error term єi is
Cov(Xi , єi) = Cov(Xi , β2Zi + ui)= β2Cov(Xi , Zi)
8 / 43
Outline Omitted Variable Bias Multiple Regression OLS Measures of Fit Assumptions Sampling Distribution Multicollinearity
aerefore,β1
p→ β1 +
Cov(Xi , єi)Var(Xi)
= β1 + β2Cov(Xi , Zi)
Var(Xi)
Let PctEL (percentage of engish learners) be Z, since
● β2 < 0: the eòect of PctEL on TestScore is negative, and
● Cov(Xi , Zi) > 0: districts with high STR also have highPctEL.
We have β1 < β1, β1 ismore negative than β1. ae (negative)eòect of STR on test score is overestimated.
9 / 43
Outline Omitted Variable Bias Multiple Regression OLS Measures of Fit Assumptions Sampling Distribution Multicollinearity
Is this is actually going on in the CA data?
● Districts with fewer English Learners have higher test scores.● Districts with lower percent EL have smaller classes.● Among districts with comparable PctEL, the eòect of class
size is small. (recall overall “test score gap" = 7.4)10 / 43
Outline Omitted Variable Bias Multiple Regression OLS Measures of Fit Assumptions Sampling Distribution Multicollinearity
Causality and regression analysis
● ae test score/STR/fraction English Learners example showsthat, if an omitted variable satisûes the two conditions foromitted variable bias, then the OLS estimator in theregression omitting that variable is biased and inconsistent.So, even if n is large, β1 will not be close to β1.
● ais raises a deeper question: how do we deûne β1? aat is,what precisely do we want to estimate when we run aregression?
11 / 43
Outline Omitted Variable Bias Multiple Regression OLS Measures of Fit Assumptions Sampling Distribution Multicollinearity
What precisely do we want to estimate when we run aregression? aere are (at least) three possible answers to thisquestion:
1. We want to estimate the slope of a line through a scatterplotas a simple summary of the data to which we attach nosubstantivemeaning.
ais can be useful at times, but isn’t very interestingintellectually and isn’t what this course is about.
12 / 43
Outline Omitted Variable Bias Multiple Regression OLS Measures of Fit Assumptions Sampling Distribution Multicollinearity
2. We want to make forecasts, or predictions, of the value of Yfor an entity not in the data set, for which we know the valueof X.
Forecasting is an important job for economists, and excellentforecasts are possible using regression methods withoutneeding to know causal eòects.
13 / 43
Outline Omitted Variable Bias Multiple Regression OLS Measures of Fit Assumptions Sampling Distribution Multicollinearity
3. We want to estimate the causal eòect on Y of a change in X.ais is why we are interested in the class size eòect. Supposethe school board decided to cut class size by 2 students perclass. What would be the eòect on test scores? ais is a causalquestion (what is the causal eòect on test scores of STR?) so weneed to estimate this causal eòect. Except when we discussforecasting, the aim of this course is the estimation of causaleòects using regression methods.
14 / 43
Outline Omitted Variable Bias Multiple Regression OLS Measures of Fit Assumptions Sampling Distribution Multicollinearity
What, precisely, is a causal eòect?
● “Causality" is a complex concept! ae common-sensedeûnition of causality is not precise enough for our purposes.
● In this course, we we take a practical approach to deûningcausality:
A causal eòect is deûned to be the eòectmeasured in anideal randomized controlled experiment.
15 / 43
Outline Omitted Variable Bias Multiple Regression OLS Measures of Fit Assumptions Sampling Distribution Multicollinearity
Ideal Randomized Controlled Experiment
● Ideal: subjects all follow the treatment protocol - perfectcompliance, no errors in reporting, etc.
● Randomized: subjects from the population of interest arerandomly assigned to a treatment or control group (so thereare no confounding factors).
● Controlled: having a control group permitsmeasuring thediòerential eòect of the treatment.
● Experiment: the treatment is assigned as part of theexperiment: the subjects have no choice, which means thatthere is no “reverse causality" in which subjects choose thetreatment they think will work best.
16 / 43
Outline Omitted Variable Bias Multiple Regression OLS Measures of Fit Assumptions Sampling Distribution Multicollinearity
Back to the case of class sizeWhat is an ideal randomized controlled experiment formeasuring the eòect on Test Score of reducing STR?
● In that experiment, students would be randomly assigned toclasses, which would have diòerent sizes.
● Because they are randomly assigned, all studentcharacteristics (and thus ui) would be distributedindependently of STRi .
● aus, E(ui ∣STRi) = 0 - that is, LSA #1 holds in arandomized controlled experiment.
17 / 43
Outline Omitted Variable Bias Multiple Regression OLS Measures of Fit Assumptions Sampling Distribution Multicollinearity
How does our regression analysis of observational data diòerfrom this ideal?
● ae treatment is not randomly assigned.
● Consider PctEL— percent English learners— in the district.It plausibly satisûes the two criteria for omitted variable bias:Z = PctEL is
● a determinant of Y ; and● correlated with the regressor X.
● ae “control" and “treatment" groups diòer in a systematicway - corr(STR, PctEL) ≠ 0.
18 / 43
Outline Omitted Variable Bias Multiple Regression OLS Measures of Fit Assumptions Sampling Distribution Multicollinearity
Examples of omitted variable bias:
● Mozart Eòect? Listening to Mozart for 10-15 minutes couldraise IQ by 8 or 9 points. (Nature 1993)
● Students who take optional music or arts courses in highschool have higher English andmath test scores than thosewho don’t.
19 / 43
Outline Omitted Variable Bias Multiple Regression OLS Measures of Fit Assumptions Sampling Distribution Multicollinearity
aree ways to overcome omitted variable bias.
● Run a randomized controlled experiment in whichtreatment (STR) is randomly assigned: then PctEL is still adeterminant of Test Score, but PctEL is uncorrelated withSTR. (But this is unrealistic in practice.)
● Adopt the “cross tabulation" approach, with ûner gradationsof STR and PctEL (But soon we will run out of data, andwhat about other determinants like family income andparental education?)
● Use a regression in which the omitted variable (PctEL) is nolonger omitted: include PctEL as an additional regressor inamultiple regression.
20 / 43
Outline Omitted Variable Bias Multiple Regression OLS Measures of Fit Assumptions Sampling Distribution Multicollinearity
aeMultiple Regression Model
Consider the case of two regressors:
Yi = β0 + β1X1i + β2X2i + ui , i = 1,⋯, n
● X1, X2 are the two independent variables (regressors).
● (Yi , X1i , X2i) denote the ith observation on Y , X1, and X2.
● β0 = unknown population intercept.
● β1 = eòect on Y of a change in X1, holding X2 constant.
● β2 = eòect on Y of a change in X2, holding X1 constant.
● ui = “error term" (omitted factors).
21 / 43
Outline Omitted Variable Bias Multiple Regression OLS Measures of Fit Assumptions Sampling Distribution Multicollinearity
Interpretation ofmultiple regression coeõcients
Yi = β0 + β1X1i + β2X2i + ui , i = 1,⋯, n
Consider changing X1 by ∆X1 while holding X2 constant.Population regression line before the change.
Yi = β0 + β1X1 + β2X2
Population regression line, a�er the change.
Y + ∆Y = β0 + β1(X1 + ∆X1) + β2X2
22 / 43
Outline Omitted Variable Bias Multiple Regression OLS Measures of Fit Assumptions Sampling Distribution Multicollinearity
aat is,β1 =
∆Y∆X1
, holding X2 constant
also,β2 =
∆Y∆X2
, holding X1 constant
and
β0 = predicted value of Y when X1 = X2 = 0
23 / 43
Outline Omitted Variable Bias Multiple Regression OLS Measures of Fit Assumptions Sampling Distribution Multicollinearity
aeOLS Estimator in Multiple RegressionWith two regressors, the OLS estimator solves
minβ0 ,β1 ,β2
n∑i=1(Yi − (β0 + β1X1i + β2X2i))
2
● ae OLS estimatorminimizes the sum of squared diòerencebetween the actual values of Yi and the prediction (predictedvalue) based on the estimated line.
● aisminimization problem yields the OLS estimators of β0,β1 and β2.
24 / 43
Outline Omitted Variable Bias Multiple Regression OLS Measures of Fit Assumptions Sampling Distribution Multicollinearity
Example: ae California test score dataRegression of Test Score against STR
Test Score = 698.9 − 2.28 × STR
Now include percent English Learners in the district (PctEL):
Test Score = 696.0 − 1.10 × STR − 0.65 × PctEL
● What happens to the coeõcient on STR?
● Why? (Note: corr(STR, PctEL) = 0.19)
25 / 43
Outline Omitted Variable Bias Multiple Regression OLS Measures of Fit Assumptions Sampling Distribution Multicollinearity
26 / 43
Outline Omitted Variable Bias Multiple Regression OLS Measures of Fit Assumptions Sampling Distribution Multicollinearity
Measures of Fit in Multiple Regression
Actual = predicted + residual: Yi = Yi + ui
● SER (standard error of regression)= std. error of ui (with d.f. correction)
● RMSE = std. error of ui (without d.f. correction)
● R2 = fraction of the sample variance of Y explained by X.
● R2 = “adjusted R2"
27 / 43
Outline Omitted Variable Bias Multiple Regression OLS Measures of Fit Assumptions Sampling Distribution Multicollinearity
SER and RMSE
● As in regression with a single regressor, the SER and theRMSE aremeasures of the spread of the Y ’s around theregression line.
SER =
¿ÁÁÀ 1
n−k − 1
n∑i=1
u2i
RMSE =
¿ÁÁÀ 1
n
n∑i=1
u2i
28 / 43
Outline Omitted Variable Bias Multiple Regression OLS Measures of Fit Assumptions Sampling Distribution Multicollinearity
R2 and R2
● ae R2 is the fraction of the sample variance of Yexplained— same deûnition as in regression with a singleregressor:
R2 =ESSTSS
= 1 −SSRTSS
where ESS = ∑ni=1(Yi − Y)2, SSR = ∑
ni=1 u2i ,
TSS = ∑ni=1(Yi − Y)2.
● ae R2 always increases when you add another regressor.(why?)
29 / 43
Outline Omitted Variable Bias Multiple Regression OLS Measures of Fit Assumptions Sampling Distribution Multicollinearity
● ae R2 (the “adjusted R2") corrects this problem by“penalizing" you for including another regressor such thatthe R2 does not necessarily increase.
Adjusted R2: R2 = 1 − (n − 1
n − k − 1)SSRTSS
● Facts about R2.
● n−1n−k−1 > 1, therefore R2 < R2.
● Adding a regressor has two opposite eòects. On the onehand, the SSR falls, which increases the R2. On theother hand, the factor n−1
n−k−1 increases, this reduces R2.● R2 can be negative.● When n is large, R2 and R2 are very close.
30 / 43
Outline Omitted Variable Bias Multiple Regression OLS Measures of Fit Assumptions Sampling Distribution Multicollinearity
Using the R2 and R2
● ae R2 is useful because it quantiûes the extent to which theregressors account for, or explain, the variation in thedependent variable.
● Nevertheless, heavy reliance on the R2 (or R2) can be a trap.In applications, “maximize the R2 is rarely the answer to anyeconomically or statisticallymeaningful question.
● ae decision about whether to include a variable in amultiple regression should be based on whether includingthat variable allows us better to estimate the causal eòect ofinterest.
31 / 43
Outline Omitted Variable Bias Multiple Regression OLS Measures of Fit Assumptions Sampling Distribution Multicollinearity
ae Least Squares Assumptions inMultiple Regression
Yi = β0 + β1X1i + β2X2i +⋯ + βkXki + ui ,i = 1,⋯, n
1 ae conditional distribution of ui given the X’s has amean of zero.
2 (X1i ,⋯, Xki ,Yi), i = 1,⋯, n, are i .i .d .
3 Large outliers are unlikely.
4 aere is no perfectmulticollinearity.
32 / 43
Outline Omitted Variable Bias Multiple Regression OLS Measures of Fit Assumptions Sampling Distribution Multicollinearity
Assumption #1:ae conditional mean of u given the included X’s is zero.
● ais has the same interpretation as in regression with asingle regressor.
● If an omitted variable (1) belongs in the equation (so is in u)and (2) is correlated with an included X, then this conditionfails.
● Failure of this condition leads to omitted variable bias.
● ae solution - if possible - is to include the omitted variablein the regression.
33 / 43
Outline Omitted Variable Bias Multiple Regression OLS Measures of Fit Assumptions Sampling Distribution Multicollinearity
Assumption #2: (X1i ,⋯, Xki ,Yi), i = 1,⋯, n, are i .i .d .
ais is satisûed automatically if the data are collected by simplerandom sampling.
Assumption #3: large outliers are rare.
X1,⋯, Xk , and u have ûnite fourth moments: E(X41i) <∞,⋯,
E(X4ki) <∞, E(u4i ) <∞.
ais is the same assumption as we had before for a singleregressor. As in the case of a single regressor, OLS can besensitive to large outliers, so you need to check your data(scatterplots!) to make sure there are no crazy values (typos orcoding errors).
34 / 43
Outline Omitted Variable Bias Multiple Regression OLS Measures of Fit Assumptions Sampling Distribution Multicollinearity
Assumption #4: aere is no perfectmulticollinearity.
ae regressors are said to be perfectlymulticollinear if one ofthe regressors is an exact linear function of the other regressors.Example: Suppose you accidentally include STR twice:
35 / 43
Outline Omitted Variable Bias Multiple Regression OLS Measures of Fit Assumptions Sampling Distribution Multicollinearity
Perfectmulticollinearity is when one of the regressors is anexact linear function of the other regressors.
● In the previous regression, β1 is the eòect on TestScore of aunit change in STR, holding STR constant. (???)
● Second example: Regress Test Score on a constant, D, and B,where Di = 1 if STR ≤ 20, = 0 otherwise; Bi = 1 if STR > 20,= 0 otherwise. So Bi = 1 − Di and there is perfectmulticollinearity.
● Perfectmulticollinearity usually re�ects amistake in thedeûnition of the regressors.
36 / 43
Outline Omitted Variable Bias Multiple Regression OLS Measures of Fit Assumptions Sampling Distribution Multicollinearity
ae Distribution of the OLS Estimators
Under the four Least Squares Assumptions,
● ae exact (ûnite sample) distribution of β1 hasmean β1,Var(β1) is inversely proportional to n. So too for β2.
● Other than itsmean and variance, the exact distribution ofβ1 is complicated.
● β1 is consistent: β1p→ β1. (law of large numbers)
● β1−E(β1)√Var(β1)
is approximately distributed N(0, 1). (CLT)
● So too for β2,⋯, βk .
37 / 43
Outline Omitted Variable Bias Multiple Regression OLS Measures of Fit Assumptions Sampling Distribution Multicollinearity
MulticollinearitySomemore examples of perfectmulticollinearity
● ae example from earlier: we include STR twice.
● Second example: regress TestScore on a constant, D, and B,where: Di = 1 if STR ≤ 20, = 0 otherwise; Bi = 1 ifSTR > 20, = 0 otherwise, so Bi = 1 − Di and there is perfectmulticollinearity.
● Would there still be perfectmulticollinearity if the intercept(constant term) were somehow excluded (that is, omitted orsuppressed) from this regression?
38 / 43
Outline Omitted Variable Bias Multiple Regression OLS Measures of Fit Assumptions Sampling Distribution Multicollinearity
ae dummy variable trap
● Suppose you have a set ofmultiple binary (dummy)variables, which aremutually exclusive and exhaustive—that is, there aremultiple categories and every observationfalls in one and only one category.
● If you include all these dummy variables and a constant, youwill have perfectmulticollinearity— this is sometimes calledthe dummy variable trap.
● Why is there perfectmulticollinearity?
39 / 43
Outline Omitted Variable Bias Multiple Regression OLS Measures of Fit Assumptions Sampling Distribution Multicollinearity
Solutions to the dummy variable trap:
(1) Omit one of the groups, or
(2) Omit the intercept term.
● What are the implications of solutions (1) and (2) for theinterpretation of the coeõcients?
40 / 43
Outline Omitted Variable Bias Multiple Regression OLS Measures of Fit Assumptions Sampling Distribution Multicollinearity
● Perfectmulticollinearity usually re�ects amistake in thedeûnition of the regressors, or an oddity in the data.
● If you have perfectmulticollinearity, your statistical so�warewill let you know— either by crashing or giving an errormessage or by “dropping" one of the variables arbitrarily.
● ae solution to perfectmulticollinearity is to modify yourlist of regressors so that you no longer have perfectmulticollinearity.
41 / 43
Outline Omitted Variable Bias Multiple Regression OLS Measures of Fit Assumptions Sampling Distribution Multicollinearity
Imperfectmulticollinearity
● Imperfect and perfectmulticollinearity are quite diòerentdespite the similarity of the names.
● Imperfectmulticollinearity occurs when two ormoreregressors are very highly correlated.
● Why this term? If two regressors are very highly correlated,then their scatterplot will prettymuch look like a straightline— they are collinear— but unless the correlation isexactly ±1, that collinearity is imperfect.
● Imperfectmulticollinearity implies that one ormore of theregression coeõcients will be imprecisely estimated.
42 / 43
Outline Omitted Variable Bias Multiple Regression OLS Measures of Fit Assumptions Sampling Distribution Multicollinearity
● Intuition: the coeõcient on X1 is the eòect of X1 holding X2
constant; but if X1 and X2 are highly correlated, there is verylittle variation in X1 once X2 is held constant— so the dataare prettymuch uninformative about what happens when X1
changes but X2 doesn’t, so the variance of the OLS estimatorof the coeõcient on X1 will be large.
● Imperfectmulticollinearity (correctly) results in largestandard errors for one ormore of the OLS coeõcients.
● Imperfectmulticollinearity does not lead OLS estimators tobe biased nor inconsistent.
43 / 43