© Christopher Dougherty 1999–2006
A.1: The model is linear in parameters and correctly specified.
A.2: There does not exist an exact linear relationship among the regressors in the sample.
A.3 The disturbance term has zero expectationA.4 The disturbance term is homoscedasticA.5 The values of the disturbance term have independent
distributionsA.6 The disturbance term has a normal distribution
PROPERTIES OF THE MULTIPLE REGRESSION COEFFICIENTS
uXXY kk ...221
Moving from the simple to the multiple regression model, we start by restating the regression model assumptions. Only A.2 is different. Previously it was stated that there must be some variation in the X variable. We will explain the difference in one of the following lectures. Provided that the regression model assumptions are valid, the OLS estimators in the multiple regression model are unbiased and efficient, as in the simple regression model.
© Christopher Dougherty 1999–2006
uXXY 33221 33221ˆ XbXbbY
23322 XXYYXX iii
2
33222
332
22
3322332
XXXXXXXX
XXXXYYXXb
iiii
iiii
We will not attempt to prove efficiency. We will however outline a proof of unbiasedness.
PROPERTIES OF THE MULTIPLE REGRESSION COEFFICIENTS
© Christopher Dougherty 1999–2006
uXXY 33221 33221ˆ XbXbbY
23322 XXYYXX iii
2
33222
332
22
3322332
XXXXXXXX
XXXXYYXXb
iiii
iiii
uuXXXX
uXXuXXYY
iii
iiii
333222
3322133221
The first step, as always, is to substitute for Y from the true relationship. The Y ingredients of b2 are actually in the form of Yi minus its mean, so it is convenient to obtain an expression for this.
PROPERTIES OF THE MULTIPLE REGRESSION COEFFICIENTS
© Christopher Dougherty 1999–2006
uXXY 33221 33221ˆ XbXbbY
23322 XXYYXX iii
2
33222
332
22
3322332
XXXXXXXX
XXXXYYXXb
iiii
iiii
uuXXXX
uXXuXXYY
iii
iiii
333222
3322133221
ii uab *222
After simplifying, we find that b2 can be decomposed into the true value 2 plus a weighted linear combination of the values of the disturbance term in the sample. This is what we found in the simple regression model. The difference is that the expression for the weights, which depend on all the values of X2 and X3 in the sample, is considerably more complicated.
PROPERTIES OF THE MULTIPLE REGRESSION COEFFICIENTS
© Christopher Dougherty 1999–2006
uXXY 33221 33221ˆ XbXbbY
23322 XXYYXX iii
2
33222
332
22
3322332
XXXXXXXX
XXXXYYXXb
iiii
iiii
uuXXXX
uXXuXXYY
iii
iiii
333222
3322133221
ii uab *222
2*22
*22
*222 iiiiii uEauaEuaEbE
Having reached this point, proving unbiasedness is easy. Taking expectations, 2 is unaffected, being a constant. The expectation of a sum is equal to the sum of expectations.
PROPERTIES OF THE MULTIPLE REGRESSION COEFFICIENTS
© Christopher Dougherty 1999–2006
uXXY 33221 33221ˆ XbXbbY
23322 XXYYXX iii
2
33222
332
22
3322332
XXXXXXXX
XXXXYYXXb
iiii
iiii
uuXXXX
uXXuXXYY
iii
iiii
333222
3322133221
ii uab *222
2*22
*22
*222 iiiiii uEauaEuaEbE
The a* terms are nonstochastic since they depend only on the values of X2 and X3, and these are assumed to be nonstochastic. Hence the a* terms may be taken out of the expectations as factors.
PROPERTIES OF THE MULTIPLE REGRESSION COEFFICIENTS
© Christopher Dougherty 1999–2006
uXXY 33221 33221ˆ XbXbbY
23322 XXYYXX iii
2
33222
332
22
3322332
XXXXXXXX
XXXXYYXXb
iiii
iiii
uuXXXX
uXXuXXYY
iii
iiii
333222
3322133221
ii uab *222
2*22
*22
*222 iiiiii uEauaEuaEbE
By Assumption A.3, E(ui) = 0 for all i. Hence E(b2) is equal to 2 and so b2 is an unbiased estimator. Similarly b3 is an unbiased estimator of 3.
PROPERTIES OF THE MULTIPLE REGRESSION COEFFICIENTS
© Christopher Dougherty 1999–2006
uXXY 33221 33221ˆ XbXbbY
Finally we will show that b1 is an unbiased estimator of 1. This is quite simple, so you should attempt to do this yourself, before looking at the rest of this sequence.
332233221
33221
)( XbXbuXXXbXbYb
PROPERTIES OF THE MULTIPLE REGRESSION COEFFICIENTS
© Christopher Dougherty 1999–2006
uXXY 33221 33221ˆ XbXbbY
332233221
33221
)( XbXbuXXXbXbYb
First substitute for the sample mean of Y.
PROPERTIES OF THE MULTIPLE REGRESSION COEFFICIENTS
© Christopher Dougherty 1999–2006
uXXY 33221 33221ˆ XbXbbY
332233221
33221
)( XbXbuXXXbXbYb
1
332233221
3322332211 )()()()(
XXXX
bEXbEXuEXXbE
Now take expectations. The first three terms are nonstochastic, so they are unaffected by taking expectations.
PROPERTIES OF THE MULTIPLE REGRESSION COEFFICIENTS
© Christopher Dougherty 1999–2006
uXXY 33221 33221ˆ XbXbbY
332233221
33221
)( XbXbuXXXbXbYb
1
332233221
3322332211 )()()()(
XXXX
bEXbEXuEXXbE
The expected value of the mean of the disturbance term is zero since E(u) is zero in each observation. We have just shown that E(b2) is equal to 2 and that E(b3) is equal to 3.
PROPERTIES OF THE MULTIPLE REGRESSION COEFFICIENTS
© Christopher Dougherty 1999–2006
uXXY 33221 33221ˆ XbXbbY
332233221
33221
)( XbXbuXXXbXbYb
1
332233221
3322332211 )()()()(
XXXX
bEXbEXuEXXbE
Hence b1 is an unbiased estimator of 1.
PROPERTIES OF THE MULTIPLE REGRESSION COEFFICIENTS
© Christopher Dougherty 1999–2006
uXXY 33221 33221ˆ XbXbbY
PRECISION OF THE MULTIPLE REGRESSION COEFFICIENTS
This sequence investigates the variances and standard errors of the slope coefficients in a model with two explanatory variables.The expression for the variance of b2 is shown above. The expression for the variance of b3 is the same, with the subscripts 2 and 3 interchanged.
2,2
2
2,
222
22
3232
2 11
)(MSD11
XX
u
XXi
ub rXnrXX
© Christopher Dougherty 1999–2006
uXXY 33221 33221ˆ XbXbbY
The first factor in the expression is identical to that for the variance of the slope coefficient in a simple regression model.The variance of b2 depends on the variance of the disturbance term, the number of observations, and the mean square deviation of X2 for exactly the same reasons as in a simple regression model.
2,2
2
2,
222
22
3232
2 11
)(MSD11
XX
u
XXi
ub rXnrXX
PRECISION OF THE MULTIPLE REGRESSION COEFFICIENTS
© Christopher Dougherty 1999–2006
uXXY 33221 33221ˆ XbXbbY
The difference is that in multiple regression analysis the expression is multiplied by a factor which depends on the correlation between X2 and X3.
The higher is the correlation between the explanatory variables, positive or negative, the greater will be the variance.This is easy to understand intuitively. The greater the correlation, the harder it is to discriminate between the effects of the explanatory variables on Y, and the less accurate will be the regression estimates.Note that the variance expression above is valid only for a model with two explanatory variables. When there are more than two, the expression becomes much more complex and it is sensible to switch to matrix algebra.
2,2
2
2,
222
22
3232
2 11
)(MSD11
XX
u
XXi
ub rXnrXX
PRECISION OF THE MULTIPLE REGRESSION COEFFICIENTS
© Christopher Dougherty 1999–2006
uXXY 33221 33221ˆ XbXbbY
The standard deviation of the distribution of b2 is of course given by the square root of its variance.With the exception of the variance of u, we can calculate the components of the standard deviation from the sample data.
2,2
2
2,
222
22
3232
2 11
)(MSD11
XX
u
XXi
ub rXnrXX
2,
222
2
232
11 of deviation standard
XXi
u
rXXb
PRECISION OF THE MULTIPLE REGRESSION COEFFICIENTS
© Christopher Dougherty 1999–2006
uXXY 33221 33221ˆ XbXbbY
The variance of u has to be estimated. The mean square of the residuals provides a consistent estimator, but it is biased downwards by a factor (n – k) / n , where k is the number of parameters, in a finite sample.
2,2
2
2,
222
22
3232
2 11
)(MSD11
XX
u
XXi
ub rXnrXX
2,
222
2
232
11 of deviation standard
XXi
u
rXXb
221ui n
knen
E
PRECISION OF THE MULTIPLE REGRESSION COEFFICIENTS
© Christopher Dougherty 1999–2006
uXXY 33221 33221ˆ XbXbbY
Obviously we can obtain an unbiased estimator by dividing the sum of the squares of the residuals by n – k instead of n. We denote this unbiased estimator su.2
2,2
2
2,
222
22
3232
2 11
)(MSD11
XX
u
XXi
ub rXnrXX
2,
222
2
232
11 of deviation standard
XXi
u
rXXb
22 1
iu ekn
s221ui n
knen
E
PRECISION OF THE MULTIPLE REGRESSION COEFFICIENTS
© Christopher Dougherty 1999–2006
uXXY 33221 33221ˆ XbXbbY
221ui n
knen
E
22 1iu e
kns
2,
222
2
232
11 of deviation standard
XXi
u
rXXb
2,
222
2
232
11)( s.e.
XXi
u
rXXsb
Thus the estimate of the standard deviation of the probability distribution of b2, known as the standard error of b2 for short, is given by the expression above.
2,2
2
2,
222
22
3232
2 11
)(MSD11
XX
u
XXi
ub rXnrXX
PRECISION OF THE MULTIPLE REGRESSION COEFFICIENTS
© Christopher Dougherty 1999–2006
We will use this expression to analyze why the standard error of S is larger for the union subsample than for the non-union subsample in earnings function regressions using Data Set 21.
. reg EARNINGS S EXP if COLLBARG==1
Source | SS df MS Number of obs = 101-------------+------------------------------ F( 2, 98) = 9.72 Model | 3076.31726 2 1538.15863 Prob > F = 0.0001 Residual | 15501.9762 98 158.18343 R-squared = 0.1656-------------+------------------------------ Adj R-squared = 0.1486 Total | 18578.2934 100 185.782934 Root MSE = 12.577
------------------------------------------------------------------------------ EARNINGS | Coef. Std. Err. t P>|t| [95% Conf. Interval]-------------+---------------------------------------------------------------- S | 2.333846 .5492604 4.25 0.000 1.243857 3.423836 EXP | .2235095 .3389455 0.66 0.511 -.4491169 .8961358 _cons | -15.12427 11.38141 -1.33 0.187 -37.71031 7.461779------------------------------------------------------------------------------
PRECISION OF THE MULTIPLE REGRESSION COEFFICIENTS
© Christopher Dougherty 1999–2006
. reg EARNINGS S EXP if COLLBARG==1
Source | SS df MS Number of obs = 101-------------+------------------------------ F( 2, 98) = 9.72 Model | 3076.31726 2 1538.15863 Prob > F = 0.0001 Residual | 15501.9762 98 158.18343 R-squared = 0.1656-------------+------------------------------ Adj R-squared = 0.1486 Total | 18578.2934 100 185.782934 Root MSE = 12.577
------------------------------------------------------------------------------ EARNINGS | Coef. Std. Err. t P>|t| [95% Conf. Interval]-------------+---------------------------------------------------------------- S | 2.333846 .5492604 4.25 0.000 1.243857 3.423836 EXP | .2235095 .3389455 0.66 0.511 -.4491169 .8961358 _cons | -15.12427 11.38141 -1.33 0.187 -37.71031 7.461779------------------------------------------------------------------------------
To select a subsample in Stata, you add an ‘if’ statement to a command. The COLLBARG variable is equal to 1 for respondents whose rates of pay are determined by collective bargaining, and it is 0 for the others.Note that in tests for equality, Stata requires the = sign to be duplicated.
PRECISION OF THE MULTIPLE REGRESSION COEFFICIENTS
© Christopher Dougherty 1999–2006
. reg EARNINGS S EXP if COLLBARG==1
Source | SS df MS Number of obs = 101-------------+------------------------------ F( 2, 98) = 9.72 Model | 3076.31726 2 1538.15863 Prob > F = 0.0001 Residual | 15501.9762 98 158.18343 R-squared = 0.1656-------------+------------------------------ Adj R-squared = 0.1486 Total | 18578.2934 100 185.782934 Root MSE = 12.577
------------------------------------------------------------------------------ EARNINGS | Coef. Std. Err. t P>|t| [95% Conf. Interval]-------------+---------------------------------------------------------------- S | 2.333846 .5492604 4.25 0.000 1.243857 3.423836 EXP | .2235095 .3389455 0.66 0.511 -.4491169 .8961358 _cons | -15.12427 11.38141 -1.33 0.187 -37.71031 7.461779------------------------------------------------------------------------------
In the case of the union subsample, the standard error of S is 0.5493.
PRECISION OF THE MULTIPLE REGRESSION COEFFICIENTS
© Christopher Dougherty 1999–2006
. reg EARNINGS S EXP if COLLBARG==0
Source | SS df MS Number of obs = 439-------------+------------------------------ F( 2, 436) = 57.77 Model | 19540.1761 2 9770.08805 Prob > F = 0.0000 Residual | 73741.593 436 169.132094 R-squared = 0.2095-------------+------------------------------ Adj R-squared = 0.2058 Total | 93281.7691 438 212.972076 Root MSE = 13.005
------------------------------------------------------------------------------ EARNINGS | Coef. Std. Err. t P>|t| [95% Conf. Interval]-------------+---------------------------------------------------------------- S | 2.721698 .2604411 10.45 0.000 2.209822 3.233574 EXP | .6077342 .1400846 4.34 0.000 .3324091 .8830592 _cons | -28.00805 4.643211 -6.03 0.000 -37.13391 -18.88219------------------------------------------------------------------------------
In the case of the non-union subsample, the standard error of S is 0.2604, less than half as large.
PRECISION OF THE MULTIPLE REGRESSION COEFFICIENTS
© Christopher Dougherty 1999–2006
2,2
2
321
1)(MSD
11)(s.e.XX
u rXnsb
We will explain the difference by looking at the components of the standard error.
Decomposition of the standard error of S
Component su n MSD(S) rS, EXP s.e.
Union 12.577 101 6.2325 –0.4087 0.5493
Non-union 13.005 439 5.8666 –0.1784 0.2604
Factor product
Union 12.577 0.0995 0.4006 1.0957 0.5493
Non-union 13.005 0.0477 0.4129 1.0163 0.2603
PRECISION OF THE MULTIPLE REGRESSION COEFFICIENTS
© Christopher Dougherty 1999–2006
RSSkn
su 12
. reg EARNINGS S EXP if COLLBARG==1
Source | SS df MS Number of obs = 101-------------+------------------------------ F( 2, 98) = 9.72 Model | 3076.31726 2 1538.15863 Prob > F = 0.0001 Residual | 15501.9762 98 158.18343 R-squared = 0.1656-------------+------------------------------ Adj R-squared = 0.1486 Total | 18578.2934 100 185.782934 Root MSE = 12.577
------------------------------------------------------------------------------ EARNINGS | Coef. Std. Err. t P>|t| [95% Conf. Interval]-------------+---------------------------------------------------------------- S | 2.333846 .5492604 4.25 0.000 1.243857 3.423836 EXP | .2235095 .3389455 0.66 0.511 -.4491169 .8961358 _cons | -15.12427 11.38141 -1.33 0.187 -37.71031 7.461779------------------------------------------------------------------------------
We will start with su. Here is RSS for the union subsample.
PRECISION OF THE MULTIPLE REGRESSION COEFFICIENTS
© Christopher Dougherty 1999–2006
RSSkn
su 12
. reg EARNINGS S EXP if COLLBARG==1
Source | SS df MS Number of obs = 101-------------+------------------------------ F( 2, 98) = 9.72 Model | 3076.31726 2 1538.15863 Prob > F = 0.0001 Residual | 15501.9762 98 158.18343 R-squared = 0.1656-------------+------------------------------ Adj R-squared = 0.1486 Total | 18578.2934 100 185.782934 Root MSE = 12.577
------------------------------------------------------------------------------ EARNINGS | Coef. Std. Err. t P>|t| [95% Conf. Interval]-------------+---------------------------------------------------------------- S | 2.333846 .5492604 4.25 0.000 1.243857 3.423836 EXP | .2235095 .3389455 0.66 0.511 -.4491169 .8961358 _cons | -15.12427 11.38141 -1.33 0.187 -37.71031 7.461779------------------------------------------------------------------------------
There are 101 observations in the union subsample. k is equal to 3. Thus n – k is equal to 98.
PRECISION OF THE MULTIPLE REGRESSION COEFFICIENTS
© Christopher Dougherty 1999–2006
RSS / (n – k) is equal to 158.183. To obtain su, we take the square root.
This is 12.577.
RSSkn
su 12
. reg EARNINGS S EXP if COLLBARG==1
Source | SS df MS Number of obs = 101-------------+------------------------------ F( 2, 98) = 9.72 Model | 3076.31726 2 1538.15863 Prob > F = 0.0001 Residual | 15501.9762 98 158.18343 R-squared = 0.1656-------------+------------------------------ Adj R-squared = 0.1486 Total | 18578.2934 100 185.782934 Root MSE = 12.577
------------------------------------------------------------------------------ EARNINGS | Coef. Std. Err. t P>|t| [95% Conf. Interval]-------------+---------------------------------------------------------------- S | 2.333846 .5492604 4.25 0.000 1.243857 3.423836 EXP | .2235095 .3389455 0.66 0.511 -.4491169 .8961358 _cons | -15.12427 11.38141 -1.33 0.187 -37.71031 7.461779------------------------------------------------------------------------------
PRECISION OF THE MULTIPLE REGRESSION COEFFICIENTS
© Christopher Dougherty 1999–2006
Decomposition of the standard error of S
Component su n MSD(S) rS, EXP s.e.
Union 12.577 101 6.2325 –0.4087 0.5493
Non-union 13.005 439 5.8666 –0.1784 0.2604
Factor product
Union 12.577 0.0995 0.4006 1.0957 0.5493
Non-union 13.005 0.0477 0.4129 1.0163 0.2603
2,2
2
321
1)(MSD
11)(s.e.XX
u rXnsb
We place this in the table, along with the number of observations.
PRECISION OF THE MULTIPLE REGRESSION COEFFICIENTS
© Christopher Dougherty 1999–2006
Similarly, in the case of the non-union subsample, su is the square root of 169.132, which is 13.005. We also note that the number of observations in that subsample is 439.
. reg EARNINGS S EXP if COLLBARG==0
Source | SS df MS Number of obs = 439-------------+------------------------------ F( 2, 436) = 57.77 Model | 19540.1761 2 9770.08805 Prob > F = 0.0000 Residual | 73741.593 436 169.132094 R-squared = 0.2095-------------+------------------------------ Adj R-squared = 0.2058 Total | 93281.7691 438 212.972076 Root MSE = 13.005
------------------------------------------------------------------------------ EARNINGS | Coef. Std. Err. t P>|t| [95% Conf. Interval]-------------+---------------------------------------------------------------- S | 2.721698 .2604411 10.45 0.000 2.209822 3.233574 EXP | .6077342 .1400846 4.34 0.000 .3324091 .8830592 _cons | -28.00805 4.643211 -6.03 0.000 -37.13391 -18.88219------------------------------------------------------------------------------
PRECISION OF THE MULTIPLE REGRESSION COEFFICIENTS
© Christopher Dougherty 1999–2006
Decomposition of the standard error of S
Component su n MSD(S) rS, EXP s.e.
Union 12.577 101 6.2325 –0.4087 0.5493
Non-union 13.005 439 5.8666 –0.1784 0.2604
Factor product
Union 12.577 0.0995 0.4006 1.0957 0.5493
Non-union 13.005 0.0477 0.4129 1.0163 0.2603
2,2
2
321
1)(MSD
11)(s.e.XX
u rXnsb
We place these in the table.
PRECISION OF THE MULTIPLE REGRESSION COEFFICIENTS
© Christopher Dougherty 1999–2006
Decomposition of the standard error of S
Component su n MSD(S) rS, EXP s.e.
Union 12.577 101 6.2325 –0.4087 0.5493
Non-union 13.005 439 5.8666 –0.1784 0.2604
Factor product
Union 12.577 0.0995 0.4006 1.0957 0.5493
Non-union 13.005 0.0477 0.4129 1.0163 0.2603
2,2
2
321
1)(MSD
11)(s.e.XX
u rXnsb
We calculate the mean square deviation of S for the two subsamples from the sample data.
PRECISION OF THE MULTIPLE REGRESSION COEFFICIENTS
© Christopher Dougherty 1999–2006
. cor S EXP if COLLBARG==1(obs=101) | S EXP--------+------------------ S | 1.0000 EXP | -0.4087 1.0000
. cor S EXP if COLLBARG==0(obs=439) | S EXP--------+------------------ S | 1.0000 EXP | -0.1784 1.0000
The correlation coefficients for S and EXP are –0.4087 and –0.1784 for the union and non-union subsamples, respectively. (Note that "cor" is the Stata command for computing correlations.)
PRECISION OF THE MULTIPLE REGRESSION COEFFICIENTS
© Christopher Dougherty 1999–2006
Decomposition of the standard error of S
Component su n MSD(S) rS, EXP s.e.
Union 12.577 101 6.2325 –0.4087 0.5493
Non-union 13.005 439 5.8666 –0.1784 0.2604
Factor product
Union 12.577 0.0995 0.4006 1.0957 0.5493
Non-union 13.005 0.0477 0.4129 1.0163 0.2603
2,2
2
321
1)(MSD
11)(s.e.XX
u rXnsb
These entries complete the top half of the table. We will now look at the impact of each item on the standard error, using the mathematical expression at the top.
PRECISION OF THE MULTIPLE REGRESSION COEFFICIENTS
© Christopher Dougherty 1999–2006
Decomposition of the standard error of S
Component su n MSD(S) rS, EXP s.e.
Union 12.577 101 6.2325 –0.4087 0.5493
Non-union 13.005 439 5.8666 –0.1784 0.2604
Factor product
Union 12.577 0.0995 0.4006 1.0957 0.5493
Non-union 13.005 0.0477 0.4129 1.0163 0.2603
2,2
2
321
1)(MSD
11)(s.e.XX
u rXnsb
The su components need no modification. It is a little larger for the non-union subsample, having an adverse effect on the standard error.
PRECISION OF THE MULTIPLE REGRESSION COEFFICIENTS
© Christopher Dougherty 1999–2006
Decomposition of the standard error of S
Component su n MSD(S) rS, EXP s.e.
Union 12.577 101 6.2325 –0.4087 0.5493
Non-union 13.005 439 5.8666 –0.1784 0.2604
Factor product
Union 12.577 0.0995 0.4006 1.0957 0.5493
Non-union 13.005 0.0477 0.4129 1.0163 0.2603
2,2
2
321
1)(MSD
11)(s.e.XX
u rXnsb
The number of observations is much larger for the non-union subsample, so the second factor is much smaller than that for the union subsample.
PRECISION OF THE MULTIPLE REGRESSION COEFFICIENTS
© Christopher Dougherty 1999–2006
Decomposition of the standard error of S
Component su n MSD(S) rS, EXP s.e.
Union 12.577 101 6.2325 –0.4087 0.5493
Non-union 13.005 439 5.8666 –0.1784 0.2604
Factor product
Union 12.577 0.0995 0.4006 1.0957 0.5493
Non-union 13.005 0.0477 0.4129 1.0163 0.2603
2,2
2
321
1)(MSD
11)(s.e.XX
u rXnsb
Perhaps surprisingly, the variance in schooling is a little larger for the union subsample.
PRECISION OF THE MULTIPLE REGRESSION COEFFICIENTS
© Christopher Dougherty 1999–2006
Decomposition of the standard error of S
Component su n MSD(S) rS, EXP s.e.
Union 12.577 101 6.2325 –0.4087 0.5493
Non-union 13.005 439 5.8666 –0.1784 0.2604
Factor product
Union 12.577 0.0995 0.4006 1.0957 0.5493
Non-union 13.005 0.0477 0.4129 1.0163 0.2603
2,2
2
321
1)(MSD
11)(s.e.XX
u rXnsb
The correlation between schooling and work experience is greater for the union subsample, and this has an adverse effect on its standard error. Note that the sign of the correlation makes no difference since it is squared.
PRECISION OF THE MULTIPLE REGRESSION COEFFICIENTS
Decomposition of the standard error of S
Component su n MSD(S) rS, EXP s.e.
Union 12.577 101 6.2325 –0.4087 0.5493
Non-union 13.005 439 5.8666 –0.1784 0.2604
Factor product
Union 12.577 0.0995 0.4006 1.0957 0.5493
Non-union 13.005 0.0477 0.4129 1.0163 0.2603
2,2
2
321
1)(MSD
11)(s.e.XX
u rXnsb
We see that the reason that the standard error is smaller for the non-union subsample is that there are far more observations than in the non-union subsample. Otherwise the standard errors would have been about the same.The greater correlation between S and EXP has an adverse effect on the union standard error, but this is just about offset by the smaller su and the larger variance of S.
PRECISION OF THE MULTIPLE REGRESSION COEFFICIENTS
© Christopher Dougherty 1999–2006
X2 X3 Y
10 19 51
11 21 56
12 23 61
13 25 66
14 27 71
15 29 76
MULTICOLLINEARITY
3232 XXY
12 23 XX
Suppose that Y = 2 + 3X2 + X3 and that X3 = 2X2 – 1. There is no disturbance term in the equation for Y, but that is not important. Suppose that we have the six observations shown.
© Christopher Dougherty 1999–2006
The three variables are plotted as line graphs above. Looking at the data, it is impossible to tell whether the changes in Y are caused by changes in X2, by changes in X3, or jointly by changes in both X2 and X3.
0
10
20
30
40
50
60
70
80
1 2 3 4 5 6
Y
X3
X2
MULTICOLLINEARITY
© Christopher Dougherty 1999–2006
Change Change ChangeX2 X3 Y in X2 in X3 in Y
10 19 51 1 2 5
11 21 56 1 2 5
12 23 61 1 2 5
13 25 66 1 2 5
14 27 71 1 2 5
15 29 76 1 2 5
3232 XXY
12 23 XX
Numerically, Y increases by 5 in each observation when X2 changes by 1.
MULTICOLLINEARITY
© Christopher Dougherty 1999–2006
Hence the true relationship could have been Y = 1 + 5X2.
0
10
20
30
40
50
60
70
80
1 2 3 4 5 6
Y
X3
X2
Y = 1 + 5X2 ?
MULTICOLLINEARITY
© Christopher Dougherty 1999–2006
uXXY 33221 23 XX
What would happen if you tried to run a regression when there is an exact linear relationship among the explanatory variables? We will investigate, using the model with two explanatory variables shown above. [Note: A disturbance term has now been included in the true model, but it makes no difference to the analysis.]
MULTICOLLINEARITY
© Christopher Dougherty 1999–2006
uXXY 33221 23 XX
23322 XXYYXX iii
2
33222
332
22
3322332
XXXXXXXX
XXXXYYXXb
iiii
iiii
The expression for the multiple regression coefficient b2 is shown above. We will substitute for X3 using its relationship with X2.
MULTICOLLINEARITY
© Christopher Dougherty 1999–2006
uXXY 33221 23 XX
23322 XXYYXX iii
2
33222
332
22
3322332
XXXXXXXX
XXXXYYXXb
iiii
iiii
222
2
222
2222
222
233 ][][
XX
XXXX
XXXX
i
ii
ii
First, we will replace the terms highlighted with the expression derived below.
MULTICOLLINEARITY
© Christopher Dougherty 1999–2006
uXXY 33221 23 XX
222
222 XXYYXX iii
2
33222
2222
22
3322332
XXXXXXXX
XXXXYYXXb
iiii
iiii
222
2222
22223322 ][][
XX
XXXX
XXXXXXXX
i
ii
iiii
Next, the terms that are highlighted now.
MULTICOLLINEARITY
© Christopher Dougherty 1999–2006
uXXY 33221 23 XX
222
222 XXYYXX iii
00
2222
222
2222
22233
2
XXXXXX
XXYYXXb
iii
iii
YYXX
YYXX
YYXXYYXX
ii
ii
iiii
22
22
2233 ][][
Finally this term.
MULTICOLLINEARITY
© Christopher Dougherty 1999–2006
uXXY 33221 23 XX
222
222 XXYYXX iii
00
2222
222
2222
22222
2
XXXXXX
XXYYXXb
iii
iii
After all the replacements, it turns out that the numerator and the denominator are both equal to zero. The regression coefficient is not defined.It is unusual for there to be an exact relationship among the explanatory variables in a regression. When this occurs, it is typically because there is a logical error in the specification.
MULTICOLLINEARITY
© Christopher Dougherty 1999–2006
. reg EARNINGS S EXP EXPSQ
Source | SS df MS Number of obs = 540-------------+------------------------------ F( 3, 536) = 45.57 Model | 22762.4472 3 7587.48241 Prob > F = 0.0000 Residual | 89247.7839 536 166.507059 R-squared = 0.2032-------------+------------------------------ Adj R-squared = 0.1988 Total | 112010.231 539 207.811189 Root MSE = 12.904
------------------------------------------------------------------------------ EARNINGS | Coef. Std. Err. t P>|t| [95% Conf. Interval]-------------+---------------------------------------------------------------- S | 2.754372 .2417286 11.39 0.000 2.279521 3.229224 EXP | -.2353907 .665197 -0.35 0.724 -1.542103 1.071322 EXPSQ | .0267843 .0219115 1.22 0.222 -.0162586 .0698272 _cons | -22.21964 5.514827 -4.03 0.000 -33.05297 -11.38632------------------------------------------------------------------------------
uEXPSQEXPSEARNINGS 4321
However, it often happens that there is an approximate relationship. For example, when relating earnings to schooling and work experience, it if often reasonable to suppose that the effect of work experience is subject to diminishing returns.A standard way of allowing for this is to include EXPSQ, the square of EXP, in the specification. According to the hypothesis of diminishing returns, 4 should be negative.
MULTICOLLINEARITY
© Christopher Dougherty 1999–2006
. reg EARNINGS S EXP EXPSQ
Source | SS df MS Number of obs = 540-------------+------------------------------ F( 3, 536) = 45.57 Model | 22762.4472 3 7587.48241 Prob > F = 0.0000 Residual | 89247.7839 536 166.507059 R-squared = 0.2032-------------+------------------------------ Adj R-squared = 0.1988 Total | 112010.231 539 207.811189 Root MSE = 12.904
------------------------------------------------------------------------------ EARNINGS | Coef. Std. Err. t P>|t| [95% Conf. Interval]-------------+---------------------------------------------------------------- S | 2.754372 .2417286 11.39 0.000 2.279521 3.229224 EXP | -.2353907 .665197 -0.35 0.724 -1.542103 1.071322 EXPSQ | .0267843 .0219115 1.22 0.222 -.0162586 .0698272 _cons | -22.21964 5.514827 -4.03 0.000 -33.05297 -11.38632------------------------------------------------------------------------------
We fit this specification using Data Set 21. The schooling component of the regression results is not much affected by the inclusion of the EXPSQ term. The coefficient of S indicates that an extra year of schooling increases hourly earnings by $2.75.
uEXPSQEXPSEARNINGS 4321
MULTICOLLINEARITY
© Christopher Dougherty 1999–2006
. reg EARNINGS S EXP
Source | SS df MS Number of obs = 540-------------+------------------------------ F( 2, 537) = 67.54 Model | 22513.6473 2 11256.8237 Prob > F = 0.0000 Residual | 89496.5838 537 166.660305 R-squared = 0.2010-------------+------------------------------ Adj R-squared = 0.1980 Total | 112010.231 539 207.811189 Root MSE = 12.91
------------------------------------------------------------------------------ EARNINGS | Coef. Std. Err. t P>|t| [95% Conf. Interval]-------------+---------------------------------------------------------------- S | 2.678125 .2336497 11.46 0.000 2.219146 3.137105 EXP | .5624326 .1285136 4.38 0.000 .3099816 .8148837 _cons | -26.48501 4.27251 -6.20 0.000 -34.87789 -18.09213------------------------------------------------------------------------------
(Looking back at slide 21:) In the specification without EXPSQ it was 2.68, not much different.
uEXPSQEXPSEARNINGS 4321
MULTICOLLINEARITY
© Christopher Dougherty 1999–2006
. reg EARNINGS S EXP EXPSQ
Source | SS df MS Number of obs = 540-------------+------------------------------ F( 3, 536) = 45.57 Model | 22762.4472 3 7587.48241 Prob > F = 0.0000 Residual | 89247.7839 536 166.507059 R-squared = 0.2032-------------+------------------------------ Adj R-squared = 0.1988 Total | 112010.231 539 207.811189 Root MSE = 12.904
------------------------------------------------------------------------------ EARNINGS | Coef. Std. Err. t P>|t| [95% Conf. Interval]-------------+---------------------------------------------------------------- S | 2.754372 .2417286 11.39 0.000 2.279521 3.229224 EXP | -.2353907 .665197 -0.35 0.724 -1.542103 1.071322 EXPSQ | .0267843 .0219115 1.22 0.222 -.0162586 .0698272 _cons | -22.21964 5.514827 -4.03 0.000 -33.05297 -11.38632------------------------------------------------------------------------------
uEXPSQEXPSEARNINGS 4321
The standard error, 0.23 in the specification without EXPSQ, is also little changed and the coefficient remains highly significant.
MULTICOLLINEARITY
© Christopher Dougherty 1999–2006
. reg EARNINGS S EXP EXPSQ
Source | SS df MS Number of obs = 540-------------+------------------------------ F( 3, 536) = 45.57 Model | 22762.4472 3 7587.48241 Prob > F = 0.0000 Residual | 89247.7839 536 166.507059 R-squared = 0.2032-------------+------------------------------ Adj R-squared = 0.1988 Total | 112010.231 539 207.811189 Root MSE = 12.904
------------------------------------------------------------------------------ EARNINGS | Coef. Std. Err. t P>|t| [95% Conf. Interval]-------------+---------------------------------------------------------------- S | 2.754372 .2417286 11.39 0.000 2.279521 3.229224 EXP | -.2353907 .665197 -0.35 0.724 -1.542103 1.071322 EXPSQ | .0267843 .0219115 1.22 0.222 -.0162586 .0698272 _cons | -22.21964 5.514827 -4.03 0.000 -33.05297 -11.38632------------------------------------------------------------------------------
uEXPSQEXPSEARNINGS 4321
By contrast, the inclusion of the new term has had a dramatic effect on the coefficient of EXP. Now it is negative, which makes little sense, and insignificant!
MULTICOLLINEARITY
© Christopher Dougherty 1999–2006
uEXPSQEXPSEARNINGS 4321
Previously it had been positive and highly significant.
. reg EARNINGS S EXP
Source | SS df MS Number of obs = 540-------------+------------------------------ F( 2, 537) = 67.54 Model | 22513.6473 2 11256.8237 Prob > F = 0.0000 Residual | 89496.5838 537 166.660305 R-squared = 0.2010-------------+------------------------------ Adj R-squared = 0.1980 Total | 112010.231 539 207.811189 Root MSE = 12.91
------------------------------------------------------------------------------ EARNINGS | Coef. Std. Err. t P>|t| [95% Conf. Interval]-------------+---------------------------------------------------------------- S | 2.678125 .2336497 11.46 0.000 2.219146 3.137105 EXP | .5624326 .1285136 4.38 0.000 .3099816 .8148837 _cons | -26.48501 4.27251 -6.20 0.000 -34.87789 -18.09213------------------------------------------------------------------------------
MULTICOLLINEARITY
© Christopher Dougherty 1999–2006
. reg EARNINGS S EXP EXPSQ
Source | SS df MS Number of obs = 540-------------+------------------------------ F( 3, 536) = 45.57 Model | 22762.4472 3 7587.48241 Prob > F = 0.0000 Residual | 89247.7839 536 166.507059 R-squared = 0.2032-------------+------------------------------ Adj R-squared = 0.1988 Total | 112010.231 539 207.811189 Root MSE = 12.904
------------------------------------------------------------------------------ EARNINGS | Coef. Std. Err. t P>|t| [95% Conf. Interval]-------------+---------------------------------------------------------------- S | 2.754372 .2417286 11.39 0.000 2.279521 3.229224 EXP | -.2353907 .665197 -0.35 0.724 -1.542103 1.071322 EXPSQ | .0267843 .0219115 1.22 0.222 -.0162586 .0698272 _cons | -22.21964 5.514827 -4.03 0.000 -33.05297 -11.38632------------------------------------------------------------------------------
uEXPSQEXPSEARNINGS 4321
MULTICOLLINEARITY
© Christopher Dougherty 1999–2006
. reg EARNINGS S EXP EXPSQ
Source | SS df MS Number of obs = 540-------------+------------------------------ F( 3, 536) = 45.57 Model | 22762.4472 3 7587.48241 Prob > F = 0.0000 Residual | 89247.7839 536 166.507059 R-squared = 0.2032-------------+------------------------------ Adj R-squared = 0.1988 Total | 112010.231 539 207.811189 Root MSE = 12.904
------------------------------------------------------------------------------ EARNINGS | Coef. Std. Err. t P>|t| [95% Conf. Interval]-------------+---------------------------------------------------------------- S | 2.754372 .2417286 11.39 0.000 2.279521 3.229224 EXP | -.2353907 .665197 -0.35 0.724 -1.542103 1.071322 EXPSQ | .0267843 .0219115 1.22 0.222 -.0162586 .0698272 _cons | -22.21964 5.514827 -4.03 0.000 -33.05297 -11.38632------------------------------------------------------------------------------
The reason for these problems is that EXPSQ is highly correlated with EXP. This makes it difficult to discriminate between the individual effects of EXP and EXPSQ, and the regression estimates tend to be erratic.
. cor EXP EXPSQ(obs=540)
| EXP EXPSQ------+------------------ EXP | 1.0000EXPSQ | 0.9812 1.0000
MULTICOLLINEARITY
© Christopher Dougherty 1999–2006
When high correlations among the explanatory variables lead to erratic point estimates of the coefficients, large standard errors and unsatisfactorily low t statistics, the regression is said to said to be suffering from multicollinearity.
Multicollinearity may also be caused by an approximate linear relationship among the explanatory variables. When there are only 2, an approximate linear relationship means there will be a high correlation, but this is not always the case when there are more than 2.
MULTICOLLINEARITY
© Christopher Dougherty 1999–2006
ALLEVIATION OF MULTICOLLINEARITY
What can you do about multicollinearity if you encounter it? We will discuss some possible measures, looking at the model with two explanatory variables.Before doing this, two important points should be emphasized. • First, multicollinearity does not cause the regression coefficients to be
biased. Their probability distributions are still centered over the true values, if the regression specification is correct, but they have unsatisfactorily large variances.
• Second, the standard errors and t tests remain valid. The standard errors are larger than they would have been in the absence of multicollinearity, warning us that the regression estimates are erratic.
Since the problem of multicollinearity is caused by the variances of the coefficients being unsatisfactorily large, we will seek ways of reducing them.
© Christopher Dougherty 1999–2006
Possible measures for alleviating multicollinearity
(1) Reduce by including further relevant variables in the model.
2u
We will focus on the slope coefficient and look at the various components of its variance. We might be able to reduce it by bringing more variables into the model and reducing u
2, the variance of the disturbance term.
2,2
2
2,
222
22
3232
2 11
)(MSD11
XX
u
XXi
ub rXnrXX
ALLEVIATION OF MULTICOLLINEARITY
© Christopher Dougherty 1999–2006
The estimator of the variance of the disturbance term is the residual sum of squares divided by n – k, where n is the number of observations (540) and k is the number of parameters (4). Here it is 166.5.
. reg EARNINGS S EXP EXPSQ
Source | SS df MS Number of obs = 540-------------+------------------------------ F( 3, 536) = 45.57 Model | 22762.4472 3 7587.48241 Prob > F = 0.0000 Residual | 89247.7839 536 166.507059 R-squared = 0.2032-------------+------------------------------ Adj R-squared = 0.1988 Total | 112010.231 539 207.811189 Root MSE = 12.904
------------------------------------------------------------------------------ EARNINGS | Coef. Std. Err. t P>|t| [95% Conf. Interval]-------------+---------------------------------------------------------------- S | 2.754372 .2417286 11.39 0.000 2.279521 3.229224 EXP | -.2353907 .665197 -0.35 0.724 -1.542103 1.071322 EXPSQ | .0267843 .0219115 1.22 0.222 -.0162586 .0698272 _cons | -22.21964 5.514827 -4.03 0.000 -33.05297 -11.38632------------------------------------------------------------------------------
ALLEVIATION OF MULTICOLLINEARITY
© Christopher Dougherty 1999–2006
. reg EARNINGS S EXP EXPSQ MALE ASVABC
Source | SS df MS Number of obs = 540-------------+------------------------------ F( 5, 534) = 37.24 Model | 28957.3532 5 5791.47063 Prob > F = 0.0000 Residual | 83052.8779 534 155.529734 R-squared = 0.2585-------------+------------------------------ Adj R-squared = 0.2516 Total | 112010.231 539 207.811189 Root MSE = 12.471
------------------------------------------------------------------------------ EARNINGS | Coef. Std. Err. t P>|t| [95% Conf. Interval]-------------+---------------------------------------------------------------- S | 2.031419 .296218 6.86 0.000 1.449524 2.613315 EXP | -.0816828 .6441767 -0.13 0.899 -1.347114 1.183748 EXPSQ | .0130223 .021334 0.61 0.542 -.0288866 .0549311 MALE | 5.762358 1.104734 5.22 0.000 3.592201 7.932515 ASVABC | .2447687 .0714294 3.43 0.001 .1044516 .3850858 _cons | -26.18541 5.452032 -4.80 0.000 -36.89547 -15.47535------------------------------------------------------------------------------
We now add two new variables that are often found to be determinants of earnings: MALE, sex of respondent, and ASVABC, the composite score on the cognitive tests in the Armed Services Vocational Aptitude Battery.MALE is a qualitative variable and the treatment of such variables will be explained in Chapter 5.
ALLEVIATION OF MULTICOLLINEARITY
© Christopher Dougherty 1999–2006
. reg EARNINGS S EXP EXPSQ MALE ASVABC
Source | SS df MS Number of obs = 540-------------+------------------------------ F( 5, 534) = 37.24 Model | 28957.3532 5 5791.47063 Prob > F = 0.0000 Residual | 83052.8779 534 155.529734 R-squared = 0.2585-------------+------------------------------ Adj R-squared = 0.2516 Total | 112010.231 539 207.811189 Root MSE = 12.471
------------------------------------------------------------------------------ EARNINGS | Coef. Std. Err. t P>|t| [95% Conf. Interval]-------------+---------------------------------------------------------------- S | 2.031419 .296218 6.86 0.000 1.449524 2.613315 EXP | -.0816828 .6441767 -0.13 0.899 -1.347114 1.183748 EXPSQ | .0130223 .021334 0.61 0.542 -.0288866 .0549311 MALE | 5.762358 1.104734 5.22 0.000 3.592201 7.932515 ASVABC | .2447687 .0714294 3.43 0.001 .1044516 .3850858 _cons | -26.18541 5.452032 -4.80 0.000 -36.89547 -15.47535------------------------------------------------------------------------------
Both MALE and ASVABC have coefficients significant at the 0.1% level.
ALLEVIATION OF MULTICOLLINEARITY
© Christopher Dougherty 1999–2006
. reg EARNINGS S EXP EXPSQ
Source | SS df MS Number of obs = 540-------------+------------------------------ F( 3, 536) = 45.57 Model | 22762.4472 3 7587.48241 Prob > F = 0.0000 Residual | 89247.7839 536 166.507059 R-squared = 0.2032-------------+------------------------------ Adj R-squared = 0.1988 Total | 112010.231 539 207.811189 Root MSE = 12.904
. reg EARNINGS S EXP EXPSQ MALE ASVABC
Source | SS df MS Number of obs = 540-------------+------------------------------ F( 5, 534) = 37.24 Model | 28957.3532 5 5791.47063 Prob > F = 0.0000 Residual | 83052.8779 534 155.529734 R-squared = 0.2585-------------+------------------------------ Adj R-squared = 0.2516 Total | 112010.231 539 207.811189 Root MSE = 12.471
However they account for only a small proportion of the variance in earnings and the reduction in the estimate of the variance of the disturbance term is likewise small.
ALLEVIATION OF MULTICOLLINEARITY
© Christopher Dougherty 1999–2006
. reg EARNINGS S EXP EXPSQ
------------------------------------------------------------------------------ EARNINGS | Coef. Std. Err. t P>|t| [95% Conf. Interval]-------------+---------------------------------------------------------------- S | 2.754372 .2417286 11.39 0.000 2.279521 3.229224 EXP | -.2353907 .665197 -0.35 0.724 -1.542103 1.071322 EXPSQ | .0267843 .0219115 1.22 0.222 -.0162586 .0698272 _cons | -22.21964 5.514827 -4.03 0.000 -33.05297 -11.38632------------------------------------------------------------------------------
. reg EARNINGS S EXP EXPSQ MALE ASVABC
------------------------------------------------------------------------------ EARNINGS | Coef. Std. Err. t P>|t| [95% Conf. Interval]-------------+---------------------------------------------------------------- S | 2.031419 .296218 6.86 0.000 1.449524 2.613315 EXP | -.0816828 .6441767 -0.13 0.899 -1.347114 1.183748 EXPSQ | .0130223 .021334 0.61 0.542 -.0288866 .0549311 MALE | 5.762358 1.104734 5.22 0.000 3.592201 7.932515 ASVABC | .2447687 .0714294 3.43 0.001 .1044516 .3850858 _cons | -26.18541 5.452032 -4.80 0.000 -36.89547 -15.47535------------------------------------------------------------------------------
As a consequence the impact on the standard errors of EXP and EXPSQ is negligible.
ALLEVIATION OF MULTICOLLINEARITY
© Christopher Dougherty 1999–2006
. reg EARNINGS S EXP EXPSQ
------------------------------------------------------------------------------ EARNINGS | Coef. Std. Err. t P>|t| [95% Conf. Interval]-------------+---------------------------------------------------------------- S | 2.754372 .2417286 11.39 0.000 2.279521 3.229224 EXP | -.2353907 .665197 -0.35 0.724 -1.542103 1.071322 EXPSQ | .0267843 .0219115 1.22 0.222 -.0162586 .0698272 _cons | -22.21964 5.514827 -4.03 0.000 -33.05297 -11.38632------------------------------------------------------------------------------
. reg EARNINGS S EXP EXPSQ MALE ASVABC
------------------------------------------------------------------------------ EARNINGS | Coef. Std. Err. t P>|t| [95% Conf. Interval]-------------+---------------------------------------------------------------- S | 2.031419 .296218 6.86 0.000 1.449524 2.613315 EXP | -.0816828 .6441767 -0.13 0.899 -1.347114 1.183748 EXPSQ | .0130223 .021334 0.61 0.542 -.0288866 .0549311 MALE | 5.762358 1.104734 5.22 0.000 3.592201 7.932515 ASVABC | .2447687 .0714294 3.43 0.001 .1044516 .3850858 _cons | -26.18541 5.452032 -4.80 0.000 -36.89547 -15.47535------------------------------------------------------------------------------
Note how unstable the coefficients are. This is often a sign of multicollinearity.
ALLEVIATION OF MULTICOLLINEARITY
© Christopher Dougherty 1999–2006
Possible measures for alleviating multicollinearity
(2) Increase the number of observations.Surveys: increase the budget, or use clusteringTime series: use quarterly instead of annual data
2,2
2
2,
222
22
3232
2 11
)(MSD11
XX
u
XXi
ub rXnrXX
The next factor to look at is n, the number of observations. If you are working with cross-section data (individuals, households, enterprises, etc) and you are undertaking a survey, you could increase the size of the sample by negotiating a bigger budget.You select a number of these randomly, perhaps using random sampling to make sure that metropolitan, other urban, and rural areas are properly represented.You then confine the survey to the areas selected. This reduces the travel time and cost of the fieldworkers, allowing them to interview a greater number of respondents.
ALLEVIATION OF MULTICOLLINEARITY
© Christopher Dougherty 1999–2006
Possible measures for alleviating multicollinearity
(2) Increase the number of observations.Surveys: increase the budget, use clusteringTime series: use quarterly instead of annual data
If you are working with time series data, you may be able to increase the sample by working with shorter time intervals for the data, for example quarterly or even monthly data instead of annual data.
2,2
2
2,
222
22
3232
2 11
)(MSD11
XX
u
XXi
ub rXnrXX
ALLEVIATION OF MULTICOLLINEARITY
© Christopher Dougherty 1999–2006
. reg EARNINGS S EXP EXPSQ MALE ASVABC
Source | SS df MS Number of obs = 2714-------------+------------------------------ F( 5, 2708) = 183.99 Model | 161795.573 5 32359.1147 Prob > F = 0.0000 Residual | 476277.268 2708 175.877869 R-squared = 0.2536-------------+------------------------------ Adj R-squared = 0.2522 Total | 638072.841 2713 235.190874 Root MSE = 13.262
------------------------------------------------------------------------------ EARNINGS | Coef. Std. Err. t P>|t| [95% Conf. Interval]-------------+---------------------------------------------------------------- S | 2.312461 .135428 17.08 0.000 2.046909 2.578014 EXP | -.3270651 .308231 -1.06 0.289 -.9314569 .2773268 EXPSQ | .023743 .0101558 2.34 0.019 .0038291 .0436569 MALE | 5.947206 .5221755 11.39 0.000 4.923303 6.971108 ASVABC | .2086846 .0336869 6.19 0.000 .1426301 .2747392 _cons | -27.40462 2.579435 -10.62 0.000 -32.46248 -22.34676------------------------------------------------------------------------------
Here is the result of running the regression with all 2,714 observations in the EAEF data set.
ALLEVIATION OF MULTICOLLINEARITY
© Christopher Dougherty 1999–2006
. reg EARNINGS S EXP EXPSQ MALE ASVABC Number of obs = 540------------------------------------------------------------------------------ EARNINGS | Coef. Std. Err. t P>|t| [95% Conf. Interval]-------------+---------------------------------------------------------------- S | 2.031419 .296218 6.86 0.000 1.449524 2.613315 EXP | -.0816828 .6441767 -0.13 0.899 -1.347114 1.183748 EXPSQ | .0130223 .021334 0.61 0.542 -.0288866 .0549311 MALE | 5.762358 1.104734 5.22 0.000 3.592201 7.932515 ASVABC | .2447687 .0714294 3.43 0.001 .1044516 .3850858 _cons | -26.18541 5.452032 -4.80 0.000 -36.89547 -15.47535------------------------------------------------------------------------------. reg EARNINGS S EXP EXPSQ MALE ASVABC Number of obs = 2714------------------------------------------------------------------------------ EARNINGS | Coef. Std. Err. t P>|t| [95% Conf. Interval]-------------+---------------------------------------------------------------- S | 2.312461 .135428 17.08 0.000 2.046909 2.578014 EXP | -.3270651 .308231 -1.06 0.289 -.9314569 .2773268 EXPSQ | .023743 .0101558 2.34 0.019 .0038291 .0436569 MALE | 5.947206 .5221755 11.39 0.000 4.923303 6.971108 ASVABC | .2086846 .0336869 6.19 0.000 .1426301 .2747392 _cons | -27.40462 2.579435 -10.62 0.000 -32.46248 -22.34676------------------------------------------------------------------------------Comparing this result with that using Data Set 21, we see that the standard errors are much smaller, as expected.
ALLEVIATION OF MULTICOLLINEARITY
© Christopher Dougherty 1999–2006
. reg EARNINGS S EXP EXPSQ MALE ASVABC Number of obs = 540------------------------------------------------------------------------------ EARNINGS | Coef. Std. Err. t P>|t| [95% Conf. Interval]-------------+---------------------------------------------------------------- S | 2.031419 .296218 6.86 0.000 1.449524 2.613315 EXP | -.0816828 .6441767 -0.13 0.899 -1.347114 1.183748 EXPSQ | .0130223 .021334 0.61 0.542 -.0288866 .0549311 MALE | 5.762358 1.104734 5.22 0.000 3.592201 7.932515 ASVABC | .2447687 .0714294 3.43 0.001 .1044516 .3850858 _cons | -26.18541 5.452032 -4.80 0.000 -36.89547 -15.47535------------------------------------------------------------------------------. reg EARNINGS S EXP EXPSQ MALE ASVABC Number of obs = 2714------------------------------------------------------------------------------ EARNINGS | Coef. Std. Err. t P>|t| [95% Conf. Interval]-------------+---------------------------------------------------------------- S | 2.312461 .135428 17.08 0.000 2.046909 2.578014 EXP | -.3270651 .308231 -1.06 0.289 -.9314569 .2773268 EXPSQ | .023743 .0101558 2.34 0.019 .0038291 .0436569 MALE | 5.947206 .5221755 11.39 0.000 4.923303 6.971108 ASVABC | .2086846 .0336869 6.19 0.000 .1426301 .2747392 _cons | -27.40462 2.579435 -10.62 0.000 -32.46248 -22.34676------------------------------------------------------------------------------As a consequence, the t statistics of the variables are higher. However the correlation between EXP and EXPSQ is as high as in the smaller sample and the increase in the sample size has not been large enough to have much impact on the problem of multicollinearity.
ALLEVIATION OF MULTICOLLINEARITY
© Christopher Dougherty 1999–2006
Possible measures for alleviating multicollinearity
(3) Increase MSD(X2).
A third possible way of reducing the problem of multicollinearity might be to increase the variation in the explanatory variables. This is possible only at the design stage of a survey.For example, if you were planning a household survey with the aim of investigating how expenditure patterns vary with income, you should make sure that the sample included relatively rich and relatively poor households as well as middle-income households.
2,2
2
2,
222
22
3232
2 11
)(MSD11
XX
u
XXi
ub rXnrXX
ALLEVIATION OF MULTICOLLINEARITY
© Christopher Dougherty 1999–2006
Possible measures for alleviating multicollinearity
(4) Reduce .32 ,XXr
Another possibility might be to reduce the correlation between the explanatory variables. This is possible only at the design stage of a survey and even then it is not easy.
2,2
2
2,
222
22
3232
2 11
)(MSD11
XX
u
XXi
ub rXnrXX
ALLEVIATION OF MULTICOLLINEARITY
© Christopher Dougherty 1999–2006
Possible measures for alleviating multicollinearity
(5) Combine the correlated variables.
If the correlated variables are similar conceptually, it may be reasonable to combine them into some overall index.
ALLEVIATION OF MULTICOLLINEARITY
© Christopher Dougherty 1999–2006
That is precisely what has been done with the three cognitive ASVAB variables. ASVABC has been calculated as a weighted average of ASVAB02 (arithmetic reasoning), ASVAB03 (word knowledge), and ASVAB04 (paragraph comprehension).The three components are highly correlated and by combining them as a weighted average, rather than using them individually, one avoids a potential problem of multicollinearity.
. reg EARNINGS S EXP EXPSQ MALE ASVABC
Source | SS df MS Number of obs = 540-------------+------------------------------ F( 5, 534) = 37.24 Model | 28957.3532 5 5791.47063 Prob > F = 0.0000 Residual | 83052.8779 534 155.529734 R-squared = 0.2585-------------+------------------------------ Adj R-squared = 0.2516 Total | 112010.231 539 207.811189 Root MSE = 12.471
------------------------------------------------------------------------------ EARNINGS | Coef. Std. Err. t P>|t| [95% Conf. Interval]-------------+---------------------------------------------------------------- S | 2.031419 .296218 6.86 0.000 1.449524 2.613315 EXP | -.0816828 .6441767 -0.13 0.899 -1.347114 1.183748 EXPSQ | .0130223 .021334 0.61 0.542 -.0288866 .0549311 MALE | 5.762358 1.104734 5.22 0.000 3.592201 7.932515 ASVABC | .2447687 .0714294 3.43 0.001 .1044516 .3850858 _cons | -26.18541 5.452032 -4.80 0.000 -36.89547 -15.47535------------------------------------------------------------------------------
ALLEVIATION OF MULTICOLLINEARITY
© Christopher Dougherty 1999–2006
Possible measures for alleviating multicollinearity
(6) Drop some of the correlated variables.
Dropping some of the correlated variables, if they have insignificant coefficients, may alleviate multicollinearity.However, this approach to multicollinearity is dangerous. It is possible that some of the variables with insignificant coefficients really do belong in the model and that the only reason their coefficients are insignificant is because there is a problem of multicollinearity.If that is the case, their omission may cause omitted variable bias, to be discussed in Chapter 6.
2,2
2
2,
222
22
3232
2 11
)(MSD11
XX
u
XXi
ub rXnrXX
ALLEVIATION OF MULTICOLLINEARITY
© Christopher Dougherty 1999–2006
Possible measures for alleviating multicollinearity
(7) Empirical restriction
uPXY 321
A further way of dealing with the problem of multicollinearity is to use extraneous information, if available, concerning the coefficient of one of the variables.For example, suppose that Y in the equation above is the demand for a category of consumer expenditure, X is aggregate disposable personal income, and P is a price index for the category.To fit a model of this type you would use time series data. If X and P are highly correlated, which is often the case with time series variables, the problem of multicollinearity might be eliminated in the following way.
2,2
2
2,
222
22
3232
2 11
)(MSD11
XX
u
XXi
ub rXnrXX
ALLEVIATION OF MULTICOLLINEARITY
© Christopher Dougherty 1999–2006
Possible measures for alleviating multicollinearity
(7) Empirical restriction
uPXY 321 uXY ''2
'1
' ''
2'1
'ˆ XbbY
Obtain data on income and expenditure on the category from a household survey and regress Y' on X'. (The ' marks are to indicate that the data are household data, not aggregate data.) This is a simple regression because there will be relatively little variation in the price paid by the households.
2,2
2
2,
222
22
3232
2 11
)(MSD11
XX
u
XXi
ub rXnrXX
ALLEVIATION OF MULTICOLLINEARITY
© Christopher Dougherty 1999–2006
Possible measures for alleviating multicollinearity
(7) Empirical restriction
uPXY 321 uXY ''2
'1
'
uPXbY 3'21
uPXbYZ 21'2
''2
'1
'ˆ XbbY
Now substitute b' for 2 in the time series model. Subtract b' X from both sides, and regress Z = Y – b' X on price. This is a simple regression, so multicollinearity has been eliminated.
2
22
2,2
2
2,
222
22
3232
2 11
)(MSD11
XX
u
XXi
ub rXnrXX
ALLEVIATION OF MULTICOLLINEARITY
© Christopher Dougherty 1999–2006
Possible measures for alleviating multicollinearity
(7) Empirical restriction
uPXY 321 uXY ''2
'1
'
uPXbY 3'21
uPXbYZ 21'2
''2
'1
'ˆ XbbY
There are some problems with this technique. First, the 2 coefficients may be conceptually different in time series and cross-section contexts. Second, since we subtract the estimated income component b' X, not the true income component 2X, from Y when constructing Z, we have introduced an element of measurement error in the dependent variable.
2,2
2
2,
222
22
3232
2 11
)(MSD11
XX
u
XXi
ub rXnrXX
ALLEVIATION OF MULTICOLLINEARITY
© Christopher Dougherty 1999–2006
Possible measures for alleviating multicollinearity
(8) Theoretical restriction
uSFSMASVABCS 4321
Last, but by no means least, is the use of a theoretical restriction, which is defined as a hypothetical relationship among the parameters of a regression model.It will be explained using an educational attainment model as an example. Suppose that we hypothesize that highest grade completed, S, depends on ASVABC, and highest grade completed by the respondent's mother and father, SM and SF, respectively.
2,2
2
2,
222
22
3232
2 11
)(MSD11
XX
u
XXi
ub rXnrXX
ALLEVIATION OF MULTICOLLINEARITY
© Christopher Dougherty 1999–2006
. reg S ASVABC SM SF
Source | SS df MS Number of obs = 540-------------+------------------------------ F( 3, 536) = 104.30 Model | 1181.36981 3 393.789935 Prob > F = 0.0000 Residual | 2023.61353 536 3.77539837 R-squared = 0.3686-------------+------------------------------ Adj R-squared = 0.3651 Total | 3204.98333 539 5.94616574 Root MSE = 1.943
------------------------------------------------------------------------------ S | Coef. Std. Err. t P>|t| [95% Conf. Interval]-------------+---------------------------------------------------------------- ASVABC | .1257087 .0098533 12.76 0.000 .1063528 .1450646 SM | .0492424 .0390901 1.26 0.208 -.027546 .1260309 SF | .1076825 .0309522 3.48 0.001 .04688 .1684851 _cons | 5.370631 .4882155 11.00 0.000 4.41158 6.329681------------------------------------------------------------------------------
A one-point increase in ASVABC increases S by 0.13 years.
ALLEVIATION OF MULTICOLLINEARITY
© Christopher Dougherty 1999–2006
. reg S ASVABC SM SF
Source | SS df MS Number of obs = 540-------------+------------------------------ F( 3, 536) = 104.30 Model | 1181.36981 3 393.789935 Prob > F = 0.0000 Residual | 2023.61353 536 3.77539837 R-squared = 0.3686-------------+------------------------------ Adj R-squared = 0.3651 Total | 3204.98333 539 5.94616574 Root MSE = 1.943
------------------------------------------------------------------------------ S | Coef. Std. Err. t P>|t| [95% Conf. Interval]-------------+---------------------------------------------------------------- ASVABC | .1257087 .0098533 12.76 0.000 .1063528 .1450646 SM | .0492424 .0390901 1.26 0.208 -.027546 .1260309 SF | .1076825 .0309522 3.48 0.001 .04688 .1684851 _cons | 5.370631 .4882155 11.00 0.000 4.41158 6.329681------------------------------------------------------------------------------
S increases by 0.05 years for every extra year of schooling of the mother and 0.11 years for every extra year of schooling of the father.Mother's education is generally held to be at least, if not more, important than father's education for educational attainment, so this outcome is unexpected.
ALLEVIATION OF MULTICOLLINEARITY
© Christopher Dougherty 1999–2006
. reg S ASVABC SM SF
Source | SS df MS Number of obs = 540-------------+------------------------------ F( 3, 536) = 104.30 Model | 1181.36981 3 393.789935 Prob > F = 0.0000 Residual | 2023.61353 536 3.77539837 R-squared = 0.3686-------------+------------------------------ Adj R-squared = 0.3651 Total | 3204.98333 539 5.94616574 Root MSE = 1.943
------------------------------------------------------------------------------ S | Coef. Std. Err. t P>|t| [95% Conf. Interval]-------------+---------------------------------------------------------------- ASVABC | .1257087 .0098533 12.76 0.000 .1063528 .1450646 SM | .0492424 .0390901 1.26 0.208 -.027546 .1260309 SF | .1076825 .0309522 3.48 0.001 .04688 .1684851 _cons | 5.370631 .4882155 11.00 0.000 4.41158 6.329681------------------------------------------------------------------------------
It is also surprising that the coefficient of SM is not significant, even at the 5% level, using a one-sided test.
ALLEVIATION OF MULTICOLLINEARITY
© Christopher Dougherty 1999–2006
. reg S ASVABC SM SF
Source | SS df MS Number of obs = 540-------------+------------------------------ F( 3, 536) = 104.30 Model | 1181.36981 3 393.789935 Prob > F = 0.0000 Residual | 2023.61353 536 3.77539837 R-squared = 0.3686-------------+------------------------------ Adj R-squared = 0.3651 Total | 3204.98333 539 5.94616574 Root MSE = 1.943
------------------------------------------------------------------------------ S | Coef. Std. Err. t P>|t| [95% Conf. Interval]-------------+---------------------------------------------------------------- ASVABC | .1257087 .0098533 12.76 0.000 .1063528 .1450646 SM | .0492424 .0390901 1.26 0.208 -.027546 .1260309 SF | .1076825 .0309522 3.48 0.001 .04688 .1684851 _cons | 5.370631 .4882155 11.00 0.000 4.41158 6.329681------------------------------------------------------------------------------
However, assortative mating leads to correlation between SM and SF and the regression appears to be suffering from multicollinearity.
. cor SM SF(obs=540) | SM SF--------+------------------ SM | 1.0000 SF | 0.6241 1.0000
ALLEVIATION OF MULTICOLLINEARITY
© Christopher Dougherty 1999–2006
Possible measures for alleviating multicollinearity
(8) Theoretical restriction
uSFSMASVABCS 4321
43
Suppose that we hypothesize that mother's and father's education are equally important. We can then impose the restriction 3 = 4.
2,2
2
2,
222
22
3232
2 11
)(MSD11
XX
u
XXi
ub rXnrXX
ALLEVIATION OF MULTICOLLINEARITY
© Christopher Dougherty 1999–2006
Possible measures for alleviating multicollinearity
(8) Theoretical restriction
uSFSMASVABCS 4321
43
uSPASVABCuSFSMASVABCS
321
321 )(
This allows us to rewrite the equation as shown.
2,2
2
2,
222
22
3232
2 11
)(MSD11
XX
u
XXi
ub rXnrXX
ALLEVIATION OF MULTICOLLINEARITY
© Christopher Dougherty 1999–2006
Possible measures for alleviating multicollinearity
(8) Theoretical restriction
uSFSMASVABCS 4321
43
uSPASVABCuSFSMASVABCS
321
321 )(
Defining SP to be the sum of SM and SF, the equation may be rewritten as shown. The problem caused by the correlation between SM and SF has been eliminated.
2,2
2
2,
222
22
3232
2 11
)(MSD11
XX
u
XXi
ub rXnrXX
ALLEVIATION OF MULTICOLLINEARITY
© Christopher Dougherty 1999–2006
. g SP=SM+SF
. reg S ASVABC SP
Source | SS df MS Number of obs = 540-------------+------------------------------ F( 2, 537) = 156.04 Model | 1177.98338 2 588.991689 Prob > F = 0.0000 Residual | 2026.99996 537 3.77467403 R-squared = 0.3675-------------+------------------------------ Adj R-squared = 0.3652 Total | 3204.98333 539 5.94616574 Root MSE = 1.9429
------------------------------------------------------------------------------ S | Coef. Std. Err. t P>|t| [95% Conf. Interval]-------------+---------------------------------------------------------------- ASVABC | .1253106 .0098434 12.73 0.000 .1059743 .1446469 SP | .0828368 .0164247 5.04 0.000 .0505722 .1151014 _cons | 5.29617 .4817972 10.99 0.000 4.349731 6.242608------------------------------------------------------------------------------
The estimate of 3 is now 0.083 and highly significant.
ALLEVIATION OF MULTICOLLINEARITY
© Christopher Dougherty 1999–2006
. g SP=SM+SF
. reg S ASVABC SP
------------------------------------------------------------------------------ S | Coef. Std. Err. t P>|t| [95% Conf. Interval]-------------+---------------------------------------------------------------- ASVABC | .1253106 .0098434 12.73 0.000 .1059743 .1446469 SP | .0828368 .0164247 5.04 0.000 .0505722 .1151014 _cons | 5.29617 .4817972 10.99 0.000 4.349731 6.242608------------------------------------------------------------------------------
. reg S ASVABC SM SF
------------------------------------------------------------------------------ S | Coef. Std. Err. t P>|t| [95% Conf. Interval]-------------+---------------------------------------------------------------- ASVABC | .1257087 .0098533 12.76 0.000 .1063528 .1450646 SM | .0492424 .0390901 1.26 0.208 -.027546 .1260309 SF | .1076825 .0309522 3.48 0.001 .04688 .1684851 _cons | 5.370631 .4882155 11.00 0.000 4.41158 6.329681------------------------------------------------------------------------------
Not surprisingly, this is a compromise between the coefficients of SM and SF in the previous specification.
ALLEVIATION OF MULTICOLLINEARITY
© Christopher Dougherty 1999–2006
. g SP=SM+SF
. reg S ASVABC SP
------------------------------------------------------------------------------ S | Coef. Std. Err. t P>|t| [95% Conf. Interval]-------------+---------------------------------------------------------------- ASVABC | .1253106 .0098434 12.73 0.000 .1059743 .1446469 SP | .0828368 .0164247 5.04 0.000 .0505722 .1151014 _cons | 5.29617 .4817972 10.99 0.000 4.349731 6.242608------------------------------------------------------------------------------
. reg S ASVABC SM SF
------------------------------------------------------------------------------ S | Coef. Std. Err. t P>|t| [95% Conf. Interval]-------------+---------------------------------------------------------------- ASVABC | .1257087 .0098533 12.76 0.000 .1063528 .1450646 SM | .0492424 .0390901 1.26 0.208 -.027546 .1260309 SF | .1076825 .0309522 3.48 0.001 .04688 .1684851 _cons | 5.370631 .4882155 11.00 0.000 4.41158 6.329681------------------------------------------------------------------------------
The standard error of SP is much smaller than those of SM and SF. The use of the restriction has led to a large gain in efficiency and the problem of multicollinearity has been eliminated.
ALLEVIATION OF MULTICOLLINEARITY
© Christopher Dougherty 1999–2006
. g SP=SM+SF
. reg S ASVABC SP
------------------------------------------------------------------------------ S | Coef. Std. Err. t P>|t| [95% Conf. Interval]-------------+---------------------------------------------------------------- ASVABC | .1253106 .0098434 12.73 0.000 .1059743 .1446469 SP | .0828368 .0164247 5.04 0.000 .0505722 .1151014 _cons | 5.29617 .4817972 10.99 0.000 4.349731 6.242608------------------------------------------------------------------------------
. reg S ASVABC SM SF
------------------------------------------------------------------------------ S | Coef. Std. Err. t P>|t| [95% Conf. Interval]-------------+---------------------------------------------------------------- ASVABC | .1257087 .0098533 12.76 0.000 .1063528 .1450646 SM | .0492424 .0390901 1.26 0.208 -.027546 .1260309 SF | .1076825 .0309522 3.48 0.001 .04688 .1684851 _cons | 5.370631 .4882155 11.00 0.000 4.41158 6.329681------------------------------------------------------------------------------
The t statistic is very high. Thus it would appear that imposing the restriction has improved the regression results. However, the restriction may not be valid. We should test it. Testing theoretical restrictions is one of the topics in Chapter 6.
ALLEVIATION OF MULTICOLLINEARITY
© Christopher Dougherty 1999–2006
F TESTS OF GOODNESS OF FIT
This sequence describes two F tests of goodness of fit in a multiple regression model. The first relates to the goodness of fit of the equation as a whole.We will consider the general case where there are k – 1 explanatory variables. For the F test of goodness of fit of the equation as a whole, the null hypothesis, in words, is that the model has no explanatory power at all.The model will have no explanatory power if it turns out that Y is unrelated to any of the explanatory variables. Mathematically, therefore, the null hypothesis is that all the coefficients 2, ..., k are zero.
The alternative hypothesis is that at least one of these coefficients is different from zero.In the multiple regression model there is a difference between the roles of the F and t tests. The F test tests the joint explanatory power of the variables, while the t tests test their explanatory power individually.In the simple regression model the F test was equivalent to the (two-sided) t test on the slope coefficient because the ‘group’ consisted of just one variable.
uXXY kk ...221 0 oneleast at :0...:
1
20
HH k
© Christopher Dougherty 1999–2006
)()1()1(
)(
)1(
)()1(),1(
2
2
knRkR
knTSSRSS
kTSSESS
knRSSkESSknkF
uXXY kk ...221
0 oneleast at :0...:
1
20
HH k
ESS / TSS is the definition of R2. RSS / TSS is equal to (1 – R2). (See the last sequence in Chapter 2.)
F TESTS OF GOODNESS OF FIT
© Christopher Dougherty 1999–2006
0: 4320 H
The educational attainment model will be used as an example. We will suppose that S depends on ASVABC, the ability score, and SM, and SF, the highest grade completed by the mother and father of the respondent, respectively.The null hypothesis for the F test of goodness of fit is that all three slope coefficients are equal to zero. The alternative hypothesis is that at least one of them is non-zero.
uSFSMASVABCS 4321 F TESTS OF GOODNESS OF FIT
© Christopher Dougherty 1999–2006
. reg S ASVABC SM SF
Source | SS df MS Number of obs = 540-------------+------------------------------ F( 3, 536) = 104.30 Model | 1181.36981 3 393.789935 Prob > F = 0.0000 Residual | 2023.61353 536 3.77539837 R-squared = 0.3686-------------+------------------------------ Adj R-squared = 0.3651 Total | 3204.98333 539 5.94616574 Root MSE = 1.943
------------------------------------------------------------------------------ S | Coef. Std. Err. t P>|t| [95% Conf. Interval]-------------+---------------------------------------------------------------- ASVABC | .1257087 .0098533 12.76 0.000 .1063528 .1450646 SM | .0492424 .0390901 1.26 0.208 -.027546 .1260309 SF | .1076825 .0309522 3.48 0.001 .04688 .1684851 _cons | 5.370631 .4882155 11.00 0.000 4.41158 6.329681------------------------------------------------------------------------------
Here is the regression output using Data Set 21.
uSFSMASVABCS 4321 0: 4320 H
F TESTS OF GOODNESS OF FIT
© Christopher Dougherty 1999–2006
. reg S ASVABC SM SF
Source | SS df MS Number of obs = 540-------------+------------------------------ F( 3, 536) = 104.30 Model | 1181.36981 3 393.789935 Prob > F = 0.0000 Residual | 2023.61353 536 3.77539837 R-squared = 0.3686-------------+------------------------------ Adj R-squared = 0.3651 Total | 3204.98333 539 5.94616574 Root MSE = 1.943
------------------------------------------------------------------------------ S | Coef. Std. Err. t P>|t| [95% Conf. Interval]-------------+---------------------------------------------------------------- ASVABC | .1257087 .0098533 12.76 0.000 .1063528 .1450646 SM | .0492424 .0390901 1.26 0.208 -.027546 .1260309 SF | .1076825 .0309522 3.48 0.001 .04688 .1684851 _cons | 5.370631 .4882155 11.00 0.000 4.41158 6.329681------------------------------------------------------------------------------
uSFSMASVABCS 4321 0: 4320 H
)/()1/(),1(knRSS
kESSknkF 3.104
536/20243/1181)536,3( F
The numerator of the F statistic is the explained sum of squares divided by k – 1. In the Stata output these numbers are given in the Model row.
F TESTS OF GOODNESS OF FIT
© Christopher Dougherty 1999–2006
. reg S ASVABC SM SF
Source | SS df MS Number of obs = 540-------------+------------------------------ F( 3, 536) = 104.30 Model | 1181.36981 3 393.789935 Prob > F = 0.0000 Residual | 2023.61353 536 3.77539837 R-squared = 0.3686-------------+------------------------------ Adj R-squared = 0.3651 Total | 3204.98333 539 5.94616574 Root MSE = 1.943
------------------------------------------------------------------------------ S | Coef. Std. Err. t P>|t| [95% Conf. Interval]-------------+---------------------------------------------------------------- ASVABC | .1257087 .0098533 12.76 0.000 .1063528 .1450646 SM | .0492424 .0390901 1.26 0.208 -.027546 .1260309 SF | .1076825 .0309522 3.48 0.001 .04688 .1684851 _cons | 5.370631 .4882155 11.00 0.000 4.41158 6.329681------------------------------------------------------------------------------
uSFSMASVABCS 4321 0: 4320 H
)/()1/(),1(knRSS
kESSknkF 3.104
536/20243/1181)536,3( F
The denominator is the residual sum of squares divided by the number of degrees of freedom remaining.
F TESTS OF GOODNESS OF FIT
© Christopher Dougherty 1999–2006
. reg S ASVABC SM SF
Source | SS df MS Number of obs = 540-------------+------------------------------ F( 3, 536) = 104.30 Model | 1181.36981 3 393.789935 Prob > F = 0.0000 Residual | 2023.61353 536 3.77539837 R-squared = 0.3686-------------+------------------------------ Adj R-squared = 0.3651 Total | 3204.98333 539 5.94616574 Root MSE = 1.943
------------------------------------------------------------------------------ S | Coef. Std. Err. t P>|t| [95% Conf. Interval]-------------+---------------------------------------------------------------- ASVABC | .1257087 .0098533 12.76 0.000 .1063528 .1450646 SM | .0492424 .0390901 1.26 0.208 -.027546 .1260309 SF | .1076825 .0309522 3.48 0.001 .04688 .1684851 _cons | 5.370631 .4882155 11.00 0.000 4.41158 6.329681------------------------------------------------------------------------------
uSFSMASVABCS 4321 0: 4320 H
)/()1/(),1(knRSS
kESSknkF 3.104
536/20243/1181)536,3( F
Hence the F statistic is 104.3. All serious regression packages compute it for you as part of the diagnostics in the regression output.
F TESTS OF GOODNESS OF FIT
© Christopher Dougherty 1999–2006
. reg S ASVABC SM SF
Source | SS df MS Number of obs = 540-------------+------------------------------ F( 3, 536) = 104.30 Model | 1181.36981 3 393.789935 Prob > F = 0.0000 Residual | 2023.61353 536 3.77539837 R-squared = 0.3686-------------+------------------------------ Adj R-squared = 0.3651 Total | 3204.98333 539 5.94616574 Root MSE = 1.943
------------------------------------------------------------------------------ S | Coef. Std. Err. t P>|t| [95% Conf. Interval]-------------+---------------------------------------------------------------- ASVABC | .1257087 .0098533 12.76 0.000 .1063528 .1450646 SM | .0492424 .0390901 1.26 0.208 -.027546 .1260309 SF | .1076825 .0309522 3.48 0.001 .04688 .1684851 _cons | 5.370631 .4882155 11.00 0.000 4.41158 6.329681------------------------------------------------------------------------------
uSFSMASVABCS 4321 0: 4320 H
3.104536/20243/1181)536,3( F
The critical value for F(3,536) is not given in the F tables, but we know it must be lower than F(3,500), which is given. At the 0.1% level, this is 5.51. Hence we easily reject H0 at the 0.1% level.
51.5)500,3(crit,0.1% F
F TESTS OF GOODNESS OF FIT
© Christopher Dougherty 1999–2006
It is unusual for the F statistic not to be significant if some of the t statistics are significant. In principle it could happen though. •Suppose that you ran a regression with 40 explanatory variables, none being a true determinant of the dependent variable. Then the F statistic should be low enough for H0 not to be rejected. However, if you are performing t tests on the slope coefficients at the 5% level, with a 5% chance of a Type I error, on average 2 of the 40 variables could be expected to have ‘significant’ coefficients.•The opposite can easily happen, though. Suppose you have a multiple regression model which is correctly specified and the R2 is high. You would expect to have a highly significant F statistic.However, if the explanatory variables are highly correlated and the model is subject to severe multicollinearity, the standard errors of the slope coefficients could all be so large that none of the t statistics is significant.In this situation you would know that your model is a good one, but you are not in a position to pinpoint the contributions made by the explanatory variables individually.
© Christopher Dougherty 1999–2006
uXXXY 4433221
uXY 221 1RSS
2RSS
We now come to the other F test of goodness of fit. This is a test of the joint explanatory power of a group of variables when they are added to a regression model.For example, in the original specification, Y may be written as a simple function of X2. In the second, we add X3 and X4.
F TESTS OF GOODNESS OF FIT
© Christopher Dougherty 1999–2006
0 and bothor 0 or 0 :0:
43431
430
HH
uXXXY 4433221
uXY 221 1RSS
2RSS
The null hypothesis for the F test is that neither X3 nor X4 belongs in the model. The alternative hypothesis is that at least one of them does, perhaps both.
F TESTS OF GOODNESS OF FIT
© Christopher Dougherty 1999–2006
F(cost, d.f. remaining) =improvement cost
remainingunexplained
degrees of freedomremaining
For this F test, and for several others which we will encounter, it is useful to think of the F statistic as having the structure indicated above.The ‘improvement’ is the reduction in the residual sum of squares when the change is made, in this case, when the group of new variables is added.The ‘cost’ is the reduction in the number of degrees of freedom remaining after making the change. In the present case it is equal to the number of new variables added, because that number of new parameters are estimated.(Remember that the number of degrees of freedom in a regression equation is the number of observations, less the number of parameters estimated. In this example, it would fall from n – 2 to n – 4 when X3 and X4 are added.)
The ‘remaining unexplained’ is the residual sum of squares after making the change.The ‘degrees of freedom remaining’ is the number of degrees of freedom remaining after making the change.
F TESTS OF GOODNESS OF FIT
© Christopher Dougherty 1999–2006
. reg S ASVABC
Source | SS df MS Number of obs = 540-------------+------------------------------ F( 1, 538) = 274.19 Model | 1081.97059 1 1081.97059 Prob > F = 0.0000 Residual | 2123.01275 538 3.94612035 R-squared = 0.3376-------------+------------------------------ Adj R-squared = 0.3364 Total | 3204.98333 539 5.94616574 Root MSE = 1.9865
------------------------------------------------------------------------------ S | Coef. Std. Err. t P>|t| [95% Conf. Interval]-------------+---------------------------------------------------------------- ASVABC | .148084 .0089431 16.56 0.000 .1305165 .1656516 _cons | 6.066225 .4672261 12.98 0.000 5.148413 6.984036------------------------------------------------------------------------------
We will illustrate the test with an educational attainment example. Here is S regressed on ASVABC using Data Set 21. We make a note of the residual sum of squares.
F TESTS OF GOODNESS OF FIT
© Christopher Dougherty 1999–2006
. reg S ASVABC SM SF
Source | SS df MS Number of obs = 540-------------+------------------------------ F( 3, 536) = 104.30 Model | 1181.36981 3 393.789935 Prob > F = 0.0000 Residual | 2023.61353 536 3.77539837 R-squared = 0.3686-------------+------------------------------ Adj R-squared = 0.3651 Total | 3204.98333 539 5.94616574 Root MSE = 1.943
------------------------------------------------------------------------------ S | Coef. Std. Err. t P>|t| [95% Conf. Interval]-------------+---------------------------------------------------------------- ASVABC | .1257087 .0098533 12.76 0.000 .1063528 .1450646 SM | .0492424 .0390901 1.26 0.208 -.027546 .1260309 SF | .1076825 .0309522 3.48 0.001 .04688 .1684851 _cons | 5.370631 .4882155 11.00 0.000 4.41158 6.329681------------------------------------------------------------------------------
Now we have added the highest grade completed by each parent. Does parental education have a significant impact? Well, we can see that a t test would show that SF has a highly significant coefficient, but we will perform the F test anyway. We make a note of RSS.
F TESTS OF GOODNESS OF FIT
© Christopher Dougherty 1999–2006
0 and bothor 0 or 0 :0:
43431
430
HH
uXXXY 4433221
uXY 221 1RSS
2RSS
The F statistic is 13.16.The critical value of F(2,500) at the 0.1% level is 7.00. The critical value of F(2,536) must be lower, so we reject H0 and conclude that the parental education variables do have significant joint explanatory power.
00.7)500,2(crit,0.1% F
16.13536/6.2023
2/)6.20230.2123()4540(2)()4540,2(
2
21
RSS
RSSRSSF
F TESTS OF GOODNESS OF FIT
© Christopher Dougherty 1999–2006
1RSS
2RSS
uXXY 33221
uXXXY 4433221
This sequence will conclude by showing that t tests are equivalent to marginal F tests when the additional group of variables consists of just one variable.Suppose that in the original model Y is a function of X2 and X3, and that in the revised model X4 is added.
F TESTS OF GOODNESS OF FIT
© Christopher Dougherty 1999–2006
1RSS
2RSS
0 :0:
41
40
HH
uXXY 33221
uXXXY 4433221
The null hypothesis for the F test of the explanatory power of the additional ‘group’ is that all the new slope coefficients are equal to zero. There is of course only one new slope coefficient, 4.
F TESTS OF GOODNESS OF FIT
© Christopher Dougherty 1999–2006
1RSS
2RSS
The F test has the usual structure. We will illustrate it with an educational attainment model where S depends on ASVABC and SM in the original model and on SF as well in the revised model.
F(cost, d.f. remaining) =improvement cost
remainingunexplained
degrees of freedomremaining
0 :0:
41
40
HH
uXXY 33221
uXXXY 4433221
F TESTS OF GOODNESS OF FIT
© Christopher Dougherty 1999–2006
. reg S ASVABC SM
Source | SS df MS Number of obs = 540-------------+------------------------------ F( 2, 537) = 147.36 Model | 1135.67473 2 567.837363 Prob > F = 0.0000 Residual | 2069.30861 537 3.85346109 R-squared = 0.3543-------------+------------------------------ Adj R-squared = 0.3519 Total | 3204.98333 539 5.94616574 Root MSE = 1.963
------------------------------------------------------------------------------ S | Coef. Std. Err. t P>|t| [95% Conf. Interval]-------------+---------------------------------------------------------------- ASVABC | .1328069 .0097389 13.64 0.000 .1136758 .151938 SM | .1235071 .0330837 3.73 0.000 .0585178 .1884963 _cons | 5.420733 .4930224 10.99 0.000 4.452244 6.389222------------------------------------------------------------------------------
Here is the regression of S on ASVABC and SM. We make a note of the residual sum of squares.
F TESTS OF GOODNESS OF FIT
© Christopher Dougherty 1999–2006
Now we add SF and again make a note of the residual sum of squares.
. reg S ASVABC SM SF
Source | SS df MS Number of obs = 540-------------+------------------------------ F( 3, 536) = 104.30 Model | 1181.36981 3 393.789935 Prob > F = 0.0000 Residual | 2023.61353 536 3.77539837 R-squared = 0.3686-------------+------------------------------ Adj R-squared = 0.3651 Total | 3204.98333 539 5.94616574 Root MSE = 1.943
------------------------------------------------------------------------------ S | Coef. Std. Err. t P>|t| [95% Conf. Interval]-------------+---------------------------------------------------------------- ASVABC | .1257087 .0098533 12.76 0.000 .1063528 .1450646 SM | .0492424 .0390901 1.26 0.208 -.027546 .1260309 SF | .1076825 .0309522 3.48 0.001 .04688 .1684851 _cons | 5.370631 .4882155 11.00 0.000 4.41158 6.329681------------------------------------------------------------------------------
F TESTS OF GOODNESS OF FIT
© Christopher Dougherty 1999–2006
0 :0:
41
40
HH
uXXXY 4433221
uXXY 33221 1RSS
2RSS
F(cost, d.f. remaining) =improvement cost
remainingunexplained
degrees of freedomremaining
The improvement on adding SF is the reduction in the residual sum of squares.
10.12536/6.2023
1/)6.20233.2069()4540(1)()4540,1(
2
21
RSS
RSSRSSF
F TESTS OF GOODNESS OF FIT
© Christopher Dougherty 1999–2006
0 :0:
41
40
HH
uXXXY 4433221 1RSS
2RSS
F(cost, d.f. remaining) =improvement cost
remainingunexplained
degrees of freedomremaining
The cost is just the single degree of freedom lost when estimating 4.
uXXY 33221
10.12536/6.2023
1/)6.20233.2069()4540(1)()4540,1(
2
21
RSS
RSSRSSF
F TESTS OF GOODNESS OF FIT
© Christopher Dougherty 1999–2006
0 :0:
41
40
HH
uXXXY 4433221 1RSS
2RSS
F(cost, d.f. remaining) =improvement cost
remainingunexplained
degrees of freedomremaining
The remaining unexplained is the residual sum of squares after adding SF.
uXXY 33221
10.12536/6.2023
1/)6.20233.2069()4540(1)()4540,1(
2
21
RSS
RSSRSSF
F TESTS OF GOODNESS OF FIT
© Christopher Dougherty 1999–2006
0 :0:
41
40
HH
uXXXY 4433221 1RSS
2RSS
F(cost, d.f. remaining) =improvement cost
remainingunexplained
degrees of freedomremaining
The number of degrees of freedom remaining after adding SF is 540 – 4 = 536.
uXXY 33221
10.12536/6.2023
1/)6.20233.2069()4540(1)()4540,1(
2
21
RSS
RSSRSSF
F TESTS OF GOODNESS OF FIT
© Christopher Dougherty 1999–2006
0 :0:
41
40
HH
uXXXY 4433221 1RSS
2RSS
F(cost, d.f. remaining) =improvement cost
remainingunexplained
degrees of freedomremaining
10.12536/6.2023
1/)6.20233.2069()4540(1)()4540,1(
2
21
RSS
RSSRSSF
uXXY 33221
96.10)500,1( crit,0.1% F
The critical value of F at the 0.1% significance level with 500 degrees of freedom is 10.96. The critical value with 536 degrees of freedom must be lower, so we reject H0 at the 0.1% level.
The null hypothesis we are testing is exactly the same as for a two-sided t test on the coefficient of SF.
F TESTS OF GOODNESS OF FIT
© Christopher Dougherty 1999–2006
We will perform the t test. The t statistic is 3.48.
96.10crit,0.1% F
. reg S ASVABC SM SF
Source | SS df MS Number of obs = 540-------------+------------------------------ F( 3, 536) = 104.30 Model | 1181.36981 3 393.789935 Prob > F = 0.0000 Residual | 2023.61353 536 3.77539837 R-squared = 0.3686-------------+------------------------------ Adj R-squared = 0.3651 Total | 3204.98333 539 5.94616574 Root MSE = 1.943
------------------------------------------------------------------------------ S | Coef. Std. Err. t P>|t| [95% Conf. Interval]-------------+---------------------------------------------------------------- ASVABC | .1257087 .0098533 12.76 0.000 .1063528 .1450646 SM | .0492424 .0390901 1.26 0.208 -.027546 .1260309 SF | .1076825 .0309522 3.48 0.001 .04688 .1684851 _cons | 5.370631 .4882155 11.00 0.000 4.41158 6.329681------------------------------------------------------------------------------
10.12536/6.2023
1/)6.20233.2069()536,1(
F
F TESTS OF GOODNESS OF FIT
© Christopher Dougherty 1999–2006
96.10crit,0.1% F
. reg S ASVABC SM SF
Source | SS df MS Number of obs = 540-------------+------------------------------ F( 3, 536) = 104.30 Model | 1181.36981 3 393.789935 Prob > F = 0.0000 Residual | 2023.61353 536 3.77539837 R-squared = 0.3686-------------+------------------------------ Adj R-squared = 0.3651 Total | 3204.98333 539 5.94616574 Root MSE = 1.943
------------------------------------------------------------------------------ S | Coef. Std. Err. t P>|t| [95% Conf. Interval]-------------+---------------------------------------------------------------- ASVABC | .1257087 .0098533 12.76 0.000 .1063528 .1450646 SM | .0492424 .0390901 1.26 0.208 -.027546 .1260309 SF | .1076825 .0309522 3.48 0.001 .04688 .1684851 _cons | 5.370631 .4882155 11.00 0.000 4.41158 6.329681------------------------------------------------------------------------------
10.12536/6.2023
1/)6.20233.2069()536,1(
F
The critical value of t at the 0.1% level with 500 degrees of freedom is 3.31. The critical value with 536 degrees of freedom must be lower. So we reject H0 again.
31.3crit,0.1% t
F TESTS OF GOODNESS OF FIT
96.10crit,0.1% F
. reg S ASVABC SM SF
Source | SS df MS Number of obs = 540-------------+------------------------------ F( 3, 536) = 104.30 Model | 1181.36981 3 393.789935 Prob > F = 0.0000 Residual | 2023.61353 536 3.77539837 R-squared = 0.3686-------------+------------------------------ Adj R-squared = 0.3651 Total | 3204.98333 539 5.94616574 Root MSE = 1.943
------------------------------------------------------------------------------ S | Coef. Std. Err. t P>|t| [95% Conf. Interval]-------------+---------------------------------------------------------------- ASVABC | .1257087 .0098533 12.76 0.000 .1063528 .1450646 SM | .0492424 .0390901 1.26 0.208 -.027546 .1260309 SF | .1076825 .0309522 3.48 0.001 .04688 .1684851 _cons | 5.370631 .4882155 11.00 0.000 4.41158 6.329681------------------------------------------------------------------------------
10.12536/6.2023
1/)6.20233.2069()536,1(
F
31.3crit,0.1% t
It can be shown that the F statistic for the F test of the explanatory power of a ‘group’ of one variable must be equal to the square of the t statistic for that variable. (The difference in the last digit is due to rounding error.)
11.1248.3 2
F TESTS OF GOODNESS OF FIT
© Christopher Dougherty 1999–2006
96.10crit,0.1% F
. reg S ASVABC SM SF
Source | SS df MS Number of obs = 540-------------+------------------------------ F( 3, 536) = 104.30 Model | 1181.36981 3 393.789935 Prob > F = 0.0000 Residual | 2023.61353 536 3.77539837 R-squared = 0.3686-------------+------------------------------ Adj R-squared = 0.3651 Total | 3204.98333 539 5.94616574 Root MSE = 1.943
------------------------------------------------------------------------------ S | Coef. Std. Err. t P>|t| [95% Conf. Interval]-------------+-------------------------------------------------------------- ASVABC | .1257087 .0098533 12.76 0.000 .1063528 .1450646 SM | .0492424 .0390901 1.26 0.208 -.027546 .1260309 SF | .1076825 .0309522 3.48 0.001 .04688 .1684851 _cons | 5.370631 .4882155 11.00 0.000 4.41158 6.329681------------------------------------------------------------------------------
10.12536/6.2023
1/)6.20233.2069()536,1(
F
31.3crit,0.1% t11.1248.3 2 96.1031.3 2 It can also be shown that the critical value of F must be equal to the square of the critical value of t. (The critical values shown are for 500 degrees of freedom, but this must also be true for 536 degrees of freedom.)
F TESTS OF GOODNESS OF FIT
© Christopher Dougherty 1999–2006
Hence the conclusions of the two tests must coincide.
This result means that the t test of the coefficient of a variable is a test of its marginal explanatory power, after all the other variables have been included in the equation.
• If the variable is correlated with one or more of the other variables, its marginal explanatory power may be quite low, even if it genuinely belongs in the model.
• If all the variables are correlated, it is possible for all of them to have low marginal explanatory power and for none of the t tests to be significant, even though the F test for their joint explanatory power is highly significant.
• If this is the case, the model is said to be suffering from the problem of multicollinearity discussed earlier.
F TESTS OF GOODNESS OF FIT
Top Related