© Christopher Dougherty 1999–2006 A.1: The model is linear in parameters and correctly specified....

121
© Christopher Dougherty 1999–2006 A.1: The model is linear in parameters and correctly specified. A.2: There does not exist an exact linear relationship among the regressors in the sample. A.3 The disturbance term has zero expectation A.4 The disturbance term is homoscedastic A.5 The values of the disturbance term have independent distributions A.6 The disturbance term has a normal distribution PROPERTIES OF THE MULTIPLE REGRESSION COEFFICIENTS Moving from the simple to the multiple regression model, we start by restating the regression model assumptions. Only A.2 is different. Previously it was stated that there must be some variation in the X variable. We will explain the difference in one of the following lectures. Provided that the regression model assumptions are valid, the OLS estimators in the multiple regression model are unbiased and efficient, as in the simple regression model. u X X Y k k ... 2 2 1

description

© Christopher Dougherty 1999–2006 The first step, as always, is to substitute for Y from the true relationship. The Y ingredients of b 2 are actually in the form of Y i minus its mean, so it is convenient to obtain an expression for this. PROPERTIES OF THE MULTIPLE REGRESSION COEFFICIENTS

Transcript of © Christopher Dougherty 1999–2006 A.1: The model is linear in parameters and correctly specified....

Page 1: © Christopher Dougherty 1999–2006 A.1: The model is linear in parameters and correctly specified. A.2: There does not exist an exact linear relationship.

© Christopher Dougherty 1999–2006

A.1: The model is linear in parameters and correctly specified.

A.2: There does not exist an exact linear relationship among the regressors in the sample.

A.3 The disturbance term has zero expectationA.4 The disturbance term is homoscedasticA.5 The values of the disturbance term have independent

distributionsA.6 The disturbance term has a normal distribution

PROPERTIES OF THE MULTIPLE REGRESSION COEFFICIENTS

uXXY kk ...221

Moving from the simple to the multiple regression model, we start by restating the regression model assumptions. Only A.2 is different. Previously it was stated that there must be some variation in the X variable. We will explain the difference in one of the following lectures. Provided that the regression model assumptions are valid, the OLS estimators in the multiple regression model are unbiased and efficient, as in the simple regression model.

Page 2: © Christopher Dougherty 1999–2006 A.1: The model is linear in parameters and correctly specified. A.2: There does not exist an exact linear relationship.

© Christopher Dougherty 1999–2006

uXXY 33221 33221ˆ XbXbbY

23322 XXYYXX iii

2

33222

332

22

3322332

XXXXXXXX

XXXXYYXXb

iiii

iiii

We will not attempt to prove efficiency. We will however outline a proof of unbiasedness.

PROPERTIES OF THE MULTIPLE REGRESSION COEFFICIENTS

Page 3: © Christopher Dougherty 1999–2006 A.1: The model is linear in parameters and correctly specified. A.2: There does not exist an exact linear relationship.

© Christopher Dougherty 1999–2006

uXXY 33221 33221ˆ XbXbbY

23322 XXYYXX iii

2

33222

332

22

3322332

XXXXXXXX

XXXXYYXXb

iiii

iiii

uuXXXX

uXXuXXYY

iii

iiii

333222

3322133221

The first step, as always, is to substitute for Y from the true relationship. The Y ingredients of b2 are actually in the form of Yi minus its mean, so it is convenient to obtain an expression for this.

PROPERTIES OF THE MULTIPLE REGRESSION COEFFICIENTS

Page 4: © Christopher Dougherty 1999–2006 A.1: The model is linear in parameters and correctly specified. A.2: There does not exist an exact linear relationship.

© Christopher Dougherty 1999–2006

uXXY 33221 33221ˆ XbXbbY

23322 XXYYXX iii

2

33222

332

22

3322332

XXXXXXXX

XXXXYYXXb

iiii

iiii

uuXXXX

uXXuXXYY

iii

iiii

333222

3322133221

ii uab *222

After simplifying, we find that b2 can be decomposed into the true value 2 plus a weighted linear combination of the values of the disturbance term in the sample. This is what we found in the simple regression model. The difference is that the expression for the weights, which depend on all the values of X2 and X3 in the sample, is considerably more complicated.

PROPERTIES OF THE MULTIPLE REGRESSION COEFFICIENTS

Page 5: © Christopher Dougherty 1999–2006 A.1: The model is linear in parameters and correctly specified. A.2: There does not exist an exact linear relationship.

© Christopher Dougherty 1999–2006

uXXY 33221 33221ˆ XbXbbY

23322 XXYYXX iii

2

33222

332

22

3322332

XXXXXXXX

XXXXYYXXb

iiii

iiii

uuXXXX

uXXuXXYY

iii

iiii

333222

3322133221

ii uab *222

2*22

*22

*222 iiiiii uEauaEuaEbE

Having reached this point, proving unbiasedness is easy. Taking expectations, 2 is unaffected, being a constant. The expectation of a sum is equal to the sum of expectations.

PROPERTIES OF THE MULTIPLE REGRESSION COEFFICIENTS

Page 6: © Christopher Dougherty 1999–2006 A.1: The model is linear in parameters and correctly specified. A.2: There does not exist an exact linear relationship.

© Christopher Dougherty 1999–2006

uXXY 33221 33221ˆ XbXbbY

23322 XXYYXX iii

2

33222

332

22

3322332

XXXXXXXX

XXXXYYXXb

iiii

iiii

uuXXXX

uXXuXXYY

iii

iiii

333222

3322133221

ii uab *222

2*22

*22

*222 iiiiii uEauaEuaEbE

The a* terms are nonstochastic since they depend only on the values of X2 and X3, and these are assumed to be nonstochastic. Hence the a* terms may be taken out of the expectations as factors.

PROPERTIES OF THE MULTIPLE REGRESSION COEFFICIENTS

Page 7: © Christopher Dougherty 1999–2006 A.1: The model is linear in parameters and correctly specified. A.2: There does not exist an exact linear relationship.

© Christopher Dougherty 1999–2006

uXXY 33221 33221ˆ XbXbbY

23322 XXYYXX iii

2

33222

332

22

3322332

XXXXXXXX

XXXXYYXXb

iiii

iiii

uuXXXX

uXXuXXYY

iii

iiii

333222

3322133221

ii uab *222

2*22

*22

*222 iiiiii uEauaEuaEbE

By Assumption A.3, E(ui) = 0 for all i. Hence E(b2) is equal to 2 and so b2 is an unbiased estimator. Similarly b3 is an unbiased estimator of 3.

PROPERTIES OF THE MULTIPLE REGRESSION COEFFICIENTS

Page 8: © Christopher Dougherty 1999–2006 A.1: The model is linear in parameters and correctly specified. A.2: There does not exist an exact linear relationship.

© Christopher Dougherty 1999–2006

uXXY 33221 33221ˆ XbXbbY

Finally we will show that b1 is an unbiased estimator of 1. This is quite simple, so you should attempt to do this yourself, before looking at the rest of this sequence.

332233221

33221

)( XbXbuXXXbXbYb

PROPERTIES OF THE MULTIPLE REGRESSION COEFFICIENTS

Page 9: © Christopher Dougherty 1999–2006 A.1: The model is linear in parameters and correctly specified. A.2: There does not exist an exact linear relationship.

© Christopher Dougherty 1999–2006

uXXY 33221 33221ˆ XbXbbY

332233221

33221

)( XbXbuXXXbXbYb

First substitute for the sample mean of Y.

PROPERTIES OF THE MULTIPLE REGRESSION COEFFICIENTS

Page 10: © Christopher Dougherty 1999–2006 A.1: The model is linear in parameters and correctly specified. A.2: There does not exist an exact linear relationship.

© Christopher Dougherty 1999–2006

uXXY 33221 33221ˆ XbXbbY

332233221

33221

)( XbXbuXXXbXbYb

1

332233221

3322332211 )()()()(

XXXX

bEXbEXuEXXbE

Now take expectations. The first three terms are nonstochastic, so they are unaffected by taking expectations.

PROPERTIES OF THE MULTIPLE REGRESSION COEFFICIENTS

Page 11: © Christopher Dougherty 1999–2006 A.1: The model is linear in parameters and correctly specified. A.2: There does not exist an exact linear relationship.

© Christopher Dougherty 1999–2006

uXXY 33221 33221ˆ XbXbbY

332233221

33221

)( XbXbuXXXbXbYb

1

332233221

3322332211 )()()()(

XXXX

bEXbEXuEXXbE

The expected value of the mean of the disturbance term is zero since E(u) is zero in each observation. We have just shown that E(b2) is equal to 2 and that E(b3) is equal to 3.

PROPERTIES OF THE MULTIPLE REGRESSION COEFFICIENTS

Page 12: © Christopher Dougherty 1999–2006 A.1: The model is linear in parameters and correctly specified. A.2: There does not exist an exact linear relationship.

© Christopher Dougherty 1999–2006

uXXY 33221 33221ˆ XbXbbY

332233221

33221

)( XbXbuXXXbXbYb

1

332233221

3322332211 )()()()(

XXXX

bEXbEXuEXXbE

Hence b1 is an unbiased estimator of 1.

PROPERTIES OF THE MULTIPLE REGRESSION COEFFICIENTS

Page 13: © Christopher Dougherty 1999–2006 A.1: The model is linear in parameters and correctly specified. A.2: There does not exist an exact linear relationship.

© Christopher Dougherty 1999–2006

uXXY 33221 33221ˆ XbXbbY

PRECISION OF THE MULTIPLE REGRESSION COEFFICIENTS

This sequence investigates the variances and standard errors of the slope coefficients in a model with two explanatory variables.The expression for the variance of b2 is shown above. The expression for the variance of b3 is the same, with the subscripts 2 and 3 interchanged.

2,2

2

2,

222

22

3232

2 11

)(MSD11

XX

u

XXi

ub rXnrXX

Page 14: © Christopher Dougherty 1999–2006 A.1: The model is linear in parameters and correctly specified. A.2: There does not exist an exact linear relationship.

© Christopher Dougherty 1999–2006

uXXY 33221 33221ˆ XbXbbY

The first factor in the expression is identical to that for the variance of the slope coefficient in a simple regression model.The variance of b2 depends on the variance of the disturbance term, the number of observations, and the mean square deviation of X2 for exactly the same reasons as in a simple regression model.

2,2

2

2,

222

22

3232

2 11

)(MSD11

XX

u

XXi

ub rXnrXX

PRECISION OF THE MULTIPLE REGRESSION COEFFICIENTS

Page 15: © Christopher Dougherty 1999–2006 A.1: The model is linear in parameters and correctly specified. A.2: There does not exist an exact linear relationship.

© Christopher Dougherty 1999–2006

uXXY 33221 33221ˆ XbXbbY

The difference is that in multiple regression analysis the expression is multiplied by a factor which depends on the correlation between X2 and X3.

The higher is the correlation between the explanatory variables, positive or negative, the greater will be the variance.This is easy to understand intuitively. The greater the correlation, the harder it is to discriminate between the effects of the explanatory variables on Y, and the less accurate will be the regression estimates.Note that the variance expression above is valid only for a model with two explanatory variables. When there are more than two, the expression becomes much more complex and it is sensible to switch to matrix algebra.

2,2

2

2,

222

22

3232

2 11

)(MSD11

XX

u

XXi

ub rXnrXX

PRECISION OF THE MULTIPLE REGRESSION COEFFICIENTS

Page 16: © Christopher Dougherty 1999–2006 A.1: The model is linear in parameters and correctly specified. A.2: There does not exist an exact linear relationship.

© Christopher Dougherty 1999–2006

uXXY 33221 33221ˆ XbXbbY

The standard deviation of the distribution of b2 is of course given by the square root of its variance.With the exception of the variance of u, we can calculate the components of the standard deviation from the sample data.

2,2

2

2,

222

22

3232

2 11

)(MSD11

XX

u

XXi

ub rXnrXX

2,

222

2

232

11 of deviation standard

XXi

u

rXXb

PRECISION OF THE MULTIPLE REGRESSION COEFFICIENTS

Page 17: © Christopher Dougherty 1999–2006 A.1: The model is linear in parameters and correctly specified. A.2: There does not exist an exact linear relationship.

© Christopher Dougherty 1999–2006

uXXY 33221 33221ˆ XbXbbY

The variance of u has to be estimated. The mean square of the residuals provides a consistent estimator, but it is biased downwards by a factor (n – k) / n , where k is the number of parameters, in a finite sample.

2,2

2

2,

222

22

3232

2 11

)(MSD11

XX

u

XXi

ub rXnrXX

2,

222

2

232

11 of deviation standard

XXi

u

rXXb

221ui n

knen

E

PRECISION OF THE MULTIPLE REGRESSION COEFFICIENTS

Page 18: © Christopher Dougherty 1999–2006 A.1: The model is linear in parameters and correctly specified. A.2: There does not exist an exact linear relationship.

© Christopher Dougherty 1999–2006

uXXY 33221 33221ˆ XbXbbY

Obviously we can obtain an unbiased estimator by dividing the sum of the squares of the residuals by n – k instead of n. We denote this unbiased estimator su.2

2,2

2

2,

222

22

3232

2 11

)(MSD11

XX

u

XXi

ub rXnrXX

2,

222

2

232

11 of deviation standard

XXi

u

rXXb

22 1

iu ekn

s221ui n

knen

E

PRECISION OF THE MULTIPLE REGRESSION COEFFICIENTS

Page 19: © Christopher Dougherty 1999–2006 A.1: The model is linear in parameters and correctly specified. A.2: There does not exist an exact linear relationship.

© Christopher Dougherty 1999–2006

uXXY 33221 33221ˆ XbXbbY

221ui n

knen

E

22 1iu e

kns

2,

222

2

232

11 of deviation standard

XXi

u

rXXb

2,

222

2

232

11)( s.e.

XXi

u

rXXsb

Thus the estimate of the standard deviation of the probability distribution of b2, known as the standard error of b2 for short, is given by the expression above.

2,2

2

2,

222

22

3232

2 11

)(MSD11

XX

u

XXi

ub rXnrXX

PRECISION OF THE MULTIPLE REGRESSION COEFFICIENTS

Page 20: © Christopher Dougherty 1999–2006 A.1: The model is linear in parameters and correctly specified. A.2: There does not exist an exact linear relationship.

© Christopher Dougherty 1999–2006

We will use this expression to analyze why the standard error of S is larger for the union subsample than for the non-union subsample in earnings function regressions using Data Set 21.

. reg EARNINGS S EXP if COLLBARG==1

Source | SS df MS Number of obs = 101-------------+------------------------------ F( 2, 98) = 9.72 Model | 3076.31726 2 1538.15863 Prob > F = 0.0001 Residual | 15501.9762 98 158.18343 R-squared = 0.1656-------------+------------------------------ Adj R-squared = 0.1486 Total | 18578.2934 100 185.782934 Root MSE = 12.577

------------------------------------------------------------------------------ EARNINGS | Coef. Std. Err. t P>|t| [95% Conf. Interval]-------------+---------------------------------------------------------------- S | 2.333846 .5492604 4.25 0.000 1.243857 3.423836 EXP | .2235095 .3389455 0.66 0.511 -.4491169 .8961358 _cons | -15.12427 11.38141 -1.33 0.187 -37.71031 7.461779------------------------------------------------------------------------------

PRECISION OF THE MULTIPLE REGRESSION COEFFICIENTS

Page 21: © Christopher Dougherty 1999–2006 A.1: The model is linear in parameters and correctly specified. A.2: There does not exist an exact linear relationship.

© Christopher Dougherty 1999–2006

. reg EARNINGS S EXP if COLLBARG==1

Source | SS df MS Number of obs = 101-------------+------------------------------ F( 2, 98) = 9.72 Model | 3076.31726 2 1538.15863 Prob > F = 0.0001 Residual | 15501.9762 98 158.18343 R-squared = 0.1656-------------+------------------------------ Adj R-squared = 0.1486 Total | 18578.2934 100 185.782934 Root MSE = 12.577

------------------------------------------------------------------------------ EARNINGS | Coef. Std. Err. t P>|t| [95% Conf. Interval]-------------+---------------------------------------------------------------- S | 2.333846 .5492604 4.25 0.000 1.243857 3.423836 EXP | .2235095 .3389455 0.66 0.511 -.4491169 .8961358 _cons | -15.12427 11.38141 -1.33 0.187 -37.71031 7.461779------------------------------------------------------------------------------

To select a subsample in Stata, you add an ‘if’ statement to a command. The COLLBARG variable is equal to 1 for respondents whose rates of pay are determined by collective bargaining, and it is 0 for the others.Note that in tests for equality, Stata requires the = sign to be duplicated.

PRECISION OF THE MULTIPLE REGRESSION COEFFICIENTS

Page 22: © Christopher Dougherty 1999–2006 A.1: The model is linear in parameters and correctly specified. A.2: There does not exist an exact linear relationship.

© Christopher Dougherty 1999–2006

. reg EARNINGS S EXP if COLLBARG==1

Source | SS df MS Number of obs = 101-------------+------------------------------ F( 2, 98) = 9.72 Model | 3076.31726 2 1538.15863 Prob > F = 0.0001 Residual | 15501.9762 98 158.18343 R-squared = 0.1656-------------+------------------------------ Adj R-squared = 0.1486 Total | 18578.2934 100 185.782934 Root MSE = 12.577

------------------------------------------------------------------------------ EARNINGS | Coef. Std. Err. t P>|t| [95% Conf. Interval]-------------+---------------------------------------------------------------- S | 2.333846 .5492604 4.25 0.000 1.243857 3.423836 EXP | .2235095 .3389455 0.66 0.511 -.4491169 .8961358 _cons | -15.12427 11.38141 -1.33 0.187 -37.71031 7.461779------------------------------------------------------------------------------

In the case of the union subsample, the standard error of S is 0.5493.

PRECISION OF THE MULTIPLE REGRESSION COEFFICIENTS

Page 23: © Christopher Dougherty 1999–2006 A.1: The model is linear in parameters and correctly specified. A.2: There does not exist an exact linear relationship.

© Christopher Dougherty 1999–2006

. reg EARNINGS S EXP if COLLBARG==0

Source | SS df MS Number of obs = 439-------------+------------------------------ F( 2, 436) = 57.77 Model | 19540.1761 2 9770.08805 Prob > F = 0.0000 Residual | 73741.593 436 169.132094 R-squared = 0.2095-------------+------------------------------ Adj R-squared = 0.2058 Total | 93281.7691 438 212.972076 Root MSE = 13.005

------------------------------------------------------------------------------ EARNINGS | Coef. Std. Err. t P>|t| [95% Conf. Interval]-------------+---------------------------------------------------------------- S | 2.721698 .2604411 10.45 0.000 2.209822 3.233574 EXP | .6077342 .1400846 4.34 0.000 .3324091 .8830592 _cons | -28.00805 4.643211 -6.03 0.000 -37.13391 -18.88219------------------------------------------------------------------------------

In the case of the non-union subsample, the standard error of S is 0.2604, less than half as large.

PRECISION OF THE MULTIPLE REGRESSION COEFFICIENTS

Page 24: © Christopher Dougherty 1999–2006 A.1: The model is linear in parameters and correctly specified. A.2: There does not exist an exact linear relationship.

© Christopher Dougherty 1999–2006

2,2

2

321

1)(MSD

11)(s.e.XX

u rXnsb

We will explain the difference by looking at the components of the standard error.

Decomposition of the standard error of S

Component su n MSD(S) rS, EXP s.e.

Union 12.577 101 6.2325 –0.4087 0.5493

Non-union 13.005 439 5.8666 –0.1784 0.2604

Factor product

Union 12.577 0.0995 0.4006 1.0957 0.5493

Non-union 13.005 0.0477 0.4129 1.0163 0.2603

PRECISION OF THE MULTIPLE REGRESSION COEFFICIENTS

Page 25: © Christopher Dougherty 1999–2006 A.1: The model is linear in parameters and correctly specified. A.2: There does not exist an exact linear relationship.

© Christopher Dougherty 1999–2006

RSSkn

su 12

. reg EARNINGS S EXP if COLLBARG==1

Source | SS df MS Number of obs = 101-------------+------------------------------ F( 2, 98) = 9.72 Model | 3076.31726 2 1538.15863 Prob > F = 0.0001 Residual | 15501.9762 98 158.18343 R-squared = 0.1656-------------+------------------------------ Adj R-squared = 0.1486 Total | 18578.2934 100 185.782934 Root MSE = 12.577

------------------------------------------------------------------------------ EARNINGS | Coef. Std. Err. t P>|t| [95% Conf. Interval]-------------+---------------------------------------------------------------- S | 2.333846 .5492604 4.25 0.000 1.243857 3.423836 EXP | .2235095 .3389455 0.66 0.511 -.4491169 .8961358 _cons | -15.12427 11.38141 -1.33 0.187 -37.71031 7.461779------------------------------------------------------------------------------

We will start with su. Here is RSS for the union subsample.

PRECISION OF THE MULTIPLE REGRESSION COEFFICIENTS

Page 26: © Christopher Dougherty 1999–2006 A.1: The model is linear in parameters and correctly specified. A.2: There does not exist an exact linear relationship.

© Christopher Dougherty 1999–2006

RSSkn

su 12

. reg EARNINGS S EXP if COLLBARG==1

Source | SS df MS Number of obs = 101-------------+------------------------------ F( 2, 98) = 9.72 Model | 3076.31726 2 1538.15863 Prob > F = 0.0001 Residual | 15501.9762 98 158.18343 R-squared = 0.1656-------------+------------------------------ Adj R-squared = 0.1486 Total | 18578.2934 100 185.782934 Root MSE = 12.577

------------------------------------------------------------------------------ EARNINGS | Coef. Std. Err. t P>|t| [95% Conf. Interval]-------------+---------------------------------------------------------------- S | 2.333846 .5492604 4.25 0.000 1.243857 3.423836 EXP | .2235095 .3389455 0.66 0.511 -.4491169 .8961358 _cons | -15.12427 11.38141 -1.33 0.187 -37.71031 7.461779------------------------------------------------------------------------------

There are 101 observations in the union subsample. k is equal to 3. Thus n – k is equal to 98.

PRECISION OF THE MULTIPLE REGRESSION COEFFICIENTS

Page 27: © Christopher Dougherty 1999–2006 A.1: The model is linear in parameters and correctly specified. A.2: There does not exist an exact linear relationship.

© Christopher Dougherty 1999–2006

RSS / (n – k) is equal to 158.183. To obtain su, we take the square root.

This is 12.577.

RSSkn

su 12

. reg EARNINGS S EXP if COLLBARG==1

Source | SS df MS Number of obs = 101-------------+------------------------------ F( 2, 98) = 9.72 Model | 3076.31726 2 1538.15863 Prob > F = 0.0001 Residual | 15501.9762 98 158.18343 R-squared = 0.1656-------------+------------------------------ Adj R-squared = 0.1486 Total | 18578.2934 100 185.782934 Root MSE = 12.577

------------------------------------------------------------------------------ EARNINGS | Coef. Std. Err. t P>|t| [95% Conf. Interval]-------------+---------------------------------------------------------------- S | 2.333846 .5492604 4.25 0.000 1.243857 3.423836 EXP | .2235095 .3389455 0.66 0.511 -.4491169 .8961358 _cons | -15.12427 11.38141 -1.33 0.187 -37.71031 7.461779------------------------------------------------------------------------------

PRECISION OF THE MULTIPLE REGRESSION COEFFICIENTS

Page 28: © Christopher Dougherty 1999–2006 A.1: The model is linear in parameters and correctly specified. A.2: There does not exist an exact linear relationship.

© Christopher Dougherty 1999–2006

Decomposition of the standard error of S

Component su n MSD(S) rS, EXP s.e.

Union 12.577 101 6.2325 –0.4087 0.5493

Non-union 13.005 439 5.8666 –0.1784 0.2604

Factor product

Union 12.577 0.0995 0.4006 1.0957 0.5493

Non-union 13.005 0.0477 0.4129 1.0163 0.2603

2,2

2

321

1)(MSD

11)(s.e.XX

u rXnsb

We place this in the table, along with the number of observations.

PRECISION OF THE MULTIPLE REGRESSION COEFFICIENTS

Page 29: © Christopher Dougherty 1999–2006 A.1: The model is linear in parameters and correctly specified. A.2: There does not exist an exact linear relationship.

© Christopher Dougherty 1999–2006

Similarly, in the case of the non-union subsample, su is the square root of 169.132, which is 13.005. We also note that the number of observations in that subsample is 439.

. reg EARNINGS S EXP if COLLBARG==0

Source | SS df MS Number of obs = 439-------------+------------------------------ F( 2, 436) = 57.77 Model | 19540.1761 2 9770.08805 Prob > F = 0.0000 Residual | 73741.593 436 169.132094 R-squared = 0.2095-------------+------------------------------ Adj R-squared = 0.2058 Total | 93281.7691 438 212.972076 Root MSE = 13.005

------------------------------------------------------------------------------ EARNINGS | Coef. Std. Err. t P>|t| [95% Conf. Interval]-------------+---------------------------------------------------------------- S | 2.721698 .2604411 10.45 0.000 2.209822 3.233574 EXP | .6077342 .1400846 4.34 0.000 .3324091 .8830592 _cons | -28.00805 4.643211 -6.03 0.000 -37.13391 -18.88219------------------------------------------------------------------------------

PRECISION OF THE MULTIPLE REGRESSION COEFFICIENTS

Page 30: © Christopher Dougherty 1999–2006 A.1: The model is linear in parameters and correctly specified. A.2: There does not exist an exact linear relationship.

© Christopher Dougherty 1999–2006

Decomposition of the standard error of S

Component su n MSD(S) rS, EXP s.e.

Union 12.577 101 6.2325 –0.4087 0.5493

Non-union 13.005 439 5.8666 –0.1784 0.2604

Factor product

Union 12.577 0.0995 0.4006 1.0957 0.5493

Non-union 13.005 0.0477 0.4129 1.0163 0.2603

2,2

2

321

1)(MSD

11)(s.e.XX

u rXnsb

We place these in the table.

PRECISION OF THE MULTIPLE REGRESSION COEFFICIENTS

Page 31: © Christopher Dougherty 1999–2006 A.1: The model is linear in parameters and correctly specified. A.2: There does not exist an exact linear relationship.

© Christopher Dougherty 1999–2006

Decomposition of the standard error of S

Component su n MSD(S) rS, EXP s.e.

Union 12.577 101 6.2325 –0.4087 0.5493

Non-union 13.005 439 5.8666 –0.1784 0.2604

Factor product

Union 12.577 0.0995 0.4006 1.0957 0.5493

Non-union 13.005 0.0477 0.4129 1.0163 0.2603

2,2

2

321

1)(MSD

11)(s.e.XX

u rXnsb

We calculate the mean square deviation of S for the two subsamples from the sample data.

PRECISION OF THE MULTIPLE REGRESSION COEFFICIENTS

Page 32: © Christopher Dougherty 1999–2006 A.1: The model is linear in parameters and correctly specified. A.2: There does not exist an exact linear relationship.

© Christopher Dougherty 1999–2006

. cor S EXP if COLLBARG==1(obs=101) | S EXP--------+------------------ S | 1.0000 EXP | -0.4087 1.0000

. cor S EXP if COLLBARG==0(obs=439) | S EXP--------+------------------ S | 1.0000 EXP | -0.1784 1.0000

The correlation coefficients for S and EXP are –0.4087 and –0.1784 for the union and non-union subsamples, respectively. (Note that "cor" is the Stata command for computing correlations.)

PRECISION OF THE MULTIPLE REGRESSION COEFFICIENTS

Page 33: © Christopher Dougherty 1999–2006 A.1: The model is linear in parameters and correctly specified. A.2: There does not exist an exact linear relationship.

© Christopher Dougherty 1999–2006

Decomposition of the standard error of S

Component su n MSD(S) rS, EXP s.e.

Union 12.577 101 6.2325 –0.4087 0.5493

Non-union 13.005 439 5.8666 –0.1784 0.2604

Factor product

Union 12.577 0.0995 0.4006 1.0957 0.5493

Non-union 13.005 0.0477 0.4129 1.0163 0.2603

2,2

2

321

1)(MSD

11)(s.e.XX

u rXnsb

These entries complete the top half of the table. We will now look at the impact of each item on the standard error, using the mathematical expression at the top.

PRECISION OF THE MULTIPLE REGRESSION COEFFICIENTS

Page 34: © Christopher Dougherty 1999–2006 A.1: The model is linear in parameters and correctly specified. A.2: There does not exist an exact linear relationship.

© Christopher Dougherty 1999–2006

Decomposition of the standard error of S

Component su n MSD(S) rS, EXP s.e.

Union 12.577 101 6.2325 –0.4087 0.5493

Non-union 13.005 439 5.8666 –0.1784 0.2604

Factor product

Union 12.577 0.0995 0.4006 1.0957 0.5493

Non-union 13.005 0.0477 0.4129 1.0163 0.2603

2,2

2

321

1)(MSD

11)(s.e.XX

u rXnsb

The su components need no modification. It is a little larger for the non-union subsample, having an adverse effect on the standard error.

PRECISION OF THE MULTIPLE REGRESSION COEFFICIENTS

Page 35: © Christopher Dougherty 1999–2006 A.1: The model is linear in parameters and correctly specified. A.2: There does not exist an exact linear relationship.

© Christopher Dougherty 1999–2006

Decomposition of the standard error of S

Component su n MSD(S) rS, EXP s.e.

Union 12.577 101 6.2325 –0.4087 0.5493

Non-union 13.005 439 5.8666 –0.1784 0.2604

Factor product

Union 12.577 0.0995 0.4006 1.0957 0.5493

Non-union 13.005 0.0477 0.4129 1.0163 0.2603

2,2

2

321

1)(MSD

11)(s.e.XX

u rXnsb

The number of observations is much larger for the non-union subsample, so the second factor is much smaller than that for the union subsample.

PRECISION OF THE MULTIPLE REGRESSION COEFFICIENTS

Page 36: © Christopher Dougherty 1999–2006 A.1: The model is linear in parameters and correctly specified. A.2: There does not exist an exact linear relationship.

© Christopher Dougherty 1999–2006

Decomposition of the standard error of S

Component su n MSD(S) rS, EXP s.e.

Union 12.577 101 6.2325 –0.4087 0.5493

Non-union 13.005 439 5.8666 –0.1784 0.2604

Factor product

Union 12.577 0.0995 0.4006 1.0957 0.5493

Non-union 13.005 0.0477 0.4129 1.0163 0.2603

2,2

2

321

1)(MSD

11)(s.e.XX

u rXnsb

Perhaps surprisingly, the variance in schooling is a little larger for the union subsample.

PRECISION OF THE MULTIPLE REGRESSION COEFFICIENTS

Page 37: © Christopher Dougherty 1999–2006 A.1: The model is linear in parameters and correctly specified. A.2: There does not exist an exact linear relationship.

© Christopher Dougherty 1999–2006

Decomposition of the standard error of S

Component su n MSD(S) rS, EXP s.e.

Union 12.577 101 6.2325 –0.4087 0.5493

Non-union 13.005 439 5.8666 –0.1784 0.2604

Factor product

Union 12.577 0.0995 0.4006 1.0957 0.5493

Non-union 13.005 0.0477 0.4129 1.0163 0.2603

2,2

2

321

1)(MSD

11)(s.e.XX

u rXnsb

The correlation between schooling and work experience is greater for the union subsample, and this has an adverse effect on its standard error. Note that the sign of the correlation makes no difference since it is squared.

PRECISION OF THE MULTIPLE REGRESSION COEFFICIENTS

Page 38: © Christopher Dougherty 1999–2006 A.1: The model is linear in parameters and correctly specified. A.2: There does not exist an exact linear relationship.

Decomposition of the standard error of S

Component su n MSD(S) rS, EXP s.e.

Union 12.577 101 6.2325 –0.4087 0.5493

Non-union 13.005 439 5.8666 –0.1784 0.2604

Factor product

Union 12.577 0.0995 0.4006 1.0957 0.5493

Non-union 13.005 0.0477 0.4129 1.0163 0.2603

2,2

2

321

1)(MSD

11)(s.e.XX

u rXnsb

We see that the reason that the standard error is smaller for the non-union subsample is that there are far more observations than in the non-union subsample. Otherwise the standard errors would have been about the same.The greater correlation between S and EXP has an adverse effect on the union standard error, but this is just about offset by the smaller su and the larger variance of S.

PRECISION OF THE MULTIPLE REGRESSION COEFFICIENTS

Page 39: © Christopher Dougherty 1999–2006 A.1: The model is linear in parameters and correctly specified. A.2: There does not exist an exact linear relationship.

© Christopher Dougherty 1999–2006

X2 X3 Y

10 19 51

11 21 56

12 23 61

13 25 66

14 27 71

15 29 76

MULTICOLLINEARITY

3232 XXY

12 23 XX

Suppose that Y = 2 + 3X2 + X3 and that X3 = 2X2 – 1. There is no disturbance term in the equation for Y, but that is not important. Suppose that we have the six observations shown.

Page 40: © Christopher Dougherty 1999–2006 A.1: The model is linear in parameters and correctly specified. A.2: There does not exist an exact linear relationship.

© Christopher Dougherty 1999–2006

The three variables are plotted as line graphs above. Looking at the data, it is impossible to tell whether the changes in Y are caused by changes in X2, by changes in X3, or jointly by changes in both X2 and X3.

0

10

20

30

40

50

60

70

80

1 2 3 4 5 6

Y

X3

X2

MULTICOLLINEARITY

Page 41: © Christopher Dougherty 1999–2006 A.1: The model is linear in parameters and correctly specified. A.2: There does not exist an exact linear relationship.

© Christopher Dougherty 1999–2006

Change Change ChangeX2 X3 Y in X2 in X3 in Y

10 19 51 1 2 5

11 21 56 1 2 5

12 23 61 1 2 5

13 25 66 1 2 5

14 27 71 1 2 5

15 29 76 1 2 5

3232 XXY

12 23 XX

Numerically, Y increases by 5 in each observation when X2 changes by 1.

MULTICOLLINEARITY

Page 42: © Christopher Dougherty 1999–2006 A.1: The model is linear in parameters and correctly specified. A.2: There does not exist an exact linear relationship.

© Christopher Dougherty 1999–2006

Hence the true relationship could have been Y = 1 + 5X2.

0

10

20

30

40

50

60

70

80

1 2 3 4 5 6

Y

X3

X2

Y = 1 + 5X2 ?

MULTICOLLINEARITY

Page 43: © Christopher Dougherty 1999–2006 A.1: The model is linear in parameters and correctly specified. A.2: There does not exist an exact linear relationship.

© Christopher Dougherty 1999–2006

uXXY 33221 23 XX

What would happen if you tried to run a regression when there is an exact linear relationship among the explanatory variables? We will investigate, using the model with two explanatory variables shown above. [Note: A disturbance term has now been included in the true model, but it makes no difference to the analysis.]

MULTICOLLINEARITY

Page 44: © Christopher Dougherty 1999–2006 A.1: The model is linear in parameters and correctly specified. A.2: There does not exist an exact linear relationship.

© Christopher Dougherty 1999–2006

uXXY 33221 23 XX

23322 XXYYXX iii

2

33222

332

22

3322332

XXXXXXXX

XXXXYYXXb

iiii

iiii

The expression for the multiple regression coefficient b2 is shown above. We will substitute for X3 using its relationship with X2.

MULTICOLLINEARITY

Page 45: © Christopher Dougherty 1999–2006 A.1: The model is linear in parameters and correctly specified. A.2: There does not exist an exact linear relationship.

© Christopher Dougherty 1999–2006

uXXY 33221 23 XX

23322 XXYYXX iii

2

33222

332

22

3322332

XXXXXXXX

XXXXYYXXb

iiii

iiii

222

2

222

2222

222

233 ][][

XX

XXXX

XXXX

i

ii

ii

First, we will replace the terms highlighted with the expression derived below.

MULTICOLLINEARITY

Page 46: © Christopher Dougherty 1999–2006 A.1: The model is linear in parameters and correctly specified. A.2: There does not exist an exact linear relationship.

© Christopher Dougherty 1999–2006

uXXY 33221 23 XX

222

222 XXYYXX iii

2

33222

2222

22

3322332

XXXXXXXX

XXXXYYXXb

iiii

iiii

222

2222

22223322 ][][

XX

XXXX

XXXXXXXX

i

ii

iiii

Next, the terms that are highlighted now.

MULTICOLLINEARITY

Page 47: © Christopher Dougherty 1999–2006 A.1: The model is linear in parameters and correctly specified. A.2: There does not exist an exact linear relationship.

© Christopher Dougherty 1999–2006

uXXY 33221 23 XX

222

222 XXYYXX iii

00

2222

222

2222

22233

2

XXXXXX

XXYYXXb

iii

iii

YYXX

YYXX

YYXXYYXX

ii

ii

iiii

22

22

2233 ][][

Finally this term.

MULTICOLLINEARITY

Page 48: © Christopher Dougherty 1999–2006 A.1: The model is linear in parameters and correctly specified. A.2: There does not exist an exact linear relationship.

© Christopher Dougherty 1999–2006

uXXY 33221 23 XX

222

222 XXYYXX iii

00

2222

222

2222

22222

2

XXXXXX

XXYYXXb

iii

iii

After all the replacements, it turns out that the numerator and the denominator are both equal to zero. The regression coefficient is not defined.It is unusual for there to be an exact relationship among the explanatory variables in a regression. When this occurs, it is typically because there is a logical error in the specification.

MULTICOLLINEARITY

Page 49: © Christopher Dougherty 1999–2006 A.1: The model is linear in parameters and correctly specified. A.2: There does not exist an exact linear relationship.

© Christopher Dougherty 1999–2006

. reg EARNINGS S EXP EXPSQ

Source | SS df MS Number of obs = 540-------------+------------------------------ F( 3, 536) = 45.57 Model | 22762.4472 3 7587.48241 Prob > F = 0.0000 Residual | 89247.7839 536 166.507059 R-squared = 0.2032-------------+------------------------------ Adj R-squared = 0.1988 Total | 112010.231 539 207.811189 Root MSE = 12.904

------------------------------------------------------------------------------ EARNINGS | Coef. Std. Err. t P>|t| [95% Conf. Interval]-------------+---------------------------------------------------------------- S | 2.754372 .2417286 11.39 0.000 2.279521 3.229224 EXP | -.2353907 .665197 -0.35 0.724 -1.542103 1.071322 EXPSQ | .0267843 .0219115 1.22 0.222 -.0162586 .0698272 _cons | -22.21964 5.514827 -4.03 0.000 -33.05297 -11.38632------------------------------------------------------------------------------

uEXPSQEXPSEARNINGS 4321

However, it often happens that there is an approximate relationship. For example, when relating earnings to schooling and work experience, it if often reasonable to suppose that the effect of work experience is subject to diminishing returns.A standard way of allowing for this is to include EXPSQ, the square of EXP, in the specification. According to the hypothesis of diminishing returns, 4 should be negative.

MULTICOLLINEARITY

Page 50: © Christopher Dougherty 1999–2006 A.1: The model is linear in parameters and correctly specified. A.2: There does not exist an exact linear relationship.

© Christopher Dougherty 1999–2006

. reg EARNINGS S EXP EXPSQ

Source | SS df MS Number of obs = 540-------------+------------------------------ F( 3, 536) = 45.57 Model | 22762.4472 3 7587.48241 Prob > F = 0.0000 Residual | 89247.7839 536 166.507059 R-squared = 0.2032-------------+------------------------------ Adj R-squared = 0.1988 Total | 112010.231 539 207.811189 Root MSE = 12.904

------------------------------------------------------------------------------ EARNINGS | Coef. Std. Err. t P>|t| [95% Conf. Interval]-------------+---------------------------------------------------------------- S | 2.754372 .2417286 11.39 0.000 2.279521 3.229224 EXP | -.2353907 .665197 -0.35 0.724 -1.542103 1.071322 EXPSQ | .0267843 .0219115 1.22 0.222 -.0162586 .0698272 _cons | -22.21964 5.514827 -4.03 0.000 -33.05297 -11.38632------------------------------------------------------------------------------

We fit this specification using Data Set 21. The schooling component of the regression results is not much affected by the inclusion of the EXPSQ term. The coefficient of S indicates that an extra year of schooling increases hourly earnings by $2.75.

uEXPSQEXPSEARNINGS 4321

MULTICOLLINEARITY

Page 51: © Christopher Dougherty 1999–2006 A.1: The model is linear in parameters and correctly specified. A.2: There does not exist an exact linear relationship.

© Christopher Dougherty 1999–2006

. reg EARNINGS S EXP

Source | SS df MS Number of obs = 540-------------+------------------------------ F( 2, 537) = 67.54 Model | 22513.6473 2 11256.8237 Prob > F = 0.0000 Residual | 89496.5838 537 166.660305 R-squared = 0.2010-------------+------------------------------ Adj R-squared = 0.1980 Total | 112010.231 539 207.811189 Root MSE = 12.91

------------------------------------------------------------------------------ EARNINGS | Coef. Std. Err. t P>|t| [95% Conf. Interval]-------------+---------------------------------------------------------------- S | 2.678125 .2336497 11.46 0.000 2.219146 3.137105 EXP | .5624326 .1285136 4.38 0.000 .3099816 .8148837 _cons | -26.48501 4.27251 -6.20 0.000 -34.87789 -18.09213------------------------------------------------------------------------------

(Looking back at slide 21:) In the specification without EXPSQ it was 2.68, not much different.

uEXPSQEXPSEARNINGS 4321

MULTICOLLINEARITY

Page 52: © Christopher Dougherty 1999–2006 A.1: The model is linear in parameters and correctly specified. A.2: There does not exist an exact linear relationship.

© Christopher Dougherty 1999–2006

. reg EARNINGS S EXP EXPSQ

Source | SS df MS Number of obs = 540-------------+------------------------------ F( 3, 536) = 45.57 Model | 22762.4472 3 7587.48241 Prob > F = 0.0000 Residual | 89247.7839 536 166.507059 R-squared = 0.2032-------------+------------------------------ Adj R-squared = 0.1988 Total | 112010.231 539 207.811189 Root MSE = 12.904

------------------------------------------------------------------------------ EARNINGS | Coef. Std. Err. t P>|t| [95% Conf. Interval]-------------+---------------------------------------------------------------- S | 2.754372 .2417286 11.39 0.000 2.279521 3.229224 EXP | -.2353907 .665197 -0.35 0.724 -1.542103 1.071322 EXPSQ | .0267843 .0219115 1.22 0.222 -.0162586 .0698272 _cons | -22.21964 5.514827 -4.03 0.000 -33.05297 -11.38632------------------------------------------------------------------------------

uEXPSQEXPSEARNINGS 4321

The standard error, 0.23 in the specification without EXPSQ, is also little changed and the coefficient remains highly significant.

MULTICOLLINEARITY

Page 53: © Christopher Dougherty 1999–2006 A.1: The model is linear in parameters and correctly specified. A.2: There does not exist an exact linear relationship.

© Christopher Dougherty 1999–2006

. reg EARNINGS S EXP EXPSQ

Source | SS df MS Number of obs = 540-------------+------------------------------ F( 3, 536) = 45.57 Model | 22762.4472 3 7587.48241 Prob > F = 0.0000 Residual | 89247.7839 536 166.507059 R-squared = 0.2032-------------+------------------------------ Adj R-squared = 0.1988 Total | 112010.231 539 207.811189 Root MSE = 12.904

------------------------------------------------------------------------------ EARNINGS | Coef. Std. Err. t P>|t| [95% Conf. Interval]-------------+---------------------------------------------------------------- S | 2.754372 .2417286 11.39 0.000 2.279521 3.229224 EXP | -.2353907 .665197 -0.35 0.724 -1.542103 1.071322 EXPSQ | .0267843 .0219115 1.22 0.222 -.0162586 .0698272 _cons | -22.21964 5.514827 -4.03 0.000 -33.05297 -11.38632------------------------------------------------------------------------------

uEXPSQEXPSEARNINGS 4321

By contrast, the inclusion of the new term has had a dramatic effect on the coefficient of EXP. Now it is negative, which makes little sense, and insignificant!

MULTICOLLINEARITY

Page 54: © Christopher Dougherty 1999–2006 A.1: The model is linear in parameters and correctly specified. A.2: There does not exist an exact linear relationship.

© Christopher Dougherty 1999–2006

uEXPSQEXPSEARNINGS 4321

Previously it had been positive and highly significant.

. reg EARNINGS S EXP

Source | SS df MS Number of obs = 540-------------+------------------------------ F( 2, 537) = 67.54 Model | 22513.6473 2 11256.8237 Prob > F = 0.0000 Residual | 89496.5838 537 166.660305 R-squared = 0.2010-------------+------------------------------ Adj R-squared = 0.1980 Total | 112010.231 539 207.811189 Root MSE = 12.91

------------------------------------------------------------------------------ EARNINGS | Coef. Std. Err. t P>|t| [95% Conf. Interval]-------------+---------------------------------------------------------------- S | 2.678125 .2336497 11.46 0.000 2.219146 3.137105 EXP | .5624326 .1285136 4.38 0.000 .3099816 .8148837 _cons | -26.48501 4.27251 -6.20 0.000 -34.87789 -18.09213------------------------------------------------------------------------------

MULTICOLLINEARITY

Page 55: © Christopher Dougherty 1999–2006 A.1: The model is linear in parameters and correctly specified. A.2: There does not exist an exact linear relationship.

© Christopher Dougherty 1999–2006

. reg EARNINGS S EXP EXPSQ

Source | SS df MS Number of obs = 540-------------+------------------------------ F( 3, 536) = 45.57 Model | 22762.4472 3 7587.48241 Prob > F = 0.0000 Residual | 89247.7839 536 166.507059 R-squared = 0.2032-------------+------------------------------ Adj R-squared = 0.1988 Total | 112010.231 539 207.811189 Root MSE = 12.904

------------------------------------------------------------------------------ EARNINGS | Coef. Std. Err. t P>|t| [95% Conf. Interval]-------------+---------------------------------------------------------------- S | 2.754372 .2417286 11.39 0.000 2.279521 3.229224 EXP | -.2353907 .665197 -0.35 0.724 -1.542103 1.071322 EXPSQ | .0267843 .0219115 1.22 0.222 -.0162586 .0698272 _cons | -22.21964 5.514827 -4.03 0.000 -33.05297 -11.38632------------------------------------------------------------------------------

uEXPSQEXPSEARNINGS 4321

MULTICOLLINEARITY

Page 56: © Christopher Dougherty 1999–2006 A.1: The model is linear in parameters and correctly specified. A.2: There does not exist an exact linear relationship.

© Christopher Dougherty 1999–2006

. reg EARNINGS S EXP EXPSQ

Source | SS df MS Number of obs = 540-------------+------------------------------ F( 3, 536) = 45.57 Model | 22762.4472 3 7587.48241 Prob > F = 0.0000 Residual | 89247.7839 536 166.507059 R-squared = 0.2032-------------+------------------------------ Adj R-squared = 0.1988 Total | 112010.231 539 207.811189 Root MSE = 12.904

------------------------------------------------------------------------------ EARNINGS | Coef. Std. Err. t P>|t| [95% Conf. Interval]-------------+---------------------------------------------------------------- S | 2.754372 .2417286 11.39 0.000 2.279521 3.229224 EXP | -.2353907 .665197 -0.35 0.724 -1.542103 1.071322 EXPSQ | .0267843 .0219115 1.22 0.222 -.0162586 .0698272 _cons | -22.21964 5.514827 -4.03 0.000 -33.05297 -11.38632------------------------------------------------------------------------------

The reason for these problems is that EXPSQ is highly correlated with EXP. This makes it difficult to discriminate between the individual effects of EXP and EXPSQ, and the regression estimates tend to be erratic.

. cor EXP EXPSQ(obs=540)

| EXP EXPSQ------+------------------ EXP | 1.0000EXPSQ | 0.9812 1.0000

MULTICOLLINEARITY

Page 57: © Christopher Dougherty 1999–2006 A.1: The model is linear in parameters and correctly specified. A.2: There does not exist an exact linear relationship.

© Christopher Dougherty 1999–2006

When high correlations among the explanatory variables lead to erratic point estimates of the coefficients, large standard errors and unsatisfactorily low t statistics, the regression is said to said to be suffering from multicollinearity.

Multicollinearity may also be caused by an approximate linear relationship among the explanatory variables. When there are only 2, an approximate linear relationship means there will be a high correlation, but this is not always the case when there are more than 2.

MULTICOLLINEARITY

Page 58: © Christopher Dougherty 1999–2006 A.1: The model is linear in parameters and correctly specified. A.2: There does not exist an exact linear relationship.

© Christopher Dougherty 1999–2006

ALLEVIATION OF MULTICOLLINEARITY

What can you do about multicollinearity if you encounter it? We will discuss some possible measures, looking at the model with two explanatory variables.Before doing this, two important points should be emphasized. • First, multicollinearity does not cause the regression coefficients to be

biased. Their probability distributions are still centered over the true values, if the regression specification is correct, but they have unsatisfactorily large variances.

• Second, the standard errors and t tests remain valid. The standard errors are larger than they would have been in the absence of multicollinearity, warning us that the regression estimates are erratic.

Since the problem of multicollinearity is caused by the variances of the coefficients being unsatisfactorily large, we will seek ways of reducing them.

Page 59: © Christopher Dougherty 1999–2006 A.1: The model is linear in parameters and correctly specified. A.2: There does not exist an exact linear relationship.

© Christopher Dougherty 1999–2006

Possible measures for alleviating multicollinearity

(1) Reduce by including further relevant variables in the model.

2u

We will focus on the slope coefficient and look at the various components of its variance. We might be able to reduce it by bringing more variables into the model and reducing u

2, the variance of the disturbance term.

2,2

2

2,

222

22

3232

2 11

)(MSD11

XX

u

XXi

ub rXnrXX

ALLEVIATION OF MULTICOLLINEARITY

Page 60: © Christopher Dougherty 1999–2006 A.1: The model is linear in parameters and correctly specified. A.2: There does not exist an exact linear relationship.

© Christopher Dougherty 1999–2006

The estimator of the variance of the disturbance term is the residual sum of squares divided by n – k, where n is the number of observations (540) and k is the number of parameters (4). Here it is 166.5.

. reg EARNINGS S EXP EXPSQ

Source | SS df MS Number of obs = 540-------------+------------------------------ F( 3, 536) = 45.57 Model | 22762.4472 3 7587.48241 Prob > F = 0.0000 Residual | 89247.7839 536 166.507059 R-squared = 0.2032-------------+------------------------------ Adj R-squared = 0.1988 Total | 112010.231 539 207.811189 Root MSE = 12.904

------------------------------------------------------------------------------ EARNINGS | Coef. Std. Err. t P>|t| [95% Conf. Interval]-------------+---------------------------------------------------------------- S | 2.754372 .2417286 11.39 0.000 2.279521 3.229224 EXP | -.2353907 .665197 -0.35 0.724 -1.542103 1.071322 EXPSQ | .0267843 .0219115 1.22 0.222 -.0162586 .0698272 _cons | -22.21964 5.514827 -4.03 0.000 -33.05297 -11.38632------------------------------------------------------------------------------

ALLEVIATION OF MULTICOLLINEARITY

Page 61: © Christopher Dougherty 1999–2006 A.1: The model is linear in parameters and correctly specified. A.2: There does not exist an exact linear relationship.

© Christopher Dougherty 1999–2006

. reg EARNINGS S EXP EXPSQ MALE ASVABC

Source | SS df MS Number of obs = 540-------------+------------------------------ F( 5, 534) = 37.24 Model | 28957.3532 5 5791.47063 Prob > F = 0.0000 Residual | 83052.8779 534 155.529734 R-squared = 0.2585-------------+------------------------------ Adj R-squared = 0.2516 Total | 112010.231 539 207.811189 Root MSE = 12.471

------------------------------------------------------------------------------ EARNINGS | Coef. Std. Err. t P>|t| [95% Conf. Interval]-------------+---------------------------------------------------------------- S | 2.031419 .296218 6.86 0.000 1.449524 2.613315 EXP | -.0816828 .6441767 -0.13 0.899 -1.347114 1.183748 EXPSQ | .0130223 .021334 0.61 0.542 -.0288866 .0549311 MALE | 5.762358 1.104734 5.22 0.000 3.592201 7.932515 ASVABC | .2447687 .0714294 3.43 0.001 .1044516 .3850858 _cons | -26.18541 5.452032 -4.80 0.000 -36.89547 -15.47535------------------------------------------------------------------------------

We now add two new variables that are often found to be determinants of earnings: MALE, sex of respondent, and ASVABC, the composite score on the cognitive tests in the Armed Services Vocational Aptitude Battery.MALE is a qualitative variable and the treatment of such variables will be explained in Chapter 5.

ALLEVIATION OF MULTICOLLINEARITY

Page 62: © Christopher Dougherty 1999–2006 A.1: The model is linear in parameters and correctly specified. A.2: There does not exist an exact linear relationship.

© Christopher Dougherty 1999–2006

. reg EARNINGS S EXP EXPSQ MALE ASVABC

Source | SS df MS Number of obs = 540-------------+------------------------------ F( 5, 534) = 37.24 Model | 28957.3532 5 5791.47063 Prob > F = 0.0000 Residual | 83052.8779 534 155.529734 R-squared = 0.2585-------------+------------------------------ Adj R-squared = 0.2516 Total | 112010.231 539 207.811189 Root MSE = 12.471

------------------------------------------------------------------------------ EARNINGS | Coef. Std. Err. t P>|t| [95% Conf. Interval]-------------+---------------------------------------------------------------- S | 2.031419 .296218 6.86 0.000 1.449524 2.613315 EXP | -.0816828 .6441767 -0.13 0.899 -1.347114 1.183748 EXPSQ | .0130223 .021334 0.61 0.542 -.0288866 .0549311 MALE | 5.762358 1.104734 5.22 0.000 3.592201 7.932515 ASVABC | .2447687 .0714294 3.43 0.001 .1044516 .3850858 _cons | -26.18541 5.452032 -4.80 0.000 -36.89547 -15.47535------------------------------------------------------------------------------

Both MALE and ASVABC have coefficients significant at the 0.1% level.

ALLEVIATION OF MULTICOLLINEARITY

Page 63: © Christopher Dougherty 1999–2006 A.1: The model is linear in parameters and correctly specified. A.2: There does not exist an exact linear relationship.

© Christopher Dougherty 1999–2006

. reg EARNINGS S EXP EXPSQ

Source | SS df MS Number of obs = 540-------------+------------------------------ F( 3, 536) = 45.57 Model | 22762.4472 3 7587.48241 Prob > F = 0.0000 Residual | 89247.7839 536 166.507059 R-squared = 0.2032-------------+------------------------------ Adj R-squared = 0.1988 Total | 112010.231 539 207.811189 Root MSE = 12.904

. reg EARNINGS S EXP EXPSQ MALE ASVABC

Source | SS df MS Number of obs = 540-------------+------------------------------ F( 5, 534) = 37.24 Model | 28957.3532 5 5791.47063 Prob > F = 0.0000 Residual | 83052.8779 534 155.529734 R-squared = 0.2585-------------+------------------------------ Adj R-squared = 0.2516 Total | 112010.231 539 207.811189 Root MSE = 12.471

However they account for only a small proportion of the variance in earnings and the reduction in the estimate of the variance of the disturbance term is likewise small.

ALLEVIATION OF MULTICOLLINEARITY

Page 64: © Christopher Dougherty 1999–2006 A.1: The model is linear in parameters and correctly specified. A.2: There does not exist an exact linear relationship.

© Christopher Dougherty 1999–2006

. reg EARNINGS S EXP EXPSQ

------------------------------------------------------------------------------ EARNINGS | Coef. Std. Err. t P>|t| [95% Conf. Interval]-------------+---------------------------------------------------------------- S | 2.754372 .2417286 11.39 0.000 2.279521 3.229224 EXP | -.2353907 .665197 -0.35 0.724 -1.542103 1.071322 EXPSQ | .0267843 .0219115 1.22 0.222 -.0162586 .0698272 _cons | -22.21964 5.514827 -4.03 0.000 -33.05297 -11.38632------------------------------------------------------------------------------

. reg EARNINGS S EXP EXPSQ MALE ASVABC

------------------------------------------------------------------------------ EARNINGS | Coef. Std. Err. t P>|t| [95% Conf. Interval]-------------+---------------------------------------------------------------- S | 2.031419 .296218 6.86 0.000 1.449524 2.613315 EXP | -.0816828 .6441767 -0.13 0.899 -1.347114 1.183748 EXPSQ | .0130223 .021334 0.61 0.542 -.0288866 .0549311 MALE | 5.762358 1.104734 5.22 0.000 3.592201 7.932515 ASVABC | .2447687 .0714294 3.43 0.001 .1044516 .3850858 _cons | -26.18541 5.452032 -4.80 0.000 -36.89547 -15.47535------------------------------------------------------------------------------

As a consequence the impact on the standard errors of EXP and EXPSQ is negligible.

ALLEVIATION OF MULTICOLLINEARITY

Page 65: © Christopher Dougherty 1999–2006 A.1: The model is linear in parameters and correctly specified. A.2: There does not exist an exact linear relationship.

© Christopher Dougherty 1999–2006

. reg EARNINGS S EXP EXPSQ

------------------------------------------------------------------------------ EARNINGS | Coef. Std. Err. t P>|t| [95% Conf. Interval]-------------+---------------------------------------------------------------- S | 2.754372 .2417286 11.39 0.000 2.279521 3.229224 EXP | -.2353907 .665197 -0.35 0.724 -1.542103 1.071322 EXPSQ | .0267843 .0219115 1.22 0.222 -.0162586 .0698272 _cons | -22.21964 5.514827 -4.03 0.000 -33.05297 -11.38632------------------------------------------------------------------------------

. reg EARNINGS S EXP EXPSQ MALE ASVABC

------------------------------------------------------------------------------ EARNINGS | Coef. Std. Err. t P>|t| [95% Conf. Interval]-------------+---------------------------------------------------------------- S | 2.031419 .296218 6.86 0.000 1.449524 2.613315 EXP | -.0816828 .6441767 -0.13 0.899 -1.347114 1.183748 EXPSQ | .0130223 .021334 0.61 0.542 -.0288866 .0549311 MALE | 5.762358 1.104734 5.22 0.000 3.592201 7.932515 ASVABC | .2447687 .0714294 3.43 0.001 .1044516 .3850858 _cons | -26.18541 5.452032 -4.80 0.000 -36.89547 -15.47535------------------------------------------------------------------------------

Note how unstable the coefficients are. This is often a sign of multicollinearity.

ALLEVIATION OF MULTICOLLINEARITY

Page 66: © Christopher Dougherty 1999–2006 A.1: The model is linear in parameters and correctly specified. A.2: There does not exist an exact linear relationship.

© Christopher Dougherty 1999–2006

Possible measures for alleviating multicollinearity

(2) Increase the number of observations.Surveys: increase the budget, or use clusteringTime series: use quarterly instead of annual data

2,2

2

2,

222

22

3232

2 11

)(MSD11

XX

u

XXi

ub rXnrXX

The next factor to look at is n, the number of observations. If you are working with cross-section data (individuals, households, enterprises, etc) and you are undertaking a survey, you could increase the size of the sample by negotiating a bigger budget.You select a number of these randomly, perhaps using random sampling to make sure that metropolitan, other urban, and rural areas are properly represented.You then confine the survey to the areas selected. This reduces the travel time and cost of the fieldworkers, allowing them to interview a greater number of respondents.

ALLEVIATION OF MULTICOLLINEARITY

Page 67: © Christopher Dougherty 1999–2006 A.1: The model is linear in parameters and correctly specified. A.2: There does not exist an exact linear relationship.

© Christopher Dougherty 1999–2006

Possible measures for alleviating multicollinearity

(2) Increase the number of observations.Surveys: increase the budget, use clusteringTime series: use quarterly instead of annual data

If you are working with time series data, you may be able to increase the sample by working with shorter time intervals for the data, for example quarterly or even monthly data instead of annual data.

2,2

2

2,

222

22

3232

2 11

)(MSD11

XX

u

XXi

ub rXnrXX

ALLEVIATION OF MULTICOLLINEARITY

Page 68: © Christopher Dougherty 1999–2006 A.1: The model is linear in parameters and correctly specified. A.2: There does not exist an exact linear relationship.

© Christopher Dougherty 1999–2006

. reg EARNINGS S EXP EXPSQ MALE ASVABC

Source | SS df MS Number of obs = 2714-------------+------------------------------ F( 5, 2708) = 183.99 Model | 161795.573 5 32359.1147 Prob > F = 0.0000 Residual | 476277.268 2708 175.877869 R-squared = 0.2536-------------+------------------------------ Adj R-squared = 0.2522 Total | 638072.841 2713 235.190874 Root MSE = 13.262

------------------------------------------------------------------------------ EARNINGS | Coef. Std. Err. t P>|t| [95% Conf. Interval]-------------+---------------------------------------------------------------- S | 2.312461 .135428 17.08 0.000 2.046909 2.578014 EXP | -.3270651 .308231 -1.06 0.289 -.9314569 .2773268 EXPSQ | .023743 .0101558 2.34 0.019 .0038291 .0436569 MALE | 5.947206 .5221755 11.39 0.000 4.923303 6.971108 ASVABC | .2086846 .0336869 6.19 0.000 .1426301 .2747392 _cons | -27.40462 2.579435 -10.62 0.000 -32.46248 -22.34676------------------------------------------------------------------------------

Here is the result of running the regression with all 2,714 observations in the EAEF data set.

ALLEVIATION OF MULTICOLLINEARITY

Page 69: © Christopher Dougherty 1999–2006 A.1: The model is linear in parameters and correctly specified. A.2: There does not exist an exact linear relationship.

© Christopher Dougherty 1999–2006

. reg EARNINGS S EXP EXPSQ MALE ASVABC Number of obs = 540------------------------------------------------------------------------------ EARNINGS | Coef. Std. Err. t P>|t| [95% Conf. Interval]-------------+---------------------------------------------------------------- S | 2.031419 .296218 6.86 0.000 1.449524 2.613315 EXP | -.0816828 .6441767 -0.13 0.899 -1.347114 1.183748 EXPSQ | .0130223 .021334 0.61 0.542 -.0288866 .0549311 MALE | 5.762358 1.104734 5.22 0.000 3.592201 7.932515 ASVABC | .2447687 .0714294 3.43 0.001 .1044516 .3850858 _cons | -26.18541 5.452032 -4.80 0.000 -36.89547 -15.47535------------------------------------------------------------------------------. reg EARNINGS S EXP EXPSQ MALE ASVABC Number of obs = 2714------------------------------------------------------------------------------ EARNINGS | Coef. Std. Err. t P>|t| [95% Conf. Interval]-------------+---------------------------------------------------------------- S | 2.312461 .135428 17.08 0.000 2.046909 2.578014 EXP | -.3270651 .308231 -1.06 0.289 -.9314569 .2773268 EXPSQ | .023743 .0101558 2.34 0.019 .0038291 .0436569 MALE | 5.947206 .5221755 11.39 0.000 4.923303 6.971108 ASVABC | .2086846 .0336869 6.19 0.000 .1426301 .2747392 _cons | -27.40462 2.579435 -10.62 0.000 -32.46248 -22.34676------------------------------------------------------------------------------Comparing this result with that using Data Set 21, we see that the standard errors are much smaller, as expected.

ALLEVIATION OF MULTICOLLINEARITY

Page 70: © Christopher Dougherty 1999–2006 A.1: The model is linear in parameters and correctly specified. A.2: There does not exist an exact linear relationship.

© Christopher Dougherty 1999–2006

. reg EARNINGS S EXP EXPSQ MALE ASVABC Number of obs = 540------------------------------------------------------------------------------ EARNINGS | Coef. Std. Err. t P>|t| [95% Conf. Interval]-------------+---------------------------------------------------------------- S | 2.031419 .296218 6.86 0.000 1.449524 2.613315 EXP | -.0816828 .6441767 -0.13 0.899 -1.347114 1.183748 EXPSQ | .0130223 .021334 0.61 0.542 -.0288866 .0549311 MALE | 5.762358 1.104734 5.22 0.000 3.592201 7.932515 ASVABC | .2447687 .0714294 3.43 0.001 .1044516 .3850858 _cons | -26.18541 5.452032 -4.80 0.000 -36.89547 -15.47535------------------------------------------------------------------------------. reg EARNINGS S EXP EXPSQ MALE ASVABC Number of obs = 2714------------------------------------------------------------------------------ EARNINGS | Coef. Std. Err. t P>|t| [95% Conf. Interval]-------------+---------------------------------------------------------------- S | 2.312461 .135428 17.08 0.000 2.046909 2.578014 EXP | -.3270651 .308231 -1.06 0.289 -.9314569 .2773268 EXPSQ | .023743 .0101558 2.34 0.019 .0038291 .0436569 MALE | 5.947206 .5221755 11.39 0.000 4.923303 6.971108 ASVABC | .2086846 .0336869 6.19 0.000 .1426301 .2747392 _cons | -27.40462 2.579435 -10.62 0.000 -32.46248 -22.34676------------------------------------------------------------------------------As a consequence, the t statistics of the variables are higher. However the correlation between EXP and EXPSQ is as high as in the smaller sample and the increase in the sample size has not been large enough to have much impact on the problem of multicollinearity.

ALLEVIATION OF MULTICOLLINEARITY

Page 71: © Christopher Dougherty 1999–2006 A.1: The model is linear in parameters and correctly specified. A.2: There does not exist an exact linear relationship.

© Christopher Dougherty 1999–2006

Possible measures for alleviating multicollinearity

(3) Increase MSD(X2).

A third possible way of reducing the problem of multicollinearity might be to increase the variation in the explanatory variables. This is possible only at the design stage of a survey.For example, if you were planning a household survey with the aim of investigating how expenditure patterns vary with income, you should make sure that the sample included relatively rich and relatively poor households as well as middle-income households.

2,2

2

2,

222

22

3232

2 11

)(MSD11

XX

u

XXi

ub rXnrXX

ALLEVIATION OF MULTICOLLINEARITY

Page 72: © Christopher Dougherty 1999–2006 A.1: The model is linear in parameters and correctly specified. A.2: There does not exist an exact linear relationship.

© Christopher Dougherty 1999–2006

Possible measures for alleviating multicollinearity

(4) Reduce .32 ,XXr

Another possibility might be to reduce the correlation between the explanatory variables. This is possible only at the design stage of a survey and even then it is not easy.

2,2

2

2,

222

22

3232

2 11

)(MSD11

XX

u

XXi

ub rXnrXX

ALLEVIATION OF MULTICOLLINEARITY

Page 73: © Christopher Dougherty 1999–2006 A.1: The model is linear in parameters and correctly specified. A.2: There does not exist an exact linear relationship.

© Christopher Dougherty 1999–2006

Possible measures for alleviating multicollinearity

(5) Combine the correlated variables.

If the correlated variables are similar conceptually, it may be reasonable to combine them into some overall index.

ALLEVIATION OF MULTICOLLINEARITY

Page 74: © Christopher Dougherty 1999–2006 A.1: The model is linear in parameters and correctly specified. A.2: There does not exist an exact linear relationship.

© Christopher Dougherty 1999–2006

That is precisely what has been done with the three cognitive ASVAB variables. ASVABC has been calculated as a weighted average of ASVAB02 (arithmetic reasoning), ASVAB03 (word knowledge), and ASVAB04 (paragraph comprehension).The three components are highly correlated and by combining them as a weighted average, rather than using them individually, one avoids a potential problem of multicollinearity.

. reg EARNINGS S EXP EXPSQ MALE ASVABC

Source | SS df MS Number of obs = 540-------------+------------------------------ F( 5, 534) = 37.24 Model | 28957.3532 5 5791.47063 Prob > F = 0.0000 Residual | 83052.8779 534 155.529734 R-squared = 0.2585-------------+------------------------------ Adj R-squared = 0.2516 Total | 112010.231 539 207.811189 Root MSE = 12.471

------------------------------------------------------------------------------ EARNINGS | Coef. Std. Err. t P>|t| [95% Conf. Interval]-------------+---------------------------------------------------------------- S | 2.031419 .296218 6.86 0.000 1.449524 2.613315 EXP | -.0816828 .6441767 -0.13 0.899 -1.347114 1.183748 EXPSQ | .0130223 .021334 0.61 0.542 -.0288866 .0549311 MALE | 5.762358 1.104734 5.22 0.000 3.592201 7.932515 ASVABC | .2447687 .0714294 3.43 0.001 .1044516 .3850858 _cons | -26.18541 5.452032 -4.80 0.000 -36.89547 -15.47535------------------------------------------------------------------------------

ALLEVIATION OF MULTICOLLINEARITY

Page 75: © Christopher Dougherty 1999–2006 A.1: The model is linear in parameters and correctly specified. A.2: There does not exist an exact linear relationship.

© Christopher Dougherty 1999–2006

Possible measures for alleviating multicollinearity

(6) Drop some of the correlated variables.

Dropping some of the correlated variables, if they have insignificant coefficients, may alleviate multicollinearity.However, this approach to multicollinearity is dangerous. It is possible that some of the variables with insignificant coefficients really do belong in the model and that the only reason their coefficients are insignificant is because there is a problem of multicollinearity.If that is the case, their omission may cause omitted variable bias, to be discussed in Chapter 6.

2,2

2

2,

222

22

3232

2 11

)(MSD11

XX

u

XXi

ub rXnrXX

ALLEVIATION OF MULTICOLLINEARITY

Page 76: © Christopher Dougherty 1999–2006 A.1: The model is linear in parameters and correctly specified. A.2: There does not exist an exact linear relationship.

© Christopher Dougherty 1999–2006

Possible measures for alleviating multicollinearity

(7) Empirical restriction

uPXY 321

A further way of dealing with the problem of multicollinearity is to use extraneous information, if available, concerning the coefficient of one of the variables.For example, suppose that Y in the equation above is the demand for a category of consumer expenditure, X is aggregate disposable personal income, and P is a price index for the category.To fit a model of this type you would use time series data. If X and P are highly correlated, which is often the case with time series variables, the problem of multicollinearity might be eliminated in the following way.

2,2

2

2,

222

22

3232

2 11

)(MSD11

XX

u

XXi

ub rXnrXX

ALLEVIATION OF MULTICOLLINEARITY

Page 77: © Christopher Dougherty 1999–2006 A.1: The model is linear in parameters and correctly specified. A.2: There does not exist an exact linear relationship.

© Christopher Dougherty 1999–2006

Possible measures for alleviating multicollinearity

(7) Empirical restriction

uPXY 321 uXY ''2

'1

' ''

2'1

'ˆ XbbY

Obtain data on income and expenditure on the category from a household survey and regress Y' on X'. (The ' marks are to indicate that the data are household data, not aggregate data.) This is a simple regression because there will be relatively little variation in the price paid by the households.

2,2

2

2,

222

22

3232

2 11

)(MSD11

XX

u

XXi

ub rXnrXX

ALLEVIATION OF MULTICOLLINEARITY

Page 78: © Christopher Dougherty 1999–2006 A.1: The model is linear in parameters and correctly specified. A.2: There does not exist an exact linear relationship.

© Christopher Dougherty 1999–2006

Possible measures for alleviating multicollinearity

(7) Empirical restriction

uPXY 321 uXY ''2

'1

'

uPXbY 3'21

uPXbYZ 21'2

''2

'1

'ˆ XbbY

Now substitute b' for 2 in the time series model. Subtract b' X from both sides, and regress Z = Y – b' X on price. This is a simple regression, so multicollinearity has been eliminated.

2

22

2,2

2

2,

222

22

3232

2 11

)(MSD11

XX

u

XXi

ub rXnrXX

ALLEVIATION OF MULTICOLLINEARITY

Page 79: © Christopher Dougherty 1999–2006 A.1: The model is linear in parameters and correctly specified. A.2: There does not exist an exact linear relationship.

© Christopher Dougherty 1999–2006

Possible measures for alleviating multicollinearity

(7) Empirical restriction

uPXY 321 uXY ''2

'1

'

uPXbY 3'21

uPXbYZ 21'2

''2

'1

'ˆ XbbY

There are some problems with this technique. First, the 2 coefficients may be conceptually different in time series and cross-section contexts. Second, since we subtract the estimated income component b' X, not the true income component 2X, from Y when constructing Z, we have introduced an element of measurement error in the dependent variable.

2,2

2

2,

222

22

3232

2 11

)(MSD11

XX

u

XXi

ub rXnrXX

ALLEVIATION OF MULTICOLLINEARITY

Page 80: © Christopher Dougherty 1999–2006 A.1: The model is linear in parameters and correctly specified. A.2: There does not exist an exact linear relationship.

© Christopher Dougherty 1999–2006

Possible measures for alleviating multicollinearity

(8) Theoretical restriction

uSFSMASVABCS 4321

Last, but by no means least, is the use of a theoretical restriction, which is defined as a hypothetical relationship among the parameters of a regression model.It will be explained using an educational attainment model as an example. Suppose that we hypothesize that highest grade completed, S, depends on ASVABC, and highest grade completed by the respondent's mother and father, SM and SF, respectively.

2,2

2

2,

222

22

3232

2 11

)(MSD11

XX

u

XXi

ub rXnrXX

ALLEVIATION OF MULTICOLLINEARITY

Page 81: © Christopher Dougherty 1999–2006 A.1: The model is linear in parameters and correctly specified. A.2: There does not exist an exact linear relationship.

© Christopher Dougherty 1999–2006

. reg S ASVABC SM SF

Source | SS df MS Number of obs = 540-------------+------------------------------ F( 3, 536) = 104.30 Model | 1181.36981 3 393.789935 Prob > F = 0.0000 Residual | 2023.61353 536 3.77539837 R-squared = 0.3686-------------+------------------------------ Adj R-squared = 0.3651 Total | 3204.98333 539 5.94616574 Root MSE = 1.943

------------------------------------------------------------------------------ S | Coef. Std. Err. t P>|t| [95% Conf. Interval]-------------+---------------------------------------------------------------- ASVABC | .1257087 .0098533 12.76 0.000 .1063528 .1450646 SM | .0492424 .0390901 1.26 0.208 -.027546 .1260309 SF | .1076825 .0309522 3.48 0.001 .04688 .1684851 _cons | 5.370631 .4882155 11.00 0.000 4.41158 6.329681------------------------------------------------------------------------------

A one-point increase in ASVABC increases S by 0.13 years.

ALLEVIATION OF MULTICOLLINEARITY

Page 82: © Christopher Dougherty 1999–2006 A.1: The model is linear in parameters and correctly specified. A.2: There does not exist an exact linear relationship.

© Christopher Dougherty 1999–2006

. reg S ASVABC SM SF

Source | SS df MS Number of obs = 540-------------+------------------------------ F( 3, 536) = 104.30 Model | 1181.36981 3 393.789935 Prob > F = 0.0000 Residual | 2023.61353 536 3.77539837 R-squared = 0.3686-------------+------------------------------ Adj R-squared = 0.3651 Total | 3204.98333 539 5.94616574 Root MSE = 1.943

------------------------------------------------------------------------------ S | Coef. Std. Err. t P>|t| [95% Conf. Interval]-------------+---------------------------------------------------------------- ASVABC | .1257087 .0098533 12.76 0.000 .1063528 .1450646 SM | .0492424 .0390901 1.26 0.208 -.027546 .1260309 SF | .1076825 .0309522 3.48 0.001 .04688 .1684851 _cons | 5.370631 .4882155 11.00 0.000 4.41158 6.329681------------------------------------------------------------------------------

S increases by 0.05 years for every extra year of schooling of the mother and 0.11 years for every extra year of schooling of the father.Mother's education is generally held to be at least, if not more, important than father's education for educational attainment, so this outcome is unexpected.

ALLEVIATION OF MULTICOLLINEARITY

Page 83: © Christopher Dougherty 1999–2006 A.1: The model is linear in parameters and correctly specified. A.2: There does not exist an exact linear relationship.

© Christopher Dougherty 1999–2006

. reg S ASVABC SM SF

Source | SS df MS Number of obs = 540-------------+------------------------------ F( 3, 536) = 104.30 Model | 1181.36981 3 393.789935 Prob > F = 0.0000 Residual | 2023.61353 536 3.77539837 R-squared = 0.3686-------------+------------------------------ Adj R-squared = 0.3651 Total | 3204.98333 539 5.94616574 Root MSE = 1.943

------------------------------------------------------------------------------ S | Coef. Std. Err. t P>|t| [95% Conf. Interval]-------------+---------------------------------------------------------------- ASVABC | .1257087 .0098533 12.76 0.000 .1063528 .1450646 SM | .0492424 .0390901 1.26 0.208 -.027546 .1260309 SF | .1076825 .0309522 3.48 0.001 .04688 .1684851 _cons | 5.370631 .4882155 11.00 0.000 4.41158 6.329681------------------------------------------------------------------------------

It is also surprising that the coefficient of SM is not significant, even at the 5% level, using a one-sided test.

ALLEVIATION OF MULTICOLLINEARITY

Page 84: © Christopher Dougherty 1999–2006 A.1: The model is linear in parameters and correctly specified. A.2: There does not exist an exact linear relationship.

© Christopher Dougherty 1999–2006

. reg S ASVABC SM SF

Source | SS df MS Number of obs = 540-------------+------------------------------ F( 3, 536) = 104.30 Model | 1181.36981 3 393.789935 Prob > F = 0.0000 Residual | 2023.61353 536 3.77539837 R-squared = 0.3686-------------+------------------------------ Adj R-squared = 0.3651 Total | 3204.98333 539 5.94616574 Root MSE = 1.943

------------------------------------------------------------------------------ S | Coef. Std. Err. t P>|t| [95% Conf. Interval]-------------+---------------------------------------------------------------- ASVABC | .1257087 .0098533 12.76 0.000 .1063528 .1450646 SM | .0492424 .0390901 1.26 0.208 -.027546 .1260309 SF | .1076825 .0309522 3.48 0.001 .04688 .1684851 _cons | 5.370631 .4882155 11.00 0.000 4.41158 6.329681------------------------------------------------------------------------------

However, assortative mating leads to correlation between SM and SF and the regression appears to be suffering from multicollinearity.

. cor SM SF(obs=540) | SM SF--------+------------------ SM | 1.0000 SF | 0.6241 1.0000

ALLEVIATION OF MULTICOLLINEARITY

Page 85: © Christopher Dougherty 1999–2006 A.1: The model is linear in parameters and correctly specified. A.2: There does not exist an exact linear relationship.

© Christopher Dougherty 1999–2006

Possible measures for alleviating multicollinearity

(8) Theoretical restriction

uSFSMASVABCS 4321

43

Suppose that we hypothesize that mother's and father's education are equally important. We can then impose the restriction 3 = 4.

2,2

2

2,

222

22

3232

2 11

)(MSD11

XX

u

XXi

ub rXnrXX

ALLEVIATION OF MULTICOLLINEARITY

Page 86: © Christopher Dougherty 1999–2006 A.1: The model is linear in parameters and correctly specified. A.2: There does not exist an exact linear relationship.

© Christopher Dougherty 1999–2006

Possible measures for alleviating multicollinearity

(8) Theoretical restriction

uSFSMASVABCS 4321

43

uSPASVABCuSFSMASVABCS

321

321 )(

This allows us to rewrite the equation as shown.

2,2

2

2,

222

22

3232

2 11

)(MSD11

XX

u

XXi

ub rXnrXX

ALLEVIATION OF MULTICOLLINEARITY

Page 87: © Christopher Dougherty 1999–2006 A.1: The model is linear in parameters and correctly specified. A.2: There does not exist an exact linear relationship.

© Christopher Dougherty 1999–2006

Possible measures for alleviating multicollinearity

(8) Theoretical restriction

uSFSMASVABCS 4321

43

uSPASVABCuSFSMASVABCS

321

321 )(

Defining SP to be the sum of SM and SF, the equation may be rewritten as shown. The problem caused by the correlation between SM and SF has been eliminated.

2,2

2

2,

222

22

3232

2 11

)(MSD11

XX

u

XXi

ub rXnrXX

ALLEVIATION OF MULTICOLLINEARITY

Page 88: © Christopher Dougherty 1999–2006 A.1: The model is linear in parameters and correctly specified. A.2: There does not exist an exact linear relationship.

© Christopher Dougherty 1999–2006

. g SP=SM+SF

. reg S ASVABC SP

Source | SS df MS Number of obs = 540-------------+------------------------------ F( 2, 537) = 156.04 Model | 1177.98338 2 588.991689 Prob > F = 0.0000 Residual | 2026.99996 537 3.77467403 R-squared = 0.3675-------------+------------------------------ Adj R-squared = 0.3652 Total | 3204.98333 539 5.94616574 Root MSE = 1.9429

------------------------------------------------------------------------------ S | Coef. Std. Err. t P>|t| [95% Conf. Interval]-------------+---------------------------------------------------------------- ASVABC | .1253106 .0098434 12.73 0.000 .1059743 .1446469 SP | .0828368 .0164247 5.04 0.000 .0505722 .1151014 _cons | 5.29617 .4817972 10.99 0.000 4.349731 6.242608------------------------------------------------------------------------------

The estimate of 3 is now 0.083 and highly significant.

ALLEVIATION OF MULTICOLLINEARITY

Page 89: © Christopher Dougherty 1999–2006 A.1: The model is linear in parameters and correctly specified. A.2: There does not exist an exact linear relationship.

© Christopher Dougherty 1999–2006

. g SP=SM+SF

. reg S ASVABC SP

------------------------------------------------------------------------------ S | Coef. Std. Err. t P>|t| [95% Conf. Interval]-------------+---------------------------------------------------------------- ASVABC | .1253106 .0098434 12.73 0.000 .1059743 .1446469 SP | .0828368 .0164247 5.04 0.000 .0505722 .1151014 _cons | 5.29617 .4817972 10.99 0.000 4.349731 6.242608------------------------------------------------------------------------------

. reg S ASVABC SM SF

------------------------------------------------------------------------------ S | Coef. Std. Err. t P>|t| [95% Conf. Interval]-------------+---------------------------------------------------------------- ASVABC | .1257087 .0098533 12.76 0.000 .1063528 .1450646 SM | .0492424 .0390901 1.26 0.208 -.027546 .1260309 SF | .1076825 .0309522 3.48 0.001 .04688 .1684851 _cons | 5.370631 .4882155 11.00 0.000 4.41158 6.329681------------------------------------------------------------------------------

Not surprisingly, this is a compromise between the coefficients of SM and SF in the previous specification.

ALLEVIATION OF MULTICOLLINEARITY

Page 90: © Christopher Dougherty 1999–2006 A.1: The model is linear in parameters and correctly specified. A.2: There does not exist an exact linear relationship.

© Christopher Dougherty 1999–2006

. g SP=SM+SF

. reg S ASVABC SP

------------------------------------------------------------------------------ S | Coef. Std. Err. t P>|t| [95% Conf. Interval]-------------+---------------------------------------------------------------- ASVABC | .1253106 .0098434 12.73 0.000 .1059743 .1446469 SP | .0828368 .0164247 5.04 0.000 .0505722 .1151014 _cons | 5.29617 .4817972 10.99 0.000 4.349731 6.242608------------------------------------------------------------------------------

. reg S ASVABC SM SF

------------------------------------------------------------------------------ S | Coef. Std. Err. t P>|t| [95% Conf. Interval]-------------+---------------------------------------------------------------- ASVABC | .1257087 .0098533 12.76 0.000 .1063528 .1450646 SM | .0492424 .0390901 1.26 0.208 -.027546 .1260309 SF | .1076825 .0309522 3.48 0.001 .04688 .1684851 _cons | 5.370631 .4882155 11.00 0.000 4.41158 6.329681------------------------------------------------------------------------------

The standard error of SP is much smaller than those of SM and SF. The use of the restriction has led to a large gain in efficiency and the problem of multicollinearity has been eliminated.

ALLEVIATION OF MULTICOLLINEARITY

Page 91: © Christopher Dougherty 1999–2006 A.1: The model is linear in parameters and correctly specified. A.2: There does not exist an exact linear relationship.

© Christopher Dougherty 1999–2006

. g SP=SM+SF

. reg S ASVABC SP

------------------------------------------------------------------------------ S | Coef. Std. Err. t P>|t| [95% Conf. Interval]-------------+---------------------------------------------------------------- ASVABC | .1253106 .0098434 12.73 0.000 .1059743 .1446469 SP | .0828368 .0164247 5.04 0.000 .0505722 .1151014 _cons | 5.29617 .4817972 10.99 0.000 4.349731 6.242608------------------------------------------------------------------------------

. reg S ASVABC SM SF

------------------------------------------------------------------------------ S | Coef. Std. Err. t P>|t| [95% Conf. Interval]-------------+---------------------------------------------------------------- ASVABC | .1257087 .0098533 12.76 0.000 .1063528 .1450646 SM | .0492424 .0390901 1.26 0.208 -.027546 .1260309 SF | .1076825 .0309522 3.48 0.001 .04688 .1684851 _cons | 5.370631 .4882155 11.00 0.000 4.41158 6.329681------------------------------------------------------------------------------

The t statistic is very high. Thus it would appear that imposing the restriction has improved the regression results. However, the restriction may not be valid. We should test it. Testing theoretical restrictions is one of the topics in Chapter 6.

ALLEVIATION OF MULTICOLLINEARITY

Page 92: © Christopher Dougherty 1999–2006 A.1: The model is linear in parameters and correctly specified. A.2: There does not exist an exact linear relationship.

© Christopher Dougherty 1999–2006

F TESTS OF GOODNESS OF FIT

This sequence describes two F tests of goodness of fit in a multiple regression model. The first relates to the goodness of fit of the equation as a whole.We will consider the general case where there are k – 1 explanatory variables. For the F test of goodness of fit of the equation as a whole, the null hypothesis, in words, is that the model has no explanatory power at all.The model will have no explanatory power if it turns out that Y is unrelated to any of the explanatory variables. Mathematically, therefore, the null hypothesis is that all the coefficients 2, ..., k are zero.

The alternative hypothesis is that at least one of these coefficients is different from zero.In the multiple regression model there is a difference between the roles of the F and t tests. The F test tests the joint explanatory power of the variables, while the t tests test their explanatory power individually.In the simple regression model the F test was equivalent to the (two-sided) t test on the slope coefficient because the ‘group’ consisted of just one variable.

uXXY kk ...221 0 oneleast at :0...:

1

20

HH k

Page 93: © Christopher Dougherty 1999–2006 A.1: The model is linear in parameters and correctly specified. A.2: There does not exist an exact linear relationship.

© Christopher Dougherty 1999–2006

)()1()1(

)(

)1(

)()1(),1(

2

2

knRkR

knTSSRSS

kTSSESS

knRSSkESSknkF

uXXY kk ...221

0 oneleast at :0...:

1

20

HH k

ESS / TSS is the definition of R2. RSS / TSS is equal to (1 – R2). (See the last sequence in Chapter 2.)

F TESTS OF GOODNESS OF FIT

Page 94: © Christopher Dougherty 1999–2006 A.1: The model is linear in parameters and correctly specified. A.2: There does not exist an exact linear relationship.

© Christopher Dougherty 1999–2006

0: 4320 H

The educational attainment model will be used as an example. We will suppose that S depends on ASVABC, the ability score, and SM, and SF, the highest grade completed by the mother and father of the respondent, respectively.The null hypothesis for the F test of goodness of fit is that all three slope coefficients are equal to zero. The alternative hypothesis is that at least one of them is non-zero.

uSFSMASVABCS 4321 F TESTS OF GOODNESS OF FIT

Page 95: © Christopher Dougherty 1999–2006 A.1: The model is linear in parameters and correctly specified. A.2: There does not exist an exact linear relationship.

© Christopher Dougherty 1999–2006

. reg S ASVABC SM SF

Source | SS df MS Number of obs = 540-------------+------------------------------ F( 3, 536) = 104.30 Model | 1181.36981 3 393.789935 Prob > F = 0.0000 Residual | 2023.61353 536 3.77539837 R-squared = 0.3686-------------+------------------------------ Adj R-squared = 0.3651 Total | 3204.98333 539 5.94616574 Root MSE = 1.943

------------------------------------------------------------------------------ S | Coef. Std. Err. t P>|t| [95% Conf. Interval]-------------+---------------------------------------------------------------- ASVABC | .1257087 .0098533 12.76 0.000 .1063528 .1450646 SM | .0492424 .0390901 1.26 0.208 -.027546 .1260309 SF | .1076825 .0309522 3.48 0.001 .04688 .1684851 _cons | 5.370631 .4882155 11.00 0.000 4.41158 6.329681------------------------------------------------------------------------------

Here is the regression output using Data Set 21.

uSFSMASVABCS 4321 0: 4320 H

F TESTS OF GOODNESS OF FIT

Page 96: © Christopher Dougherty 1999–2006 A.1: The model is linear in parameters and correctly specified. A.2: There does not exist an exact linear relationship.

© Christopher Dougherty 1999–2006

. reg S ASVABC SM SF

Source | SS df MS Number of obs = 540-------------+------------------------------ F( 3, 536) = 104.30 Model | 1181.36981 3 393.789935 Prob > F = 0.0000 Residual | 2023.61353 536 3.77539837 R-squared = 0.3686-------------+------------------------------ Adj R-squared = 0.3651 Total | 3204.98333 539 5.94616574 Root MSE = 1.943

------------------------------------------------------------------------------ S | Coef. Std. Err. t P>|t| [95% Conf. Interval]-------------+---------------------------------------------------------------- ASVABC | .1257087 .0098533 12.76 0.000 .1063528 .1450646 SM | .0492424 .0390901 1.26 0.208 -.027546 .1260309 SF | .1076825 .0309522 3.48 0.001 .04688 .1684851 _cons | 5.370631 .4882155 11.00 0.000 4.41158 6.329681------------------------------------------------------------------------------

uSFSMASVABCS 4321 0: 4320 H

)/()1/(),1(knRSS

kESSknkF 3.104

536/20243/1181)536,3( F

The numerator of the F statistic is the explained sum of squares divided by k – 1. In the Stata output these numbers are given in the Model row.

F TESTS OF GOODNESS OF FIT

Page 97: © Christopher Dougherty 1999–2006 A.1: The model is linear in parameters and correctly specified. A.2: There does not exist an exact linear relationship.

© Christopher Dougherty 1999–2006

. reg S ASVABC SM SF

Source | SS df MS Number of obs = 540-------------+------------------------------ F( 3, 536) = 104.30 Model | 1181.36981 3 393.789935 Prob > F = 0.0000 Residual | 2023.61353 536 3.77539837 R-squared = 0.3686-------------+------------------------------ Adj R-squared = 0.3651 Total | 3204.98333 539 5.94616574 Root MSE = 1.943

------------------------------------------------------------------------------ S | Coef. Std. Err. t P>|t| [95% Conf. Interval]-------------+---------------------------------------------------------------- ASVABC | .1257087 .0098533 12.76 0.000 .1063528 .1450646 SM | .0492424 .0390901 1.26 0.208 -.027546 .1260309 SF | .1076825 .0309522 3.48 0.001 .04688 .1684851 _cons | 5.370631 .4882155 11.00 0.000 4.41158 6.329681------------------------------------------------------------------------------

uSFSMASVABCS 4321 0: 4320 H

)/()1/(),1(knRSS

kESSknkF 3.104

536/20243/1181)536,3( F

The denominator is the residual sum of squares divided by the number of degrees of freedom remaining.

F TESTS OF GOODNESS OF FIT

Page 98: © Christopher Dougherty 1999–2006 A.1: The model is linear in parameters and correctly specified. A.2: There does not exist an exact linear relationship.

© Christopher Dougherty 1999–2006

. reg S ASVABC SM SF

Source | SS df MS Number of obs = 540-------------+------------------------------ F( 3, 536) = 104.30 Model | 1181.36981 3 393.789935 Prob > F = 0.0000 Residual | 2023.61353 536 3.77539837 R-squared = 0.3686-------------+------------------------------ Adj R-squared = 0.3651 Total | 3204.98333 539 5.94616574 Root MSE = 1.943

------------------------------------------------------------------------------ S | Coef. Std. Err. t P>|t| [95% Conf. Interval]-------------+---------------------------------------------------------------- ASVABC | .1257087 .0098533 12.76 0.000 .1063528 .1450646 SM | .0492424 .0390901 1.26 0.208 -.027546 .1260309 SF | .1076825 .0309522 3.48 0.001 .04688 .1684851 _cons | 5.370631 .4882155 11.00 0.000 4.41158 6.329681------------------------------------------------------------------------------

uSFSMASVABCS 4321 0: 4320 H

)/()1/(),1(knRSS

kESSknkF 3.104

536/20243/1181)536,3( F

Hence the F statistic is 104.3. All serious regression packages compute it for you as part of the diagnostics in the regression output.

F TESTS OF GOODNESS OF FIT

Page 99: © Christopher Dougherty 1999–2006 A.1: The model is linear in parameters and correctly specified. A.2: There does not exist an exact linear relationship.

© Christopher Dougherty 1999–2006

. reg S ASVABC SM SF

Source | SS df MS Number of obs = 540-------------+------------------------------ F( 3, 536) = 104.30 Model | 1181.36981 3 393.789935 Prob > F = 0.0000 Residual | 2023.61353 536 3.77539837 R-squared = 0.3686-------------+------------------------------ Adj R-squared = 0.3651 Total | 3204.98333 539 5.94616574 Root MSE = 1.943

------------------------------------------------------------------------------ S | Coef. Std. Err. t P>|t| [95% Conf. Interval]-------------+---------------------------------------------------------------- ASVABC | .1257087 .0098533 12.76 0.000 .1063528 .1450646 SM | .0492424 .0390901 1.26 0.208 -.027546 .1260309 SF | .1076825 .0309522 3.48 0.001 .04688 .1684851 _cons | 5.370631 .4882155 11.00 0.000 4.41158 6.329681------------------------------------------------------------------------------

uSFSMASVABCS 4321 0: 4320 H

3.104536/20243/1181)536,3( F

The critical value for F(3,536) is not given in the F tables, but we know it must be lower than F(3,500), which is given. At the 0.1% level, this is 5.51. Hence we easily reject H0 at the 0.1% level.

51.5)500,3(crit,0.1% F

F TESTS OF GOODNESS OF FIT

Page 100: © Christopher Dougherty 1999–2006 A.1: The model is linear in parameters and correctly specified. A.2: There does not exist an exact linear relationship.

© Christopher Dougherty 1999–2006

It is unusual for the F statistic not to be significant if some of the t statistics are significant. In principle it could happen though. •Suppose that you ran a regression with 40 explanatory variables, none being a true determinant of the dependent variable. Then the F statistic should be low enough for H0 not to be rejected. However, if you are performing t tests on the slope coefficients at the 5% level, with a 5% chance of a Type I error, on average 2 of the 40 variables could be expected to have ‘significant’ coefficients.•The opposite can easily happen, though. Suppose you have a multiple regression model which is correctly specified and the R2 is high. You would expect to have a highly significant F statistic.However, if the explanatory variables are highly correlated and the model is subject to severe multicollinearity, the standard errors of the slope coefficients could all be so large that none of the t statistics is significant.In this situation you would know that your model is a good one, but you are not in a position to pinpoint the contributions made by the explanatory variables individually.

Page 101: © Christopher Dougherty 1999–2006 A.1: The model is linear in parameters and correctly specified. A.2: There does not exist an exact linear relationship.

© Christopher Dougherty 1999–2006

uXXXY 4433221

uXY 221 1RSS

2RSS

We now come to the other F test of goodness of fit. This is a test of the joint explanatory power of a group of variables when they are added to a regression model.For example, in the original specification, Y may be written as a simple function of X2. In the second, we add X3 and X4.

F TESTS OF GOODNESS OF FIT

Page 102: © Christopher Dougherty 1999–2006 A.1: The model is linear in parameters and correctly specified. A.2: There does not exist an exact linear relationship.

© Christopher Dougherty 1999–2006

0 and bothor 0 or 0 :0:

43431

430

HH

uXXXY 4433221

uXY 221 1RSS

2RSS

The null hypothesis for the F test is that neither X3 nor X4 belongs in the model. The alternative hypothesis is that at least one of them does, perhaps both.

F TESTS OF GOODNESS OF FIT

Page 103: © Christopher Dougherty 1999–2006 A.1: The model is linear in parameters and correctly specified. A.2: There does not exist an exact linear relationship.

© Christopher Dougherty 1999–2006

F(cost, d.f. remaining) =improvement cost

remainingunexplained

degrees of freedomremaining

For this F test, and for several others which we will encounter, it is useful to think of the F statistic as having the structure indicated above.The ‘improvement’ is the reduction in the residual sum of squares when the change is made, in this case, when the group of new variables is added.The ‘cost’ is the reduction in the number of degrees of freedom remaining after making the change. In the present case it is equal to the number of new variables added, because that number of new parameters are estimated.(Remember that the number of degrees of freedom in a regression equation is the number of observations, less the number of parameters estimated. In this example, it would fall from n – 2 to n – 4 when X3 and X4 are added.)

The ‘remaining unexplained’ is the residual sum of squares after making the change.The ‘degrees of freedom remaining’ is the number of degrees of freedom remaining after making the change.

F TESTS OF GOODNESS OF FIT

Page 104: © Christopher Dougherty 1999–2006 A.1: The model is linear in parameters and correctly specified. A.2: There does not exist an exact linear relationship.

© Christopher Dougherty 1999–2006

. reg S ASVABC

Source | SS df MS Number of obs = 540-------------+------------------------------ F( 1, 538) = 274.19 Model | 1081.97059 1 1081.97059 Prob > F = 0.0000 Residual | 2123.01275 538 3.94612035 R-squared = 0.3376-------------+------------------------------ Adj R-squared = 0.3364 Total | 3204.98333 539 5.94616574 Root MSE = 1.9865

------------------------------------------------------------------------------ S | Coef. Std. Err. t P>|t| [95% Conf. Interval]-------------+---------------------------------------------------------------- ASVABC | .148084 .0089431 16.56 0.000 .1305165 .1656516 _cons | 6.066225 .4672261 12.98 0.000 5.148413 6.984036------------------------------------------------------------------------------

We will illustrate the test with an educational attainment example. Here is S regressed on ASVABC using Data Set 21. We make a note of the residual sum of squares.

F TESTS OF GOODNESS OF FIT

Page 105: © Christopher Dougherty 1999–2006 A.1: The model is linear in parameters and correctly specified. A.2: There does not exist an exact linear relationship.

© Christopher Dougherty 1999–2006

. reg S ASVABC SM SF

Source | SS df MS Number of obs = 540-------------+------------------------------ F( 3, 536) = 104.30 Model | 1181.36981 3 393.789935 Prob > F = 0.0000 Residual | 2023.61353 536 3.77539837 R-squared = 0.3686-------------+------------------------------ Adj R-squared = 0.3651 Total | 3204.98333 539 5.94616574 Root MSE = 1.943

------------------------------------------------------------------------------ S | Coef. Std. Err. t P>|t| [95% Conf. Interval]-------------+---------------------------------------------------------------- ASVABC | .1257087 .0098533 12.76 0.000 .1063528 .1450646 SM | .0492424 .0390901 1.26 0.208 -.027546 .1260309 SF | .1076825 .0309522 3.48 0.001 .04688 .1684851 _cons | 5.370631 .4882155 11.00 0.000 4.41158 6.329681------------------------------------------------------------------------------

Now we have added the highest grade completed by each parent. Does parental education have a significant impact? Well, we can see that a t test would show that SF has a highly significant coefficient, but we will perform the F test anyway. We make a note of RSS.

F TESTS OF GOODNESS OF FIT

Page 106: © Christopher Dougherty 1999–2006 A.1: The model is linear in parameters and correctly specified. A.2: There does not exist an exact linear relationship.

© Christopher Dougherty 1999–2006

0 and bothor 0 or 0 :0:

43431

430

HH

uXXXY 4433221

uXY 221 1RSS

2RSS

The F statistic is 13.16.The critical value of F(2,500) at the 0.1% level is 7.00. The critical value of F(2,536) must be lower, so we reject H0 and conclude that the parental education variables do have significant joint explanatory power.

00.7)500,2(crit,0.1% F

16.13536/6.2023

2/)6.20230.2123()4540(2)()4540,2(

2

21

RSS

RSSRSSF

F TESTS OF GOODNESS OF FIT

Page 107: © Christopher Dougherty 1999–2006 A.1: The model is linear in parameters and correctly specified. A.2: There does not exist an exact linear relationship.

© Christopher Dougherty 1999–2006

1RSS

2RSS

uXXY 33221

uXXXY 4433221

This sequence will conclude by showing that t tests are equivalent to marginal F tests when the additional group of variables consists of just one variable.Suppose that in the original model Y is a function of X2 and X3, and that in the revised model X4 is added.

F TESTS OF GOODNESS OF FIT

Page 108: © Christopher Dougherty 1999–2006 A.1: The model is linear in parameters and correctly specified. A.2: There does not exist an exact linear relationship.

© Christopher Dougherty 1999–2006

1RSS

2RSS

0 :0:

41

40

HH

uXXY 33221

uXXXY 4433221

The null hypothesis for the F test of the explanatory power of the additional ‘group’ is that all the new slope coefficients are equal to zero. There is of course only one new slope coefficient, 4.

F TESTS OF GOODNESS OF FIT

Page 109: © Christopher Dougherty 1999–2006 A.1: The model is linear in parameters and correctly specified. A.2: There does not exist an exact linear relationship.

© Christopher Dougherty 1999–2006

1RSS

2RSS

The F test has the usual structure. We will illustrate it with an educational attainment model where S depends on ASVABC and SM in the original model and on SF as well in the revised model.

F(cost, d.f. remaining) =improvement cost

remainingunexplained

degrees of freedomremaining

0 :0:

41

40

HH

uXXY 33221

uXXXY 4433221

F TESTS OF GOODNESS OF FIT

Page 110: © Christopher Dougherty 1999–2006 A.1: The model is linear in parameters and correctly specified. A.2: There does not exist an exact linear relationship.

© Christopher Dougherty 1999–2006

. reg S ASVABC SM

Source | SS df MS Number of obs = 540-------------+------------------------------ F( 2, 537) = 147.36 Model | 1135.67473 2 567.837363 Prob > F = 0.0000 Residual | 2069.30861 537 3.85346109 R-squared = 0.3543-------------+------------------------------ Adj R-squared = 0.3519 Total | 3204.98333 539 5.94616574 Root MSE = 1.963

------------------------------------------------------------------------------ S | Coef. Std. Err. t P>|t| [95% Conf. Interval]-------------+---------------------------------------------------------------- ASVABC | .1328069 .0097389 13.64 0.000 .1136758 .151938 SM | .1235071 .0330837 3.73 0.000 .0585178 .1884963 _cons | 5.420733 .4930224 10.99 0.000 4.452244 6.389222------------------------------------------------------------------------------

Here is the regression of S on ASVABC and SM. We make a note of the residual sum of squares.

F TESTS OF GOODNESS OF FIT

Page 111: © Christopher Dougherty 1999–2006 A.1: The model is linear in parameters and correctly specified. A.2: There does not exist an exact linear relationship.

© Christopher Dougherty 1999–2006

Now we add SF and again make a note of the residual sum of squares.

. reg S ASVABC SM SF

Source | SS df MS Number of obs = 540-------------+------------------------------ F( 3, 536) = 104.30 Model | 1181.36981 3 393.789935 Prob > F = 0.0000 Residual | 2023.61353 536 3.77539837 R-squared = 0.3686-------------+------------------------------ Adj R-squared = 0.3651 Total | 3204.98333 539 5.94616574 Root MSE = 1.943

------------------------------------------------------------------------------ S | Coef. Std. Err. t P>|t| [95% Conf. Interval]-------------+---------------------------------------------------------------- ASVABC | .1257087 .0098533 12.76 0.000 .1063528 .1450646 SM | .0492424 .0390901 1.26 0.208 -.027546 .1260309 SF | .1076825 .0309522 3.48 0.001 .04688 .1684851 _cons | 5.370631 .4882155 11.00 0.000 4.41158 6.329681------------------------------------------------------------------------------

F TESTS OF GOODNESS OF FIT

Page 112: © Christopher Dougherty 1999–2006 A.1: The model is linear in parameters and correctly specified. A.2: There does not exist an exact linear relationship.

© Christopher Dougherty 1999–2006

0 :0:

41

40

HH

uXXXY 4433221

uXXY 33221 1RSS

2RSS

F(cost, d.f. remaining) =improvement cost

remainingunexplained

degrees of freedomremaining

The improvement on adding SF is the reduction in the residual sum of squares.

10.12536/6.2023

1/)6.20233.2069()4540(1)()4540,1(

2

21

RSS

RSSRSSF

F TESTS OF GOODNESS OF FIT

Page 113: © Christopher Dougherty 1999–2006 A.1: The model is linear in parameters and correctly specified. A.2: There does not exist an exact linear relationship.

© Christopher Dougherty 1999–2006

0 :0:

41

40

HH

uXXXY 4433221 1RSS

2RSS

F(cost, d.f. remaining) =improvement cost

remainingunexplained

degrees of freedomremaining

The cost is just the single degree of freedom lost when estimating 4.

uXXY 33221

10.12536/6.2023

1/)6.20233.2069()4540(1)()4540,1(

2

21

RSS

RSSRSSF

F TESTS OF GOODNESS OF FIT

Page 114: © Christopher Dougherty 1999–2006 A.1: The model is linear in parameters and correctly specified. A.2: There does not exist an exact linear relationship.

© Christopher Dougherty 1999–2006

0 :0:

41

40

HH

uXXXY 4433221 1RSS

2RSS

F(cost, d.f. remaining) =improvement cost

remainingunexplained

degrees of freedomremaining

The remaining unexplained is the residual sum of squares after adding SF.

uXXY 33221

10.12536/6.2023

1/)6.20233.2069()4540(1)()4540,1(

2

21

RSS

RSSRSSF

F TESTS OF GOODNESS OF FIT

Page 115: © Christopher Dougherty 1999–2006 A.1: The model is linear in parameters and correctly specified. A.2: There does not exist an exact linear relationship.

© Christopher Dougherty 1999–2006

0 :0:

41

40

HH

uXXXY 4433221 1RSS

2RSS

F(cost, d.f. remaining) =improvement cost

remainingunexplained

degrees of freedomremaining

The number of degrees of freedom remaining after adding SF is 540 – 4 = 536.

uXXY 33221

10.12536/6.2023

1/)6.20233.2069()4540(1)()4540,1(

2

21

RSS

RSSRSSF

F TESTS OF GOODNESS OF FIT

Page 116: © Christopher Dougherty 1999–2006 A.1: The model is linear in parameters and correctly specified. A.2: There does not exist an exact linear relationship.

© Christopher Dougherty 1999–2006

0 :0:

41

40

HH

uXXXY 4433221 1RSS

2RSS

F(cost, d.f. remaining) =improvement cost

remainingunexplained

degrees of freedomremaining

10.12536/6.2023

1/)6.20233.2069()4540(1)()4540,1(

2

21

RSS

RSSRSSF

uXXY 33221

96.10)500,1( crit,0.1% F

The critical value of F at the 0.1% significance level with 500 degrees of freedom is 10.96. The critical value with 536 degrees of freedom must be lower, so we reject H0 at the 0.1% level.

The null hypothesis we are testing is exactly the same as for a two-sided t test on the coefficient of SF.

F TESTS OF GOODNESS OF FIT

Page 117: © Christopher Dougherty 1999–2006 A.1: The model is linear in parameters and correctly specified. A.2: There does not exist an exact linear relationship.

© Christopher Dougherty 1999–2006

We will perform the t test. The t statistic is 3.48.

96.10crit,0.1% F

. reg S ASVABC SM SF

Source | SS df MS Number of obs = 540-------------+------------------------------ F( 3, 536) = 104.30 Model | 1181.36981 3 393.789935 Prob > F = 0.0000 Residual | 2023.61353 536 3.77539837 R-squared = 0.3686-------------+------------------------------ Adj R-squared = 0.3651 Total | 3204.98333 539 5.94616574 Root MSE = 1.943

------------------------------------------------------------------------------ S | Coef. Std. Err. t P>|t| [95% Conf. Interval]-------------+---------------------------------------------------------------- ASVABC | .1257087 .0098533 12.76 0.000 .1063528 .1450646 SM | .0492424 .0390901 1.26 0.208 -.027546 .1260309 SF | .1076825 .0309522 3.48 0.001 .04688 .1684851 _cons | 5.370631 .4882155 11.00 0.000 4.41158 6.329681------------------------------------------------------------------------------

10.12536/6.2023

1/)6.20233.2069()536,1(

F

F TESTS OF GOODNESS OF FIT

Page 118: © Christopher Dougherty 1999–2006 A.1: The model is linear in parameters and correctly specified. A.2: There does not exist an exact linear relationship.

© Christopher Dougherty 1999–2006

96.10crit,0.1% F

. reg S ASVABC SM SF

Source | SS df MS Number of obs = 540-------------+------------------------------ F( 3, 536) = 104.30 Model | 1181.36981 3 393.789935 Prob > F = 0.0000 Residual | 2023.61353 536 3.77539837 R-squared = 0.3686-------------+------------------------------ Adj R-squared = 0.3651 Total | 3204.98333 539 5.94616574 Root MSE = 1.943

------------------------------------------------------------------------------ S | Coef. Std. Err. t P>|t| [95% Conf. Interval]-------------+---------------------------------------------------------------- ASVABC | .1257087 .0098533 12.76 0.000 .1063528 .1450646 SM | .0492424 .0390901 1.26 0.208 -.027546 .1260309 SF | .1076825 .0309522 3.48 0.001 .04688 .1684851 _cons | 5.370631 .4882155 11.00 0.000 4.41158 6.329681------------------------------------------------------------------------------

10.12536/6.2023

1/)6.20233.2069()536,1(

F

The critical value of t at the 0.1% level with 500 degrees of freedom is 3.31. The critical value with 536 degrees of freedom must be lower. So we reject H0 again.

31.3crit,0.1% t

F TESTS OF GOODNESS OF FIT

Page 119: © Christopher Dougherty 1999–2006 A.1: The model is linear in parameters and correctly specified. A.2: There does not exist an exact linear relationship.

96.10crit,0.1% F

. reg S ASVABC SM SF

Source | SS df MS Number of obs = 540-------------+------------------------------ F( 3, 536) = 104.30 Model | 1181.36981 3 393.789935 Prob > F = 0.0000 Residual | 2023.61353 536 3.77539837 R-squared = 0.3686-------------+------------------------------ Adj R-squared = 0.3651 Total | 3204.98333 539 5.94616574 Root MSE = 1.943

------------------------------------------------------------------------------ S | Coef. Std. Err. t P>|t| [95% Conf. Interval]-------------+---------------------------------------------------------------- ASVABC | .1257087 .0098533 12.76 0.000 .1063528 .1450646 SM | .0492424 .0390901 1.26 0.208 -.027546 .1260309 SF | .1076825 .0309522 3.48 0.001 .04688 .1684851 _cons | 5.370631 .4882155 11.00 0.000 4.41158 6.329681------------------------------------------------------------------------------

10.12536/6.2023

1/)6.20233.2069()536,1(

F

31.3crit,0.1% t

It can be shown that the F statistic for the F test of the explanatory power of a ‘group’ of one variable must be equal to the square of the t statistic for that variable. (The difference in the last digit is due to rounding error.)

11.1248.3 2

F TESTS OF GOODNESS OF FIT

Page 120: © Christopher Dougherty 1999–2006 A.1: The model is linear in parameters and correctly specified. A.2: There does not exist an exact linear relationship.

© Christopher Dougherty 1999–2006

96.10crit,0.1% F

. reg S ASVABC SM SF

Source | SS df MS Number of obs = 540-------------+------------------------------ F( 3, 536) = 104.30 Model | 1181.36981 3 393.789935 Prob > F = 0.0000 Residual | 2023.61353 536 3.77539837 R-squared = 0.3686-------------+------------------------------ Adj R-squared = 0.3651 Total | 3204.98333 539 5.94616574 Root MSE = 1.943

------------------------------------------------------------------------------ S | Coef. Std. Err. t P>|t| [95% Conf. Interval]-------------+-------------------------------------------------------------- ASVABC | .1257087 .0098533 12.76 0.000 .1063528 .1450646 SM | .0492424 .0390901 1.26 0.208 -.027546 .1260309 SF | .1076825 .0309522 3.48 0.001 .04688 .1684851 _cons | 5.370631 .4882155 11.00 0.000 4.41158 6.329681------------------------------------------------------------------------------

10.12536/6.2023

1/)6.20233.2069()536,1(

F

31.3crit,0.1% t11.1248.3 2 96.1031.3 2 It can also be shown that the critical value of F must be equal to the square of the critical value of t. (The critical values shown are for 500 degrees of freedom, but this must also be true for 536 degrees of freedom.)

F TESTS OF GOODNESS OF FIT

Page 121: © Christopher Dougherty 1999–2006 A.1: The model is linear in parameters and correctly specified. A.2: There does not exist an exact linear relationship.

© Christopher Dougherty 1999–2006

Hence the conclusions of the two tests must coincide.

This result means that the t test of the coefficient of a variable is a test of its marginal explanatory power, after all the other variables have been included in the equation.

• If the variable is correlated with one or more of the other variables, its marginal explanatory power may be quite low, even if it genuinely belongs in the model.

• If all the variables are correlated, it is possible for all of them to have low marginal explanatory power and for none of the t tests to be significant, even though the F test for their joint explanatory power is highly significant.

• If this is the case, the model is said to be suffering from the problem of multicollinearity discussed earlier.

F TESTS OF GOODNESS OF FIT