Download - © Christopher Dougherty 1999–2006 A.1: The model is linear in parameters and correctly specified. A.2: There does not exist an exact linear relationship.

© Christopher Dougherty 1999–2006

A.1: The model is linear in parameters and correctly specified.

A.2: There does not exist an exact linear relationship among the regressors in the sample.

A.3 The disturbance term has zero expectationA.4 The disturbance term is homoscedasticA.5 The values of the disturbance term have independent

distributionsA.6 The disturbance term has a normal distribution

PROPERTIES OF THE MULTIPLE REGRESSION COEFFICIENTS

uXXY kk ...221

Moving from the simple to the multiple regression model, we start by restating the regression model assumptions. Only A.2 is different. Previously it was stated that there must be some variation in the X variable. We will explain the difference in one of the following lectures. Provided that the regression model assumptions are valid, the OLS estimators in the multiple regression model are unbiased and efficient, as in the simple regression model.


uXXY 33221 33221ˆ XbXbbY

23322 XXYYXX iii

2

33222

332

22

3322332

XXXXXXXX

XXXXYYXXb

iiii

iiii

We will not attempt to prove efficiency. We will however outline a proof of unbiasedness.




23322 XXYYXX iii

2

33222

332

22

3322332

XXXXXXXX

XXXXYYXXb

iiii

iiii

uuXXXX

uXXuXXYY

iii

iiii

333222

3322133221

The first step, as always, is to substitute for Y from the true relationship. The Y ingredients of b2 are actually in the form of Yi minus its mean, so it is convenient to obtain an expression for this.




23322 XXYYXX iii

2

33222

332

22

3322332

XXXXXXXX

XXXXYYXXb

iiii

iiii

uuXXXX

uXXuXXYY

iii

iiii

333222

3322133221

ii uab *222

After simplifying, we find that b2 can be decomposed into the true value 2 plus a weighted linear combination of the values of the disturbance term in the sample. This is what we found in the simple regression model. The difference is that the expression for the weights, which depend on all the values of X2 and X3 in the sample, is considerably more complicated.




23322 XXYYXX iii

2

33222

332

22

3322332

XXXXXXXX

XXXXYYXXb

iiii

iiii

uuXXXX

uXXuXXYY

iii

iiii

333222

3322133221

ii uab *222

2*22

*22

*222 iiiiii uEauaEuaEbE

Having reached this point, proving unbiasedness is easy. Taking expectations, 2 is unaffected, being a constant. The expectation of a sum is equal to the sum of expectations.




23322 XXYYXX iii

2

33222

332

22

3322332

XXXXXXXX

XXXXYYXXb

iiii

iiii

uuXXXX

uXXuXXYY

iii

iiii

333222

3322133221

ii uab *222

2*22

*22


The a* terms are nonstochastic since they depend only on the values of X2 and X3, and these are assumed to be nonstochastic. Hence the a* terms may be taken out of the expectations as factors.




23322 XXYYXX iii

2

33222

332

22

3322332

XXXXXXXX

XXXXYYXXb

iiii

iiii

uuXXXX

uXXuXXYY

iii

iiii

333222

3322133221

ii uab *222

2*22

*22


By Assumption A.3, E(ui) = 0 for all i. Hence E(b2) is equal to 2 and so b2 is an unbiased estimator. Similarly b3 is an unbiased estimator of 3.




Finally we will show that b1 is an unbiased estimator of 1. This is quite simple, so you should attempt to do this yourself, before looking at the rest of this sequence.

332233221

33221

)( XbXbuXXXbXbYb




332233221

33221

)( XbXbuXXXbXbYb

First substitute for the sample mean of Y.




332233221

33221

)( XbXbuXXXbXbYb

1

332233221

3322332211 )()()()(

XXXX

bEXbEXuEXXbE

Now take expectations. The first three terms are nonstochastic, so they are unaffected by taking expectations.




332233221

33221

)( XbXbuXXXbXbYb

1

332233221

3322332211 )()()()(

XXXX

bEXbEXuEXXbE

The expected value of the mean of the disturbance term is zero since E(u) is zero in each observation. We have just shown that E(b2) is equal to 2 and that E(b3) is equal to 3.




332233221

33221

)( XbXbuXXXbXbYb

1

332233221

3322332211 )()()()(

XXXX

bEXbEXuEXXbE

Hence b1 is an unbiased estimator of 1.




PRECISION OF THE MULTIPLE REGRESSION COEFFICIENTS

This sequence investigates the variances and standard errors of the slope coefficients in a model with two explanatory variables.The expression for the variance of b2 is shown above. The expression for the variance of b3 is the same, with the subscripts 2 and 3 interchanged.

2,2

2

2,

222

22

3232

2 11

)(MSD11

XX

u

XXi

ub rXnrXX



The first factor in the expression is identical to that for the variance of the slope coefficient in a simple regression model.The variance of b2 depends on the variance of the disturbance term, the number of observations, and the mean square deviation of X2 for exactly the same reasons as in a simple regression model.

2,2

2

2,

222

22

3232

2 11

)(MSD11

XX

u

XXi

ub rXnrXX




The difference is that in multiple regression analysis the expression is multiplied by a factor which depends on the correlation between X2 and X3.

The higher is the correlation between the explanatory variables, positive or negative, the greater will be the variance.This is easy to understand intuitively. The greater the correlation, the harder it is to discriminate between the effects of the explanatory variables on Y, and the less accurate will be the regression estimates.Note that the variance expression above is valid only for a model with two explanatory variables. When there are more than two, the expression becomes much more complex and it is sensible to switch to matrix algebra.

2,2

2

2,

222

22

3232

2 11

)(MSD11

XX

u

XXi

ub rXnrXX




The standard deviation of the distribution of b2 is of course given by the square root of its variance.With the exception of the variance of u, we can calculate the components of the standard deviation from the sample data.

2,2

2

2,

222

22

3232

2 11

)(MSD11

XX

u

XXi

ub rXnrXX

2,

222

2

232

11 of deviation standard

XXi

u

rXXb




The variance of u has to be estimated. The mean square of the residuals provides a consistent estimator, but it is biased downwards by a factor (n – k) / n , where k is the number of parameters, in a finite sample.

2,2

2

2,

222

22

3232

2 11

)(MSD11

XX

u

XXi

ub rXnrXX

2,

222

2

232


XXi

u

rXXb

221ui n

knen

E




Obviously we can obtain an unbiased estimator by dividing the sum of the squares of the residuals by n – k instead of n. We denote this unbiased estimator su.2

2,2

2

2,

222

22

3232

2 11

)(MSD11

XX

u

XXi

ub rXnrXX

2,

222

2

232


XXi

u

rXXb

22 1

iu ekn

s221ui n

knen

E




221ui n

knen

E

22 1iu e

kns

2,

222

2

232


XXi

u

rXXb

2,

222

2

232

11)( s.e.

XXi

u

rXXsb

Thus the estimate of the standard deviation of the probability distribution of b2, known as the standard error of b2 for short, is given by the expression above.

2,2

2

2,

222

22

3232

2 11

)(MSD11

XX

u

XXi

ub rXnrXX



We will use this expression to analyze why the standard error of S is larger for the union subsample than for the non-union subsample in earnings function regressions using Data Set 21.

. reg EARNINGS S EXP if COLLBARG==1

Source | SS df MS Number of obs = 101-------------+------------------------------ F( 2, 98) = 9.72 Model | 3076.31726 2 1538.15863 Prob > F = 0.0001 Residual | 15501.9762 98 158.18343 R-squared = 0.1656-------------+------------------------------ Adj R-squared = 0.1486 Total | 18578.2934 100 185.782934 Root MSE = 12.577

------------------------------------------------------------------------------ EARNINGS | Coef. Std. Err. t P>|t| [95% Conf. Interval]-------------+---------------------------------------------------------------- S | 2.333846 .5492604 4.25 0.000 1.243857 3.423836 EXP | .2235095 .3389455 0.66 0.511 -.4491169 .8961358 _cons | -15.12427 11.38141 -1.33 0.187 -37.71031 7.461779------------------------------------------------------------------------------






To select a subsample in Stata, you add an ‘if’ statement to a command. The COLLBARG variable is equal to 1 for respondents whose rates of pay are determined by collective bargaining, and it is 0 for the others.Note that in tests for equality, Stata requires the = sign to be duplicated.






In the case of the union subsample, the standard error of S is 0.5493.





------------------------------------------------------------------------------ EARNINGS | Coef. Std. Err. t P>|t| [95% Conf. Interval]-------------+---------------------------------------------------------------- S | 2.721698 .2604411 10.45 0.000 2.209822 3.233574 EXP | .6077342 .1400846 4.34 0.000 .3324091 .8830592 _cons | -28.00805 4.643211 -6.03 0.000 -37.13391 -18.88219------------------------------------------------------------------------------

In the case of the non-union subsample, the standard error of S is 0.2604, less than half as large.



2,2

2

321

1)(MSD

11)(s.e.XX

u rXnsb

We will explain the difference by looking at the components of the standard error.

Decomposition of the standard error of S

Component su n MSD(S) rS, EXP s.e.

Union 12.577 101 6.2325 –0.4087 0.5493

Non-union 13.005 439 5.8666 –0.1784 0.2604

Factor product

Union 12.577 0.0995 0.4006 1.0957 0.5493

Non-union 13.005 0.0477 0.4129 1.0163 0.2603



RSSkn

su 12




We will start with su. Here is RSS for the union subsample.



RSSkn

su 12




There are 101 observations in the union subsample. k is equal to 3. Thus n – k is equal to 98.



RSS / (n – k) is equal to 158.183. To obtain su, we take the square root.

This is 12.577.

RSSkn

su 12








Union 12.577 101 6.2325 –0.4087 0.5493

Non-union 13.005 439 5.8666 –0.1784 0.2604

Factor product

Union 12.577 0.0995 0.4006 1.0957 0.5493

Non-union 13.005 0.0477 0.4129 1.0163 0.2603

2,2

2

321

1)(MSD

11)(s.e.XX

u rXnsb

We place this in the table, along with the number of observations.



Similarly, in the case of the non-union subsample, su is the square root of 169.132, which is 13.005. We also note that the number of observations in that subsample is 439.








Union 12.577 101 6.2325 –0.4087 0.5493

Non-union 13.005 439 5.8666 –0.1784 0.2604

Factor product

Union 12.577 0.0995 0.4006 1.0957 0.5493

Non-union 13.005 0.0477 0.4129 1.0163 0.2603

2,2

2

321

1)(MSD

11)(s.e.XX

u rXnsb

We place these in the table.





Union 12.577 101 6.2325 –0.4087 0.5493

Non-union 13.005 439 5.8666 –0.1784 0.2604

Factor product

Union 12.577 0.0995 0.4006 1.0957 0.5493

Non-union 13.005 0.0477 0.4129 1.0163 0.2603

2,2

2

321

1)(MSD

11)(s.e.XX

u rXnsb

We calculate the mean square deviation of S for the two subsamples from the sample data.



. cor S EXP if COLLBARG==1(obs=101) | S EXP--------+------------------ S | 1.0000 EXP | -0.4087 1.0000

. cor S EXP if COLLBARG==0(obs=439) | S EXP--------+------------------ S | 1.0000 EXP | -0.1784 1.0000

The correlation coefficients for S and EXP are –0.4087 and –0.1784 for the union and non-union subsamples, respectively. (Note that "cor" is the Stata command for computing correlations.)





Union 12.577 101 6.2325 –0.4087 0.5493

Non-union 13.005 439 5.8666 –0.1784 0.2604

Factor product

Union 12.577 0.0995 0.4006 1.0957 0.5493

Non-union 13.005 0.0477 0.4129 1.0163 0.2603

2,2

2

321

1)(MSD

11)(s.e.XX

u rXnsb

These entries complete the top half of the table. We will now look at the impact of each item on the standard error, using the mathematical expression at the top.





Union 12.577 101 6.2325 –0.4087 0.5493

Non-union 13.005 439 5.8666 –0.1784 0.2604

Factor product

Union 12.577 0.0995 0.4006 1.0957 0.5493

Non-union 13.005 0.0477 0.4129 1.0163 0.2603

2,2

2

321

1)(MSD

11)(s.e.XX

u rXnsb

The su components need no modification. It is a little larger for the non-union subsample, having an adverse effect on the standard error.





Union 12.577 101 6.2325 –0.4087 0.5493

Non-union 13.005 439 5.8666 –0.1784 0.2604

Factor product

Union 12.577 0.0995 0.4006 1.0957 0.5493

Non-union 13.005 0.0477 0.4129 1.0163 0.2603

2,2

2

321

1)(MSD

11)(s.e.XX

u rXnsb

The number of observations is much larger for the non-union subsample, so the second factor is much smaller than that for the union subsample.





Union 12.577 101 6.2325 –0.4087 0.5493

Non-union 13.005 439 5.8666 –0.1784 0.2604

Factor product

Union 12.577 0.0995 0.4006 1.0957 0.5493

Non-union 13.005 0.0477 0.4129 1.0163 0.2603

2,2

2

321

1)(MSD

11)(s.e.XX

u rXnsb

Perhaps surprisingly, the variance in schooling is a little larger for the union subsample.





Union 12.577 101 6.2325 –0.4087 0.5493

Non-union 13.005 439 5.8666 –0.1784 0.2604

Factor product

Union 12.577 0.0995 0.4006 1.0957 0.5493

Non-union 13.005 0.0477 0.4129 1.0163 0.2603

2,2

2

321

1)(MSD

11)(s.e.XX

u rXnsb

The correlation between schooling and work experience is greater for the union subsample, and this has an adverse effect on its standard error. Note that the sign of the correlation makes no difference since it is squared.




Union 12.577 101 6.2325 –0.4087 0.5493

Non-union 13.005 439 5.8666 –0.1784 0.2604

Factor product

Union 12.577 0.0995 0.4006 1.0957 0.5493

Non-union 13.005 0.0477 0.4129 1.0163 0.2603

2,2

2

321

1)(MSD

11)(s.e.XX

u rXnsb

We see that the reason that the standard error is smaller for the non-union subsample is that there are far more observations than in the non-union subsample. Otherwise the standard errors would have been about the same.The greater correlation between S and EXP has an adverse effect on the union standard error, but this is just about offset by the smaller su and the larger variance of S.



X2 X3 Y

10 19 51

11 21 56

12 23 61

13 25 66

14 27 71

15 29 76

MULTICOLLINEARITY

3232 XXY

12 23 XX

Suppose that Y = 2 + 3X2 + X3 and that X3 = 2X2 – 1. There is no disturbance term in the equation for Y, but that is not important. Suppose that we have the six observations shown.


The three variables are plotted as line graphs above. Looking at the data, it is impossible to tell whether the changes in Y are caused by changes in X2, by changes in X3, or jointly by changes in both X2 and X3.

0

10

20

30

40

50

60

70

80

1 2 3 4 5 6

Y

X3

X2

MULTICOLLINEARITY


Change Change ChangeX2 X3 Y in X2 in X3 in Y

10 19 51 1 2 5

11 21 56 1 2 5

12 23 61 1 2 5

13 25 66 1 2 5

14 27 71 1 2 5

15 29 76 1 2 5

3232 XXY

12 23 XX

Numerically, Y increases by 5 in each observation when X2 changes by 1.

MULTICOLLINEARITY


Hence the true relationship could have been Y = 1 + 5X2.

0

10

20

30

40

50

60

70

80

1 2 3 4 5 6

Y

X3

X2

Y = 1 + 5X2 ?

MULTICOLLINEARITY


uXXY 33221 23 XX

What would happen if you tried to run a regression when there is an exact linear relationship among the explanatory variables? We will investigate, using the model with two explanatory variables shown above. [Note: A disturbance term has now been included in the true model, but it makes no difference to the analysis.]

MULTICOLLINEARITY


uXXY 33221 23 XX

23322 XXYYXX iii

2

33222

332

22

3322332

XXXXXXXX

XXXXYYXXb

iiii

iiii

The expression for the multiple regression coefficient b2 is shown above. We will substitute for X3 using its relationship with X2.

MULTICOLLINEARITY


uXXY 33221 23 XX

23322 XXYYXX iii

2

33222

332

22

3322332

XXXXXXXX

XXXXYYXXb

iiii

iiii

222

2

222

2222

222

233 ][][

XX

XXXX

XXXX

i

ii

ii

First, we will replace the terms highlighted with the expression derived below.

MULTICOLLINEARITY


uXXY 33221 23 XX

222

222 XXYYXX iii

2

33222

2222

22

3322332

XXXXXXXX

XXXXYYXXb

iiii

iiii

222

2222

22223322 ][][

XX

XXXX

XXXXXXXX

i

ii

iiii

Next, the terms that are highlighted now.

MULTICOLLINEARITY


uXXY 33221 23 XX

222

222 XXYYXX iii

00

2222

222

2222

22233

2

XXXXXX

XXYYXXb

iii

iii

YYXX

YYXX

YYXXYYXX

ii

ii

iiii

22

22

2233 ][][

Finally this term.

MULTICOLLINEARITY


uXXY 33221 23 XX

222

222 XXYYXX iii

00

2222

222

2222

22222

2

XXXXXX

XXYYXXb

iii

iii

After all the replacements, it turns out that the numerator and the denominator are both equal to zero. The regression coefficient is not defined.It is unusual for there to be an exact relationship among the explanatory variables in a regression. When this occurs, it is typically because there is a logical error in the specification.

MULTICOLLINEARITY


. reg EARNINGS S EXP EXPSQ


------------------------------------------------------------------------------ EARNINGS | Coef. Std. Err. t P>|t| [95% Conf. Interval]-------------+---------------------------------------------------------------- S | 2.754372 .2417286 11.39 0.000 2.279521 3.229224 EXP | -.2353907 .665197 -0.35 0.724 -1.542103 1.071322 EXPSQ | .0267843 .0219115 1.22 0.222 -.0162586 .0698272 _cons | -22.21964 5.514827 -4.03 0.000 -33.05297 -11.38632------------------------------------------------------------------------------

uEXPSQEXPSEARNINGS 4321

However, it often happens that there is an approximate relationship. For example, when relating earnings to schooling and work experience, it if often reasonable to suppose that the effect of work experience is subject to diminishing returns.A standard way of allowing for this is to include EXPSQ, the square of EXP, in the specification. According to the hypothesis of diminishing returns, 4 should be negative.

MULTICOLLINEARITY





We fit this specification using Data Set 21. The schooling component of the regression results is not much affected by the inclusion of the EXPSQ term. The coefficient of S indicates that an extra year of schooling increases hourly earnings by $2.75.


MULTICOLLINEARITY


. reg EARNINGS S EXP



(Looking back at slide 21:) In the specification without EXPSQ it was 2.68, not much different.


MULTICOLLINEARITY






The standard error, 0.23 in the specification without EXPSQ, is also little changed and the coefficient remains highly significant.

MULTICOLLINEARITY






By contrast, the inclusion of the new term has had a dramatic effect on the coefficient of EXP. Now it is negative, which makes little sense, and insignificant!

MULTICOLLINEARITY



Previously it had been positive and highly significant.

. reg EARNINGS S EXP



MULTICOLLINEARITY






MULTICOLLINEARITY





The reason for these problems is that EXPSQ is highly correlated with EXP. This makes it difficult to discriminate between the individual effects of EXP and EXPSQ, and the regression estimates tend to be erratic.

. cor EXP EXPSQ(obs=540)

| EXP EXPSQ------+------------------ EXP | 1.0000EXPSQ | 0.9812 1.0000

MULTICOLLINEARITY


When high correlations among the explanatory variables lead to erratic point estimates of the coefficients, large standard errors and unsatisfactorily low t statistics, the regression is said to said to be suffering from multicollinearity.

Multicollinearity may also be caused by an approximate linear relationship among the explanatory variables. When there are only 2, an approximate linear relationship means there will be a high correlation, but this is not always the case when there are more than 2.

MULTICOLLINEARITY


ALLEVIATION OF MULTICOLLINEARITY

What can you do about multicollinearity if you encounter it? We will discuss some possible measures, looking at the model with two explanatory variables.Before doing this, two important points should be emphasized. • First, multicollinearity does not cause the regression coefficients to be

biased. Their probability distributions are still centered over the true values, if the regression specification is correct, but they have unsatisfactorily large variances.

• Second, the standard errors and t tests remain valid. The standard errors are larger than they would have been in the absence of multicollinearity, warning us that the regression estimates are erratic.

Since the problem of multicollinearity is caused by the variances of the coefficients being unsatisfactorily large, we will seek ways of reducing them.


Possible measures for alleviating multicollinearity

(1) Reduce by including further relevant variables in the model.

2u

We will focus on the slope coefficient and look at the various components of its variance. We might be able to reduce it by bringing more variables into the model and reducing u

2, the variance of the disturbance term.

2,2

2

2,

222

22

3232

2 11

)(MSD11

XX

u

XXi

ub rXnrXX



The estimator of the variance of the disturbance term is the residual sum of squares divided by n – k, where n is the number of observations (540) and k is the number of parameters (4). Here it is 166.5.






. reg EARNINGS S EXP EXPSQ MALE ASVABC


------------------------------------------------------------------------------ EARNINGS | Coef. Std. Err. t P>|t| [95% Conf. Interval]-------------+---------------------------------------------------------------- S | 2.031419 .296218 6.86 0.000 1.449524 2.613315 EXP | -.0816828 .6441767 -0.13 0.899 -1.347114 1.183748 EXPSQ | .0130223 .021334 0.61 0.542 -.0288866 .0549311 MALE | 5.762358 1.104734 5.22 0.000 3.592201 7.932515 ASVABC | .2447687 .0714294 3.43 0.001 .1044516 .3850858 _cons | -26.18541 5.452032 -4.80 0.000 -36.89547 -15.47535------------------------------------------------------------------------------

We now add two new variables that are often found to be determinants of earnings: MALE, sex of respondent, and ASVABC, the composite score on the cognitive tests in the Armed Services Vocational Aptitude Battery.MALE is a qualitative variable and the treatment of such variables will be explained in Chapter 5.






Both MALE and ASVABC have coefficients significant at the 0.1% level.







However they account for only a small proportion of the variance in earnings and the reduction in the estimate of the variance of the disturbance term is likewise small.







As a consequence the impact on the standard errors of EXP and EXPSQ is negligible.







Note how unstable the coefficients are. This is often a sign of multicollinearity.




(2) Increase the number of observations.Surveys: increase the budget, or use clusteringTime series: use quarterly instead of annual data

2,2

2

2,

222

22

3232

2 11

)(MSD11

XX

u

XXi

ub rXnrXX

The next factor to look at is n, the number of observations. If you are working with cross-section data (individuals, households, enterprises, etc) and you are undertaking a survey, you could increase the size of the sample by negotiating a bigger budget.You select a number of these randomly, perhaps using random sampling to make sure that metropolitan, other urban, and rural areas are properly represented.You then confine the survey to the areas selected. This reduces the travel time and cost of the fieldworkers, allowing them to interview a greater number of respondents.




(2) Increase the number of observations.Surveys: increase the budget, use clusteringTime series: use quarterly instead of annual data

If you are working with time series data, you may be able to increase the sample by working with shorter time intervals for the data, for example quarterly or even monthly data instead of annual data.

2,2

2

2,

222

22

3232

2 11

)(MSD11

XX

u

XXi

ub rXnrXX





------------------------------------------------------------------------------ EARNINGS | Coef. Std. Err. t P>|t| [95% Conf. Interval]-------------+---------------------------------------------------------------- S | 2.312461 .135428 17.08 0.000 2.046909 2.578014 EXP | -.3270651 .308231 -1.06 0.289 -.9314569 .2773268 EXPSQ | .023743 .0101558 2.34 0.019 .0038291 .0436569 MALE | 5.947206 .5221755 11.39 0.000 4.923303 6.971108 ASVABC | .2086846 .0336869 6.19 0.000 .1426301 .2747392 _cons | -27.40462 2.579435 -10.62 0.000 -32.46248 -22.34676------------------------------------------------------------------------------

Here is the result of running the regression with all 2,714 observations in the EAEF data set.



. reg EARNINGS S EXP EXPSQ MALE ASVABC Number of obs = 540------------------------------------------------------------------------------ EARNINGS | Coef. Std. Err. t P>|t| [95% Conf. Interval]-------------+---------------------------------------------------------------- S | 2.031419 .296218 6.86 0.000 1.449524 2.613315 EXP | -.0816828 .6441767 -0.13 0.899 -1.347114 1.183748 EXPSQ | .0130223 .021334 0.61 0.542 -.0288866 .0549311 MALE | 5.762358 1.104734 5.22 0.000 3.592201 7.932515 ASVABC | .2447687 .0714294 3.43 0.001 .1044516 .3850858 _cons | -26.18541 5.452032 -4.80 0.000 -36.89547 -15.47535------------------------------------------------------------------------------. reg EARNINGS S EXP EXPSQ MALE ASVABC Number of obs = 2714------------------------------------------------------------------------------ EARNINGS | Coef. Std. Err. t P>|t| [95% Conf. Interval]-------------+---------------------------------------------------------------- S | 2.312461 .135428 17.08 0.000 2.046909 2.578014 EXP | -.3270651 .308231 -1.06 0.289 -.9314569 .2773268 EXPSQ | .023743 .0101558 2.34 0.019 .0038291 .0436569 MALE | 5.947206 .5221755 11.39 0.000 4.923303 6.971108 ASVABC | .2086846 .0336869 6.19 0.000 .1426301 .2747392 _cons | -27.40462 2.579435 -10.62 0.000 -32.46248 -22.34676------------------------------------------------------------------------------Comparing this result with that using Data Set 21, we see that the standard errors are much smaller, as expected.



. reg EARNINGS S EXP EXPSQ MALE ASVABC Number of obs = 540------------------------------------------------------------------------------ EARNINGS | Coef. Std. Err. t P>|t| [95% Conf. Interval]-------------+---------------------------------------------------------------- S | 2.031419 .296218 6.86 0.000 1.449524 2.613315 EXP | -.0816828 .6441767 -0.13 0.899 -1.347114 1.183748 EXPSQ | .0130223 .021334 0.61 0.542 -.0288866 .0549311 MALE | 5.762358 1.104734 5.22 0.000 3.592201 7.932515 ASVABC | .2447687 .0714294 3.43 0.001 .1044516 .3850858 _cons | -26.18541 5.452032 -4.80 0.000 -36.89547 -15.47535------------------------------------------------------------------------------. reg EARNINGS S EXP EXPSQ MALE ASVABC Number of obs = 2714------------------------------------------------------------------------------ EARNINGS | Coef. Std. Err. t P>|t| [95% Conf. Interval]-------------+---------------------------------------------------------------- S | 2.312461 .135428 17.08 0.000 2.046909 2.578014 EXP | -.3270651 .308231 -1.06 0.289 -.9314569 .2773268 EXPSQ | .023743 .0101558 2.34 0.019 .0038291 .0436569 MALE | 5.947206 .5221755 11.39 0.000 4.923303 6.971108 ASVABC | .2086846 .0336869 6.19 0.000 .1426301 .2747392 _cons | -27.40462 2.579435 -10.62 0.000 -32.46248 -22.34676------------------------------------------------------------------------------As a consequence, the t statistics of the variables are higher. However the correlation between EXP and EXPSQ is as high as in the smaller sample and the increase in the sample size has not been large enough to have much impact on the problem of multicollinearity.




(3) Increase MSD(X2).

A third possible way of reducing the problem of multicollinearity might be to increase the variation in the explanatory variables. This is possible only at the design stage of a survey.For example, if you were planning a household survey with the aim of investigating how expenditure patterns vary with income, you should make sure that the sample included relatively rich and relatively poor households as well as middle-income households.

2,2

2

2,

222

22

3232

2 11

)(MSD11

XX

u

XXi

ub rXnrXX




(4) Reduce .32 ,XXr

Another possibility might be to reduce the correlation between the explanatory variables. This is possible only at the design stage of a survey and even then it is not easy.

2,2

2

2,

222

22

3232

2 11

)(MSD11

XX

u

XXi

ub rXnrXX




(5) Combine the correlated variables.

If the correlated variables are similar conceptually, it may be reasonable to combine them into some overall index.



That is precisely what has been done with the three cognitive ASVAB variables. ASVABC has been calculated as a weighted average of ASVAB02 (arithmetic reasoning), ASVAB03 (word knowledge), and ASVAB04 (paragraph comprehension).The three components are highly correlated and by combining them as a weighted average, rather than using them individually, one avoids a potential problem of multicollinearity.







(6) Drop some of the correlated variables.

Dropping some of the correlated variables, if they have insignificant coefficients, may alleviate multicollinearity.However, this approach to multicollinearity is dangerous. It is possible that some of the variables with insignificant coefficients really do belong in the model and that the only reason their coefficients are insignificant is because there is a problem of multicollinearity.If that is the case, their omission may cause omitted variable bias, to be discussed in Chapter 6.

2,2

2

2,

222

22

3232

2 11

)(MSD11

XX

u

XXi

ub rXnrXX




(7) Empirical restriction

uPXY 321

A further way of dealing with the problem of multicollinearity is to use extraneous information, if available, concerning the coefficient of one of the variables.For example, suppose that Y in the equation above is the demand for a category of consumer expenditure, X is aggregate disposable personal income, and P is a price index for the category.To fit a model of this type you would use time series data. If X and P are highly correlated, which is often the case with time series variables, the problem of multicollinearity might be eliminated in the following way.

2,2

2

2,

222

22

3232

2 11

)(MSD11

XX

u

XXi

ub rXnrXX





uPXY 321 uXY ''2

'1

' ''

2'1

'ˆ XbbY

Obtain data on income and expenditure on the category from a household survey and regress Y' on X'. (The ' marks are to indicate that the data are household data, not aggregate data.) This is a simple regression because there will be relatively little variation in the price paid by the households.

2,2

2

2,

222

22

3232

2 11

)(MSD11

XX

u

XXi

ub rXnrXX





uPXY 321 uXY ''2

'1

'

uPXbY 3'21

uPXbYZ 21'2

''2

'1

'ˆ XbbY

Now substitute b' for 2 in the time series model. Subtract b' X from both sides, and regress Z = Y – b' X on price. This is a simple regression, so multicollinearity has been eliminated.

2

22

2,2

2

2,

222

22

3232

2 11

)(MSD11

XX

u

XXi

ub rXnrXX





uPXY 321 uXY ''2

'1

'

uPXbY 3'21

uPXbYZ 21'2

''2

'1

'ˆ XbbY

There are some problems with this technique. First, the 2 coefficients may be conceptually different in time series and cross-section contexts. Second, since we subtract the estimated income component b' X, not the true income component 2X, from Y when constructing Z, we have introduced an element of measurement error in the dependent variable.

2,2

2

2,

222

22

3232

2 11

)(MSD11

XX

u

XXi

ub rXnrXX




(8) Theoretical restriction

uSFSMASVABCS 4321

Last, but by no means least, is the use of a theoretical restriction, which is defined as a hypothetical relationship among the parameters of a regression model.It will be explained using an educational attainment model as an example. Suppose that we hypothesize that highest grade completed, S, depends on ASVABC, and highest grade completed by the respondent's mother and father, SM and SF, respectively.

2,2

2

2,

222

22

3232

2 11

)(MSD11

XX

u

XXi

ub rXnrXX



. reg S ASVABC SM SF


------------------------------------------------------------------------------ S | Coef. Std. Err. t P>|t| [95% Conf. Interval]-------------+---------------------------------------------------------------- ASVABC | .1257087 .0098533 12.76 0.000 .1063528 .1450646 SM | .0492424 .0390901 1.26 0.208 -.027546 .1260309 SF | .1076825 .0309522 3.48 0.001 .04688 .1684851 _cons | 5.370631 .4882155 11.00 0.000 4.41158 6.329681------------------------------------------------------------------------------

A one-point increase in ASVABC increases S by 0.13 years.






S increases by 0.05 years for every extra year of schooling of the mother and 0.11 years for every extra year of schooling of the father.Mother's education is generally held to be at least, if not more, important than father's education for educational attainment, so this outcome is unexpected.






It is also surprising that the coefficient of SM is not significant, even at the 5% level, using a one-sided test.






However, assortative mating leads to correlation between SM and SF and the regression appears to be suffering from multicollinearity.

. cor SM SF(obs=540) | SM SF--------+------------------ SM | 1.0000 SF | 0.6241 1.0000





uSFSMASVABCS 4321

43

Suppose that we hypothesize that mother's and father's education are equally important. We can then impose the restriction 3 = 4.

2,2

2

2,

222

22

3232

2 11

)(MSD11

XX

u

XXi

ub rXnrXX





uSFSMASVABCS 4321

43

uSPASVABCuSFSMASVABCS

321

321 )(

This allows us to rewrite the equation as shown.

2,2

2

2,

222

22

3232

2 11

)(MSD11

XX

u

XXi

ub rXnrXX





uSFSMASVABCS 4321

43

uSPASVABCuSFSMASVABCS

321

321 )(

Defining SP to be the sum of SM and SF, the equation may be rewritten as shown. The problem caused by the correlation between SM and SF has been eliminated.

2,2

2

2,

222

22

3232

2 11

)(MSD11

XX

u

XXi

ub rXnrXX



. g SP=SM+SF

. reg S ASVABC SP


------------------------------------------------------------------------------ S | Coef. Std. Err. t P>|t| [95% Conf. Interval]-------------+---------------------------------------------------------------- ASVABC | .1253106 .0098434 12.73 0.000 .1059743 .1446469 SP | .0828368 .0164247 5.04 0.000 .0505722 .1151014 _cons | 5.29617 .4817972 10.99 0.000 4.349731 6.242608------------------------------------------------------------------------------

The estimate of 3 is now 0.083 and highly significant.



. g SP=SM+SF

. reg S ASVABC SP




Not surprisingly, this is a compromise between the coefficients of SM and SF in the previous specification.



. g SP=SM+SF

. reg S ASVABC SP




The standard error of SP is much smaller than those of SM and SF. The use of the restriction has led to a large gain in efficiency and the problem of multicollinearity has been eliminated.



. g SP=SM+SF

. reg S ASVABC SP




The t statistic is very high. Thus it would appear that imposing the restriction has improved the regression results. However, the restriction may not be valid. We should test it. Testing theoretical restrictions is one of the topics in Chapter 6.



F TESTS OF GOODNESS OF FIT

This sequence describes two F tests of goodness of fit in a multiple regression model. The first relates to the goodness of fit of the equation as a whole.We will consider the general case where there are k – 1 explanatory variables. For the F test of goodness of fit of the equation as a whole, the null hypothesis, in words, is that the model has no explanatory power at all.The model will have no explanatory power if it turns out that Y is unrelated to any of the explanatory variables. Mathematically, therefore, the null hypothesis is that all the coefficients 2, ..., k are zero.

The alternative hypothesis is that at least one of these coefficients is different from zero.In the multiple regression model there is a difference between the roles of the F and t tests. The F test tests the joint explanatory power of the variables, while the t tests test their explanatory power individually.In the simple regression model the F test was equivalent to the (two-sided) t test on the slope coefficient because the ‘group’ consisted of just one variable.

uXXY kk ...221 0 oneleast at :0...:

1

20

HH k


)()1()1(

)(

)1(

)()1(),1(

2

2

knRkR

knTSSRSS

kTSSESS

knRSSkESSknkF

uXXY kk ...221

0 oneleast at :0...:

1

20

HH k

ESS / TSS is the definition of R2. RSS / TSS is equal to (1 – R2). (See the last sequence in Chapter 2.)



0: 4320 H

The educational attainment model will be used as an example. We will suppose that S depends on ASVABC, the ability score, and SM, and SF, the highest grade completed by the mother and father of the respondent, respectively.The null hypothesis for the F test of goodness of fit is that all three slope coefficients are equal to zero. The alternative hypothesis is that at least one of them is non-zero.

uSFSMASVABCS 4321 F TESTS OF GOODNESS OF FIT





Here is the regression output using Data Set 21.

uSFSMASVABCS 4321 0: 4320 H







)/()1/(),1(knRSS

kESSknkF 3.104

536/20243/1181)536,3( F

The numerator of the F statistic is the explained sum of squares divided by k – 1. In the Stata output these numbers are given in the Model row.







)/()1/(),1(knRSS

kESSknkF 3.104

536/20243/1181)536,3( F

The denominator is the residual sum of squares divided by the number of degrees of freedom remaining.







)/()1/(),1(knRSS

kESSknkF 3.104

536/20243/1181)536,3( F

Hence the F statistic is 104.3. All serious regression packages compute it for you as part of the diagnostics in the regression output.







3.104536/20243/1181)536,3( F

The critical value for F(3,536) is not given in the F tables, but we know it must be lower than F(3,500), which is given. At the 0.1% level, this is 5.51. Hence we easily reject H0 at the 0.1% level.

51.5)500,3(crit,0.1% F



It is unusual for the F statistic not to be significant if some of the t statistics are significant. In principle it could happen though. •Suppose that you ran a regression with 40 explanatory variables, none being a true determinant of the dependent variable. Then the F statistic should be low enough for H0 not to be rejected. However, if you are performing t tests on the slope coefficients at the 5% level, with a 5% chance of a Type I error, on average 2 of the 40 variables could be expected to have ‘significant’ coefficients.•The opposite can easily happen, though. Suppose you have a multiple regression model which is correctly specified and the R2 is high. You would expect to have a highly significant F statistic.However, if the explanatory variables are highly correlated and the model is subject to severe multicollinearity, the standard errors of the slope coefficients could all be so large that none of the t statistics is significant.In this situation you would know that your model is a good one, but you are not in a position to pinpoint the contributions made by the explanatory variables individually.


uXXXY 4433221

uXY 221 1RSS

2RSS

We now come to the other F test of goodness of fit. This is a test of the joint explanatory power of a group of variables when they are added to a regression model.For example, in the original specification, Y may be written as a simple function of X2. In the second, we add X3 and X4.



0 and bothor 0 or 0 :0:

43431

430

HH

uXXXY 4433221

uXY 221 1RSS

2RSS

The null hypothesis for the F test is that neither X3 nor X4 belongs in the model. The alternative hypothesis is that at least one of them does, perhaps both.



F(cost, d.f. remaining) =improvement cost

remainingunexplained

degrees of freedomremaining

For this F test, and for several others which we will encounter, it is useful to think of the F statistic as having the structure indicated above.The ‘improvement’ is the reduction in the residual sum of squares when the change is made, in this case, when the group of new variables is added.The ‘cost’ is the reduction in the number of degrees of freedom remaining after making the change. In the present case it is equal to the number of new variables added, because that number of new parameters are estimated.(Remember that the number of degrees of freedom in a regression equation is the number of observations, less the number of parameters estimated. In this example, it would fall from n – 2 to n – 4 when X3 and X4 are added.)

The ‘remaining unexplained’ is the residual sum of squares after making the change.The ‘degrees of freedom remaining’ is the number of degrees of freedom remaining after making the change.



. reg S ASVABC


------------------------------------------------------------------------------ S | Coef. Std. Err. t P>|t| [95% Conf. Interval]-------------+---------------------------------------------------------------- ASVABC | .148084 .0089431 16.56 0.000 .1305165 .1656516 _cons | 6.066225 .4672261 12.98 0.000 5.148413 6.984036------------------------------------------------------------------------------

We will illustrate the test with an educational attainment example. Here is S regressed on ASVABC using Data Set 21. We make a note of the residual sum of squares.






Now we have added the highest grade completed by each parent. Does parental education have a significant impact? Well, we can see that a t test would show that SF has a highly significant coefficient, but we will perform the F test anyway. We make a note of RSS.



0 and bothor 0 or 0 :0:

43431

430

HH

uXXXY 4433221

uXY 221 1RSS

2RSS

The F statistic is 13.16.The critical value of F(2,500) at the 0.1% level is 7.00. The critical value of F(2,536) must be lower, so we reject H0 and conclude that the parental education variables do have significant joint explanatory power.

00.7)500,2(crit,0.1% F

16.13536/6.2023

2/)6.20230.2123()4540(2)()4540,2(

2

21

RSS

RSSRSSF



1RSS

2RSS

uXXY 33221

uXXXY 4433221

This sequence will conclude by showing that t tests are equivalent to marginal F tests when the additional group of variables consists of just one variable.Suppose that in the original model Y is a function of X2 and X3, and that in the revised model X4 is added.



1RSS

2RSS

0 :0:

41

40

HH

uXXY 33221

uXXXY 4433221

The null hypothesis for the F test of the explanatory power of the additional ‘group’ is that all the new slope coefficients are equal to zero. There is of course only one new slope coefficient, 4.



1RSS

2RSS

The F test has the usual structure. We will illustrate it with an educational attainment model where S depends on ASVABC and SM in the original model and on SF as well in the revised model.




0 :0:

41

40

HH

uXXY 33221

uXXXY 4433221



. reg S ASVABC SM


------------------------------------------------------------------------------ S | Coef. Std. Err. t P>|t| [95% Conf. Interval]-------------+---------------------------------------------------------------- ASVABC | .1328069 .0097389 13.64 0.000 .1136758 .151938 SM | .1235071 .0330837 3.73 0.000 .0585178 .1884963 _cons | 5.420733 .4930224 10.99 0.000 4.452244 6.389222------------------------------------------------------------------------------

Here is the regression of S on ASVABC and SM. We make a note of the residual sum of squares.



Now we add SF and again make a note of the residual sum of squares.






0 :0:

41

40

HH

uXXXY 4433221

uXXY 33221 1RSS

2RSS




The improvement on adding SF is the reduction in the residual sum of squares.

10.12536/6.2023

1/)6.20233.2069()4540(1)()4540,1(

2

21

RSS

RSSRSSF



0 :0:

41

40

HH

uXXXY 4433221 1RSS

2RSS




The cost is just the single degree of freedom lost when estimating 4.

uXXY 33221

10.12536/6.2023

1/)6.20233.2069()4540(1)()4540,1(

2

21

RSS

RSSRSSF



0 :0:

41

40

HH

uXXXY 4433221 1RSS

2RSS




The remaining unexplained is the residual sum of squares after adding SF.

uXXY 33221

10.12536/6.2023

1/)6.20233.2069()4540(1)()4540,1(

2

21

RSS

RSSRSSF



0 :0:

41

40

HH

uXXXY 4433221 1RSS

2RSS




The number of degrees of freedom remaining after adding SF is 540 – 4 = 536.

uXXY 33221

10.12536/6.2023

1/)6.20233.2069()4540(1)()4540,1(

2

21

RSS

RSSRSSF



0 :0:

41

40

HH

uXXXY 4433221 1RSS

2RSS




10.12536/6.2023

1/)6.20233.2069()4540(1)()4540,1(

2

21

RSS

RSSRSSF

uXXY 33221

96.10)500,1( crit,0.1% F

The critical value of F at the 0.1% significance level with 500 degrees of freedom is 10.96. The critical value with 536 degrees of freedom must be lower, so we reject H0 at the 0.1% level.

The null hypothesis we are testing is exactly the same as for a two-sided t test on the coefficient of SF.



We will perform the t test. The t statistic is 3.48.

96.10crit,0.1% F




10.12536/6.2023

1/)6.20233.2069()536,1(

F



96.10crit,0.1% F




10.12536/6.2023

1/)6.20233.2069()536,1(

F

The critical value of t at the 0.1% level with 500 degrees of freedom is 3.31. The critical value with 536 degrees of freedom must be lower. So we reject H0 again.

31.3crit,0.1% t


96.10crit,0.1% F




10.12536/6.2023

1/)6.20233.2069()536,1(

F

31.3crit,0.1% t

It can be shown that the F statistic for the F test of the explanatory power of a ‘group’ of one variable must be equal to the square of the t statistic for that variable. (The difference in the last digit is due to rounding error.)

11.1248.3 2



96.10crit,0.1% F



------------------------------------------------------------------------------ S | Coef. Std. Err. t P>|t| [95% Conf. Interval]-------------+-------------------------------------------------------------- ASVABC | .1257087 .0098533 12.76 0.000 .1063528 .1450646 SM | .0492424 .0390901 1.26 0.208 -.027546 .1260309 SF | .1076825 .0309522 3.48 0.001 .04688 .1684851 _cons | 5.370631 .4882155 11.00 0.000 4.41158 6.329681------------------------------------------------------------------------------

10.12536/6.2023

1/)6.20233.2069()536,1(

F

31.3crit,0.1% t11.1248.3 2 96.1031.3 2 It can also be shown that the critical value of F must be equal to the square of the critical value of t. (The critical values shown are for 500 degrees of freedom, but this must also be true for 536 degrees of freedom.)



Hence the conclusions of the two tests must coincide.

This result means that the t test of the coefficient of a variable is a test of its marginal explanatory power, after all the other variables have been included in the equation.

• If the variable is correlated with one or more of the other variables, its marginal explanatory power may be quite low, even if it genuinely belongs in the model.

• If all the variables are correlated, it is possible for all of them to have low marginal explanatory power and for none of the t tests to be significant, even though the F test for their joint explanatory power is highly significant.

• If this is the case, the model is said to be suffering from the problem of multicollinearity discussed earlier.