Regression Lectures Covariance and Correlation

Regression Lectures So far we have talked only about statistics that describe one variable. What we are going to be discussing for much of the remainder of the course is relationships between two or more variables. To do this there are several concepts that it would be good to have a gasp of Covariance and Correlation Recall that 2 2( ) ( )x xVar x E xσ µ= = − which we estimate with the sample variance

2

2 ( )1

ix

x xS

n−

=−

∑

The covariance is the two variable analog to the variance. The formula for the covariance between two variables is ( , ) ( )( )xy x yCov X Y E x yσ µ µ= = − − which we estimate with the sample covariance

2 ( )(1

i ixy

)x x y yS

n− −

=−

∑

When two variables are negatively related the covariance will be negative. When two variables are positively related the covariance will be positive. Go through some scatter plots. One potential problem with the covariance is that we do not have any idea how big it was. With the variance we could find the standard deviation. If we knew the x was approximately normal this standard deviation conveyed useful information. We cannot do this for the covariance. Another problem with the covariance is that it depends on units of measurement. If we are interested in the relationship between income and years education we would get a different covariance if we defined income in thousands of dollars than we would defining income in dollars.

To correct these problems we can use correlation between x and y. The correlation is given by

xyxy

x y

σρ

σ σ=

which we estimate with the sample correlation

2

2 2

( )( )

( ) ( )

xy i ixy

x yi i

S x x y yS S x x y y

ρ− −

= =− −

∑∑ ∑

This is often called the Person Product Moment Correlation Coefficient. Both the correlation and the sample correlation must be between –1 and 1. If two variables are perfectly negatively correlated then the correlation will be –1. If they are perfectly positively correlated it will be 1. The covariance and correlation describe the strength of the linear relationship between two variables. Neither measure, however, describes this relationship. To describe the linear relationship between two variables we need simple linear regression analysis. Simple Linear Regression Analysis What is going in the population? There are two variables that are related.

y is the dependent variable.

x is the independent variable. In theory y should be caused by (or dependent on) x. Suppose that there is a linear relationship between x and y such that 0 1y x eβ β= + + . The way to think about this is that in the population y is being generated as the sum of

0 1 (an intercept term), (a function of the independent variable), and (an error term).x eβ β

We can think of 0 and 1β β as population parameters the same way we thought about µ as a population parameter. Note that

0 1 0 1( ) ( ) xE y E xβ β β β= + ⋅ = + ⋅ µ and 0 1( | )E y x xβ β= + ⋅ We have already dealt with a special case of this model.

0y eβ= + In this case 0( ) yE y β µ= = and 0( | ) yE y x β µ= = . Suppose that we had a sample that looked something like the following

1 1

2 2

. .

. .

. .

n n

y xy x

y x Then we would be able to estimate the population parameters 0 and 1β β . Lets consider the data that you had last week for a homework assignment. This was the data on unemployment duration and age that you used in the case problem. The sample covariance between these two variables is 76.60 and the correlation coefficient is 0.66. So we know (from looking at the scatter plot) and from these measures that there is a positive relationship between weeks unemployed and age. What sorts of things would lead to this type of positive association? Which of these variables would be the independent variable and which would be the dependent variable if we were setting this up as a simple linear regression?

To fit a line to this data would basically have to pick values for (or estimate) 0 1 and β β . The estimate of the parameter 0β will give us our y-intercept and the estimate of the parameter 1β will give us the slope? How can I go about finding estimates for 0 1and β β ?

Scatterplot of Weeks Unemployed and Age

05

1015202530354045

15

Wee

ks U

nem

ploy

ed

76.60xyS =

05

1015202530354045

15

Wee

ks U

nem

ploy

ed

20 25 30 35 40 45 50 55 60 65

Age

2

65.91

142.69

36.6

xy

x

r

S

x

=

=

=


20 25 30 35 40 45 50 55 60 65

Age

If we have sample data we can write our original equation as

1 1 0 1

2 2 0 2

0 1

. .

. .

. .

n n

e y xe y x

e y x

1

2

1

β ββ β

β β

− −− −

=

− −

We want these errors to be as small as possible in absolute value. The way that we make them as small as possible in absolute value is to find 0 and 1β β that minimize

( )2

0 1 11

n

ii

y xβ β=

− −∑ with respect to 0 1 and β β . When we do this we would find that

1 2

0 1

76.6 0.537142.69

15.54 0.54 36.6 4.11

xy

x

SS

y x

β

β β

= = =

= − ⋅ = − ⋅ = −

So 4.11 0.54WU Age= − + ⋅

A Few Special Properties of Least Squares Estimates 1) The least squares estimate always goes through the means. This means that

( , )x y is always on the least squares line. Algebraically, it means that

0 1y xβ β= + Graphically it means that


0

15.54

31.08

46.62

15 36.6 58.2

Age

Wee

ks U

nem

ploy

ed

2) . This is just a just a product of how we chose

the

( 0 1 0ii i i ix e x y xβ β⋅ = − − =∑ ∑ )0 and 1β β . It is a mathematical fact. I cannot really give you any intuition

for it without confusing you.

The Coefficient of Determination We might be interested in how precise our estimate is. In particular we might be interested in determining how much better our line fit than the sample mean of y. We could estimate y with its sample mean (one parameter) or we could fit a line (by estimating 2 parameters 0 1,β β ). How much better do we do in terms of fit by estimating two parameters? Note that ( ) ( )0 1 0 1 1i ii i iy y x e x x x eβ β β β β− = + + − + = − +

Using the above we can write

2 22 22 2 20 0 0( ) ( ) ( ) ( )i ii i i iiy y x x e x x e x x eβ β β− = − + + − ⋅ = − +∑ i ⇒

SST SSR SSE= +where

2

2 20

2

( )

( )

i i

i

i

SST y y

SSR x x

SSE e

β

= −

= −

=

∑

The coefficient of determination ( R-squared ), defined as

2 1SSR SSERSST SST

= = −

is a measure of goodness of fit. It basically tells us how much better than line fits than the sample mean of y. Another way to think about R-squared is that it tells us the percentage of the variation in y away from its mean is explained by x versus unknown factors (the residuals). It is useful to think of a couple specific cases.


0

15.54

31.08

46.62

15 36.6 58.2

Age

Wee

ks U

nem

ploy

ed

Ok – Lets think about the case when x and y are not correlated. The mean of y does just as good as any line. Indeed the slope estimate should be close to zero so the mean of y is close to the least squares line. One thing you should know – for simple linear regression analysis (with one independent variable) the square root of R-squared is the sample correlation.

Inference and Simple Linear Regression Thus far we have talked only about the mechanics of least squares. We have not talked about how to use estimates to make inferences about population parameters. This is the next step. Before we do this we need to make some assumptions Assumptions

1) We got the functional form correct: 0 1i iy x ei iβ β= + + ∀ 2) Zero mean disturbance: ( ) 0 iE e i= ∀ 3) Constant variance disturbance: 2( ) iVar e iσ= ∀ - Constant vari 4) Errors not autocorrelated: ( , ) 0 i jCov e e i j= ∀ ≠ 5) Regressors are non-stochastic: ( ) 0 iVar x i= ∀ 6) - This is not strictly necessary but does allow us to make

inferences in the case when sample sizes are not very large.

2(0, ) ie N σ ∀∼ i

1βThe Sampling Distribution of If we want to find the sampling distribution of 1β we need to specify the mean, the

variance, and the distribution of 1β .

1) Lets working on finding the mean.

( )( ) ( ) ( )

( )( )

( ) ( )

11 2 2

2

1 12 2 2

( )

( ) ( )

( ) ( )

i i i i i

i i

i i i i

i ii

x x y y x x x x e

x x x x

ix x x x e x x e

x x xx x

ββ

β β

− − − − += =

− −

⎡ ⎤− − ⋅ −⎢ ⎥= + = +⎢ ⎥ − −−⎣ ⎦

∑ ∑∑ ∑

∑ ∑ ∑∑ ∑∑ x

⋅

Taking the expected value we find

( ) ( )( )1 11 2

i i

i

x x eE E

x xβ β β

⎡ ⎤− ⋅⎢ ⎥= + =⎢ ⎥−⎣ ⎦

∑∑

So 1β has mean 1β . It is an unbiased estimator. Equivalent to ( )E x µ=

2) Lets find the variance

( )2

2

11 1 2

2

22

2 22

2 22

( )( )

( )

1 ( )( )

( )( )( )

i i

i

i i

i

i

ii

x x eVar E E

x x

E x x ex x

x xx xx x

β β β

σσ

⎡ ⎤− ⋅= − = ⎢ ⎥

−⎢ ⎥⎣ ⎦

⎡ ⎤= ⋅ − ⋅⎣ ⎦⎡ ⎤−⎣ ⎦−

= ⋅ =−⎡ ⎤−⎣ ⎦

∑∑

∑∑∑

∑∑

=

Equivalent to

2( )Var x n

σ=

3) What is the distribution of 1β ?

If the disturbances are normal then 1β is linear in the disturbances so 1β is normal If the disturbances are not normal and the sample size is large enough we have a central limit theorem that says the distribution of 1β is approximately normal.

Ok – so all that we need to do is find an estimate of 2σ and we are in business. Remember that . Thus the appropriate estimator is 2 ( )iVar eσ =

2

2

2 2ie SSES

n n= =

− −∑

Now we can make inferences about 1β .

Confidence Intervals and Hypothesis Testing From our estimate with the BLS data

0

1

0.537-4.106

β

β

⎡ ⎤ ⎡ ⎤=⎢ ⎥ ⎢ ⎥⎣ ⎦⎢ ⎥⎣ ⎦

Note that

2 56.91S =

Now we can estimate 1( )Var β

1

1

2 22

22

56.91( 1) 6692( )

56.91 0.0906692

xi

Sn Sx x

S

β

β

σ σ= = =

−−

= =

∑⇒

Example) Lets use this to test the hypothesis

0 1

1

: 0: 0a

HH

ββ=≠

In words - we are testing the null that age as no effect on the duration of unemployment spells against the alternative that it does. Choose level of significance ( 0.01 ) Get the value of the test statistic

0.537 00.090

z −= =5.97 we would reject the null at almost any significance level. ⇒

Example) We could also construct a 95% confidence interval.

Multiple Regression Here the model is similar to the simple linear model we already talked about. The only difference being that is a linear function of a bunch more than one x variable not just. That is

y

0 1 1 2 2 ...i i i k iky x x x ieβ β β β= + + + + + For this model 1( | ,..., )kE y x x = 0 1 1 2 2 ... k kx x xβ β β β+ + + + 0 1 1 2 2( ) ...x x kE y 3xβ β µ β µ β µ= + + + + 2( ) ( )i iVar y Var e σ= = Instead of just estimating 3 population parameters 0β 1β 2( )iVar e σ= we will be estimating k+2 parameters 0 through kβ β - k+1 parameters 2( )iVar e σ= We will use least squares to fit our line (actually it is a plane of dimension k in k+1 dimensional space). This means that 0 through kβ β are the particular sβ that minimize

( )2

1 20 1 2 ...i kky x xβ β β β− − − − −∑ x

Unlike the case of the simple linear regression model we are going to be unable to write down formulas for 1 through kβ β . As you shall see we will still be able to write out a

formula for 0β .

Goodness of fit and Coefficient of Determination Almost everything that I said about the R-squared in the bivariate regression model holds for the multivariate model. That is

2 1SSR SSERSST SST

= = −

where

2

2

2

( )

( )i

i

i

SST y y

SSR y y

SSE e

= −

= −

=

∑∑∑

Because you can always improve R-squared by adding more x variables to the model there is something called the adjusted R-squared

( )2 2 11 11a

nR Rn k

−= − −

− −

I do not expect you to remember this, but you should understand the purpose for the adjusted R-squared. Properties of Multivariate Least Squares

1) The least squares line goes through the means

1 20 1 2 ... kky x xβ β β β= + + + + x Here is were the formula for 0β comes in

1 20 1 2 ... kky x x xβ β β β= − − − −

2) The least squares residuals wipe out the x’s

1,...,0 iij j kx e =⋅ = ∀∑

Assumptions 1) We got the functional form correct: 0 1 1 2 2 ... i i i k iky x x x ei iβ β β β= + + + + + ∀ 2) Zero mean disturbance: ( ) 0 iE e i= ∀ 3) Constant variance disturbance: 2( ) iVar e iσ= ∀ - Constant variance 4) Errors not autocorrelated: ( , ) 0 i jCov e e i j= ∀ ≠ 5) Regressors are non-stochastic: ( ) 0 1,..., 1,...,ijVar x i n and j k= ∀ = =

6) - This is not strictly necessary but does allow us to make inferences in the case when sample sizes are not very large.

2(0, ) ie N σ ∀∼ i

Statistical Inference and the Linear Regression Model We are going to want to characterize the distribution of 0,..., kβ β . To do this we need three things

(1) The mean of our estimates. As in the case of the bivariate model the assumptions above we can say that

( ) for all 0,...,jjE j kβ β= =

(2) We want to characterize the variance of our estimates. While it is possible to

write down formulas here they are excessively complicated. Suffice to know that

2( ) ( ' , ) for all 0,...,jVar f the x s j kβ σ= =

(3) The distribution of our estimates. It is going to be normal for the same reason that the distribution was normal in the bivariate case (assumption 6).

If the sample size is small we need to make a normality assumption to conduct statistical inference If the sample size is large then we have a central limit theorem that tells that the

sβ are approximately normal. Note that we will need an estimate of 2σ to conduct statistical inference. We can estimate 2σ with

2

2

( 1) ( 1ie SSE

n k n kσ = =

− + − +∑

)

The F-Test of a Set of Linear Restrictions Example 1: Human Capital Model. Are returns to potential experience different for men and women? There are two relevant models here Unrestricted Model:

0 1 2 3 4 5

6 7 8

ln( )

i i i i i i i

i i i i

wage ed ex exsq fem ex fem exsqblack hisp female e

iβ β β β β ββ β β

= + ⋅ + ⋅ + ⋅ + ⋅ ⋅ + ⋅ ⋅+ ⋅ + ⋅ + ⋅ +

Restricted Model:

0 1 2 3

6 7 8

ln( )

i i i i

i i i

wage ed ex exsqblack hisp female ei

β β β ββ β β

= + ⋅ + ⋅ + ⋅+ ⋅ + ⋅ + ⋅ +

The restrictions implied by the restricted model are 4 5 0β β= = , therefore my hypothesis test can be constructed as

0 4 5

4 5

: 0: 0,a

HH 0

β ββ β

= =≠ ≠

To evaluate the hypothesis I must estimate both the restricted and unrestricted models Restricted Model Estimates . reg lnwage ed ex exsq black hisp fem if year==1987 Source | SS df MS Number of obs = 5000 -------------+------------------------------ F( 6, 4993) = 447.22 Model | 424.074029 6 70.6790048 Prob > F = 0.0000 Residual | 789.105425 4993 .158042344 R-squared = 0.3496 -------------+------------------------------ Adj R-squared = 0.3488 Total | 1213.17945 4999 .242684428 Root MSE = .39755 ------------------------------------------------------------------------------ lnwage | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- ed | .08817 .0023512 37.50 0.000 .0835607 .0927793 ex | .0310741 .0016831 18.46 0.000 .0277745 .0343737 exsq | -.0004682 .0000379 -12.37 0.000 -.0005424 -.000394 black | -.1137154 .0194184 -5.86 0.000 -.151784 -.0756469 hisp | -.0604423 .0258856 -2.33 0.020 -.1111895 -.0096952 fem | -.2705865 .011385 -23.77 0.000 -.292906 -.2482669 _cons | 1.290962 .035745 36.12 0.000 1.220886 1.361038 ------------------------------------------------------------------------------

Unrestricted Model Estimates . reg lnwage ed ex exsq fex fexsq black hisp fem if year==1987 Source | SS df MS Number of obs = 5000 -------------+------------------------------ F( 8, 4991) = 355.71 Model | 440.536691 8 55.0670864 Prob > F = 0.0000 Residual | 772.642762 4991 .154807205 R-squared = 0.3631 -------------+------------------------------ Adj R-squared = 0.3621 Total | 1213.17945 4999 .242684428 Root MSE = .39346 ------------------------------------------------------------------------------ lnwage | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- ed | .0875369 .0023332 37.52 0.000 .0829629 .0921109 ex | .038817 .0022281 17.42 0.000 .0344489 .043185 exsq | -.0005508 .0000491 -11.21 0.000 -.0006471 -.0004545 fex | -.0151112 .0033731 -4.48 0.000 -.0217239 -.0084985 fexsq | .000129 .0000755 1.71 0.088 -.0000191 .000277 black | -.1096847 .0192485 -5.70 0.000 -.1474203 -.0719491 hisp | -.0616344 .0256242 -2.41 0.016 -.111869 -.0113998 fem | -.0576369 .0308256 -1.87 0.062 -.1180685 .0027947 _cons | 1.19554 .0371705 32.16 0.000 1.122669 1.26841 ------------------------------------------------------------------------------ Comparison of R-squared terms from the regressions.

2 2*

2

0.3631 0.34962 54

1 0.363114,991( 1)

R Rr

FR

n k

⎛ ⎞− −⎛ ⎞⎜ ⎟ ⎜ ⎟⎝ ⎠ ⎝ ⎠= =

−⎛ ⎞ ⎛ ⎞−⎜ ⎟ ⎜ ⎟− + ⎝ ⎠⎝ ⎠

≈

Alternatively, a series of commands in STATA will give you the test statistic the p-value associated with the test statistic. . reg lnwage ed ex exsq fex fexsq black hisp fem if year==1987 . test fex==0 ( 1) fex = 0.0 F( 1, 4991) = 20.07 Prob > F = 0.0000 . test fexsq==0, accum ( 1) fex = 0.0 ( 2) fexsq = 0.0 F( 2, 4991) = 53.17 Prob > F = 0.0000

Example: Layoffs (From Chapter 16 of the text). Do managerial and sales workers have different expected unemployment durations than productions workers?

• Age - age of the worker in years • Education - highest school grade completed • Married - a dummy variable equal to 1 if worker is married, 0 otherwise • Head - a dummy variable equal to 1 if a worker is the head of household,

0 otherwise • Tenure – The number of years on the last job • Manager – a dummy variable equal to 1 if a worker was employed as a

manager on his last job, 0 otherwise. • Sales – a dummy variable equal to 1 if a worker was employed as a sales

worker on his last job, 0 otherwise.

Model: 0 1 2 3

4 5 6 5

i i i i

i i i

Weeks Unem Age Education MarriedHead Tenure Manager Sales ei i

β β β ββ β β β

= + ⋅ + ⋅ + ⋅+ ⋅ + ⋅ + ⋅ + ⋅ +

Below is a version of what STATA prints out with its regression tool. . reg weeks age educ married head tenure manager sales Source | SS df MS Number of obs = 50 -------------+------------------------------ F( 7, 42) = 8.68 Model | 16250.4401 7 2321.49145 Prob > F = 0.0000 Residual | 11227.0799 42 267.311426 R-squared = 0.5914 -------------+------------------------------ Adj R-squared = 0.5233 Total | 27477.52 49 560.765714 Root MSE = 16.35 ------------------------------------------------------------------------------ weeks | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- age | 1.509288 .3040413 4.96 0.000 .8957076 2.122868 educ | -.6133109 .9361832 -0.66 0.516 -2.502605 1.275983 married | -10.74299 6.01238 -1.79 0.081 -22.87647 1.390482 head | -19.77948 5.837235 -3.39 0.002 -31.5595 -7.999463 tenure | .426465 .4668567 0.91 0.366 -.5156901 1.36862 manager | -26.74239 8.325661 -3.21 0.003 -43.54426 -9.940531 sales | -18.5609 6.280606 -2.96 0.005 -31.23567 -5.886121 _cons | 22.8507 18.86681 1.21 0.233 -15.22406 60.92547 ------------------------------------------------------------------------------

Restricted Model: 0 1 2 3

4 5

i i i

i i i

Weeks Unem Married Head TenureManager Sales e

iγ γ γ γγ γ

= + ⋅ + ⋅ + ⋅+ ⋅ + ⋅ +

. reg weeks married head tenure manager sales Source | SS df MS Number of obs = 50 -------------+------------------------------ F( 5, 44) = 4.59 Model | 9423.20491 5 1884.64098 Prob > F = 0.0019 Residual | 18054.3151 44 410.325343 R-squared = 0.3429 -------------+------------------------------ Adj R-squared = 0.2683 Total | 27477.52 49 560.765714 Root MSE = 20.256 ------------------------------------------------------------------------------ weeks | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- married | -15.21677 7.132795 -2.13 0.039 -29.59197 -.8415648 head | -18.89676 7.082394 -2.67 0.011 -33.17039 -4.623132 tenure | 1.456754 .5026624 2.90 0.006 .4437045 2.469803 manager | -19.66126 9.954804 -1.98 0.055 -39.72385 .401328 sales | -15.67198 7.694462 -2.04 0.048 -31.17915 -.164812 _cons | 59.47015 10.64719 5.59 0.000 38.01216 80.92815 ------------------------------------------------------------------------------

* In actuality you would not need to estimate the restricted model to test the hypothesis that 1 2 0β β= = . In practice, the test statistic and p-value associated with this test obtained with the following series of commands in STATA . reg weeks age educ married head tenure manager sales . test married==0 . test educ==0, accum

Example: Unemployed Workers (From Chapter 16 of the text)

• Age - age of the worker in years • Education - highest school grade completed • Married - a dummy variable equal to 1 if worker is married, 0 otherwise • Head - a dummy variable equal to 1 if a worker is the head of household,

0 otherwise • Tenure – The number of years on the last job • Manager – a dummy variable equal to 1 if a worker was employed as a

manager on his last job, 0 otherwise. • Sales – a dummy variable equal to 1 if a worker was employed as a sales

worker on his last job, 0 otherwise.

Model: 0 1 2 3

4 5 6 5

i i i i

i i i

Weeks Unem Age Education MarriedHead Tenure Manager Sales ei i

β β β ββ β β β

= + ⋅ + ⋅ + ⋅+ ⋅ + ⋅ + ⋅ + ⋅ +

. reg weeks age educ married head tenure manager sales Source | SS df MS Number of obs = 50 -------------+------------------------------ F( 7, 42) = 8.68 Model | 16250.4401 7 2321.49145 Prob > F = 0.0000 Residual | 11227.0799 42 267.311426 R-squared = 0.5914 -------------+------------------------------ Adj R-squared = 0.5233 Total | 27477.52 49 560.765714 Root MSE = 16.35 ------------------------------------------------------------------------------ weeks | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- age | 1.509288 .3040413 4.96 0.000 .8957076 2.122868 educ | -.6133109 .9361832 -0.66 0.516 -2.502605 1.275983 married | -10.74299 6.01238 -1.79 0.081 -22.87647 1.390482 head | -19.77948 5.837235 -3.39 0.002 -31.5595 -7.999463 tenure | .426465 .4668567 0.91 0.366 -.5156901 1.36862 manager | -26.74239 8.325661 -3.21 0.003 -43.54426 -9.940531 sales | -18.5609 6.280606 -2.96 0.005 -31.23567 -5.886121 _cons | 22.8507 18.86681 1.21 0.233 -15.22406 60.92547 ------------------------------------------------------------------------------

How can we interpret the coefficient on the education variable?

Lets test the hypothesis that education has no effect on unemployment duration of manufacturing workers.

How would I state this hypothesis in terms of the population parameters?

What should the first step in testing this hypothesis be?

What do I do next?

What conclusions can I draw?

Testing a Set of Linear Restrictions (The F-Test) What I am going to give you is a more general version of the F-Test that is in the book. What I give you is way more powerful that what is in the book so please pay attention. This is open material for the test. Consider the model 0 1 1 2 2 ...i i i k iky x x x ieβ β β β= + + + + + Suppose that we want to test the hypothesis that some subset of the coefficients is zero. We can form an F-statistic for this test.

*

( , )

SSE SSErF FSSE

n k

−

= −

−

∼ r n k

ie

where

*

the number of restrictionsthe error sum of squares from a restricted model

the error sum of squares from the unrestricted model

rSSESSE

=

==

Just for the sake of being concrete lets suppose that

0 1 1 2 2 3 3 4 4i i i i iy x x x xβ β β β β= + + + + + and want to test the null hypothesis that 1 2 0β β= = against the alternative that

1 20 and 0β β≠ ≠ . To test this hypothesis it turns out we will need an F statistic. This F-statistic can be computed as

*SSE SSErF SSE

n k

−

=

−

Where

2 21 2 3 40 1 2 3 4( )i i i i iSSE e y x x x xβ β β β β= = − − − − −∑ ∑ i

*2* 23 40 1 2( )i i iSSE e y x xγ γ γ= = − − −∑ ∑ i

Where the sγ are least squares coefficient estimates from the model 0 3 1 3 2 4i i iy x x ivγ γ γ= + + + The important thing about this model is that it is a restricted version of the first model. It is essentially the first model with the coefficients on the first two x variables set to zero. If the null hypothesis that 1 2 0β β= = is correct, I would expect that the sum of squared residuals from the restricted model would not be much larger than the sum of squares from the unrestricted model. It will be larger because degrees of freedom due to fewer coefficients, but it should not be much larger. There is an easy way to compute the test statistic

*SSE SSErF SSE

n k

−

=

−

If I divide the numerator and denominator by SST I get

( )

* 2*2 2

*

2 2

1 1 1 1

1 1 1

SSE SSE R RR Rr SST SST rrF

SSE R Rn k SST n k n k

⎡ ⎤ ⎛ −− ⎡ ⎤ ⎜ ⎟⎢ ⎥ − − −⎣ ⎦⎣ ⎦ ⎝= = =−⎡ ⎤ ⎛ ⎞−

⎜ ⎟⎢ ⎥− −⎣ ⎦ −⎝ ⎠

2 ⎞

⎠

0

Basically all I need to do to conduct the test is estimate both the restricted and the unrestricted models, get the R-squareds from these regressions, form the F-statistic and go to the tables. Example: Unemployment Durations Lets test the hypothesis that the age and education jointly have not effect on weeks unemployed for laid off manufacturing workers.

0 1 2

1 2

: 0: 0,a

HH

β ββ β= =≠ ≠

I need to compute estimates from the restricted model

Model: 0 1 2 3

4 5

i i i

i i i

Weeks Unem Married Head TenureManager Sales e

iγ γ γ γγ γ

= + ⋅ + ⋅ + ⋅+ ⋅ + ⋅ +

. reg weeks married head tenure manager sales Source | SS df MS Number of obs = 50 -------------+------------------------------ F( 5, 44) = 4.59 Model | 9423.20491 5 1884.64098 Prob > F = 0.0019 Residual | 18054.3151 44 410.325343 R-squared = 0.3429 -------------+------------------------------ Adj R-squared = 0.2683 Total | 27477.52 49 560.765714 Root MSE = 20.256 ------------------------------------------------------------------------------ weeks | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- married | -15.21677 7.132795 -2.13 0.039 -29.59197 -.8415648 head | -18.89676 7.082394 -2.67 0.011 -33.17039 -4.623132 tenure | 1.456754 .5026624 2.90 0.006 .4437045 2.469803 manager | -19.66126 9.954804 -1.98 0.055 -39.72385 .401328 sales | -15.67198 7.694462 -2.04 0.048 -31.17915 -.164812 _cons | 59.47015 10.64719 5.59 0.000 38.01216 80.92815 ------------------------------------------------------------------------------

Alternatively, a series of commands in STATA will give you the test statistic the p-value associated with the test statistic. reg weeks married head tenure manager sales test married==0 test educ==0, accum

Regression Lectures Covariance and Correlation

Documents

Transcript of Regression Lectures Covariance and Correlation