CHAPTER 3 CLASSICAL LINEAR REGRESSION...

CHAPTER 3 CLASSICAL LINEARREGRESSION MODELS

Key words: Classical linear regression, Conditional heteroskedasticity, Conditionalhomoskedasticity, F -test, GLS, Hypothesis testing, Model selection criterion, OLS, R2;

t-test

Abstract: In this chapter, we will introduce the classical linear regression theory, in-cluding the classical model assumptions, the statistical properties of the OLS estimator,

the t-test and the F -test, as well as the GLS estimator and related statistical procedures.

This chapter will serve as a starting point from which we will develop the modern econo-

metric theory.

3.1 Assumptions

Suppose we have a random sample fZtgnt=1 of size n; where Zt = (Yt; X0t)0; Yt is

a scalar, Xt = (1; X1t; X2t; :::; Xkt)0 is a (k + 1) � 1 vector, t is an index (either cross-

sectional unit or time period) for observations, and n is the sample size. We are interested

in modelling the conditional mean E(YtjXt) using an observed realization (i.e., a data

set) of the random sample fYt; X 0tg0; t = 1; :::; n:

Notations:(i) K � k + 1 is the number of regressors for which there are k economic variables

and an intercept.

(ii) The index t may denote an individual unit (e.g., a �rm, a household, a country)

for cross-sectional data, or denote a time period (e.g., day, weak, month, year) in a time

series context.

We �rst list and discuss the assumptions of the classical linear regression theory.

Assumption 3.1 [Linearity]:

Yt = X0t�0 + "t; t = 1; :::; n;

where �0 is a K � 1 unknown parameter vector, and "t is an unobservable disturbance.

Remarks:

(i) In Assumption 3.1, Yt is the dependent variable (or regresand), Xt is the vector of

regressors (or independent variables, or explanatory variables), and �0 is the regression

1

coe¢ cient vector. When the linear model is correctly for the conditional mean E(YtjXt),

i.e., when E("tjXt) = 0; the parameter �0 = @

@XtE(YtjXt) is the marginal e¤ect of Xt on

Yt:

(ii) The key notion of linearity in the classical linear regression model is that the

regression model is linear in �0 rather than in Xt:

(iii) Does Assumption 3.1 imply a causal relationship from Xt to Yt? Not necessarily.

As Kendall and Stuart (1961, Vol.2, Ch. 26, p.279) point out, �a statistical relationship,

however strong and however suggestive, can never establish causal connection. our ideas

of causation must come from outside statistics ultimately, from some theory or other.�

Assumption 3.1 only implies a predictive relationship: Given Xt, can we predict Ytlinearly?

(iv) Matrix Notation: Denote

Y = (Y1; :::; Yn)0; n� 1;

" = ("1; :::; "n)0; n� 1;

X = (X1; :::; Xn)0; n�K:

Note that the t-th row of X is X 0t = (1; X1t; :::; Xkt): With these matrix notations, we

have a compact expression for Assumption 3.1:

Y = X�0 + ";

n� 1 = (n�K)(K � 1) + n� 1:

The second assumption is a strict exogeneity condition.

Assumption 3.2 [Strict Exogeneity]:

E("tjX) = E("tjX1; :::; Xt; :::; Xn) = 0; t = 1; :::; n:

Remarks:(i) Among other things, this condition implies correct model speci�cation forE(YtjXt):

This is because Assumption 3.2 implies E("tjXt) = 0 by the law of iterated expectations.

(ii) Assumption 3.2 also implies E("t) = 0 by the law of iterated expectations.

(iii) Under Assumption 3.2, we have E(Xs"t) = 0 for any (t; s); where t; s 2 f1; :::; ng:This follows because

E(Xs"t) = E[E(Xs"tjX)]= E[XsE("tjX)]= E(Xs � 0)= 0:

2

Note that (i) and (ii) imply cov(Xs; "t) = 0 for all t; s 2 f1; :::; ng:(iv) Because X contains regressors fXsg for both s � t and s > t; Assumption 3.2

essentially requires that the error "t do not depend on the past and future values of

regressors if t is a time index. This rules out dynamic time series models for which

"t may be correlated with the future values of regressors (because the future values of

regressors depend on the current shocks), as is illustrated in the following example.

Example 1: Consider a so-called AutoRegressive AR(1) model

Yt = �0 + �1Yt�1 + "t; t = 1; :::; n;

= X 0t� + "t;

f"tg � i.i.d.(0; �2);

where Xt = (1; Yt�1)0: Obviously, E(Xt"t) = E(Xt)E("t) = 0 but E(Xt+1"t) 6= 0: Thus,

we have E("tjX) 6= 0; and so Assumption 3.2 does not hold. In Chapter 5, we will

consider linear regression models with dependent observations, which will include this

example as a special case. In fact, the main reason of imposing Assumption 3.2 is to

obtain a �nite sample distribution theory. For a large sample theory (i.e., an asymptotic

theory), the strict exogeneity condition will not be needed.

(v) In econometrics, there are some alternative de�nitions of strict exogeneity. For

example, one de�nition assumes that "t and X are independent. An example is that X

is nonstochastic. This rules out conditional heteroskedasticity (i.e., var("tjX) dependsonn X). In Assumption 3.2, we still allow for conditional heteroskedasticity, because we

do not assume that "t and X are independent. We only assume that the conditional

mean E("tjX) does not depend on X:(vi) Question: What happens to Assumption 3.2 if X is nonstochastic?

Answer: If X is nonstochastic, Assumption 3.2 becomes

E("tjX) = E("t) = 0:

An example of nonstochasticX isXt = (1; t; :::; tk)0: This corresponds to a time-trend

regression model

Yt = X 0t�0 + "t

=

kXj=0

�0j tj + "t:

(vii) Question: What happens to Assumption 3.2 if Zt = (Yt; X0t)0 is an independent

random sample (i.e., Zt and Zs are independent whenever t 6= s; although Yt and Xt

may not be independent)?

3

Answer: When fZtg is i.i.d., Assumption 3.2 becomes

E("tjX) = E("tjX1; X2; :::Xt; :::; Xn)

= E("tjXt)

= 0:

In other words, when fZtg is i.i.d., E("tjX) = 0 is equivalent to E("tjXt) = 0:

Assumption 3.3 [Nonsingularity]: The minimum eigenvalue of the K � K square

matrix X 0X =Pn

t=1XtX0t;

�min (X0X)!1 as n!1

with probability one.

Question: How to �nd eigenvalues of a square matrix A?

Answer: By solving the system of linear equations:

det(A� �I) = 0;

where det(�) denotes the determinant of a square matrix, and I is an identity matrixwith the same dimension as A.

Remarks:(i) Assumption 3.3 rules out multicollinearity among the (k+1) regressors in Xt:We

say that there exists multicolinearity among the Xt if for all t 2 f1; :::; ng; the variableXjt for some j 2 f0; 1; :::; kg is a linear combination of the other K�1 column variables.(ii) This condition implies that X 0X is nonsingular, or equivalently, X must be of

full rank of K = k+ 1. Thus, we need K � n: That is, the number of regressors cannotbe larger than the sample size.

(iii) It is well-known that the eigenvalue � can be used to summarize information

contained in a matrix (recall the popular principal component analysis). Assumption

3.3 implies that new information must be available as the sample size n ! 1 (i.e., Xt

should not only have same repeated values as t increases).

(iv) Intuitively, if there are no variations in the values of the Xt; it will be di¢ cult

to determine the relationship between Yt and Xt (indeed, the purpose of classical linear

regression is to investigate how a change in X causes a change in Y ): In certain sense,

one may call X 0X the �information matrix� of the random sample X because it is a

4

measure of the information contained in X: The degree of variation in X 0X will a¤ect

the preciseness of parameter estimation for �0.

Question: Why can the eigenvalue � used as a measure of the information containedin X 0X?

Assumption 3.4 [Spherical error variance]:(a) [conditional homoskedasticity]:

E("2t jX) = �2 > 0; t = 1; :::; n;

(b) [conditional serial uncorrelatedness]:

E("t"sjX) = 0; t 6= s; t; s 2 f1; :::; ng:

Remarks:(i) We can write Assumption 3.4 as

E("t"sjX) = �2�ts;

where �ts = 1 if t = s and �ts = 0 otherwise. In mathematics, �ts is called the Kronecker

delta function.

(ii) var("tjX) = E("2t jX)� [E("tjX)]2 = E("2t jX) = �2 given Assumption 3.2. Simi-larly, we have cov("t; "sjX) = E("t"sjX) = 0 for all t 6= s:(iii) By the law of iterated expectations, Assumption 3.4 implies that var("t) = �2 for

all t = 1; :::; n; the so-called unconditional homoskedasticity. Similarly, cov("t; "s) = 0

for all t 6= s:(iv) A compact expression for Assumptions 3.2 and 3.4 together is:

E("jX) = 0 and E(""0jX) = �2I;

where I � In is a n� n identity matrix.(v) Assumption 3.4 does not imply that "t and X are independent. It allows the

possibility that the conditional higher order moments (e.g., skewness and kurtosis) of "tdepend on X:

3.2 OLS Estimation

Question: How to estimate �0 using an observed data set generated from the random

sample fZtgnt=1; where Zt = (Yt; X 0t)0?

5

De�nition [OLS estimator]: Suppose Assumptions 3.1 and 3.3 hold. De�ne the sumof squared residuals (SSR) of the linear regression model Yt = X 0

t� + ut as

SSR(�) � (Y �X�)0(Y �X�)

=

nXt=1

(Yt �X 0t�)

2:

Then the Ordinary Least Squares (OLS) estimator � is the solution to

� = arg min�2RK

SSR(�):

Remark: SSR(�) is the sum of squred model errors fut = Yt�X 0t�g for observations,

with equal weighting for each t.

Theorem 1 [Existence of OLS]: Under Assumptions 3.1 and 3.3, the OLS estimator� exists and

� = (X 0X)�1X 0Y

=

nXt=1

XtX0t

!�1 nXt=1

XtYt

=

1

n

nXt=1

XtX0t

!�11

n

nXt=1

XtYt:

The latter expression will be useful for our asymptotic analysis in subsequent chapters.

Remark: Suppose Zt = fYt; X 0tg0; t = 1; :::; n; is an independent and identically distrib-

uted (i.i.d.) random sample of size n. Compare the population MSE criterion

MSE(�) = E(Yt �X 0t�)

2

and the sample MSE criterion

SSR(�)

n=1

n

nXt=1

(Yt �X 0t�)

2:

By the weak law of large numbers (WLLN), we have

SSR(�)

n!p E(Yt �X 0

t�)2 �MSE(�)

6

and their minimizers

� =

1

n

nXt=1

XtX0t

!�11

n

nXt=1

XtYt

! p [E(XtX0t)]�1E(XtYt) � ��:

Here, SSR(�); after scaled by n�1; is the sample analogue of MSE(�); and the OLS �

is the sample analogue of the best LS approximation coe¢ cient ��:

Proof of the Theorem: Using the formula that for both a and � being vectors, thederivative

@(a0�)

@�= a;

we have

dSSR(�)

d�=

d

d�

nXt=1

(Yt �X 0t�)

2

=nXt=1

@

@�(Yt �X 0

t�)2

=nXt=1

2(Yt �X 0t�)

@

@�(Yt �X 0

t�)

= �2nXt=1

Xt(Yt �X 0t�)

= �2X 0(Y �X�):

The OLS must satisfy the FOC:

�2X 0(Y �X�) = 0;

X 0(Y �X�) = 0;

X 0Y � (X 0X)� = 0:

It follows that

(X 0X)� = X 0Y:

By Assumption 3.3, X 0X is nonsingular. Thus,

� = (X 0X)�1X 0Y:

Checking the SOC, we have the K �K Hessian matrix

@2SSR(�)

@�@�0= �2

nXt=1

@

@�0[(Yt �X 0

t�)Xt]

= 2X 0X

� positive de�nite

7

given �min(X 0X) > 0: Thus, � is a global minimizer. Note that for the existence of �; we

only need that X 0X is nonsingular, which is implied by the condition that �min(X 0X)!1 as n!1 but it does not require that �min(X 0X)!1 as n!1:

Remarks:(i) Yt � X 0

t� is called the �tted value (or predicted value) for observation Yt; and

et � Yt � Yt is the estimated residual (or prediction error) for observation Yt: Note that

et = Yt � Yt= (X 0

t�0 + "t)�X 0

t�

= "t �X 0t(� � �0)

= true regression disturbance + estimation error,

where the true disturbance "t is unavoidable, and the estimation error X 0t(� � �0) can

be made smaller when a larger data set is available (so � becomes closer to �0).

(ii) FOC implies that the estimated residual e = Y �X� is orthogonal to regressorsX in the sense that

X 0e =nXt=1

Xtet = 0:

This is the consequence of the very nature of OLS, as implied by the FOC ofmin�2RK SSR(�).

It always holds no matter whether E("tjX) = 0 (recall that we do not impose Assump-tion 3.2). Note that if Xt contains the intercept, then X 0e = 0 implies �nt=1et = 0:

Some useful identitiesTo investigate the statistical properties of �; we �rst state some useful lemmas.

Lemma: Under Assumptions 3.1 and 3.3, we have:(i)

X 0e = 0;

(ii)

� � �0 = (X 0X)�1X 0";

(iii) De�ne a n� n projection matrix

P = X(X 0X)�1X 0

and

M = In � P:

8

Then both P and M are symmetric (i.e., P = P 0 and M = M 0) and idempotent (i.e.,

P 2 = P;M2 =M), with

PX = X;

MX = 0:

(iv)

SSR(�) = e0e = Y 0MY = "0M":

Proof:(i) The result follows immediately by the FOC of the OLS estimator.

(ii) Because � = (X 0X)�1X 0Y and Y = X�0 + "; we have

� � �0 = (X 0X)�1X 0(X�0 + ")� �0

= (X 0X)�1X 0":

(iii) P is idempotent because

P 2 = PP

= [X(X 0X)�1X 0][X(X 0X)�1X 0]

= X(X 0X)�1X 0

= P:

Similarly we can show M2 =M:

(iv) By the de�nition of M; we have

e = Y �X�= Y �X(X 0X)�1X 0Y

= [I �X(X 0X)�1X 0]Y

= MY

= M(X�0 + ")

= MX�0 +M"

= M"

given MX = 0: It follows that

SSR(�) = e0e

= (M")0(M")

= "0M2"

= "0M";

9

where the last equality follows from M2 =M:

3.3 Goodness of �t and model selection criterion

Question: How well does the linear regression model �t the data? That is, how welldoes the linear regression model explain the variation of the observed data of fYtg?

We need some criteria or some measures to characterize goodness of �t.

We �rst introduce two measures for goodness of �t. The �rst measure is called the

uncentered squared multi-correlation coe¢ cient R2

De�nition [Uncentered R2] : The uncentered squared multi-correlation coe¢ cient isde�ned as

R2uc =Y 0Y

Y 0Y= 1� e0e

Y 0Y:

Remarks:Interpretation ofR2uc : The proportion of the uncentered sample quadratic variation in

the dependent variables fYtg that can be explained by the uncentered sample quadraticvariation of the predicted values fYtg. Note that we always have 0 � R2uc � 1:

Next, we de�ne a closely related measure called Centered R2:

De�nition [Centered R2 : Coe¢ cient of Determination]: The coe¢ cient of deter-mination

R2 � 1�Pn

t=1 e2tPn

t=1(Yt � �Y )2;

where �Y = n�1Pn

t=1 Yt is the sample mean.

Remarks:(i) When Xt contains the intercept, we have the following orthogonal decomposition:

nXt=1

(Yt � �Y )2 =

nXt=1

(Yt � �Y + Yt � Yt)2

=

nXt=1

(Yt � �Y )2 +nXt=1

e2t

+2nXt=1

(Yt � �Y )et

=nXt=1

(Yt � �Y )2 +nXt=1

e2t ;

10

where the cross-product term

nXt=1

(Yt � �Y )et =nXt=1

Ytet � �YnXt=1

et

= �0nXt=1

Xtet � �Y

nXt=1

et

= �0(X 0e)� �Y

nXt=1

et

= �0 � 0� �Y � 0

= 0;

where we have made use of the facts that X 0e = 0 andPn

t=1 et = 0 from the FOC of

the OLS estimation and the fact that Xt contains the intercept (i.e., X0t = 1): It follows

that

R2 � 1� e0ePnt=1(Yt � �Y )2

=

Pnt=1(Yt � �Y )2 �

Pnt=1 e

2tPn

t=1(Yt � �Y )2

=

Pnt=1(Yt � �Y )2Pnt=1(Yt � �Y )2

:

and consequently we have

0 � R2 � 1:

(ii) If Xt does not contain the intercept, then the orthogonal decomposition identity

nXt=1

(Yt � �Y )2 =

nXt=1

(Yt � �Y )2 +

nXt=1

e2t

no longer holds. As a consequence, R2 may be negative when there is no intercept! This

is because the cross-product term

2

nXt=1

(Yt � �Y )et

may be negative.

(iii) For any given random sample fYt; X 0tg0; t = 1; :::; n; R2 is nondecreasing in the

number of explanatory variables Xt: In other words, the more explanatory variables are

11

added in the linear regression, the higher R2: This is always true no matter whether Xt

has any true explanatory power for Yt:

Theorem: Suppose fYt; X 0tg0; t = 1; :::; n; is a random sample, R21 is the centered R

2

from the linear regression

Yt = X0t� + ut;

where Xt = (1; X1t; :::; Xkt)0; and � is a K�1 parameter vector; also, R22 is the centered

R2 from the extended linear regression

Yt = ~Xt + vt;

where ~Xt = (1; X1t; :::; Xkt; Xk+1;t; :::; Xk+q;t)0;and is a (K + q)� 1 parameter vector.

Then R22 � R21:

Proof: By de�nition, we have

R21 = 1� e0ePnt=1(Yt � �Y )2

;

R22 = 1� ~e0~ePnt=1(Yt � �Y )2

;

where e is the estimated residual vector from the regression of Y on X; and ~e is the

estimated residual vector from the regression of Y on ~X: It su¢ ces to show ~e0~e � e0e:

Because the OLS estimator = ( ~X 0 ~X)�1 ~X 0Y minimizes SSR( ) for the extended model,

we have

~e0~e =nXt=1

(Yt � ~X 0t )

2 �nXt=1

(Yt � ~X 0t )

2 for all 2 RK+q:

Now we choose

= (�0; 00)0;

where � = (X 0X)�1X 0Y is the OLS from the �rst regression. It follows that

~e0~e �nXt=1

Yt �

kXj=0

�jXjt �k+qXj=k+1

0 �Xjt

!2

=

nXt=1

(Yt �X 0t�)

2

= e0e:

Hence, we have R21 � R22: This completes the proof.

Question: What is the implication of this result?

12

Answer: R2 is not a suitable criterion for correct model speci�cation. It is not a suitablecriterion of model selection.

Question: Does a high R2 imply a precise estimation for �0?

Question: What are the suitable model selection criteria?

Two popular model selection criteria

TheKISS Principle: Keep It Sophistically Simple!

An important idea in statistics is to use a simple model to capture essential infor-

mation contained in data as much as possible. Below, we introduce two popular model

selection criteria that re�ect such an idea.

Akaike Information Criterion [AIC]:A linear regression model can be selected by minimizing the following AIC criterion

with a suitable choice of K :

AIC = ln(s2) +2K

n� goodness of �t + model complexity

where

s2 = e0e=(n�K);

and K = k + 1 is the number of regressors.

Bayesian Information Criterion [BIC]:A linear regression model can be selected by minimizing the following criterion with

a suitable choice of K :

BIC = ln(s2) +K ln(n)

n:

Remark: When lnn � 2; BIC gives a heavier penalty for model complexity than AIC,which is measured by the number of estimated parameters (relative to the sample size

n). As a consequence, BIC will choose a more parsimonious linear regression model than

AIC.

Remark: BIC is strongly consistent in that it determines the true model asymptotically(i.e., as n ! 1), whereas for AIC an overparameterized model will emerge no matter

13

how large the sample is. Of course, such properties are not necessarily guaranteed in

�nite samples.

Remarks: In addition to AIC and BIC, there are other criteria such as �R2; the so-calledadjusted R2 that can also be used to select a linear regression model. The adjusted �R2

is de�ned as�R2 = 1� e0e=(n�K)

(Y � �Y )0(Y � �Y )=(n� 1):

This de¤ers from

R2 = 1� e0e

(Y � �Y )0(Y � �Y ):

It may be shown that

�R2 = 1��n� 1n�K (1�R

2)

�:

All model criteria are structured in terms of the estimated residual variance �2 plus a

penalty adjustment involving the number of estimated parameters, and it is in the extent

of this penalty that the criteria di¤er from. For more discussion about these, and other

selection criteria, see Judge et al. (1985, Section 7.5).

Question: Why is it not a good practice to use a complicated model?

Answer: A complicated model contains many unknown parameters. Given a �xed

amount of data information, parameter estimation will become less precise if more pa-

rameters have to be estimated. As a consequence, the out-of-sample forecast for Yt may

become less precise than the forecast of a simpler model. The latter may have a larger

bias but more precise parameter estimates. Intuitively, a complicated model is too �ex-

ible in the sense that it may not only capture systematic components but also some

features in the data which will not show up again. Thus, it cannot forecast futures well.

3.4 Consistency and E¢ ciency of OLS

We now investigate the statistical properties of �: We are interested in addressing

the following basic questions:

� Is � a good estimator for �0(consistency)?

� Is � the best estimator (e¢ ciency)?

� What is the sampling distribution of � (normality)?

14

Question: What is the sampling distribution of �?Answer: The distribution of � is called the sampling distribution of �; because � is afunction of the random sample fZtgnt=1; where Zt = (Yt; X 0

t)0:

Remark: The sampling distribution of � is useful for any statistical inference involving�; such as con�dence interval estimation and hypothesis testing.

We �rst investigate the statistical properties of �:

Theorem: Suppose Assumptions 3.1-3.4 hold. Then(i) [Unbiaseness] E(�jX) = �0 and E(�) = �0:(ii) [Vanishing Variance]

var(�jX) = Eh(� � E�)(� � E�)0jX

i= �2(X 0X)�1

and for any K � 1 vector � such that � 0� = 1; we have

� 0var(�jX)� ! 0 as n!1:

(iii) [Orthogonality between e and �]

cov(�; ejX) = Ef[� � E(�jX)]e0jXg = 0:

(iv) [Gauss-Markov]

var(bjX)� var(�jx) is positive semi-de�nite (p.s.d.)

for any unbiased estimator b that is linear in Y with E(bjX) = �0:(v) [Residual variance estimator]

s2 =1

n�K

nXt=1

e2t = e0e=(n�K)

is unbiased for �2 = E("2t ): That is, E(s2jX) = �2:

Question: What is a linear estimator b?Remarks:

(a) Both Theorems (i) and (ii) imply that the conditional MSE

MSE(�jX) = E[(� � �0)(� � �0)0jX]= var(�jX) + Bias(�jX)Bias(�jX)0

= var(�jX)! 0 as n!1;

15

where we have used the fact that Bias(�jX) � E(�jX) � �0 = 0: Recall that MSE

measures how close an estimator � is to the target �0:

(b) Theorem (iv) implies that � is the best linear unbiased estimator (BLUE) for �0

because var(�jX) is the smallest among all unbiased linear estimators for �0:

Formally, we can de�ne a related concept for comparing two estimators:

De�nition [E¢ ciency]: An unbiased estimator � of parameter �0 is more e¢ cientthan another unbiased estimator b of parameter �0 if

var(bjX)� var(�jX) is p.s.d.

Implication: For any � 2 RK such that � 0� = 1; we have

� 0hvar(bjX)� var(�jX)

i� � 0:

Choosing � = (1; 0; :::; 0)0; for example, we have

var(b1)� var(�1) � 0:

Proof: (i) Given � � �0 = (X 0X)�1X 0"; we have

E[(� � �0)jX] = E[(X 0X)�1X 0"jX]= (X 0X)�1X 0E("jX)= (X 0X)�1X 00

= 0:

(ii) Given � � �0 = (X 0X)�1X 0" and E(""0jX) = �2I; we have

var(�jX) � Eh(� � E�)(� � E�)0jX

i= E

h(� � �0)(� � �0)0jX

i= E[(X 0X)�1X 0""0X(X 0X)�1jX]= (X 0X)�1X 0E(""0jX)X(X 0X)�1

= (X 0X)�1X 0�2IX(X 0X)�1

= �2(X 0X)�1X 0X(X 0X)�1

= �2(X 0X)�1:

16

Note that Assumption 3.4 is crutial here to obtain the expression of �2(X 0X)�1 for

var(�jX): Moreover, for any � 2 RK such that � 0� = 1; we have

� 0var(�jX)� = �2� 0(X 0X)�1�

� �2�max[(X0X)�1]

= �2��1min(X0X)

! 0

given �min(X 0X) ! 1 as n ! 1 with probability one. Note that the condition that

�min(X0X)!1 ensures that var(�jX) vanishes to zero as n!1:

(iii) Given � � �0 = (X 0X)�1X 0"; e = Y �X� = MY = M" (since MX = 0); and

E(e) = 0; we have

cov(�; ejX) = Eh(� � E�)(e� Ee)0jX

i= E

h(� � �0)e0jX

i= E[(X 0X)�1X 0""0M jX]= (X 0X)�1X 0E(""0jX)M= (X 0X)�1X 0�2IM

= �2(X 0X)�1X 0M

= 0:

Again, Assumption 3.4 plays a crutial role in ensuring zero correlation between �

and e: (iv) Consider a linear estimator

b = C 0Y;

where C = C(X) is a n�K matrix depending on X: It is unbiased for �0 regardless of

the value of �0 if and only if

E(bjX) = C 0X�0 + C 0E("jX)= C 0X�0

= �0:

This follows if and only if

C 0X = I:

17

Because

b = C 0Y

= C 0(X�0 + ")

= C 0X�0 + C 0"

= �0 + C 0";

the variance of b

var(b) = Eh(b� �0)(b� �0)0jX

i= E [C 0""0CjX]= C 0E(""0jX)C= C 0�2IC

= �2C 0C:

Using C 0X = I, we now have

var(bjX)� var(�jX) = �2C 0C � �2(X 0X)�1

= �2[C 0C � C 0X(X 0X)�1X 0C]

= �2C 0[I �X(X 0X)�1X 0]C

= �2C 0MC

= �2C 0MMC

= �2C 0M 0MC

= �2(MC)0(MC)

= �2D0D

= �2nXt=1

DtD0t

� p.s.d.

where we have used the fact that for any real-valued vector D; the squared matrix D0D

is always p.s.d. [Question: How to show this?]

(v) Now we show E[e0e=(n � K)] = �2: Because e0e = "0M" and tr(AB) =tr(BA);

18

we have

E(e0ejX) = E("0M"jX)= E[tr("0M")jX]

[putting A = "0M;B = "]

= E[tr(""0M)jX]= tr[E(""0jX)M ]= tr(�2IM)

= �2tr(M)

= �2(n�K)

where

tr(M) = tr(In)� tr(X(X 0X)�1X 0)

= tr(In)� tr(X 0X(X 0X)�1)

= n�K;

using tr(AB) =tr(BA) again. It follows that

E(s2jX) =E(e0ejX)n�K

=�2(n�K)(n�K)

= �2:

This completes the proof. Note that the sample residual variance s2 = e0e=(n � K) is

a generalization of the sample variance S2n = (n � 1)�1�nt=1(Yt � �Y )2 for the random

sample fYtgnt=1:

3.5 The Sampling Distribution of �

To obtain the �nite sample sampling distribution of �; we impose the normality

assumption on ":

Assumption 3.5: "jX � N(0; �2I):

Remarks:

19

(i) Under Assumption 3.5, the conditional pdf of " given X is

f("jX) = 1

(p2��2)n

exp

�� "

0"

2�2

�= f(");

which does not depend on X; the disturbance " is independent of X: Thus, every condi-

tional moment of " given X does not depend on X:

(ii) Assumption 3.5 implies Assumptions 3.2 (E("jX) = 0) and 3.4 (E(""0X) = �2I).(iii) The normality assumption may be reasonable for observations that are computed

as the averages of the outcomes of many repeated experiments, due to the e¤ect of

the so-called central limit theorem (CLT). This may occur in Physics, for example. In

economics, the normality assumption may not always be reasonable. For example, many

high-frequency �nancial time series usually display heavy tails (with kurtosis larger tha

3).

Question: What is the sampling distribution of �?

Sampling Distribution of � :

� � �0 = (X 0X)�1X 0"

= (X 0X)�1nXt=1

Xt"t

=nXt=1

Ct"t;

where the weighting matrix

Ct = (X0X)�1Xt

is called the leverage of observation Xt.

Theorem [Normality of �]: Under Assumptions 3.1, 3.3 and 3.5,

(� � �0)jX � N [0; �2(X 0X)�1]:

Proof: Conditional on X; � � �0 is a weighted sum of independent normal random

variables f"tg; and so is also normally distributed.

The corollary below follows immediately.

Corollary [Normality of R(� � �0)]: Suppose Assumptions 3.1, 3.3 and 3.5 hold.Then for any nonstochastic J �K matrix R; we have

R(� � �0)jX � N [0; �2R(X 0X)�1R0]:

20

Question: What is the role of the nonstochastic matrix R?Answer: R is a selection matrix. For example, when R = (1; 0; :::; 0)0; we then have

R(� � �0) = �0 � �00:

Proof: Conditional on X; � � �0 is normally distributed. Therefore, conditional on X;the linear combination R(� � �0) is also normally distributed, with

E[R(� � �0)jX] = RE[(� � �0)jX] = R � 0 = 0

and

var[R(� � �0)jX] = EhR(� � �0)(R(� � �0))0jX

i= E

hR(� � �0)(� � �0)0R0jX

i= RE

h(� � �0)(� � �0)0jX

iR0

= Rvar(�jX)R0

= �2R(X 0X)�1R0:

It follows that

R(� � �0)jX � N(0; �2R(X 0X)�1R0):

Question: Why would we like to know the sampling distribution of R(� � �0)?

Answer: This is mainly for con�dence interval estimation and hypothesis testing.

Remark: However, var[R(� � �0)jX] = �2R(X 0X)�1R0 is unknown, because var("t) =

�2 is unknown. We need to estimate �2:

3.6 Estimation of the Covariance Matrix of �

Question: How to estimate var(�jx) = �2(X 0X)�1?

Answer: To estimate �2; we can use the residual variance estimator

s2 = e0e=(n�K):

Theorem [Residual Variance Estimator]: Suppose Assumptions 3.1, 3.3 and 3.5hold. Then we have for all n > K; (i)

(n�K)s2�2

jX =e0e

�2jX � �2n�K ;

(ii) conditional on X; s2 and � are independent.

21

Remarks:(i) Question: What is a �2q distribution?

De�nition [Chi-square Distribution, �2q] Suppose fZigqi=1 are i.i.d.N(0,1) random

variables. Then the random variable

�2 =

qXi=1

Z2i

will follow a �2q distribution.

Properties of �2q :

� E(�2q) = q:

� var(�2q) = 2q:

Remarks:

(i) Theorem (i) implies

E

�(n�K)s2

�2jX�= n�K:

(n�K)�2

E(s2jX) = n�K:

It follows that E(s2jX) = �2:

(ii) Theorem (i) also implies

var�(n�K)s2

�2jX�= 2(n�K);

(n�K)2�4

var(s2jX) = 2(n�K);

var(s2jX) =2�4

n�K ! 0

as n!1:(iii) Both Theorems (i) and (ii) imply that the conditional MSE of s2

MSE(s2jX) = E�(s2 � �2)2jX

�= var(s2jX) + [E(s2jX)� �2]2

! 0:

22

Thus, s2 is a good estimator for �2:

(iv) The independence between s2 and � is crucial for us to obtain the sampling

distribution of the popular t-test and F -test statistics, which will be introduced soon.

Proof of Theorem: (i) Because e =M"; we have

e0e

�2="0M"

�2=� "�

�0M� "�

�:

In addition, because "jX � N(0; �2In); and M is an idempotent matrix with rank

= n�K (as has been shown before), we have the quadratic form

e0e

�2="0M"

�2jX � �2n�K

by the following lemma.

Lemma [Quadratic form of normal random variables]: If v � N(0; In) and Q isan n�n nonstochastic symmetric idempotent matrix with rank q � n; then the quadraticform

v0Qv � �2q:

In our application, we have v = "=� � N(0; I); and Q = M: Since rank(M) = n �K;we have

e0e

�2jX � �2n�K :

(ii) Next, we show that s2 and � are independent. Because s2 = e0e=(n � K) is afunction of e; it su¢ ces to show that e and � are independent. This follows immedi-

ately because both e and � are jointly normally distributed and they are uncorrelated.

It is well-known that for a joint normal distribution, zero correlation is equivalent to

independence.

Question: Why are e and � jointly normally distributed?

Answer: We write "e

� � �0

#=

"M"

(X 0X)�1X 0"

#

=

"M

(X 0X)�1X 0

#":

Because "jX � N(0; �2I); the linear combination of"M

(X 0X)�1X 0

#"

23

is also normally distributed conditional on X.

Question: Why are these sampling distributions useful in practice?

Answer: They are useful in con�dence interval estimation and hypothesis testing.

3.7 Hypothesis Testing

We now use the sampling distributions of � and s2 to develop test procedures for

hypotheses of interest. We consider testing the following linear hypothesis in form of

H0 : R�0 = r;

(J �K)(K � 1) = J � 1;

where R is called the selection matrix, and J is the number of restrictions. We assume

J � K:

Remark: We wil test H0 under correct model speci�cation for E(YtjXt).

Motivations

Example 1 [Reforms have no e¤ect]: Consider the extended production function

ln(Yt) = �0 + �1 ln(Lt) + �2 ln(Kt) + �3AUt + �4PSt + "t;

where AUt is a dummy variable indicating whether �rm t is granted autonomy, and PStis the pro�t share of �rm t with the state.

Suppose we are interested in testing whether autonomy AUt has an e¤ect on produc-

tivity. Then we can write the null hypothesis

Ha0 : �03 = 0

This is equivalent to the choice of:

�0 = (�0; �1; �2; �3; �4)0:

R = (0; 0; 0; 1; 0);

r = 0:

If we are interested in testing whether pro�t sharing has an e¤ect on productivity,

we can consider the null hypothesis

Hb0 : �4 = 0:

24

Alternatively, to test whether the production technology exhibits the constant return

to scale (CRS), we can write the null hypothesis as follows:

Hc0 : �01 + �

02 = 1:

This is equivalent to the choice of R = (0; 1; 1; 0; 0) and r = 1:

Finally, if we are interested in examining the joint e¤ect of both autonomy and pro�t

sharing, we can test the hypothesis that neither autonomy nor pro�t sharing has impact:

Hd0 : �03 = �

04 = 0:

This is equivalent to the choice of

R =

"0 0 0 1 0

0 0 0 0 1

#;

r =

"0

0

#:

Example 3 [Optimal Predictor for Future Spot Exchange Rate]: Consider

St+� = �0 + �1Ft(�) + "t+� ; t = 1; :::; n;

where St+� is the spot exchange rate at period t+ � ; and Ft(�) is period t�s price for the

foreign currency to be delivered at period t + � : The null hypothesis of interest is that

the forward exchange rate Ft(�) is an optimal predictor for the future spot rate St+� in

the sense that E(St+� jIt) = Ft(�); where It is the information set available at time t.

This is actually called the expectations hypothesis in economics and �nance. Given the

above speci�cation, this hypothesis can be written as

He0 : �00 = 0; �

01 = 1;

and E("t+� jIt) = 0: This is equivalent to the choice of

R =

"1 0

0 1

#; r =

"0

1

#:

Remark: All examples considered above can be formulated with a suitable speci�cationof R; where R is a J �K matrix in the null hypothesis

H0 : R�0 = r;

25

where r is a J � 1 vector.Basic Idea of Hypothesis Testing

To test the null hypothesis

H0 : R�0 = r;

we can consider the statistic:

R� � r

and check if this di¤erence is signi�cantly di¤erent from zero.

Under H0 : R�0 = r; we have

R� � r = R� �R�0

= R(� � �0)! 0 as n!1

because � � �0 ! 0 as n!1 in the term of MSE.

Under the alternative to H0; R�0 6= r; but we still have � � �0 ! 0 in the term of

MSE. It follows that

R� � r = R(� � �0) +R�0 � r! R�0 � r 6= 0

as n!1; where the convergence is in term of MSE. In other words, R��r will convergeto a nonzero limit, R�0 � r.

Remark: The behavior of R� � r is di¤erent under H0 and under the alternativehypothesis to H0. Therefore, we can test H0 by examining whether R��r is signi�cantlydi¤erent from zero.

Question: How large should the magnitude of the absolute value of the di¤erence R��rbe in order to claim that R� � r is signi�cantly di¤erent from zero?

Answer: We need to a decision rule which speci�es a threshold value with which wecan compare the (absolute) value of R� � r.

Note that R� � r is a random variable and so it can take many (possibly an in�nite

number of) values. Given a data set, we only obtain one realization of R�� r:Whethera realization of R� � r is close to zero should be judged using the critical value of its

26

sampling distribution, which depends on the sample size n and the signi�cance level

� 2 (0; 1) one preselects.

Question: What is the sampling distribution of R� � r under H0?

Because

R(� � �0)jX � N(0; �2R(X 0X)�1R0);

we have that conditional on X;

R� � r = R(� � �0) +R�0 � r� N(R�0 � r; �2R(X 0X)�1R0)

Corollary: Under Assumptions 3.1, 3.3 and 3.5, and H0 : R�0 = r; we have for eachn > K;

(R� � r)jX � N(0; �2R(X 0X)�1R0):

Remarks:(i) var(R�jX) = �2R(X 0X)�1R0 is a J � J matrix.(ii) However, R�� r cannot be used as a test statistic for H0; because �2 is unknown

and there is no way to calculate the critical values of the sampling distribution of R��r:

Question: How to construct a feasible (i.e., computable) test statistic?

The forms of test statistics will di¤er depending on whether we have J = 1 or J > 1:

We �rst consider the case of J = 1:

CASE I: t-Test (J = 1):

Recall that we have

(R� � r)jX � N(0; �2R(X 0X)�1R0);

When J = 1; the conditional variance

var[(R� � r)jX] = �2R(X 0X)�1R0

is a scalar (1� 1). It follows that conditional on X; we have

R� � rqvar[(R� � r)jX]

=R� � rp

�2R(X 0X)�1R0

� N(0; 1):

27

Question: What is the unconditional distribution of

R� � rp�2R(X 0X)�1R0

?

Answer: The unconditional distribution is also N(0,1).

However, �2 is unknown, so we cannot use the ratio

R� � rp�2R(X 0X)�1R0

as a test statistic. We have to replace �2 by s2; which is a good estimator for �2: This

gives a feasible (i.e., computable) test statistic

T =R� � rp

s2R(X 0X)�1R0:

However, the test statistic T will be no longer normally distributed. Instead,

T =R� � rp

s2R(X 0X)�1R0

=

R��rp�2R(X0X)�1R0q

(n�K)s2�2

=(n�K)

� N(0; 1)q�2n�K=(n�K)

� tn�K ;

where tn�K denotes the Student t-distribution with n�K degrees of freedom. Note that

the numerator and denominator are mutually independent conditional on X; because �

and s2 are mutually independent conditional on X.

Remarks:(i) The feasible statistic T is called a t-test statistic.

(ii) What is a Student tq distribution?

De�nition [Student�s t-distribution]: Suppose Z � N(0; 1) and V � �2q; and bothZ and V are independent. Then the ratio

ZpV=q

� tq:

28

(iii) When q ! 1; tqd! N(0; 1); where d! denotes convergence in distribution. This

implies that we have

T =R� � rp

s2R(X 0X)�1R0d! N(0; 1) as n!1:

This result has a very important implication in practice: For a large sample size n; it

makes no di¤erence to use either the critical values from tn�K or from N(0; 1).

Plots of Student t-distributions?

Question: What is meant by convergence in distribution?

De�nition [Convergence in distribution]: Suppose fZn; n = 1; 2; � � � g is a sequenceof random variables/vectors with distribution functions Fn(z) = P (Zn � z); and Z is arandom variable/vector with distribution F (z) = P (Z � z): We say that Zn convergesto Z in distribution if the distribution of Zn converges to the distribution of Z at all

continuity points; namely,

limn!1

Fn(z) = F (z) or

Fn(z) ! F (z) as n!1

for any continuity point z (i.e., for any point at which F (z) is continuous): We use the

notation Znd! Z: The distribution of Z is called the asymptotic or limiting distribution

of Zn:

Implication: In practice, Zn is a test statistic or a parameter estimator, and its sam-pling distribution Fn(z) is either unknown or very complicated, but F (z) is known or

very simple. As long as Znd! Z; then we can use F (z) as an approximation to Fn(z):

This gives a convenient procedure for statistical inference. The potential cost is that the

approximation of Fn(z) to F (z) may not be good enough in �nite samples (i.e., when n

is �nite). How good the approximation will be depend on the data generating process

and the sample size n:

Example: Suppose f"n; n = 1; 2; � � � g is an i.i.d. sequence with distribution functionF (z): Let " be a random variable with the same distribution function F (z): Then "n

d! ":

With the obtained sampling distribution for the test statistic T; we can now describe

a decision rule for testing H0 when J = 1:

Decision Rule of the T-test Based on Critical Values

29

(i) Reject H0 : R�0 = r at a prespeci�ed signi�cance level � 2 (0; 1) if

jT j > Ctn�K;�2 ;

where Ctn�K;�2 is the so-called upper-tailed critical value of the tn�K distribution at level�2; which is determined by

Phtn�K > Ctn�K;�2

i=�

2

or equivalently

Phjtn�K j > Ctn�K;�2

i= �:

(ii) Accept H0 at the signi�cance level � if

jT j � Ctn�K;�2 :

Remarks:(a) In testing H0; there exist two types of errors, due to the limited information

about the population in a given random sample fZtgnt=1. One possibility is that H0 istrue but we reject it. This is called the �Type I error�. The signi�cance level � is the

probability of making the Type I error. If

PhjT j > Ctn�K;�2 jH0

i= �;

we say that the decision rule is a test with size �:

(b) The probability P [jT j > Ctn�K;�2] is called the power function of a size � test.

When

PhjT j > Ctn�K;�2 jH0 is false

i< 1;

there exists a possibility that one may accept H0 when it is false. This is called the�Type II error�.

(c) Ideally one would like to minimize both the Type I error and Type II error, but

this is impossible for any given �nite sample. In practice, one usually presets the level

for Type I error, the so-called signi�cance level, and then minimizes the Type II error.

Conventional choices for signi�cance level � are 10%, 5% and 1% respectively.

Next, we describe an alternative decision rule for testing H0 when J = 1; using theso-called p-value of test statistic T:

An Equivalent Decision Rule Based on p-values

30

Given a data set zn = fyt; x0tg0nt=1; which is a realization of the random sample Zn =

fYt; X 0tg0nt=1; we can compute a realization (i.e., a number) for the t-test statistic T;

namely

T (zn) =R� � rp

s2R(x0x)�1R0:

Then the probability

p(zn) = P [jtn�K j > jT (zn)j] ;

is called the p-value (i.e., probability value) of the test statistic T given that fYt; X 0tgnt=1 =

fyt; x0tgnt=1 is observed, where tn�K is a Student t random variable with n�K degrees of

freedom, and T (zn) is a realization for test statistic T = T (Zn). Intuitively, the p-value

is the tail probability that the absolute value of a Student tn�K random variable take

values larger than the absolute value of the test statistic T (zn). If this probability is very

small relative to the signi�cance level, then it is unlikely that the test statistic T (Zn)

will follow a Student tn�K distribution. As a consequence, the null hypothesis is likely

to be false.

The above decision rule can be described equivalently as follows:

Decision Rule Based on the p-value(i) Reject H0 at the signi�cance level � if p(zn) < �:(ii) Accept H0 at the signi�cance level � if p(zn) � �:

Question: What are the advantages/disadvantages of using p-values versus using crit-ical values?

Remarks:(i) p-values are more informative than only rejecting/accepting the null hypothesis

at some signi�cance level. A p-value is the smallest signi�cance level at which a null

hypothesis can be rejected.

(ii) Most statistical software reports p-values of parameter estimates.

Examples of t-tests

Example 1 [Reforms have no e¤ects (continued.)]We �rst consider testing the null hypothesis

Ha0 : �3 = 0;

where �3 is the coe¢ cient of the autonomy AUt in the extended production function

regression model. This is equivalent to the selection of R = (0; 0; 0; 1; 0): In this case,

31

we have

s2R(X 0X)�1R0 =�s2(X 0X)�1

�(4;4)

= S2�

which is the estimator of var(�3jX): The t-test statistic

T =R� � rp

s2R(X 0X)�1R0

=�3qS2�3

� tn�K :

Next, we consider testing the CRS hypothesis

Hc0 : �1 + �2 = 1;

which corresponds to R = (0; 1; 1; 0; 0) and r = 1: In this case,

s2R(X 0X)�1R0 = S2�1+ S2

�2+ 2�cov(�1; �2)

=�s2(X 0X)�1

�(2;2)

+�s2(X 0X)�1

�(3;3)

+2 ��s2(X 0X)�1

�(2;3)

= S2�+�2

;

which is the estimator of var(�1+�2jX):Here, �cov(�1; �2) is the estimator for cov(�1; �2jX);the covariance between �1 and �2 conditional on X:

The t-test statistic is

T =R� � rp

s2R(X 0X)�1R0

=�1 + �2 � 1S�1+�2

� tn�K :

CASE II: F -testing (J > 1)

Question: What happens if J > 1? We �rst state a useful lemma.

32

Lemma: If Z � N(0; V ); where V = var(Z) is a nonsingular J�J variance-covariancematrix, then

Z 0V �1Z � �2J :

Proof: Because V is symmetric and positive de�nite, we can �nd a symmetric and

invertible matrix V 1=2 such that

V 1=2V 1=2 = V;

V �1=2V �1=2 = V �1:

(Question: What is this decomposition called?) Now, de�ne

Y = V �1=2Z:

Then we have E(Y ) = 0; and

var(Y ) = E f[Y � E(Y )][Y � E(Y )]0g= E(Y Y 0)

= E(V �1=2ZZ 0V �1=2)

= V �1=2E(ZZ 0)V

= V �1=2V V �1=2

= V �1=2V 1=2V 1=2V �1=2

= I:

It follows that Y � N(0; I): Therefore, we have

Y 0Y � �2q:

Applying this lemma, and using the result that

(R� � r)jX � N [0; �2R(X 0X)�1R0]

under H0; we have the quadratic form

(R� � r)0[�2R(X 0X)�1R0]�1(R� � r) � �2J

conditional on X; or

(R� � r)0[R(X 0X)�1R0]�1(R� � r)�2

� �2J

conditional on X:

33

Because �2J does not depend on X; therefore, we also have

(R� � r)0[R(X 0X)�1R0]�1(R� � r)�2

� �2J

unconditionally.

Like in constructing a t-test statistic, we should replace �2 by s2 in the left hand

side:(R� � r)0[R(X 0X)�1R0]�1(R� � r)

s2:

The replacement of �2 by s2 renders the distribution of the quadratic form no longer

Chi-squared. Instead, after proper scaling, the quadratic form will follow a so-called

F -distribution with degrees of freedom equal to (q; n�K).

Question: Why?

Observe

(R� � r)0[R(X 0X)�1R0]�1(R� � r)s2

= J �(R��r)0[R(X0X)�1R0]�1(R��r)

�2=J

(n�K)s2�2

=(n�K)

� J � �2J=J

�2n�K=(n�K)� J � FJ;n�K ;

where FJ;n�K denotes the F distribution with degrees of J and n�K distributions.

Question: What is a FJ;n�K distribution?

De�nition: Suppose U � �2p and V � �2q; and both U and V are independent. Then

the ratioU=p

V=q� Fp;q

is called to follow a Fp;q distribution with degrees of freedom (p; q):

Question: Why call it the F -distribution?

Answer: It is named after Professor Fisher, a well-known statistician in the 20th cen-tury.

Plots of F -distributions.

34

Remarks:

Properties of Fp;q :

(i) If F � Fp;q; then F�1 � Fq;p:(ii) t2q � F1;q:

t2q =�21=1

�2=q� F1;q

(iii) Given any �xed integer p; p � Fp;q ! �2p as q !1:

Relationship (ii) implies that when J = 1; using either the t-test or the F -test will

deliver the same conclusion. Relationship (iii) implies that the conclusions based on

Fp;q and on pFp;q using the �2p approximation will be approximately the same when q is

su¢ ciently large.

We can now de�ne the following F -test statistic

F � (R� � r)0[R(X 0X)�1R0]�1(R� � r)=Js2

� FJ;n�K :

Theorem: Suppose Assumptions 3.1, 3.3 and 3.5 hold. Then under H0 : R�0 = r, we

have

F =(R� � r)0[R(X 0X)�1R0]�1(R� � r)=J

s2

� FJ;n�K

for all n > K:

A practical issue now is how to compute the F -statistic. One can of course compute

the F -test statistic using the de�nition of F . However, there is a very convenient way

to compute the F -test statistic. We now introduce this method.

Alternative Expression for the F-Test Statistic

Theorem: Suppose Assumptions 3.1 and 3.3 hold. Let SSRu = e0e be the sum of

squared residuals from the unrestricted model

Y = X�0 + ":

35

Let SSRr = ~e0~e be the sum of squared residuals from the restricted model

Y = X�0 + "

subject to

R�0 = r;

where ~� is the restricted OLS estimator. Then under H0; the F -test statistic can be

written as

F =(~e0~e� e0e)=Je0e=(n�K) � FJ;n�K :

Remark: This is a convenient test statistic! One only needs to compute SSR in orderto compute the F -test statistic.

Proof: Let ~� be the OLS under H0; that is,

~� = arg min�2RK

(Y �X�)0(Y �X�)

subject to the constraint that R� = r: We �rst form the Lagrangian function

L(�; �) = (Y �X�)0(Y �X�) + 2�0(r �R�);

where � is a J � 1 vector called the Lagrange multiplier vector.We have the following FOC:

@L(~�; ~�)

@�= �2X 0(Y �X~�)� 2R0~� = 0;

@L(~�; ~�)

@�= r �R~� = 0:

With the unconstrained OLS estimator � = (X 0X)�1X 0Y; and from the �rst equation

of FOC, we can obtain

�(� � ~�) = (X 0X)�1R0~�;

R(X 0X)�1R0~� = �R(� � ~�):

Hence, the Lagrange multiplier

~� = �[R(X 0X)�1R0]�1R(� � ~�):= �[R(X 0X)�1R0]�1(R� � r);

36

where we have made use of the constraint that R~� = r: It follows that

� � ~� = (X 0X)�1R0[R(X 0X)�1R0]�1(R� � r):

Now,

~e = Y �X~�= Y �X� +X(� � ~�)= e+X(� � ~�):

It follows that

~e0~e = e0e+ (� � ~�)0X 0X(� � ~�)= e0e+ (R� � r)0[R(X 0X)�1R0]�1(R� � r):

We have

(R� � r)0[R(X 0X)�1R0]�1(R� � r) = ~e0~e� e0e

and

F =(R� � r)0[R(X 0X)�1R0]�1(R� � r)=J

s2

=(~e0~e� e0e)=Je0e=(n�K) :

This completes the proof.

Question: What is the interpretation for the Lagrange multiplier ~�?

Answer: Recall that we have obtained the relation that

~� = �[R(X 0X)�1R0]�1R(� � ~�)= �[R(X 0X)�1R0]�1(R� � r):

Thus, ~� is an indicator of the departure of R� from r: That is, the value of ~� will indicate

whether R� � r is signi�cantly di¤erent from zero.

Question: What happens to the distribution of F when n!1?

Recall that under H0; the quadratic form

(R� � r)0��2R(X 0X)�1R0

��1(R� � r) � �2J :

37

We have shown that

J � F = (R� � r)0�s2R(X 0X)�1R0

��1(R� � r)

is no longer �2J for each �nite n > K:

However, the following theorem shows that as n!1; the quadratic form J � F !d

�2J : In other words, when n ! 1; the limiting distribution of J � F coincides with the

distribution of the quadratic form

(R� � r)0��2R(X 0X)�1R0

��1(R� � r):

Theorem: Suppose Assumptions 3.1, 3.3 and 3.5 hold. Then under H0; we have theWald test statistic

W =(R� � r)0[R(X 0X)�1R0]�1(R� � r)

s2d! �2J

as n!1:

Remarks:(i) W is called the Wald test statistic. In the present context, we have W = J � F:(ii) Recall that when n ! 1; the F -distribution FJ;n�K ; after multiplied by J; will

approach the �2J distribution. This result has a very important implication in practice:

When n is large, the use of FJ;n�K and the use of �2J will deliver the same conclusion on

statistical inference.

Proof: The result follows immediately by the so-called Slutsky theorem and the fact

that s2p! �2; where

p! denotes convergence in probability.

Question: What is the Slutsky theorem? What is convergence in probability?

Slutsky Theorem: Suppose Znd! Z; an

p! a and bnp! b as n!1: Then

an + bnZnd! a+ bZ:

In our application, we have

Zn =(R� � r)0[R(X 0X)�1R0]�1(R� � r)

�2;

Z � �2J ;

an = 0;

a = 0;

bn = �2=s2;

b = 1:

38

The by the Slutsky theorem, we have

W = bnZn !d 1 � Z � �2J

because Zn !d Z and bn !p 1:

What we need is to show that s2p! �2 and then apply the Slutsky theorem.

Question: What is the concept of convergence in probability?

De�nition [Convergence in Probability] Suppose fZng is a sequence of randomvariables and Z is a random variable (including a constant). Then we say that Znconverges to Z in probability, denoted as Zn � Z

p! 0; or Zn � Z = oP (1) if for everysmall constant � > 0;

limn!1

Pr [jZn � Zj > �] = 0; or

Pr(jZn � Zj > �)! 0 as n!1:

Interpretation: De�ne the event

An(�) = f! 2 : jZn(!)� Z(!)j > �g;

where ! is a basic outcome in sample space : Then convergence in probability says that

the probability of event An(�) may be nonzero for any �nite n; but such a probability

will eventually vanish to zero as n!1: In other words, it becomes less and less likelythat the di¤erence jZn � Zj is larger than a prespeci�ed constant � > 0: Or, we have

more and more con�dence that the di¤erence jZn�Zj will be smaller than � as n!1:

Question: What is the interpretation of �?

Answer: It can be viewed as a prespeci�ed tolerance level.

Remark: When Z = b, a constant, we can write Znp! b; and we say that b is the

probability limit of Zn; written as

b = p limn!1

Zn:

Next, we introduce another convergence concept which can be conveniently used to show

convergence in probability.

De�nition [Convergence in Quadratic Mean]: We say that Zn converges to Zin quadratic mean, denoted as Zn � Z !q:m 0; or Zn � Z = oq:m:(1); if

limn!1

E[Zn � Z]2 = 0:

39

Lemma: If Zn � Zq:m:! 0; then Zn � Z

p! 0:

Proof : By Chebyshev�s inequality, we have

P (jZn � Zj > �) �E[Zn � Z]2

�2! 0

for any given � > 0 as n!1:

Example: Suppose Assumptions 3.1, 3.3 and 3.5 hold. Does s2 converge in probabilityto �2?

Solution: Under the given assumptions,

(n�K) s2

�2� �2n�K ;

and therefore we have E(s2) = �2 and var(s2) = 2�4

n�K : It follows that E(s2 � �2)2 =

2�4=(n �K) ! 0; s2 !q:m: �2 and so s2p! �2 because convergence in quadratic mean

implies convergence in probability.

Question: If s2 !p �2; do we have s!p �?

Answer: Yes. Why? It follows from the following lemma with the choice of g(s2) =ps2 = s:

Lemma: Suppose Zn !p Z; and g(�) is a continuous function. Then

g(Zn)!p g(Z):

Proof: Because g(�) is continuous, we have that for any given constant � > 0; there

exists � = �(�) such that whenever jZn(!)� Z(!)j < �; we have

jg[Zn(!)]� g[Z(!)]j < �:

De�ne

A(�) = f! : jZn(!)� Z(!)j > �g;B(�) = f! : jg(Zn(!))� g(Z(!))j > �g:

Then the continuity of g(�) implies that B(�) � A(�): It follows that

P [B(�)] � P [A(�)]! 0

40

as n!1; where P [A(�)]! 0 given Zn !p Z: Because � and � are arbitrary, we have

g(Zn)!p g(Z):

This completes the proof.

3.8 Applications

We now consider some special but implication cases often countered in economics

and �nance.

Case 1: Testing for the Joint Signi�cance of Explanatory Variables

Consider a linear regression model

Yt = X 0t�0 + "t

= �00 +kXj=1

�0jXjt + "t:

We are interested in testing the combined e¤ect of all the regressors except the intercept.

The null hypothesis is

H0 : �0j = 0 for 1 � j � k;

which implies that none of the explanatory variables in�uences Yt:

The alternative hypothesis is

HA : �0j 6= 0 at least for some �0j ; j = 1; � � � ; k:

One can use the F -test and

F � Fk;n�(k+1):

In fact, the restricted model under H0 is very simple:

Yt = �00 + "t:

The restricted OLS estimator ~� = (�Y ; 0; � � � ; 0)0: It follows that

~e = Y �X~� = Y � �Y :

Hence, we have

~e0~e = (Y � �Y )0(Y � �Y ):

41

Recall the de�nition of R2 :

R2 = 1� e0e

(Y � �Y )0(Y � �Y )

= 1� e0e

~e0~e:

It follows that

F =(~e0~e� e0e)=ke0e=(n� k � 1)

=(1� e0e

~e0~e)=ke0e~e0~e=(n� k � 1)

=R2=k

(1�R2)=(n� k � 1) :

Thus, it su¢ ces to run one regression, namely the unrestricted model in this case. We

emphasize that this formula is valid only when one is testing for H0 : �0j = 0 for all

1 � j � k:

Example 1 [E¢ cient Market Hypothesis]: Suppose Yt is the exchange rate returnin period t; and It�1 is the information available at time t� 1: Then a classical versionof the e¢ cient market hypothesis (EMH) can be stated as follows:

E(YtjIt�1) = E(Yt)

To check whether exchange rate changes are unpredictable using the past history of

exchange rate changes, we specify a linear regression model:

Yt = X0t�0 + "t;

where

Xt = (1; Yt�1; :::; Yt�k)0:

Under EMH, we have

H0 : �0j = 0 for all j = 1; :::; k:

If the alternative

HA : �0j 6= 0 at least for some j 2 f1; :::; kg

holds, then exchange rate changes are predictable using the past information.

Remarks:

42

(i) What is the appropriate interpretation if H0 is not rejected? Note that there existsa gap between the e¢ ciency hypothesis and H0, because the linear regression model isjust one of many ways to check EMH. Thus, H0 is not rejected, at most we can onlysay that no evidence against the e¢ ciency hypothesis is found. We should not conclude

that EMH holds.

(ii) Strictly speaking, the current theory (Assumption 3.2: E("tjX) = 0) rules out

this application, which is a dynamic time series regression model. However, we will

justify in Chapter 5 that

k � F =R2

(1�R2)=(n� k � 1)d! �2k

under conditional homoskedasticity even for a linear dynamic regression model.

(iii) In fact, we can use a simpler version when n is large:

(n� k � 1)R2 d! �2k:

This follows from the Slutsky theorem because R2 !p 0 underH0: Although Assumption3.5 is not needed for this result, conditional homoskedasticity is still needed, which rules

out autoregressive conditional heteroskedasticity (ARCH) in the time series context.

Example 1 [Consumption Function and Wealth E¤ect]: Let Yt = consumption,X1t = labor income, X2t = liquidity asset wealth. A regression estimation gives

Yt = 33:88� 26:00X1t + 6:71X2t + et; R2 = 0:742; n = 25:

[1:77] [�0:74] [0:77]

where the numbers inside [�] are t-statistics.Suppose we are interested in whether labor income or liquidity asset wealth has

impact on consumption. We can use the F -test statistic,

F =R2=2

(1�R2)=(n� 3)= (0:742=2)=[(1� 0:742)=(25� 3)]= 31:636

� F2;22

Comparing it with the critical value of F2;22 at the 5% signi�cance level, we reject the

null hypothesis that neither income nor liquidity asset has impact on consumption at

the 5% signi�cance level.

43

Case 2: Testing for Omitted Variables (or Testing for No E¤ect)

Suppose X = (X(1); X(2)); where X(1) is a n� (k1 + 1) matrix and X(2) is a n� k2matrix:

A random vector X(2)t has no explanary power for the conditional expectation of Yt

if

E(YtjXt) = E(YtjX(1)t ):

Alternatively, it has explanatory power for the conditional expectation of Yt if

E(YtjXt) 6= E(YtjX(1)t ):

When X(2)t has explaining power for Yt but is not included in the regression, we say that

X(2)t is an omitted random variable or vector.

Question: How to test whether X(2)t is an omitted variable in the linear regression

context?

Consider the restricted model

Yt = �0 + �1X1t + � � �+ �k1Xk1t + "t:

Suppose we have additional k2 variables (X(k1+1)t; � � � ; X(k1+k2)t), and so we consider theunrestricted regression model

Yt = �0 + �1X1t + :::+ �k1Xk1t

+�k1+1X(k1+1)t + � � �+ �(k1+k2)X(k1+k2)t + "t:

The null hypothesis is that the additional variables have no e¤ect on Yt: If this is the

case, then

H0 : �k1+1 = �k1+2 = � � � = �k1+k2 = 0:

The alternative is that at least one of the additional variables has e¤ect on Yt:

The F -Test statistic is

F =(~e0~e� e0e)=k2

e0e=(n� k1 � k2 � 1)� Fk2;n�(k1+k2+1):

Question: Suppose we reject the null hypothesis. Then some important explanatoryvariables are omitted, and they should be included in the regression. On the other hand,

if the F -test statistic does not reject the null hypothesis H0; can we say that there is noomitted variable?

44

Answer: No. There may exist a nonlinear relationship for additional variables which alinear regression speci�cation cannot capture.

Example 1 [Testing for the E¤ect of Reforms]:Consider the extended production function

Yt = �0 + �1 ln(Lt) + �2 ln(Kt)

+�3AUt + �4PSt + �5CMt + "t;

where AUt is the autonomy dummy, PSt is the pro�t sharing ratio, and CMt is the

dummy for change of manager. The null hypothesis of interest here is that none of the

three reforms has impact:

H0 : �3 = �4 = �5 = 0:

We can use the F -test, and F � F3;n�6 under H0:

Interpretation: If rejection occurs, there exists evidence against H0: If no rejectionoccurs, then we �nd no evidence against H0 (which is not the same as the statementthat reforms have no e¤ect): It is possible that the e¤ect of X(2)

t is of nonlinear form. In

this case, we may obtain a zero coe¢ cient for X(2)t ; because the linear speci�cation may

not be able to capture it.

Example 2 [Testing for Granger Causality]: Consider two time series fYt; Ztg;where t is the time index, IYt�1 = fYt�1; :::; Y1g and IZt�1 = fZt�1; :::; Z1g. For example,Yt is the GDP growth, and Zt is the money supply growth. We say that Zt does not

Granger-cause Yt in conditional mean with respect to It�1 = fI(Y )t�1; I(Z)t�1g if

E(YtjI(Y )t�1; I(Z)t�1) = E(YtjI

(Y )t�1):

In other words, the lagged variables of Zt have no impact on the level of Yt:

Remark: Granger causality is de�ned in terms of incremental predictability ratherthan the real cause-e¤ect relationship. From an econometric point of view, it is a test

of omitted variables in a time series context. It is �rst introduced by Granger (1969).

Question: How to test Granger causality?

Consider now a linear regression model

Yt = �0 + �1Yt�1 + � � �+ �pYt�p+�p+1Zt�1 + � � �+ �p+qZt�q + "t:

45

Under non-Granger causality, we have

H0 : �p+1 = � � � = �p+q = 0:

The F -test statistic

F � Fq;n�(p+q+1):

Remark: The current theory (Assumption 3.2: E("tjX) = 0) rules out this application,because it is a dynamic regression model. However, we will justify in Chapter 5 that

under H0;

q � F d! �2q

as n ! 1 under conditional homoskedasticity even for a linear dynamic regression

model.

Example 2 [Testing for Structural Change (or testing for regime shift)]

Yt = �0 + �1X1t + "t;

where t is a time index, and fXtg and f"tg are mutually independent. Suppose thereexist changes after t = t0; i.e., there exist structural changes. We can consider the

extended regression model:

Yt = (�0 + �0Dt) + (�1 + �1Dt)X1t + "t

= �0 + �1X1t + �0Dt + �1(DtX1t) + "t;

where Dt = 1 if t > t0 and Dt = 0 otherwise. The variable Dt is called a dummy

variable, indicating whether it is a pre- or post-structral break period.

The null hypothesis of no structral change is

H0 : �0 = �1 = 0:

The alternative hypothesis that there exist a structral change is

HA : �0 6= 0 or�1 6= 0:


F � F2;n�4:

The idea of such a test is �rst proposed by Chow (1960).

Case 3: Testing for linear restrictions

46

Example 1 [ Testing for CRS]:Consider the extended production function

ln(Yt) = �0 + �1 ln(Lt) + �2 ln(Kt) + �3AUt + �4PSt + �5CMt + "t:

We will test the null hypothesis of CRS:

H0 : �1 + �2 = 1:

The alternative hypothesis is

H0 : �1 + �2 6= 1:

What is the restricted model under H0? It is given by

ln(Yt) = �0 + �1 ln(Lt) + (1� �1) ln(Kt) + �3AUt + �4PSt + �5CMt + "t

or equivanetly

ln(Yt=Kt) = �0 + �1 ln(Lt=Kt) + �3AUt + �4CONt + �5CMt + "t:


F � F1;n�6:

Remark: Both t- and F - tests are applicable to test CRS.

Example 2 [Wage Determination]: Consider the wage function

Wt = �0 + �1Pt + �2Pt�1 + �3Ut

+�4Vt + �5Wt�1 + "t;

where

Wt = wage,

Pt = price,

Ut = unemployment,

Vt = un�lled vacancies.

We will test the null hypothesis

H0 : �1 + �2 = 0; �3 + �4 = 0; and �5 = 1:

Question: What is the economic interpretation of the null hypothesis H0?

What is the restricted model? Under H0; we have the restricted wage equation:

�Wt = �0 + �1�Pt + �4Dt + "t;

47

where �Wt = Wt �Wt�1 is the wage growth rate, �Pt = Pt � Pt�1 is the in�ation rate,and Dt = Vt � Ut is an index for job market situation (excess job supply). This impliesthat the wage increase depends on the in�ation rate and the excess labor supply.

The F -test statistic for H0 is

F � F3;n�6:

Case 4: Testing for Near-Multicolinearity

Example 1:Consider the following estimation results for three separate regressions based on the

same data set with n = 25. The �rst is a regression of consumption on income:

Yt = 36:74 + 0:832X1t + e1t; R2 = 0:735

[1:98][7:98]

The second is a regression of consumption on wealth:

Yt = 36:61 + 0:208X2t + e2t; R2 = 0:735

[1:97][7:99]

The third is a regression of consumption on both income and wealth:

Y = 33:88� 26:00X1t + 6:71X2t + et; R2 = 0:742;

[1:77][�0:74][0:77]

Note that in the �rst two separate regressions, we can �nd signi�cant t-test statistics

for income and wealth, but in the third joint regression, both income and wealth are

insigni�cant. This may be due to the fact that income and wealth are highly multicol-

inear! To test neither income nor stock wealth has impact on consumption, we can use

the F -test:

F =R2=2

(1�R2)=(n� 3)

=0:742=2

(1� 0:742)=(25� 3)= 31:636

� F2;22 :

This F -test shows that the null hypothesis is �rmly rejected at the 5% signi�cance level,

because the critical value of F2;22 at the 5% level is ???.

48

3.9 Generalized Least Squares Estimation

Question:The classical linear regression theory crutially depends on the assumption that "jX �

N(0; �2I), or equivalently f"tg � i:i:d:N(0; �2); and fXtg and f"tg are mutually inde-pendent. What may happen if some classical assumptions do not hold?

Question: Under what conditions, the existing procedures and results are still approx-imately true?

Assumption 3.5 is unrealitic for many economic and �nancial data. Suppose As-

sumption 3.5 is replaced by the following condition:

Assumption 3.6: "jX � N(0; �2V ); where 0 < �2 <1 is unknown and V = V (X) is

a known n� n �nite and positive de�nite matrix.

Question: Is this assumption realistic in practice?

Remarks:(i) Assumption 3.6 implies that

var("jX) = E(""0jX)= �2V = �2V (X)

is known up to a constant �2: It allows for conditional heteroskedasticity of known form.

(iii) It is possible that V is not a diagonal matrix. Thus, cov("t; "sjX) may not bezero. In other words, Assumption 3.6 allows conditional serial correlation of known form.

(iv) However, the assumption that V is known is still very restrictive from a practical

point of view. In practice, V usually has an unknown form.

Question: What is the statistical property of OLS � under Assumption 3.6?

Theorem: Suppose Assumptions 3.1, 3.3 and 3.6 hold. Then(i) unbiasedness: E(�jX) = �0:(ii) variance:

var(�jX) = �2(X 0X)�1X 0V X(X 0X)�1

6= �2(X 0X)�1:

49

(iii)

(� � �0)jX � N(0; �2(X 0X)�1X 0V X(X 0X)�1):

(iv) cov(�; ejX) 6= 0 in general.

Remarks:(i) OLS � is still unbiased and one can show that its variance goes to zero as n!1

(see Question 6, Problem Set 03). Thus, it converges to �0 in the sense of MSE.

(ii) However, the variance of the OLS estimator � does no longer have the simple

expression of �2(X 0X)�1 under Assumption 3.6. As a consequence, the classical t- and

F -test statistics are invalid because they are based on an incorrect variance-covariance

matrix of �. That is, they use an incorrect expression of �2(X 0X)�1 rather than the

correct variance formula of �2(X 0X)�1X 0V X(X 0X)�1:

(iii) Theorem (iv) implies that even if we can obtain a consistent estimator for

�2(X 0X)�1X 0V X(X 0X)�1 and use it to construct tests, we can no longer obtain the

Student t-distribution and F -distribution, because the numerator and the denominator

in de�ning the t- and F -test statistics are no longer independent.

Proof: (i) Using � � �0 = (X 0X)�1X 0"; we have

E[(� � �0)jX] = (X 0X)�1X 0E("jX)= (X 0X)�1X 00

= 0:

(ii)

var(�jX) = E[(� � �0)(� � �0)0jX]= E[(X 0X)�1X 0""0X(X 0X)�1jX]= (X 0X)�1X 0E(""0jX)X(X 0X)�1

= �2(X 0X)�1X 0V X(X 0X)�1:

Note that we cannot further simplify the expression here because V 6= �2I:(iii) Because

� � �0 = (X 0X)�1X 0"

=nXt=1

Ct"t;

50

where

Ct = (X0X)�1Xt;

� � �0 follows a normal distribution given X; because it is a sum of a normal random

variables. As a result,

� � �0 � N(0; �2(X 0X)�1X 0V X(X 0X)�1):

(iv)

cov(�; ejX) = E[(� � �0)e0jX]= E[(X 0X)�1X 0""0M jX]= (X 0X)�1X 0E(""0jX)M= �2(X 0X)�1X 0VM

6= 0

because X 0VM 6= 0: We can see that it is conditional heteroskedasticity and/or auto-correlation in f"tg that cause � to be correlated with e:

Generalized Least Square (GLS) EstimationTo introduce GLS, we �rst state a useful lemma.

Lemma: For any symmetric positive de�nite matrix V; we can always write

V �1 = C 0C;

V = C�1(C 0)�1

where C is a n� n nonsingular matrix.

Question: What is this decomposition called?

Remark: C may not be symmetric.

Consider the original linear regression model:

Y = X�0 + ":

If we mutitiplyin the equation by C; we obtain the transformed regression model

CY = (CX)�0 + C"; or

Y � = X��0 + "�;

51

where Y � = CY;X� = CX and "� = C": Then the OLS of this transformed model

��= (X�0X�)�1X�0Y �

= (X 0C 0CX)�1(X 0C 0CY )

= (X 0V �1X)�1X 0V �1Y

is called the Generalized Least Square (GLS) estimator.

Question: What is the nature of GLS?

Observe that

E("�jX) = E(C"jX)= CE("jX)= C � 0= 0:

Also, note that

var("�jX) = E["�"�0jX]= E[C""0C 0jX]= CE(""0jX)C 0

= �2CV C 0

= �2C[C�1(C 0)�1]C 0

= �2I:

It follows from Assumption 3.6 that

"�jX � N(0; �2I):

The transformation makes the new error "� conditionally homoskedastic and serially

uncorrelated, while maintaining the normality distribution. Suppose that for t; "t has

a large variance �2t : The transformation "�t = C"t will discount "t by dividing it by

its conditional standard deviation so that "�t becomes conditionally homoskedastic. In

addition, the transformation also remove possible correlation between "t and "s; t 6= s:As a consequence, GLS becomes the best linear LS estimator for �0 in term of the

Gauss-Markov theorem.

52

To appreciate how the transformation by matrix C removes conditional heteroskedas-

ticity and eliminates serial correlation, we now consider two examples.

Example 1 [Removing Heteroskedasticity]: Suppose

V =

26664h1 0 � � � 0

0 h2 � � � 0

� � � � � � � � � 0

0 � � � � � � hn

37775 :Then

C =

26664h�1=21 0 � � � 0

0 h�1=22 � � � 0

� � � � � � � � � 0

0 � � � � � � h�1=2n

37775and

"� = C" =

266664"1ph1"2ph2

� � �"nphn

377775 :Example 2 [Eliminating Serial Correlation] Suppose

V =

2666666664

1 � 0 � � � 0 0

� 1 � � � � 0 0

0 � 1 � � � 0 0

� � � � � � � � � � � � � 0

0 0 0 � � � 1 �

0 0 0 � � � � 1

3777777775:

Then

"� = C" =

26664p1� �2"1"2 � �"1� � �

"n � �"n�1

37775 :Theorem: Under Assumptions 3:1; 3:3 and 3.6,(i) E(�

�jX) = �0;(ii) var(�

�jX) = �2(X�0X�)�1 = �2(X 0V �1X)�1;

(iii) cov(��; e�jX) = 0; where e� = Y � �X��

�;

(iv) ��is BLUE.

53

(v) E(s�2jX) = �2; where s�2 = e�0e�=(n�K):

Proof:(i) GLS is the OLS of the transformed model.

(ii) The transformed model satis�es 3.1, 3.3 and 3.5 of the classical regression as-

sumptions with "�jX� � N(0; �2In): It follows that GLS is BLUE by the Gauss-MarkovTheorem.

Remarks:(i) Because �

�is the OLS of the transformed regression model with i.i.d. N(0; �2I)

errors, the t-test and F -test are applicable, and these test statistics are de�ned as follows:

T � =R�

� � rps�2R(X�0X�)�1R0

� tn�K ;

F � =(R�

� � r)0[R(X�0X�)�1R0]�1(R�� r)=J

s�2

� FJ;n�K :

When testing whether all coe¢ cients except the intercept are jointly zero, we have

(n�K)R�2 !d �2J :

(ii) Because GLS ��is BLUE and OLS � di¤ers from �

�; OLS � cannot be BLUE.

��= (X�0X�)�1X�0Y �;

= (X 0V �1X)�1X 0V �1Y:

� = (X 0X)�1X 0Y:

In fact, the most important message of GLS is the insight it provides into the impact

of conditional heteroskedasticity and serial correlation on the estimation and inference

of the linear regression model. In practice, GLS is generally not feasible, because the

n� n matrix V is of unknown form, where var("jX) = �2V .

What are feasible solutions?

Two solutions

(i) First Approach: Adaptive feasible GLS

In some cases with additional assumptions, we can use a nonparametric estimator V

to replace the unknown V; we obtain the adaptive feasible GLS

��a = (X

0V �1X)�1X 0V �1Y;

54

where V is an estimator for V: Because V is an n�n unknown matrix and we only haven data points, it is impossible to estimate V consistently using a sample of size n if we

do not impose any restriction on the form of V: In other words, we have to impose some

restrictions on V in order to estimate it consistently. For example, suppose we assume

�2V = diagf�21(X); :::; �2n(X)g= diagf�2(X1); :::; �

2(Xn)g;

where diag{�} is a n � n diagonal matrix and �2(Xt) = E("2t jX) is unknown. The fact

that �2V is a diagonal matrix can arise when cov("t"sjX) = 0 for all t 6= s; i.e., whenthere is no serial correlation. Then we can use the nonparametric kernel estimator

�2(x) =1n

Pnt=1 e

2t1bK�x�Xtb

�1n

Pnt=1

1bK�x�Xtb

�! p�2(x);

where et is the estimated OLS residual, and K(�) is a kernel function which is a spec-i�ed symmetric density function (e.g., K(u) = (2�)�1=2 exp(�1

2u2)); and b = b(n) is a

bandwidth such that b ! 0; nb ! 1 as n ! 1: The �nite sample distribution of ��awill be di¤erent from the �nite sample distribution of �

�; which assumes that V were

known. This is because the sampling errors of the estimator V have some impact on

the estimator ��a. However, under some suitable conditions on V ; �

�a will share the same

asymptotic property as the infeasible GLS ��(i.e., the MSE of �

�a is approximately

equal to the MSE of ��). In other words, the �rst stage estimation of �2(�) has no

impact on the asymptotic distribution of ��a: For more discussion, see Robinson (1988)

and Stinchcombe and White (1991).

Simulation studies?

(ii) Second Approach

Continue to use OLS �, obtaining the correct formula for

var(�jX) = �2(X 0X)�1X 0V X(X 0X)�1

as well as a consistent estimator for var��jX

�: The classical de�nitions of t and F -tests

cannot be used, because they are based on an incorrect formula for var(�jX). However,some modi�ed tests can be obtained by using a consistent estimator for the correct

formula for var(�jX): The trick is to estimate �2X 0V X; which is a K � K nknown

55

matrix, rather than to estimate V; which is a n � n unknown matrix. Howeber, onlyasymptotic distributions can be used in this case.

Question: Suppose we assume

E(""jX) = �2V = diagf�21(X); :::; �2n(X)g:

As pointed out earlier, this essentially assume E("t"sjX) = 0 for all t 6= s: That is, thereis no serial correlation in f"tg conditional on X: Instead of estimating �2t (X); one canestimate the K �K matrix �2X 0V X directly.

How to estimate

�2X 0V X =

nXt=1

XtX0t�2t (X)?

Answer: An estimator

X 0D(e)D(e)0X =nXt=1

XtX0te2t ;

where D(e) = diag(e1; :::; en) is a n�n diagonal matrix with all o¤-diagonal elements be-ing zero. This is called White�s (1980) heteroskedasticity-consistent variance-covariance

estimator. See more discussion in Chapter 4.

Question: For J = 1; do we have

R� � rpR(X 0X)�1X 0D(e)D(e)0X(X 0X)�1R0

� tn�K?

For J > 1; do we have

(R� � r)0�R(X 0X)�1X 0D(e)D(e)0X(X 0X)�1R0

��1(R� � r)=J

� FJ;n�K?

Answer: No. Why? Recall cov(�; ejX) 6= 0 under Assumption 3.6. It follows that �and e are not independent.

However, when n!1; we have(i) Case I (J = 1) :

R� � rpR(X 0X)�1X 0D(e)D(e)0X(X 0X)�1R0

!d N(; 1):

This can be called a robust t-test.

56

(ii) Case II (J > 1) :

(R� � r)�R(X 0X)�1X 0D(e)D(e)0X(X 0X)�1R0

��1(�R� r)!d �2J :

This is a robust Wald test statistic.

The above two feasible solutions are based on the assumption that E("t"sjX) = 0 forall t 6= s: When there exists serial correlation of unknown form, an allernative solutionshould be provided. This is discussed in Chapter 6.

3.10 Summary and Conclusion

In this chapter, we have presented the econometric theory for the classical linear

regression models. We �rst provide and discuss a set of assumptions on which the

classical linear regression model is built. This set of regularity conditions will serve as

the starting points from which we will develop modern econometric theory for linear

regression models.

We derive the statistical properties of the OLS estimator. In particular, we point out

that R2 is not a suitable model selection criterion, because it is always nondecreasing

with the dimension of regressors. Suitable model selection criteria, such as AIC and

BIC, are discussed. We show that conditional on the regressor matrix X; the OLS

estimator � is unbiased, has a vanishing variance, and is BLUE. Under the additional

conditional normality assumption, we derive the �nite sample normal distribution for �;

the Chi-squared distribution for (n �K)s2=�2; as well as the independence between �and s2.

Many hypotheses encountered in economics can be formulated as linear restrictions

on model parameters. Depending on the number of parameter restrictions, we derive

the t-test and the F -test. In the special case of testing the hypothesis that all slope

coe¢ cients are jointly zero, we also derive an asymptotically Chi-squared test based on

R2:

When there exist conditional heteroskedasticity and/or autocorrelation, the OLS

estimator is still unbiased and has a vanishing variance, but it is no longer BLUE, and �

and s2 are no longer mutully independent. Under the assumption of a known variance-

covariance matrix up to some scale parameter, one can transform the linear regression

model by correcting conditional heteroskedasticity and eliminating autocorrelation, so

that the transformed regression model has conditionally homoskedastic and uncorrelated

errors. The OLS estimator of this transformed linear regression model is called the GLS

57

estimator, which is BLUE. The t-test and F -test are applicable. When the variance-

covariance structure is unknown, the GLS estimator becomes infeasible. However, if the

error in the original linear regression model is serially uncorrelated (as is the case with

independent observations across t), there are two feasible solutions. The �rst is to use

a nonparametric method to obtain a consistent estimator for the conditional variance

var("tjXt), and then obtain a feasible plug-in GLS. The second is to use White�s (1980)

heteroskedasticity-consistent variance-covariance matrix estimator for the OLS estimator

�: Both of these two methods are built on the asymptotic theory. When the error of

the original linear regression model is serially correlated, a feasible solution is provided

in Chapter 6.

References

Robinson, P. (1988)

Stinchombe, M. and H. White (1991)

58

ADVANCED ECONOMETRICSPROBLEM SET 03

1. Consider a bivariate linear regression model

Yt = X0t�0 + "t; t = 1; :::; n;

where Xt = (X0t; X1t) = (1; X1t)0; and "t is a regression error. Let � denote the sample

correlation between Yt and X1t; namely,

� =

Pnt=1 x1tytpPn

t=1 x21t

Pnt=1 y

2t

;

where yt = Yt � �Y ; x1t = X1t � �X1; and �Y and �X1 are the sample means of fYtg andfX1tg. Show R2 = �2:

2. Suppose Xt = Q for all t � m; where m is a �xed integer, and Q is a K � 1 constantvector. Do we have �min(X 0X)!1 as n!1? Explain.

3. The adjusted R2; denoted as �R2; is de�ned as follows:

�R2 = 1� e0e=(n�K)(Y � �Y )0(Y � �Y )=(n� 1)

:

Show�R2 = 1�

�n� 1n�K (1�R

2)

�:

4. Consider the following linear regression model

Yt = X0t�0 + ut; t = 1; :::; n; (4.1)

where

ut = �(Xt)"t;

where fXtg is a nonstochastic process, and �(Xt) is a positive function of Xt such that

=

26666664�2(X1) 0 0 ::: 0

0 �2(X2) 0 ::: 0

0 0 �2(X3) ::: 0

::: ::: ::: ::: :::

0 0 0 ::: �2(Xn)

37777775 = 12

12 ;

with

59

12 =

26666664�(X1) 0 0 ::: 0

0 �(X2) 0 ::: 0

0 0 �(X3) ::: 0

::: ::: ::: ::: :::

0 0 0 ::: �(Xn)

37777775 :

Assume that f"tg is i.i.d. N(0; 1): Then futg is i.i.d. N(0; �2(Xt)): This di¤ers from

Assumption 3.5 of the classical linear regression analysis, because now futg exhibitsconditional heteroskedasticity.

Let � denote the OLS estimator for �0:

(a) Is � unbiased for �0?(b) Show that var(�) = (X 0X)�1X 0X(X 0X)�1:

Consider an alternative estimator

~� = (X 0�1X)�1X 0�1Y

=

"nXt=1

��2(Xt)XtX0t

#�1 nXt=1

��2(Xt)XtYt:

(c) Is ~� unbiased for �0?(d) Show that var(~�) = (X 0�1X)�1:

(e) Is var(�)�var(~�) positive semi-de�nite (p.s.d.)? Which estimator, � or ~�; is moree¢ cient?

(f) Is ~� the Best Linear Unbiased Estimator (BLUE) for �0? [Hint: There are severalapproaches to this question. A simple one is to consider the transformed model

Y �t = X�0t �

0 + "t; t = 1; :::; n; (4.2)

where Y �t = Yt=�(Xt); X�t = Xt=�(Xt): This model is obtained from model (4.1) after

dividing by �(Xt): In matrix notation, model (4.2) can be written as

Y � = X��0 + ";

where the n� 1 vector Y � = � 12Y and the n� k matrix X� = �

12X:]

(g) Construct two test statistics for the null hypothesis of interest H0 : �02 = 0.

One test is based on �; and the other test is based on ~�: What are the �nite sample

distributions of your test statistics under H0? Can you tell which test is better?(h) Construct two test statistics for the null hypothesis of interest H0 : R�0 = r;

where R is a J � k matrix with J > 0: One test is based on �; and the other test is

based on ~�: What are the �nite sample distributions of your test statistics under H0?

60

5. Consider the following classical regression model

Yt = X 0t�0 + "t

= �00 +

kXj=1

�0jXjt + "t; t = 1; :::; n: (5.1)

Suppose that we are interested in testing the null hypothesis

H0 : �01 = �02 = � � � = �0k = 0:

Then the F -statistic can be written as

F =(~e0~e� e0e)=ke0e=(n� k � 1) :

where e0e is the sum of squared residuals from the unrestricted model (5.1), and ~e0~e is

the sum of squared residuals from the restricted model (5.2)

Yt = �00 + "t: (5.2)

(a) Show that under Assumptions 3.1 and 3.3,

F =R2=k

(1�R2)=(n� k � 1) ;

where R2 is the coe¢ cient of determination of the unrestricted model (5.1).

(b) Suppose in addition Assumption 3.5 holds. Show that under H0;

(n� k � 1)R2 d! �2k:

6. Suppose X 0X is a K �K matrix, and V is a n�n matrix, and both X 0X and V are

symmetric and nonsingular, with the minimum eigenvalue �min(X 0X) ! 1 as n ! 1and 0 < c � �max(V ) � C <1: Show that for any � 2 RK such that � 0� = 1;

� 0var(�jX)� = �2� 0(X 0X)�1X 0V X(X 0X)�1� ! 0

as n!1: Thus, var(�jX) vanishes to zero as n!1 under conditional heteroskedas-

ticity.

7. Suppose a data generating process is given by

Yt = �01X1t + �

02X2t + "t = X

0t�0 + "t;

where Xt = (X1t; X2t)0; and Assumptions 3.1�3.4 hold. (For simplicity, we have assumed

no intercept.) Denote the OLS estimator by � = (�1; �2)0:

61

If �02 = 0 and we know it. Then we can consider a simpler regression

Yt = �01X1t + "t:

Denote the OLS of this simpler regression as ~�1:

Please compare the relative e¢ ciency between �1 and ~�1: That is, which estimator

is better for �01? Give your reasoning in detail.

62

CHAPTER 3 CLASSICAL LINEAR REGRESSION...

Documents

Transcript of CHAPTER 3 CLASSICAL LINEAR REGRESSION...