CHAPTER 3 CLASSICAL LINEAR REGRESSION...
Transcript of CHAPTER 3 CLASSICAL LINEAR REGRESSION...
CHAPTER 3 CLASSICAL LINEARREGRESSION MODELS
Key words: Classical linear regression, Conditional heteroskedasticity, Conditionalhomoskedasticity, F -test, GLS, Hypothesis testing, Model selection criterion, OLS, R2;
t-test
Abstract: In this chapter, we will introduce the classical linear regression theory, in-cluding the classical model assumptions, the statistical properties of the OLS estimator,
the t-test and the F -test, as well as the GLS estimator and related statistical procedures.
This chapter will serve as a starting point from which we will develop the modern econo-
metric theory.
3.1 Assumptions
Suppose we have a random sample fZtgnt=1 of size n; where Zt = (Yt; X0t)0; Yt is
a scalar, Xt = (1; X1t; X2t; :::; Xkt)0 is a (k + 1) � 1 vector, t is an index (either cross-
sectional unit or time period) for observations, and n is the sample size. We are interested
in modelling the conditional mean E(YtjXt) using an observed realization (i.e., a data
set) of the random sample fYt; X 0tg0; t = 1; :::; n:
Notations:(i) K � k + 1 is the number of regressors for which there are k economic variables
and an intercept.
(ii) The index t may denote an individual unit (e.g., a �rm, a household, a country)
for cross-sectional data, or denote a time period (e.g., day, weak, month, year) in a time
series context.
We �rst list and discuss the assumptions of the classical linear regression theory.
Assumption 3.1 [Linearity]:
Yt = X0t�0 + "t; t = 1; :::; n;
where �0 is a K � 1 unknown parameter vector, and "t is an unobservable disturbance.
Remarks:
(i) In Assumption 3.1, Yt is the dependent variable (or regresand), Xt is the vector of
regressors (or independent variables, or explanatory variables), and �0 is the regression
1
coe¢ cient vector. When the linear model is correctly for the conditional mean E(YtjXt),
i.e., when E("tjXt) = 0; the parameter �0 = @
@XtE(YtjXt) is the marginal e¤ect of Xt on
Yt:
(ii) The key notion of linearity in the classical linear regression model is that the
regression model is linear in �0 rather than in Xt:
(iii) Does Assumption 3.1 imply a causal relationship from Xt to Yt? Not necessarily.
As Kendall and Stuart (1961, Vol.2, Ch. 26, p.279) point out, �a statistical relationship,
however strong and however suggestive, can never establish causal connection. our ideas
of causation must come from outside statistics ultimately, from some theory or other.�
Assumption 3.1 only implies a predictive relationship: Given Xt, can we predict Ytlinearly?
(iv) Matrix Notation: Denote
Y = (Y1; :::; Yn)0; n� 1;
" = ("1; :::; "n)0; n� 1;
X = (X1; :::; Xn)0; n�K:
Note that the t-th row of X is X 0t = (1; X1t; :::; Xkt): With these matrix notations, we
have a compact expression for Assumption 3.1:
Y = X�0 + ";
n� 1 = (n�K)(K � 1) + n� 1:
The second assumption is a strict exogeneity condition.
Assumption 3.2 [Strict Exogeneity]:
E("tjX) = E("tjX1; :::; Xt; :::; Xn) = 0; t = 1; :::; n:
Remarks:(i) Among other things, this condition implies correct model speci�cation forE(YtjXt):
This is because Assumption 3.2 implies E("tjXt) = 0 by the law of iterated expectations.
(ii) Assumption 3.2 also implies E("t) = 0 by the law of iterated expectations.
(iii) Under Assumption 3.2, we have E(Xs"t) = 0 for any (t; s); where t; s 2 f1; :::; ng:This follows because
E(Xs"t) = E[E(Xs"tjX)]= E[XsE("tjX)]= E(Xs � 0)= 0:
2
Note that (i) and (ii) imply cov(Xs; "t) = 0 for all t; s 2 f1; :::; ng:(iv) Because X contains regressors fXsg for both s � t and s > t; Assumption 3.2
essentially requires that the error "t do not depend on the past and future values of
regressors if t is a time index. This rules out dynamic time series models for which
"t may be correlated with the future values of regressors (because the future values of
regressors depend on the current shocks), as is illustrated in the following example.
Example 1: Consider a so-called AutoRegressive AR(1) model
Yt = �0 + �1Yt�1 + "t; t = 1; :::; n;
= X 0t� + "t;
f"tg � i.i.d.(0; �2);
where Xt = (1; Yt�1)0: Obviously, E(Xt"t) = E(Xt)E("t) = 0 but E(Xt+1"t) 6= 0: Thus,
we have E("tjX) 6= 0; and so Assumption 3.2 does not hold. In Chapter 5, we will
consider linear regression models with dependent observations, which will include this
example as a special case. In fact, the main reason of imposing Assumption 3.2 is to
obtain a �nite sample distribution theory. For a large sample theory (i.e., an asymptotic
theory), the strict exogeneity condition will not be needed.
(v) In econometrics, there are some alternative de�nitions of strict exogeneity. For
example, one de�nition assumes that "t and X are independent. An example is that X
is nonstochastic. This rules out conditional heteroskedasticity (i.e., var("tjX) dependsonn X). In Assumption 3.2, we still allow for conditional heteroskedasticity, because we
do not assume that "t and X are independent. We only assume that the conditional
mean E("tjX) does not depend on X:(vi) Question: What happens to Assumption 3.2 if X is nonstochastic?
Answer: If X is nonstochastic, Assumption 3.2 becomes
E("tjX) = E("t) = 0:
An example of nonstochasticX isXt = (1; t; :::; tk)0: This corresponds to a time-trend
regression model
Yt = X 0t�0 + "t
=
kXj=0
�0j tj + "t:
(vii) Question: What happens to Assumption 3.2 if Zt = (Yt; X0t)0 is an independent
random sample (i.e., Zt and Zs are independent whenever t 6= s; although Yt and Xt
may not be independent)?
3
Answer: When fZtg is i.i.d., Assumption 3.2 becomes
E("tjX) = E("tjX1; X2; :::Xt; :::; Xn)
= E("tjXt)
= 0:
In other words, when fZtg is i.i.d., E("tjX) = 0 is equivalent to E("tjXt) = 0:
Assumption 3.3 [Nonsingularity]: The minimum eigenvalue of the K � K square
matrix X 0X =Pn
t=1XtX0t;
�min (X0X)!1 as n!1
with probability one.
Question: How to �nd eigenvalues of a square matrix A?
Answer: By solving the system of linear equations:
det(A� �I) = 0;
where det(�) denotes the determinant of a square matrix, and I is an identity matrixwith the same dimension as A.
Remarks:(i) Assumption 3.3 rules out multicollinearity among the (k+1) regressors in Xt:We
say that there exists multicolinearity among the Xt if for all t 2 f1; :::; ng; the variableXjt for some j 2 f0; 1; :::; kg is a linear combination of the other K�1 column variables.(ii) This condition implies that X 0X is nonsingular, or equivalently, X must be of
full rank of K = k+ 1. Thus, we need K � n: That is, the number of regressors cannotbe larger than the sample size.
(iii) It is well-known that the eigenvalue � can be used to summarize information
contained in a matrix (recall the popular principal component analysis). Assumption
3.3 implies that new information must be available as the sample size n ! 1 (i.e., Xt
should not only have same repeated values as t increases).
(iv) Intuitively, if there are no variations in the values of the Xt; it will be di¢ cult
to determine the relationship between Yt and Xt (indeed, the purpose of classical linear
regression is to investigate how a change in X causes a change in Y ): In certain sense,
one may call X 0X the �information matrix� of the random sample X because it is a
4
measure of the information contained in X: The degree of variation in X 0X will a¤ect
the preciseness of parameter estimation for �0.
Question: Why can the eigenvalue � used as a measure of the information containedin X 0X?
Assumption 3.4 [Spherical error variance]:(a) [conditional homoskedasticity]:
E("2t jX) = �2 > 0; t = 1; :::; n;
(b) [conditional serial uncorrelatedness]:
E("t"sjX) = 0; t 6= s; t; s 2 f1; :::; ng:
Remarks:(i) We can write Assumption 3.4 as
E("t"sjX) = �2�ts;
where �ts = 1 if t = s and �ts = 0 otherwise. In mathematics, �ts is called the Kronecker
delta function.
(ii) var("tjX) = E("2t jX)� [E("tjX)]2 = E("2t jX) = �2 given Assumption 3.2. Simi-larly, we have cov("t; "sjX) = E("t"sjX) = 0 for all t 6= s:(iii) By the law of iterated expectations, Assumption 3.4 implies that var("t) = �2 for
all t = 1; :::; n; the so-called unconditional homoskedasticity. Similarly, cov("t; "s) = 0
for all t 6= s:(iv) A compact expression for Assumptions 3.2 and 3.4 together is:
E("jX) = 0 and E(""0jX) = �2I;
where I � In is a n� n identity matrix.(v) Assumption 3.4 does not imply that "t and X are independent. It allows the
possibility that the conditional higher order moments (e.g., skewness and kurtosis) of "tdepend on X:
3.2 OLS Estimation
Question: How to estimate �0 using an observed data set generated from the random
sample fZtgnt=1; where Zt = (Yt; X 0t)0?
5
De�nition [OLS estimator]: Suppose Assumptions 3.1 and 3.3 hold. De�ne the sumof squared residuals (SSR) of the linear regression model Yt = X 0
t� + ut as
SSR(�) � (Y �X�)0(Y �X�)
=
nXt=1
(Yt �X 0t�)
2:
Then the Ordinary Least Squares (OLS) estimator � is the solution to
� = arg min�2RK
SSR(�):
Remark: SSR(�) is the sum of squred model errors fut = Yt�X 0t�g for observations,
with equal weighting for each t.
Theorem 1 [Existence of OLS]: Under Assumptions 3.1 and 3.3, the OLS estimator� exists and
� = (X 0X)�1X 0Y
=
nXt=1
XtX0t
!�1 nXt=1
XtYt
=
1
n
nXt=1
XtX0t
!�11
n
nXt=1
XtYt:
The latter expression will be useful for our asymptotic analysis in subsequent chapters.
Remark: Suppose Zt = fYt; X 0tg0; t = 1; :::; n; is an independent and identically distrib-
uted (i.i.d.) random sample of size n. Compare the population MSE criterion
MSE(�) = E(Yt �X 0t�)
2
and the sample MSE criterion
SSR(�)
n=1
n
nXt=1
(Yt �X 0t�)
2:
By the weak law of large numbers (WLLN), we have
SSR(�)
n!p E(Yt �X 0
t�)2 �MSE(�)
6
and their minimizers
� =
1
n
nXt=1
XtX0t
!�11
n
nXt=1
XtYt
! p [E(XtX0t)]�1E(XtYt) � ��:
Here, SSR(�); after scaled by n�1; is the sample analogue of MSE(�); and the OLS �
is the sample analogue of the best LS approximation coe¢ cient ��:
Proof of the Theorem: Using the formula that for both a and � being vectors, thederivative
@(a0�)
@�= a;
we have
dSSR(�)
d�=
d
d�
nXt=1
(Yt �X 0t�)
2
=nXt=1
@
@�(Yt �X 0
t�)2
=nXt=1
2(Yt �X 0t�)
@
@�(Yt �X 0
t�)
= �2nXt=1
Xt(Yt �X 0t�)
= �2X 0(Y �X�):
The OLS must satisfy the FOC:
�2X 0(Y �X�) = 0;
X 0(Y �X�) = 0;
X 0Y � (X 0X)� = 0:
It follows that
(X 0X)� = X 0Y:
By Assumption 3.3, X 0X is nonsingular. Thus,
� = (X 0X)�1X 0Y:
Checking the SOC, we have the K �K Hessian matrix
@2SSR(�)
@�@�0= �2
nXt=1
@
@�0[(Yt �X 0
t�)Xt]
= 2X 0X
� positive de�nite
7
given �min(X 0X) > 0: Thus, � is a global minimizer. Note that for the existence of �; we
only need that X 0X is nonsingular, which is implied by the condition that �min(X 0X)!1 as n!1 but it does not require that �min(X 0X)!1 as n!1:
Remarks:(i) Yt � X 0
t� is called the �tted value (or predicted value) for observation Yt; and
et � Yt � Yt is the estimated residual (or prediction error) for observation Yt: Note that
et = Yt � Yt= (X 0
t�0 + "t)�X 0
t�
= "t �X 0t(� � �0)
= true regression disturbance + estimation error,
where the true disturbance "t is unavoidable, and the estimation error X 0t(� � �0) can
be made smaller when a larger data set is available (so � becomes closer to �0).
(ii) FOC implies that the estimated residual e = Y �X� is orthogonal to regressorsX in the sense that
X 0e =nXt=1
Xtet = 0:
This is the consequence of the very nature of OLS, as implied by the FOC ofmin�2RK SSR(�).
It always holds no matter whether E("tjX) = 0 (recall that we do not impose Assump-tion 3.2). Note that if Xt contains the intercept, then X 0e = 0 implies �nt=1et = 0:
Some useful identitiesTo investigate the statistical properties of �; we �rst state some useful lemmas.
Lemma: Under Assumptions 3.1 and 3.3, we have:(i)
X 0e = 0;
(ii)
� � �0 = (X 0X)�1X 0";
(iii) De�ne a n� n projection matrix
P = X(X 0X)�1X 0
and
M = In � P:
8
Then both P and M are symmetric (i.e., P = P 0 and M = M 0) and idempotent (i.e.,
P 2 = P;M2 =M), with
PX = X;
MX = 0:
(iv)
SSR(�) = e0e = Y 0MY = "0M":
Proof:(i) The result follows immediately by the FOC of the OLS estimator.
(ii) Because � = (X 0X)�1X 0Y and Y = X�0 + "; we have
� � �0 = (X 0X)�1X 0(X�0 + ")� �0
= (X 0X)�1X 0":
(iii) P is idempotent because
P 2 = PP
= [X(X 0X)�1X 0][X(X 0X)�1X 0]
= X(X 0X)�1X 0
= P:
Similarly we can show M2 =M:
(iv) By the de�nition of M; we have
e = Y �X�= Y �X(X 0X)�1X 0Y
= [I �X(X 0X)�1X 0]Y
= MY
= M(X�0 + ")
= MX�0 +M"
= M"
given MX = 0: It follows that
SSR(�) = e0e
= (M")0(M")
= "0M2"
= "0M";
9
where the last equality follows from M2 =M:
3.3 Goodness of �t and model selection criterion
Question: How well does the linear regression model �t the data? That is, how welldoes the linear regression model explain the variation of the observed data of fYtg?
We need some criteria or some measures to characterize goodness of �t.
We �rst introduce two measures for goodness of �t. The �rst measure is called the
uncentered squared multi-correlation coe¢ cient R2
De�nition [Uncentered R2] : The uncentered squared multi-correlation coe¢ cient isde�ned as
R2uc =Y 0Y
Y 0Y= 1� e0e
Y 0Y:
Remarks:Interpretation ofR2uc : The proportion of the uncentered sample quadratic variation in
the dependent variables fYtg that can be explained by the uncentered sample quadraticvariation of the predicted values fYtg. Note that we always have 0 � R2uc � 1:
Next, we de�ne a closely related measure called Centered R2:
De�nition [Centered R2 : Coe¢ cient of Determination]: The coe¢ cient of deter-mination
R2 � 1�Pn
t=1 e2tPn
t=1(Yt � �Y )2;
where �Y = n�1Pn
t=1 Yt is the sample mean.
Remarks:(i) When Xt contains the intercept, we have the following orthogonal decomposition:
nXt=1
(Yt � �Y )2 =
nXt=1
(Yt � �Y + Yt � Yt)2
=
nXt=1
(Yt � �Y )2 +nXt=1
e2t
+2nXt=1
(Yt � �Y )et
=nXt=1
(Yt � �Y )2 +nXt=1
e2t ;
10
where the cross-product term
nXt=1
(Yt � �Y )et =nXt=1
Ytet � �YnXt=1
et
= �0nXt=1
Xtet � �Y
nXt=1
et
= �0(X 0e)� �Y
nXt=1
et
= �0 � 0� �Y � 0
= 0;
where we have made use of the facts that X 0e = 0 andPn
t=1 et = 0 from the FOC of
the OLS estimation and the fact that Xt contains the intercept (i.e., X0t = 1): It follows
that
R2 � 1� e0ePnt=1(Yt � �Y )2
=
Pnt=1(Yt � �Y )2 �
Pnt=1 e
2tPn
t=1(Yt � �Y )2
=
Pnt=1(Yt � �Y )2Pnt=1(Yt � �Y )2
:
and consequently we have
0 � R2 � 1:
(ii) If Xt does not contain the intercept, then the orthogonal decomposition identity
nXt=1
(Yt � �Y )2 =
nXt=1
(Yt � �Y )2 +
nXt=1
e2t
no longer holds. As a consequence, R2 may be negative when there is no intercept! This
is because the cross-product term
2
nXt=1
(Yt � �Y )et
may be negative.
(iii) For any given random sample fYt; X 0tg0; t = 1; :::; n; R2 is nondecreasing in the
number of explanatory variables Xt: In other words, the more explanatory variables are
11
added in the linear regression, the higher R2: This is always true no matter whether Xt
has any true explanatory power for Yt:
Theorem: Suppose fYt; X 0tg0; t = 1; :::; n; is a random sample, R21 is the centered R
2
from the linear regression
Yt = X0t� + ut;
where Xt = (1; X1t; :::; Xkt)0; and � is a K�1 parameter vector; also, R22 is the centered
R2 from the extended linear regression
Yt = ~Xt + vt;
where ~Xt = (1; X1t; :::; Xkt; Xk+1;t; :::; Xk+q;t)0;and is a (K + q)� 1 parameter vector.
Then R22 � R21:
Proof: By de�nition, we have
R21 = 1� e0ePnt=1(Yt � �Y )2
;
R22 = 1� ~e0~ePnt=1(Yt � �Y )2
;
where e is the estimated residual vector from the regression of Y on X; and ~e is the
estimated residual vector from the regression of Y on ~X: It su¢ ces to show ~e0~e � e0e:
Because the OLS estimator = ( ~X 0 ~X)�1 ~X 0Y minimizes SSR( ) for the extended model,
we have
~e0~e =nXt=1
(Yt � ~X 0t )
2 �nXt=1
(Yt � ~X 0t )
2 for all 2 RK+q:
Now we choose
= (�0; 00)0;
where � = (X 0X)�1X 0Y is the OLS from the �rst regression. It follows that
~e0~e �nXt=1
Yt �
kXj=0
�jXjt �k+qXj=k+1
0 �Xjt
!2
=
nXt=1
(Yt �X 0t�)
2
= e0e:
Hence, we have R21 � R22: This completes the proof.
Question: What is the implication of this result?
12
Answer: R2 is not a suitable criterion for correct model speci�cation. It is not a suitablecriterion of model selection.
Question: Does a high R2 imply a precise estimation for �0?
Question: What are the suitable model selection criteria?
Two popular model selection criteria
TheKISS Principle: Keep It Sophistically Simple!
An important idea in statistics is to use a simple model to capture essential infor-
mation contained in data as much as possible. Below, we introduce two popular model
selection criteria that re�ect such an idea.
Akaike Information Criterion [AIC]:A linear regression model can be selected by minimizing the following AIC criterion
with a suitable choice of K :
AIC = ln(s2) +2K
n� goodness of �t + model complexity
where
s2 = e0e=(n�K);
and K = k + 1 is the number of regressors.
Bayesian Information Criterion [BIC]:A linear regression model can be selected by minimizing the following criterion with
a suitable choice of K :
BIC = ln(s2) +K ln(n)
n:
Remark: When lnn � 2; BIC gives a heavier penalty for model complexity than AIC,which is measured by the number of estimated parameters (relative to the sample size
n). As a consequence, BIC will choose a more parsimonious linear regression model than
AIC.
Remark: BIC is strongly consistent in that it determines the true model asymptotically(i.e., as n ! 1), whereas for AIC an overparameterized model will emerge no matter
13
how large the sample is. Of course, such properties are not necessarily guaranteed in
�nite samples.
Remarks: In addition to AIC and BIC, there are other criteria such as �R2; the so-calledadjusted R2 that can also be used to select a linear regression model. The adjusted �R2
is de�ned as�R2 = 1� e0e=(n�K)
(Y � �Y )0(Y � �Y )=(n� 1):
This de¤ers from
R2 = 1� e0e
(Y � �Y )0(Y � �Y ):
It may be shown that
�R2 = 1��n� 1n�K (1�R
2)
�:
All model criteria are structured in terms of the estimated residual variance �2 plus a
penalty adjustment involving the number of estimated parameters, and it is in the extent
of this penalty that the criteria di¤er from. For more discussion about these, and other
selection criteria, see Judge et al. (1985, Section 7.5).
Question: Why is it not a good practice to use a complicated model?
Answer: A complicated model contains many unknown parameters. Given a �xed
amount of data information, parameter estimation will become less precise if more pa-
rameters have to be estimated. As a consequence, the out-of-sample forecast for Yt may
become less precise than the forecast of a simpler model. The latter may have a larger
bias but more precise parameter estimates. Intuitively, a complicated model is too �ex-
ible in the sense that it may not only capture systematic components but also some
features in the data which will not show up again. Thus, it cannot forecast futures well.
3.4 Consistency and E¢ ciency of OLS
We now investigate the statistical properties of �: We are interested in addressing
the following basic questions:
� Is � a good estimator for �0(consistency)?
� Is � the best estimator (e¢ ciency)?
� What is the sampling distribution of � (normality)?
14
Question: What is the sampling distribution of �?Answer: The distribution of � is called the sampling distribution of �; because � is afunction of the random sample fZtgnt=1; where Zt = (Yt; X 0
t)0:
Remark: The sampling distribution of � is useful for any statistical inference involving�; such as con�dence interval estimation and hypothesis testing.
We �rst investigate the statistical properties of �:
Theorem: Suppose Assumptions 3.1-3.4 hold. Then(i) [Unbiaseness] E(�jX) = �0 and E(�) = �0:(ii) [Vanishing Variance]
var(�jX) = Eh(� � E�)(� � E�)0jX
i= �2(X 0X)�1
and for any K � 1 vector � such that � 0� = 1; we have
� 0var(�jX)� ! 0 as n!1:
(iii) [Orthogonality between e and �]
cov(�; ejX) = Ef[� � E(�jX)]e0jXg = 0:
(iv) [Gauss-Markov]
var(bjX)� var(�jx) is positive semi-de�nite (p.s.d.)
for any unbiased estimator b that is linear in Y with E(bjX) = �0:(v) [Residual variance estimator]
s2 =1
n�K
nXt=1
e2t = e0e=(n�K)
is unbiased for �2 = E("2t ): That is, E(s2jX) = �2:
Question: What is a linear estimator b?Remarks:
(a) Both Theorems (i) and (ii) imply that the conditional MSE
MSE(�jX) = E[(� � �0)(� � �0)0jX]= var(�jX) + Bias(�jX)Bias(�jX)0
= var(�jX)! 0 as n!1;
15
where we have used the fact that Bias(�jX) � E(�jX) � �0 = 0: Recall that MSE
measures how close an estimator � is to the target �0:
(b) Theorem (iv) implies that � is the best linear unbiased estimator (BLUE) for �0
because var(�jX) is the smallest among all unbiased linear estimators for �0:
Formally, we can de�ne a related concept for comparing two estimators:
De�nition [E¢ ciency]: An unbiased estimator � of parameter �0 is more e¢ cientthan another unbiased estimator b of parameter �0 if
var(bjX)� var(�jX) is p.s.d.
Implication: For any � 2 RK such that � 0� = 1; we have
� 0hvar(bjX)� var(�jX)
i� � 0:
Choosing � = (1; 0; :::; 0)0; for example, we have
var(b1)� var(�1) � 0:
Proof: (i) Given � � �0 = (X 0X)�1X 0"; we have
E[(� � �0)jX] = E[(X 0X)�1X 0"jX]= (X 0X)�1X 0E("jX)= (X 0X)�1X 00
= 0:
(ii) Given � � �0 = (X 0X)�1X 0" and E(""0jX) = �2I; we have
var(�jX) � Eh(� � E�)(� � E�)0jX
i= E
h(� � �0)(� � �0)0jX
i= E[(X 0X)�1X 0""0X(X 0X)�1jX]= (X 0X)�1X 0E(""0jX)X(X 0X)�1
= (X 0X)�1X 0�2IX(X 0X)�1
= �2(X 0X)�1X 0X(X 0X)�1
= �2(X 0X)�1:
16
Note that Assumption 3.4 is crutial here to obtain the expression of �2(X 0X)�1 for
var(�jX): Moreover, for any � 2 RK such that � 0� = 1; we have
� 0var(�jX)� = �2� 0(X 0X)�1�
� �2�max[(X0X)�1]
= �2��1min(X0X)
! 0
given �min(X 0X) ! 1 as n ! 1 with probability one. Note that the condition that
�min(X0X)!1 ensures that var(�jX) vanishes to zero as n!1:
(iii) Given � � �0 = (X 0X)�1X 0"; e = Y �X� = MY = M" (since MX = 0); and
E(e) = 0; we have
cov(�; ejX) = Eh(� � E�)(e� Ee)0jX
i= E
h(� � �0)e0jX
i= E[(X 0X)�1X 0""0M jX]= (X 0X)�1X 0E(""0jX)M= (X 0X)�1X 0�2IM
= �2(X 0X)�1X 0M
= 0:
Again, Assumption 3.4 plays a crutial role in ensuring zero correlation between �
and e: (iv) Consider a linear estimator
b = C 0Y;
where C = C(X) is a n�K matrix depending on X: It is unbiased for �0 regardless of
the value of �0 if and only if
E(bjX) = C 0X�0 + C 0E("jX)= C 0X�0
= �0:
This follows if and only if
C 0X = I:
17
Because
b = C 0Y
= C 0(X�0 + ")
= C 0X�0 + C 0"
= �0 + C 0";
the variance of b
var(b) = Eh(b� �0)(b� �0)0jX
i= E [C 0""0CjX]= C 0E(""0jX)C= C 0�2IC
= �2C 0C:
Using C 0X = I, we now have
var(bjX)� var(�jX) = �2C 0C � �2(X 0X)�1
= �2[C 0C � C 0X(X 0X)�1X 0C]
= �2C 0[I �X(X 0X)�1X 0]C
= �2C 0MC
= �2C 0MMC
= �2C 0M 0MC
= �2(MC)0(MC)
= �2D0D
= �2nXt=1
DtD0t
� p.s.d.
where we have used the fact that for any real-valued vector D; the squared matrix D0D
is always p.s.d. [Question: How to show this?]
(v) Now we show E[e0e=(n � K)] = �2: Because e0e = "0M" and tr(AB) =tr(BA);
18
we have
E(e0ejX) = E("0M"jX)= E[tr("0M")jX]
[putting A = "0M;B = "]
= E[tr(""0M)jX]= tr[E(""0jX)M ]= tr(�2IM)
= �2tr(M)
= �2(n�K)
where
tr(M) = tr(In)� tr(X(X 0X)�1X 0)
= tr(In)� tr(X 0X(X 0X)�1)
= n�K;
using tr(AB) =tr(BA) again. It follows that
E(s2jX) =E(e0ejX)n�K
=�2(n�K)(n�K)
= �2:
This completes the proof. Note that the sample residual variance s2 = e0e=(n � K) is
a generalization of the sample variance S2n = (n � 1)�1�nt=1(Yt � �Y )2 for the random
sample fYtgnt=1:
3.5 The Sampling Distribution of �
To obtain the �nite sample sampling distribution of �; we impose the normality
assumption on ":
Assumption 3.5: "jX � N(0; �2I):
Remarks:
19
(i) Under Assumption 3.5, the conditional pdf of " given X is
f("jX) = 1
(p2��2)n
exp
�� "
0"
2�2
�= f(");
which does not depend on X; the disturbance " is independent of X: Thus, every condi-
tional moment of " given X does not depend on X:
(ii) Assumption 3.5 implies Assumptions 3.2 (E("jX) = 0) and 3.4 (E(""0X) = �2I).(iii) The normality assumption may be reasonable for observations that are computed
as the averages of the outcomes of many repeated experiments, due to the e¤ect of
the so-called central limit theorem (CLT). This may occur in Physics, for example. In
economics, the normality assumption may not always be reasonable. For example, many
high-frequency �nancial time series usually display heavy tails (with kurtosis larger tha
3).
Question: What is the sampling distribution of �?
Sampling Distribution of � :
� � �0 = (X 0X)�1X 0"
= (X 0X)�1nXt=1
Xt"t
=nXt=1
Ct"t;
where the weighting matrix
Ct = (X0X)�1Xt
is called the leverage of observation Xt.
Theorem [Normality of �]: Under Assumptions 3.1, 3.3 and 3.5,
(� � �0)jX � N [0; �2(X 0X)�1]:
Proof: Conditional on X; � � �0 is a weighted sum of independent normal random
variables f"tg; and so is also normally distributed.
The corollary below follows immediately.
Corollary [Normality of R(� � �0)]: Suppose Assumptions 3.1, 3.3 and 3.5 hold.Then for any nonstochastic J �K matrix R; we have
R(� � �0)jX � N [0; �2R(X 0X)�1R0]:
20
Question: What is the role of the nonstochastic matrix R?Answer: R is a selection matrix. For example, when R = (1; 0; :::; 0)0; we then have
R(� � �0) = �0 � �00:
Proof: Conditional on X; � � �0 is normally distributed. Therefore, conditional on X;the linear combination R(� � �0) is also normally distributed, with
E[R(� � �0)jX] = RE[(� � �0)jX] = R � 0 = 0
and
var[R(� � �0)jX] = EhR(� � �0)(R(� � �0))0jX
i= E
hR(� � �0)(� � �0)0R0jX
i= RE
h(� � �0)(� � �0)0jX
iR0
= Rvar(�jX)R0
= �2R(X 0X)�1R0:
It follows that
R(� � �0)jX � N(0; �2R(X 0X)�1R0):
Question: Why would we like to know the sampling distribution of R(� � �0)?
Answer: This is mainly for con�dence interval estimation and hypothesis testing.
Remark: However, var[R(� � �0)jX] = �2R(X 0X)�1R0 is unknown, because var("t) =
�2 is unknown. We need to estimate �2:
3.6 Estimation of the Covariance Matrix of �
Question: How to estimate var(�jx) = �2(X 0X)�1?
Answer: To estimate �2; we can use the residual variance estimator
s2 = e0e=(n�K):
Theorem [Residual Variance Estimator]: Suppose Assumptions 3.1, 3.3 and 3.5hold. Then we have for all n > K; (i)
(n�K)s2�2
jX =e0e
�2jX � �2n�K ;
(ii) conditional on X; s2 and � are independent.
21
Remarks:(i) Question: What is a �2q distribution?
De�nition [Chi-square Distribution, �2q] Suppose fZigqi=1 are i.i.d.N(0,1) random
variables. Then the random variable
�2 =
qXi=1
Z2i
will follow a �2q distribution.
Properties of �2q :
� E(�2q) = q:
� var(�2q) = 2q:
Remarks:
(i) Theorem (i) implies
E
�(n�K)s2
�2jX�= n�K:
(n�K)�2
E(s2jX) = n�K:
It follows that E(s2jX) = �2:
(ii) Theorem (i) also implies
var�(n�K)s2
�2jX�= 2(n�K);
(n�K)2�4
var(s2jX) = 2(n�K);
var(s2jX) =2�4
n�K ! 0
as n!1:(iii) Both Theorems (i) and (ii) imply that the conditional MSE of s2
MSE(s2jX) = E�(s2 � �2)2jX
�= var(s2jX) + [E(s2jX)� �2]2
! 0:
22
Thus, s2 is a good estimator for �2:
(iv) The independence between s2 and � is crucial for us to obtain the sampling
distribution of the popular t-test and F -test statistics, which will be introduced soon.
Proof of Theorem: (i) Because e =M"; we have
e0e
�2="0M"
�2=� "�
�0M� "�
�:
In addition, because "jX � N(0; �2In); and M is an idempotent matrix with rank
= n�K (as has been shown before), we have the quadratic form
e0e
�2="0M"
�2jX � �2n�K
by the following lemma.
Lemma [Quadratic form of normal random variables]: If v � N(0; In) and Q isan n�n nonstochastic symmetric idempotent matrix with rank q � n; then the quadraticform
v0Qv � �2q:
In our application, we have v = "=� � N(0; I); and Q = M: Since rank(M) = n �K;we have
e0e
�2jX � �2n�K :
(ii) Next, we show that s2 and � are independent. Because s2 = e0e=(n � K) is afunction of e; it su¢ ces to show that e and � are independent. This follows immedi-
ately because both e and � are jointly normally distributed and they are uncorrelated.
It is well-known that for a joint normal distribution, zero correlation is equivalent to
independence.
Question: Why are e and � jointly normally distributed?
Answer: We write "e
� � �0
#=
"M"
(X 0X)�1X 0"
#
=
"M
(X 0X)�1X 0
#":
Because "jX � N(0; �2I); the linear combination of"M
(X 0X)�1X 0
#"
23
is also normally distributed conditional on X.
Question: Why are these sampling distributions useful in practice?
Answer: They are useful in con�dence interval estimation and hypothesis testing.
3.7 Hypothesis Testing
We now use the sampling distributions of � and s2 to develop test procedures for
hypotheses of interest. We consider testing the following linear hypothesis in form of
H0 : R�0 = r;
(J �K)(K � 1) = J � 1;
where R is called the selection matrix, and J is the number of restrictions. We assume
J � K:
Remark: We wil test H0 under correct model speci�cation for E(YtjXt).
Motivations
Example 1 [Reforms have no e¤ect]: Consider the extended production function
ln(Yt) = �0 + �1 ln(Lt) + �2 ln(Kt) + �3AUt + �4PSt + "t;
where AUt is a dummy variable indicating whether �rm t is granted autonomy, and PStis the pro�t share of �rm t with the state.
Suppose we are interested in testing whether autonomy AUt has an e¤ect on produc-
tivity. Then we can write the null hypothesis
Ha0 : �03 = 0
This is equivalent to the choice of:
�0 = (�0; �1; �2; �3; �4)0:
R = (0; 0; 0; 1; 0);
r = 0:
If we are interested in testing whether pro�t sharing has an e¤ect on productivity,
we can consider the null hypothesis
Hb0 : �4 = 0:
24
Alternatively, to test whether the production technology exhibits the constant return
to scale (CRS), we can write the null hypothesis as follows:
Hc0 : �01 + �
02 = 1:
This is equivalent to the choice of R = (0; 1; 1; 0; 0) and r = 1:
Finally, if we are interested in examining the joint e¤ect of both autonomy and pro�t
sharing, we can test the hypothesis that neither autonomy nor pro�t sharing has impact:
Hd0 : �03 = �
04 = 0:
This is equivalent to the choice of
R =
"0 0 0 1 0
0 0 0 0 1
#;
r =
"0
0
#:
Example 3 [Optimal Predictor for Future Spot Exchange Rate]: Consider
St+� = �0 + �1Ft(�) + "t+� ; t = 1; :::; n;
where St+� is the spot exchange rate at period t+ � ; and Ft(�) is period t�s price for the
foreign currency to be delivered at period t + � : The null hypothesis of interest is that
the forward exchange rate Ft(�) is an optimal predictor for the future spot rate St+� in
the sense that E(St+� jIt) = Ft(�); where It is the information set available at time t.
This is actually called the expectations hypothesis in economics and �nance. Given the
above speci�cation, this hypothesis can be written as
He0 : �00 = 0; �
01 = 1;
and E("t+� jIt) = 0: This is equivalent to the choice of
R =
"1 0
0 1
#; r =
"0
1
#:
Remark: All examples considered above can be formulated with a suitable speci�cationof R; where R is a J �K matrix in the null hypothesis
H0 : R�0 = r;
25
where r is a J � 1 vector.Basic Idea of Hypothesis Testing
To test the null hypothesis
H0 : R�0 = r;
we can consider the statistic:
R� � r
and check if this di¤erence is signi�cantly di¤erent from zero.
Under H0 : R�0 = r; we have
R� � r = R� �R�0
= R(� � �0)! 0 as n!1
because � � �0 ! 0 as n!1 in the term of MSE.
Under the alternative to H0; R�0 6= r; but we still have � � �0 ! 0 in the term of
MSE. It follows that
R� � r = R(� � �0) +R�0 � r! R�0 � r 6= 0
as n!1; where the convergence is in term of MSE. In other words, R��r will convergeto a nonzero limit, R�0 � r.
Remark: The behavior of R� � r is di¤erent under H0 and under the alternativehypothesis to H0. Therefore, we can test H0 by examining whether R��r is signi�cantlydi¤erent from zero.
Question: How large should the magnitude of the absolute value of the di¤erence R��rbe in order to claim that R� � r is signi�cantly di¤erent from zero?
Answer: We need to a decision rule which speci�es a threshold value with which wecan compare the (absolute) value of R� � r.
Note that R� � r is a random variable and so it can take many (possibly an in�nite
number of) values. Given a data set, we only obtain one realization of R�� r:Whethera realization of R� � r is close to zero should be judged using the critical value of its
26
sampling distribution, which depends on the sample size n and the signi�cance level
� 2 (0; 1) one preselects.
Question: What is the sampling distribution of R� � r under H0?
Because
R(� � �0)jX � N(0; �2R(X 0X)�1R0);
we have that conditional on X;
R� � r = R(� � �0) +R�0 � r� N(R�0 � r; �2R(X 0X)�1R0)
Corollary: Under Assumptions 3.1, 3.3 and 3.5, and H0 : R�0 = r; we have for eachn > K;
(R� � r)jX � N(0; �2R(X 0X)�1R0):
Remarks:(i) var(R�jX) = �2R(X 0X)�1R0 is a J � J matrix.(ii) However, R�� r cannot be used as a test statistic for H0; because �2 is unknown
and there is no way to calculate the critical values of the sampling distribution of R��r:
Question: How to construct a feasible (i.e., computable) test statistic?
The forms of test statistics will di¤er depending on whether we have J = 1 or J > 1:
We �rst consider the case of J = 1:
CASE I: t-Test (J = 1):
Recall that we have
(R� � r)jX � N(0; �2R(X 0X)�1R0);
When J = 1; the conditional variance
var[(R� � r)jX] = �2R(X 0X)�1R0
is a scalar (1� 1). It follows that conditional on X; we have
R� � rqvar[(R� � r)jX]
=R� � rp
�2R(X 0X)�1R0
� N(0; 1):
27
Question: What is the unconditional distribution of
R� � rp�2R(X 0X)�1R0
?
Answer: The unconditional distribution is also N(0,1).
However, �2 is unknown, so we cannot use the ratio
R� � rp�2R(X 0X)�1R0
as a test statistic. We have to replace �2 by s2; which is a good estimator for �2: This
gives a feasible (i.e., computable) test statistic
T =R� � rp
s2R(X 0X)�1R0:
However, the test statistic T will be no longer normally distributed. Instead,
T =R� � rp
s2R(X 0X)�1R0
=
R��rp�2R(X0X)�1R0q
(n�K)s2�2
=(n�K)
� N(0; 1)q�2n�K=(n�K)
� tn�K ;
where tn�K denotes the Student t-distribution with n�K degrees of freedom. Note that
the numerator and denominator are mutually independent conditional on X; because �
and s2 are mutually independent conditional on X.
Remarks:(i) The feasible statistic T is called a t-test statistic.
(ii) What is a Student tq distribution?
De�nition [Student�s t-distribution]: Suppose Z � N(0; 1) and V � �2q; and bothZ and V are independent. Then the ratio
ZpV=q
� tq:
28
(iii) When q ! 1; tqd! N(0; 1); where d! denotes convergence in distribution. This
implies that we have
T =R� � rp
s2R(X 0X)�1R0d! N(0; 1) as n!1:
This result has a very important implication in practice: For a large sample size n; it
makes no di¤erence to use either the critical values from tn�K or from N(0; 1).
Plots of Student t-distributions?
Question: What is meant by convergence in distribution?
De�nition [Convergence in distribution]: Suppose fZn; n = 1; 2; � � � g is a sequenceof random variables/vectors with distribution functions Fn(z) = P (Zn � z); and Z is arandom variable/vector with distribution F (z) = P (Z � z): We say that Zn convergesto Z in distribution if the distribution of Zn converges to the distribution of Z at all
continuity points; namely,
limn!1
Fn(z) = F (z) or
Fn(z) ! F (z) as n!1
for any continuity point z (i.e., for any point at which F (z) is continuous): We use the
notation Znd! Z: The distribution of Z is called the asymptotic or limiting distribution
of Zn:
Implication: In practice, Zn is a test statistic or a parameter estimator, and its sam-pling distribution Fn(z) is either unknown or very complicated, but F (z) is known or
very simple. As long as Znd! Z; then we can use F (z) as an approximation to Fn(z):
This gives a convenient procedure for statistical inference. The potential cost is that the
approximation of Fn(z) to F (z) may not be good enough in �nite samples (i.e., when n
is �nite). How good the approximation will be depend on the data generating process
and the sample size n:
Example: Suppose f"n; n = 1; 2; � � � g is an i.i.d. sequence with distribution functionF (z): Let " be a random variable with the same distribution function F (z): Then "n
d! ":
With the obtained sampling distribution for the test statistic T; we can now describe
a decision rule for testing H0 when J = 1:
Decision Rule of the T-test Based on Critical Values
29
(i) Reject H0 : R�0 = r at a prespeci�ed signi�cance level � 2 (0; 1) if
jT j > Ctn�K;�2 ;
where Ctn�K;�2 is the so-called upper-tailed critical value of the tn�K distribution at level�2; which is determined by
Phtn�K > Ctn�K;�2
i=�
2
or equivalently
Phjtn�K j > Ctn�K;�2
i= �:
(ii) Accept H0 at the signi�cance level � if
jT j � Ctn�K;�2 :
Remarks:(a) In testing H0; there exist two types of errors, due to the limited information
about the population in a given random sample fZtgnt=1. One possibility is that H0 istrue but we reject it. This is called the �Type I error�. The signi�cance level � is the
probability of making the Type I error. If
PhjT j > Ctn�K;�2 jH0
i= �;
we say that the decision rule is a test with size �:
(b) The probability P [jT j > Ctn�K;�2] is called the power function of a size � test.
When
PhjT j > Ctn�K;�2 jH0 is false
i< 1;
there exists a possibility that one may accept H0 when it is false. This is called the�Type II error�.
(c) Ideally one would like to minimize both the Type I error and Type II error, but
this is impossible for any given �nite sample. In practice, one usually presets the level
for Type I error, the so-called signi�cance level, and then minimizes the Type II error.
Conventional choices for signi�cance level � are 10%, 5% and 1% respectively.
Next, we describe an alternative decision rule for testing H0 when J = 1; using theso-called p-value of test statistic T:
An Equivalent Decision Rule Based on p-values
30
Given a data set zn = fyt; x0tg0nt=1; which is a realization of the random sample Zn =
fYt; X 0tg0nt=1; we can compute a realization (i.e., a number) for the t-test statistic T;
namely
T (zn) =R� � rp
s2R(x0x)�1R0:
Then the probability
p(zn) = P [jtn�K j > jT (zn)j] ;
is called the p-value (i.e., probability value) of the test statistic T given that fYt; X 0tgnt=1 =
fyt; x0tgnt=1 is observed, where tn�K is a Student t random variable with n�K degrees of
freedom, and T (zn) is a realization for test statistic T = T (Zn). Intuitively, the p-value
is the tail probability that the absolute value of a Student tn�K random variable take
values larger than the absolute value of the test statistic T (zn). If this probability is very
small relative to the signi�cance level, then it is unlikely that the test statistic T (Zn)
will follow a Student tn�K distribution. As a consequence, the null hypothesis is likely
to be false.
The above decision rule can be described equivalently as follows:
Decision Rule Based on the p-value(i) Reject H0 at the signi�cance level � if p(zn) < �:(ii) Accept H0 at the signi�cance level � if p(zn) � �:
Question: What are the advantages/disadvantages of using p-values versus using crit-ical values?
Remarks:(i) p-values are more informative than only rejecting/accepting the null hypothesis
at some signi�cance level. A p-value is the smallest signi�cance level at which a null
hypothesis can be rejected.
(ii) Most statistical software reports p-values of parameter estimates.
Examples of t-tests
Example 1 [Reforms have no e¤ects (continued.)]We �rst consider testing the null hypothesis
Ha0 : �3 = 0;
where �3 is the coe¢ cient of the autonomy AUt in the extended production function
regression model. This is equivalent to the selection of R = (0; 0; 0; 1; 0): In this case,
31
we have
s2R(X 0X)�1R0 =�s2(X 0X)�1
�(4;4)
= S2�
which is the estimator of var(�3jX): The t-test statistic
T =R� � rp
s2R(X 0X)�1R0
=�3qS2�3
� tn�K :
Next, we consider testing the CRS hypothesis
Hc0 : �1 + �2 = 1;
which corresponds to R = (0; 1; 1; 0; 0) and r = 1: In this case,
s2R(X 0X)�1R0 = S2�1+ S2
�2+ 2�cov(�1; �2)
=�s2(X 0X)�1
�(2;2)
+�s2(X 0X)�1
�(3;3)
+2 ��s2(X 0X)�1
�(2;3)
= S2�+�2
;
which is the estimator of var(�1+�2jX):Here, �cov(�1; �2) is the estimator for cov(�1; �2jX);the covariance between �1 and �2 conditional on X:
The t-test statistic is
T =R� � rp
s2R(X 0X)�1R0
=�1 + �2 � 1S�1+�2
� tn�K :
CASE II: F -testing (J > 1)
Question: What happens if J > 1? We �rst state a useful lemma.
32
Lemma: If Z � N(0; V ); where V = var(Z) is a nonsingular J�J variance-covariancematrix, then
Z 0V �1Z � �2J :
Proof: Because V is symmetric and positive de�nite, we can �nd a symmetric and
invertible matrix V 1=2 such that
V 1=2V 1=2 = V;
V �1=2V �1=2 = V �1:
(Question: What is this decomposition called?) Now, de�ne
Y = V �1=2Z:
Then we have E(Y ) = 0; and
var(Y ) = E f[Y � E(Y )][Y � E(Y )]0g= E(Y Y 0)
= E(V �1=2ZZ 0V �1=2)
= V �1=2E(ZZ 0)V
= V �1=2V V �1=2
= V �1=2V 1=2V 1=2V �1=2
= I:
It follows that Y � N(0; I): Therefore, we have
Y 0Y � �2q:
Applying this lemma, and using the result that
(R� � r)jX � N [0; �2R(X 0X)�1R0]
under H0; we have the quadratic form
(R� � r)0[�2R(X 0X)�1R0]�1(R� � r) � �2J
conditional on X; or
(R� � r)0[R(X 0X)�1R0]�1(R� � r)�2
� �2J
conditional on X:
33
Because �2J does not depend on X; therefore, we also have
(R� � r)0[R(X 0X)�1R0]�1(R� � r)�2
� �2J
unconditionally.
Like in constructing a t-test statistic, we should replace �2 by s2 in the left hand
side:(R� � r)0[R(X 0X)�1R0]�1(R� � r)
s2:
The replacement of �2 by s2 renders the distribution of the quadratic form no longer
Chi-squared. Instead, after proper scaling, the quadratic form will follow a so-called
F -distribution with degrees of freedom equal to (q; n�K).
Question: Why?
Observe
(R� � r)0[R(X 0X)�1R0]�1(R� � r)s2
= J �(R��r)0[R(X0X)�1R0]�1(R��r)
�2=J
(n�K)s2�2
=(n�K)
� J � �2J=J
�2n�K=(n�K)� J � FJ;n�K ;
where FJ;n�K denotes the F distribution with degrees of J and n�K distributions.
Question: What is a FJ;n�K distribution?
De�nition: Suppose U � �2p and V � �2q; and both U and V are independent. Then
the ratioU=p
V=q� Fp;q
is called to follow a Fp;q distribution with degrees of freedom (p; q):
Question: Why call it the F -distribution?
Answer: It is named after Professor Fisher, a well-known statistician in the 20th cen-tury.
Plots of F -distributions.
34
Remarks:
Properties of Fp;q :
(i) If F � Fp;q; then F�1 � Fq;p:(ii) t2q � F1;q:
t2q =�21=1
�2=q� F1;q
(iii) Given any �xed integer p; p � Fp;q ! �2p as q !1:
Relationship (ii) implies that when J = 1; using either the t-test or the F -test will
deliver the same conclusion. Relationship (iii) implies that the conclusions based on
Fp;q and on pFp;q using the �2p approximation will be approximately the same when q is
su¢ ciently large.
We can now de�ne the following F -test statistic
F � (R� � r)0[R(X 0X)�1R0]�1(R� � r)=Js2
� FJ;n�K :
Theorem: Suppose Assumptions 3.1, 3.3 and 3.5 hold. Then under H0 : R�0 = r, we
have
F =(R� � r)0[R(X 0X)�1R0]�1(R� � r)=J
s2
� FJ;n�K
for all n > K:
A practical issue now is how to compute the F -statistic. One can of course compute
the F -test statistic using the de�nition of F . However, there is a very convenient way
to compute the F -test statistic. We now introduce this method.
Alternative Expression for the F-Test Statistic
Theorem: Suppose Assumptions 3.1 and 3.3 hold. Let SSRu = e0e be the sum of
squared residuals from the unrestricted model
Y = X�0 + ":
35
Let SSRr = ~e0~e be the sum of squared residuals from the restricted model
Y = X�0 + "
subject to
R�0 = r;
where ~� is the restricted OLS estimator. Then under H0; the F -test statistic can be
written as
F =(~e0~e� e0e)=Je0e=(n�K) � FJ;n�K :
Remark: This is a convenient test statistic! One only needs to compute SSR in orderto compute the F -test statistic.
Proof: Let ~� be the OLS under H0; that is,
~� = arg min�2RK
(Y �X�)0(Y �X�)
subject to the constraint that R� = r: We �rst form the Lagrangian function
L(�; �) = (Y �X�)0(Y �X�) + 2�0(r �R�);
where � is a J � 1 vector called the Lagrange multiplier vector.We have the following FOC:
@L(~�; ~�)
@�= �2X 0(Y �X~�)� 2R0~� = 0;
@L(~�; ~�)
@�= r �R~� = 0:
With the unconstrained OLS estimator � = (X 0X)�1X 0Y; and from the �rst equation
of FOC, we can obtain
�(� � ~�) = (X 0X)�1R0~�;
R(X 0X)�1R0~� = �R(� � ~�):
Hence, the Lagrange multiplier
~� = �[R(X 0X)�1R0]�1R(� � ~�):= �[R(X 0X)�1R0]�1(R� � r);
36
where we have made use of the constraint that R~� = r: It follows that
� � ~� = (X 0X)�1R0[R(X 0X)�1R0]�1(R� � r):
Now,
~e = Y �X~�= Y �X� +X(� � ~�)= e+X(� � ~�):
It follows that
~e0~e = e0e+ (� � ~�)0X 0X(� � ~�)= e0e+ (R� � r)0[R(X 0X)�1R0]�1(R� � r):
We have
(R� � r)0[R(X 0X)�1R0]�1(R� � r) = ~e0~e� e0e
and
F =(R� � r)0[R(X 0X)�1R0]�1(R� � r)=J
s2
=(~e0~e� e0e)=Je0e=(n�K) :
This completes the proof.
Question: What is the interpretation for the Lagrange multiplier ~�?
Answer: Recall that we have obtained the relation that
~� = �[R(X 0X)�1R0]�1R(� � ~�)= �[R(X 0X)�1R0]�1(R� � r):
Thus, ~� is an indicator of the departure of R� from r: That is, the value of ~� will indicate
whether R� � r is signi�cantly di¤erent from zero.
Question: What happens to the distribution of F when n!1?
Recall that under H0; the quadratic form
(R� � r)0��2R(X 0X)�1R0
��1(R� � r) � �2J :
37
We have shown that
J � F = (R� � r)0�s2R(X 0X)�1R0
��1(R� � r)
is no longer �2J for each �nite n > K:
However, the following theorem shows that as n!1; the quadratic form J � F !d
�2J : In other words, when n ! 1; the limiting distribution of J � F coincides with the
distribution of the quadratic form
(R� � r)0��2R(X 0X)�1R0
��1(R� � r):
Theorem: Suppose Assumptions 3.1, 3.3 and 3.5 hold. Then under H0; we have theWald test statistic
W =(R� � r)0[R(X 0X)�1R0]�1(R� � r)
s2d! �2J
as n!1:
Remarks:(i) W is called the Wald test statistic. In the present context, we have W = J � F:(ii) Recall that when n ! 1; the F -distribution FJ;n�K ; after multiplied by J; will
approach the �2J distribution. This result has a very important implication in practice:
When n is large, the use of FJ;n�K and the use of �2J will deliver the same conclusion on
statistical inference.
Proof: The result follows immediately by the so-called Slutsky theorem and the fact
that s2p! �2; where
p! denotes convergence in probability.
Question: What is the Slutsky theorem? What is convergence in probability?
Slutsky Theorem: Suppose Znd! Z; an
p! a and bnp! b as n!1: Then
an + bnZnd! a+ bZ:
In our application, we have
Zn =(R� � r)0[R(X 0X)�1R0]�1(R� � r)
�2;
Z � �2J ;
an = 0;
a = 0;
bn = �2=s2;
b = 1:
38
The by the Slutsky theorem, we have
W = bnZn !d 1 � Z � �2J
because Zn !d Z and bn !p 1:
What we need is to show that s2p! �2 and then apply the Slutsky theorem.
Question: What is the concept of convergence in probability?
De�nition [Convergence in Probability] Suppose fZng is a sequence of randomvariables and Z is a random variable (including a constant). Then we say that Znconverges to Z in probability, denoted as Zn � Z
p! 0; or Zn � Z = oP (1) if for everysmall constant � > 0;
limn!1
Pr [jZn � Zj > �] = 0; or
Pr(jZn � Zj > �)! 0 as n!1:
Interpretation: De�ne the event
An(�) = f! 2 : jZn(!)� Z(!)j > �g;
where ! is a basic outcome in sample space : Then convergence in probability says that
the probability of event An(�) may be nonzero for any �nite n; but such a probability
will eventually vanish to zero as n!1: In other words, it becomes less and less likelythat the di¤erence jZn � Zj is larger than a prespeci�ed constant � > 0: Or, we have
more and more con�dence that the di¤erence jZn�Zj will be smaller than � as n!1:
Question: What is the interpretation of �?
Answer: It can be viewed as a prespeci�ed tolerance level.
Remark: When Z = b, a constant, we can write Znp! b; and we say that b is the
probability limit of Zn; written as
b = p limn!1
Zn:
Next, we introduce another convergence concept which can be conveniently used to show
convergence in probability.
De�nition [Convergence in Quadratic Mean]: We say that Zn converges to Zin quadratic mean, denoted as Zn � Z !q:m 0; or Zn � Z = oq:m:(1); if
limn!1
E[Zn � Z]2 = 0:
39
Lemma: If Zn � Zq:m:! 0; then Zn � Z
p! 0:
Proof : By Chebyshev�s inequality, we have
P (jZn � Zj > �) �E[Zn � Z]2
�2! 0
for any given � > 0 as n!1:
Example: Suppose Assumptions 3.1, 3.3 and 3.5 hold. Does s2 converge in probabilityto �2?
Solution: Under the given assumptions,
(n�K) s2
�2� �2n�K ;
and therefore we have E(s2) = �2 and var(s2) = 2�4
n�K : It follows that E(s2 � �2)2 =
2�4=(n �K) ! 0; s2 !q:m: �2 and so s2p! �2 because convergence in quadratic mean
implies convergence in probability.
Question: If s2 !p �2; do we have s!p �?
Answer: Yes. Why? It follows from the following lemma with the choice of g(s2) =ps2 = s:
Lemma: Suppose Zn !p Z; and g(�) is a continuous function. Then
g(Zn)!p g(Z):
Proof: Because g(�) is continuous, we have that for any given constant � > 0; there
exists � = �(�) such that whenever jZn(!)� Z(!)j < �; we have
jg[Zn(!)]� g[Z(!)]j < �:
De�ne
A(�) = f! : jZn(!)� Z(!)j > �g;B(�) = f! : jg(Zn(!))� g(Z(!))j > �g:
Then the continuity of g(�) implies that B(�) � A(�): It follows that
P [B(�)] � P [A(�)]! 0
40
as n!1; where P [A(�)]! 0 given Zn !p Z: Because � and � are arbitrary, we have
g(Zn)!p g(Z):
This completes the proof.
3.8 Applications
We now consider some special but implication cases often countered in economics
and �nance.
Case 1: Testing for the Joint Signi�cance of Explanatory Variables
Consider a linear regression model
Yt = X 0t�0 + "t
= �00 +kXj=1
�0jXjt + "t:
We are interested in testing the combined e¤ect of all the regressors except the intercept.
The null hypothesis is
H0 : �0j = 0 for 1 � j � k;
which implies that none of the explanatory variables in�uences Yt:
The alternative hypothesis is
HA : �0j 6= 0 at least for some �0j ; j = 1; � � � ; k:
One can use the F -test and
F � Fk;n�(k+1):
In fact, the restricted model under H0 is very simple:
Yt = �00 + "t:
The restricted OLS estimator ~� = (�Y ; 0; � � � ; 0)0: It follows that
~e = Y �X~� = Y � �Y :
Hence, we have
~e0~e = (Y � �Y )0(Y � �Y ):
41
Recall the de�nition of R2 :
R2 = 1� e0e
(Y � �Y )0(Y � �Y )
= 1� e0e
~e0~e:
It follows that
F =(~e0~e� e0e)=ke0e=(n� k � 1)
=(1� e0e
~e0~e)=ke0e~e0~e=(n� k � 1)
=R2=k
(1�R2)=(n� k � 1) :
Thus, it su¢ ces to run one regression, namely the unrestricted model in this case. We
emphasize that this formula is valid only when one is testing for H0 : �0j = 0 for all
1 � j � k:
Example 1 [E¢ cient Market Hypothesis]: Suppose Yt is the exchange rate returnin period t; and It�1 is the information available at time t� 1: Then a classical versionof the e¢ cient market hypothesis (EMH) can be stated as follows:
E(YtjIt�1) = E(Yt)
To check whether exchange rate changes are unpredictable using the past history of
exchange rate changes, we specify a linear regression model:
Yt = X0t�0 + "t;
where
Xt = (1; Yt�1; :::; Yt�k)0:
Under EMH, we have
H0 : �0j = 0 for all j = 1; :::; k:
If the alternative
HA : �0j 6= 0 at least for some j 2 f1; :::; kg
holds, then exchange rate changes are predictable using the past information.
Remarks:
42
(i) What is the appropriate interpretation if H0 is not rejected? Note that there existsa gap between the e¢ ciency hypothesis and H0, because the linear regression model isjust one of many ways to check EMH. Thus, H0 is not rejected, at most we can onlysay that no evidence against the e¢ ciency hypothesis is found. We should not conclude
that EMH holds.
(ii) Strictly speaking, the current theory (Assumption 3.2: E("tjX) = 0) rules out
this application, which is a dynamic time series regression model. However, we will
justify in Chapter 5 that
k � F =R2
(1�R2)=(n� k � 1)d! �2k
under conditional homoskedasticity even for a linear dynamic regression model.
(iii) In fact, we can use a simpler version when n is large:
(n� k � 1)R2 d! �2k:
This follows from the Slutsky theorem because R2 !p 0 underH0: Although Assumption3.5 is not needed for this result, conditional homoskedasticity is still needed, which rules
out autoregressive conditional heteroskedasticity (ARCH) in the time series context.
Example 1 [Consumption Function and Wealth E¤ect]: Let Yt = consumption,X1t = labor income, X2t = liquidity asset wealth. A regression estimation gives
Yt = 33:88� 26:00X1t + 6:71X2t + et; R2 = 0:742; n = 25:
[1:77] [�0:74] [0:77]
where the numbers inside [�] are t-statistics.Suppose we are interested in whether labor income or liquidity asset wealth has
impact on consumption. We can use the F -test statistic,
F =R2=2
(1�R2)=(n� 3)= (0:742=2)=[(1� 0:742)=(25� 3)]= 31:636
� F2;22
Comparing it with the critical value of F2;22 at the 5% signi�cance level, we reject the
null hypothesis that neither income nor liquidity asset has impact on consumption at
the 5% signi�cance level.
43
Case 2: Testing for Omitted Variables (or Testing for No E¤ect)
Suppose X = (X(1); X(2)); where X(1) is a n� (k1 + 1) matrix and X(2) is a n� k2matrix:
A random vector X(2)t has no explanary power for the conditional expectation of Yt
if
E(YtjXt) = E(YtjX(1)t ):
Alternatively, it has explanatory power for the conditional expectation of Yt if
E(YtjXt) 6= E(YtjX(1)t ):
When X(2)t has explaining power for Yt but is not included in the regression, we say that
X(2)t is an omitted random variable or vector.
Question: How to test whether X(2)t is an omitted variable in the linear regression
context?
Consider the restricted model
Yt = �0 + �1X1t + � � �+ �k1Xk1t + "t:
Suppose we have additional k2 variables (X(k1+1)t; � � � ; X(k1+k2)t), and so we consider theunrestricted regression model
Yt = �0 + �1X1t + :::+ �k1Xk1t
+�k1+1X(k1+1)t + � � �+ �(k1+k2)X(k1+k2)t + "t:
The null hypothesis is that the additional variables have no e¤ect on Yt: If this is the
case, then
H0 : �k1+1 = �k1+2 = � � � = �k1+k2 = 0:
The alternative is that at least one of the additional variables has e¤ect on Yt:
The F -Test statistic is
F =(~e0~e� e0e)=k2
e0e=(n� k1 � k2 � 1)� Fk2;n�(k1+k2+1):
Question: Suppose we reject the null hypothesis. Then some important explanatoryvariables are omitted, and they should be included in the regression. On the other hand,
if the F -test statistic does not reject the null hypothesis H0; can we say that there is noomitted variable?
44
Answer: No. There may exist a nonlinear relationship for additional variables which alinear regression speci�cation cannot capture.
Example 1 [Testing for the E¤ect of Reforms]:Consider the extended production function
Yt = �0 + �1 ln(Lt) + �2 ln(Kt)
+�3AUt + �4PSt + �5CMt + "t;
where AUt is the autonomy dummy, PSt is the pro�t sharing ratio, and CMt is the
dummy for change of manager. The null hypothesis of interest here is that none of the
three reforms has impact:
H0 : �3 = �4 = �5 = 0:
We can use the F -test, and F � F3;n�6 under H0:
Interpretation: If rejection occurs, there exists evidence against H0: If no rejectionoccurs, then we �nd no evidence against H0 (which is not the same as the statementthat reforms have no e¤ect): It is possible that the e¤ect of X(2)
t is of nonlinear form. In
this case, we may obtain a zero coe¢ cient for X(2)t ; because the linear speci�cation may
not be able to capture it.
Example 2 [Testing for Granger Causality]: Consider two time series fYt; Ztg;where t is the time index, IYt�1 = fYt�1; :::; Y1g and IZt�1 = fZt�1; :::; Z1g. For example,Yt is the GDP growth, and Zt is the money supply growth. We say that Zt does not
Granger-cause Yt in conditional mean with respect to It�1 = fI(Y )t�1; I(Z)t�1g if
E(YtjI(Y )t�1; I(Z)t�1) = E(YtjI
(Y )t�1):
In other words, the lagged variables of Zt have no impact on the level of Yt:
Remark: Granger causality is de�ned in terms of incremental predictability ratherthan the real cause-e¤ect relationship. From an econometric point of view, it is a test
of omitted variables in a time series context. It is �rst introduced by Granger (1969).
Question: How to test Granger causality?
Consider now a linear regression model
Yt = �0 + �1Yt�1 + � � �+ �pYt�p+�p+1Zt�1 + � � �+ �p+qZt�q + "t:
45
Under non-Granger causality, we have
H0 : �p+1 = � � � = �p+q = 0:
The F -test statistic
F � Fq;n�(p+q+1):
Remark: The current theory (Assumption 3.2: E("tjX) = 0) rules out this application,because it is a dynamic regression model. However, we will justify in Chapter 5 that
under H0;
q � F d! �2q
as n ! 1 under conditional homoskedasticity even for a linear dynamic regression
model.
Example 2 [Testing for Structural Change (or testing for regime shift)]
Yt = �0 + �1X1t + "t;
where t is a time index, and fXtg and f"tg are mutually independent. Suppose thereexist changes after t = t0; i.e., there exist structural changes. We can consider the
extended regression model:
Yt = (�0 + �0Dt) + (�1 + �1Dt)X1t + "t
= �0 + �1X1t + �0Dt + �1(DtX1t) + "t;
where Dt = 1 if t > t0 and Dt = 0 otherwise. The variable Dt is called a dummy
variable, indicating whether it is a pre- or post-structral break period.
The null hypothesis of no structral change is
H0 : �0 = �1 = 0:
The alternative hypothesis that there exist a structral change is
HA : �0 6= 0 or�1 6= 0:
The F -test statistic
F � F2;n�4:
The idea of such a test is �rst proposed by Chow (1960).
Case 3: Testing for linear restrictions
46
Example 1 [ Testing for CRS]:Consider the extended production function
ln(Yt) = �0 + �1 ln(Lt) + �2 ln(Kt) + �3AUt + �4PSt + �5CMt + "t:
We will test the null hypothesis of CRS:
H0 : �1 + �2 = 1:
The alternative hypothesis is
H0 : �1 + �2 6= 1:
What is the restricted model under H0? It is given by
ln(Yt) = �0 + �1 ln(Lt) + (1� �1) ln(Kt) + �3AUt + �4PSt + �5CMt + "t
or equivanetly
ln(Yt=Kt) = �0 + �1 ln(Lt=Kt) + �3AUt + �4CONt + �5CMt + "t:
The F -test statistic
F � F1;n�6:
Remark: Both t- and F - tests are applicable to test CRS.
Example 2 [Wage Determination]: Consider the wage function
Wt = �0 + �1Pt + �2Pt�1 + �3Ut
+�4Vt + �5Wt�1 + "t;
where
Wt = wage,
Pt = price,
Ut = unemployment,
Vt = un�lled vacancies.
We will test the null hypothesis
H0 : �1 + �2 = 0; �3 + �4 = 0; and �5 = 1:
Question: What is the economic interpretation of the null hypothesis H0?
What is the restricted model? Under H0; we have the restricted wage equation:
�Wt = �0 + �1�Pt + �4Dt + "t;
47
where �Wt = Wt �Wt�1 is the wage growth rate, �Pt = Pt � Pt�1 is the in�ation rate,and Dt = Vt � Ut is an index for job market situation (excess job supply). This impliesthat the wage increase depends on the in�ation rate and the excess labor supply.
The F -test statistic for H0 is
F � F3;n�6:
Case 4: Testing for Near-Multicolinearity
Example 1:Consider the following estimation results for three separate regressions based on the
same data set with n = 25. The �rst is a regression of consumption on income:
Yt = 36:74 + 0:832X1t + e1t; R2 = 0:735
[1:98][7:98]
The second is a regression of consumption on wealth:
Yt = 36:61 + 0:208X2t + e2t; R2 = 0:735
[1:97][7:99]
The third is a regression of consumption on both income and wealth:
Y = 33:88� 26:00X1t + 6:71X2t + et; R2 = 0:742;
[1:77][�0:74][0:77]
Note that in the �rst two separate regressions, we can �nd signi�cant t-test statistics
for income and wealth, but in the third joint regression, both income and wealth are
insigni�cant. This may be due to the fact that income and wealth are highly multicol-
inear! To test neither income nor stock wealth has impact on consumption, we can use
the F -test:
F =R2=2
(1�R2)=(n� 3)
=0:742=2
(1� 0:742)=(25� 3)= 31:636
� F2;22 :
This F -test shows that the null hypothesis is �rmly rejected at the 5% signi�cance level,
because the critical value of F2;22 at the 5% level is ???.
48
3.9 Generalized Least Squares Estimation
Question:The classical linear regression theory crutially depends on the assumption that "jX �
N(0; �2I), or equivalently f"tg � i:i:d:N(0; �2); and fXtg and f"tg are mutually inde-pendent. What may happen if some classical assumptions do not hold?
Question: Under what conditions, the existing procedures and results are still approx-imately true?
Assumption 3.5 is unrealitic for many economic and �nancial data. Suppose As-
sumption 3.5 is replaced by the following condition:
Assumption 3.6: "jX � N(0; �2V ); where 0 < �2 <1 is unknown and V = V (X) is
a known n� n �nite and positive de�nite matrix.
Question: Is this assumption realistic in practice?
Remarks:(i) Assumption 3.6 implies that
var("jX) = E(""0jX)= �2V = �2V (X)
is known up to a constant �2: It allows for conditional heteroskedasticity of known form.
(iii) It is possible that V is not a diagonal matrix. Thus, cov("t; "sjX) may not bezero. In other words, Assumption 3.6 allows conditional serial correlation of known form.
(iv) However, the assumption that V is known is still very restrictive from a practical
point of view. In practice, V usually has an unknown form.
Question: What is the statistical property of OLS � under Assumption 3.6?
Theorem: Suppose Assumptions 3.1, 3.3 and 3.6 hold. Then(i) unbiasedness: E(�jX) = �0:(ii) variance:
var(�jX) = �2(X 0X)�1X 0V X(X 0X)�1
6= �2(X 0X)�1:
49
(iii)
(� � �0)jX � N(0; �2(X 0X)�1X 0V X(X 0X)�1):
(iv) cov(�; ejX) 6= 0 in general.
Remarks:(i) OLS � is still unbiased and one can show that its variance goes to zero as n!1
(see Question 6, Problem Set 03). Thus, it converges to �0 in the sense of MSE.
(ii) However, the variance of the OLS estimator � does no longer have the simple
expression of �2(X 0X)�1 under Assumption 3.6. As a consequence, the classical t- and
F -test statistics are invalid because they are based on an incorrect variance-covariance
matrix of �. That is, they use an incorrect expression of �2(X 0X)�1 rather than the
correct variance formula of �2(X 0X)�1X 0V X(X 0X)�1:
(iii) Theorem (iv) implies that even if we can obtain a consistent estimator for
�2(X 0X)�1X 0V X(X 0X)�1 and use it to construct tests, we can no longer obtain the
Student t-distribution and F -distribution, because the numerator and the denominator
in de�ning the t- and F -test statistics are no longer independent.
Proof: (i) Using � � �0 = (X 0X)�1X 0"; we have
E[(� � �0)jX] = (X 0X)�1X 0E("jX)= (X 0X)�1X 00
= 0:
(ii)
var(�jX) = E[(� � �0)(� � �0)0jX]= E[(X 0X)�1X 0""0X(X 0X)�1jX]= (X 0X)�1X 0E(""0jX)X(X 0X)�1
= �2(X 0X)�1X 0V X(X 0X)�1:
Note that we cannot further simplify the expression here because V 6= �2I:(iii) Because
� � �0 = (X 0X)�1X 0"
=nXt=1
Ct"t;
50
where
Ct = (X0X)�1Xt;
� � �0 follows a normal distribution given X; because it is a sum of a normal random
variables. As a result,
� � �0 � N(0; �2(X 0X)�1X 0V X(X 0X)�1):
(iv)
cov(�; ejX) = E[(� � �0)e0jX]= E[(X 0X)�1X 0""0M jX]= (X 0X)�1X 0E(""0jX)M= �2(X 0X)�1X 0VM
6= 0
because X 0VM 6= 0: We can see that it is conditional heteroskedasticity and/or auto-correlation in f"tg that cause � to be correlated with e:
Generalized Least Square (GLS) EstimationTo introduce GLS, we �rst state a useful lemma.
Lemma: For any symmetric positive de�nite matrix V; we can always write
V �1 = C 0C;
V = C�1(C 0)�1
where C is a n� n nonsingular matrix.
Question: What is this decomposition called?
Remark: C may not be symmetric.
Consider the original linear regression model:
Y = X�0 + ":
If we mutitiplyin the equation by C; we obtain the transformed regression model
CY = (CX)�0 + C"; or
Y � = X��0 + "�;
51
where Y � = CY;X� = CX and "� = C": Then the OLS of this transformed model
��= (X�0X�)�1X�0Y �
= (X 0C 0CX)�1(X 0C 0CY )
= (X 0V �1X)�1X 0V �1Y
is called the Generalized Least Square (GLS) estimator.
Question: What is the nature of GLS?
Observe that
E("�jX) = E(C"jX)= CE("jX)= C � 0= 0:
Also, note that
var("�jX) = E["�"�0jX]= E[C""0C 0jX]= CE(""0jX)C 0
= �2CV C 0
= �2C[C�1(C 0)�1]C 0
= �2I:
It follows from Assumption 3.6 that
"�jX � N(0; �2I):
The transformation makes the new error "� conditionally homoskedastic and serially
uncorrelated, while maintaining the normality distribution. Suppose that for t; "t has
a large variance �2t : The transformation "�t = C"t will discount "t by dividing it by
its conditional standard deviation so that "�t becomes conditionally homoskedastic. In
addition, the transformation also remove possible correlation between "t and "s; t 6= s:As a consequence, GLS becomes the best linear LS estimator for �0 in term of the
Gauss-Markov theorem.
52
To appreciate how the transformation by matrix C removes conditional heteroskedas-
ticity and eliminates serial correlation, we now consider two examples.
Example 1 [Removing Heteroskedasticity]: Suppose
V =
26664h1 0 � � � 0
0 h2 � � � 0
� � � � � � � � � 0
0 � � � � � � hn
37775 :Then
C =
26664h�1=21 0 � � � 0
0 h�1=22 � � � 0
� � � � � � � � � 0
0 � � � � � � h�1=2n
37775and
"� = C" =
266664"1ph1"2ph2
� � �"nphn
377775 :Example 2 [Eliminating Serial Correlation] Suppose
V =
2666666664
1 � 0 � � � 0 0
� 1 � � � � 0 0
0 � 1 � � � 0 0
� � � � � � � � � � � � � 0
0 0 0 � � � 1 �
0 0 0 � � � � 1
3777777775:
Then
"� = C" =
26664p1� �2"1"2 � �"1� � �
"n � �"n�1
37775 :Theorem: Under Assumptions 3:1; 3:3 and 3.6,(i) E(�
�jX) = �0;(ii) var(�
�jX) = �2(X�0X�)�1 = �2(X 0V �1X)�1;
(iii) cov(��; e�jX) = 0; where e� = Y � �X��
�;
(iv) ��is BLUE.
53
(v) E(s�2jX) = �2; where s�2 = e�0e�=(n�K):
Proof:(i) GLS is the OLS of the transformed model.
(ii) The transformed model satis�es 3.1, 3.3 and 3.5 of the classical regression as-
sumptions with "�jX� � N(0; �2In): It follows that GLS is BLUE by the Gauss-MarkovTheorem.
Remarks:(i) Because �
�is the OLS of the transformed regression model with i.i.d. N(0; �2I)
errors, the t-test and F -test are applicable, and these test statistics are de�ned as follows:
T � =R�
� � rps�2R(X�0X�)�1R0
� tn�K ;
F � =(R�
� � r)0[R(X�0X�)�1R0]�1(R�� � r)=J
s�2
� FJ;n�K :
When testing whether all coe¢ cients except the intercept are jointly zero, we have
(n�K)R�2 !d �2J :
(ii) Because GLS ��is BLUE and OLS � di¤ers from �
�; OLS � cannot be BLUE.
��= (X�0X�)�1X�0Y �;
= (X 0V �1X)�1X 0V �1Y:
� = (X 0X)�1X 0Y:
In fact, the most important message of GLS is the insight it provides into the impact
of conditional heteroskedasticity and serial correlation on the estimation and inference
of the linear regression model. In practice, GLS is generally not feasible, because the
n� n matrix V is of unknown form, where var("jX) = �2V .
What are feasible solutions?
Two solutions
(i) First Approach: Adaptive feasible GLS
In some cases with additional assumptions, we can use a nonparametric estimator V
to replace the unknown V; we obtain the adaptive feasible GLS
��a = (X
0V �1X)�1X 0V �1Y;
54
where V is an estimator for V: Because V is an n�n unknown matrix and we only haven data points, it is impossible to estimate V consistently using a sample of size n if we
do not impose any restriction on the form of V: In other words, we have to impose some
restrictions on V in order to estimate it consistently. For example, suppose we assume
�2V = diagf�21(X); :::; �2n(X)g= diagf�2(X1); :::; �
2(Xn)g;
where diag{�} is a n � n diagonal matrix and �2(Xt) = E("2t jX) is unknown. The fact
that �2V is a diagonal matrix can arise when cov("t"sjX) = 0 for all t 6= s; i.e., whenthere is no serial correlation. Then we can use the nonparametric kernel estimator
�2(x) =1n
Pnt=1 e
2t1bK�x�Xtb
�1n
Pnt=1
1bK�x�Xtb
�! p�2(x);
where et is the estimated OLS residual, and K(�) is a kernel function which is a spec-i�ed symmetric density function (e.g., K(u) = (2�)�1=2 exp(�1
2u2)); and b = b(n) is a
bandwidth such that b ! 0; nb ! 1 as n ! 1: The �nite sample distribution of ��awill be di¤erent from the �nite sample distribution of �
�; which assumes that V were
known. This is because the sampling errors of the estimator V have some impact on
the estimator ��a. However, under some suitable conditions on V ; �
�a will share the same
asymptotic property as the infeasible GLS ��(i.e., the MSE of �
�a is approximately
equal to the MSE of ��). In other words, the �rst stage estimation of �2(�) has no
impact on the asymptotic distribution of ��a: For more discussion, see Robinson (1988)
and Stinchcombe and White (1991).
Simulation studies?
(ii) Second Approach
Continue to use OLS �, obtaining the correct formula for
var(�jX) = �2(X 0X)�1X 0V X(X 0X)�1
as well as a consistent estimator for var��jX
�: The classical de�nitions of t and F -tests
cannot be used, because they are based on an incorrect formula for var(�jX). However,some modi�ed tests can be obtained by using a consistent estimator for the correct
formula for var(�jX): The trick is to estimate �2X 0V X; which is a K � K nknown
55
matrix, rather than to estimate V; which is a n � n unknown matrix. Howeber, onlyasymptotic distributions can be used in this case.
Question: Suppose we assume
E(""jX) = �2V = diagf�21(X); :::; �2n(X)g:
As pointed out earlier, this essentially assume E("t"sjX) = 0 for all t 6= s: That is, thereis no serial correlation in f"tg conditional on X: Instead of estimating �2t (X); one canestimate the K �K matrix �2X 0V X directly.
How to estimate
�2X 0V X =
nXt=1
XtX0t�2t (X)?
Answer: An estimator
X 0D(e)D(e)0X =nXt=1
XtX0te2t ;
where D(e) = diag(e1; :::; en) is a n�n diagonal matrix with all o¤-diagonal elements be-ing zero. This is called White�s (1980) heteroskedasticity-consistent variance-covariance
estimator. See more discussion in Chapter 4.
Question: For J = 1; do we have
R� � rpR(X 0X)�1X 0D(e)D(e)0X(X 0X)�1R0
� tn�K?
For J > 1; do we have
(R� � r)0�R(X 0X)�1X 0D(e)D(e)0X(X 0X)�1R0
��1(R� � r)=J
� FJ;n�K?
Answer: No. Why? Recall cov(�; ejX) 6= 0 under Assumption 3.6. It follows that �and e are not independent.
However, when n!1; we have(i) Case I (J = 1) :
R� � rpR(X 0X)�1X 0D(e)D(e)0X(X 0X)�1R0
!d N(; 1):
This can be called a robust t-test.
56
(ii) Case II (J > 1) :
(R� � r)�R(X 0X)�1X 0D(e)D(e)0X(X 0X)�1R0
��1(�R� r)!d �2J :
This is a robust Wald test statistic.
The above two feasible solutions are based on the assumption that E("t"sjX) = 0 forall t 6= s: When there exists serial correlation of unknown form, an allernative solutionshould be provided. This is discussed in Chapter 6.
3.10 Summary and Conclusion
In this chapter, we have presented the econometric theory for the classical linear
regression models. We �rst provide and discuss a set of assumptions on which the
classical linear regression model is built. This set of regularity conditions will serve as
the starting points from which we will develop modern econometric theory for linear
regression models.
We derive the statistical properties of the OLS estimator. In particular, we point out
that R2 is not a suitable model selection criterion, because it is always nondecreasing
with the dimension of regressors. Suitable model selection criteria, such as AIC and
BIC, are discussed. We show that conditional on the regressor matrix X; the OLS
estimator � is unbiased, has a vanishing variance, and is BLUE. Under the additional
conditional normality assumption, we derive the �nite sample normal distribution for �;
the Chi-squared distribution for (n �K)s2=�2; as well as the independence between �and s2.
Many hypotheses encountered in economics can be formulated as linear restrictions
on model parameters. Depending on the number of parameter restrictions, we derive
the t-test and the F -test. In the special case of testing the hypothesis that all slope
coe¢ cients are jointly zero, we also derive an asymptotically Chi-squared test based on
R2:
When there exist conditional heteroskedasticity and/or autocorrelation, the OLS
estimator is still unbiased and has a vanishing variance, but it is no longer BLUE, and �
and s2 are no longer mutully independent. Under the assumption of a known variance-
covariance matrix up to some scale parameter, one can transform the linear regression
model by correcting conditional heteroskedasticity and eliminating autocorrelation, so
that the transformed regression model has conditionally homoskedastic and uncorrelated
errors. The OLS estimator of this transformed linear regression model is called the GLS
57
estimator, which is BLUE. The t-test and F -test are applicable. When the variance-
covariance structure is unknown, the GLS estimator becomes infeasible. However, if the
error in the original linear regression model is serially uncorrelated (as is the case with
independent observations across t), there are two feasible solutions. The �rst is to use
a nonparametric method to obtain a consistent estimator for the conditional variance
var("tjXt), and then obtain a feasible plug-in GLS. The second is to use White�s (1980)
heteroskedasticity-consistent variance-covariance matrix estimator for the OLS estimator
�: Both of these two methods are built on the asymptotic theory. When the error of
the original linear regression model is serially correlated, a feasible solution is provided
in Chapter 6.
References
Robinson, P. (1988)
Stinchombe, M. and H. White (1991)
58
ADVANCED ECONOMETRICSPROBLEM SET 03
1. Consider a bivariate linear regression model
Yt = X0t�0 + "t; t = 1; :::; n;
where Xt = (X0t; X1t) = (1; X1t)0; and "t is a regression error. Let � denote the sample
correlation between Yt and X1t; namely,
� =
Pnt=1 x1tytpPn
t=1 x21t
Pnt=1 y
2t
;
where yt = Yt � �Y ; x1t = X1t � �X1; and �Y and �X1 are the sample means of fYtg andfX1tg. Show R2 = �2:
2. Suppose Xt = Q for all t � m; where m is a �xed integer, and Q is a K � 1 constantvector. Do we have �min(X 0X)!1 as n!1? Explain.
3. The adjusted R2; denoted as �R2; is de�ned as follows:
�R2 = 1� e0e=(n�K)(Y � �Y )0(Y � �Y )=(n� 1)
:
Show�R2 = 1�
�n� 1n�K (1�R
2)
�:
4. Consider the following linear regression model
Yt = X0t�0 + ut; t = 1; :::; n; (4.1)
where
ut = �(Xt)"t;
where fXtg is a nonstochastic process, and �(Xt) is a positive function of Xt such that
=
26666664�2(X1) 0 0 ::: 0
0 �2(X2) 0 ::: 0
0 0 �2(X3) ::: 0
::: ::: ::: ::: :::
0 0 0 ::: �2(Xn)
37777775 = 12
12 ;
with
59
12 =
26666664�(X1) 0 0 ::: 0
0 �(X2) 0 ::: 0
0 0 �(X3) ::: 0
::: ::: ::: ::: :::
0 0 0 ::: �(Xn)
37777775 :
Assume that f"tg is i.i.d. N(0; 1): Then futg is i.i.d. N(0; �2(Xt)): This di¤ers from
Assumption 3.5 of the classical linear regression analysis, because now futg exhibitsconditional heteroskedasticity.
Let � denote the OLS estimator for �0:
(a) Is � unbiased for �0?(b) Show that var(�) = (X 0X)�1X 0X(X 0X)�1:
Consider an alternative estimator
~� = (X 0�1X)�1X 0�1Y
=
"nXt=1
��2(Xt)XtX0t
#�1 nXt=1
��2(Xt)XtYt:
(c) Is ~� unbiased for �0?(d) Show that var(~�) = (X 0�1X)�1:
(e) Is var(�)�var(~�) positive semi-de�nite (p.s.d.)? Which estimator, � or ~�; is moree¢ cient?
(f) Is ~� the Best Linear Unbiased Estimator (BLUE) for �0? [Hint: There are severalapproaches to this question. A simple one is to consider the transformed model
Y �t = X�0t �
0 + "t; t = 1; :::; n; (4.2)
where Y �t = Yt=�(Xt); X�t = Xt=�(Xt): This model is obtained from model (4.1) after
dividing by �(Xt): In matrix notation, model (4.2) can be written as
Y � = X��0 + ";
where the n� 1 vector Y � = � 12Y and the n� k matrix X� = �
12X:]
(g) Construct two test statistics for the null hypothesis of interest H0 : �02 = 0.
One test is based on �; and the other test is based on ~�: What are the �nite sample
distributions of your test statistics under H0? Can you tell which test is better?(h) Construct two test statistics for the null hypothesis of interest H0 : R�0 = r;
where R is a J � k matrix with J > 0: One test is based on �; and the other test is
based on ~�: What are the �nite sample distributions of your test statistics under H0?
60
5. Consider the following classical regression model
Yt = X 0t�0 + "t
= �00 +
kXj=1
�0jXjt + "t; t = 1; :::; n: (5.1)
Suppose that we are interested in testing the null hypothesis
H0 : �01 = �02 = � � � = �0k = 0:
Then the F -statistic can be written as
F =(~e0~e� e0e)=ke0e=(n� k � 1) :
where e0e is the sum of squared residuals from the unrestricted model (5.1), and ~e0~e is
the sum of squared residuals from the restricted model (5.2)
Yt = �00 + "t: (5.2)
(a) Show that under Assumptions 3.1 and 3.3,
F =R2=k
(1�R2)=(n� k � 1) ;
where R2 is the coe¢ cient of determination of the unrestricted model (5.1).
(b) Suppose in addition Assumption 3.5 holds. Show that under H0;
(n� k � 1)R2 d! �2k:
6. Suppose X 0X is a K �K matrix, and V is a n�n matrix, and both X 0X and V are
symmetric and nonsingular, with the minimum eigenvalue �min(X 0X) ! 1 as n ! 1and 0 < c � �max(V ) � C <1: Show that for any � 2 RK such that � 0� = 1;
� 0var(�jX)� = �2� 0(X 0X)�1X 0V X(X 0X)�1� ! 0
as n!1: Thus, var(�jX) vanishes to zero as n!1 under conditional heteroskedas-
ticity.
7. Suppose a data generating process is given by
Yt = �01X1t + �
02X2t + "t = X
0t�0 + "t;
where Xt = (X1t; X2t)0; and Assumptions 3.1�3.4 hold. (For simplicity, we have assumed
no intercept.) Denote the OLS estimator by � = (�1; �2)0:
61
If �02 = 0 and we know it. Then we can consider a simpler regression
Yt = �01X1t + "t:
Denote the OLS of this simpler regression as ~�1:
Please compare the relative e¢ ciency between �1 and ~�1: That is, which estimator
is better for �01? Give your reasoning in detail.
62